1 00:00:01,370 --> 00:00:02,420 In the last video, we talked 2 00:00:02,740 --> 00:00:04,200 about the recommender system problem, 3 00:00:05,030 --> 00:00:06,270 where for example, you may 4 00:00:06,380 --> 00:00:07,810 have a set of movies and you 5 00:00:07,940 --> 00:00:09,140 may have a set of users, 6 00:00:09,810 --> 00:00:10,960 each of whom has rated 7 00:00:11,670 --> 00:00:13,170 some subset of the movies, 8 00:00:13,370 --> 00:00:14,340 rated the movies 1 to 9 00:00:14,500 --> 00:00:15,460 5 stars or 0 to 5 10 00:00:15,630 --> 00:00:16,830 stars, and what I would like 11 00:00:17,200 --> 00:00:18,170 to do, is look at 12 00:00:18,240 --> 00:00:19,720 these users and predict how 13 00:00:19,910 --> 00:00:22,540 they would have rated other movies that they have not yet rated. 14 00:00:23,530 --> 00:00:24,540 In this video, I would like 15 00:00:24,600 --> 00:00:25,950 to talk about our first approach 16 00:00:26,430 --> 00:00:28,190 to building a recommender system, this 17 00:00:28,360 --> 00:00:30,100 approach is called content based recommendations. 18 00:00:31,460 --> 00:00:32,690 Here's our data set from before, 19 00:00:33,310 --> 00:00:34,470 and just to remind you of a 20 00:00:34,550 --> 00:00:35,780 bit of notation, I was using 21 00:00:36,690 --> 00:00:37,870 Nu to denote the number 22 00:00:38,030 --> 00:00:39,110 of users, and so that's equal 23 00:00:39,290 --> 00:00:40,990 to 4, and Nm 24 00:00:41,990 --> 00:00:44,780 to denote the number of movies, I have five movies. 25 00:00:47,230 --> 00:00:48,140 So, how do I predict 26 00:00:48,960 --> 00:00:50,950 what these missing values would be? 27 00:00:52,490 --> 00:00:53,520 Let's suppose that for each 28 00:00:53,700 --> 00:00:55,500 of these movies, I have a 29 00:00:55,540 --> 00:00:57,460 set of features for them. 30 00:00:57,910 --> 00:00:58,990 In particular, lets say that 31 00:00:59,690 --> 00:01:00,850 for each of the movies I have two features, 32 00:01:01,920 --> 00:01:03,500 which I'm going to denote X1 and 33 00:01:04,080 --> 00:01:05,700 X2, where X1 measures the degree 34 00:01:06,130 --> 00:01:07,450 to which a movie is a 35 00:01:07,650 --> 00:01:09,270 romantic movie and X2 measures 36 00:01:09,810 --> 00:01:12,080 the degree to which a movie is an action movie. 37 00:01:12,840 --> 00:01:13,700 So if you take a movie 38 00:01:14,470 --> 00:01:16,490 Love at last, you know, 39 00:01:16,800 --> 00:01:17,960 0.9 rating on the 40 00:01:18,030 --> 00:01:19,190 romance scale, it is a 41 00:01:19,260 --> 00:01:20,850 highly romantic movie but zero on 42 00:01:20,920 --> 00:01:22,400 the action scale, so almost no 43 00:01:22,520 --> 00:01:24,390 action in that 44 00:01:24,540 --> 00:01:25,860 movie. Romance forever was 1.0, 45 00:01:26,230 --> 00:01:27,610 lot of romance and 0.01 action, 46 00:01:27,860 --> 00:01:29,790 I don't know maybe 47 00:01:30,700 --> 00:01:32,650 there's a minor car crash in 48 00:01:33,630 --> 00:01:35,580 that movie or something, so little bit of action. 49 00:01:35,610 --> 00:01:36,760 Skipping one let's do 50 00:01:37,860 --> 00:01:39,630 Swords vs,. karate, maybe that 51 00:01:39,870 --> 00:01:41,110 has a zero romance rating 52 00:01:41,520 --> 00:01:42,780 and no romance at all in that 53 00:01:43,250 --> 00:01:46,040 but plenty of action and you know, non-stop car chases. 54 00:01:46,300 --> 00:01:47,120 Maybe again there is 55 00:01:47,220 --> 00:01:48,390 tiny bit of romance in 56 00:01:48,500 --> 00:01:49,800 that movie, but mainly action, 57 00:01:50,460 --> 00:01:51,560 and and Cute puppies of 58 00:01:51,680 --> 00:01:52,730 love again but mainly a romance 59 00:01:53,510 --> 00:01:54,410 movie with no action at all. 60 00:01:55,990 --> 00:01:57,150 So if we have features 61 00:01:57,550 --> 00:01:59,220 like these then each movie 62 00:01:59,800 --> 00:02:01,510 can be represented with a feature vector. 63 00:02:02,380 --> 00:02:03,810 Let's take movie 1, so just 64 00:02:04,020 --> 00:02:06,210 call these movies you know, movies 1 2, 3, 4 and 5. 65 00:02:06,630 --> 00:02:08,180 For my first movie, 66 00:02:08,520 --> 00:02:09,810 Love at last, I have 67 00:02:10,170 --> 00:02:11,710 my two features, 0.9 and 68 00:02:12,180 --> 00:02:12,950 0, and so these are features 69 00:02:13,380 --> 00:02:16,170 X1 and X2, and 70 00:02:16,340 --> 00:02:17,270 let's add an extra feature 71 00:02:17,790 --> 00:02:18,780 as usual, which is my 72 00:02:19,350 --> 00:02:21,640 interceptor feature X0, which is equal to 1 73 00:02:22,680 --> 00:02:23,810 and so, putting these together, 74 00:02:24,700 --> 00:02:26,150 I would then have a feature X1, 75 00:02:26,970 --> 00:02:28,420 the superscript 1 denotes it's 76 00:02:28,510 --> 00:02:29,430 the feature vector for my first 77 00:02:29,770 --> 00:02:30,720 movie, and this feature 78 00:02:30,980 --> 00:02:32,520 vector is equal to one. 79 00:02:33,190 --> 00:02:34,880 The first one there is this interceptor, 80 00:02:35,740 --> 00:02:37,010 and then my two features 0.9, 0, 81 00:02:37,260 --> 00:02:39,330 like so. 82 00:02:40,370 --> 00:02:41,360 So, for Love at last, I 83 00:02:41,550 --> 00:02:43,470 would have a feature vector X1, 84 00:02:44,480 --> 00:02:46,220 for the movie Romance Forever, we 85 00:02:46,340 --> 00:02:47,510 have the separate feature vector 86 00:02:47,800 --> 00:02:49,310 X2 and so on, and 87 00:02:49,380 --> 00:02:50,780 for Swords vs. karate I would 88 00:02:51,510 --> 00:02:54,050 have a different feature vector x superscript 5. 89 00:02:56,150 --> 00:02:57,460 Also, consistent with our 90 00:02:57,680 --> 00:02:59,090 early notation that we were 91 00:02:59,300 --> 00:03:00,220 using, we're going to set N 92 00:03:00,490 --> 00:03:02,130 to be the number of features, not 93 00:03:02,360 --> 00:03:03,530 counting this X zero 94 00:03:03,810 --> 00:03:05,320 intercept term so n is 95 00:03:05,420 --> 00:03:06,600 equal to two because we have 96 00:03:06,790 --> 00:03:08,180 two features x1 and x2 97 00:03:08,890 --> 00:03:10,140 capturing the degree of romance 98 00:03:10,640 --> 00:03:11,980 and the degree of action in each 99 00:03:12,630 --> 00:03:14,270 movie. Now in order 100 00:03:14,560 --> 00:03:17,930 to make predictions, here is one thing we could do, 101 00:03:19,230 --> 00:03:20,980 which is that we could treat predicting 102 00:03:21,160 --> 00:03:22,340 the ratings of each user 103 00:03:23,250 --> 00:03:26,210 as a separate linear regression problem. So 104 00:03:26,440 --> 00:03:27,660 specifically lets say that for each 105 00:03:27,920 --> 00:03:29,170 user j we are going 106 00:03:29,270 --> 00:03:30,860 to learn a parameter vector theta 107 00:03:31,340 --> 00:03:33,030 J which would be in R3 in this case, 108 00:03:33,540 --> 00:03:35,730 more generally theta j would 109 00:03:35,950 --> 00:03:37,960 be in r n+1, where 110 00:03:38,340 --> 00:03:39,460 n is the number of features, 111 00:03:39,700 --> 00:03:42,170 not counting the intercept term, and we're going 112 00:03:42,440 --> 00:03:43,880 to predict user J as 113 00:03:44,050 --> 00:03:45,780 rating movie I, with just 114 00:03:46,000 --> 00:03:47,390 the inner product between the parameters 115 00:03:47,860 --> 00:03:50,590 vector theta and the features "XI". 116 00:03:51,830 --> 00:03:53,680 So, let's take a specific example. 117 00:03:55,130 --> 00:03:56,700 Let's take user one. 118 00:03:59,600 --> 00:04:01,120 So that would be Alice and 119 00:04:01,380 --> 00:04:02,700 associated with Alice would 120 00:04:02,830 --> 00:04:03,990 be some parameter vector, 121 00:04:04,810 --> 00:04:06,210 theta 1 and our 122 00:04:06,520 --> 00:04:07,610 second user Bob will be 123 00:04:07,720 --> 00:04:08,600 associated, with a different 124 00:04:08,970 --> 00:04:10,290 parameter vector theta 2. 125 00:04:10,800 --> 00:04:12,190 Carol will be associated with a 126 00:04:12,300 --> 00:04:13,360 different parameter vector theta 127 00:04:13,660 --> 00:04:14,790 3 and Dave a different 128 00:04:15,750 --> 00:04:17,670 parameter vector, theta 4. So 129 00:04:18,090 --> 00:04:18,990 lets say we want to make a 130 00:04:19,320 --> 00:04:21,040 prediction for what Alice will 131 00:04:21,240 --> 00:04:22,450 think of the movie, Cute 132 00:04:22,690 --> 00:04:24,640 puppies of love. Well that 133 00:04:24,810 --> 00:04:25,670 movie is going to have some 134 00:04:26,810 --> 00:04:29,180 parameter vector X3, where 135 00:04:29,410 --> 00:04:30,400 we have that X3 is going 136 00:04:30,430 --> 00:04:32,460 to be equal to 1 137 00:04:32,650 --> 00:04:34,580 which is my intercept term, and 138 00:04:34,800 --> 00:04:37,220 then 0.99, and then 0. 139 00:04:38,560 --> 00:04:39,680 And let's say for this 140 00:04:39,810 --> 00:04:41,040 example, let's say that you 141 00:04:41,190 --> 00:04:42,890 know we have somehow already gotten 142 00:04:43,290 --> 00:04:44,600 a parameter vector theta 1 143 00:04:44,830 --> 00:04:45,700 for Alice--we we will 144 00:04:45,850 --> 00:04:47,560 say later exactly how 145 00:04:47,800 --> 00:04:48,520 we come up with this parameter 146 00:04:48,600 --> 00:04:50,530 vector--but let's 147 00:04:50,710 --> 00:04:52,000 just say for now that you 148 00:04:52,150 --> 00:04:53,560 know some unspecified learning algorithm 149 00:04:54,040 --> 00:04:55,040 has learned the parameter vector 150 00:04:55,180 --> 00:04:56,970 theta 1 and it is 151 00:04:57,120 --> 00:04:59,260 equal to 0 5 0. And so 152 00:05:00,150 --> 00:05:02,010 our prediction for this 153 00:05:02,270 --> 00:05:04,130 entry is going to 154 00:05:04,260 --> 00:05:06,930 be equal to theta 1, 155 00:05:07,440 --> 00:05:08,760 that is Alice's parameter vector, 156 00:05:09,620 --> 00:05:11,450 transpose X3, that 157 00:05:11,620 --> 00:05:13,730 is the feature vector for 158 00:05:14,170 --> 00:05:16,050 the Cute Puppies of Love movie number 3. 159 00:05:16,250 --> 00:05:17,200 And so the inner 160 00:05:17,470 --> 00:05:18,470 product between these two vectors 161 00:05:19,910 --> 00:05:21,780 is going to be 5 x 0.99. 162 00:05:23,980 --> 00:05:26,340 Which is equal to 4.95. 163 00:05:27,360 --> 00:05:28,940 And so my prediction for value this 164 00:05:29,130 --> 00:05:30,930 over here is going to be 4.95. 165 00:05:31,970 --> 00:05:33,110 And maybe that seems like a 166 00:05:33,230 --> 00:05:34,660 reasonable value, if indeed 167 00:05:36,130 --> 00:05:37,830 this is my parameter vector theta 1. 168 00:05:38,950 --> 00:05:40,290 So all we doing here is 169 00:05:40,520 --> 00:05:42,710 we are applying a different copy of 170 00:05:42,930 --> 00:05:44,480 essentially linear regression for each 171 00:05:44,760 --> 00:05:46,020 user and we are saying 172 00:05:46,230 --> 00:05:47,610 that what Alice does, is 173 00:05:47,820 --> 00:05:48,880 Alice has seem some parameter vector 174 00:05:49,160 --> 00:05:50,400 theta 1 that she uses, 175 00:05:51,410 --> 00:05:52,380 that we use to predict 176 00:05:53,310 --> 00:05:54,770 her ratings as a 177 00:05:54,950 --> 00:05:56,190 function of how romantic and how 178 00:05:56,470 --> 00:05:57,540 action packed the movie is 179 00:05:58,210 --> 00:05:59,600 and Bob, and Carol, and 180 00:05:59,740 --> 00:06:01,010 Dave each of them have a 181 00:06:01,220 --> 00:06:03,170 different linear function of the 182 00:06:03,330 --> 00:06:04,700 romantic-ness and action-ness or the degree 183 00:06:05,220 --> 00:06:06,510 of romance and the degree of action 184 00:06:07,580 --> 00:06:08,030 in a movie, 185 00:06:08,820 --> 00:06:11,300 and that that is how we're going to predict their star ratings. 186 00:06:14,820 --> 00:06:16,330 More formally here is 187 00:06:16,610 --> 00:06:17,920 how we can write down the problem. 188 00:06:19,260 --> 00:06:20,320 Our notation is that RIJ 189 00:06:20,690 --> 00:06:21,600 is equal to one, if 190 00:06:21,680 --> 00:06:22,910 user J has rated movie I, 191 00:06:23,380 --> 00:06:24,630 and YIJ is the rating 192 00:06:25,850 --> 00:06:28,010 of that movie if that rating exists. 193 00:06:29,540 --> 00:06:30,520 That is if that user has actually 194 00:06:31,030 --> 00:06:32,830 rated that movie. And 195 00:06:33,330 --> 00:06:34,360 on the previous slide we also 196 00:06:34,650 --> 00:06:36,540 defined theta J which 197 00:06:36,740 --> 00:06:38,790 is a parameter for each user XI 198 00:06:39,150 --> 00:06:40,830 which is a feature vector for specific 199 00:06:41,220 --> 00:06:42,370 movie and for each user 200 00:06:42,850 --> 00:06:43,780 and each movie you would predict 201 00:06:44,300 --> 00:06:45,620 that rating, as follows. 202 00:06:47,230 --> 00:06:49,560 So let me introduce, 203 00:06:49,650 --> 00:06:51,600 just temporarily, introduce one extra 204 00:06:51,860 --> 00:06:53,530 bit of notation mj, we 205 00:06:53,760 --> 00:06:54,980 are gonna use mj to denote the 206 00:06:55,070 --> 00:06:56,140 number of users rated by movie 207 00:06:56,400 --> 00:06:57,350 j, we're gonna need this 208 00:06:57,580 --> 00:06:59,890 notation only for this slide. Now, in order to learn 209 00:07:00,160 --> 00:07:01,700 the parameter vector for 210 00:07:01,760 --> 00:07:03,720 theta j, well, how can we do so? 211 00:07:04,410 --> 00:07:06,380 This is basically a linear regression problem. 212 00:07:06,930 --> 00:07:07,980 So what we can do, is 213 00:07:08,290 --> 00:07:09,810 just choose a parameter vector, theta j, 214 00:07:10,520 --> 00:07:12,100 so the predicted value 215 00:07:12,570 --> 00:07:13,620 here are as close 216 00:07:13,980 --> 00:07:15,280 as possible to the values 217 00:07:15,800 --> 00:07:18,760 that we observed in our training set, the values we observed in our data. 218 00:07:19,900 --> 00:07:21,390 So, let's write that down. 219 00:07:22,290 --> 00:07:24,320 In order to learn the 220 00:07:24,380 --> 00:07:26,960 parameter vector theta j, let's minimize over 221 00:07:27,170 --> 00:07:28,510 my parameter vector theta j, 222 00:07:29,400 --> 00:07:30,360 of sum-- 223 00:07:31,920 --> 00:07:32,860 and I want to sum 224 00:07:33,290 --> 00:07:34,900 over all movies that user 225 00:07:35,240 --> 00:07:36,930 j has rated--so we write this as sum 226 00:07:37,270 --> 00:07:38,290 over all values of i 227 00:07:39,100 --> 00:07:42,000 that is a colon rij equals 1. 228 00:07:43,870 --> 00:07:45,970 So the way to read this summation index is 229 00:07:46,370 --> 00:07:48,280 this is summation over all 230 00:07:48,470 --> 00:07:49,550 the values of i, so that 231 00:07:49,780 --> 00:07:51,180 r i j is equal to 1. 232 00:07:51,210 --> 00:07:52,470 So this is going to be summing over all the 233 00:07:52,560 --> 00:07:54,670 movies that user j has rated. 234 00:07:56,230 --> 00:07:57,000 And then I am going to 235 00:07:58,150 --> 00:07:59,910 compute theta j 236 00:08:01,810 --> 00:08:04,450 transpose xi so 237 00:08:04,610 --> 00:08:06,740 that's the prediction of user 238 00:08:07,030 --> 00:08:08,390 j's rating on movie i, 239 00:08:09,230 --> 00:08:10,960 minus y i j, 240 00:08:11,700 --> 00:08:13,700 so that's the actual observed rating squared, 241 00:08:15,190 --> 00:08:16,790 and then, let me just divide 242 00:08:17,260 --> 00:08:18,650 by the number of movies 243 00:08:19,040 --> 00:08:20,990 that user J, has 244 00:08:21,380 --> 00:08:23,910 actually rated, so just divide by 1 over 2MJ. 245 00:08:24,000 --> 00:08:25,460 And so this is 246 00:08:25,690 --> 00:08:27,620 just like the least squares regression, 247 00:08:28,210 --> 00:08:29,550 it's just like linear regression 248 00:08:30,170 --> 00:08:31,170 where we want to choose 249 00:08:31,320 --> 00:08:34,480 the parameter vector theta J, to minimize this type of squared error term. 250 00:08:34,510 --> 00:08:35,090 And if you want to, you can 251 00:08:36,330 --> 00:08:39,580 also add in a regularization term 252 00:08:39,980 --> 00:08:41,870 so plus lambda over 2m, and 253 00:08:43,780 --> 00:08:44,930 this is really 2MJ because, 254 00:08:45,420 --> 00:08:47,760 this as if we have MJ examples right? 255 00:08:47,920 --> 00:08:49,330 Because if user J has 256 00:08:49,650 --> 00:08:50,910 rated that many movies, it's 257 00:08:51,050 --> 00:08:53,340 sort of like we have that many data points with which to fit 258 00:08:53,680 --> 00:08:55,790 the parameters theta J. And then 259 00:08:56,650 --> 00:08:57,390 let me add in my usual 260 00:08:58,340 --> 00:09:00,260 regularization term here of 261 00:09:00,460 --> 00:09:02,530 theta J K squared. 262 00:09:03,110 --> 00:09:04,270 As usual this sum is from 263 00:09:04,840 --> 00:09:05,980 K equals 1 through N 264 00:09:06,330 --> 00:09:08,670 so here theta J is 265 00:09:08,880 --> 00:09:10,050 going to be an N plus 266 00:09:10,520 --> 00:09:12,400 1 dimensional vector where, 267 00:09:12,620 --> 00:09:14,630 in our early example, n was equal to two, 268 00:09:15,320 --> 00:09:17,090 but more generally, n is 269 00:09:17,260 --> 00:09:20,980 the number of features we have per movie. 270 00:09:21,730 --> 00:09:22,270 And so as usual we don't regularize over theta 0. 271 00:09:22,390 --> 00:09:23,710 We don't regularize over the 272 00:09:23,910 --> 00:09:24,750 biased term because the sum is 273 00:09:24,930 --> 00:09:28,590 from K1 through N. If 274 00:09:28,760 --> 00:09:30,430 you minimize this as 275 00:09:30,570 --> 00:09:31,780 a function of theta J you get a 276 00:09:31,900 --> 00:09:33,010 good solution, you get a 277 00:09:33,180 --> 00:09:35,330 pretty good estimate of a parameter vector theta j 278 00:09:36,490 --> 00:09:37,200 with which to make the predictions 279 00:09:37,940 --> 00:09:39,460 for user J's movie ratings. 280 00:09:40,820 --> 00:09:42,250 For recommender systems, we're going 281 00:09:42,520 --> 00:09:44,140 to change this notation a little 282 00:09:44,500 --> 00:09:46,130 bit. So to simplify the subsequent math, 283 00:09:46,690 --> 00:09:48,440 I'm actually going to get rid of this term MJ. 284 00:09:49,570 --> 00:09:50,720 So that's just a constant right 285 00:09:50,970 --> 00:09:52,140 so I can delete it without changing 286 00:09:53,000 --> 00:09:54,310 the value of theta J that 287 00:09:54,430 --> 00:09:55,840 I get out of this optimization, 288 00:09:56,010 --> 00:09:57,030 so if you imagine taking this 289 00:09:57,220 --> 00:09:58,850 whole equation, taking this 290 00:09:59,010 --> 00:10:00,290 whole expression and multiplying it by 291 00:10:00,870 --> 00:10:02,540 MJ and get rid of that constant, and when 292 00:10:02,950 --> 00:10:04,110 I minimize this I should still get 293 00:10:04,200 --> 00:10:06,590 the same value of theta J as before. 294 00:10:06,710 --> 00:10:07,780 So, just to repeat what 295 00:10:08,440 --> 00:10:10,060 we wrote on the previous slide, here 296 00:10:10,340 --> 00:10:12,250 is our optimization objective: In order 297 00:10:12,580 --> 00:10:13,620 to learn theta J, which is 298 00:10:13,990 --> 00:10:15,080 a parameter for user J, 299 00:10:15,790 --> 00:10:17,570 we're going to minimize over theta j 300 00:10:17,770 --> 00:10:19,820 this optimization objectives. So 301 00:10:20,100 --> 00:10:21,360 this is our usual squared 302 00:10:21,720 --> 00:10:24,830 error term and then this is our regularization term. 303 00:10:26,050 --> 00:10:27,410 Now of course in building 304 00:10:27,690 --> 00:10:28,790 a recommender system we don't 305 00:10:29,030 --> 00:10:29,800 just want to learn parameters 306 00:10:30,420 --> 00:10:31,500 for a single user, we want 307 00:10:31,650 --> 00:10:33,140 to learn parameters for all of 308 00:10:33,490 --> 00:10:35,640 our users, I have n subscript u 309 00:10:35,760 --> 00:10:36,730 users, so I want to 310 00:10:36,950 --> 00:10:38,920 learn all of these parameters and 311 00:10:39,060 --> 00:10:39,830 so what I'm going to do 312 00:10:40,140 --> 00:10:42,320 is take this minimization, take 313 00:10:42,500 --> 00:10:45,480 this optimization objective and just add an extra summation there. 314 00:10:45,800 --> 00:10:47,610 So, you know, this expression here 315 00:10:48,410 --> 00:10:49,200 with the one half on top again, so 316 00:10:49,240 --> 00:10:50,510 it's exactly the same 317 00:10:50,780 --> 00:10:52,520 as what we have on top except 318 00:10:52,950 --> 00:10:53,980 that now, instead of just 319 00:10:54,090 --> 00:10:55,670 doing this for a specific user theta 320 00:10:55,960 --> 00:10:57,270 J, I'm going to sum 321 00:10:57,680 --> 00:10:59,340 my objective over all of 322 00:10:59,490 --> 00:11:00,940 my users and then minimize 323 00:11:01,260 --> 00:11:03,700 this overall optimization objective. 324 00:11:04,320 --> 00:11:05,570 Minimize this overall cost function. 325 00:11:06,730 --> 00:11:09,200 And when I minimize this 326 00:11:09,380 --> 00:11:10,560 as a function of theta 1, 327 00:11:11,360 --> 00:11:12,400 theta 2, up to 328 00:11:12,600 --> 00:11:14,130 theta NU, I will 329 00:11:14,270 --> 00:11:15,750 get a separate parameter 330 00:11:16,030 --> 00:11:17,340 vector each user and 331 00:11:17,450 --> 00:11:18,720 I can then use that 332 00:11:19,090 --> 00:11:20,460 to make predictions for all of 333 00:11:20,530 --> 00:11:21,610 my users for all of 334 00:11:21,720 --> 00:11:23,150 my N subscript u users. 335 00:11:24,520 --> 00:11:26,560 So putting everything together, this 336 00:11:27,180 --> 00:11:28,730 was our optimization objective on 337 00:11:28,880 --> 00:11:29,940 top and to give 338 00:11:30,170 --> 00:11:31,070 this thing a name, I'll just call this 339 00:11:31,930 --> 00:11:33,480 J of theta 1, 340 00:11:33,630 --> 00:11:35,520 dot, dot, dot theta NU. 341 00:11:36,050 --> 00:11:37,280 So J as usual is my 342 00:11:37,590 --> 00:11:39,830 optimization objective which I'm trying to minimize. 343 00:11:41,330 --> 00:11:42,500 Next, in order to actually 344 00:11:42,880 --> 00:11:44,310 do the minimization, if you 345 00:11:44,500 --> 00:11:45,840 were to derive the gradient 346 00:11:46,150 --> 00:11:47,410 descent updates, these are 347 00:11:47,530 --> 00:11:48,720 the equations you would get, 348 00:11:49,900 --> 00:11:51,300 so you would take theta 349 00:11:51,750 --> 00:11:53,310 JK and subtract from 350 00:11:53,430 --> 00:11:56,190 it alpha, which is the learning rate, times these terms here on the right. 351 00:11:56,280 --> 00:11:57,540 So we have slightly different cases 352 00:11:58,160 --> 00:11:59,660 so when K equals 0 and when K is not 353 00:11:59,840 --> 00:12:01,460 equal to 0, because our regularization 354 00:12:01,960 --> 00:12:04,380 term here regularizes only the 355 00:12:04,910 --> 00:12:06,430 values of theta JK for 356 00:12:06,610 --> 00:12:07,690 K not equal to zero. So 357 00:12:07,830 --> 00:12:09,470 we don't regularize theta 0 358 00:12:10,090 --> 00:12:11,610 so the slightly different updates 359 00:12:12,270 --> 00:12:13,580 for k equals zero, and k not equal to 0. 360 00:12:14,680 --> 00:12:16,080 And this term, over 361 00:12:16,250 --> 00:12:18,090 here, for example is just a partial 362 00:12:18,520 --> 00:12:20,790 derivative with respect to your parameter, 363 00:12:21,090 --> 00:12:24,300 that of your 364 00:12:25,350 --> 00:12:28,270 optimization objective, right? 365 00:12:28,790 --> 00:12:30,280 And so, this is just 366 00:12:30,680 --> 00:12:33,000 gradient descent and I've 367 00:12:33,230 --> 00:12:35,440 already computed the derivatives and plugged them into here. 368 00:12:36,560 --> 00:12:39,580 If these gradient 369 00:12:40,570 --> 00:12:41,810 descent updates look a 370 00:12:41,980 --> 00:12:42,870 lot like what we had for 371 00:12:43,050 --> 00:12:44,700 linear regression, that's because these 372 00:12:44,880 --> 00:12:47,250 are essentially the same as linear regression. 373 00:12:48,190 --> 00:12:49,510 The only minor difference is that 374 00:12:49,780 --> 00:12:51,120 for linear regression we have 375 00:12:51,580 --> 00:12:52,600 these 1 over M terms 376 00:12:52,990 --> 00:12:54,710 - it's really 1 377 00:12:54,810 --> 00:12:56,770 over MJ - but 378 00:12:57,550 --> 00:12:59,230 because earlier when we were 379 00:12:59,370 --> 00:13:00,780 deriving the optimization objective 380 00:13:01,270 --> 00:13:03,540 we got rid of this, that's why we don't have this 1 over M term. 381 00:13:04,440 --> 00:13:05,880 But otherwise it's really sum over 382 00:13:06,080 --> 00:13:08,350 my training examples of, you 383 00:13:08,530 --> 00:13:09,890 know, the error times 384 00:13:10,230 --> 00:13:13,390 XK plus that regularization 385 00:13:14,900 --> 00:13:16,550 term contributes to the derivative. 386 00:13:18,120 --> 00:13:19,040 So if you are using 387 00:13:19,200 --> 00:13:20,360 gradient descent, here is how 388 00:13:20,680 --> 00:13:22,140 you can minimize the cost 389 00:13:22,440 --> 00:13:23,880 function j, to learn all 390 00:13:24,110 --> 00:13:25,490 the parameters, and using these 391 00:13:25,640 --> 00:13:26,980 formulas for the derivatives, if 392 00:13:27,090 --> 00:13:28,240 you want, you can also plug them 393 00:13:28,440 --> 00:13:29,710 into a more advanced optimization 394 00:13:30,290 --> 00:13:31,710 algorithm like cluster gradient or 395 00:13:31,810 --> 00:13:33,730 LBFGS or what have you, and use 396 00:13:33,940 --> 00:13:35,930 that to try to minimize the cost function J as well. 397 00:13:37,360 --> 00:13:38,450 So hopefully you now know 398 00:13:38,750 --> 00:13:40,510 how you can apply essentially a 399 00:13:41,000 --> 00:13:42,820 variation on linear regression in 400 00:13:42,950 --> 00:13:45,460 order to predict different movie ratings by different users. 401 00:13:46,350 --> 00:13:47,510 This particular algorithm is called 402 00:13:48,030 --> 00:13:49,930 a content based recommendations, or 403 00:13:50,040 --> 00:13:51,980 content based approach because we 404 00:13:52,130 --> 00:13:53,200 assume that we have available 405 00:13:53,650 --> 00:13:55,430 to us, features for the different movies. 406 00:13:56,150 --> 00:13:57,330 So we have features that 407 00:13:57,490 --> 00:13:58,610 capture what is the 408 00:13:58,700 --> 00:14:00,260 content of these movies. How romantic is this movie? 409 00:14:01,280 --> 00:14:03,050 How much action is in this movie? 410 00:14:03,430 --> 00:14:04,690 And we are really using features of the 411 00:14:04,780 --> 00:14:06,910 content of the movies to make our predictions. 412 00:14:08,350 --> 00:14:09,770 But for many movies we 413 00:14:09,920 --> 00:14:11,300 don't actually have such features, 414 00:14:11,820 --> 00:14:13,630 or it may be very difficult to get 415 00:14:13,850 --> 00:14:14,970 such features for all of 416 00:14:15,050 --> 00:14:16,160 our movies, for all 417 00:14:16,460 --> 00:14:17,800 of whatever items we are trying to sell. 418 00:14:18,880 --> 00:14:20,430 So in the next video, we'll 419 00:14:20,590 --> 00:14:21,530 start to talk about an approach 420 00:14:22,010 --> 00:14:23,290 to recommender systems that isn't 421 00:14:23,570 --> 00:14:24,710 content based and does not 422 00:14:24,980 --> 00:14:26,090 assume that we have 423 00:14:26,670 --> 00:14:28,420 someone else giving us all of these features, 424 00:14:28,880 --> 00:14:30,300 for all of the movies in our data set.