1 00:00:00,400 --> 00:00:01,510 By now you've seen all 2 00:00:01,800 --> 00:00:03,600 of the main pieces of the 3 00:00:04,030 --> 00:00:06,760 recommender system algorithm or the collaborative filtering algorithm. 4 00:00:07,770 --> 00:00:08,770 In this video I want 5 00:00:08,940 --> 00:00:10,620 to just share one last implementational detail, 6 00:00:12,000 --> 00:00:14,140 namely mean normalization, which 7 00:00:14,350 --> 00:00:15,520 can sometimes just make the 8 00:00:15,570 --> 00:00:17,090 algorithm work a little bit better. 9 00:00:18,290 --> 00:00:20,820 To motivate the idea of mean normalization, let's 10 00:00:22,130 --> 00:00:24,390 consider an example of where 11 00:00:24,710 --> 00:00:26,530 there's a user that has not rated any movies. 12 00:00:28,050 --> 00:00:29,290 So, in addition to our 13 00:00:29,540 --> 00:00:30,790 four users, Alice, Bob, Carol, 14 00:00:31,060 --> 00:00:32,710 and Dave, I've added a 15 00:00:32,870 --> 00:00:35,110 fifth user, Eve, who hasn't rated any movies. 16 00:00:36,470 --> 00:00:37,920 Let's see what our collaborative filtering 17 00:00:38,350 --> 00:00:39,570 algorithm will do on this user. 18 00:00:41,020 --> 00:00:43,140 Let's say that n is equal to 2 and so 19 00:00:43,390 --> 00:00:44,420 we're going to learn two features 20 00:00:45,410 --> 00:00:46,470 and we are going to have 21 00:00:46,630 --> 00:00:47,890 to learn a parameter vector theta 22 00:00:48,140 --> 00:00:50,420 5, which is going to be 23 00:00:51,130 --> 00:00:52,560 in R2, remember this 24 00:00:52,750 --> 00:00:55,920 is now vectors in Rn not Rn+1, 25 00:00:57,070 --> 00:00:58,210 we'll learn the parameter vector theta 5 for our user number 5, Eve. 26 00:00:59,780 --> 00:01:00,800 So if we look in 27 00:01:00,960 --> 00:01:02,020 the first term in this 28 00:01:02,200 --> 00:01:04,020 optimization objective, well the 29 00:01:04,220 --> 00:01:05,490 user Eve hasn't rated any 30 00:01:05,730 --> 00:01:07,860 movies, so there are 31 00:01:08,120 --> 00:01:10,750 no movies for 32 00:01:11,050 --> 00:01:12,810 which Rij is equal to 33 00:01:13,130 --> 00:01:14,590 one for the user Eve and 34 00:01:14,700 --> 00:01:15,840 so this first term plays no 35 00:01:16,060 --> 00:01:17,400 role at all in determining theta 5 36 00:01:18,610 --> 00:01:19,790 because there are no movies that Eve has rated. 37 00:01:20,960 --> 00:01:22,120 And so the only term that 38 00:01:22,260 --> 00:01:24,300 effects theta 5 is this term. 39 00:01:24,880 --> 00:01:25,830 And so we're saying that we 40 00:01:25,910 --> 00:01:28,840 want to choose vector theta 5 so 41 00:01:28,950 --> 00:01:33,820 that the last regularization term is 42 00:01:34,540 --> 00:01:35,500 as small as possible. 43 00:01:35,920 --> 00:01:38,470 In other words we want to minimize this 44 00:01:39,040 --> 00:01:39,610 lambda over 2 theta 5 subscript 1 squared 45 00:01:40,880 --> 00:01:43,150 plus theta 5 46 00:01:43,820 --> 00:01:45,840 subscript 2 squared so 47 00:01:46,040 --> 00:01:47,170 that's the component of the 48 00:01:47,270 --> 00:01:49,460 regularization term that corresponds to 49 00:01:49,740 --> 00:01:51,610 user 5, and of course 50 00:01:51,850 --> 00:01:53,280 if your goal is to 51 00:01:53,550 --> 00:01:55,540 minimize this term, then 52 00:01:55,900 --> 00:01:56,790 what you're going to end up 53 00:01:56,980 --> 00:01:58,520 with is just theta 5 equals 0 0. 54 00:01:59,670 --> 00:02:01,550 Because a regularization term 55 00:02:01,850 --> 00:02:03,270 is encouraging us to set 56 00:02:03,510 --> 00:02:05,120 parameters close to 0 57 00:02:05,620 --> 00:02:07,580 and if there is 58 00:02:07,730 --> 00:02:08,820 no data to try to 59 00:02:08,990 --> 00:02:10,210 pull the parameters away from 60 00:02:10,410 --> 00:02:11,460 0, because this first term 61 00:02:12,710 --> 00:02:13,800 doesn't effect theta 5, 62 00:02:13,880 --> 00:02:15,410 we just end up with theta 5 63 00:02:15,690 --> 00:02:18,450 equals the vector of all zeros. And 64 00:02:18,590 --> 00:02:19,610 so when we go to 65 00:02:19,730 --> 00:02:20,920 predict how user 5 would 66 00:02:21,280 --> 00:02:22,570 rate any movie, we have 67 00:02:22,890 --> 00:02:25,850 that theta 5 transpose xi, 68 00:02:26,900 --> 00:02:28,380 for any i, that's just going 69 00:02:29,950 --> 00:02:31,060 to be equal to zero. 70 00:02:31,570 --> 00:02:33,320 Because theta 5 is 0 for any value of 71 00:02:33,750 --> 00:02:35,780 x, this inner product is going to be equal to 0. And what we're 72 00:02:35,900 --> 00:02:37,160 going to have therefore, is that 73 00:02:37,310 --> 00:02:38,780 we're going to predict that Eve 74 00:02:39,480 --> 00:02:40,870 is going to rate every single 75 00:02:41,170 --> 00:02:42,690 movie with zero stars. 76 00:02:44,050 --> 00:02:45,970 But this doesn't seem very useful does it? 77 00:02:46,110 --> 00:02:47,310 I mean if you look at the different movies, 78 00:02:47,770 --> 00:02:49,710 Love at Last, this first movie, 79 00:02:50,130 --> 00:02:52,300 a couple people rated it 5 stars. 80 00:02:54,940 --> 00:02:56,870 And for even the Swords vs. Karate, someone rated it 5 stars. 81 00:02:57,410 --> 00:02:58,780 So some people do like some movies. 82 00:02:59,270 --> 00:03:01,030 It seems not useful to 83 00:03:01,160 --> 00:03:03,750 just predict that Eve is going to rate everything 0 stars. 84 00:03:04,570 --> 00:03:05,850 And in fact if we're predicting 85 00:03:06,410 --> 00:03:08,340 that eve is going to rate everything 0 stars, 86 00:03:09,050 --> 00:03:10,100 we also don't have any 87 00:03:10,280 --> 00:03:11,660 good way of recommending any movies 88 00:03:11,810 --> 00:03:12,930 to her, because you know 89 00:03:13,130 --> 00:03:15,320 all of these movies are getting exactly the 90 00:03:15,410 --> 00:03:16,810 same predicted rating for Eve 91 00:03:17,010 --> 00:03:18,500 so there's no one movie with 92 00:03:18,660 --> 00:03:20,010 a higher predicted rating that 93 00:03:20,210 --> 00:03:22,880 we could recommend to her, so, that's not very good. 94 00:03:24,520 --> 00:03:27,340 The idea of mean normalization will let us fix this problem. 95 00:03:28,160 --> 00:03:29,410 So here's how it works. 96 00:03:30,760 --> 00:03:31,720 As before let me group all of my 97 00:03:32,370 --> 00:03:33,750 movie ratings into this matrix 98 00:03:34,280 --> 00:03:35,400 Y, so just take all of 99 00:03:35,460 --> 00:03:36,700 these ratings and group them 100 00:03:37,240 --> 00:03:38,400 into matrix Y. And this 101 00:03:38,560 --> 00:03:39,740 column over here of all 102 00:03:39,910 --> 00:03:41,220 question marks corresponds to 103 00:03:41,670 --> 00:03:43,300 Eve's not having rated any movies. 104 00:03:44,830 --> 00:03:46,890 Now to perform mean normalization what I'm going to 105 00:03:47,140 --> 00:03:48,350 do is compute the average 106 00:03:48,720 --> 00:03:50,610 rating that each movie obtained. 107 00:03:51,120 --> 00:03:51,760 And I'm going to store that 108 00:03:52,040 --> 00:03:54,780 in a vector that we'll call mu. 109 00:03:55,210 --> 00:03:57,250 So the first movie got two 5-star and two 0-star ratings, 110 00:03:57,760 --> 00:03:58,960 so the average of that is a 2.5-star rating. 111 00:03:59,040 --> 00:04:01,470 The second movie had 112 00:04:01,620 --> 00:04:04,300 an average of 2.5-stars and so on. 113 00:04:04,470 --> 00:04:06,300 And the final movie that has 0, 0, 5, 0. 114 00:04:06,330 --> 00:04:07,440 And the average of 0, 0, 115 00:04:07,520 --> 00:04:09,190 5, 0, that averages out to 116 00:04:09,620 --> 00:04:11,500 an average of 1.25 117 00:04:12,240 --> 00:04:14,910 rating. And what I'm going to 118 00:04:15,000 --> 00:04:15,900 do is look at all 119 00:04:16,020 --> 00:04:17,610 the movie ratings and I'm going 120 00:04:18,010 --> 00:04:19,550 to subtract off the mean rating. 121 00:04:20,110 --> 00:04:22,990 So this first element 5 I'm going to subtract off 2.5 and that gives me 2.5. 122 00:04:26,900 --> 00:04:29,380 And the second element 5 subtract off of 2.5, 123 00:04:29,590 --> 00:04:30,000 get a 2.5. 124 00:04:30,410 --> 00:04:31,760 And then the 0, 125 00:04:32,040 --> 00:04:34,560 0, subtract off 2.5 and you get -2.5, -2.5. 126 00:04:35,450 --> 00:04:36,530 In other words, what 127 00:04:36,620 --> 00:04:38,010 I'm going to do is take 128 00:04:38,310 --> 00:04:39,440 my matrix of movie ratings, 129 00:04:39,960 --> 00:04:42,070 take this wide matrix, and 130 00:04:42,730 --> 00:04:45,580 subtract form each row the average rating for that movie. 131 00:04:46,580 --> 00:04:47,580 So, what I'm doing is 132 00:04:48,010 --> 00:04:49,600 just normalizing each movie to 133 00:04:49,740 --> 00:04:51,610 have an average rating of zero. 134 00:04:52,800 --> 00:04:53,580 And so just one last example. 135 00:04:54,000 --> 00:04:56,010 If you look at this last row, 0 0 5 0. 136 00:04:56,270 --> 00:04:56,940 We're going to subtract 1.25, and 137 00:04:57,000 --> 00:04:58,590 so I end up with 138 00:05:00,950 --> 00:05:02,300 these values over here. 139 00:05:02,510 --> 00:05:03,730 So now and of course 140 00:05:03,940 --> 00:05:05,380 the question marks stay a question 141 00:05:06,960 --> 00:05:06,960 mark. 142 00:05:07,880 --> 00:05:09,630 So each movie in 143 00:05:09,810 --> 00:05:11,040 this new matrix Y has 144 00:05:11,210 --> 00:05:12,780 an average rating of 0. 145 00:05:13,940 --> 00:05:15,180 What I'm going to do then, is 146 00:05:15,440 --> 00:05:16,850 take this set of ratings 147 00:05:17,590 --> 00:05:20,170 and use it with my collaborative filtering algorithm. 148 00:05:20,480 --> 00:05:22,130 So I'm going to pretend that this 149 00:05:22,430 --> 00:05:24,200 was the data that I had 150 00:05:24,420 --> 00:05:25,570 gotten from my users, or pretend that 151 00:05:25,810 --> 00:05:27,400 these are the actual ratings I 152 00:05:27,530 --> 00:05:28,940 had gotten from the users, and I'm 153 00:05:29,250 --> 00:05:30,130 going to use this as my 154 00:05:30,270 --> 00:05:31,730 data set with which to 155 00:05:32,000 --> 00:05:33,920 learn my parameters theta 156 00:05:34,560 --> 00:05:36,540 J and my features XI 157 00:05:36,860 --> 00:05:39,320 - from these mean normalized movie ratings. 158 00:05:41,280 --> 00:05:42,040 When I want to make predictions 159 00:05:42,660 --> 00:05:43,910 of movie ratings, what I'm 160 00:05:44,070 --> 00:05:44,980 going to do is the 161 00:05:45,250 --> 00:05:46,830 following: for user J on movie 162 00:05:47,130 --> 00:05:49,250 I, I'm gonna predict theta 163 00:05:49,600 --> 00:05:54,730 J transpose XI, where 164 00:05:55,070 --> 00:05:55,990 X and theta are the parameters 165 00:05:56,590 --> 00:05:58,230 that I've learned from this mean normalized data set. 166 00:05:59,180 --> 00:06:00,680 But, because on the data 167 00:06:00,950 --> 00:06:02,260 set, I had subtracted off the 168 00:06:02,330 --> 00:06:04,000 means in order to make 169 00:06:04,040 --> 00:06:05,210 a prediction on movie i, 170 00:06:05,510 --> 00:06:07,220 I'm going to need to add back in the mean, 171 00:06:08,070 --> 00:06:08,730 and so i'm going to add 172 00:06:08,840 --> 00:06:10,690 back in mu i. And 173 00:06:10,830 --> 00:06:11,780 so that's going to be 174 00:06:11,830 --> 00:06:13,350 my prediction where in my training 175 00:06:13,660 --> 00:06:14,860 data subtracted off all the 176 00:06:14,930 --> 00:06:16,290 means and so when we 177 00:06:16,440 --> 00:06:20,770 make predictions and we need 178 00:06:21,770 --> 00:06:23,030 to add back in these 179 00:06:23,410 --> 00:06:23,880 means mu i for movie i. And 180 00:06:24,100 --> 00:06:25,320 so specifically if you user 181 00:06:25,330 --> 00:06:26,840 5 which is Eve, the same argument as 182 00:06:27,010 --> 00:06:28,250 the previous slide still applies in 183 00:06:28,440 --> 00:06:29,870 the sense that Eve had 184 00:06:30,080 --> 00:06:31,600 not rated any movies and 185 00:06:31,760 --> 00:06:32,930 so the learned parameter for 186 00:06:33,710 --> 00:06:35,030 user 5 is still going to 187 00:06:35,970 --> 00:06:37,990 be equal to 0, 0. 188 00:06:38,270 --> 00:06:39,910 And so what we're 189 00:06:40,130 --> 00:06:41,320 going to get then is that 190 00:06:41,690 --> 00:06:42,980 on a particular movie i we're 191 00:06:43,130 --> 00:06:44,900 going to predict for Eve theta 192 00:06:45,480 --> 00:06:49,930 5, transpose xi plus 193 00:06:51,260 --> 00:06:52,890 add back in mu i and 194 00:06:53,010 --> 00:06:54,360 so this first component is 195 00:06:54,460 --> 00:06:57,520 going to be equal to zero, if theta five is equal to zero. 196 00:06:58,290 --> 00:06:59,190 And so on movie i, we 197 00:06:59,260 --> 00:07:00,660 are going to end a predicting mu 198 00:07:01,090 --> 00:07:03,190 i. And, this actually makes sense. 199 00:07:03,380 --> 00:07:03,690 It means that 200 00:07:03,900 --> 00:07:05,270 on movie 1 we're 201 00:07:05,390 --> 00:07:06,990 going to predict Eve rates it 2.5. 202 00:07:07,270 --> 00:07:10,260 On movie 2 we're gonna predict Eve rates it 2.5. 203 00:07:10,420 --> 00:07:11,640 On movie 3 we're 204 00:07:11,880 --> 00:07:13,000 gonna predict Eve rates it at 2 205 00:07:13,200 --> 00:07:14,510 and so on. 206 00:07:14,780 --> 00:07:15,960 This actually makes sense, because it says 207 00:07:16,320 --> 00:07:17,730 that if Eve hasn't rated 208 00:07:18,020 --> 00:07:18,870 any movies and we just 209 00:07:19,100 --> 00:07:20,180 don't know anything about this new 210 00:07:20,410 --> 00:07:21,630 user Eve, what we're going 211 00:07:21,810 --> 00:07:23,770 to do is just predict for 212 00:07:23,940 --> 00:07:25,140 each of the movies, what are 213 00:07:25,230 --> 00:07:27,520 the average rating that those movies got. 214 00:07:30,060 --> 00:07:31,480 Finally, as an aside, in 215 00:07:31,810 --> 00:07:33,290 this video we talked about mean 216 00:07:33,540 --> 00:07:35,220 normalization, where we normalized 217 00:07:35,320 --> 00:07:36,450 each row of the matrix y, 218 00:07:37,510 --> 00:07:38,100 to have mean 0. 219 00:07:39,020 --> 00:07:40,730 In case you have some movies 220 00:07:41,020 --> 00:07:42,330 with no ratings, so it is 221 00:07:42,590 --> 00:07:44,320 analogous to a user who hasn't rated anything, 222 00:07:44,590 --> 00:07:45,550 but in case you have some 223 00:07:46,250 --> 00:07:47,530 movies with no ratings, you 224 00:07:47,590 --> 00:07:48,700 can also play with versions 225 00:07:49,320 --> 00:07:50,700 of the algorithm, where you 226 00:07:50,900 --> 00:07:52,190 normalize the different columns 227 00:07:52,790 --> 00:07:53,990 to have means zero, instead of 228 00:07:54,280 --> 00:07:55,180 normalizing the rows to have mean 229 00:07:55,500 --> 00:07:56,990 zero, although that's maybe 230 00:07:57,240 --> 00:07:58,770 less important, because if you 231 00:07:58,870 --> 00:07:59,810 really have a movie with no 232 00:08:00,040 --> 00:08:01,390 rating, maybe you just 233 00:08:01,590 --> 00:08:03,920 shouldn't recommend that movie to anyone, anyway. 234 00:08:04,700 --> 00:08:08,010 And so, taking 235 00:08:08,540 --> 00:08:09,980 care of the case of a user who hasn't 236 00:08:10,490 --> 00:08:11,780 rated anything might be more 237 00:08:12,010 --> 00:08:13,170 important than taking care of 238 00:08:13,310 --> 00:08:14,550 the case of a movie 239 00:08:14,860 --> 00:08:16,090 that hasn't gotten a single rating. 240 00:08:18,930 --> 00:08:20,080 So to summarize, that's how 241 00:08:20,360 --> 00:08:21,830 you can do mean normalization as 242 00:08:22,110 --> 00:08:25,110 a sort of pre-processing step for collaborative filtering. 243 00:08:25,740 --> 00:08:26,670 Depending on your data set, 244 00:08:26,960 --> 00:08:28,140 this might some times make your implementation 245 00:08:28,540 --> 00:08:30,040 work just a little bit better.