1 00:00:00,530 --> 00:00:01,650 In the last few videos, we 2 00:00:01,730 --> 00:00:03,890 talked about a collaborative filtering algorithm. 3 00:00:04,830 --> 00:00:05,890 In this video I'm going to 4 00:00:05,970 --> 00:00:07,120 say a little bit about the 5 00:00:07,490 --> 00:00:09,090 vectorization implementation of this algorithm. 6 00:00:09,980 --> 00:00:12,670 And also talk a little bit about other things you can do with this algorithm. 7 00:00:13,340 --> 00:00:14,520 For example, one of the 8 00:00:14,600 --> 00:00:15,830 things you can do is, given 9 00:00:16,180 --> 00:00:17,390 one product can you find 10 00:00:17,770 --> 00:00:19,160 other products that are related 11 00:00:19,270 --> 00:00:20,210 to this so that for 12 00:00:20,490 --> 00:00:23,140 example, a user has recently been looking at one product. 13 00:00:23,650 --> 00:00:24,990 Are there other related products 14 00:00:25,520 --> 00:00:27,170 that you could recommend to this user? 15 00:00:27,620 --> 00:00:28,980 So let's see what we could do about that. 16 00:00:30,170 --> 00:00:31,190 What I'd like to do is work 17 00:00:31,550 --> 00:00:33,520 out an alternative way of 18 00:00:33,740 --> 00:00:35,710 writing out the predictions of the collaborative filtering algorithm. 19 00:00:37,370 --> 00:00:38,590 To start, here is our 20 00:00:38,960 --> 00:00:40,440 data set with our 21 00:00:40,750 --> 00:00:41,880 five movies and what I'm 22 00:00:42,160 --> 00:00:43,150 going to do is take 23 00:00:43,390 --> 00:00:44,520 all the ratings by all the 24 00:00:44,850 --> 00:00:46,500 users and group them into 25 00:00:47,080 --> 00:00:48,800 a matrix. So, here we have 26 00:00:49,200 --> 00:00:51,390 five movies and four 27 00:00:51,670 --> 00:00:53,390 users, and so this 28 00:00:53,670 --> 00:00:54,550 matrix y is going to be 29 00:00:54,910 --> 00:00:57,110 a 5 by 4 matrix. It's just you know, taking all 30 00:00:57,340 --> 00:00:58,770 of the elements, all of this data. 31 00:00:59,820 --> 00:01:02,390 Including question marks, and grouping them into this matrix. 32 00:01:03,290 --> 00:01:04,470 And of course the elements of this 33 00:01:04,650 --> 00:01:06,400 matrix of the (i, j) element of 34 00:01:06,500 --> 00:01:07,860 this matrix is really what 35 00:01:08,060 --> 00:01:09,710 we were previously writing as y 36 00:01:10,520 --> 00:01:12,090 superscript i, j. It's 37 00:01:12,220 --> 00:01:13,480 the rating given to movie i 38 00:01:14,140 --> 00:01:15,640 by user j. Given this 39 00:01:16,070 --> 00:01:17,290 matrix y of all the 40 00:01:17,430 --> 00:01:18,520 ratings that we have, there's 41 00:01:18,700 --> 00:01:20,500 an alternative way of writing 42 00:01:20,880 --> 00:01:23,340 out all the predictive ratings of the algorithm. 43 00:01:24,320 --> 00:01:26,210 And, in particular if you 44 00:01:26,430 --> 00:01:27,540 look at what a certain 45 00:01:27,920 --> 00:01:29,480 user predicts on a 46 00:01:29,690 --> 00:01:31,250 certain movie, what user j 47 00:01:31,950 --> 00:01:35,540 predicts on movie i is given by this formula. 48 00:01:37,010 --> 00:01:38,570 And so, if you have 49 00:01:39,440 --> 00:01:40,330 a matrix of the predicted 50 00:01:40,910 --> 00:01:42,000 ratings, what you would 51 00:01:42,180 --> 00:01:43,600 have is the following 52 00:01:45,030 --> 00:01:48,140 matrix where the i, j entry. 53 00:01:49,650 --> 00:01:51,440 So this corresponds to the rating 54 00:01:52,000 --> 00:01:54,020 that we predict using j 55 00:01:54,460 --> 00:01:55,690 will give to movie i 56 00:01:57,130 --> 00:01:58,440 is exactly equal to that 57 00:01:58,790 --> 00:02:00,680 theta j transpose 58 00:02:00,900 --> 00:02:01,940 XI, and so, you know, this is a matrix 59 00:02:02,520 --> 00:02:04,310 where this first element 60 00:02:04,750 --> 00:02:05,930 the one-one element is a 61 00:02:06,220 --> 00:02:07,450 predictive rating of user one 62 00:02:07,760 --> 00:02:09,360 or movie one and this 63 00:02:09,560 --> 00:02:11,070 element, this is the one-two 64 00:02:11,430 --> 00:02:12,680 element is the predicted rating 65 00:02:13,470 --> 00:02:14,640 of user two on movie 66 00:02:14,930 --> 00:02:16,070 one, and so on, 67 00:02:16,630 --> 00:02:18,670 and this is the 68 00:02:19,000 --> 00:02:20,130 predicted rating of user one 69 00:02:20,930 --> 00:02:23,380 on the last movie and 70 00:02:23,640 --> 00:02:25,100 if you want, you know, 71 00:02:25,400 --> 00:02:26,870 this rating is what we 72 00:02:27,020 --> 00:02:28,050 would have predicted for this value 73 00:02:29,050 --> 00:02:32,470 and this rating is 74 00:02:32,650 --> 00:02:33,570 what we would have predicted for that 75 00:02:33,910 --> 00:02:35,080 value, and so on. 76 00:02:36,180 --> 00:02:37,480 Now, given this matrix of 77 00:02:37,560 --> 00:02:39,290 predictive ratings there is then 78 00:02:39,610 --> 00:02:42,670 a simpler or vectorized way of writing these out. 79 00:02:43,640 --> 00:02:44,640 In particular if I define 80 00:02:45,120 --> 00:02:46,850 the matrix x, and this 81 00:02:46,970 --> 00:02:48,090 is going to be just 82 00:02:48,370 --> 00:02:50,980 like the matrix we had earlier for linear regression to be 83 00:02:52,070 --> 00:02:53,820 sort of x1 transpose x2 84 00:02:55,050 --> 00:02:57,060 transpose down to 85 00:02:58,530 --> 00:03:01,740 x of nm transpose. 86 00:03:02,420 --> 00:03:03,320 So I'm take all the features 87 00:03:04,210 --> 00:03:05,670 for my movies and stack 88 00:03:06,140 --> 00:03:07,260 them in rows. 89 00:03:07,950 --> 00:03:08,860 So if you think of 90 00:03:08,980 --> 00:03:09,810 each movie as one example 91 00:03:10,350 --> 00:03:11,200 and stack all of the features 92 00:03:11,670 --> 00:03:13,460 of the different movies and rows. 93 00:03:14,290 --> 00:03:16,160 And if we also to 94 00:03:16,280 --> 00:03:18,550 find a matrix capital theta, 95 00:03:19,870 --> 00:03:20,840 and what I'm going to 96 00:03:21,180 --> 00:03:22,490 do is take each of 97 00:03:22,750 --> 00:03:25,780 the per user parameter 98 00:03:26,280 --> 00:03:28,520 vectors, and stack them in rows, like so. 99 00:03:28,790 --> 00:03:29,690 So that's theta 1, which 100 00:03:30,220 --> 00:03:31,880 is the parameter vector for the first user. 101 00:03:33,430 --> 00:03:36,100 And, you know, theta 2, and 102 00:03:37,040 --> 00:03:38,100 so, you must stack 103 00:03:38,360 --> 00:03:39,470 them in rows like this to 104 00:03:39,650 --> 00:03:41,530 define a matrix capital 105 00:03:42,070 --> 00:03:43,830 theta and so I have 106 00:03:45,870 --> 00:03:48,410 nu parameter vectors all stacked in rows like this. 107 00:03:50,000 --> 00:03:51,390 Now given this definition 108 00:03:52,080 --> 00:03:53,400 for the matrix x and this 109 00:03:53,590 --> 00:03:54,870 definition for the matrix theta 110 00:03:55,820 --> 00:03:56,970 in order to have a 111 00:03:57,290 --> 00:03:59,330 vectorized way of computing the 112 00:03:59,420 --> 00:04:00,330 matrix of all the predictions 113 00:04:01,060 --> 00:04:03,570 you can just compute x times 114 00:04:04,710 --> 00:04:07,050 the matrix theta transpose, and 115 00:04:07,160 --> 00:04:08,380 that gives you a vectorized way 116 00:04:08,570 --> 00:04:10,530 of computing this matrix over here. 117 00:04:11,680 --> 00:04:12,460 To give the collaborative filtering 118 00:04:12,480 --> 00:04:15,220 algorithm that you've been using another name. 119 00:04:16,070 --> 00:04:17,190 The algorithm that we're using 120 00:04:17,660 --> 00:04:19,840 is also called low rank 121 00:04:21,240 --> 00:04:22,540 matrix factorization. 122 00:04:24,280 --> 00:04:25,410 And so if you hear 123 00:04:25,620 --> 00:04:26,760 people talk about low rank matrix 124 00:04:27,210 --> 00:04:29,490 factorization that's essentially exactly 125 00:04:30,390 --> 00:04:32,100 the algorithm that we have been talking about. 126 00:04:32,590 --> 00:04:33,900 And this term comes from the 127 00:04:33,990 --> 00:04:36,100 property that this matrix 128 00:04:36,770 --> 00:04:38,880 x times theta transpose has a 129 00:04:39,110 --> 00:04:40,780 mathematical property in linear 130 00:04:41,030 --> 00:04:42,410 algebra called that this 131 00:04:42,670 --> 00:04:43,820 is a low rank matrix and 132 00:04:44,720 --> 00:04:45,800 so that's what gives 133 00:04:46,060 --> 00:04:47,190 rise to this name low 134 00:04:47,340 --> 00:04:48,570 rank matrix factorization for these 135 00:04:48,930 --> 00:04:50,240 algorithms, because of this low 136 00:04:50,410 --> 00:04:53,580 rank property of this matrix x theta transpose. 137 00:04:54,830 --> 00:04:55,640 In case you don't know what 138 00:04:55,910 --> 00:04:57,310 low rank means or in case you don't 139 00:04:57,620 --> 00:04:59,770 know what a low rank matrix is, don't worry about it. 140 00:04:59,970 --> 00:05:02,820 You really don't need to know that in order to use this algorithm. 141 00:05:03,740 --> 00:05:04,790 But if you're an expert in 142 00:05:04,890 --> 00:05:06,110 linear algebra, that's what gives 143 00:05:06,320 --> 00:05:07,580 this algorithm, this other name 144 00:05:07,850 --> 00:05:12,370 of low rank matrix factorization. 145 00:05:12,620 --> 00:05:14,090 Finally, having run the 146 00:05:14,300 --> 00:05:16,350 collaborative filtering algorithm here's 147 00:05:17,310 --> 00:05:18,160 something else that you can do 148 00:05:18,530 --> 00:05:20,060 which is use the learned 149 00:05:20,320 --> 00:05:23,510 features in order to find related movies. 150 00:05:25,060 --> 00:05:26,810 Specifically for each product i 151 00:05:27,050 --> 00:05:27,810 really for each movie i, we've 152 00:05:28,810 --> 00:05:30,970 learned a feature vector xi. 153 00:05:31,740 --> 00:05:32,880 So, you know, when you learn a 154 00:05:32,930 --> 00:05:34,220 certain features without really know 155 00:05:34,590 --> 00:05:35,420 that can the advance what the 156 00:05:35,610 --> 00:05:37,850 different features are going to be, but if you 157 00:05:37,940 --> 00:05:39,550 run the algorithm and perfectly the features 158 00:05:39,990 --> 00:05:41,690 will tend to capture what are 159 00:05:41,930 --> 00:05:43,490 the important aspects of these 160 00:05:43,730 --> 00:05:45,340 different movies or different products or what have you. 161 00:05:45,480 --> 00:05:47,120 What are the important aspects that cause 162 00:05:47,610 --> 00:05:48,600 some users to like certain 163 00:05:48,930 --> 00:05:49,830 movies and cause some users 164 00:05:50,210 --> 00:05:51,670 to like different sets of movies. 165 00:05:52,470 --> 00:05:53,380 So maybe you end up 166 00:05:53,540 --> 00:05:55,050 learning a feature, you know, where x1 167 00:05:55,260 --> 00:05:56,550 equals romance, x2 equals 168 00:05:57,060 --> 00:05:59,180 action similar to 169 00:05:59,460 --> 00:06:00,590 an earlier video and maybe you 170 00:06:00,710 --> 00:06:02,100 learned a different feature x3 which 171 00:06:02,210 --> 00:06:04,520 is a degree to which this is a comedy. 172 00:06:05,330 --> 00:06:07,000 Then some feature x4 which is, you know, some other thing. 173 00:06:07,270 --> 00:06:09,750 And you have N 174 00:06:09,940 --> 00:06:11,600 features all together and after 175 00:06:12,610 --> 00:06:14,420 you have learned features it's actually often 176 00:06:14,750 --> 00:06:16,030 pretty difficult to go in 177 00:06:16,420 --> 00:06:18,120 to the learned features and come 178 00:06:18,390 --> 00:06:19,980 up with a human understandable 179 00:06:20,810 --> 00:06:22,850 interpretation of what these features really are. 180 00:06:22,950 --> 00:06:24,540 But in practice, you know, the 181 00:06:24,620 --> 00:06:27,480 features even though these features can be hard to visualize. 182 00:06:28,100 --> 00:06:29,570 It can be hard to figure out just what these features are. 183 00:06:31,070 --> 00:06:32,160 Usually, it will learn 184 00:06:32,410 --> 00:06:33,400 features that are very meaningful 185 00:06:33,960 --> 00:06:35,250 for capturing whatever are the 186 00:06:35,870 --> 00:06:37,120 most important or the most salient 187 00:06:37,880 --> 00:06:39,300 properties of a movie 188 00:06:39,710 --> 00:06:41,800 that causes you to like or dislike it. 189 00:06:41,860 --> 00:06:44,950 And so now let's say we want to address the following problem. 190 00:06:45,970 --> 00:06:47,410 Say you have some specific movie 191 00:06:47,790 --> 00:06:48,980 i and you want 192 00:06:49,120 --> 00:06:50,750 to find other movies j 193 00:06:51,620 --> 00:06:52,680 that are related to that movie. 194 00:06:53,150 --> 00:06:54,770 And so well, why would you want to do this? 195 00:06:54,920 --> 00:06:56,120 Right, maybe you have a 196 00:06:56,320 --> 00:06:57,840 user that's browsing movies, and they're 197 00:06:58,360 --> 00:07:00,210 currently watching movie j, than 198 00:07:00,550 --> 00:07:01,820 what's a reasonable movie to recommend 199 00:07:02,350 --> 00:07:04,110 to them to watch after they're done with movie j? 200 00:07:04,530 --> 00:07:06,040 Or if someone's recently purchased movie 201 00:07:06,330 --> 00:07:07,470 j, well, what's a different 202 00:07:07,730 --> 00:07:11,000 movie that would be reasonable to recommend to them for them to consider purchasing. 203 00:07:12,190 --> 00:07:13,000 So, now that you have 204 00:07:13,080 --> 00:07:14,540 learned these feature vectors, this gives 205 00:07:14,640 --> 00:07:16,080 us a very convenient way to 206 00:07:16,250 --> 00:07:17,930 measure how similar two movies are. 207 00:07:18,670 --> 00:07:20,530 In particular, movie i 208 00:07:21,460 --> 00:07:22,340 has a feature vector xi. 209 00:07:23,290 --> 00:07:24,200 and so if you can find 210 00:07:24,640 --> 00:07:27,500 a different movie, j, so 211 00:07:27,710 --> 00:07:29,300 that the distance between 212 00:07:29,780 --> 00:07:30,800 xi and xj is small, 213 00:07:33,080 --> 00:07:34,010 then this is a pretty 214 00:07:34,430 --> 00:07:36,980 strong indication that, you know, movies 215 00:07:37,830 --> 00:07:41,360 j and i are somehow similar. 216 00:07:42,320 --> 00:07:44,080 At least in the sense that some of them 217 00:07:44,200 --> 00:07:46,950 likes movie i, maybe more likely to like movie j as well. 218 00:07:47,760 --> 00:07:49,940 So, just to recap, if 219 00:07:50,590 --> 00:07:52,130 your user is looking 220 00:07:52,510 --> 00:07:53,710 at some movie i and if 221 00:07:54,150 --> 00:07:55,060 you want to find the 5 222 00:07:55,310 --> 00:07:56,640 most similar movies to that 223 00:07:56,900 --> 00:07:57,860 movie in order to recommend 224 00:07:58,230 --> 00:07:59,590 5 new movies to them, what 225 00:07:59,690 --> 00:08:00,650 you do is find the five 226 00:08:00,970 --> 00:08:02,260 movies j, with the 227 00:08:02,340 --> 00:08:03,880 smallest distance between the 228 00:08:04,190 --> 00:08:05,680 features between these different movies. 229 00:08:06,550 --> 00:08:09,220 And this could give you a few different movies to recommend to your user. 230 00:08:10,010 --> 00:08:11,500 So with that, hopefully, you 231 00:08:11,680 --> 00:08:13,350 now know how to use 232 00:08:13,700 --> 00:08:15,930 a vectorized implementation to compute 233 00:08:16,560 --> 00:08:18,130 all the predicted ratings of 234 00:08:18,210 --> 00:08:20,280 all the users and all the 235 00:08:20,390 --> 00:08:21,720 movies, and also how to do 236 00:08:21,920 --> 00:08:23,300 things like use learned features 237 00:08:23,930 --> 00:08:25,360 to find what might be movies 238 00:08:25,480 --> 00:08:27,490 and what might be products that aren't related to each other.