1
00:00:00,530 --> 00:00:01,650
In the last few videos, we

2
00:00:01,730 --> 00:00:03,890
talked about a collaborative filtering algorithm.

3
00:00:04,830 --> 00:00:05,890
In this video I'm going to

4
00:00:05,970 --> 00:00:07,120
say a little bit about the

5
00:00:07,490 --> 00:00:09,090
vectorization implementation of this algorithm.

6
00:00:09,980 --> 00:00:12,670
And also talk a little bit about other things you can do with this algorithm.

7
00:00:13,340 --> 00:00:14,520
For example, one of the

8
00:00:14,600 --> 00:00:15,830
things you can do is, given

9
00:00:16,180 --> 00:00:17,390
one product can you find

10
00:00:17,770 --> 00:00:19,160
other products that are related

11
00:00:19,270 --> 00:00:20,210
to this so that for

12
00:00:20,490 --> 00:00:23,140
example, a user has recently been looking at one product.

13
00:00:23,650 --> 00:00:24,990
Are there other related products

14
00:00:25,520 --> 00:00:27,170
that you could recommend to this user?

15
00:00:27,620 --> 00:00:28,980
So let's see what we could do about that.

16
00:00:30,170 --> 00:00:31,190
What I'd like to do is work

17
00:00:31,550 --> 00:00:33,520
out an alternative way of

18
00:00:33,740 --> 00:00:35,710
writing out the predictions of the collaborative filtering algorithm.

19
00:00:37,370 --> 00:00:38,590
To start, here is our

20
00:00:38,960 --> 00:00:40,440
data set with our

21
00:00:40,750 --> 00:00:41,880
five movies and what I'm

22
00:00:42,160 --> 00:00:43,150
going to do is take

23
00:00:43,390 --> 00:00:44,520
all the ratings by all the

24
00:00:44,850 --> 00:00:46,500
users and group them into

25
00:00:47,080 --> 00:00:48,800
a matrix. So, here we have

26
00:00:49,200 --> 00:00:51,390
five movies and four

27
00:00:51,670 --> 00:00:53,390
users, and so this

28
00:00:53,670 --> 00:00:54,550
matrix y is going to be

29
00:00:54,910 --> 00:00:57,110
a 5 by 4 matrix. It's just you know, taking all

30
00:00:57,340 --> 00:00:58,770
of the elements, all of this data.

31
00:00:59,820 --> 00:01:02,390
Including question marks, and grouping them into this matrix.

32
00:01:03,290 --> 00:01:04,470
And of course the elements of this

33
00:01:04,650 --> 00:01:06,400
matrix of the (i, j) element of

34
00:01:06,500 --> 00:01:07,860
this matrix is really what

35
00:01:08,060 --> 00:01:09,710
we were previously writing as y

36
00:01:10,520 --> 00:01:12,090
superscript i, j. It's

37
00:01:12,220 --> 00:01:13,480
the rating given to movie i

38
00:01:14,140 --> 00:01:15,640
by user j. Given this

39
00:01:16,070 --> 00:01:17,290
matrix y of all the

40
00:01:17,430 --> 00:01:18,520
ratings that we have, there's

41
00:01:18,700 --> 00:01:20,500
an alternative way of writing

42
00:01:20,880 --> 00:01:23,340
out all the predictive ratings of the algorithm.

43
00:01:24,320 --> 00:01:26,210
And, in particular if you

44
00:01:26,430 --> 00:01:27,540
look at what a certain

45
00:01:27,920 --> 00:01:29,480
user predicts on a

46
00:01:29,690 --> 00:01:31,250
certain movie, what user j

47
00:01:31,950 --> 00:01:35,540
predicts on movie i is given by this formula.

48
00:01:37,010 --> 00:01:38,570
And so, if you have

49
00:01:39,440 --> 00:01:40,330
a matrix of the predicted

50
00:01:40,910 --> 00:01:42,000
ratings, what you would

51
00:01:42,180 --> 00:01:43,600
have is the following

52
00:01:45,030 --> 00:01:48,140
matrix where the i, j entry.

53
00:01:49,650 --> 00:01:51,440
So this corresponds to the rating

54
00:01:52,000 --> 00:01:54,020
that we predict using j

55
00:01:54,460 --> 00:01:55,690
will give to movie i

56
00:01:57,130 --> 00:01:58,440
is exactly equal to that

57
00:01:58,790 --> 00:02:00,680
theta j transpose

58
00:02:00,900 --> 00:02:01,940
XI, and so, you know, this is a matrix

59
00:02:02,520 --> 00:02:04,310
where this first element

60
00:02:04,750 --> 00:02:05,930
the one-one element is a

61
00:02:06,220 --> 00:02:07,450
predictive rating of user one

62
00:02:07,760 --> 00:02:09,360
or movie one and this

63
00:02:09,560 --> 00:02:11,070
element, this is the one-two

64
00:02:11,430 --> 00:02:12,680
element is the predicted rating

65
00:02:13,470 --> 00:02:14,640
of user two on movie

66
00:02:14,930 --> 00:02:16,070
one, and so on,

67
00:02:16,630 --> 00:02:18,670
and this is the

68
00:02:19,000 --> 00:02:20,130
predicted rating of user one

69
00:02:20,930 --> 00:02:23,380
on the last movie and

70
00:02:23,640 --> 00:02:25,100
if you want, you know,

71
00:02:25,400 --> 00:02:26,870
this rating is what we

72
00:02:27,020 --> 00:02:28,050
would have predicted for this value

73
00:02:29,050 --> 00:02:32,470
and this rating is

74
00:02:32,650 --> 00:02:33,570
what we would have predicted for that

75
00:02:33,910 --> 00:02:35,080
value, and so on.

76
00:02:36,180 --> 00:02:37,480
Now, given this matrix of

77
00:02:37,560 --> 00:02:39,290
predictive ratings there is then

78
00:02:39,610 --> 00:02:42,670
a simpler or vectorized way of writing these out.

79
00:02:43,640 --> 00:02:44,640
In particular if I define

80
00:02:45,120 --> 00:02:46,850
the matrix x, and this

81
00:02:46,970 --> 00:02:48,090
is going to be just

82
00:02:48,370 --> 00:02:50,980
like the matrix we had earlier for linear regression to be

83
00:02:52,070 --> 00:02:53,820
sort of x1 transpose x2

84
00:02:55,050 --> 00:02:57,060
transpose down to

85
00:02:58,530 --> 00:03:01,740
x of nm transpose.

86
00:03:02,420 --> 00:03:03,320
So I'm take all the features

87
00:03:04,210 --> 00:03:05,670
for my movies and stack

88
00:03:06,140 --> 00:03:07,260
them in rows.

89
00:03:07,950 --> 00:03:08,860
So if you think of

90
00:03:08,980 --> 00:03:09,810
each movie as one example

91
00:03:10,350 --> 00:03:11,200
and stack all of the features

92
00:03:11,670 --> 00:03:13,460
of the different movies and rows.

93
00:03:14,290 --> 00:03:16,160
And if we also to

94
00:03:16,280 --> 00:03:18,550
find a matrix capital theta,

95
00:03:19,870 --> 00:03:20,840
and what I'm going to

96
00:03:21,180 --> 00:03:22,490
do is take each of

97
00:03:22,750 --> 00:03:25,780
the per user parameter

98
00:03:26,280 --> 00:03:28,520
vectors, and stack them in rows, like so.

99
00:03:28,790 --> 00:03:29,690
So that's theta 1, which

100
00:03:30,220 --> 00:03:31,880
is the parameter vector for the first user.

101
00:03:33,430 --> 00:03:36,100
And, you know, theta 2, and

102
00:03:37,040 --> 00:03:38,100
so, you must stack

103
00:03:38,360 --> 00:03:39,470
them in rows like this to

104
00:03:39,650 --> 00:03:41,530
define a matrix capital

105
00:03:42,070 --> 00:03:43,830
theta and so I have

106
00:03:45,870 --> 00:03:48,410
nu parameter vectors all stacked in rows like this.

107
00:03:50,000 --> 00:03:51,390
Now given this definition

108
00:03:52,080 --> 00:03:53,400
for the matrix x and this

109
00:03:53,590 --> 00:03:54,870
definition for the matrix theta

110
00:03:55,820 --> 00:03:56,970
in order to have a

111
00:03:57,290 --> 00:03:59,330
vectorized way of computing the

112
00:03:59,420 --> 00:04:00,330
matrix of all the predictions

113
00:04:01,060 --> 00:04:03,570
you can just compute x times

114
00:04:04,710 --> 00:04:07,050
the matrix theta transpose, and

115
00:04:07,160 --> 00:04:08,380
that gives you a vectorized way

116
00:04:08,570 --> 00:04:10,530
of computing this matrix over here.

117
00:04:11,680 --> 00:04:12,460
To give the collaborative filtering

118
00:04:12,480 --> 00:04:15,220
algorithm that you've been using another name.

119
00:04:16,070 --> 00:04:17,190
The algorithm that we're using

120
00:04:17,660 --> 00:04:19,840
is also called low rank

121
00:04:21,240 --> 00:04:22,540
matrix factorization.

122
00:04:24,280 --> 00:04:25,410
And so if you hear

123
00:04:25,620 --> 00:04:26,760
people talk about low rank matrix

124
00:04:27,210 --> 00:04:29,490
factorization that's essentially exactly

125
00:04:30,390 --> 00:04:32,100
the algorithm that we have been talking about.

126
00:04:32,590 --> 00:04:33,900
And this term comes from the

127
00:04:33,990 --> 00:04:36,100
property that this matrix

128
00:04:36,770 --> 00:04:38,880
x times theta transpose has a

129
00:04:39,110 --> 00:04:40,780
mathematical property in linear

130
00:04:41,030 --> 00:04:42,410
algebra called that this

131
00:04:42,670 --> 00:04:43,820
is a low rank matrix and

132
00:04:44,720 --> 00:04:45,800
so that's what gives

133
00:04:46,060 --> 00:04:47,190
rise to this name low

134
00:04:47,340 --> 00:04:48,570
rank matrix factorization for these

135
00:04:48,930 --> 00:04:50,240
algorithms, because of this low

136
00:04:50,410 --> 00:04:53,580
rank property of this matrix x theta transpose.

137
00:04:54,830 --> 00:04:55,640
In case you don't know what

138
00:04:55,910 --> 00:04:57,310
low rank means or in case you don't

139
00:04:57,620 --> 00:04:59,770
know what a low rank matrix is, don't worry about it.

140
00:04:59,970 --> 00:05:02,820
You really don't need to know that in order to use this algorithm.

141
00:05:03,740 --> 00:05:04,790
But if you're an expert in

142
00:05:04,890 --> 00:05:06,110
linear algebra, that's what gives

143
00:05:06,320 --> 00:05:07,580
this algorithm, this other name

144
00:05:07,850 --> 00:05:12,370
of low rank matrix factorization.

145
00:05:12,620 --> 00:05:14,090
Finally, having run the

146
00:05:14,300 --> 00:05:16,350
collaborative filtering algorithm here's

147
00:05:17,310 --> 00:05:18,160
something else that you can do

148
00:05:18,530 --> 00:05:20,060
which is use the learned

149
00:05:20,320 --> 00:05:23,510
features in order to find related movies.

150
00:05:25,060 --> 00:05:26,810
Specifically for each product i

151
00:05:27,050 --> 00:05:27,810
really for each movie i, we've

152
00:05:28,810 --> 00:05:30,970
learned a feature vector xi.

153
00:05:31,740 --> 00:05:32,880
So, you know, when you learn a

154
00:05:32,930 --> 00:05:34,220
certain features without really know

155
00:05:34,590 --> 00:05:35,420
that can the advance what the

156
00:05:35,610 --> 00:05:37,850
different features are going to be, but if you

157
00:05:37,940 --> 00:05:39,550
run the algorithm and perfectly the features

158
00:05:39,990 --> 00:05:41,690
will tend to capture what are

159
00:05:41,930 --> 00:05:43,490
the important aspects of these

160
00:05:43,730 --> 00:05:45,340
different movies or different products or what have you.

161
00:05:45,480 --> 00:05:47,120
What are the important aspects that cause

162
00:05:47,610 --> 00:05:48,600
some users to like certain

163
00:05:48,930 --> 00:05:49,830
movies and cause some users

164
00:05:50,210 --> 00:05:51,670
to like different sets of movies.

165
00:05:52,470 --> 00:05:53,380
So maybe you end up

166
00:05:53,540 --> 00:05:55,050
learning a feature, you know, where x1

167
00:05:55,260 --> 00:05:56,550
equals romance, x2 equals

168
00:05:57,060 --> 00:05:59,180
action similar to

169
00:05:59,460 --> 00:06:00,590
an earlier video and maybe you

170
00:06:00,710 --> 00:06:02,100
learned a different feature x3 which

171
00:06:02,210 --> 00:06:04,520
is a degree to which this is a comedy.

172
00:06:05,330 --> 00:06:07,000
Then some feature x4 which is, you know, some other thing.

173
00:06:07,270 --> 00:06:09,750
And you have N

174
00:06:09,940 --> 00:06:11,600
features all together and after

175
00:06:12,610 --> 00:06:14,420
you have learned features it's actually often

176
00:06:14,750 --> 00:06:16,030
pretty difficult to go in

177
00:06:16,420 --> 00:06:18,120
to the learned features and come

178
00:06:18,390 --> 00:06:19,980
up with a human understandable

179
00:06:20,810 --> 00:06:22,850
interpretation of what these features really are.

180
00:06:22,950 --> 00:06:24,540
But in practice, you know, the

181
00:06:24,620 --> 00:06:27,480
features even though these features can be hard to visualize.

182
00:06:28,100 --> 00:06:29,570
It can be hard to figure out just what these features are.

183
00:06:31,070 --> 00:06:32,160
Usually, it will learn

184
00:06:32,410 --> 00:06:33,400
features that are very meaningful

185
00:06:33,960 --> 00:06:35,250
for capturing whatever are the

186
00:06:35,870 --> 00:06:37,120
most important or the most salient

187
00:06:37,880 --> 00:06:39,300
properties of a movie

188
00:06:39,710 --> 00:06:41,800
that causes you to like or dislike it.

189
00:06:41,860 --> 00:06:44,950
And so now let's say we want to address the following problem.

190
00:06:45,970 --> 00:06:47,410
Say you have some specific movie

191
00:06:47,790 --> 00:06:48,980
i and you want

192
00:06:49,120 --> 00:06:50,750
to find other movies j

193
00:06:51,620 --> 00:06:52,680
that are related to that movie.

194
00:06:53,150 --> 00:06:54,770
And so well, why would you want to do this?

195
00:06:54,920 --> 00:06:56,120
Right, maybe you have a

196
00:06:56,320 --> 00:06:57,840
user that's browsing movies, and they're

197
00:06:58,360 --> 00:07:00,210
currently watching movie j, than

198
00:07:00,550 --> 00:07:01,820
what's a reasonable movie to recommend

199
00:07:02,350 --> 00:07:04,110
to them to watch after they're done with movie j?

200
00:07:04,530 --> 00:07:06,040
Or if someone's recently purchased movie

201
00:07:06,330 --> 00:07:07,470
j, well, what's a different

202
00:07:07,730 --> 00:07:11,000
movie that would be reasonable to recommend to them for them to consider purchasing.

203
00:07:12,190 --> 00:07:13,000
So, now that you have

204
00:07:13,080 --> 00:07:14,540
learned these feature vectors, this gives

205
00:07:14,640 --> 00:07:16,080
us a very convenient way to

206
00:07:16,250 --> 00:07:17,930
measure how similar two movies are.

207
00:07:18,670 --> 00:07:20,530
In particular, movie i

208
00:07:21,460 --> 00:07:22,340
has a feature vector xi.

209
00:07:23,290 --> 00:07:24,200
and so if you can find

210
00:07:24,640 --> 00:07:27,500
a different movie, j, so

211
00:07:27,710 --> 00:07:29,300
that the distance between

212
00:07:29,780 --> 00:07:30,800
xi and xj is small,

213
00:07:33,080 --> 00:07:34,010
then this is a pretty

214
00:07:34,430 --> 00:07:36,980
strong indication that, you know, movies

215
00:07:37,830 --> 00:07:41,360
j and i are somehow similar.

216
00:07:42,320 --> 00:07:44,080
At least in the sense that some of them

217
00:07:44,200 --> 00:07:46,950
likes movie i, maybe more likely to like movie j as well.

218
00:07:47,760 --> 00:07:49,940
So, just to recap, if

219
00:07:50,590 --> 00:07:52,130
your user is looking

220
00:07:52,510 --> 00:07:53,710
at some movie i and if

221
00:07:54,150 --> 00:07:55,060
you want to find the 5

222
00:07:55,310 --> 00:07:56,640
most similar movies to that

223
00:07:56,900 --> 00:07:57,860
movie in order to recommend

224
00:07:58,230 --> 00:07:59,590
5 new movies to them, what

225
00:07:59,690 --> 00:08:00,650
you do is find the five

226
00:08:00,970 --> 00:08:02,260
movies j, with the

227
00:08:02,340 --> 00:08:03,880
smallest distance between the

228
00:08:04,190 --> 00:08:05,680
features between these different movies.

229
00:08:06,550 --> 00:08:09,220
And this could give you a few different movies to recommend to your user.

230
00:08:10,010 --> 00:08:11,500
So with that, hopefully, you

231
00:08:11,680 --> 00:08:13,350
now know how to use

232
00:08:13,700 --> 00:08:15,930
a vectorized implementation to compute

233
00:08:16,560 --> 00:08:18,130
all the predicted ratings of

234
00:08:18,210 --> 00:08:20,280
all the users and all the

235
00:08:20,390 --> 00:08:21,720
movies, and also how to do

236
00:08:21,920 --> 00:08:23,300
things like use learned features

237
00:08:23,930 --> 00:08:25,360
to find what might be movies

238
00:08:25,480 --> 00:08:27,490
and what might be products that aren't related to each other.