1
00:00:01,060 --> 00:00:02,420
In this video we'll talk about

2
00:00:02,620 --> 00:00:03,900
an approach to building a

3
00:00:03,970 --> 00:00:06,390
recommender system that's called collaborative filtering.

4
00:00:07,540 --> 00:00:08,880
The algorithm that we're talking

5
00:00:09,180 --> 00:00:10,400
about has a very interesting

6
00:00:10,680 --> 00:00:11,830
property that it does

7
00:00:12,120 --> 00:00:13,290
what is called feature learning and

8
00:00:13,790 --> 00:00:14,800
by that I mean that this

9
00:00:14,960 --> 00:00:16,270
will be an algorithm that can

10
00:00:16,450 --> 00:00:19,010
start to learn for itself what features to use.

11
00:00:21,130 --> 00:00:22,100
Here was the data set that

12
00:00:22,220 --> 00:00:23,440
we had and we had

13
00:00:23,720 --> 00:00:25,030
assumed that for each movie,

14
00:00:25,690 --> 00:00:27,000
someone had come and told

15
00:00:27,320 --> 00:00:28,640
us how romantic that

16
00:00:28,840 --> 00:00:30,550
movie was and how much action there was in that movie.

17
00:00:31,680 --> 00:00:32,880
But as you can imagine it

18
00:00:33,020 --> 00:00:34,320
can be very difficult and time

19
00:00:34,500 --> 00:00:36,390
consuming and expensive to actually try

20
00:00:36,490 --> 00:00:37,860
to get someone to, you know,

21
00:00:38,050 --> 00:00:39,440
watch each movie and tell

22
00:00:39,700 --> 00:00:40,880
you how romantic each movie and

23
00:00:41,410 --> 00:00:42,570
how action packed is each

24
00:00:42,660 --> 00:00:44,270
movie, and often you'll

25
00:00:44,390 --> 00:00:46,760
want even more features than just these two.

26
00:00:46,980 --> 00:00:48,130
So where do you get these features from?

27
00:00:49,890 --> 00:00:50,920
So let's change the problem

28
00:00:51,500 --> 00:00:53,220
a bit and suppose that

29
00:00:53,980 --> 00:00:55,160
we have a data set where

30
00:00:55,410 --> 00:00:57,980
we do not know the values of these features.

31
00:00:58,380 --> 00:00:59,280
So we're given the data set

32
00:00:59,640 --> 00:01:01,140
of movies and of

33
00:01:01,270 --> 00:01:03,550
how the users rated them, but we

34
00:01:03,760 --> 00:01:05,190
have no idea how romantic each

35
00:01:05,370 --> 00:01:06,140
movie is and we have no

36
00:01:06,310 --> 00:01:07,660
idea how action packed each

37
00:01:07,820 --> 00:01:09,940
movie is so I've replaced all of these things with question marks.

38
00:01:11,310 --> 00:01:12,330
But now let's make a slightly different assumption.

39
00:01:13,870 --> 00:01:15,570
Let's say we've gone to each of our users, and each of our users has told has told us

40
00:01:15,980 --> 00:01:18,510
how much they like the

41
00:01:18,820 --> 00:01:20,040
romantic movies and how much

42
00:01:20,220 --> 00:01:22,320
they like action packed movies.

43
00:01:22,830 --> 00:01:26,090
So Alice has associated a current of theta 1.

44
00:01:26,820 --> 00:01:27,470
Bob theta 2.

45
00:01:27,910 --> 00:01:28,440
Carol theta 3.

46
00:01:28,970 --> 00:01:30,330
Dave theta 4.

47
00:01:30,500 --> 00:01:31,530
And let's say we also use this

48
00:01:31,780 --> 00:01:33,040
and that Alice tells us

49
00:01:33,380 --> 00:01:35,340
that she really

50
00:01:35,610 --> 00:01:36,960
likes romantic movies and so

51
00:01:37,140 --> 00:01:38,780
there's a five there which

52
00:01:39,280 --> 00:01:41,210
is the multiplier associated with X1 and lets

53
00:01:41,570 --> 00:01:42,680
say that Alice tells us she

54
00:01:42,840 --> 00:01:45,030
really doesn't like action movies and so there's a 0 there.

55
00:01:46,060 --> 00:01:47,190
And Bob tells us something similar

56
00:01:48,660 --> 00:01:49,770
so we have theta 2 over here.

57
00:01:50,630 --> 00:01:52,460
Whereas Carol tells us that

58
00:01:53,570 --> 00:01:54,720
she really likes action movies

59
00:01:55,240 --> 00:01:56,450
which is why there's a 5 there,

60
00:01:56,900 --> 00:01:58,600
that's the multiplier associated with X2,

61
00:01:58,980 --> 00:02:00,160
and remember there's also

62
00:02:01,210 --> 00:02:03,490
X0 equals 1 and let's

63
00:02:03,770 --> 00:02:05,390
say that Carol tells us

64
00:02:05,610 --> 00:02:07,000
she doesn't like romantic

65
00:02:07,390 --> 00:02:09,640
movies and so on, similarly for Dave.

66
00:02:09,840 --> 00:02:11,030
So let's assume that somehow

67
00:02:11,440 --> 00:02:12,830
we can go to users and

68
00:02:13,290 --> 00:02:14,600
each user J just tells

69
00:02:15,020 --> 00:02:16,160
us what is the value

70
00:02:17,090 --> 00:02:18,870
of theta J for them.

71
00:02:19,450 --> 00:02:22,190
And so basically specifies to us of how much they like different types of movies.

72
00:02:24,060 --> 00:02:25,280
If we can get these parameters

73
00:02:25,990 --> 00:02:27,890
theta from our users then it

74
00:02:28,050 --> 00:02:29,820
turns out that it becomes possible to

75
00:02:29,960 --> 00:02:31,210
try to infer what are the

76
00:02:31,310 --> 00:02:33,710
values of x1 and x2 for each movie.

77
00:02:34,800 --> 00:02:35,140
Let's look at an example.

78
00:02:35,730 --> 00:02:36,560
Let's look at movie 1.

79
00:02:38,690 --> 00:02:39,790
So that movie 1 has associated

80
00:02:40,580 --> 00:02:42,050
with it a feature vector x1.

81
00:02:42,890 --> 00:02:45,420
And you know this movie is called Love at last but let's ignore that.

82
00:02:45,770 --> 00:02:46,750
Let's pretend we don't know what

83
00:02:46,870 --> 00:02:49,060
this movie is about, so let's ignore the title of this movie.

84
00:02:50,180 --> 00:02:52,270
All we know is that Alice loved this move.

85
00:02:52,450 --> 00:02:53,110
Bob loved this movie.

86
00:02:53,750 --> 00:02:55,370
Carol and Dave hated this movie.

87
00:02:56,450 --> 00:02:57,450
So what can we infer?

88
00:02:57,830 --> 00:02:58,900
Well, we know from the

89
00:02:58,990 --> 00:03:00,510
feature vectors that Alice

90
00:03:00,780 --> 00:03:03,190
and Bob love romantic movies

91
00:03:03,700 --> 00:03:05,660
because they told us that there's a 5 here.

92
00:03:06,290 --> 00:03:07,480
Whereas Carol and Dave,

93
00:03:08,380 --> 00:03:10,150
we know that they hate

94
00:03:10,510 --> 00:03:11,920
romantic movies and that

95
00:03:12,300 --> 00:03:13,990
they love action movies. So

96
00:03:14,730 --> 00:03:16,050
because those are the parameter

97
00:03:16,340 --> 00:03:18,830
vectors that you know, uses 3 and 4, Carol and Dave, gave us.

98
00:03:20,110 --> 00:03:20,950
And so based on the fact

99
00:03:21,390 --> 00:03:22,340
that movie 1 is loved

100
00:03:22,880 --> 00:03:24,120
by Alice and Bob and

101
00:03:24,340 --> 00:03:26,460
hated by Carol and Dave, we might

102
00:03:26,910 --> 00:03:30,810
reasonably conclude that this is probably a romantic movie,

103
00:03:31,180 --> 00:03:34,240
it is probably not much of an action movie.

104
00:03:35,290 --> 00:03:36,360
this example is a little

105
00:03:36,520 --> 00:03:38,090
bit mathematically simplified but what

106
00:03:38,260 --> 00:03:40,330
we're really asking is what

107
00:03:40,590 --> 00:03:42,010
feature vector should X1 be

108
00:03:42,840 --> 00:03:45,370
so that theta 1 transpose

109
00:03:46,030 --> 00:03:48,940
x1 is approximately equal to 5,

110
00:03:49,660 --> 00:03:51,700
that's Alice's rating, and

111
00:03:51,920 --> 00:03:55,360
theta 2 transpose x1 is

112
00:03:55,510 --> 00:03:56,660
also approximately equal to 5,

113
00:03:57,670 --> 00:03:59,100
and theta 3 transpose x1 is

114
00:03:59,310 --> 00:04:02,180
approximately equal to 0,

115
00:04:03,020 --> 00:04:05,250
so this would be Carol's rating, and

116
00:04:06,970 --> 00:04:09,780
theta 4 transpose X1

117
00:04:10,740 --> 00:04:11,630
is approximately equal to 0.

118
00:04:12,590 --> 00:04:13,520
And from this it looks

119
00:04:13,770 --> 00:04:16,000
like, you know, X1 equals

120
00:04:16,870 --> 00:04:18,770
one that's the intercept term, and

121
00:04:19,080 --> 00:04:20,960
then 1.0, 0.0, that makes sense

122
00:04:21,310 --> 00:04:22,390
given what we know of Alice,

123
00:04:22,790 --> 00:04:24,110
Bob, Carol, and Dave's preferences

124
00:04:24,770 --> 00:04:25,940
for movies and the way they rated this movie.

125
00:04:27,700 --> 00:04:29,080
And so more generally, we can

126
00:04:29,220 --> 00:04:30,210
go down this list and try

127
00:04:30,430 --> 00:04:31,520
to figure out what might

128
00:04:31,700 --> 00:04:35,260
be reasonable features for these other movies as well.

129
00:04:39,160 --> 00:04:41,890
Let's formalize this problem of learning the features XI.

130
00:04:42,410 --> 00:04:44,220
Let's say that our

131
00:04:44,340 --> 00:04:45,860
users have given us their preferences.

132
00:04:46,580 --> 00:04:47,950
So let's say that our users have

133
00:04:48,130 --> 00:04:49,100
come and, you know, told us

134
00:04:49,330 --> 00:04:50,800
these values for theta 1

135
00:04:50,890 --> 00:04:52,990
through theta of NU

136
00:04:53,280 --> 00:04:54,430
and we want to learn the

137
00:04:54,790 --> 00:04:56,130
feature vector XI for movie

138
00:04:56,540 --> 00:04:58,020
number I. What we can

139
00:04:58,200 --> 00:05:00,830
do is therefore pose the following optimization problem.

140
00:05:01,220 --> 00:05:02,210
So we want to sum over

141
00:05:02,840 --> 00:05:04,600
all the indices J for

142
00:05:04,930 --> 00:05:06,280
which we have a rating

143
00:05:06,950 --> 00:05:08,340
for movie I because

144
00:05:08,750 --> 00:05:10,040
we're trying to learn the features

145
00:05:10,950 --> 00:05:13,560
for movie I that is this feature vector XI.

146
00:05:14,650 --> 00:05:15,660
So and then what we

147
00:05:15,780 --> 00:05:18,450
want to do is minimize this squared

148
00:05:19,020 --> 00:05:20,160
error, so we want to choose

149
00:05:20,420 --> 00:05:22,430
features XI, so that,

150
00:05:22,900 --> 00:05:25,000
you know, the predictive value of

151
00:05:25,200 --> 00:05:26,820
how user J rates movie

152
00:05:27,110 --> 00:05:28,170
I will be similar,

153
00:05:28,900 --> 00:05:30,130
will be not too far in the

154
00:05:30,440 --> 00:05:31,910
squared error sense of the actual

155
00:05:32,530 --> 00:05:35,330
value YIJ that we actually observe in

156
00:05:35,530 --> 00:05:37,130
the rating of user j

157
00:05:38,310 --> 00:05:40,790
on movie I.
So, just to

158
00:05:41,040 --> 00:05:42,320
summarize what this term does

159
00:05:42,840 --> 00:05:44,060
is it tries to choose features

160
00:05:45,040 --> 00:05:46,590
XI so that for

161
00:05:46,960 --> 00:05:48,210
all the users J that

162
00:05:48,360 --> 00:05:50,190
have rated that movie, the

163
00:05:50,860 --> 00:05:52,830
algorithm also predicts a

164
00:05:52,900 --> 00:05:55,490
value for how that user would have rated that movie

165
00:05:56,170 --> 00:05:57,720
that is not too far, in

166
00:05:57,810 --> 00:05:59,730
the squared error sense, from the

167
00:06:00,000 --> 00:06:02,310
actual value that the user had rated that movie.

168
00:06:03,380 --> 00:06:04,560
So that's the squared error term.

169
00:06:05,420 --> 00:06:07,200
As usual, we can

170
00:06:07,310 --> 00:06:08,430
also add this sort of

171
00:06:08,520 --> 00:06:09,850
regularization term to prevent

172
00:06:10,300 --> 00:06:11,870
the features from becoming too big.

173
00:06:13,720 --> 00:06:15,610
So this is how we

174
00:06:15,760 --> 00:06:16,910
would learn the features

175
00:06:17,420 --> 00:06:19,140
for one specific movie but

176
00:06:19,690 --> 00:06:20,480
what we want to do is

177
00:06:20,740 --> 00:06:22,060
learn all the features for all

178
00:06:22,230 --> 00:06:23,820
the movies and so what

179
00:06:24,080 --> 00:06:25,050
I'm going to do is add this

180
00:06:25,240 --> 00:06:26,620
extra summation here so

181
00:06:26,780 --> 00:06:28,840
I'm going to sum over all Nm

182
00:06:29,260 --> 00:06:33,140
movies, N subscript m movies, and minimize

183
00:06:33,830 --> 00:06:34,670
this objective on top

184
00:06:35,010 --> 00:06:37,080
that sums of all movies.

185
00:06:37,410 --> 00:06:39,930
And if you do that, you end up with the following optimization problem.

186
00:06:40,950 --> 00:06:42,320
And if you minimize

187
00:06:42,890 --> 00:06:44,520
this, you have hopefully a

188
00:06:44,680 --> 00:06:47,440
reasonable set of features for all of your movies.

189
00:06:48,650 --> 00:06:50,080
So putting everything together, what

190
00:06:50,210 --> 00:06:51,050
we, the algorithm we talked

191
00:06:51,330 --> 00:06:52,730
about in the previous video and

192
00:06:53,180 --> 00:06:54,810
the algorithm that we just talked about in this video.

193
00:06:55,730 --> 00:06:57,070
In the previous video, what we

194
00:06:57,180 --> 00:06:58,710
showed was that you know,

195
00:06:58,820 --> 00:06:59,700
if you have a set of

196
00:06:59,790 --> 00:07:00,640
movie ratings, so if you

197
00:07:00,640 --> 00:07:03,960
have the data the rij's and

198
00:07:04,090 --> 00:07:06,100
then you have the yij's that will be the movie ratings.

199
00:07:08,500 --> 00:07:09,650
Then given features for your

200
00:07:09,800 --> 00:07:11,800
different movies we can learn these parameters theta.

201
00:07:12,340 --> 00:07:13,110
So if you knew the features,

202
00:07:13,830 --> 00:07:15,000
you can learn the parameters

203
00:07:15,650 --> 00:07:16,850
theta for your different users.

204
00:07:18,250 --> 00:07:19,770
And what we showed earlier in

205
00:07:19,930 --> 00:07:21,400
this video is that if

206
00:07:21,790 --> 00:07:22,860
your users are willing to

207
00:07:23,000 --> 00:07:25,450
give you parameters, then you

208
00:07:25,560 --> 00:07:28,060
can estimate features for the different movies.

209
00:07:29,270 --> 00:07:31,490
So this is kind of a chicken and egg problem.

210
00:07:31,770 --> 00:07:32,290
Which comes first?

211
00:07:32,900 --> 00:07:35,570
You know, do we want if we can get the thetas, we can know the Xs.

212
00:07:36,060 --> 00:07:38,160
If we have the Xs, we can learn the thetas.

213
00:07:39,500 --> 00:07:40,500
And what you can

214
00:07:40,680 --> 00:07:41,790
do is, and then

215
00:07:41,910 --> 00:07:43,000
this actually works, what you

216
00:07:43,110 --> 00:07:44,530
can do is in fact randomly

217
00:07:45,170 --> 00:07:47,160
guess some value of the thetas.

218
00:07:48,210 --> 00:07:49,200
Now based on your initial random

219
00:07:49,530 --> 00:07:50,630
guess for the thetas, you can

220
00:07:50,940 --> 00:07:52,530
then go ahead and use

221
00:07:53,160 --> 00:07:54,210
the procedure that we just talked

222
00:07:54,460 --> 00:07:55,810
about in order to

223
00:07:56,060 --> 00:07:57,740
learn features for your different movies.

224
00:07:58,800 --> 00:07:59,990
Now given some initial set

225
00:08:00,130 --> 00:08:01,160
of features for your movies you

226
00:08:01,240 --> 00:08:02,730
can then use this first

227
00:08:03,050 --> 00:08:04,060
method that we talked about

228
00:08:04,130 --> 00:08:06,180
in the previous video to try to get

229
00:08:06,360 --> 00:08:08,590
an even better estimate for your parameters theta.

230
00:08:09,560 --> 00:08:12,420
Now that you have a better setting of the parameters theta for your users,

231
00:08:12,860 --> 00:08:13,850
we can use that to maybe

232
00:08:14,070 --> 00:08:15,140
even get a better set of

233
00:08:15,240 --> 00:08:17,110
features and so on.

234
00:08:17,380 --> 00:08:18,400
We can sort of keep

235
00:08:18,600 --> 00:08:19,440
iterating, going back and forth

236
00:08:19,790 --> 00:08:21,270
and optimizing theta, x theta,

237
00:08:21,560 --> 00:08:24,000
x theta, nd this

238
00:08:24,270 --> 00:08:25,290
actually works and if you

239
00:08:25,410 --> 00:08:26,340
do this, this will actually

240
00:08:26,800 --> 00:08:28,360
cause your album to converge

241
00:08:28,930 --> 00:08:30,430
to a reasonable set of

242
00:08:31,340 --> 00:08:32,650
features for you movies and a

243
00:08:32,790 --> 00:08:34,880
reasonable set of parameters for your different users.

244
00:08:36,080 --> 00:08:38,870
So this is a basic collaborative filtering algorithm.

245
00:08:39,770 --> 00:08:40,850
This isn't actually the final

246
00:08:41,020 --> 00:08:42,890
algorithm that we're going to use. In the next

247
00:08:43,120 --> 00:08:44,100
video we are going to be able to improve

248
00:08:44,790 --> 00:08:45,610
on this algorithm and make

249
00:08:45,920 --> 00:08:47,430
it quite a bit more computationally efficient.

250
00:08:48,390 --> 00:08:49,510
But, hopefully this gives you

251
00:08:49,640 --> 00:08:50,600
a sense of how you

252
00:08:50,680 --> 00:08:51,980
can formulate a

253
00:08:52,040 --> 00:08:52,990
problem where you can simultaneously

254
00:08:53,930 --> 00:08:57,200
learn the parameters and simultaneously learn the features from the different movies.

255
00:08:58,440 --> 00:08:59,660
And for this problem, for the

256
00:08:59,740 --> 00:09:01,100
recommender system problem, this is

257
00:09:01,390 --> 00:09:02,950
possible only because each user

258
00:09:03,490 --> 00:09:04,840
rates multiple movies and hopefully

259
00:09:05,100 --> 00:09:06,410
each movie is rated

260
00:09:06,790 --> 00:09:08,710
by multiple users. And so

261
00:09:09,280 --> 00:09:10,150
you can do this back and

262
00:09:10,270 --> 00:09:11,150
forth process to estimate theta

263
00:09:11,200 --> 00:09:14,400
and x. 
So to

264
00:09:14,830 --> 00:09:15,910
summarize, in this video we've

265
00:09:16,140 --> 00:09:18,710
seen an initial collaborative filtering algorithm.

266
00:09:19,680 --> 00:09:21,550
The term collaborative filtering refers

267
00:09:22,020 --> 00:09:23,620
to the observation that when

268
00:09:23,760 --> 00:09:25,020
you run this algorithm with a

269
00:09:25,210 --> 00:09:26,790
large set of users, what all

270
00:09:26,960 --> 00:09:28,410
of these users are effectively doing are sort of

271
00:09:29,070 --> 00:09:31,300
collaboratively--or collaborating to

272
00:09:31,490 --> 00:09:32,770
get better movie ratings for

273
00:09:33,010 --> 00:09:34,610
everyone because with every

274
00:09:34,840 --> 00:09:36,540
user rating some subset with the movies,

275
00:09:37,350 --> 00:09:39,040
every user is helping the

276
00:09:39,300 --> 00:09:41,490
algorithm a little bit to learn better features,

277
00:09:42,900 --> 00:09:44,390
and then by helping--

278
00:09:44,490 --> 00:09:46,690
by rating a few movies myself, I will be helping

279
00:09:47,810 --> 00:09:49,550
the system learn better features and

280
00:09:49,680 --> 00:09:50,750
then these features can be used

281
00:09:50,930 --> 00:09:52,610
by the system to make better

282
00:09:52,890 --> 00:09:54,380
movie predictions for everyone else.

283
00:09:54,640 --> 00:09:55,450
And so there is a sense of

284
00:09:55,530 --> 00:09:56,980
collaboration where every user is

285
00:09:57,370 --> 00:09:58,980
helping the system learn better features

286
00:09:59,360 --> 00:10:00,740
for the common good. This

287
00:10:00,810 --> 00:10:03,450
is this collaborative filtering.

288
00:10:04,070 --> 00:10:04,990
And, in the next video what we

289
00:10:05,140 --> 00:10:07,490
going to do is take the ideas that

290
00:10:07,740 --> 00:10:08,850
have worked out, and try to

291
00:10:09,090 --> 00:10:09,910
develop a better an even

292
00:10:10,170 --> 00:10:11,920
better algorithm, a slightly better

293
00:10:12,180 --> 00:10:13,640
technique for collaborative filtering.