1
00:00:01,370 --> 00:00:02,420
In the last video, we talked

2
00:00:02,740 --> 00:00:04,200
about the recommender system problem,

3
00:00:05,030 --> 00:00:06,270
where for example, you may

4
00:00:06,380 --> 00:00:07,810
have a set of movies and you

5
00:00:07,940 --> 00:00:09,140
may have a set of users,

6
00:00:09,810 --> 00:00:10,960
each of whom has rated

7
00:00:11,670 --> 00:00:13,170
some subset of the movies,

8
00:00:13,370 --> 00:00:14,340
rated the movies 1 to

9
00:00:14,500 --> 00:00:15,460
5 stars or 0 to 5

10
00:00:15,630 --> 00:00:16,830
stars, and what I would like

11
00:00:17,200 --> 00:00:18,170
to do, is look at

12
00:00:18,240 --> 00:00:19,720
these users and predict how

13
00:00:19,910 --> 00:00:22,540
they would have rated other movies that they have not yet rated.

14
00:00:23,530 --> 00:00:24,540
In this video, I would like

15
00:00:24,600 --> 00:00:25,950
to talk about our first approach

16
00:00:26,430 --> 00:00:28,190
to building a recommender system, this

17
00:00:28,360 --> 00:00:30,100
approach is called content based recommendations.

18
00:00:31,460 --> 00:00:32,690
Here's our data set from before,

19
00:00:33,310 --> 00:00:34,470
and just to remind you of a

20
00:00:34,550 --> 00:00:35,780
bit of notation, I was using

21
00:00:36,690 --> 00:00:37,870
Nu to denote the number

22
00:00:38,030 --> 00:00:39,110
of users, and so that's equal

23
00:00:39,290 --> 00:00:40,990
to 4, and Nm

24
00:00:41,990 --> 00:00:44,780
to denote the number of movies, I have five movies.

25
00:00:47,230 --> 00:00:48,140
So, how do I predict

26
00:00:48,960 --> 00:00:50,950
what these missing values would be?

27
00:00:52,490 --> 00:00:53,520
Let's suppose that for each

28
00:00:53,700 --> 00:00:55,500
of these movies, I have a

29
00:00:55,540 --> 00:00:57,460
set of features for them.

30
00:00:57,910 --> 00:00:58,990
In particular, lets say that

31
00:00:59,690 --> 00:01:00,850
for each of the movies I have two features,

32
00:01:01,920 --> 00:01:03,500
which I'm going to denote X1 and

33
00:01:04,080 --> 00:01:05,700
X2, where X1 measures the degree

34
00:01:06,130 --> 00:01:07,450
to which a movie is a

35
00:01:07,650 --> 00:01:09,270
romantic movie and X2 measures

36
00:01:09,810 --> 00:01:12,080
the degree to which a movie is an action movie.

37
00:01:12,840 --> 00:01:13,700
So if you take a movie

38
00:01:14,470 --> 00:01:16,490
Love at last, you know,

39
00:01:16,800 --> 00:01:17,960
0.9 rating on the

40
00:01:18,030 --> 00:01:19,190
romance scale, it is a

41
00:01:19,260 --> 00:01:20,850
highly romantic movie but zero on

42
00:01:20,920 --> 00:01:22,400
the action scale, so almost no

43
00:01:22,520 --> 00:01:24,390
action in that

44
00:01:24,540 --> 00:01:25,860
movie. Romance forever was 1.0,

45
00:01:26,230 --> 00:01:27,610
lot of romance and 0.01 action,

46
00:01:27,860 --> 00:01:29,790
I don't know maybe

47
00:01:30,700 --> 00:01:32,650
there's a minor car crash in

48
00:01:33,630 --> 00:01:35,580
that movie or something, so little bit of action.

49
00:01:35,610 --> 00:01:36,760
Skipping one let's do

50
00:01:37,860 --> 00:01:39,630
Swords vs,. karate, maybe that

51
00:01:39,870 --> 00:01:41,110
has a zero romance rating

52
00:01:41,520 --> 00:01:42,780
and no romance at all in that

53
00:01:43,250 --> 00:01:46,040
but plenty of action and you know, non-stop car chases.

54
00:01:46,300 --> 00:01:47,120
Maybe again there is

55
00:01:47,220 --> 00:01:48,390
tiny bit of romance in

56
00:01:48,500 --> 00:01:49,800
that movie, but mainly action,

57
00:01:50,460 --> 00:01:51,560
and and Cute puppies of

58
00:01:51,680 --> 00:01:52,730
love again but mainly a romance

59
00:01:53,510 --> 00:01:54,410
movie with no action at all.

60
00:01:55,990 --> 00:01:57,150
So if we have features

61
00:01:57,550 --> 00:01:59,220
like these then each movie

62
00:01:59,800 --> 00:02:01,510
can be represented with a feature vector.

63
00:02:02,380 --> 00:02:03,810
Let's take movie 1, so just

64
00:02:04,020 --> 00:02:06,210
call these movies you know, movies 1 2, 3, 4 and 5.

65
00:02:06,630 --> 00:02:08,180
For my first movie,

66
00:02:08,520 --> 00:02:09,810
Love at last, I have

67
00:02:10,170 --> 00:02:11,710
my two features, 0.9 and

68
00:02:12,180 --> 00:02:12,950
0, and so these are features

69
00:02:13,380 --> 00:02:16,170
X1 and X2, and

70
00:02:16,340 --> 00:02:17,270
let's add an extra feature

71
00:02:17,790 --> 00:02:18,780
as usual, which is my

72
00:02:19,350 --> 00:02:21,640
interceptor feature X0, which is equal to 1

73
00:02:22,680 --> 00:02:23,810
and so, putting these together,

74
00:02:24,700 --> 00:02:26,150
I would then have a feature X1,

75
00:02:26,970 --> 00:02:28,420
the superscript 1 denotes it's

76
00:02:28,510 --> 00:02:29,430
the feature vector for my first

77
00:02:29,770 --> 00:02:30,720
movie, and this feature

78
00:02:30,980 --> 00:02:32,520
vector is equal to one.

79
00:02:33,190 --> 00:02:34,880
The first one there is this interceptor,

80
00:02:35,740 --> 00:02:37,010
and then my two features 0.9, 0,

81
00:02:37,260 --> 00:02:39,330
like so.

82
00:02:40,370 --> 00:02:41,360
So, for Love at last, I

83
00:02:41,550 --> 00:02:43,470
would have a feature vector X1,

84
00:02:44,480 --> 00:02:46,220
for the movie Romance Forever, we

85
00:02:46,340 --> 00:02:47,510
have the separate feature vector

86
00:02:47,800 --> 00:02:49,310
X2 and so on, and

87
00:02:49,380 --> 00:02:50,780
for Swords vs. karate I would

88
00:02:51,510 --> 00:02:54,050
have a different feature vector x superscript 5.

89
00:02:56,150 --> 00:02:57,460
Also, consistent with our

90
00:02:57,680 --> 00:02:59,090
early notation that we were

91
00:02:59,300 --> 00:03:00,220
using, we're going to set N

92
00:03:00,490 --> 00:03:02,130
to be the number of features, not

93
00:03:02,360 --> 00:03:03,530
counting this X zero

94
00:03:03,810 --> 00:03:05,320
intercept term so n is

95
00:03:05,420 --> 00:03:06,600
equal to two because we have

96
00:03:06,790 --> 00:03:08,180
two features x1 and x2

97
00:03:08,890 --> 00:03:10,140
capturing the degree of romance

98
00:03:10,640 --> 00:03:11,980
and the degree of action in each

99
00:03:12,630 --> 00:03:14,270
movie. 
Now in order

100
00:03:14,560 --> 00:03:17,930
to make predictions, here is one thing we could do,

101
00:03:19,230 --> 00:03:20,980
which is that we could treat predicting

102
00:03:21,160 --> 00:03:22,340
the ratings of each user

103
00:03:23,250 --> 00:03:26,210
as a separate linear regression problem. So

104
00:03:26,440 --> 00:03:27,660
specifically lets say that for each

105
00:03:27,920 --> 00:03:29,170
user j we are going

106
00:03:29,270 --> 00:03:30,860
to learn a parameter vector theta

107
00:03:31,340 --> 00:03:33,030
J which would be in R3 in this case,

108
00:03:33,540 --> 00:03:35,730
more generally theta j would

109
00:03:35,950 --> 00:03:37,960
be in r n+1, where

110
00:03:38,340 --> 00:03:39,460
n is the number of features,

111
00:03:39,700 --> 00:03:42,170
not counting the intercept term, and we're going

112
00:03:42,440 --> 00:03:43,880
to predict user J as

113
00:03:44,050 --> 00:03:45,780
rating movie I, with just

114
00:03:46,000 --> 00:03:47,390
the inner product between the parameters

115
00:03:47,860 --> 00:03:50,590
vector theta and the features "XI".

116
00:03:51,830 --> 00:03:53,680
So, let's take a specific example.

117
00:03:55,130 --> 00:03:56,700
Let's take user one.

118
00:03:59,600 --> 00:04:01,120
So that would be Alice and

119
00:04:01,380 --> 00:04:02,700
associated with Alice would

120
00:04:02,830 --> 00:04:03,990
be some parameter vector,

121
00:04:04,810 --> 00:04:06,210
theta 1 and our

122
00:04:06,520 --> 00:04:07,610
second user Bob will be

123
00:04:07,720 --> 00:04:08,600
associated, with a different

124
00:04:08,970 --> 00:04:10,290
parameter vector theta 2.

125
00:04:10,800 --> 00:04:12,190
Carol will be associated with a

126
00:04:12,300 --> 00:04:13,360
different parameter vector theta

127
00:04:13,660 --> 00:04:14,790
3 and Dave a different

128
00:04:15,750 --> 00:04:17,670
parameter vector, theta 4. So

129
00:04:18,090 --> 00:04:18,990
lets say we want to make a

130
00:04:19,320 --> 00:04:21,040
prediction for what Alice will

131
00:04:21,240 --> 00:04:22,450
think of the movie, Cute

132
00:04:22,690 --> 00:04:24,640
puppies of love. Well that

133
00:04:24,810 --> 00:04:25,670
movie is going to have some

134
00:04:26,810 --> 00:04:29,180
parameter vector X3, where

135
00:04:29,410 --> 00:04:30,400
we have that X3 is going

136
00:04:30,430 --> 00:04:32,460
to be equal to 1

137
00:04:32,650 --> 00:04:34,580
which is my intercept term, and

138
00:04:34,800 --> 00:04:37,220
then 0.99, and then 0.

139
00:04:38,560 --> 00:04:39,680
And let's say for this

140
00:04:39,810 --> 00:04:41,040
example, let's say that you

141
00:04:41,190 --> 00:04:42,890
know we have somehow already gotten

142
00:04:43,290 --> 00:04:44,600
a parameter vector theta 1

143
00:04:44,830 --> 00:04:45,700
for Alice--we we will

144
00:04:45,850 --> 00:04:47,560
say later exactly how

145
00:04:47,800 --> 00:04:48,520
we come up with this parameter

146
00:04:48,600 --> 00:04:50,530
vector--but let's

147
00:04:50,710 --> 00:04:52,000
just say for now that you

148
00:04:52,150 --> 00:04:53,560
know some unspecified learning algorithm

149
00:04:54,040 --> 00:04:55,040
has learned the parameter vector

150
00:04:55,180 --> 00:04:56,970
theta 1 and it is

151
00:04:57,120 --> 00:04:59,260
equal to 0 5 0. And so

152
00:05:00,150 --> 00:05:02,010
our prediction for this

153
00:05:02,270 --> 00:05:04,130
entry is going to

154
00:05:04,260 --> 00:05:06,930
be equal to theta 1,

155
00:05:07,440 --> 00:05:08,760
that is Alice's parameter vector,

156
00:05:09,620 --> 00:05:11,450
transpose X3, that

157
00:05:11,620 --> 00:05:13,730
is the feature vector for

158
00:05:14,170 --> 00:05:16,050
the Cute Puppies of Love movie number 3.

159
00:05:16,250 --> 00:05:17,200
And so the inner

160
00:05:17,470 --> 00:05:18,470
product between these two vectors

161
00:05:19,910 --> 00:05:21,780
is going to be 5 x 0.99.

162
00:05:23,980 --> 00:05:26,340
Which is equal to 4.95.

163
00:05:27,360 --> 00:05:28,940
And so my prediction for value this

164
00:05:29,130 --> 00:05:30,930
over here is going to be 4.95.

165
00:05:31,970 --> 00:05:33,110
And maybe that seems like a

166
00:05:33,230 --> 00:05:34,660
reasonable value, if indeed

167
00:05:36,130 --> 00:05:37,830
this is my parameter vector theta 1.

168
00:05:38,950 --> 00:05:40,290
So all we doing here is

169
00:05:40,520 --> 00:05:42,710
we are applying a different copy of

170
00:05:42,930 --> 00:05:44,480
essentially linear regression for each

171
00:05:44,760 --> 00:05:46,020
user and we are saying

172
00:05:46,230 --> 00:05:47,610
that what Alice does, is

173
00:05:47,820 --> 00:05:48,880
Alice has seem some parameter vector

174
00:05:49,160 --> 00:05:50,400
theta 1 that she uses,

175
00:05:51,410 --> 00:05:52,380
that we use to predict

176
00:05:53,310 --> 00:05:54,770
her ratings as a

177
00:05:54,950 --> 00:05:56,190
function of how romantic and how

178
00:05:56,470 --> 00:05:57,540
action packed the movie is

179
00:05:58,210 --> 00:05:59,600
and Bob, and Carol, and

180
00:05:59,740 --> 00:06:01,010
Dave each of them have a

181
00:06:01,220 --> 00:06:03,170
different linear function of the

182
00:06:03,330 --> 00:06:04,700
romantic-ness and action-ness or the degree

183
00:06:05,220 --> 00:06:06,510
of romance and the degree of action

184
00:06:07,580 --> 00:06:08,030
in a movie,

185
00:06:08,820 --> 00:06:11,300
and that that is how we're going to predict their star ratings.

186
00:06:14,820 --> 00:06:16,330
More formally here is

187
00:06:16,610 --> 00:06:17,920
how we can write down the problem.

188
00:06:19,260 --> 00:06:20,320
Our notation is that RIJ

189
00:06:20,690 --> 00:06:21,600
is equal to one, if

190
00:06:21,680 --> 00:06:22,910
user J has rated movie I,

191
00:06:23,380 --> 00:06:24,630
and YIJ is the rating

192
00:06:25,850 --> 00:06:28,010
of that movie if that rating exists.

193
00:06:29,540 --> 00:06:30,520
That is if that user has actually

194
00:06:31,030 --> 00:06:32,830
rated that movie. And

195
00:06:33,330 --> 00:06:34,360
on the previous slide we also

196
00:06:34,650 --> 00:06:36,540
defined theta J which

197
00:06:36,740 --> 00:06:38,790
is a parameter for each user XI

198
00:06:39,150 --> 00:06:40,830
which is a feature vector for specific

199
00:06:41,220 --> 00:06:42,370
movie and for each user

200
00:06:42,850 --> 00:06:43,780
and each movie you would predict

201
00:06:44,300 --> 00:06:45,620
that rating, as follows.

202
00:06:47,230 --> 00:06:49,560
So let me introduce,

203
00:06:49,650 --> 00:06:51,600
just temporarily, introduce one extra

204
00:06:51,860 --> 00:06:53,530
bit of notation mj, we

205
00:06:53,760 --> 00:06:54,980
are gonna use mj to denote the

206
00:06:55,070 --> 00:06:56,140
number of users rated by movie

207
00:06:56,400 --> 00:06:57,350
j, we're gonna need this

208
00:06:57,580 --> 00:06:59,890
notation only for this slide. Now, in order to learn

209
00:07:00,160 --> 00:07:01,700
the parameter vector for

210
00:07:01,760 --> 00:07:03,720
theta j, well, how can we do so?

211
00:07:04,410 --> 00:07:06,380
This is basically a linear regression problem.

212
00:07:06,930 --> 00:07:07,980
So what we can do, is

213
00:07:08,290 --> 00:07:09,810
just choose a parameter vector, theta j,

214
00:07:10,520 --> 00:07:12,100
so the predicted value

215
00:07:12,570 --> 00:07:13,620
here are as close

216
00:07:13,980 --> 00:07:15,280
as possible to the values

217
00:07:15,800 --> 00:07:18,760
that we observed in our training set, the values we observed in our data.

218
00:07:19,900 --> 00:07:21,390
So, let's write that down.

219
00:07:22,290 --> 00:07:24,320
In order to learn the

220
00:07:24,380 --> 00:07:26,960
parameter vector theta j, let's minimize over

221
00:07:27,170 --> 00:07:28,510
my parameter vector theta j,

222
00:07:29,400 --> 00:07:30,360
of sum--

223
00:07:31,920 --> 00:07:32,860
and I want to sum

224
00:07:33,290 --> 00:07:34,900
over all movies that user

225
00:07:35,240 --> 00:07:36,930
j has rated--so we write this as sum

226
00:07:37,270 --> 00:07:38,290
over all values of i

227
00:07:39,100 --> 00:07:42,000
that is a colon rij equals 1.

228
00:07:43,870 --> 00:07:45,970
So the way to read this summation index is

229
00:07:46,370 --> 00:07:48,280
this is summation over all

230
00:07:48,470 --> 00:07:49,550
the values of i, so that

231
00:07:49,780 --> 00:07:51,180
r i j is equal to 1.

232
00:07:51,210 --> 00:07:52,470
So this is going to be summing over all the

233
00:07:52,560 --> 00:07:54,670
movies that user j has rated.

234
00:07:56,230 --> 00:07:57,000
And then I am going to

235
00:07:58,150 --> 00:07:59,910
compute theta j

236
00:08:01,810 --> 00:08:04,450
transpose xi so

237
00:08:04,610 --> 00:08:06,740
that's the prediction of user

238
00:08:07,030 --> 00:08:08,390
j's rating on movie i,

239
00:08:09,230 --> 00:08:10,960
minus y i j,

240
00:08:11,700 --> 00:08:13,700
so that's the actual observed rating squared,

241
00:08:15,190 --> 00:08:16,790
and then, let me just divide

242
00:08:17,260 --> 00:08:18,650
by the number of movies

243
00:08:19,040 --> 00:08:20,990
that user J, has

244
00:08:21,380 --> 00:08:23,910
actually rated, so just divide by 1 over 2MJ.

245
00:08:24,000 --> 00:08:25,460
And so this is

246
00:08:25,690 --> 00:08:27,620
just like the least squares regression,

247
00:08:28,210 --> 00:08:29,550
it's just like linear regression

248
00:08:30,170 --> 00:08:31,170
where we want to choose

249
00:08:31,320 --> 00:08:34,480
the parameter vector theta J, to minimize this type of squared error term.

250
00:08:34,510 --> 00:08:35,090
And if you want to, you can

251
00:08:36,330 --> 00:08:39,580
also add in a regularization term

252
00:08:39,980 --> 00:08:41,870
so plus lambda over 2m, and

253
00:08:43,780 --> 00:08:44,930
this is really 2MJ because,

254
00:08:45,420 --> 00:08:47,760
this as if we have MJ examples right?

255
00:08:47,920 --> 00:08:49,330
Because if user J has

256
00:08:49,650 --> 00:08:50,910
rated that many movies, it's

257
00:08:51,050 --> 00:08:53,340
sort of like we have that many data points with which to fit

258
00:08:53,680 --> 00:08:55,790
the parameters theta J. And then

259
00:08:56,650 --> 00:08:57,390
let me add in my usual

260
00:08:58,340 --> 00:09:00,260
regularization term here of

261
00:09:00,460 --> 00:09:02,530
theta J K squared.

262
00:09:03,110 --> 00:09:04,270
As usual this sum is from

263
00:09:04,840 --> 00:09:05,980
K equals 1 through N

264
00:09:06,330 --> 00:09:08,670
so here theta J is

265
00:09:08,880 --> 00:09:10,050
going to be an N plus

266
00:09:10,520 --> 00:09:12,400
1 dimensional vector where,

267
00:09:12,620 --> 00:09:14,630
in our early example, n was equal to two,

268
00:09:15,320 --> 00:09:17,090
but more generally, n is

269
00:09:17,260 --> 00:09:20,980
the number of features we have per movie.

270
00:09:21,730 --> 00:09:22,270
And so as usual we don't regularize over theta 0.

271
00:09:22,390 --> 00:09:23,710
We don't regularize over the

272
00:09:23,910 --> 00:09:24,750
biased term because the sum is

273
00:09:24,930 --> 00:09:28,590
from K1 through N. If

274
00:09:28,760 --> 00:09:30,430
you minimize this as

275
00:09:30,570 --> 00:09:31,780
a function of theta J you get a

276
00:09:31,900 --> 00:09:33,010
good solution, you get a

277
00:09:33,180 --> 00:09:35,330
pretty good estimate of a parameter vector theta j

278
00:09:36,490 --> 00:09:37,200
with which to make the predictions

279
00:09:37,940 --> 00:09:39,460
for user J's movie ratings.

280
00:09:40,820 --> 00:09:42,250
For recommender systems, we're going

281
00:09:42,520 --> 00:09:44,140
to change this notation a little

282
00:09:44,500 --> 00:09:46,130
bit. So to simplify the subsequent math,

283
00:09:46,690 --> 00:09:48,440
I'm actually going to get rid of this term MJ.

284
00:09:49,570 --> 00:09:50,720
So that's just a constant right

285
00:09:50,970 --> 00:09:52,140
so I can delete it without changing

286
00:09:53,000 --> 00:09:54,310
the value of theta J that

287
00:09:54,430 --> 00:09:55,840
I get out of this optimization,

288
00:09:56,010 --> 00:09:57,030
so if you imagine taking this

289
00:09:57,220 --> 00:09:58,850
whole equation, taking this

290
00:09:59,010 --> 00:10:00,290
whole expression and multiplying it by

291
00:10:00,870 --> 00:10:02,540
MJ and get rid of that constant, and when

292
00:10:02,950 --> 00:10:04,110
I minimize this I should still get

293
00:10:04,200 --> 00:10:06,590
the same value of theta J as before.

294
00:10:06,710 --> 00:10:07,780
So, just to repeat what

295
00:10:08,440 --> 00:10:10,060
we wrote on the previous slide, here

296
00:10:10,340 --> 00:10:12,250
is our optimization objective: In order

297
00:10:12,580 --> 00:10:13,620
to learn theta J, which is

298
00:10:13,990 --> 00:10:15,080
a parameter for user J,

299
00:10:15,790 --> 00:10:17,570
we're going to minimize over theta j

300
00:10:17,770 --> 00:10:19,820
this optimization objectives. So

301
00:10:20,100 --> 00:10:21,360
this is our usual squared

302
00:10:21,720 --> 00:10:24,830
error term and then this is our regularization term.

303
00:10:26,050 --> 00:10:27,410
Now of course in building

304
00:10:27,690 --> 00:10:28,790
a recommender system we don't

305
00:10:29,030 --> 00:10:29,800
just want to learn parameters

306
00:10:30,420 --> 00:10:31,500
for a single user, we want

307
00:10:31,650 --> 00:10:33,140
to learn parameters for all of

308
00:10:33,490 --> 00:10:35,640
our users, I have n subscript u

309
00:10:35,760 --> 00:10:36,730
users, so I want to

310
00:10:36,950 --> 00:10:38,920
learn all of these parameters and

311
00:10:39,060 --> 00:10:39,830
so what I'm going to do

312
00:10:40,140 --> 00:10:42,320
is take this minimization, take

313
00:10:42,500 --> 00:10:45,480
this optimization objective and just add an extra summation there.

314
00:10:45,800 --> 00:10:47,610
So, you know, this expression here

315
00:10:48,410 --> 00:10:49,200
with the one half on top again, so

316
00:10:49,240 --> 00:10:50,510
it's exactly the same

317
00:10:50,780 --> 00:10:52,520
as what we have on top except

318
00:10:52,950 --> 00:10:53,980
that now, instead of just

319
00:10:54,090 --> 00:10:55,670
doing this for a specific user theta

320
00:10:55,960 --> 00:10:57,270
J, I'm going to sum

321
00:10:57,680 --> 00:10:59,340
my objective over all of

322
00:10:59,490 --> 00:11:00,940
my users and then minimize

323
00:11:01,260 --> 00:11:03,700
this overall optimization objective.

324
00:11:04,320 --> 00:11:05,570
Minimize this overall cost function.

325
00:11:06,730 --> 00:11:09,200
And when I minimize this

326
00:11:09,380 --> 00:11:10,560
as a function of theta 1,

327
00:11:11,360 --> 00:11:12,400
theta 2, up to

328
00:11:12,600 --> 00:11:14,130
theta NU, I will

329
00:11:14,270 --> 00:11:15,750
get a separate parameter

330
00:11:16,030 --> 00:11:17,340
vector each user and

331
00:11:17,450 --> 00:11:18,720
I can then use that

332
00:11:19,090 --> 00:11:20,460
to make predictions for all of

333
00:11:20,530 --> 00:11:21,610
my users for all of

334
00:11:21,720 --> 00:11:23,150
my N subscript u users.

335
00:11:24,520 --> 00:11:26,560
So putting everything together, this

336
00:11:27,180 --> 00:11:28,730
was our optimization objective on

337
00:11:28,880 --> 00:11:29,940
top and to give

338
00:11:30,170 --> 00:11:31,070
this thing a name, I'll just call this

339
00:11:31,930 --> 00:11:33,480
J of theta 1,

340
00:11:33,630 --> 00:11:35,520
dot, dot, dot theta NU.

341
00:11:36,050 --> 00:11:37,280
So J as usual is my

342
00:11:37,590 --> 00:11:39,830
optimization objective which I'm trying to minimize.

343
00:11:41,330 --> 00:11:42,500
Next, in order to actually

344
00:11:42,880 --> 00:11:44,310
do the minimization, if you

345
00:11:44,500 --> 00:11:45,840
were to derive the gradient

346
00:11:46,150 --> 00:11:47,410
descent updates, these are

347
00:11:47,530 --> 00:11:48,720
the equations you would get,

348
00:11:49,900 --> 00:11:51,300
so you would take theta

349
00:11:51,750 --> 00:11:53,310
JK and subtract from

350
00:11:53,430 --> 00:11:56,190
it alpha, which is the learning rate, times these terms here on the right.

351
00:11:56,280 --> 00:11:57,540
So we have slightly different cases

352
00:11:58,160 --> 00:11:59,660
so when K equals 0 and when K is not

353
00:11:59,840 --> 00:12:01,460
equal to 0, because our regularization

354
00:12:01,960 --> 00:12:04,380
term here regularizes only the

355
00:12:04,910 --> 00:12:06,430
values of theta JK for

356
00:12:06,610 --> 00:12:07,690
K not equal to zero. So

357
00:12:07,830 --> 00:12:09,470
we don't regularize theta 0

358
00:12:10,090 --> 00:12:11,610
so the slightly different updates

359
00:12:12,270 --> 00:12:13,580
for k equals zero, and k not equal to 0.

360
00:12:14,680 --> 00:12:16,080
And this term, over

361
00:12:16,250 --> 00:12:18,090
here, for example is just a partial

362
00:12:18,520 --> 00:12:20,790
derivative with respect to your parameter,

363
00:12:21,090 --> 00:12:24,300
that of your

364
00:12:25,350 --> 00:12:28,270
optimization objective, right?

365
00:12:28,790 --> 00:12:30,280
And so, this is just

366
00:12:30,680 --> 00:12:33,000
gradient descent and I've

367
00:12:33,230 --> 00:12:35,440
already computed the derivatives and plugged them into here.

368
00:12:36,560 --> 00:12:39,580
If these gradient

369
00:12:40,570 --> 00:12:41,810
descent updates look a

370
00:12:41,980 --> 00:12:42,870
lot like what we had for

371
00:12:43,050 --> 00:12:44,700
linear regression, that's because these

372
00:12:44,880 --> 00:12:47,250
are essentially the same as linear regression.

373
00:12:48,190 --> 00:12:49,510
The only minor difference is that

374
00:12:49,780 --> 00:12:51,120
for linear regression we have

375
00:12:51,580 --> 00:12:52,600
these 1 over M terms

376
00:12:52,990 --> 00:12:54,710
- it's really 1

377
00:12:54,810 --> 00:12:56,770
over MJ - but

378
00:12:57,550 --> 00:12:59,230
because earlier when we were

379
00:12:59,370 --> 00:13:00,780
deriving the optimization objective

380
00:13:01,270 --> 00:13:03,540
we got rid of this, that's why we don't have this 1 over M term.

381
00:13:04,440 --> 00:13:05,880
But otherwise it's really sum over

382
00:13:06,080 --> 00:13:08,350
my training examples of, you

383
00:13:08,530 --> 00:13:09,890
know, the error times

384
00:13:10,230 --> 00:13:13,390
XK plus that regularization

385
00:13:14,900 --> 00:13:16,550
term contributes to the derivative.

386
00:13:18,120 --> 00:13:19,040
So if you are using

387
00:13:19,200 --> 00:13:20,360
gradient descent, here is how

388
00:13:20,680 --> 00:13:22,140
you can minimize the cost

389
00:13:22,440 --> 00:13:23,880
function j, to learn all

390
00:13:24,110 --> 00:13:25,490
the parameters, and using these

391
00:13:25,640 --> 00:13:26,980
formulas for the derivatives, if

392
00:13:27,090 --> 00:13:28,240
you want, you can also plug them

393
00:13:28,440 --> 00:13:29,710
into a more advanced optimization

394
00:13:30,290 --> 00:13:31,710
algorithm like cluster gradient or

395
00:13:31,810 --> 00:13:33,730
LBFGS or what have you, and use

396
00:13:33,940 --> 00:13:35,930
that to try to minimize the cost function J as well.

397
00:13:37,360 --> 00:13:38,450
So hopefully you now know

398
00:13:38,750 --> 00:13:40,510
how you can apply essentially a

399
00:13:41,000 --> 00:13:42,820
variation on linear regression in

400
00:13:42,950 --> 00:13:45,460
order to predict different movie ratings by different users.

401
00:13:46,350 --> 00:13:47,510
This particular algorithm is called

402
00:13:48,030 --> 00:13:49,930
a content based recommendations, or

403
00:13:50,040 --> 00:13:51,980
content based approach because we

404
00:13:52,130 --> 00:13:53,200
assume that we have available

405
00:13:53,650 --> 00:13:55,430
to us, features for the different movies.

406
00:13:56,150 --> 00:13:57,330
So we have features that

407
00:13:57,490 --> 00:13:58,610
capture what is the

408
00:13:58,700 --> 00:14:00,260
content of these movies. How romantic is this movie?

409
00:14:01,280 --> 00:14:03,050
How much action is in this movie?

410
00:14:03,430 --> 00:14:04,690
And we are really using features of the

411
00:14:04,780 --> 00:14:06,910
content of the movies to make our predictions.

412
00:14:08,350 --> 00:14:09,770
But for many movies we

413
00:14:09,920 --> 00:14:11,300
don't actually have such features,

414
00:14:11,820 --> 00:14:13,630
or it may be very difficult to get

415
00:14:13,850 --> 00:14:14,970
such features for all of

416
00:14:15,050 --> 00:14:16,160
our movies, for all

417
00:14:16,460 --> 00:14:17,800
of whatever items we are trying to sell.

418
00:14:18,880 --> 00:14:20,430
So in the next video, we'll

419
00:14:20,590 --> 00:14:21,530
start to talk about an approach

420
00:14:22,010 --> 00:14:23,290
to recommender systems that isn't

421
00:14:23,570 --> 00:14:24,710
content based and does not

422
00:14:24,980 --> 00:14:26,090
assume that we have

423
00:14:26,670 --> 00:14:28,420
someone else giving us all of these features,

424
00:14:28,880 --> 00:14:30,300
for all of the movies in our data set.