1
00:00:00,400 --> 00:00:01,510
By now you've seen all

2
00:00:01,800 --> 00:00:03,600
of the main pieces of the

3
00:00:04,030 --> 00:00:06,760
recommender system algorithm or the collaborative filtering algorithm.

4
00:00:07,770 --> 00:00:08,770
In this video I want

5
00:00:08,940 --> 00:00:10,620
to just share one last implementational detail,

6
00:00:12,000 --> 00:00:14,140
namely mean normalization, which

7
00:00:14,350 --> 00:00:15,520
can sometimes just make the

8
00:00:15,570 --> 00:00:17,090
algorithm work a little bit better.

9
00:00:18,290 --> 00:00:20,820
To motivate the idea of mean normalization, let's

10
00:00:22,130 --> 00:00:24,390
consider an example of where

11
00:00:24,710 --> 00:00:26,530
there's a user that has not rated any movies.

12
00:00:28,050 --> 00:00:29,290
So, in addition to our

13
00:00:29,540 --> 00:00:30,790
four users, Alice, Bob, Carol,

14
00:00:31,060 --> 00:00:32,710
and Dave, I've added a

15
00:00:32,870 --> 00:00:35,110
fifth user, Eve, who hasn't rated any movies.

16
00:00:36,470 --> 00:00:37,920
Let's see what our collaborative filtering

17
00:00:38,350 --> 00:00:39,570
algorithm will do on this user.

18
00:00:41,020 --> 00:00:43,140
Let's say that n is equal to 2 and so

19
00:00:43,390 --> 00:00:44,420
we're going to learn two features

20
00:00:45,410 --> 00:00:46,470
and we are going to have

21
00:00:46,630 --> 00:00:47,890
to learn a parameter vector theta

22
00:00:48,140 --> 00:00:50,420
5, which is going to be

23
00:00:51,130 --> 00:00:52,560
in R2, remember this

24
00:00:52,750 --> 00:00:55,920
is now vectors in Rn not Rn+1,

25
00:00:57,070 --> 00:00:58,210
we'll learn the parameter vector theta 5 for our user number 5, Eve.

26
00:00:59,780 --> 00:01:00,800
So if we look in

27
00:01:00,960 --> 00:01:02,020
the first term in this

28
00:01:02,200 --> 00:01:04,020
optimization objective, well the

29
00:01:04,220 --> 00:01:05,490
user Eve hasn't rated any

30
00:01:05,730 --> 00:01:07,860
movies, so there are

31
00:01:08,120 --> 00:01:10,750
no movies for

32
00:01:11,050 --> 00:01:12,810
which Rij is equal to

33
00:01:13,130 --> 00:01:14,590
one for the user Eve and

34
00:01:14,700 --> 00:01:15,840
so this first term plays no

35
00:01:16,060 --> 00:01:17,400
role at all in determining theta 5

36
00:01:18,610 --> 00:01:19,790
because there are no movies that Eve has rated.

37
00:01:20,960 --> 00:01:22,120
And so the only term that

38
00:01:22,260 --> 00:01:24,300
effects theta 5 is this term.

39
00:01:24,880 --> 00:01:25,830
And so we're saying that we

40
00:01:25,910 --> 00:01:28,840
want to choose vector theta 5 so

41
00:01:28,950 --> 00:01:33,820
that the last regularization term is

42
00:01:34,540 --> 00:01:35,500
as small as possible.

43
00:01:35,920 --> 00:01:38,470
In other words we want to minimize this

44
00:01:39,040 --> 00:01:39,610
lambda over 2 theta 5 subscript 1 squared

45
00:01:40,880 --> 00:01:43,150
plus theta 5

46
00:01:43,820 --> 00:01:45,840
subscript 2 squared so

47
00:01:46,040 --> 00:01:47,170
that's the component of the

48
00:01:47,270 --> 00:01:49,460
regularization term that corresponds to

49
00:01:49,740 --> 00:01:51,610
user 5, and of course

50
00:01:51,850 --> 00:01:53,280
if your goal is to

51
00:01:53,550 --> 00:01:55,540
minimize this term, then

52
00:01:55,900 --> 00:01:56,790
what you're going to end up

53
00:01:56,980 --> 00:01:58,520
with is just theta 5 equals 0 0.

54
00:01:59,670 --> 00:02:01,550
Because a regularization term

55
00:02:01,850 --> 00:02:03,270
is encouraging us to set

56
00:02:03,510 --> 00:02:05,120
parameters close to 0

57
00:02:05,620 --> 00:02:07,580
and if there is

58
00:02:07,730 --> 00:02:08,820
no data to try to

59
00:02:08,990 --> 00:02:10,210
pull the parameters away from

60
00:02:10,410 --> 00:02:11,460
0, because this first term

61
00:02:12,710 --> 00:02:13,800
doesn't effect theta 5,

62
00:02:13,880 --> 00:02:15,410
we just end up with theta 5

63
00:02:15,690 --> 00:02:18,450
equals the vector of all zeros. And

64
00:02:18,590 --> 00:02:19,610
so when we go to

65
00:02:19,730 --> 00:02:20,920
predict how user 5 would

66
00:02:21,280 --> 00:02:22,570
rate any movie, we have

67
00:02:22,890 --> 00:02:25,850
that theta 5 transpose xi,

68
00:02:26,900 --> 00:02:28,380
for any i, that's just going

69
00:02:29,950 --> 00:02:31,060
to be equal to zero.

70
00:02:31,570 --> 00:02:33,320
Because theta 5 is 0 for any value of

71
00:02:33,750 --> 00:02:35,780
x, this inner product is going to be equal to 0. And what we're

72
00:02:35,900 --> 00:02:37,160
going to have therefore, is that

73
00:02:37,310 --> 00:02:38,780
we're going to predict that Eve

74
00:02:39,480 --> 00:02:40,870
is going to rate every single

75
00:02:41,170 --> 00:02:42,690
movie with zero stars.

76
00:02:44,050 --> 00:02:45,970
But this doesn't seem very useful does it?

77
00:02:46,110 --> 00:02:47,310
I mean if you look at the different movies,

78
00:02:47,770 --> 00:02:49,710
Love at Last, this first movie,

79
00:02:50,130 --> 00:02:52,300
a couple people rated it 5 stars.

80
00:02:54,940 --> 00:02:56,870
And for even the Swords vs. Karate, someone rated it 5 stars.

81
00:02:57,410 --> 00:02:58,780
So some people do like some movies.

82
00:02:59,270 --> 00:03:01,030
It seems not useful to

83
00:03:01,160 --> 00:03:03,750
just predict that Eve is going to rate everything 0 stars.

84
00:03:04,570 --> 00:03:05,850
And in fact if we're predicting

85
00:03:06,410 --> 00:03:08,340
that eve is going to rate everything 0 stars,

86
00:03:09,050 --> 00:03:10,100
we also don't have any

87
00:03:10,280 --> 00:03:11,660
good way of recommending any movies

88
00:03:11,810 --> 00:03:12,930
to her, because you know

89
00:03:13,130 --> 00:03:15,320
all of these movies are getting exactly the

90
00:03:15,410 --> 00:03:16,810
same predicted rating for Eve

91
00:03:17,010 --> 00:03:18,500
so there's no one movie with

92
00:03:18,660 --> 00:03:20,010
a higher predicted rating that

93
00:03:20,210 --> 00:03:22,880
we could recommend to her, so, that's not very good.

94
00:03:24,520 --> 00:03:27,340
The idea of mean normalization will let us fix this problem.

95
00:03:28,160 --> 00:03:29,410
So here's how it works.

96
00:03:30,760 --> 00:03:31,720
As before let me group all of my

97
00:03:32,370 --> 00:03:33,750
movie ratings into this matrix

98
00:03:34,280 --> 00:03:35,400
Y, so just take all of

99
00:03:35,460 --> 00:03:36,700
these ratings and group them

100
00:03:37,240 --> 00:03:38,400
into matrix Y.  And this

101
00:03:38,560 --> 00:03:39,740
column over here of all

102
00:03:39,910 --> 00:03:41,220
question marks corresponds to

103
00:03:41,670 --> 00:03:43,300
Eve's not having rated any movies.

104
00:03:44,830 --> 00:03:46,890
Now to perform mean normalization what I'm going to

105
00:03:47,140 --> 00:03:48,350
do is compute the average

106
00:03:48,720 --> 00:03:50,610
rating that each movie obtained.

107
00:03:51,120 --> 00:03:51,760
And I'm going to store that

108
00:03:52,040 --> 00:03:54,780
in a vector that we'll call mu.

109
00:03:55,210 --> 00:03:57,250
So the first movie got two 5-star and two 0-star ratings,

110
00:03:57,760 --> 00:03:58,960
so the average of that is a 2.5-star rating.

111
00:03:59,040 --> 00:04:01,470
The second movie had

112
00:04:01,620 --> 00:04:04,300
an average of 2.5-stars and so on.

113
00:04:04,470 --> 00:04:06,300
And the final movie that has 0, 0, 5, 0.

114
00:04:06,330 --> 00:04:07,440
And the average of 0, 0,

115
00:04:07,520 --> 00:04:09,190
5, 0, that averages out to

116
00:04:09,620 --> 00:04:11,500
an average of 1.25

117
00:04:12,240 --> 00:04:14,910
rating. And what I'm going to

118
00:04:15,000 --> 00:04:15,900
do is look at all

119
00:04:16,020 --> 00:04:17,610
the movie ratings and I'm going

120
00:04:18,010 --> 00:04:19,550
to subtract off the mean rating.

121
00:04:20,110 --> 00:04:22,990
So this first element 5 I'm going to subtract off 2.5 and that gives me 2.5.

122
00:04:26,900 --> 00:04:29,380
And the second element 5 subtract off of 2.5,

123
00:04:29,590 --> 00:04:30,000
get a 2.5.

124
00:04:30,410 --> 00:04:31,760
And then the 0,

125
00:04:32,040 --> 00:04:34,560
0, subtract off 2.5 and you get -2.5, -2.5.

126
00:04:35,450 --> 00:04:36,530
In other words, what

127
00:04:36,620 --> 00:04:38,010
I'm going to do is take

128
00:04:38,310 --> 00:04:39,440
my matrix of movie ratings,

129
00:04:39,960 --> 00:04:42,070
take this wide matrix, and

130
00:04:42,730 --> 00:04:45,580
subtract form each row the average rating for that movie.

131
00:04:46,580 --> 00:04:47,580
So, what I'm doing is

132
00:04:48,010 --> 00:04:49,600
just normalizing each movie to

133
00:04:49,740 --> 00:04:51,610
have an average rating of zero.

134
00:04:52,800 --> 00:04:53,580
And so just one last example.

135
00:04:54,000 --> 00:04:56,010
If you look at this last row, 0 0 5 0.

136
00:04:56,270 --> 00:04:56,940
We're going to subtract 1.25, and

137
00:04:57,000 --> 00:04:58,590
so I end up with

138
00:05:00,950 --> 00:05:02,300
these values over here.

139
00:05:02,510 --> 00:05:03,730
So now and of course

140
00:05:03,940 --> 00:05:05,380
the question marks stay a question

141
00:05:06,960 --> 00:05:06,960
mark.

142
00:05:07,880 --> 00:05:09,630
So each movie in

143
00:05:09,810 --> 00:05:11,040
this new matrix Y has

144
00:05:11,210 --> 00:05:12,780
an average rating of 0.

145
00:05:13,940 --> 00:05:15,180
What I'm going to do then, is

146
00:05:15,440 --> 00:05:16,850
take this set of ratings

147
00:05:17,590 --> 00:05:20,170
and use it with my collaborative filtering algorithm.

148
00:05:20,480 --> 00:05:22,130
So I'm going to pretend that this

149
00:05:22,430 --> 00:05:24,200
was the data that I had

150
00:05:24,420 --> 00:05:25,570
gotten from my users, or pretend that

151
00:05:25,810 --> 00:05:27,400
these are the actual ratings I

152
00:05:27,530 --> 00:05:28,940
had gotten from the users, and I'm

153
00:05:29,250 --> 00:05:30,130
going to use this as my

154
00:05:30,270 --> 00:05:31,730
data set with which to

155
00:05:32,000 --> 00:05:33,920
learn my parameters theta

156
00:05:34,560 --> 00:05:36,540
J and my features XI

157
00:05:36,860 --> 00:05:39,320
- from these mean normalized movie ratings.

158
00:05:41,280 --> 00:05:42,040
When I want to make predictions

159
00:05:42,660 --> 00:05:43,910
of movie ratings, what I'm

160
00:05:44,070 --> 00:05:44,980
going to do is the

161
00:05:45,250 --> 00:05:46,830
following:  for user J on movie

162
00:05:47,130 --> 00:05:49,250
I, I'm gonna predict theta

163
00:05:49,600 --> 00:05:54,730
J transpose XI, where

164
00:05:55,070 --> 00:05:55,990
X and theta are the parameters

165
00:05:56,590 --> 00:05:58,230
that I've learned from this mean normalized data set.

166
00:05:59,180 --> 00:06:00,680
But, because on the data

167
00:06:00,950 --> 00:06:02,260
set, I had subtracted off the

168
00:06:02,330 --> 00:06:04,000
means in order to make

169
00:06:04,040 --> 00:06:05,210
a prediction on movie i,

170
00:06:05,510 --> 00:06:07,220
I'm going to need to add back in the mean,

171
00:06:08,070 --> 00:06:08,730
and so i'm going to add

172
00:06:08,840 --> 00:06:10,690
back in mu i. And

173
00:06:10,830 --> 00:06:11,780
so that's going to be

174
00:06:11,830 --> 00:06:13,350
my prediction where in my training

175
00:06:13,660 --> 00:06:14,860
data subtracted off all the

176
00:06:14,930 --> 00:06:16,290
means and so when we

177
00:06:16,440 --> 00:06:20,770
make predictions and we need

178
00:06:21,770 --> 00:06:23,030
to add back in these

179
00:06:23,410 --> 00:06:23,880
means mu i for movie i.  And

180
00:06:24,100 --> 00:06:25,320
so specifically if you user

181
00:06:25,330 --> 00:06:26,840
5 which is Eve, the same argument as

182
00:06:27,010 --> 00:06:28,250
the previous slide still applies in

183
00:06:28,440 --> 00:06:29,870
the sense that Eve had

184
00:06:30,080 --> 00:06:31,600
not rated any movies and

185
00:06:31,760 --> 00:06:32,930
so the learned parameter for

186
00:06:33,710 --> 00:06:35,030
user 5 is still going to

187
00:06:35,970 --> 00:06:37,990
be equal to 0, 0.

188
00:06:38,270 --> 00:06:39,910
And so what we're

189
00:06:40,130 --> 00:06:41,320
going to get then is that

190
00:06:41,690 --> 00:06:42,980
on a particular movie i we're

191
00:06:43,130 --> 00:06:44,900
going to predict for Eve theta

192
00:06:45,480 --> 00:06:49,930
5, transpose xi plus

193
00:06:51,260 --> 00:06:52,890
add back in mu i and

194
00:06:53,010 --> 00:06:54,360
so this first component is

195
00:06:54,460 --> 00:06:57,520
going to be equal to zero, if theta five is equal to zero.

196
00:06:58,290 --> 00:06:59,190
And so on movie i, we

197
00:06:59,260 --> 00:07:00,660
are going to end a predicting mu

198
00:07:01,090 --> 00:07:03,190
i. And, this actually makes sense.

199
00:07:03,380 --> 00:07:03,690
It means that

200
00:07:03,900 --> 00:07:05,270
on movie 1 we're

201
00:07:05,390 --> 00:07:06,990
going to predict Eve rates it 2.5.

202
00:07:07,270 --> 00:07:10,260
On movie 2 we're gonna predict Eve rates it 2.5.

203
00:07:10,420 --> 00:07:11,640
On movie 3 we're

204
00:07:11,880 --> 00:07:13,000
gonna predict Eve rates it at 2

205
00:07:13,200 --> 00:07:14,510
and so on.

206
00:07:14,780 --> 00:07:15,960
This actually makes sense, because it says

207
00:07:16,320 --> 00:07:17,730
that if Eve hasn't rated

208
00:07:18,020 --> 00:07:18,870
any movies and we just

209
00:07:19,100 --> 00:07:20,180
don't know anything about this new

210
00:07:20,410 --> 00:07:21,630
user Eve, what we're going

211
00:07:21,810 --> 00:07:23,770
to do is just predict for

212
00:07:23,940 --> 00:07:25,140
each of the movies, what are

213
00:07:25,230 --> 00:07:27,520
the average rating that those movies got.

214
00:07:30,060 --> 00:07:31,480
Finally, as an aside, in

215
00:07:31,810 --> 00:07:33,290
this video we talked about mean

216
00:07:33,540 --> 00:07:35,220
normalization, where we normalized

217
00:07:35,320 --> 00:07:36,450
each row of the matrix y,

218
00:07:37,510 --> 00:07:38,100
to have mean 0.

219
00:07:39,020 --> 00:07:40,730
In case you have some movies

220
00:07:41,020 --> 00:07:42,330
with no ratings, so it is

221
00:07:42,590 --> 00:07:44,320
analogous to a user who hasn't rated anything,

222
00:07:44,590 --> 00:07:45,550
but in case you have some

223
00:07:46,250 --> 00:07:47,530
movies with no ratings, you

224
00:07:47,590 --> 00:07:48,700
can also play with versions

225
00:07:49,320 --> 00:07:50,700
of the algorithm, where you

226
00:07:50,900 --> 00:07:52,190
normalize the different columns

227
00:07:52,790 --> 00:07:53,990
to have means zero, instead of

228
00:07:54,280 --> 00:07:55,180
normalizing the rows to have mean

229
00:07:55,500 --> 00:07:56,990
zero, although that's maybe

230
00:07:57,240 --> 00:07:58,770
less important, because if you

231
00:07:58,870 --> 00:07:59,810
really have a movie with no

232
00:08:00,040 --> 00:08:01,390
rating, maybe you just

233
00:08:01,590 --> 00:08:03,920
shouldn't recommend that movie to anyone, anyway.

234
00:08:04,700 --> 00:08:08,010
And so, taking

235
00:08:08,540 --> 00:08:09,980
care of the case of a user who hasn't

236
00:08:10,490 --> 00:08:11,780
rated anything might be more

237
00:08:12,010 --> 00:08:13,170
important than taking care of

238
00:08:13,310 --> 00:08:14,550
the case of a movie

239
00:08:14,860 --> 00:08:16,090
that hasn't gotten a single rating.

240
00:08:18,930 --> 00:08:20,080
So to summarize, that's how

241
00:08:20,360 --> 00:08:21,830
you can do mean normalization as

242
00:08:22,110 --> 00:08:25,110
a sort of pre-processing step for collaborative filtering.

243
00:08:25,740 --> 00:08:26,670
Depending on your data set,

244
00:08:26,960 --> 00:08:28,140
this might some times make your implementation

245
00:08:28,540 --> 00:08:30,040
work just a little bit better.