1
00:00:00,109 --> 00:00:02,030
In this video, I'd like to talk

2
00:00:02,030 --> 00:00:03,738
about a new large-scale

3
00:00:03,738 --> 00:00:05,369
machine learning setting called

4
00:00:05,369 --> 00:00:07,073
the online learning setting.

5
00:00:07,442 --> 00:00:08,731
The online learning setting

6
00:00:08,731 --> 00:00:10,659
allows us to model problems

7
00:00:10,659 --> 00:00:12,074
where we have a continuous flood

8
00:00:12,074 --> 00:00:14,064
or a continuous stream of data

9
00:00:14,064 --> 00:00:15,906
coming in and we would like

10
00:00:15,906 --> 00:00:17,839
an algorithm to learn from that.

11
00:00:18,762 --> 00:00:20,759
Today, many of the largest

12
00:00:20,759 --> 00:00:22,245
websites, or many of the largest

13
00:00:22,245 --> 00:00:24,335
website companies use different

14
00:00:24,335 --> 00:00:25,901
versions of online learning

15
00:00:25,901 --> 00:00:28,102
algorithms to learn from

16
00:00:28,117 --> 00:00:29,468
the flood of users that keep

17
00:00:29,468 --> 00:00:31,370
on coming to, back to the website.

18
00:00:31,370 --> 00:00:32,943
Specifically, if you have

19
00:00:32,943 --> 00:00:34,992
a continuous stream of data

20
00:00:34,992 --> 00:00:36,371
generated by a continuous

21
00:00:36,371 --> 00:00:37,703
stream of users coming to

22
00:00:37,703 --> 00:00:39,413
your website, what you can

23
00:00:39,413 --> 00:00:40,844
do is sometimes use an

24
00:00:40,844 --> 00:00:42,632
online learning algorithm to learn

25
00:00:42,632 --> 00:00:44,492
user preferences from the

26
00:00:44,492 --> 00:00:46,324
stream of data and use that

27
00:00:46,324 --> 00:00:47,470
to optimize some of the

28
00:00:47,470 --> 00:00:49,632
decisions on your website.

29
00:00:52,063 --> 00:00:54,506
Suppose you run a shipping service,

30
00:00:54,506 --> 00:00:56,163
so, you know, users come and ask

31
00:00:56,163 --> 00:00:57,307
you to help ship their package from

32
00:00:57,307 --> 00:01:01,533
location A to location B and suppose

33
00:01:01,533 --> 00:01:02,717
you run a website, where users

34
00:01:02,717 --> 00:01:04,110
repeatedly come and they

35
00:01:04,110 --> 00:01:05,689
tell you where they want

36
00:01:05,689 --> 00:01:07,291
to send the package from, and

37
00:01:07,291 --> 00:01:08,523
where they want to send it to

38
00:01:08,523 --> 00:01:10,947
(so the origin and destination) and

39
00:01:10,947 --> 00:01:12,748
your website offers to ship the package

40
00:01:12,748 --> 00:01:14,515
for some asking price,

41
00:01:14,515 --> 00:01:16,092
so I'll ship your package for $50,

42
00:01:16,092 --> 00:01:17,926
I'll ship it for $20.

43
00:01:17,926 --> 00:01:19,343
And based on the price

44
00:01:19,343 --> 00:01:20,922
that you offer to the users,

45
00:01:20,922 --> 00:01:23,522
the users sometimes chose to use a shipping service;

46
00:01:23,522 --> 00:01:25,891
that's a positive example and

47
00:01:25,891 --> 00:01:28,168
sometimes they go away and

48
00:01:28,168 --> 00:01:29,722
they do not choose to

49
00:01:29,722 --> 00:01:31,719
purchase your shipping service.

50
00:01:31,719 --> 00:01:34,552
So let's say that we want

51
00:01:34,552 --> 00:01:36,386
a learning algorithm to help us

52
00:01:36,386 --> 00:01:38,499
to optimize what is the asking

53
00:01:38,499 --> 00:01:41,680
price that we want to offer to our users.

54
00:01:41,680 --> 00:01:43,724
And specifically, let's say we

55
00:01:43,724 --> 00:01:44,908
come up with some sort of features

56
00:01:44,908 --> 00:01:46,510
that capture properties of the users.

57
00:01:46,510 --> 00:01:49,376
If we know anything about the demographics,

58
00:01:49,376 --> 00:01:50,875
they capture, you know, the origin and

59
00:01:50,875 --> 00:01:54,405
destination of the package, where they want to ship the package.

60
00:01:54,405 --> 00:01:55,635
And what is the price

61
00:01:55,635 --> 00:01:57,911
that we offer to them for shipping the package.

62
00:01:57,911 --> 00:01:59,931
and what we want to do

63
00:01:59,931 --> 00:02:00,883
is learn what is the

64
00:02:00,883 --> 00:02:02,439
probability that they will

65
00:02:02,439 --> 00:02:03,762
elect to ship the

66
00:02:03,762 --> 00:02:05,457
package, using our

67
00:02:05,457 --> 00:02:07,315
shipping service given these features, and

68
00:02:07,315 --> 00:02:10,197
again just as a reminder these

69
00:02:10,197 --> 00:02:14,121
features X also captures the price that we're asking for.

70
00:02:14,121 --> 00:02:15,790
And so if we could

71
00:02:15,790 --> 00:02:17,486
estimate the chance that they'll

72
00:02:17,486 --> 00:02:19,629
agree to use our service

73
00:02:19,629 --> 00:02:20,962
for any given price, then we

74
00:02:20,962 --> 00:02:21,967
can try to pick

75
00:02:21,967 --> 00:02:23,183
a price so that they

76
00:02:23,183 --> 00:02:25,125
have a pretty high probability of

77
00:02:25,125 --> 00:02:27,841
choosing our website while simultaneously

78
00:02:27,841 --> 00:02:29,188
hopefully offering us a

79
00:02:29,188 --> 00:02:31,371
fair return, offering us

80
00:02:31,371 --> 00:02:34,293
a fair profit for shipping their package.

81
00:02:34,585 --> 00:02:36,489
So if we can learn this property

82
00:02:36,489 --> 00:02:37,733
of y equals 1 given

83
00:02:37,733 --> 00:02:38,632
any price and given the other

84
00:02:38,632 --> 00:02:39,660
features we could really

85
00:02:39,660 --> 00:02:41,657
use this to choose appropriate

86
00:02:41,657 --> 00:02:44,072
prices as new users come to us.

87
00:02:44,072 --> 00:02:45,907
So in order to model

88
00:02:45,907 --> 00:02:47,277
the probability of y equals 1,

89
00:02:47,277 --> 00:02:48,972
what we can do is use

90
00:02:48,972 --> 00:02:51,781
logistic regression or neural

91
00:02:51,781 --> 00:02:53,756
network or some other algorithm like that.

92
00:02:53,756 --> 00:02:55,889
But let's start with logistic regression.

93
00:02:57,658 --> 00:02:59,583
Now if you have a

94
00:02:59,583 --> 00:03:01,835
website that just runs continuously,

95
00:03:01,835 --> 00:03:05,342
here's what an online learning algorithm would do.

96
00:03:05,342 --> 00:03:07,478
I'm gonna write repeat forever.

97
00:03:07,478 --> 00:03:09,730
This just means that our website

98
00:03:09,730 --> 00:03:11,170
is going to, you know, keep on

99
00:03:11,170 --> 00:03:12,911
staying up.

100
00:03:12,911 --> 00:03:14,351
What happens on the website is

101
00:03:14,351 --> 00:03:16,465
occasionally a user

102
00:03:16,465 --> 00:03:17,950
will come and for

103
00:03:17,950 --> 00:03:19,576
the user that comes we'll get

104
00:03:19,576 --> 00:03:25,380
some x,y pair corresponding to

105
00:03:25,380 --> 00:03:29,096
a customer or to a user on the website.

106
00:03:29,096 --> 00:03:30,884
So the features x are, you

107
00:03:30,884 --> 00:03:32,811
know, the origin and destination specified

108
00:03:32,811 --> 00:03:34,111
by this user and the price

109
00:03:34,111 --> 00:03:35,358
that we happened to offer to

110
00:03:35,358 --> 00:03:37,292
them this time around, and

111
00:03:37,292 --> 00:03:38,430
y is either one or

112
00:03:38,430 --> 00:03:40,148
zero depending one whether or

113
00:03:40,148 --> 00:03:41,518
not they chose to

114
00:03:41,518 --> 00:03:43,980
use our shipping service.

115
00:03:43,980 --> 00:03:45,419
Now once we get this {x,y}

116
00:03:45,419 --> 00:03:46,813
pair, what an online

117
00:03:46,813 --> 00:03:48,391
learning algorithm does is then

118
00:03:48,391 --> 00:03:50,690
update the parameters theta

119
00:03:50,690 --> 00:03:54,011
using just this example

120
00:03:54,011 --> 00:03:57,726
x,y, and in particular

121
00:03:57,726 --> 00:03:59,839
we would update my parameters theta

122
00:03:59,839 --> 00:04:01,842
as Theta j get updated as Theta j

123
00:04:01,842 --> 00:04:06,619
minus the learning rate alpha times

124
00:04:06,619 --> 00:04:11,356
my usual gradient descent

125
00:04:11,356 --> 00:04:13,399
rule for logistic regression.

126
00:04:13,399 --> 00:04:14,491
So we do this for j

127
00:04:14,491 --> 00:04:15,652
equals zero up to n,

128
00:04:15,652 --> 00:04:19,088
and that's my close curly brace.

129
00:04:19,088 --> 00:04:21,218
So, for other learning algorithms

130
00:04:21,218 --> 00:04:22,873
instead of writing X-Y, right, I

131
00:04:22,873 --> 00:04:24,011
was writing things like Xi,

132
00:04:24,011 --> 00:04:26,495
Yi but

133
00:04:26,495 --> 00:04:27,842
in this online learning setting

134
00:04:27,842 --> 00:04:29,723
where actually discarding the notion

135
00:04:29,723 --> 00:04:31,464
of there being a fixed training

136
00:04:31,464 --> 00:04:32,904
set instead we have an algorithm.

137
00:04:32,904 --> 00:04:34,924
Now what happens as we get

138
00:04:34,924 --> 00:04:37,014
an example and then we

139
00:04:37,014 --> 00:04:38,825
learn using that example like

140
00:04:38,825 --> 00:04:41,031
so and then we throw that example away.

141
00:04:41,031 --> 00:04:43,098
We discard that example and we

142
00:04:43,098 --> 00:04:45,141
never use it again and

143
00:04:45,141 --> 00:04:47,161
so that's why we just look at one example at a time.

144
00:04:47,161 --> 00:04:48,879
We learn from that example.

145
00:04:48,879 --> 00:04:50,412
We discard it.

146
00:04:50,412 --> 00:04:51,527
Which is why, you know, we're

147
00:04:51,527 --> 00:04:52,943
also doing away with this

148
00:04:52,943 --> 00:04:54,615
notion of there being this

149
00:04:54,615 --> 00:04:58,191
sort of fixed training set indexed by i.

150
00:04:58,191 --> 00:04:59,328
And, if you really run

151
00:04:59,328 --> 00:05:01,488
a major website where you

152
00:05:01,488 --> 00:05:03,624
really have a continuous stream

153
00:05:03,624 --> 00:05:05,737
of users coming, then this

154
00:05:05,737 --> 00:05:07,525
sort of online learning algorithm

155
00:05:07,525 --> 00:05:10,358
is actually a pretty reasonable algorithm.

156
00:05:10,358 --> 00:05:12,076
Because of data is essentially

157
00:05:12,076 --> 00:05:13,330
free if you have so

158
00:05:13,330 --> 00:05:14,979
much data, that data

159
00:05:14,979 --> 00:05:17,022
is essentially unlimited then there

160
00:05:17,022 --> 00:05:17,997
is really may be no

161
00:05:17,997 --> 00:05:18,949
need to look at a

162
00:05:18,949 --> 00:05:21,527
training example more than once.

163
00:05:21,527 --> 00:05:22,432
Of course if we had only

164
00:05:22,432 --> 00:05:24,220
a small number of users then

165
00:05:24,220 --> 00:05:26,333
rather than using an online learning

166
00:05:26,333 --> 00:05:27,912
algorithm like this, you might

167
00:05:27,912 --> 00:05:29,421
be better off saving away all

168
00:05:29,421 --> 00:05:30,884
your data in a fixed training

169
00:05:30,884 --> 00:05:34,042
set and then running some algorithm over that training set.

170
00:05:34,042 --> 00:05:35,018
But if you really have a continuous

171
00:05:35,018 --> 00:05:36,341
stream of data, then an

172
00:05:36,341 --> 00:05:39,881
online learning algorithm can be very effective.

173
00:05:39,881 --> 00:05:41,171
I should mention also that one

174
00:05:41,171 --> 00:05:43,015
interesting effect of this sort

175
00:05:43,015 --> 00:05:44,073
of online learning algorithm is

176
00:05:44,073 --> 00:05:49,391
that it can adapt to changing user preferences.

177
00:05:51,006 --> 00:05:54,592
And in particular, if over

178
00:05:54,592 --> 00:05:55,776
time because of changes in

179
00:05:55,776 --> 00:05:58,377
the economy maybe users

180
00:05:58,377 --> 00:05:59,957
start to become more price

181
00:05:59,957 --> 00:06:01,395
sensitive and willing to pay,

182
00:06:01,395 --> 00:06:03,717
you know, less willing to pay high prices.

183
00:06:03,717 --> 00:06:06,527
Or if they become less price sensitive and they're willing to pay higher prices.

184
00:06:06,527 --> 00:06:08,292
Or if different things

185
00:06:08,292 --> 00:06:10,451
become more important to users,

186
00:06:10,451 --> 00:06:11,496
if you start to have new

187
00:06:11,496 --> 00:06:12,587
types of users coming to your website.

188
00:06:12,587 --> 00:06:14,933
This sort of online learning algorithm

189
00:06:14,933 --> 00:06:17,278
can also adapt to changing

190
00:06:17,278 --> 00:06:18,950
user preferences and kind

191
00:06:18,950 --> 00:06:20,157
of keep track of what your

192
00:06:20,157 --> 00:06:21,991
changing population of users

193
00:06:21,991 --> 00:06:24,685
may be willing to pay for.

194
00:06:24,685 --> 00:06:26,171
And it does that because if

195
00:06:26,171 --> 00:06:28,168
your pool of users changes,

196
00:06:28,168 --> 00:06:29,793
then these updates to your

197
00:06:29,793 --> 00:06:31,953
parameters theta will just slowly adapt

198
00:06:31,953 --> 00:06:33,555
your parameters to whatever your

199
00:06:33,555 --> 00:06:36,599
latest pool of users looks like.

200
00:06:36,599 --> 00:06:37,781
Here's another example of a

201
00:06:37,781 --> 00:06:40,753
sort of application to which you might apply online learning.

202
00:06:40,753 --> 00:06:43,472
this is an application in product

203
00:06:43,472 --> 00:06:44,701
search in which we want to

204
00:06:44,701 --> 00:06:46,117
apply learning algorithm to learn

205
00:06:46,117 --> 00:06:48,973
to give good search listings to a user.

206
00:06:48,973 --> 00:06:51,156
Let's say you run an online

207
00:06:51,156 --> 00:06:53,083
store that sells phones - that

208
00:06:53,083 --> 00:06:55,312
sells mobile phones or sells cell phones.

209
00:06:55,312 --> 00:06:56,682
And you have a user interface

210
00:06:56,682 --> 00:06:58,284
where a user can come to

211
00:06:58,284 --> 00:06:59,445
your website and type in the

212
00:06:59,445 --> 00:07:02,626
query like "Android phone 1080p camera".

213
00:07:02,626 --> 00:07:03,509
So 1080p is a type

214
00:07:03,509 --> 00:07:04,623
of a specification for a

215
00:07:04,623 --> 00:07:05,808
video camera that you might

216
00:07:05,808 --> 00:07:08,710
have on a phone, a cell phone, a mobile phone.

217
00:07:08,710 --> 00:07:12,100
Suppose, suppose we have a hundred phones in our store.

218
00:07:12,100 --> 00:07:13,354
And because of the way our

219
00:07:13,354 --> 00:07:15,321
website is laid out, when

220
00:07:15,321 --> 00:07:16,558
a user types in a query,

221
00:07:16,558 --> 00:07:18,277
if it was a search query, we

222
00:07:18,277 --> 00:07:19,601
would like to find a

223
00:07:19,601 --> 00:07:20,900
choice of ten different phones to

224
00:07:20,900 --> 00:07:22,921
show what to offer to the user.

225
00:07:22,921 --> 00:07:24,987
What we'd like to do is have

226
00:07:24,987 --> 00:07:26,566
a learning algorithm help us figure

227
00:07:26,566 --> 00:07:28,447
out what are the ten phones

228
00:07:28,447 --> 00:07:29,771
out of the 100 we

229
00:07:29,771 --> 00:07:31,791
should return the user in response to

230
00:07:31,791 --> 00:07:34,531
a user-search query like the one here.

231
00:07:34,531 --> 00:07:36,695
Here's how we can go about the problem.

232
00:07:37,218 --> 00:07:39,291
For each phone and given

233
00:07:39,291 --> 00:07:41,311
a specific user query; we

234
00:07:41,311 --> 00:07:44,120
can construct a feature vector

235
00:07:44,120 --> 00:07:45,676
X. So the feature

236
00:07:45,676 --> 00:07:47,650
vector X might capture different properties of the phone.

237
00:07:47,650 --> 00:07:49,972
It might capture things like,

238
00:07:49,972 --> 00:07:53,107
how similar the user search query is in the phones.

239
00:07:53,107 --> 00:07:54,059
We capture things like how many

240
00:07:54,059 --> 00:07:55,475
words in the user search

241
00:07:55,475 --> 00:07:56,172
query match the name of

242
00:07:56,172 --> 00:07:57,356
the phone, how many words

243
00:07:57,356 --> 00:08:01,303
in the user search query match the description of the phone and so on.

244
00:08:01,303 --> 00:08:02,789
So the features x capture

245
00:08:02,789 --> 00:08:03,672
properties of the phone and

246
00:08:03,672 --> 00:08:05,251
it captures things about how

247
00:08:05,251 --> 00:08:06,412
similar or how well

248
00:08:06,412 --> 00:08:10,591
the phone matches the user query along different dimensions.

249
00:08:10,591 --> 00:08:11,868
What we like to do is

250
00:08:11,868 --> 00:08:14,330
estimate the probability that a

251
00:08:14,330 --> 00:08:15,816
user will click on the

252
00:08:15,816 --> 00:08:17,673
link for a specific phone,

253
00:08:17,673 --> 00:08:18,881
because we want to show

254
00:08:18,881 --> 00:08:20,065
the user phones that they

255
00:08:20,065 --> 00:08:21,481
are likely to want to

256
00:08:21,481 --> 00:08:22,921
buy, want to show the user

257
00:08:22,921 --> 00:08:24,082
phones that they have high

258
00:08:24,082 --> 00:08:27,240
probability of clicking on in the web browser.

259
00:08:27,240 --> 00:08:29,562
So I'm going to define y equals

260
00:08:29,562 --> 00:08:30,676
one if the user clicks on

261
00:08:30,676 --> 00:08:31,930
the link for a phone and

262
00:08:31,930 --> 00:08:34,136
y equals zero otherwise and

263
00:08:34,136 --> 00:08:35,454
what I would like to do is

264
00:08:35,454 --> 00:08:36,992
learn the probability the user

265
00:08:36,992 --> 00:08:38,246
will click on a specific

266
00:08:38,246 --> 00:08:39,802
phone given, you know,

267
00:08:39,802 --> 00:08:41,693
the features x, which capture properties

268
00:08:41,693 --> 00:08:43,819
of the phone and how well the query matches the phone.

269
00:08:43,819 --> 00:08:45,700
To give this problem a name

270
00:08:45,700 --> 00:08:47,720
in the language of

271
00:08:47,720 --> 00:08:49,130
people that run websites like

272
00:08:49,130 --> 00:08:51,249
this, the problem of learning this is

273
00:08:51,249 --> 00:08:53,223
actually called the problem of

274
00:08:53,223 --> 00:08:57,296
learning the predicted click-through rate, the predicted CTR.

275
00:08:57,296 --> 00:08:58,796
It just means learning the probability

276
00:08:58,796 --> 00:09:00,491
that the user will click on

277
00:09:00,491 --> 00:09:01,698
the specific link that you

278
00:09:01,698 --> 00:09:03,022
offer them, so CTR is

279
00:09:03,022 --> 00:09:06,528
an abbreviation for click through rate.

280
00:09:06,528 --> 00:09:07,550
And if you can estimate the

281
00:09:07,550 --> 00:09:09,245
predicted click-through rate for any

282
00:09:09,245 --> 00:09:10,847
particular phone, what we

283
00:09:10,847 --> 00:09:12,171
can do is use this to

284
00:09:12,171 --> 00:09:13,819
show the user the ten phones

285
00:09:13,819 --> 00:09:15,770
that are most likely to click on,

286
00:09:15,770 --> 00:09:17,441
because out of the hundred phones,

287
00:09:17,441 --> 00:09:20,553
we can compute this for

288
00:09:20,553 --> 00:09:21,737
each of the 100 phones and

289
00:09:21,737 --> 00:09:22,759
just select the 10 phones

290
00:09:22,759 --> 00:09:25,754
that the user is most likely to click on,

291
00:09:25,754 --> 00:09:26,892
and this will be a pretty reasonable

292
00:09:26,892 --> 00:09:29,818
way to decide what ten results to show to the user.

293
00:09:29,818 --> 00:09:32,186
Just to be clear, suppose that

294
00:09:32,186 --> 00:09:33,440
every time a user does

295
00:09:33,440 --> 00:09:35,576
a search, we return ten results

296
00:09:35,576 --> 00:09:37,225
what that will do is it

297
00:09:37,225 --> 00:09:38,990
will actually give us ten

298
00:09:38,990 --> 00:09:40,870
x,y pairs, this actually

299
00:09:40,870 --> 00:09:43,332
gives us ten training examples every

300
00:09:43,332 --> 00:09:44,640
time a user comes to

301
00:09:44,640 --> 00:09:46,257
our website because, because for

302
00:09:46,257 --> 00:09:47,535
the ten phone that we chose

303
00:09:47,535 --> 00:09:48,881
to show the user, for each

304
00:09:48,881 --> 00:09:49,896
of those 10 phones we get

305
00:09:49,896 --> 00:09:51,389
a feature vector X, and

306
00:09:51,389 --> 00:09:52,737
for each of those 10 phones we

307
00:09:52,737 --> 00:09:54,563
show the user we will also

308
00:09:54,563 --> 00:09:56,172
get a value for y, we

309
00:09:56,172 --> 00:09:57,542
will also observe the value

310
00:09:57,542 --> 00:09:59,517
of y, depending on whether

311
00:09:59,517 --> 00:10:00,925
or not we clicked on that

312
00:10:00,925 --> 00:10:02,465
url or not and

313
00:10:02,465 --> 00:10:03,696
so, one way to run a

314
00:10:03,696 --> 00:10:04,903
website like this would be to

315
00:10:04,903 --> 00:10:06,830
continuously show the user,

316
00:10:06,830 --> 00:10:08,363
you know, your ten best guesses for

317
00:10:08,363 --> 00:10:09,895
what other phones they might like

318
00:10:09,895 --> 00:10:11,428
and so, each time a user

319
00:10:11,428 --> 00:10:12,728
comes you would get ten

320
00:10:12,728 --> 00:10:14,493
examples, ten x,y pairs,

321
00:10:14,493 --> 00:10:16,304
and then use an online

322
00:10:16,304 --> 00:10:17,953
learning algorithm to update the

323
00:10:17,953 --> 00:10:20,182
parameters using essentially 10

324
00:10:20,182 --> 00:10:21,691
steps of gradient descent on these

325
00:10:21,691 --> 00:10:23,386
10 examples, and then

326
00:10:23,386 --> 00:10:25,081
you can throw the data away, and

327
00:10:25,081 --> 00:10:26,590
if you really have a continuous

328
00:10:26,590 --> 00:10:27,891
stream of users coming to

329
00:10:27,891 --> 00:10:29,354
your website, this would be

330
00:10:29,354 --> 00:10:31,095
a pretty reasonable way to learn

331
00:10:31,095 --> 00:10:32,395
parameters for your algorithm

332
00:10:32,395 --> 00:10:33,835
so as to show the ten phones

333
00:10:33,835 --> 00:10:35,669
to your users that may

334
00:10:35,669 --> 00:10:39,013
be most promising and the most likely to click on.

335
00:10:39,013 --> 00:10:40,151
So, this is a product search

336
00:10:40,151 --> 00:10:41,498
problem or learning to rank

337
00:10:41,498 --> 00:10:44,214
phones, learning to search for phones example.

338
00:10:44,214 --> 00:10:46,422
So, I'll quickly mention a few others.

339
00:10:46,422 --> 00:10:47,372
One is, if you have

340
00:10:47,372 --> 00:10:48,231
a website and you're trying to

341
00:10:48,231 --> 00:10:49,439
decide, you know, what special

342
00:10:49,439 --> 00:10:50,321
offer to show the user,

343
00:10:50,321 --> 00:10:53,154
this is very similar to phones,

344
00:10:53,154 --> 00:10:54,710
or if you have a

345
00:10:54,710 --> 00:10:58,216
website and you show different users different news articles.

346
00:10:58,216 --> 00:10:59,911
So, if you're a news aggregator

347
00:10:59,911 --> 00:11:01,374
website, then you can

348
00:11:01,374 --> 00:11:02,303
again use a similar system to

349
00:11:02,303 --> 00:11:03,882
select, to show to

350
00:11:03,882 --> 00:11:05,554
the user, you know, what

351
00:11:05,554 --> 00:11:06,877
are the news articles that they

352
00:11:06,877 --> 00:11:08,154
are most likely to be interested

353
00:11:08,154 --> 00:11:11,103
in and what are the news articles that they are most likely to click on.

354
00:11:11,103 --> 00:11:13,495
Closely related to special offers, will we profit from recommendations.

355
00:11:13,495 --> 00:11:15,097
And in fact, if you have

356
00:11:15,097 --> 00:11:17,953
a collaborative filtering system, you

357
00:11:17,953 --> 00:11:20,693
can even imagine a collaborative filtering

358
00:11:20,693 --> 00:11:22,643
system giving you additional

359
00:11:22,643 --> 00:11:23,897
features to feed into a

360
00:11:23,897 --> 00:11:25,732
logistic regression classifier to try

361
00:11:25,732 --> 00:11:28,100
to predict the click through

362
00:11:28,100 --> 00:11:29,981
rate for different products that you might recommend to a user.

363
00:11:29,981 --> 00:11:32,280
Of course, I should say that

364
00:11:32,280 --> 00:11:34,207
any of these problems could also

365
00:11:34,207 --> 00:11:35,600
have been formulated as a

366
00:11:35,600 --> 00:11:39,873
standard machine learning problem, where you have a fixed training set.

367
00:11:39,873 --> 00:11:40,894
Maybe, you can run your

368
00:11:40,894 --> 00:11:41,823
website for a few days and

369
00:11:41,823 --> 00:11:43,727
then save away a training set,

370
00:11:43,727 --> 00:11:44,842
a fixed training set, and run

371
00:11:44,842 --> 00:11:45,771
a learning algorithm on that.

372
00:11:45,771 --> 00:11:48,696
But these are the actual

373
00:11:48,696 --> 00:11:49,950
sorts of problems, where you do

374
00:11:49,950 --> 00:11:51,901
see large companies get so

375
00:11:51,901 --> 00:11:53,712
much data, that there's really

376
00:11:53,712 --> 00:11:55,221
maybe no need to save away

377
00:11:55,221 --> 00:11:56,963
a fixed training set, but instead

378
00:11:56,963 --> 00:11:59,563
you can use an online learning algorithm to just learn continuously.

379
00:11:59,563 --> 00:12:04,091
from the data that users are generating on your website.

380
00:12:05,183 --> 00:12:07,249
So, that was the online

381
00:12:07,249 --> 00:12:08,990
learning setting and as we

382
00:12:08,990 --> 00:12:10,616
saw, the algorithm that we apply to

383
00:12:10,616 --> 00:12:12,357
it is really very similar

384
00:12:12,357 --> 00:12:13,867
to this schotastic gradient descent

385
00:12:13,867 --> 00:12:15,330
algorithm, only instead of

386
00:12:15,330 --> 00:12:16,871
scanning through a fixed

387
00:12:16,871 --> 00:12:18,000
training set, we're instead getting

388
00:12:18,000 --> 00:12:19,974
one example from a user,

389
00:12:19,974 --> 00:12:21,290
learning from that example, then

390
00:12:21,290 --> 00:12:22,644
discarding it and moving on.

391
00:12:22,644 --> 00:12:25,593
And if you have a continuous

392
00:12:25,593 --> 00:12:26,777
stream of data for some application,

393
00:12:26,777 --> 00:12:28,356
this sort of algorithm may be

394
00:12:28,356 --> 00:12:31,816
well worth considering for your application.

395
00:12:31,816 --> 00:12:33,952
And of course, one advantage of

396
00:12:33,952 --> 00:12:36,128
online learning is also that

397
00:12:36,128 --> 00:12:37,458
if you have a changing pool

398
00:12:37,458 --> 00:12:38,967
of users, or if the

399
00:12:38,967 --> 00:12:40,082
things you're trying to predict are

400
00:12:40,082 --> 00:12:42,032
slowly changing like your user

401
00:12:42,032 --> 00:12:43,751
taste is slowly changing, the online

402
00:12:43,751 --> 00:12:45,492
learning algorithm can slowly

403
00:12:45,492 --> 00:12:47,211
adapt your learned hypothesis to

404
00:12:47,211 --> 00:12:49,161
whatever the latest sets of

405
00:12:49,161 --> 99:59:59,000
user behaviors are like as well.