1 00:00:00,109 --> 00:00:02,030 In this video, I'd like to talk 2 00:00:02,030 --> 00:00:03,738 about a new large-scale 3 00:00:03,738 --> 00:00:05,369 machine learning setting called 4 00:00:05,369 --> 00:00:07,073 the online learning setting. 5 00:00:07,442 --> 00:00:08,731 The online learning setting 6 00:00:08,731 --> 00:00:10,659 allows us to model problems 7 00:00:10,659 --> 00:00:12,074 where we have a continuous flood 8 00:00:12,074 --> 00:00:14,064 or a continuous stream of data 9 00:00:14,064 --> 00:00:15,906 coming in and we would like 10 00:00:15,906 --> 00:00:17,839 an algorithm to learn from that. 11 00:00:18,762 --> 00:00:20,759 Today, many of the largest 12 00:00:20,759 --> 00:00:22,245 websites, or many of the largest 13 00:00:22,245 --> 00:00:24,335 website companies use different 14 00:00:24,335 --> 00:00:25,901 versions of online learning 15 00:00:25,901 --> 00:00:28,102 algorithms to learn from 16 00:00:28,117 --> 00:00:29,468 the flood of users that keep 17 00:00:29,468 --> 00:00:31,370 on coming to, back to the website. 18 00:00:31,370 --> 00:00:32,943 Specifically, if you have 19 00:00:32,943 --> 00:00:34,992 a continuous stream of data 20 00:00:34,992 --> 00:00:36,371 generated by a continuous 21 00:00:36,371 --> 00:00:37,703 stream of users coming to 22 00:00:37,703 --> 00:00:39,413 your website, what you can 23 00:00:39,413 --> 00:00:40,844 do is sometimes use an 24 00:00:40,844 --> 00:00:42,632 online learning algorithm to learn 25 00:00:42,632 --> 00:00:44,492 user preferences from the 26 00:00:44,492 --> 00:00:46,324 stream of data and use that 27 00:00:46,324 --> 00:00:47,470 to optimize some of the 28 00:00:47,470 --> 00:00:49,632 decisions on your website. 29 00:00:52,063 --> 00:00:54,506 Suppose you run a shipping service, 30 00:00:54,506 --> 00:00:56,163 so, you know, users come and ask 31 00:00:56,163 --> 00:00:57,307 you to help ship their package from 32 00:00:57,307 --> 00:01:01,533 location A to location B and suppose 33 00:01:01,533 --> 00:01:02,717 you run a website, where users 34 00:01:02,717 --> 00:01:04,110 repeatedly come and they 35 00:01:04,110 --> 00:01:05,689 tell you where they want 36 00:01:05,689 --> 00:01:07,291 to send the package from, and 37 00:01:07,291 --> 00:01:08,523 where they want to send it to 38 00:01:08,523 --> 00:01:10,947 (so the origin and destination) and 39 00:01:10,947 --> 00:01:12,748 your website offers to ship the package 40 00:01:12,748 --> 00:01:14,515 for some asking price, 41 00:01:14,515 --> 00:01:16,092 so I'll ship your package for $50, 42 00:01:16,092 --> 00:01:17,926 I'll ship it for $20. 43 00:01:17,926 --> 00:01:19,343 And based on the price 44 00:01:19,343 --> 00:01:20,922 that you offer to the users, 45 00:01:20,922 --> 00:01:23,522 the users sometimes chose to use a shipping service; 46 00:01:23,522 --> 00:01:25,891 that's a positive example and 47 00:01:25,891 --> 00:01:28,168 sometimes they go away and 48 00:01:28,168 --> 00:01:29,722 they do not choose to 49 00:01:29,722 --> 00:01:31,719 purchase your shipping service. 50 00:01:31,719 --> 00:01:34,552 So let's say that we want 51 00:01:34,552 --> 00:01:36,386 a learning algorithm to help us 52 00:01:36,386 --> 00:01:38,499 to optimize what is the asking 53 00:01:38,499 --> 00:01:41,680 price that we want to offer to our users. 54 00:01:41,680 --> 00:01:43,724 And specifically, let's say we 55 00:01:43,724 --> 00:01:44,908 come up with some sort of features 56 00:01:44,908 --> 00:01:46,510 that capture properties of the users. 57 00:01:46,510 --> 00:01:49,376 If we know anything about the demographics, 58 00:01:49,376 --> 00:01:50,875 they capture, you know, the origin and 59 00:01:50,875 --> 00:01:54,405 destination of the package, where they want to ship the package. 60 00:01:54,405 --> 00:01:55,635 And what is the price 61 00:01:55,635 --> 00:01:57,911 that we offer to them for shipping the package. 62 00:01:57,911 --> 00:01:59,931 and what we want to do 63 00:01:59,931 --> 00:02:00,883 is learn what is the 64 00:02:00,883 --> 00:02:02,439 probability that they will 65 00:02:02,439 --> 00:02:03,762 elect to ship the 66 00:02:03,762 --> 00:02:05,457 package, using our 67 00:02:05,457 --> 00:02:07,315 shipping service given these features, and 68 00:02:07,315 --> 00:02:10,197 again just as a reminder these 69 00:02:10,197 --> 00:02:14,121 features X also captures the price that we're asking for. 70 00:02:14,121 --> 00:02:15,790 And so if we could 71 00:02:15,790 --> 00:02:17,486 estimate the chance that they'll 72 00:02:17,486 --> 00:02:19,629 agree to use our service 73 00:02:19,629 --> 00:02:20,962 for any given price, then we 74 00:02:20,962 --> 00:02:21,967 can try to pick 75 00:02:21,967 --> 00:02:23,183 a price so that they 76 00:02:23,183 --> 00:02:25,125 have a pretty high probability of 77 00:02:25,125 --> 00:02:27,841 choosing our website while simultaneously 78 00:02:27,841 --> 00:02:29,188 hopefully offering us a 79 00:02:29,188 --> 00:02:31,371 fair return, offering us 80 00:02:31,371 --> 00:02:34,293 a fair profit for shipping their package. 81 00:02:34,585 --> 00:02:36,489 So if we can learn this property 82 00:02:36,489 --> 00:02:37,733 of y equals 1 given 83 00:02:37,733 --> 00:02:38,632 any price and given the other 84 00:02:38,632 --> 00:02:39,660 features we could really 85 00:02:39,660 --> 00:02:41,657 use this to choose appropriate 86 00:02:41,657 --> 00:02:44,072 prices as new users come to us. 87 00:02:44,072 --> 00:02:45,907 So in order to model 88 00:02:45,907 --> 00:02:47,277 the probability of y equals 1, 89 00:02:47,277 --> 00:02:48,972 what we can do is use 90 00:02:48,972 --> 00:02:51,781 logistic regression or neural 91 00:02:51,781 --> 00:02:53,756 network or some other algorithm like that. 92 00:02:53,756 --> 00:02:55,889 But let's start with logistic regression. 93 00:02:57,658 --> 00:02:59,583 Now if you have a 94 00:02:59,583 --> 00:03:01,835 website that just runs continuously, 95 00:03:01,835 --> 00:03:05,342 here's what an online learning algorithm would do. 96 00:03:05,342 --> 00:03:07,478 I'm gonna write repeat forever. 97 00:03:07,478 --> 00:03:09,730 This just means that our website 98 00:03:09,730 --> 00:03:11,170 is going to, you know, keep on 99 00:03:11,170 --> 00:03:12,911 staying up. 100 00:03:12,911 --> 00:03:14,351 What happens on the website is 101 00:03:14,351 --> 00:03:16,465 occasionally a user 102 00:03:16,465 --> 00:03:17,950 will come and for 103 00:03:17,950 --> 00:03:19,576 the user that comes we'll get 104 00:03:19,576 --> 00:03:25,380 some x,y pair corresponding to 105 00:03:25,380 --> 00:03:29,096 a customer or to a user on the website. 106 00:03:29,096 --> 00:03:30,884 So the features x are, you 107 00:03:30,884 --> 00:03:32,811 know, the origin and destination specified 108 00:03:32,811 --> 00:03:34,111 by this user and the price 109 00:03:34,111 --> 00:03:35,358 that we happened to offer to 110 00:03:35,358 --> 00:03:37,292 them this time around, and 111 00:03:37,292 --> 00:03:38,430 y is either one or 112 00:03:38,430 --> 00:03:40,148 zero depending one whether or 113 00:03:40,148 --> 00:03:41,518 not they chose to 114 00:03:41,518 --> 00:03:43,980 use our shipping service. 115 00:03:43,980 --> 00:03:45,419 Now once we get this {x,y} 116 00:03:45,419 --> 00:03:46,813 pair, what an online 117 00:03:46,813 --> 00:03:48,391 learning algorithm does is then 118 00:03:48,391 --> 00:03:50,690 update the parameters theta 119 00:03:50,690 --> 00:03:54,011 using just this example 120 00:03:54,011 --> 00:03:57,726 x,y, and in particular 121 00:03:57,726 --> 00:03:59,839 we would update my parameters theta 122 00:03:59,839 --> 00:04:01,842 as Theta j get updated as Theta j 123 00:04:01,842 --> 00:04:06,619 minus the learning rate alpha times 124 00:04:06,619 --> 00:04:11,356 my usual gradient descent 125 00:04:11,356 --> 00:04:13,399 rule for logistic regression. 126 00:04:13,399 --> 00:04:14,491 So we do this for j 127 00:04:14,491 --> 00:04:15,652 equals zero up to n, 128 00:04:15,652 --> 00:04:19,088 and that's my close curly brace. 129 00:04:19,088 --> 00:04:21,218 So, for other learning algorithms 130 00:04:21,218 --> 00:04:22,873 instead of writing X-Y, right, I 131 00:04:22,873 --> 00:04:24,011 was writing things like Xi, 132 00:04:24,011 --> 00:04:26,495 Yi but 133 00:04:26,495 --> 00:04:27,842 in this online learning setting 134 00:04:27,842 --> 00:04:29,723 where actually discarding the notion 135 00:04:29,723 --> 00:04:31,464 of there being a fixed training 136 00:04:31,464 --> 00:04:32,904 set instead we have an algorithm. 137 00:04:32,904 --> 00:04:34,924 Now what happens as we get 138 00:04:34,924 --> 00:04:37,014 an example and then we 139 00:04:37,014 --> 00:04:38,825 learn using that example like 140 00:04:38,825 --> 00:04:41,031 so and then we throw that example away. 141 00:04:41,031 --> 00:04:43,098 We discard that example and we 142 00:04:43,098 --> 00:04:45,141 never use it again and 143 00:04:45,141 --> 00:04:47,161 so that's why we just look at one example at a time. 144 00:04:47,161 --> 00:04:48,879 We learn from that example. 145 00:04:48,879 --> 00:04:50,412 We discard it. 146 00:04:50,412 --> 00:04:51,527 Which is why, you know, we're 147 00:04:51,527 --> 00:04:52,943 also doing away with this 148 00:04:52,943 --> 00:04:54,615 notion of there being this 149 00:04:54,615 --> 00:04:58,191 sort of fixed training set indexed by i. 150 00:04:58,191 --> 00:04:59,328 And, if you really run 151 00:04:59,328 --> 00:05:01,488 a major website where you 152 00:05:01,488 --> 00:05:03,624 really have a continuous stream 153 00:05:03,624 --> 00:05:05,737 of users coming, then this 154 00:05:05,737 --> 00:05:07,525 sort of online learning algorithm 155 00:05:07,525 --> 00:05:10,358 is actually a pretty reasonable algorithm. 156 00:05:10,358 --> 00:05:12,076 Because of data is essentially 157 00:05:12,076 --> 00:05:13,330 free if you have so 158 00:05:13,330 --> 00:05:14,979 much data, that data 159 00:05:14,979 --> 00:05:17,022 is essentially unlimited then there 160 00:05:17,022 --> 00:05:17,997 is really may be no 161 00:05:17,997 --> 00:05:18,949 need to look at a 162 00:05:18,949 --> 00:05:21,527 training example more than once. 163 00:05:21,527 --> 00:05:22,432 Of course if we had only 164 00:05:22,432 --> 00:05:24,220 a small number of users then 165 00:05:24,220 --> 00:05:26,333 rather than using an online learning 166 00:05:26,333 --> 00:05:27,912 algorithm like this, you might 167 00:05:27,912 --> 00:05:29,421 be better off saving away all 168 00:05:29,421 --> 00:05:30,884 your data in a fixed training 169 00:05:30,884 --> 00:05:34,042 set and then running some algorithm over that training set. 170 00:05:34,042 --> 00:05:35,018 But if you really have a continuous 171 00:05:35,018 --> 00:05:36,341 stream of data, then an 172 00:05:36,341 --> 00:05:39,881 online learning algorithm can be very effective. 173 00:05:39,881 --> 00:05:41,171 I should mention also that one 174 00:05:41,171 --> 00:05:43,015 interesting effect of this sort 175 00:05:43,015 --> 00:05:44,073 of online learning algorithm is 176 00:05:44,073 --> 00:05:49,391 that it can adapt to changing user preferences. 177 00:05:51,006 --> 00:05:54,592 And in particular, if over 178 00:05:54,592 --> 00:05:55,776 time because of changes in 179 00:05:55,776 --> 00:05:58,377 the economy maybe users 180 00:05:58,377 --> 00:05:59,957 start to become more price 181 00:05:59,957 --> 00:06:01,395 sensitive and willing to pay, 182 00:06:01,395 --> 00:06:03,717 you know, less willing to pay high prices. 183 00:06:03,717 --> 00:06:06,527 Or if they become less price sensitive and they're willing to pay higher prices. 184 00:06:06,527 --> 00:06:08,292 Or if different things 185 00:06:08,292 --> 00:06:10,451 become more important to users, 186 00:06:10,451 --> 00:06:11,496 if you start to have new 187 00:06:11,496 --> 00:06:12,587 types of users coming to your website. 188 00:06:12,587 --> 00:06:14,933 This sort of online learning algorithm 189 00:06:14,933 --> 00:06:17,278 can also adapt to changing 190 00:06:17,278 --> 00:06:18,950 user preferences and kind 191 00:06:18,950 --> 00:06:20,157 of keep track of what your 192 00:06:20,157 --> 00:06:21,991 changing population of users 193 00:06:21,991 --> 00:06:24,685 may be willing to pay for. 194 00:06:24,685 --> 00:06:26,171 And it does that because if 195 00:06:26,171 --> 00:06:28,168 your pool of users changes, 196 00:06:28,168 --> 00:06:29,793 then these updates to your 197 00:06:29,793 --> 00:06:31,953 parameters theta will just slowly adapt 198 00:06:31,953 --> 00:06:33,555 your parameters to whatever your 199 00:06:33,555 --> 00:06:36,599 latest pool of users looks like. 200 00:06:36,599 --> 00:06:37,781 Here's another example of a 201 00:06:37,781 --> 00:06:40,753 sort of application to which you might apply online learning. 202 00:06:40,753 --> 00:06:43,472 this is an application in product 203 00:06:43,472 --> 00:06:44,701 search in which we want to 204 00:06:44,701 --> 00:06:46,117 apply learning algorithm to learn 205 00:06:46,117 --> 00:06:48,973 to give good search listings to a user. 206 00:06:48,973 --> 00:06:51,156 Let's say you run an online 207 00:06:51,156 --> 00:06:53,083 store that sells phones - that 208 00:06:53,083 --> 00:06:55,312 sells mobile phones or sells cell phones. 209 00:06:55,312 --> 00:06:56,682 And you have a user interface 210 00:06:56,682 --> 00:06:58,284 where a user can come to 211 00:06:58,284 --> 00:06:59,445 your website and type in the 212 00:06:59,445 --> 00:07:02,626 query like "Android phone 1080p camera". 213 00:07:02,626 --> 00:07:03,509 So 1080p is a type 214 00:07:03,509 --> 00:07:04,623 of a specification for a 215 00:07:04,623 --> 00:07:05,808 video camera that you might 216 00:07:05,808 --> 00:07:08,710 have on a phone, a cell phone, a mobile phone. 217 00:07:08,710 --> 00:07:12,100 Suppose, suppose we have a hundred phones in our store. 218 00:07:12,100 --> 00:07:13,354 And because of the way our 219 00:07:13,354 --> 00:07:15,321 website is laid out, when 220 00:07:15,321 --> 00:07:16,558 a user types in a query, 221 00:07:16,558 --> 00:07:18,277 if it was a search query, we 222 00:07:18,277 --> 00:07:19,601 would like to find a 223 00:07:19,601 --> 00:07:20,900 choice of ten different phones to 224 00:07:20,900 --> 00:07:22,921 show what to offer to the user. 225 00:07:22,921 --> 00:07:24,987 What we'd like to do is have 226 00:07:24,987 --> 00:07:26,566 a learning algorithm help us figure 227 00:07:26,566 --> 00:07:28,447 out what are the ten phones 228 00:07:28,447 --> 00:07:29,771 out of the 100 we 229 00:07:29,771 --> 00:07:31,791 should return the user in response to 230 00:07:31,791 --> 00:07:34,531 a user-search query like the one here. 231 00:07:34,531 --> 00:07:36,695 Here's how we can go about the problem. 232 00:07:37,218 --> 00:07:39,291 For each phone and given 233 00:07:39,291 --> 00:07:41,311 a specific user query; we 234 00:07:41,311 --> 00:07:44,120 can construct a feature vector 235 00:07:44,120 --> 00:07:45,676 X. So the feature 236 00:07:45,676 --> 00:07:47,650 vector X might capture different properties of the phone. 237 00:07:47,650 --> 00:07:49,972 It might capture things like, 238 00:07:49,972 --> 00:07:53,107 how similar the user search query is in the phones. 239 00:07:53,107 --> 00:07:54,059 We capture things like how many 240 00:07:54,059 --> 00:07:55,475 words in the user search 241 00:07:55,475 --> 00:07:56,172 query match the name of 242 00:07:56,172 --> 00:07:57,356 the phone, how many words 243 00:07:57,356 --> 00:08:01,303 in the user search query match the description of the phone and so on. 244 00:08:01,303 --> 00:08:02,789 So the features x capture 245 00:08:02,789 --> 00:08:03,672 properties of the phone and 246 00:08:03,672 --> 00:08:05,251 it captures things about how 247 00:08:05,251 --> 00:08:06,412 similar or how well 248 00:08:06,412 --> 00:08:10,591 the phone matches the user query along different dimensions. 249 00:08:10,591 --> 00:08:11,868 What we like to do is 250 00:08:11,868 --> 00:08:14,330 estimate the probability that a 251 00:08:14,330 --> 00:08:15,816 user will click on the 252 00:08:15,816 --> 00:08:17,673 link for a specific phone, 253 00:08:17,673 --> 00:08:18,881 because we want to show 254 00:08:18,881 --> 00:08:20,065 the user phones that they 255 00:08:20,065 --> 00:08:21,481 are likely to want to 256 00:08:21,481 --> 00:08:22,921 buy, want to show the user 257 00:08:22,921 --> 00:08:24,082 phones that they have high 258 00:08:24,082 --> 00:08:27,240 probability of clicking on in the web browser. 259 00:08:27,240 --> 00:08:29,562 So I'm going to define y equals 260 00:08:29,562 --> 00:08:30,676 one if the user clicks on 261 00:08:30,676 --> 00:08:31,930 the link for a phone and 262 00:08:31,930 --> 00:08:34,136 y equals zero otherwise and 263 00:08:34,136 --> 00:08:35,454 what I would like to do is 264 00:08:35,454 --> 00:08:36,992 learn the probability the user 265 00:08:36,992 --> 00:08:38,246 will click on a specific 266 00:08:38,246 --> 00:08:39,802 phone given, you know, 267 00:08:39,802 --> 00:08:41,693 the features x, which capture properties 268 00:08:41,693 --> 00:08:43,819 of the phone and how well the query matches the phone. 269 00:08:43,819 --> 00:08:45,700 To give this problem a name 270 00:08:45,700 --> 00:08:47,720 in the language of 271 00:08:47,720 --> 00:08:49,130 people that run websites like 272 00:08:49,130 --> 00:08:51,249 this, the problem of learning this is 273 00:08:51,249 --> 00:08:53,223 actually called the problem of 274 00:08:53,223 --> 00:08:57,296 learning the predicted click-through rate, the predicted CTR. 275 00:08:57,296 --> 00:08:58,796 It just means learning the probability 276 00:08:58,796 --> 00:09:00,491 that the user will click on 277 00:09:00,491 --> 00:09:01,698 the specific link that you 278 00:09:01,698 --> 00:09:03,022 offer them, so CTR is 279 00:09:03,022 --> 00:09:06,528 an abbreviation for click through rate. 280 00:09:06,528 --> 00:09:07,550 And if you can estimate the 281 00:09:07,550 --> 00:09:09,245 predicted click-through rate for any 282 00:09:09,245 --> 00:09:10,847 particular phone, what we 283 00:09:10,847 --> 00:09:12,171 can do is use this to 284 00:09:12,171 --> 00:09:13,819 show the user the ten phones 285 00:09:13,819 --> 00:09:15,770 that are most likely to click on, 286 00:09:15,770 --> 00:09:17,441 because out of the hundred phones, 287 00:09:17,441 --> 00:09:20,553 we can compute this for 288 00:09:20,553 --> 00:09:21,737 each of the 100 phones and 289 00:09:21,737 --> 00:09:22,759 just select the 10 phones 290 00:09:22,759 --> 00:09:25,754 that the user is most likely to click on, 291 00:09:25,754 --> 00:09:26,892 and this will be a pretty reasonable 292 00:09:26,892 --> 00:09:29,818 way to decide what ten results to show to the user. 293 00:09:29,818 --> 00:09:32,186 Just to be clear, suppose that 294 00:09:32,186 --> 00:09:33,440 every time a user does 295 00:09:33,440 --> 00:09:35,576 a search, we return ten results 296 00:09:35,576 --> 00:09:37,225 what that will do is it 297 00:09:37,225 --> 00:09:38,990 will actually give us ten 298 00:09:38,990 --> 00:09:40,870 x,y pairs, this actually 299 00:09:40,870 --> 00:09:43,332 gives us ten training examples every 300 00:09:43,332 --> 00:09:44,640 time a user comes to 301 00:09:44,640 --> 00:09:46,257 our website because, because for 302 00:09:46,257 --> 00:09:47,535 the ten phone that we chose 303 00:09:47,535 --> 00:09:48,881 to show the user, for each 304 00:09:48,881 --> 00:09:49,896 of those 10 phones we get 305 00:09:49,896 --> 00:09:51,389 a feature vector X, and 306 00:09:51,389 --> 00:09:52,737 for each of those 10 phones we 307 00:09:52,737 --> 00:09:54,563 show the user we will also 308 00:09:54,563 --> 00:09:56,172 get a value for y, we 309 00:09:56,172 --> 00:09:57,542 will also observe the value 310 00:09:57,542 --> 00:09:59,517 of y, depending on whether 311 00:09:59,517 --> 00:10:00,925 or not we clicked on that 312 00:10:00,925 --> 00:10:02,465 url or not and 313 00:10:02,465 --> 00:10:03,696 so, one way to run a 314 00:10:03,696 --> 00:10:04,903 website like this would be to 315 00:10:04,903 --> 00:10:06,830 continuously show the user, 316 00:10:06,830 --> 00:10:08,363 you know, your ten best guesses for 317 00:10:08,363 --> 00:10:09,895 what other phones they might like 318 00:10:09,895 --> 00:10:11,428 and so, each time a user 319 00:10:11,428 --> 00:10:12,728 comes you would get ten 320 00:10:12,728 --> 00:10:14,493 examples, ten x,y pairs, 321 00:10:14,493 --> 00:10:16,304 and then use an online 322 00:10:16,304 --> 00:10:17,953 learning algorithm to update the 323 00:10:17,953 --> 00:10:20,182 parameters using essentially 10 324 00:10:20,182 --> 00:10:21,691 steps of gradient descent on these 325 00:10:21,691 --> 00:10:23,386 10 examples, and then 326 00:10:23,386 --> 00:10:25,081 you can throw the data away, and 327 00:10:25,081 --> 00:10:26,590 if you really have a continuous 328 00:10:26,590 --> 00:10:27,891 stream of users coming to 329 00:10:27,891 --> 00:10:29,354 your website, this would be 330 00:10:29,354 --> 00:10:31,095 a pretty reasonable way to learn 331 00:10:31,095 --> 00:10:32,395 parameters for your algorithm 332 00:10:32,395 --> 00:10:33,835 so as to show the ten phones 333 00:10:33,835 --> 00:10:35,669 to your users that may 334 00:10:35,669 --> 00:10:39,013 be most promising and the most likely to click on. 335 00:10:39,013 --> 00:10:40,151 So, this is a product search 336 00:10:40,151 --> 00:10:41,498 problem or learning to rank 337 00:10:41,498 --> 00:10:44,214 phones, learning to search for phones example. 338 00:10:44,214 --> 00:10:46,422 So, I'll quickly mention a few others. 339 00:10:46,422 --> 00:10:47,372 One is, if you have 340 00:10:47,372 --> 00:10:48,231 a website and you're trying to 341 00:10:48,231 --> 00:10:49,439 decide, you know, what special 342 00:10:49,439 --> 00:10:50,321 offer to show the user, 343 00:10:50,321 --> 00:10:53,154 this is very similar to phones, 344 00:10:53,154 --> 00:10:54,710 or if you have a 345 00:10:54,710 --> 00:10:58,216 website and you show different users different news articles. 346 00:10:58,216 --> 00:10:59,911 So, if you're a news aggregator 347 00:10:59,911 --> 00:11:01,374 website, then you can 348 00:11:01,374 --> 00:11:02,303 again use a similar system to 349 00:11:02,303 --> 00:11:03,882 select, to show to 350 00:11:03,882 --> 00:11:05,554 the user, you know, what 351 00:11:05,554 --> 00:11:06,877 are the news articles that they 352 00:11:06,877 --> 00:11:08,154 are most likely to be interested 353 00:11:08,154 --> 00:11:11,103 in and what are the news articles that they are most likely to click on. 354 00:11:11,103 --> 00:11:13,495 Closely related to special offers, will we profit from recommendations. 355 00:11:13,495 --> 00:11:15,097 And in fact, if you have 356 00:11:15,097 --> 00:11:17,953 a collaborative filtering system, you 357 00:11:17,953 --> 00:11:20,693 can even imagine a collaborative filtering 358 00:11:20,693 --> 00:11:22,643 system giving you additional 359 00:11:22,643 --> 00:11:23,897 features to feed into a 360 00:11:23,897 --> 00:11:25,732 logistic regression classifier to try 361 00:11:25,732 --> 00:11:28,100 to predict the click through 362 00:11:28,100 --> 00:11:29,981 rate for different products that you might recommend to a user. 363 00:11:29,981 --> 00:11:32,280 Of course, I should say that 364 00:11:32,280 --> 00:11:34,207 any of these problems could also 365 00:11:34,207 --> 00:11:35,600 have been formulated as a 366 00:11:35,600 --> 00:11:39,873 standard machine learning problem, where you have a fixed training set. 367 00:11:39,873 --> 00:11:40,894 Maybe, you can run your 368 00:11:40,894 --> 00:11:41,823 website for a few days and 369 00:11:41,823 --> 00:11:43,727 then save away a training set, 370 00:11:43,727 --> 00:11:44,842 a fixed training set, and run 371 00:11:44,842 --> 00:11:45,771 a learning algorithm on that. 372 00:11:45,771 --> 00:11:48,696 But these are the actual 373 00:11:48,696 --> 00:11:49,950 sorts of problems, where you do 374 00:11:49,950 --> 00:11:51,901 see large companies get so 375 00:11:51,901 --> 00:11:53,712 much data, that there's really 376 00:11:53,712 --> 00:11:55,221 maybe no need to save away 377 00:11:55,221 --> 00:11:56,963 a fixed training set, but instead 378 00:11:56,963 --> 00:11:59,563 you can use an online learning algorithm to just learn continuously. 379 00:11:59,563 --> 00:12:04,091 from the data that users are generating on your website. 380 00:12:05,183 --> 00:12:07,249 So, that was the online 381 00:12:07,249 --> 00:12:08,990 learning setting and as we 382 00:12:08,990 --> 00:12:10,616 saw, the algorithm that we apply to 383 00:12:10,616 --> 00:12:12,357 it is really very similar 384 00:12:12,357 --> 00:12:13,867 to this schotastic gradient descent 385 00:12:13,867 --> 00:12:15,330 algorithm, only instead of 386 00:12:15,330 --> 00:12:16,871 scanning through a fixed 387 00:12:16,871 --> 00:12:18,000 training set, we're instead getting 388 00:12:18,000 --> 00:12:19,974 one example from a user, 389 00:12:19,974 --> 00:12:21,290 learning from that example, then 390 00:12:21,290 --> 00:12:22,644 discarding it and moving on. 391 00:12:22,644 --> 00:12:25,593 And if you have a continuous 392 00:12:25,593 --> 00:12:26,777 stream of data for some application, 393 00:12:26,777 --> 00:12:28,356 this sort of algorithm may be 394 00:12:28,356 --> 00:12:31,816 well worth considering for your application. 395 00:12:31,816 --> 00:12:33,952 And of course, one advantage of 396 00:12:33,952 --> 00:12:36,128 online learning is also that 397 00:12:36,128 --> 00:12:37,458 if you have a changing pool 398 00:12:37,458 --> 00:12:38,967 of users, or if the 399 00:12:38,967 --> 00:12:40,082 things you're trying to predict are 400 00:12:40,082 --> 00:12:42,032 slowly changing like your user 401 00:12:42,032 --> 00:12:43,751 taste is slowly changing, the online 402 00:12:43,751 --> 00:12:45,492 learning algorithm can slowly 403 00:12:45,492 --> 00:12:47,211 adapt your learned hypothesis to 404 00:12:47,211 --> 00:12:49,161 whatever the latest sets of 405 00:12:49,161 --> 99:59:59,000 user behaviors are like as well.