1 00:00:00,018 --> 00:00:04,576 [MUSIC] 2 00:00:04,576 --> 00:00:07,330 And this is an example of an online learning problem. 3 00:00:07,330 --> 00:00:09,050 Data is arriving over time. 4 00:00:09,050 --> 00:00:14,115 You see an import xi and you need to make a prediction, y hat i. 5 00:00:14,115 --> 00:00:18,410 So, the input may be texting the web page information about you and 6 00:00:18,410 --> 00:00:21,540 y hat i might be a prediction about what ads you might click on. 7 00:00:22,930 --> 00:00:27,400 And then, given what happens in the real world, whether you're clicking an ad, 8 00:00:27,400 --> 00:00:31,180 in which case yt might be ad two, or you don't click on anything, 9 00:00:31,180 --> 00:00:35,070 in which case yt would be none of the above, no ad was good for you. 10 00:00:35,070 --> 00:00:40,080 Whatever that is gets fed into a machine learning algorithm that improves its 11 00:00:40,080 --> 00:00:43,190 coefficients, so it can improve its performance over time. 12 00:00:44,910 --> 00:00:50,130 The question is how do we design a machine learning algorithm that behaves like this? 13 00:00:50,130 --> 00:00:52,090 What's a good example of a machine learning algorithm? 14 00:00:52,090 --> 00:00:55,210 It can improve its performance over time in an online fashion like this. 15 00:00:56,530 --> 00:00:58,100 And it turns out that we've seen one. 16 00:00:58,100 --> 00:00:58,963 Stochastic gradient. 17 00:00:58,963 --> 00:01:01,652 Stochastic gradient is a learning algorithm that can be used for 18 00:01:01,652 --> 00:01:02,950 online learning. 19 00:01:02,950 --> 00:01:04,110 So let's review it. 20 00:01:04,110 --> 00:01:05,870 You give me some initial set of coefficients, 21 00:01:05,870 --> 00:01:07,300 say everything is equal to zero. 22 00:01:07,300 --> 00:01:10,350 Every time we stop, you get some input xi. 23 00:01:10,350 --> 00:01:18,060 You can make a prediction y hat t based on your current estimate of the coefficients. 24 00:01:18,060 --> 00:01:21,980 And then, you're given that true label, yt, 25 00:01:21,980 --> 00:01:24,618 and you want to feed those into an algorithm. 26 00:01:24,618 --> 00:01:29,010 Well, stochastic gradients will take those inputs and use it to compute the gradient, 27 00:01:29,010 --> 00:01:32,250 and then just update the coefficients, so 28 00:01:32,250 --> 00:01:37,970 w j t + 1 is going to be w j t, + eta times the gradient, 29 00:01:37,970 --> 00:01:41,230 which is computed from these observed quantities in the real world. 30 00:01:42,550 --> 00:01:46,850 So, online learning is a different kind of learning that we haven't talked at 31 00:01:46,850 --> 00:01:51,390 all about in the specialization but it's really important to practice. 32 00:01:51,390 --> 00:01:53,400 So when data arrives over time and 33 00:01:53,400 --> 00:01:56,360 you need to make a decision right away of what to do with it. 34 00:01:56,360 --> 00:01:58,700 But based on that decision, you're going to get some feedback and 35 00:01:58,700 --> 00:02:02,420 you're going to update the parameters immediately and keep going. 36 00:02:02,420 --> 00:02:05,970 This online learning approach, where you update the parameters immediately as you 37 00:02:05,970 --> 00:02:09,550 see some information in the real world, can be extremely useful. 38 00:02:09,550 --> 00:02:12,630 So, for example, your model is always up to date. 39 00:02:12,630 --> 00:02:16,520 It's always based on the latest data, latest information in the world. 40 00:02:16,520 --> 00:02:19,640 It can have lower computational cost because you can use 41 00:02:19,640 --> 00:02:23,680 techniques like stochastic gradient that don't have to look at all the data. 42 00:02:23,680 --> 00:02:26,690 And in fact, you don't even have to store all the data if it's too massive. 43 00:02:28,240 --> 00:02:31,160 However, most people do store the data because they might want to use it later. 44 00:02:31,160 --> 00:02:32,160 So that's a side note. 45 00:02:32,160 --> 00:02:33,130 But you don't have to. 46 00:02:34,300 --> 00:02:39,300 However it has some really difficult practical properties. 47 00:02:39,300 --> 00:02:42,330 So this system that you have to build, the actual 48 00:02:42,330 --> 00:02:45,980 design of how data interacts with the world and where problems get stored, 49 00:02:45,980 --> 00:02:49,640 the coefficients get stored and all that is really complex and complicated. 50 00:02:49,640 --> 00:02:51,010 It's hard to maintain. 51 00:02:51,010 --> 00:02:53,500 If you have oscillations in your machine learning algorithm, 52 00:02:53,500 --> 00:02:58,000 it can do really stupid things, and nobody want their website to do stupid things. 53 00:02:58,000 --> 00:03:03,060 And you don't really trust those machinery stochastic gradient updates, necessarily. 54 00:03:03,060 --> 00:03:05,080 Sometimes it can give you bad predictions. 55 00:03:05,080 --> 00:03:11,530 And so, in practice, most companies don't do something like this. 56 00:03:11,530 --> 00:03:16,930 They, what they do is they save their data for a little while and update their 57 00:03:16,930 --> 00:03:22,110 models with the data from last hour or from the last day or from the last week. 58 00:03:22,110 --> 00:03:22,700 It's very common. 59 00:03:22,700 --> 00:03:29,160 So it's very common, for example, for a large like retailer, to every 60 00:03:29,160 --> 00:03:34,540 night change its recommender system and run a big service every night to do that. 61 00:03:34,540 --> 00:03:38,700 And you can think about that as an extreme version of mini-batches that 62 00:03:38,700 --> 00:03:40,550 we talked about earlier in this module. 63 00:03:40,550 --> 00:03:43,360 But now the batch is the whole data from the whole day. 64 00:03:43,360 --> 00:03:47,086 For you, it will be those 5 billion page views. 65 00:03:47,086 --> 00:03:51,109 [MUSIC]