1 00:00:00,000 --> 00:00:04,458 [MUSIC] 2 00:00:04,458 --> 00:00:09,040 Let's take a moment to compare gradient to stochastic gradient to really understand 3 00:00:09,040 --> 00:00:11,820 how the two approaches relate to each other. 4 00:00:11,820 --> 00:00:14,712 And we see that this very, very small change in your algorithm and 5 00:00:14,712 --> 00:00:17,090 your implementation is going to make a big difference. 6 00:00:18,950 --> 00:00:20,540 So we're going to build out the table here, 7 00:00:20,540 --> 00:00:25,520 comparing the two approaches, gradient in blue, stochastic gradient in green. 8 00:00:25,520 --> 00:00:31,300 And we saw already that gradient is slow at computing an update for 9 00:00:31,300 --> 00:00:34,550 large data, while stochastic gradient is fast. 10 00:00:34,550 --> 00:00:38,160 It doesn't depend on the dataset size, it's always fast. 11 00:00:38,160 --> 00:00:40,640 But the question is, what's the total time? 12 00:00:40,640 --> 00:00:44,550 Each iteration is cheaper for stochastic gradient, but is it cheaper overall? 13 00:00:44,550 --> 00:00:47,430 And there's two answers to this question. 14 00:00:47,430 --> 00:00:53,030 In theory, stochastic gradient for large datasets is always faster, always. 15 00:00:54,030 --> 00:00:58,320 In practice, it's a little bit more nuanced, but it's often faster, so 16 00:00:58,320 --> 00:00:59,450 it's often a good thing to do. 17 00:01:00,620 --> 00:01:04,450 However, stochastic gradient has a significant problem, 18 00:01:04,450 --> 00:01:09,360 it's much more sensitive to the choice of parameters like the choice of step size, 19 00:01:09,360 --> 00:01:12,200 and it has lots of practical problems. 20 00:01:12,200 --> 00:01:16,520 And so a lot of the focus of today's module is to talk about those 21 00:01:16,520 --> 00:01:20,090 practical challenges to stochastic gradient and how to overcome them. 22 00:01:20,090 --> 00:01:23,679 And how to get the most benefit of the small change to your algorithm. 23 00:01:24,810 --> 00:01:26,200 We'll see a lot of pictures like this, 24 00:01:26,200 --> 00:01:28,860 so I'm going to take a few minutes to explain it. 25 00:01:28,860 --> 00:01:32,910 This picture compares gradient to stochastic gradient. 26 00:01:32,910 --> 00:01:38,430 So the red line here is the behavior of gradient as you iterate through data, 27 00:01:38,430 --> 00:01:41,080 so as you make passes through datum. 28 00:01:41,080 --> 00:01:45,010 And on the y axis we see the data likelihood, so higher is better, 29 00:01:45,010 --> 00:01:45,869 we fit in the data better. 30 00:01:46,940 --> 00:01:49,460 The blue line here is stochastic gradient. 31 00:01:50,510 --> 00:01:54,646 And to be able to compare the two approaches, on the x axis I'm not showing 32 00:01:54,646 --> 00:01:59,352 exactly running time, but I'm showing you how many data points you need to touch. 33 00:01:59,352 --> 00:02:03,775 So stochastic gradient is going to make an update every time he sees a data point. 34 00:02:03,775 --> 00:02:08,266 Gradient is going to make update every time this makes full pass of the data. 35 00:02:08,266 --> 00:02:12,590 And so, we're showing how many passes we're making over the data. 36 00:02:12,590 --> 00:02:15,620 So the full x axis here is ten passes over the data, 37 00:02:15,620 --> 00:02:20,470 and you see that after ten passes over the data, gradient is going to a likelihood 38 00:02:20,470 --> 00:02:24,070 that's much lower than that of stochastic gradient. 39 00:02:24,070 --> 00:02:26,136 Stochastic gradient goes to a point that's much higher. 40 00:02:26,136 --> 00:02:30,527 And even if you look at it in the longer scales you'll see, 41 00:02:30,527 --> 00:02:34,300 stochastic gradient's going to converge faster. 42 00:02:34,300 --> 00:02:39,200 However, it doesn't convert smoothly to the optimal solution. 43 00:02:39,200 --> 00:02:41,198 It oscillates around the optimal solution, and 44 00:02:41,198 --> 00:02:43,230 we will understand today why that happens. 45 00:02:43,230 --> 00:02:44,200 And oscillation, 46 00:02:44,200 --> 00:02:47,580 some of the challenges that are introduced by stochastic gradient. 47 00:02:47,580 --> 00:02:51,530 So now I've extended it from 10 passes over the data to 100 passes over the data, 48 00:02:51,530 --> 00:02:54,070 and now we see the gradient is getting to solutions 49 00:02:54,070 --> 00:02:56,198 very close to that stochastic gradient. 50 00:02:56,198 --> 00:02:57,820 But again, you see a lot of noise and 51 00:02:57,820 --> 00:03:00,100 a lot of oscillation from stochastic gradient. 52 00:03:00,100 --> 00:03:03,800 Sometimes it's good, sometimes it's bad, sometimes it's good, sometimes it's bad. 53 00:03:03,800 --> 00:03:05,120 So that's the challenge there. 54 00:03:06,730 --> 00:03:08,140 So here's a summary of what we've learned. 55 00:03:09,190 --> 00:03:11,060 Make a tiny change to our algorithm. 56 00:03:11,060 --> 00:03:14,069 Instead of using the whole dataset to compute the gradient, 57 00:03:14,069 --> 00:03:17,370 use a single data point, call that stochastic gradient. 58 00:03:17,370 --> 00:03:19,110 going to get better quality faster. 59 00:03:20,580 --> 00:03:22,650 However, it's going to be tricky, 60 00:03:22,650 --> 00:03:26,820 there's going to be some oscillations, and you have to learn some of the practical 61 00:03:26,820 --> 00:03:29,870 issues that you need to address in order to make this really effective. 62 00:03:29,870 --> 00:03:34,140 But this change is going to allow you to scale to billions of data points. 63 00:03:34,140 --> 00:03:37,755 Even on your desktop you'll be able to deal with a ton of data, 64 00:03:37,755 --> 00:03:39,717 which is really super exciting. 65 00:03:39,717 --> 00:03:43,999 [MUSIC]