1 00:00:00,000 --> 00:00:03,330 [MUSIC] 2 00:00:03,330 --> 00:00:07,972 In this lesson, we'll discuss extension of gradient descent methods that allow us to 3 00:00:07,972 --> 00:00:09,380 learn better and faster. 4 00:00:10,530 --> 00:00:14,883 So in gradient descent, we try to minimize a loss function which is usually 5 00:00:14,883 --> 00:00:18,839 a sum of loss functions on separate examples from our training set. 6 00:00:18,839 --> 00:00:23,259 In gradient descent, we start with initialization w0, and 7 00:00:23,259 --> 00:00:28,359 then on every step we take current approximation of parameter wt-1. 8 00:00:28,359 --> 00:00:34,334 Then we subtract the gradient at this point, multiply by e to t, 9 00:00:34,334 --> 00:00:38,670 the loading rate, and then we weigh the step. 10 00:00:38,670 --> 00:00:40,510 Then we check the stopping criteria. 11 00:00:40,510 --> 00:00:45,460 For example, we can check whether the new parameter vector is close to the previous. 12 00:00:45,460 --> 00:00:47,687 And if it's close, then we stop our gradient descent. 13 00:00:47,687 --> 00:00:53,053 For example, for a mean squared error, the loss function looks like 14 00:00:53,053 --> 00:00:58,810 sum of squared errors of separate examples from the training set. 15 00:00:58,810 --> 00:01:01,330 So to calculate the gradient of mean squared error, 16 00:01:01,330 --> 00:01:07,220 we should sum gradients of squared errors over all gradient examples. 17 00:01:07,220 --> 00:01:11,330 And if we have a million of gradient examples, then we have to sum over 1 18 00:01:11,330 --> 00:01:16,100 million of gradients to calculate the gradient for one step of our algorithm. 19 00:01:16,100 --> 00:01:19,783 That's a lot. For example, to make 1,000 gradient steps, 20 00:01:19,783 --> 00:01:23,160 we have to calculate a billion of gradients. 21 00:01:23,160 --> 00:01:27,490 So this makes gradient descent infeasible for large scale problems. 22 00:01:29,440 --> 00:01:32,620 To overcome this problem we can use stochastic gradient descent. 23 00:01:32,620 --> 00:01:36,930 It's very similar to gradient descent with only one difference. 24 00:01:36,930 --> 00:01:39,480 We start with some initialization w0, 25 00:01:39,480 --> 00:01:44,700 and then on every step at a time t, we select some 26 00:01:44,700 --> 00:01:50,380 random example from our training set, the number of this example by i. 27 00:01:50,380 --> 00:01:53,770 And then we calculate the gradient only on this example. 28 00:01:53,770 --> 00:01:56,970 And then we make a step in the direction of this gradient. 29 00:01:56,970 --> 00:02:00,620 So in stochastic gradient descent we approximate the gradient 30 00:02:00,620 --> 00:02:05,300 of all the loss function by the gradient of loss function on only one example. 31 00:02:06,350 --> 00:02:09,618 Of course, this leads to very noisy approximations. 32 00:02:09,618 --> 00:02:15,025 If we analyze how the stochastic gradient descent behaves on some sample, 33 00:02:15,025 --> 00:02:17,360 then we can see this picture. 34 00:02:17,360 --> 00:02:22,550 So if you form iteration to iteration, the loss function can increase or decrease. 35 00:02:22,550 --> 00:02:25,600 But if you make enough iterations of gradient descent then 36 00:02:25,600 --> 00:02:27,650 it converges to sum minimum. 37 00:02:27,650 --> 00:02:30,970 So stochastic gradient descent makes very noisy updates. 38 00:02:30,970 --> 00:02:33,160 But it has a large advantage. 39 00:02:33,160 --> 00:02:36,650 It needs only one example to make one gradient step. 40 00:02:36,650 --> 00:02:39,780 And also it maybe used in online setting. 41 00:02:39,780 --> 00:02:43,720 Suppose that you have some stream of data, for example, a click stream from a search 42 00:02:43,720 --> 00:02:47,930 engine, and you want to adapt to the stream to learn a new module online. 43 00:02:47,930 --> 00:02:52,050 So if you use stochastic gradient descent when you receive 44 00:02:52,050 --> 00:02:56,960 another example from the click stream, you can just make one gradient step for 45 00:02:56,960 --> 00:03:00,890 this particular example, and that would be online learning. 46 00:03:00,890 --> 00:03:04,230 There is also a disadvantage of stochastic gradient descent. 47 00:03:04,230 --> 00:03:08,830 Learning rate nt should be chosen very carefully because if you choose 48 00:03:08,830 --> 00:03:12,620 a large learning rate, then the matrix cannot converge. 49 00:03:12,620 --> 00:03:17,290 And if you choose too small learning rate, then the conversions will be too small 50 00:03:17,290 --> 00:03:21,520 to require thousands and maybe millions of iterations to converge to the minimum. 51 00:03:23,690 --> 00:03:28,140 To overcome some of these problems, one can use mini-batch gradient descent, which 52 00:03:28,140 --> 00:03:33,549 merges some properties of the gradient descent and stochastic gradient descent. 53 00:03:33,549 --> 00:03:35,744 So in mini-batch gradient descent, 54 00:03:35,744 --> 00:03:40,002 on every iteration we choose m random examples from our training sample. 55 00:03:40,002 --> 00:03:44,440 You know their indices by i1, etc, im. 56 00:03:44,440 --> 00:03:48,250 Then we calculate the gradient for every of these examples. 57 00:03:48,250 --> 00:03:52,192 And than we average their gradients, and make a step towards this direction. 58 00:03:52,192 --> 00:03:56,651 So in mini-batch gradient descent, we use m points to estimate the full 59 00:03:56,651 --> 00:04:01,610 gradient instead of one point like a stochastic gradient descent. 60 00:04:01,610 --> 00:04:06,150 The updates of mini-batch gradient descent have much less noise 61 00:04:06,150 --> 00:04:07,970 than stochastic gradient descent. 62 00:04:07,970 --> 00:04:11,240 And this might still can be used in online learning setting. 63 00:04:11,240 --> 00:04:15,700 Just accumulate n examples from your stream, and then you make an update. 64 00:04:15,700 --> 00:04:19,318 And still, learning rate eta t should be chosen very carefully for 65 00:04:19,318 --> 00:04:21,540 mini-batch gradient descent. 66 00:04:21,540 --> 00:04:25,810 And also there is another problem with this stochastic and nonstochastic methods. 67 00:04:25,810 --> 00:04:30,100 Suppose that we have a difficult function that has levelized like this. 68 00:04:30,100 --> 00:04:31,170 They have elliptic form. 69 00:04:32,170 --> 00:04:35,030 It is known from the calculus that gradient is always 70 00:04:35,030 --> 00:04:37,010 orthogonal to the level line. 71 00:04:37,010 --> 00:04:39,330 So we start from some point w0, 72 00:04:39,330 --> 00:04:43,760 then we'll make a gradient step that takes us to the other side of this function. 73 00:04:43,760 --> 00:04:48,690 Then we take another gradient step that takes us in the opposite direction, etc. 74 00:04:48,690 --> 00:04:53,190 So the gradient distance will also rate on this function and this is not very good. 75 00:04:53,190 --> 00:04:54,680 It will take many iterations for 76 00:04:54,680 --> 00:04:58,520 it to converge, and it will be good to somehow overcome this problem. 77 00:05:00,210 --> 00:05:04,257 So in this video, we've discussed stochastic extensions of gradient descent, 78 00:05:04,257 --> 00:05:07,340 stochastic gradient descent, and mini-batch gradient descent. 79 00:05:07,340 --> 00:05:10,100 They're much faster than gradient descent and 80 00:05:10,100 --> 00:05:12,820 can be used in online learning setting. 81 00:05:12,820 --> 00:05:15,312 But they have some problems. They have learning rates that should be 82 00:05:15,312 --> 00:05:19,690 somehow chosen and they can have some problems with difficult functions. 83 00:05:19,690 --> 00:05:23,741 And in the next video, we will discuss some other extensions of stochastic 84 00:05:23,741 --> 00:05:26,390 methods that can overcome these difficulties. 85 00:05:26,390 --> 00:05:36,390 [MUSIC]