1 00:00:00,000 --> 00:00:03,358 [MUSIC] 2 00:00:03,358 --> 00:00:07,142 In this video, we'll discuss advanced optimization techniques that allow to 3 00:00:07,142 --> 00:00:09,600 improve gradient descent methods. 4 00:00:09,600 --> 00:00:13,100 As we remember from the previous video, if our function is difficult, 5 00:00:13,100 --> 00:00:17,340 for example, it has elliptic level lines like on this graph? 6 00:00:17,340 --> 00:00:20,880 Then gradient descent will oscillate, and it will take many iterations for 7 00:00:20,880 --> 00:00:22,960 it to converge to the minimum. 8 00:00:22,960 --> 00:00:26,450 To improve this, we can somehow change our gradient descent methods. 9 00:00:26,450 --> 00:00:28,930 For example, take mini-batch gradient descent. 10 00:00:28,930 --> 00:00:33,578 In this method, we take some random m examples from our training set, 11 00:00:33,578 --> 00:00:36,870 we approximate the gradient based on these m examples, and 12 00:00:36,870 --> 00:00:38,450 then we'll make a gradient step. 13 00:00:38,450 --> 00:00:42,760 So in this video, we'll suppose that we have some way to approximate gradient. 14 00:00:42,760 --> 00:00:47,820 Maybe by mini-batch, maybe by stochastic gradient descent with just one point, 15 00:00:47,820 --> 00:00:50,160 or maybe we calculate full gradient. 16 00:00:50,160 --> 00:00:54,425 That doesn't matter, we just have some approximation of gradient, and 17 00:00:54,425 --> 00:00:55,470 we allot it by gt. 18 00:00:55,470 --> 00:01:01,500 And in this video, we'll discuss how can we modify the way we make a gradient step, 19 00:01:01,500 --> 00:01:07,320 how we add a gradient or antigradient to our previous approximation, w t-1. 20 00:01:07,320 --> 00:01:10,530 So let's start with momentum method. 21 00:01:10,530 --> 00:01:14,599 In this method, we maintain additional vector h at every iteration. 22 00:01:14,599 --> 00:01:19,674 To calculate h in the step t, we take this vector h at the step t-1, 23 00:01:19,674 --> 00:01:22,535 multiply by some coefficient alpha. 24 00:01:22,535 --> 00:01:26,264 And add the gradient at the current iteration, gt, 25 00:01:26,264 --> 00:01:29,250 multiplied by the learning rate, eta t. 26 00:01:29,250 --> 00:01:33,950 So actually, ht is just a weighted sum of gradients from all previous iteration, and 27 00:01:33,950 --> 00:01:35,596 from this iteration, also. 28 00:01:35,596 --> 00:01:41,052 And then we just make a gradient step in the direction of -ht. 29 00:01:41,052 --> 00:01:44,028 So what is the point of this method? 30 00:01:44,028 --> 00:01:48,567 Suppose that we have some function, it'll make gradient descent, and maybe for 31 00:01:48,567 --> 00:01:51,400 some coordinates of our parameter vector. 32 00:01:51,400 --> 00:01:55,440 Gradients always have the same sign, so they lead us to the minimum. 33 00:01:55,440 --> 00:01:56,966 And for some coordinates, 34 00:01:56,966 --> 00:02:00,493 the sign of the gradient changes from iteration to iteration. 35 00:02:00,493 --> 00:02:05,491 So, vector ht would be large for component where gradients have the same sign on 36 00:02:05,491 --> 00:02:09,678 every iteration, and will make large steps by this coordinates. 37 00:02:09,678 --> 00:02:12,394 And for coordinates that change sign, 38 00:02:12,394 --> 00:02:17,180 they will just cancel each other and ht will be close to zero. 39 00:02:17,180 --> 00:02:23,280 So, ht cancels some coordinates that lead to oscillation of gradients, 40 00:02:23,280 --> 00:02:25,250 and help us to achieve better convergence. 41 00:02:26,300 --> 00:02:29,583 Here, we should somehow choose parameter alpha, for example, 42 00:02:29,583 --> 00:02:33,420 we could just set it to 0.9, and to somehow choose learning rate, eta t. 43 00:02:35,150 --> 00:02:40,290 In practice, this momentum method indeed leads to faster convergence. 44 00:02:40,290 --> 00:02:44,451 So for our last function, with elliptic level lines, 45 00:02:44,451 --> 00:02:47,480 we'll see picture like this one. 46 00:02:47,480 --> 00:02:51,510 So here, we don't oscillate that much as on the previous picture, and 47 00:02:51,510 --> 00:02:52,730 we converge better. 48 00:02:52,730 --> 00:02:58,690 There is also an extension of momentum method called Nesterov momentum. 49 00:02:58,690 --> 00:03:02,369 So in simple momentum method, on every iteration, 50 00:03:02,369 --> 00:03:06,053 we calculate the gradient at current point w t-1. 51 00:03:06,053 --> 00:03:10,929 We take a gradient step from it by gt, and we then get our momentum, ht. 52 00:03:10,929 --> 00:03:16,998 But since it's certain that we'll move in the direction of the momentum, 53 00:03:16,998 --> 00:03:21,906 it will be more clever to, first, step in the direction of ht 54 00:03:21,906 --> 00:03:26,175 to get some new approximation of parameter vector. 55 00:03:26,175 --> 00:03:32,400 And then to calculate gradient at the new point, w t-1 plus ht. 56 00:03:32,400 --> 00:03:36,010 So in this case, we'll get a better approximation on the next step. 57 00:03:37,710 --> 00:03:41,906 Mathematically, we take the current point, w t-1, 58 00:03:41,906 --> 00:03:48,940 we subtract alpha multiplied by h t-1, and calculate the gradient at this new point. 59 00:03:48,940 --> 00:03:53,584 And then we add the gradient at this new point, multiplied by the learning rate, 60 00:03:53,584 --> 00:03:58,000 to the h from the previous iteration, alpha multiplied by h t-1. 61 00:03:58,000 --> 00:03:58,750 And in practice, 62 00:03:58,750 --> 00:04:02,445 this method indeed leads to better convergence than momentum method. 63 00:04:02,445 --> 00:04:06,810 So momentum method and Nesterov momentum method work better with difficult 64 00:04:06,810 --> 00:04:09,280 functions with complex level sets. 65 00:04:09,280 --> 00:04:13,820 But they still require to choose learning rate, and they're very sensitive to it. 66 00:04:13,820 --> 00:04:17,720 So now we'll discuss some other optimization methods that try to choose 67 00:04:17,720 --> 00:04:21,170 learning rate adaptively, so we don't have to choose it ourselves. 68 00:04:21,170 --> 00:04:23,460 One of such methods is AdaGrad. 69 00:04:23,460 --> 00:04:26,725 So let's remember how we make a gradient step, just for 70 00:04:26,725 --> 00:04:29,091 one coordinate g of parameter vector w. 71 00:04:29,091 --> 00:04:33,801 To make a gradient step, we take j-th component of parameter 72 00:04:33,801 --> 00:04:37,720 vector from the previous iteration, is w t-1 j. 73 00:04:37,720 --> 00:04:42,981 And we subtract j-th component of the gradient at the current point, 74 00:04:42,981 --> 00:04:47,010 to find out the next parameter approximation, w t j. 75 00:04:48,010 --> 00:04:51,599 In AdaGrad, we've obtained additional value for 76 00:04:51,599 --> 00:04:55,197 each parameter from all our parameter vector, G. 77 00:04:55,197 --> 00:05:00,534 So to calculate it, we take G t-1, G from the previous iteration. 78 00:05:00,534 --> 00:05:02,767 We take j-th component of it, and 79 00:05:02,767 --> 00:05:07,020 we just add a square of gradient at the current iteration. 80 00:05:07,020 --> 00:05:13,280 So essentially, G is a sum of squares of gradients from all previous iterations, 81 00:05:13,280 --> 00:05:16,020 and then we modify our gradient step. 82 00:05:16,020 --> 00:05:18,420 We divide our learning rate, eta t, 83 00:05:18,420 --> 00:05:24,590 by the square root of G t j, plus some small number, epsilon. 84 00:05:24,590 --> 00:05:28,460 We add epsilon, just to make sure that we don't divide learning rate by zero. 85 00:05:28,460 --> 00:05:32,780 So this AdaGrad method, it chooses learning rate adaptively. 86 00:05:32,780 --> 00:05:36,720 Here, we can just set eta t, our learning rate, to some constant, for 87 00:05:36,720 --> 00:05:39,442 example, 0.01, and don't remember about it at all. 88 00:05:39,442 --> 00:05:44,430 So this doesn't need to somehow 89 00:05:44,430 --> 00:05:49,500 wisely choose learning rate, but it also has some disadvantages. 90 00:05:49,500 --> 00:05:52,989 Our auxiliary parameter, G, accumulates squares of gradient, and 91 00:05:52,989 --> 00:05:54,830 at some step it can become too large. 92 00:05:54,830 --> 00:05:57,705 We'll divide our learning rate, eta t, by a large number, and 93 00:05:57,705 --> 00:05:59,473 gradient descent will stop building. 94 00:05:59,473 --> 00:06:05,290 So to somehow overcome this, we need some other methods like AdaGrad. 95 00:06:05,290 --> 00:06:07,340 By the way, AdaGrad has another advantage, 96 00:06:07,340 --> 00:06:11,420 it chooses its own learning rate for each example. 97 00:06:11,420 --> 00:06:13,853 So suppose that we are analyzing texts, and 98 00:06:13,853 --> 00:06:17,380 each feature from our sample corresponds to one word. 99 00:06:17,380 --> 00:06:20,717 So for some frequent word that we see in every document, 100 00:06:20,717 --> 00:06:25,059 we have gradient updates on every step, and we'll make smaller steps. 101 00:06:25,059 --> 00:06:29,325 And for some rare words that we can met only in one of thousand or 102 00:06:29,325 --> 00:06:33,102 ten thousand documents, we'll make large updates, 103 00:06:33,102 --> 00:06:36,978 because they are rare, we don't need them very often. 104 00:06:36,978 --> 00:06:42,020 And we need to move faster in the direction of these words. 105 00:06:42,020 --> 00:06:45,960 Another matter that can improve AdaGrad is RMSprop. 106 00:06:45,960 --> 00:06:50,384 This method is very similar to AdaGrad, but here we take an exponentially weighted 107 00:06:50,384 --> 00:06:52,922 average of squares of gradients on every step. 108 00:06:52,922 --> 00:06:58,901 So here, to calculate G j at the step t, we take G j from the previous step, 109 00:06:58,901 --> 00:07:03,057 t-1, we multiply it by some coefficient alpha. 110 00:07:03,057 --> 00:07:07,967 And then we add the square of the gradient, G t j, at this iteration, 111 00:07:07,967 --> 00:07:09,930 multiplied by 1- alpha. 112 00:07:11,030 --> 00:07:17,200 And then we use this G to divide our learning rate, just like in AdaGrad. 113 00:07:17,200 --> 00:07:22,030 So this method overcomes the problem of large sums of square gradients. 114 00:07:22,030 --> 00:07:22,970 And here, 115 00:07:22,970 --> 00:07:28,340 our learning rate depends only on last examples from our gradient descent method. 116 00:07:30,070 --> 00:07:32,688 Let's start with RMSprop method and slightly modify it. 117 00:07:32,688 --> 00:07:37,727 So in RMSprop, we maintained an additional variable, 118 00:07:37,727 --> 00:07:40,640 and it will be augmented by v t j. 119 00:07:40,640 --> 00:07:45,448 And is just an exponentially weighted sum of gradients from all iterations. 120 00:07:45,448 --> 00:07:49,909 So here, to calculate v for j-th component at the step t, 121 00:07:49,909 --> 00:07:56,480 we take v from the previous step, v j t-1, multiplied by some coefficient beta 2. 122 00:07:56,480 --> 00:08:00,819 And we add square of gradient in the current iteration, 123 00:08:00,819 --> 00:08:04,156 g t j squared, multiplied by 1- beta 2. 124 00:08:04,156 --> 00:08:08,648 And we can notice, that this approximation, v has some bias towards 125 00:08:08,648 --> 00:08:13,900 zero, especially at first steps, because we initialize it with zero. 126 00:08:13,900 --> 00:08:20,520 So to overcome this, we just divide it by 1- beta 2, in the degree of t. 127 00:08:20,520 --> 00:08:25,150 So this normalization allows us to get rid of this bias. 128 00:08:25,150 --> 00:08:27,740 At first steps, this normalization is large, and for 129 00:08:27,740 --> 00:08:30,810 large t's, this normalization almost equals to 1. 130 00:08:30,810 --> 00:08:34,530 And then we use this v to divide our learning rate, eta t. 131 00:08:36,210 --> 00:08:38,380 But as we go from momentum method, 132 00:08:38,380 --> 00:08:42,800 an approximation of gradient from one step can be noisy and lead to oscillations. 133 00:08:42,800 --> 00:08:45,200 So let's smooth our gradients. 134 00:08:45,200 --> 00:08:48,490 To do it, we maintain another auxiliary variable, m, 135 00:08:48,490 --> 00:08:51,740 that is essentially a sum of gradients. 136 00:08:51,740 --> 00:08:54,931 Not squares of gradients, like in v, but just gradients. 137 00:08:54,931 --> 00:08:56,795 And it's calculated in the same way, 138 00:08:56,795 --> 00:09:00,416 it's just an exponentially weighted average with coefficient beta 1. 139 00:09:00,416 --> 00:09:05,238 And then we replace g, the gradient approximation in our gradient step, 140 00:09:05,238 --> 00:09:09,927 by m, by weighted average of gradients from all previous iterations. 141 00:09:09,927 --> 00:09:12,966 So this method, Adam, combines both momentum methods and 142 00:09:12,966 --> 00:09:14,744 adaptive learning rate methods. 143 00:09:14,744 --> 00:09:21,390 And in practice, it achieves better convergence and faster convergence. 144 00:09:21,390 --> 00:09:24,090 In this video, we discussed some advanced optimization 145 00:09:24,090 --> 00:09:26,130 techniques that can improve gradient decent. 146 00:09:26,130 --> 00:09:29,946 For example, momentum method and Nesterov momentum method, 147 00:09:29,946 --> 00:09:34,970 that allows us to work with some complex functions with complex level sets. 148 00:09:34,970 --> 00:09:39,062 And AdaGrad, RMSprop method, that have adaptive learning rates, so 149 00:09:39,062 --> 00:09:41,390 we don't have to choose eta t manually. 150 00:09:41,390 --> 00:09:45,294 And there is also an Adam method, that combines both of this methods. 151 00:09:45,294 --> 00:09:55,294 [MUSIC]