1
00:00:00,000 --> 00:00:03,358
[MUSIC]

2
00:00:03,358 --> 00:00:07,142
In this video, we'll discuss advanced
optimization techniques that allow to

3
00:00:07,142 --> 00:00:09,600
improve gradient descent methods.

4
00:00:09,600 --> 00:00:13,100
As we remember from the previous video,
if our function is difficult,

5
00:00:13,100 --> 00:00:17,340
for example, it has elliptic
level lines like on this graph?

6
00:00:17,340 --> 00:00:20,880
Then gradient descent will oscillate,
and it will take many iterations for

7
00:00:20,880 --> 00:00:22,960
it to converge to the minimum.

8
00:00:22,960 --> 00:00:26,450
To improve this, we can somehow
change our gradient descent methods.

9
00:00:26,450 --> 00:00:28,930
For example,
take mini-batch gradient descent.

10
00:00:28,930 --> 00:00:33,578
In this method, we take some random
m examples from our training set,

11
00:00:33,578 --> 00:00:36,870
we approximate the gradient
based on these m examples, and

12
00:00:36,870 --> 00:00:38,450
then we'll make a gradient step.

13
00:00:38,450 --> 00:00:42,760
So in this video, we'll suppose that we
have some way to approximate gradient.

14
00:00:42,760 --> 00:00:47,820
Maybe by mini-batch, maybe by stochastic
gradient descent with just one point,

15
00:00:47,820 --> 00:00:50,160
or maybe we calculate full gradient.

16
00:00:50,160 --> 00:00:54,425
That doesn't matter, we just have
some approximation of gradient, and

17
00:00:54,425 --> 00:00:55,470
we allot it by gt.

18
00:00:55,470 --> 00:01:01,500
And in this video, we'll discuss how can
we modify the way we make a gradient step,

19
00:01:01,500 --> 00:01:07,320
how we add a gradient or antigradient
to our previous approximation, w t-1.

20
00:01:07,320 --> 00:01:10,530
So let's start with momentum method.

21
00:01:10,530 --> 00:01:14,599
In this method, we maintain additional
vector h at every iteration.

22
00:01:14,599 --> 00:01:19,674
To calculate h in the step t,
we take this vector h at the step t-1,

23
00:01:19,674 --> 00:01:22,535
multiply by some coefficient alpha.

24
00:01:22,535 --> 00:01:26,264
And add the gradient at
the current iteration, gt,

25
00:01:26,264 --> 00:01:29,250
multiplied by the learning rate, eta t.

26
00:01:29,250 --> 00:01:33,950
So actually, ht is just a weighted sum of
gradients from all previous iteration, and

27
00:01:33,950 --> 00:01:35,596
from this iteration, also.

28
00:01:35,596 --> 00:01:41,052
And then we just make a gradient
step in the direction of -ht.

29
00:01:41,052 --> 00:01:44,028
So what is the point of this method?

30
00:01:44,028 --> 00:01:48,567
Suppose that we have some function,
it'll make gradient descent, and maybe for

31
00:01:48,567 --> 00:01:51,400
some coordinates of our parameter vector.

32
00:01:51,400 --> 00:01:55,440
Gradients always have the same sign,
so they lead us to the minimum.

33
00:01:55,440 --> 00:01:56,966
And for some coordinates,

34
00:01:56,966 --> 00:02:00,493
the sign of the gradient changes
from iteration to iteration.

35
00:02:00,493 --> 00:02:05,491
So, vector ht would be large for component
where gradients have the same sign on

36
00:02:05,491 --> 00:02:09,678
every iteration, and
will make large steps by this coordinates.

37
00:02:09,678 --> 00:02:12,394
And for coordinates that change sign,

38
00:02:12,394 --> 00:02:17,180
they will just cancel each other and
ht will be close to zero.

39
00:02:17,180 --> 00:02:23,280
So, ht cancels some coordinates that
lead to oscillation of gradients,

40
00:02:23,280 --> 00:02:25,250
and help us to achieve better convergence.

41
00:02:26,300 --> 00:02:29,583
Here, we should somehow choose
parameter alpha, for example,

42
00:02:29,583 --> 00:02:33,420
we could just set it to 0.9, and
to somehow choose learning rate, eta t.

43
00:02:35,150 --> 00:02:40,290
In practice, this momentum method
indeed leads to faster convergence.

44
00:02:40,290 --> 00:02:44,451
So for our last function,
with elliptic level lines,

45
00:02:44,451 --> 00:02:47,480
we'll see picture like this one.

46
00:02:47,480 --> 00:02:51,510
So here, we don't oscillate that
much as on the previous picture, and

47
00:02:51,510 --> 00:02:52,730
we converge better.

48
00:02:52,730 --> 00:02:58,690
There is also an extension of momentum
method called Nesterov momentum.

49
00:02:58,690 --> 00:03:02,369
So in simple momentum method,
on every iteration,

50
00:03:02,369 --> 00:03:06,053
we calculate the gradient
at current point w t-1.

51
00:03:06,053 --> 00:03:10,929
We take a gradient step from it by gt,
and we then get our momentum, ht.

52
00:03:10,929 --> 00:03:16,998
But since it's certain that we'll move
in the direction of the momentum,

53
00:03:16,998 --> 00:03:21,906
it will be more clever to, first,
step in the direction of ht

54
00:03:21,906 --> 00:03:26,175
to get some new approximation
of parameter vector.

55
00:03:26,175 --> 00:03:32,400
And then to calculate gradient
at the new point, w t-1 plus ht.

56
00:03:32,400 --> 00:03:36,010
So in this case, we'll get a better
approximation on the next step.

57
00:03:37,710 --> 00:03:41,906
Mathematically, we take the current point,
w t-1,

58
00:03:41,906 --> 00:03:48,940
we subtract alpha multiplied by h t-1, and
calculate the gradient at this new point.

59
00:03:48,940 --> 00:03:53,584
And then we add the gradient at this new
point, multiplied by the learning rate,

60
00:03:53,584 --> 00:03:58,000
to the h from the previous iteration,
alpha multiplied by h t-1.

61
00:03:58,000 --> 00:03:58,750
And in practice,

62
00:03:58,750 --> 00:04:02,445
this method indeed leads to better
convergence than momentum method.

63
00:04:02,445 --> 00:04:06,810
So momentum method and Nesterov momentum
method work better with difficult

64
00:04:06,810 --> 00:04:09,280
functions with complex level sets.

65
00:04:09,280 --> 00:04:13,820
But they still require to choose learning
rate, and they're very sensitive to it.

66
00:04:13,820 --> 00:04:17,720
So now we'll discuss some other
optimization methods that try to choose

67
00:04:17,720 --> 00:04:21,170
learning rate adaptively, so
we don't have to choose it ourselves.

68
00:04:21,170 --> 00:04:23,460
One of such methods is AdaGrad.

69
00:04:23,460 --> 00:04:26,725
So let's remember how we make
a gradient step, just for

70
00:04:26,725 --> 00:04:29,091
one coordinate g of parameter vector w.

71
00:04:29,091 --> 00:04:33,801
To make a gradient step,
we take j-th component of parameter

72
00:04:33,801 --> 00:04:37,720
vector from the previous iteration,
is w t-1 j.

73
00:04:37,720 --> 00:04:42,981
And we subtract j-th component of
the gradient at the current point,

74
00:04:42,981 --> 00:04:47,010
to find out the next parameter
approximation, w t j.

75
00:04:48,010 --> 00:04:51,599
In AdaGrad,
we've obtained additional value for

76
00:04:51,599 --> 00:04:55,197
each parameter from all
our parameter vector, G.

77
00:04:55,197 --> 00:05:00,534
So to calculate it, we take G t-1,
G from the previous iteration.

78
00:05:00,534 --> 00:05:02,767
We take j-th component of it, and

79
00:05:02,767 --> 00:05:07,020
we just add a square of gradient
at the current iteration.

80
00:05:07,020 --> 00:05:13,280
So essentially, G is a sum of squares of
gradients from all previous iterations,

81
00:05:13,280 --> 00:05:16,020
and then we modify our gradient step.

82
00:05:16,020 --> 00:05:18,420
We divide our learning rate, eta t,

83
00:05:18,420 --> 00:05:24,590
by the square root of G t j,
plus some small number, epsilon.

84
00:05:24,590 --> 00:05:28,460
We add epsilon, just to make sure that
we don't divide learning rate by zero.

85
00:05:28,460 --> 00:05:32,780
So this AdaGrad method,
it chooses learning rate adaptively.

86
00:05:32,780 --> 00:05:36,720
Here, we can just set eta t,
our learning rate, to some constant, for

87
00:05:36,720 --> 00:05:39,442
example, 0.01, and
don't remember about it at all.

88
00:05:39,442 --> 00:05:44,430
So this doesn't need to somehow

89
00:05:44,430 --> 00:05:49,500
wisely choose learning rate, but
it also has some disadvantages.

90
00:05:49,500 --> 00:05:52,989
Our auxiliary parameter, G,
accumulates squares of gradient, and

91
00:05:52,989 --> 00:05:54,830
at some step it can become too large.

92
00:05:54,830 --> 00:05:57,705
We'll divide our learning rate,
eta t, by a large number, and

93
00:05:57,705 --> 00:05:59,473
gradient descent will stop building.

94
00:05:59,473 --> 00:06:05,290
So to somehow overcome this,
we need some other methods like AdaGrad.

95
00:06:05,290 --> 00:06:07,340
By the way, AdaGrad has another advantage,

96
00:06:07,340 --> 00:06:11,420
it chooses its own learning rate for
each example.

97
00:06:11,420 --> 00:06:13,853
So suppose that we are analyzing texts,
and

98
00:06:13,853 --> 00:06:17,380
each feature from our sample
corresponds to one word.

99
00:06:17,380 --> 00:06:20,717
So for some frequent word that
we see in every document,

100
00:06:20,717 --> 00:06:25,059
we have gradient updates on every step,
and we'll make smaller steps.

101
00:06:25,059 --> 00:06:29,325
And for some rare words that we
can met only in one of thousand or

102
00:06:29,325 --> 00:06:33,102
ten thousand documents,
we'll make large updates,

103
00:06:33,102 --> 00:06:36,978
because they are rare,
we don't need them very often.

104
00:06:36,978 --> 00:06:42,020
And we need to move faster in
the direction of these words.

105
00:06:42,020 --> 00:06:45,960
Another matter that can
improve AdaGrad is RMSprop.

106
00:06:45,960 --> 00:06:50,384
This method is very similar to AdaGrad,
but here we take an exponentially weighted

107
00:06:50,384 --> 00:06:52,922
average of squares of
gradients on every step.

108
00:06:52,922 --> 00:06:58,901
So here, to calculate G j at the step t,
we take G j from the previous step,

109
00:06:58,901 --> 00:07:03,057
t-1, we multiply it by
some coefficient alpha.

110
00:07:03,057 --> 00:07:07,967
And then we add the square of
the gradient, G t j, at this iteration,

111
00:07:07,967 --> 00:07:09,930
multiplied by 1- alpha.

112
00:07:11,030 --> 00:07:17,200
And then we use this G to divide our
learning rate, just like in AdaGrad.

113
00:07:17,200 --> 00:07:22,030
So this method overcomes the problem
of large sums of square gradients.

114
00:07:22,030 --> 00:07:22,970
And here,

115
00:07:22,970 --> 00:07:28,340
our learning rate depends only on last
examples from our gradient descent method.

116
00:07:30,070 --> 00:07:32,688
Let's start with RMSprop method and
slightly modify it.

117
00:07:32,688 --> 00:07:37,727
So in RMSprop,
we maintained an additional variable,

118
00:07:37,727 --> 00:07:40,640
and it will be augmented by v t j.

119
00:07:40,640 --> 00:07:45,448
And is just an exponentially weighted
sum of gradients from all iterations.

120
00:07:45,448 --> 00:07:49,909
So here, to calculate v for
j-th component at the step t,

121
00:07:49,909 --> 00:07:56,480
we take v from the previous step, v j t-1,
multiplied by some coefficient beta 2.

122
00:07:56,480 --> 00:08:00,819
And we add square of gradient
in the current iteration,

123
00:08:00,819 --> 00:08:04,156
g t j squared, multiplied by 1- beta 2.

124
00:08:04,156 --> 00:08:08,648
And we can notice, that this
approximation, v has some bias towards

125
00:08:08,648 --> 00:08:13,900
zero, especially at first steps,
because we initialize it with zero.

126
00:08:13,900 --> 00:08:20,520
So to overcome this, we just divide
it by 1- beta 2, in the degree of t.

127
00:08:20,520 --> 00:08:25,150
So this normalization allows
us to get rid of this bias.

128
00:08:25,150 --> 00:08:27,740
At first steps,
this normalization is large, and for

129
00:08:27,740 --> 00:08:30,810
large t's,
this normalization almost equals to 1.

130
00:08:30,810 --> 00:08:34,530
And then we use this v to divide
our learning rate, eta t.

131
00:08:36,210 --> 00:08:38,380
But as we go from momentum method,

132
00:08:38,380 --> 00:08:42,800
an approximation of gradient from one step
can be noisy and lead to oscillations.

133
00:08:42,800 --> 00:08:45,200
So let's smooth our gradients.

134
00:08:45,200 --> 00:08:48,490
To do it,
we maintain another auxiliary variable, m,

135
00:08:48,490 --> 00:08:51,740
that is essentially a sum of gradients.

136
00:08:51,740 --> 00:08:54,931
Not squares of gradients,
like in v, but just gradients.

137
00:08:54,931 --> 00:08:56,795
And it's calculated in the same way,

138
00:08:56,795 --> 00:09:00,416
it's just an exponentially weighted
average with coefficient beta 1.

139
00:09:00,416 --> 00:09:05,238
And then we replace g, the gradient
approximation in our gradient step,

140
00:09:05,238 --> 00:09:09,927
by m, by weighted average of gradients
from all previous iterations.

141
00:09:09,927 --> 00:09:12,966
So this method, Adam,
combines both momentum methods and

142
00:09:12,966 --> 00:09:14,744
adaptive learning rate methods.

143
00:09:14,744 --> 00:09:21,390
And in practice, it achieves better
convergence and faster convergence.

144
00:09:21,390 --> 00:09:24,090
In this video,
we discussed some advanced optimization

145
00:09:24,090 --> 00:09:26,130
techniques that can
improve gradient decent.

146
00:09:26,130 --> 00:09:29,946
For example, momentum method and
Nesterov momentum method,

147
00:09:29,946 --> 00:09:34,970
that allows us to work with some complex
functions with complex level sets.

148
00:09:34,970 --> 00:09:39,062
And AdaGrad, RMSprop method,
that have adaptive learning rates, so

149
00:09:39,062 --> 00:09:41,390
we don't have to choose eta t manually.

150
00:09:41,390 --> 00:09:45,294
And there is also an Adam method,
that combines both of this methods.

151
00:09:45,294 --> 00:09:55,294
[MUSIC]