1
00:00:00,000 --> 00:00:03,330
[MUSIC]

2
00:00:03,330 --> 00:00:07,972
In this lesson, we'll discuss extension of
gradient descent methods that allow us to

3
00:00:07,972 --> 00:00:09,380
learn better and faster.

4
00:00:10,530 --> 00:00:14,883
So in gradient descent, we try to
minimize a loss function which is usually

5
00:00:14,883 --> 00:00:18,839
a sum of loss functions on separate
examples from our training set.

6
00:00:18,839 --> 00:00:23,259
In gradient descent,
we start with initialization w0, and

7
00:00:23,259 --> 00:00:28,359
then on every step we take current
approximation of parameter wt-1.

8
00:00:28,359 --> 00:00:34,334
Then we subtract the gradient at
this point, multiply by e to t,

9
00:00:34,334 --> 00:00:38,670
the loading rate, and
then we weigh the step.

10
00:00:38,670 --> 00:00:40,510
Then we check the stopping criteria.

11
00:00:40,510 --> 00:00:45,460
For example, we can check whether the new
parameter vector is close to the previous.

12
00:00:45,460 --> 00:00:47,687
And if it's close,
then we stop our gradient descent.

13
00:00:47,687 --> 00:00:53,053
For example, for a mean squared error,
the loss function looks like

14
00:00:53,053 --> 00:00:58,810
sum of squared errors of separate
examples from the training set.

15
00:00:58,810 --> 00:01:01,330
So to calculate the gradient
of mean squared error,

16
00:01:01,330 --> 00:01:07,220
we should sum gradients of squared
errors over all gradient examples.

17
00:01:07,220 --> 00:01:11,330
And if we have a million of gradient
examples, then we have to sum over 1

18
00:01:11,330 --> 00:01:16,100
million of gradients to calculate the
gradient for one step of our algorithm.

19
00:01:16,100 --> 00:01:19,783
That's a lot.
For example, to make 1,000 gradient steps,

20
00:01:19,783 --> 00:01:23,160
we have to calculate
a billion of gradients.

21
00:01:23,160 --> 00:01:27,490
So this makes gradient descent
infeasible for large scale problems.

22
00:01:29,440 --> 00:01:32,620
To overcome this problem we can
use stochastic gradient descent.

23
00:01:32,620 --> 00:01:36,930
It's very similar to gradient
descent with only one difference.

24
00:01:36,930 --> 00:01:39,480
We start with some initialization w0,

25
00:01:39,480 --> 00:01:44,700
and then on every step at a time t,
we select some

26
00:01:44,700 --> 00:01:50,380
random example from our training set,
the number of this example by i.

27
00:01:50,380 --> 00:01:53,770
And then we calculate the gradient
only on this example.

28
00:01:53,770 --> 00:01:56,970
And then we make a step in
the direction of this gradient.

29
00:01:56,970 --> 00:02:00,620
So in stochastic gradient descent
we approximate the gradient

30
00:02:00,620 --> 00:02:05,300
of all the loss function by the gradient
of loss function on only one example.

31
00:02:06,350 --> 00:02:09,618
Of course,
this leads to very noisy approximations.

32
00:02:09,618 --> 00:02:15,025
If we analyze how the stochastic
gradient descent behaves on some sample,

33
00:02:15,025 --> 00:02:17,360
then we can see this picture.

34
00:02:17,360 --> 00:02:22,550
So if you form iteration to iteration, the
loss function can increase or decrease.

35
00:02:22,550 --> 00:02:25,600
But if you make enough iterations
of gradient descent then

36
00:02:25,600 --> 00:02:27,650
it converges to sum minimum.

37
00:02:27,650 --> 00:02:30,970
So stochastic gradient descent
makes very noisy updates.

38
00:02:30,970 --> 00:02:33,160
But it has a large advantage.

39
00:02:33,160 --> 00:02:36,650
It needs only one example
to make one gradient step.

40
00:02:36,650 --> 00:02:39,780
And also it maybe used in online setting.

41
00:02:39,780 --> 00:02:43,720
Suppose that you have some stream of data,
for example, a click stream from a search

42
00:02:43,720 --> 00:02:47,930
engine, and you want to adapt to
the stream to learn a new module online.

43
00:02:47,930 --> 00:02:52,050
So if you use stochastic
gradient descent when you receive

44
00:02:52,050 --> 00:02:56,960
another example from the click stream,
you can just make one gradient step for

45
00:02:56,960 --> 00:03:00,890
this particular example, and
that would be online learning.

46
00:03:00,890 --> 00:03:04,230
There is also a disadvantage of
stochastic gradient descent.

47
00:03:04,230 --> 00:03:08,830
Learning rate nt should be chosen
very carefully because if you choose

48
00:03:08,830 --> 00:03:12,620
a large learning rate,
then the matrix cannot converge.

49
00:03:12,620 --> 00:03:17,290
And if you choose too small learning rate,
then the conversions will be too small

50
00:03:17,290 --> 00:03:21,520
to require thousands and maybe millions
of iterations to converge to the minimum.

51
00:03:23,690 --> 00:03:28,140
To overcome some of these problems, one
can use mini-batch gradient descent, which

52
00:03:28,140 --> 00:03:33,549
merges some properties of the gradient
descent and stochastic gradient descent.

53
00:03:33,549 --> 00:03:35,744
So in mini-batch gradient descent,

54
00:03:35,744 --> 00:03:40,002
on every iteration we choose m random
examples from our training sample.

55
00:03:40,002 --> 00:03:44,440
You know their indices by i1, etc, im.

56
00:03:44,440 --> 00:03:48,250
Then we calculate the gradient for
every of these examples.

57
00:03:48,250 --> 00:03:52,192
And than we average their gradients,
and make a step towards this direction.

58
00:03:52,192 --> 00:03:56,651
So in mini-batch gradient descent,
we use m points to estimate the full

59
00:03:56,651 --> 00:04:01,610
gradient instead of one point like
a stochastic gradient descent.

60
00:04:01,610 --> 00:04:06,150
The updates of mini-batch gradient
descent have much less noise

61
00:04:06,150 --> 00:04:07,970
than stochastic gradient descent.

62
00:04:07,970 --> 00:04:11,240
And this might still can be used
in online learning setting.

63
00:04:11,240 --> 00:04:15,700
Just accumulate n examples from your
stream, and then you make an update.

64
00:04:15,700 --> 00:04:19,318
And still, learning rate eta t
should be chosen very carefully for

65
00:04:19,318 --> 00:04:21,540
mini-batch gradient descent.

66
00:04:21,540 --> 00:04:25,810
And also there is another problem with
this stochastic and nonstochastic methods.

67
00:04:25,810 --> 00:04:30,100
Suppose that we have a difficult
function that has levelized like this.

68
00:04:30,100 --> 00:04:31,170
They have elliptic form.

69
00:04:32,170 --> 00:04:35,030
It is known from the calculus
that gradient is always

70
00:04:35,030 --> 00:04:37,010
orthogonal to the level line.

71
00:04:37,010 --> 00:04:39,330
So we start from some point w0,

72
00:04:39,330 --> 00:04:43,760
then we'll make a gradient step that takes
us to the other side of this function.

73
00:04:43,760 --> 00:04:48,690
Then we take another gradient step that
takes us in the opposite direction, etc.

74
00:04:48,690 --> 00:04:53,190
So the gradient distance will also rate on
this function and this is not very good.

75
00:04:53,190 --> 00:04:54,680
It will take many iterations for

76
00:04:54,680 --> 00:04:58,520
it to converge, and it will be good
to somehow overcome this problem.

77
00:05:00,210 --> 00:05:04,257
So in this video, we've discussed
stochastic extensions of gradient descent,

78
00:05:04,257 --> 00:05:07,340
stochastic gradient descent, and
mini-batch gradient descent.

79
00:05:07,340 --> 00:05:10,100
They're much faster than
gradient descent and

80
00:05:10,100 --> 00:05:12,820
can be used in online learning setting.

81
00:05:12,820 --> 00:05:15,312
But they have some problems.
They have learning rates that should be

82
00:05:15,312 --> 00:05:19,690
somehow chosen and they can have some
problems with difficult functions.

83
00:05:19,690 --> 00:05:23,741
And in the next video, we will discuss
some other extensions of stochastic

84
00:05:23,741 --> 00:05:26,390
methods that can overcome
these difficulties.

85
00:05:26,390 --> 00:05:36,390
[MUSIC]