1
00:00:00,000 --> 00:00:05,158
In this video, we're going to look at a
method that was developed in the late

2
00:00:05,158 --> 00:00:09,848
1980's by Robbie Jacobs and then improved
by a number of other people.

3
00:00:09,848 --> 00:00:15,208
The idea is that each connection in the
neural net should have its own adaptive

4
00:00:15,208 --> 00:00:20,702
learning rate, which we set empirically by
observing what happens to the weight on

5
00:00:20,702 --> 00:00:25,938
that connection when we update it.
So that if the weight keeps reversing its

6
00:00:25,938 --> 00:00:28,755
gradient, we turn down the learning
weight.

7
00:00:28,755 --> 00:00:33,404
And if the gradient stays consistent, we
turn up the learning weight.

8
00:00:33,404 --> 00:00:38,968
So, let's start by thinking why having
separate adaptive learning weights on each

9
00:00:38,968 --> 00:00:43,593
connection is a good idea.
The problem is, they're in a deep

10
00:00:43,593 --> 00:00:46,704
multilayer net.
The learning weights can vary widely

11
00:00:46,704 --> 00:00:51,246
between different weights, especially
between weights in different layers.

12
00:00:51,246 --> 00:00:56,037
So, if for example, we start with small
weights, the gradience starts from much

13
00:00:56,037 --> 00:00:59,460
smaller in the initial layers than in the
later layers.

14
00:01:00,020 --> 00:01:05,027
Another factor that causes one different
learning rate for different weights is the

15
00:01:05,027 --> 00:01:08,385
fan-in of the unit.
The fan-in determines the size of the

16
00:01:08,385 --> 00:01:12,745
overshoot effects that you get when you
simultaneously change many of the

17
00:01:12,745 --> 00:01:15,690
different incoming weights to fix up the
same error.

18
00:01:15,690 --> 00:01:20,462
It maybe that the unit didn't get enough
input, when you change all these weights

19
00:01:20,462 --> 00:01:24,232
at the same time to fix up the error, it
now gets too much input.

20
00:01:24,232 --> 00:01:28,180
Obviously, that effect is going to be
bigger if there's a bigger fan-in.

21
00:01:28,640 --> 00:01:34,438
So, the net in the diagram on the right
has the same fain-in for both layers more

22
00:01:34,438 --> 00:01:40,040
or less the same fain-in for both layers,
but that's very different in some nets.

23
00:01:41,220 --> 00:01:45,735
So, the idea is that we're going to use a
global learning weight which we set by

24
00:01:45,735 --> 00:01:50,079
hand, and then we're going to multiply it
by a local gain that is determined

25
00:01:50,079 --> 00:01:55,702
empirically for each weight.
A simple way to determine what those local

26
00:01:55,702 --> 00:02:01,079
gains should be is to start with a local
gain of one for every weight.

27
00:02:01,079 --> 00:02:06,687
So that, initially we're going to change
the weight, Wij, by the learning rate

28
00:02:06,687 --> 00:02:11,219
times the gain of one, gij times the error
derivative for that weight.

29
00:02:11,219 --> 00:02:15,060
Then, what we're going to do is we're
going to adapt gij.

30
00:02:15,680 --> 00:02:21,589
We're going to increase gij if the
gradient for the weight does not change

31
00:02:21,589 --> 00:02:25,287
side.
And we're going to use small additive

32
00:02:25,287 --> 00:02:31,536
increases, and multiplicative decreases.
So, if the gradient for the weight at time

33
00:02:31,536 --> 00:02:36,840
t has the same sign as the gradient for
the weight at time t minus one, with t

34
00:02:36,840 --> 00:02:42,457
refers to weight updates, then when you
take that product, it'll be positive.

35
00:02:42,457 --> 00:02:48,931
Cuz you already get two negative gradients
or two positive gradients, and then what

36
00:02:48,931 --> 00:02:53,300
we're going to go is increase gij by small
additive amount.

37
00:02:53,300 --> 00:02:58,425
If the gradients have opposite signs,
we're going to decrease gij. And because

38
00:02:58,425 --> 00:03:03,753
we want to damp down gij quickly if it's
already big, we're going to decrease it

39
00:03:03,753 --> 00:03:08,657
multiplicatively.
That ensures that big gains will decay

40
00:03:08,657 --> 00:03:14,465
very rapidly if oscillation start.
It's interesting to ask what would happen

41
00:03:14,465 --> 00:03:18,990
if the grading was totally random.
So, on each update of the weights, pick a

42
00:03:18,990 --> 00:03:22,483
random gradient.
Then, you'll get an equal number of

43
00:03:22,483 --> 00:03:27,758
increases and decreases cuz it will
equally often be the same sign as the

44
00:03:27,758 --> 00:03:33,334
previous gradient or the opposite sign.
And so, you'll get a bunch of additive

45
00:03:33,334 --> 00:03:38,789
0.05 increases, and multiplicative 0.95
decreases, and they have an equilibrium

46
00:03:38,789 --> 00:03:43,748
point which is when the gain is one.
If the gain's bigger than one, the

47
00:03:43,748 --> 00:03:49,273
multiplying by 0.95 will reduce it by more
than adding 0.05. If the gain's smaller

48
00:03:49,273 --> 00:03:54,657
than one, adding 0.05 will increase it
more than multiplying by 0.95 decreases

49
00:03:54,657 --> 00:03:57,020
it.
So, with random gradients, we'll hover

50
00:03:57,020 --> 00:04:00,159
around one.
And if the gradient is consistently in the

51
00:04:00,159 --> 00:04:02,891
same direction we can get much bigger than
one.

52
00:04:02,891 --> 00:04:07,134
If the gradient is consistently in
opposite directions, which means we're

53
00:04:07,134 --> 00:04:10,680
oscillating across a ravine, we can get
much smaller than one.

54
00:04:11,720 --> 00:04:16,709
There's a number of tricks for making the
adaptive learning rates work better.

55
00:04:16,709 --> 00:04:19,651
It's important to limit the size of the
gains.

56
00:04:19,651 --> 00:04:22,465
A reasonable range is 0.1 to ten.
Or 0.1 to 100.

57
00:04:22,465 --> 00:04:27,391
You don't want the gains to get huge
because then you can easily get into an

58
00:04:27,391 --> 00:04:32,188
instability and they won't die down fast
enough, and you'll destroy all the

59
00:04:32,188 --> 00:04:36,295
weights.
The adaptive learning rates was designed

60
00:04:36,295 --> 00:04:40,589
for full batch learning.
You can also apply it with mini batches

61
00:04:40,589 --> 00:04:43,316
but they had better be pretty big mini
batches.

62
00:04:43,316 --> 00:04:48,429
That'll ensure that the sign, changing
signs of gradience aren't due to the

63
00:04:48,429 --> 00:04:53,270
sampling error of mini batches,
They are really due to the other side of

64
00:04:53,270 --> 00:04:58,431
the ravine.
There's nothing to prevent you combining

65
00:04:58,431 --> 00:05:03,552
adaptive learning rates with momentum.
So, Jacob suggests that, instead of using

66
00:05:03,552 --> 00:05:08,869
the agreement in sign between the current
gradient and the previous gradient, you

67
00:05:08,869 --> 00:05:14,187
use the agreement in sign between the
current gradient and the velocity for that

68
00:05:14,187 --> 00:05:18,782
weight, so the accumulated gradient.
And, if you do that, you get a nice

69
00:05:18,782 --> 00:05:24,166
combination of the advantages of momentum,
and the advantages of adaptive learning

70
00:05:24,166 --> 00:05:27,880
rates.
So, adaptive learning rates only deal with

71
00:05:27,880 --> 00:05:33,052
axis of line defects.
Whereas, momentum doesn't care about the

72
00:05:33,052 --> 00:05:37,001
alignment of the axis.
Momentum can deal with these diagonal

73
00:05:37,001 --> 00:05:42,332
ellipses and going in that diagonal
direction quickly which adaptive learning

74
00:05:42,332 --> 00:05:43,320
rates can't do.