1
00:00:02,610 --> 00:00:05,315
In this video, we will learn how

2
00:00:05,315 --> 00:00:10,875
LSTM and GRU architectures work and understand why they're built in the way they are.

3
00:00:10,875 --> 00:00:15,505
Up to this point, we have been talking mostly about the Simple Recurrent Neural Network.

4
00:00:15,505 --> 00:00:17,970
In such network, the dependence between the hidden units and

5
00:00:17,970 --> 00:00:21,645
the neighboring time steps is given by a very simple formula.

6
00:00:21,645 --> 00:00:25,829
We just compute the non-linear function of a linear combination on the inputs.

7
00:00:25,829 --> 00:00:27,895
But, as you can remember from the beginning of the week,

8
00:00:27,895 --> 00:00:29,740
we can actually use something more

9
00:00:29,740 --> 00:00:33,190
sophisticated to compute the next hidden units from the previous ones.

10
00:00:33,190 --> 00:00:34,520
For example, some MLP,

11
00:00:34,520 --> 00:00:35,900
which has more than one hidden layer,

12
00:00:35,900 --> 00:00:38,530
where layer is a primitive function.

13
00:00:38,530 --> 00:00:40,475
So it is what we'll do now.

14
00:00:40,475 --> 00:00:44,745
We'll construct a more effective primitive function to compute hidden units.

15
00:00:44,745 --> 00:00:50,380
You might also think about this as a construction of a new type of recurrent layer.

16
00:00:50,380 --> 00:00:53,310
Let's start from the function we use in the simple recurrent neural network

17
00:00:53,310 --> 00:00:56,730
and construct the function for the LCM step-by-step.

18
00:00:56,730 --> 00:01:01,620
On the slide, you can see the diagram of a simple recurrent layer.

19
00:01:01,620 --> 00:01:04,610
Here, non-linearity and summation are shown in circles,

20
00:01:04,610 --> 00:01:08,930
multiplications by the weight matrices are shown on the corresponding edges,

21
00:01:08,930 --> 00:01:12,590
and base vectors are dropped from that picture for simplicity.

22
00:01:12,590 --> 00:01:14,355
When we do a back propagation through such layer,

23
00:01:14,355 --> 00:01:18,600
gradients needs to go through non-linearity and for multiplication by

24
00:01:18,600 --> 00:01:21,123
their current weight matrix W. Both of

25
00:01:21,123 --> 00:01:24,195
these things can cause the vanishing gradient problem.

26
00:01:24,195 --> 00:01:26,720
The main idea here is to create a short way for

27
00:01:26,720 --> 00:01:30,805
the gradients result any non-linearities or multiplications.

28
00:01:30,805 --> 00:01:34,350
All sorts of LSTM propose to do it

29
00:01:34,350 --> 00:01:37,700
by adding a new separately way through recurrent layer.

30
00:01:37,700 --> 00:01:40,715
So the LCM layer has its own internal memory C,

31
00:01:40,715 --> 00:01:43,740
which other layers on the network don't have access to.

32
00:01:43,740 --> 00:01:45,407
In the LCM layer,

33
00:01:45,407 --> 00:01:46,600
at each time step,

34
00:01:46,600 --> 00:01:49,650
we compute not only the vector of hidden units H but also a vector of memory cell C

35
00:01:49,650 --> 00:01:54,101
on the same dimension and,

36
00:01:54,101 --> 00:01:56,509
as a result, we have two ways through such layer,

37
00:01:56,509 --> 00:02:00,600
one between hidden units Ht-1 and Ht and the second

38
00:02:00,600 --> 00:02:04,955
one between memory cells Ct-1 and Ct. Now,

39
00:02:04,955 --> 00:02:07,692
let's understand what is going on inside the LSTM layer.

40
00:02:07,692 --> 00:02:11,228
In the simple recurrent neural network,

41
00:02:11,228 --> 00:02:13,635
we compute these hidden units at

42
00:02:13,635 --> 00:02:17,050
some non-linear function on the linear combination of the inputs.

43
00:02:17,050 --> 00:02:22,071
Here, we do the same but do not return this value as an output of the layer.

44
00:02:22,071 --> 00:02:24,470
We want to use the internal memory too.

45
00:02:24,470 --> 00:02:26,330
To look at the memory,

46
00:02:26,330 --> 00:02:30,665
LSTM needs at least two controller: an input gate and an output gate.

47
00:02:30,665 --> 00:02:34,283
These gates are the vectors of the same dimension as the hidden units.

48
00:02:34,283 --> 00:02:36,650
To compute them, we use the similar formula: the

49
00:02:36,650 --> 00:02:40,055
non-linearity over the linear combination.

50
00:02:40,055 --> 00:02:41,900
We can even rewrite the formulas for

51
00:02:41,900 --> 00:02:46,873
the information vector G and the gates in one vector formula in which weight matrices,

52
00:02:46,873 --> 00:02:52,500
V and W, and the bisector V are the concatenations of the corresponding parameters.

53
00:02:52,500 --> 00:02:58,255
Suppose there are N inputs X and M hidden units H in the network,

54
00:02:58,255 --> 00:03:03,195
then what size do the matrices V and W and the vector b have?

55
00:03:03,195 --> 00:03:08,185
Yep, the matrix V is 3m by M. The matrix W

56
00:03:08,185 --> 00:03:13,225
is 3m by M and the vector b contains 3m elements.

57
00:03:13,225 --> 00:03:16,480
For the gates, we use only the sigmoid non-linearity.

58
00:03:16,480 --> 00:03:18,490
It is important because you want the elements on

59
00:03:18,490 --> 00:03:21,100
the gates to take values from zero to one.

60
00:03:21,100 --> 00:03:23,290
In this case, the value one can be interpreted as

61
00:03:23,290 --> 00:03:26,585
an open gate and the value zero as a closed gate.

62
00:03:26,585 --> 00:03:29,226
If we multiply some information by the gate vector,

63
00:03:29,226 --> 00:03:36,810
we either get the same information or zero or something in between.

64
00:03:36,810 --> 00:03:38,455
As you've probably already guessed,

65
00:03:38,455 --> 00:03:41,305
the input gate controls what to store in the memory.

66
00:03:41,305 --> 00:03:46,295
The information vector g is multiplied by the input gate and then added to the memory.

67
00:03:46,295 --> 00:03:48,280
Multiplication here is element wise.

68
00:03:48,280 --> 00:03:54,060
The output gate controls what to read from the memory and return to the outer world.

69
00:03:54,060 --> 00:04:00,740
Memory cell C are multiplied by the output gate and then returned as a new hidden units.

70
00:04:00,740 --> 00:04:03,530
Now, we see that the LCM has pretty nice interpretation

71
00:04:03,530 --> 00:04:06,965
with internal memory and the controller suite,

72
00:04:06,965 --> 00:04:09,940
but how does this help with the vanishing gradient problem?

73
00:04:09,940 --> 00:04:11,905
As I already mentioned,

74
00:04:11,905 --> 00:04:16,530
there's not only one way through the recurrent layer and the important thing

75
00:04:16,530 --> 00:04:18,465
here is that now we have

76
00:04:18,465 --> 00:04:22,196
at least one short way for the information and for the gradients,

77
00:04:22,196 --> 00:04:25,775
between memory cells Ct minus 1 and Ct.

78
00:04:25,775 --> 00:04:29,555
There is no any non-linearity or multiplication on this way

79
00:04:29,555 --> 00:04:33,635
so if we calculate the Jacobian of Ct with respect to Ct-1,

80
00:04:33,635 --> 00:04:36,035
we see that it is equal to one.

81
00:04:36,035 --> 00:04:38,305
So there is no vanishing problem anymore.

82
00:04:38,305 --> 00:04:40,395
Is it an architecture?

83
00:04:40,395 --> 00:04:42,150
Unfortunately, it's not.

84
00:04:42,150 --> 00:04:44,180
We can only write something new in the memory at

85
00:04:44,180 --> 00:04:46,940
each time step and we can't erase anything.

86
00:04:46,940 --> 00:04:50,290
What if we want to work with really long sequences?

87
00:04:50,290 --> 00:04:53,260
Memory cell C have faint capacities so

88
00:04:53,260 --> 00:04:56,905
the values will be a mess after a lot of time steps.

89
00:04:56,905 --> 00:05:00,545
We need to be able to erase the information from the memory sometimes.

90
00:05:00,545 --> 00:05:04,565
One more gate, which is called a forget gate will help us with this.

91
00:05:04,565 --> 00:05:07,910
We compute in the same manner as the previous two gates and use

92
00:05:07,910 --> 00:05:11,635
it on the input memory cells before doing something else with them.

93
00:05:11,635 --> 00:05:13,880
If the forget the gate is closed,

94
00:05:13,880 --> 00:05:16,530
we erase the information from the memory.

95
00:05:16,530 --> 00:05:20,650
This version of LSTM with three gates is the most standard nowadays.

96
00:05:20,650 --> 00:05:26,000
Forget gate is very important in task with long sequences,

97
00:05:26,000 --> 00:05:29,714
but because of which, we now have multiplication on the short way through the LSTM layer.

98
00:05:29,714 --> 00:05:35,485
If we compute the Jacobian of Ct with respect to Ct-1,

99
00:05:35,485 --> 00:05:42,385
it is equal to Ft now not one and the Forget gate Ft takes values from zero to one.

100
00:05:42,385 --> 00:05:47,060
So it is usually less than one and may cause vanishing gradient problem.

101
00:05:47,060 --> 00:05:49,927
To do this, the proper initialization can be used.

102
00:05:49,927 --> 00:05:54,445
If the base of the forget gate is initialized with high positive numbers, for example,

103
00:05:54,445 --> 00:05:59,520
five, then the forget gate at first iteration with training is almost equal to one.

104
00:05:59,520 --> 00:06:05,335
At the beginning, LSTM doesn't forget and can't find long range dependencies in the data.

105
00:06:05,335 --> 00:06:08,995
Later, it learns to forget if it's necessary.

106
00:06:08,995 --> 00:06:13,300
To have some intuition about how LSTM may behave in practice,

107
00:06:13,300 --> 00:06:15,341
let's look at different extreme regimes in which it can work.

108
00:06:15,341 --> 00:06:20,020
On the slide, you can see it from a different picture of the LSTM layer.

109
00:06:20,020 --> 00:06:23,514
Here, the internal memory C is pictured inside the layer

110
00:06:23,514 --> 00:06:27,795
as a yellow circle and gates are represented with green circles.

111
00:06:27,795 --> 00:06:30,710
The regime type depends on the state of the gates.

112
00:06:30,710 --> 00:06:32,830
They either opened or closed.

113
00:06:32,830 --> 00:06:35,640
If only input and forget gates are open then

114
00:06:35,640 --> 00:06:39,250
the LSTM reads all the information on the inputs and stores it.

115
00:06:39,250 --> 00:06:41,738
In the opposite situation, that only forget and output gates are open,

116
00:06:41,738 --> 00:06:47,170
the LSTM carries the information to time and release it to the next layer.

117
00:06:47,170 --> 00:06:50,500
If both input and output gates are closed,

118
00:06:50,500 --> 00:06:55,615
the LSTM either erases all the information or stores it but in most cases,

119
00:06:55,615 --> 00:06:59,190
it doesn't directed to the outer world so it doesn't read or write anything.

120
00:06:59,190 --> 00:07:02,290
Now, I have a question for you.

121
00:07:02,290 --> 00:07:05,090
Which combination on the gate values makes the LSTM

122
00:07:05,090 --> 00:07:08,880
very similar to the simple recurrent neural network?

123
00:07:08,880 --> 00:07:14,665
Yeah, it is when input and output gates are open and forget gate is closed.

124
00:07:14,665 --> 00:07:18,990
In this case, LSTM reads everything to the memory and return the whole memory

125
00:07:18,990 --> 00:07:24,620
as an output so memory and hidden units here are all of the same entities.

126
00:07:24,620 --> 00:07:28,660
Of course, all of these cases are just extreme cases.

127
00:07:28,660 --> 00:07:31,865
In practice, the model behavior is somewhere in between.

128
00:07:31,865 --> 00:07:33,970
Because of all the different regimes,

129
00:07:33,970 --> 00:07:36,883
LSTM can work with the information more accurately.

130
00:07:36,883 --> 00:07:40,900
For example, when the simple recurrent neural network read something from the data,

131
00:07:40,900 --> 00:07:45,945
it outputs this information at each time step and gradually forgets it each time.

132
00:07:45,945 --> 00:07:47,615
At the same situation,

133
00:07:47,615 --> 00:07:49,860
LSTM can carry the information much longer from

134
00:07:49,860 --> 00:07:53,600
time and outputs it only in the particular time steps.

135
00:07:53,600 --> 00:07:59,790
LSTM has a lot of advantages compared with the simple recurrent neural network but,

136
00:07:59,790 --> 00:08:00,840
at the same time,

137
00:08:00,840 --> 00:08:04,372
it has four times more parameters because each gate and the information left in

138
00:08:04,372 --> 00:08:08,890
g has its own set of parameters V, W, and b.

139
00:08:08,890 --> 00:08:11,640
This makes LSTM less efficient in terms of memory

140
00:08:11,640 --> 00:08:14,610
and time and also makes the GRU architecture more likely.

141
00:08:14,610 --> 00:08:18,225
The GRU architecture is the strongest front.

142
00:08:18,225 --> 00:08:20,250
It has only three times more parameters compared to

143
00:08:20,250 --> 00:08:23,115
the simple recurrent neural network and in terms of quality,

144
00:08:23,115 --> 00:08:24,480
it works pretty much the same as the LSTM.

145
00:08:24,480 --> 00:08:29,539
The GRU layer doesn't contain an additional internal memory,

146
00:08:29,539 --> 00:08:34,180
but it also contains two gates that are called the reset and update gates.

147
00:08:34,180 --> 00:08:37,420
These gates are computed in the same manner as the ones from

148
00:08:37,420 --> 00:08:41,610
LSTM so they're equal to the sigmoid function or the linear combination of the inputs.

149
00:08:41,610 --> 00:08:45,930
As a result, they can take queries from zero to one.

150
00:08:45,930 --> 00:08:48,360
On the slide, the diagram for computing the gates is

151
00:08:48,360 --> 00:08:51,505
pictured as separate step for simplicity of the pictures.

152
00:08:51,505 --> 00:08:56,275
There is other gate controls which part of the hidden units from the previous time step?

153
00:08:56,275 --> 00:08:58,589
We use as an input to the information vector g.

154
00:08:58,589 --> 00:09:03,421
It acts quite similar to an input gate in the LSTM.

155
00:09:03,421 --> 00:09:05,750
The update gate controls the balance

156
00:09:05,750 --> 00:09:08,695
between the storing the previous values of the hidden units.

157
00:09:08,695 --> 00:09:12,190
And the writing the new information into hidden units so it

158
00:09:12,190 --> 00:09:16,430
works as a combination of inputs and forget gates from the LSTM.

159
00:09:16,430 --> 00:09:19,052
The situation that the vanishing gradient problem in GRU

160
00:09:19,052 --> 00:09:21,794
is very similar to the one in LSTM.

161
00:09:21,794 --> 00:09:26,221
Here, we have a short way layer with only one multiplication by the update gate on it.

162
00:09:26,221 --> 00:09:30,060
This short way is actually an identity keep connection

163
00:09:30,060 --> 00:09:33,943
from HTMNS1 to HT which is additionally controlled by the update gate.

164
00:09:33,943 --> 00:09:41,190
Which initialization tree should be used to make GRUs table to the vanishing gradients?

165
00:09:41,190 --> 00:09:46,555
We should initialize the base vector on the update gate with some high positive numbers.

166
00:09:46,555 --> 00:09:49,230
Meaning, in the beginning of the training, the gradients

167
00:09:49,230 --> 00:09:50,660
go through these multiplication very

168
00:09:50,660 --> 00:09:55,610
easy and the network is capable to find long range dependencies in the data,

169
00:09:55,610 --> 00:09:58,060
but you should use not too high numbers here.

170
00:09:58,060 --> 00:10:00,225
Since if the update gate is open,

171
00:10:00,225 --> 00:10:04,470
then the GRU layer doesn't pay much attention to the inputs X.

172
00:10:04,470 --> 00:10:06,620
We discussed two architectures, LSTM and GRU.

173
00:10:06,620 --> 00:10:11,520
How to understand which one of them should we use in a particular task?

174
00:10:11,520 --> 00:10:15,836
There is no obvious answer here that one of them is better than the other,

175
00:10:15,836 --> 00:10:17,355
but there is a rule of thumb.

176
00:10:17,355 --> 00:10:22,645
First, train the LSTM since it has more parameters and can be a little bit more flexible,

177
00:10:22,645 --> 00:10:24,400
then train the GRU,

178
00:10:24,400 --> 00:10:29,325
and if it works the same or the quality difference is negligible then use the GRU.

179
00:10:29,325 --> 00:10:32,875
In that case, return to LSTM.

180
00:10:32,875 --> 00:10:36,210
Also, you can use a multi-layer recurrent neural network so you can

181
00:10:36,210 --> 00:10:39,600
stack several recurrent layers as shown in the picture. In.

182
00:10:39,600 --> 00:10:41,400
This case, for the last layer,

183
00:10:41,400 --> 00:10:42,740
it's better to use

184
00:10:42,740 --> 00:10:44,370
LSTM since GRU doesn't

185
00:10:44,370 --> 00:10:44,920
have a proper amount of the output gate and cannot work with the output

186
00:10:44,920 --> 00:10:46,172
as accurate as LSTM.

187
00:10:46,172 --> 00:10:51,347
And for all the other layers,

188
00:10:51,347 --> 00:10:53,295
you can use either LSTM or GRU.

189
00:10:53,295 --> 00:10:56,760
It doesn't matter that much.

190
00:10:56,760 --> 00:10:59,430
Let's summarize what we have learnt in this video.

191
00:10:59,430 --> 00:11:00,915
We discussed two gated recurrent architectures: LSTM and GRU.

192
00:11:00,915 --> 00:11:07,560
These models are more resistant to vanishing gradient problem

193
00:11:07,560 --> 00:11:11,955
because there is an additional short way for the gradients through them.

194
00:11:11,955 --> 00:11:14,850
In the next video, we will discuss how to use

195
00:11:14,850 --> 00:11:18,580
recurrent neural networks to solve different practical tasks.