1
00:00:02,390 --> 00:00:05,520
In this video, we will speak about simple Recurrent Neural Network,

2
00:00:05,520 --> 00:00:08,920
how to use Backpropagation to train.

3
00:00:08,920 --> 00:00:12,510
In the previous video, we spoke about a recurrent architecture.

4
00:00:12,510 --> 00:00:16,415
In this architecture, we work with sequences, element by element.

5
00:00:16,415 --> 00:00:22,323
Here, Xs are the elements of the input sequence and at each time step,

6
00:00:22,323 --> 00:00:26,120
we have some output or a prediction right ahead.

7
00:00:26,120 --> 00:00:28,895
We can use this architecture for a lot of different tasks.

8
00:00:28,895 --> 00:00:33,100
For example, if we want to use it for language modeling,

9
00:00:33,100 --> 00:00:37,420
then Xs should be word embeddings and the output

10
00:00:37,420 --> 00:00:42,325
of our model at each time step is a probability distribution over a vocabulary,

11
00:00:42,325 --> 00:00:47,110
which says which word is more probable to be the next one in our sequence.

12
00:00:47,110 --> 00:00:51,220
Of course, we can generate this word from this probability distribution and give

13
00:00:51,220 --> 00:00:55,985
it to our model as an input in the next time step so we can generate text.

14
00:00:55,985 --> 00:00:59,655
We can use a recurrent architecture to solve a post tagging task.

15
00:00:59,655 --> 00:01:02,915
In this task, we want to tag each word to the text,

16
00:01:02,915 --> 00:01:05,250
which is a part of speech label.

17
00:01:05,250 --> 00:01:08,057
Here we have the same inputs, word embeddings,

18
00:01:08,057 --> 00:01:09,927
but the output is different now.

19
00:01:09,927 --> 00:01:11,640
Now, at each time step,

20
00:01:11,640 --> 00:01:14,970
the output is the probability distribution over the set of

21
00:01:14,970 --> 00:01:21,080
all possible post tags which says which post tag is more probable to the current word.

22
00:01:21,080 --> 00:01:27,990
Now we see that we can use this architecture for different tasks and here,

23
00:01:27,990 --> 00:01:29,550
in the middle of this architecture,

24
00:01:29,550 --> 00:01:31,115
we see this box MLP,

25
00:01:31,115 --> 00:01:34,420
which can be quite different inside.

26
00:01:34,420 --> 00:01:36,198
Let's use the symbols MLP, which exist.

27
00:01:36,198 --> 00:01:39,036
The MLP is one layer of hidden neurons.

28
00:01:39,036 --> 00:01:45,144
Then, we have a vector of hidden neurons in the front layer of our network

29
00:01:45,144 --> 00:01:47,196
and we can use these hidden neurons both

30
00:01:47,196 --> 00:01:50,790
to calculate the output of our networks or the predictions we had,

31
00:01:50,790 --> 00:01:54,695
and to calculate the hidden neurons from the next time step.

32
00:01:54,695 --> 00:01:57,820
On the slide, you can see formulas for forward pass in our network,

33
00:01:57,820 --> 00:02:02,555
an action that follows are pretty similar to the ones you already saw in the course.

34
00:02:02,555 --> 00:02:04,835
Here, again, you have some known neural functions

35
00:02:04,835 --> 00:02:07,865
F often in your combinations of the inputs.

36
00:02:07,865 --> 00:02:11,570
But here you need to remember that all the parameters are shared between

37
00:02:11,570 --> 00:02:17,285
time steps so we use same word matrices and the same bisectors at each time stamp.

38
00:02:17,285 --> 00:02:23,205
This type of visualization of recurrent neural network usually called a full diversion.

39
00:02:23,205 --> 00:02:27,140
And there is another diversion of the rotation for recurrent neural network,

40
00:02:27,140 --> 00:02:29,150
the more complex one.

41
00:02:29,150 --> 00:02:31,608
Actually, this complex visualization is more general

42
00:02:31,608 --> 00:02:35,375
because it doesn't depend on the length of the input sequence.

43
00:02:35,375 --> 00:02:38,804
While the full diversion is actually depends on it,

44
00:02:38,804 --> 00:02:41,280
but in the computerization,

45
00:02:41,280 --> 00:02:43,265
we have these recurrent connection,

46
00:02:43,265 --> 00:02:46,187
which forms a loop in our scheme and it's actually

47
00:02:46,187 --> 00:02:49,175
not clear how to train a network with a loop.

48
00:02:49,175 --> 00:02:56,200
That's why we will use a full visualization when we will speak about training.

49
00:02:56,200 --> 00:02:59,150
Now, let's speak about training of Recurrent Neural Network.

50
00:02:59,150 --> 00:03:02,397
Here we see a Recurrent Neural Network in the unfolded form.

51
00:03:02,397 --> 00:03:04,497
To train it training we actually need some data.

52
00:03:04,497 --> 00:03:10,965
Let's say that we have a training dataset so for some training sequence we know

53
00:03:10,965 --> 00:03:15,030
all the true label's y and we make

54
00:03:15,030 --> 00:03:19,590
some predictions y-hat and to understand how work are we,

55
00:03:19,590 --> 00:03:25,287
use some loss function L but we use this loss function separately at each time step,

56
00:03:25,287 --> 00:03:30,810
and then we sum up all of the sources to take their total loss, to have a total loss.

57
00:03:30,810 --> 00:03:35,475
Now, we have our usual neural network,

58
00:03:35,475 --> 00:03:38,825
we have data, we have a loss, what should we do?

59
00:03:38,825 --> 00:03:40,845
We need to backpropagate.

60
00:03:40,845 --> 00:03:43,378
As usual, in backpropagation, of the system,

61
00:03:43,378 --> 00:03:46,535
we need to make a forward pass and backward pass.

62
00:03:46,535 --> 00:03:49,400
In forward pass, we need to calculate all of the elements in

63
00:03:49,400 --> 00:03:52,315
our own neural network so we calculate the hidden elements,

64
00:03:52,315 --> 00:03:55,345
then we make our predictions, and we calculate the loss.

65
00:03:55,345 --> 00:03:56,922
In the backward pass,

66
00:03:56,922 --> 00:04:00,190
we need to calculate the gradient of our loss function with respect to

67
00:04:00,190 --> 00:04:04,150
all the parameters so respect to weight matrices and bisvectors.

68
00:04:04,150 --> 00:04:07,096
In usual fit for our neural network,

69
00:04:07,096 --> 00:04:10,992
we make a backpropagation on different layers so in the vertical direction,

70
00:04:10,992 --> 00:04:12,820
but here we have sequences.

71
00:04:12,820 --> 00:04:15,560
This why we need to backpropagate not only through layers but

72
00:04:15,560 --> 00:04:18,985
also through time in result direction.

73
00:04:18,985 --> 00:04:23,000
That's why the algorithm of training of recurrent neural networks,

74
00:04:23,000 --> 00:04:25,680
usually called backpropagation through time.

75
00:04:25,680 --> 00:04:30,000
Additionally, we need to remember that all the parameters are shared between time steps.

76
00:04:30,000 --> 00:04:31,765
Therefore, if you want to calculate the gradient

77
00:04:31,765 --> 00:04:33,970
of our loss with respect to some parameter,

78
00:04:33,970 --> 00:04:37,060
we actually need to sum up the gradient from all the time steps.

79
00:04:37,060 --> 00:04:39,260
Let's, for example, calculate the gradient of

80
00:04:39,260 --> 00:04:42,285
loss function with respect to weight matrix U.

81
00:04:42,285 --> 00:04:47,440
As I already said, we need to sum up the gradients from all these time steps in

82
00:04:47,440 --> 00:04:53,150
our sequence and to calculate the gradient of the loss at the specific time step t,

83
00:04:53,150 --> 00:04:55,510
we can simply use a chain rule.

84
00:04:55,510 --> 00:04:59,935
As a result, we have this product and the first element in this product,

85
00:04:59,935 --> 00:05:02,890
we calculate if we know which loss function we used,

86
00:05:02,890 --> 00:05:04,740
and the second element,

87
00:05:04,740 --> 00:05:06,475
we can actually also calculate very simply

88
00:05:06,475 --> 00:05:11,080
because our prediction depends on the matrix U only in one point.

89
00:05:11,080 --> 00:05:13,880
Now let's consider a more difficult example.

90
00:05:13,880 --> 00:05:16,935
Let's calculate the gradient of our loss function with respect to

91
00:05:16,935 --> 00:05:18,510
the recurrent weight matrix W.

92
00:05:18,510 --> 00:05:21,700
In first step, we will do the same thing order.

93
00:05:21,700 --> 00:05:24,880
We simply sum up the gradients from all the time steps in

94
00:05:24,880 --> 00:05:29,830
our sequence and actually we want to do the same thing thing order in the next step,

95
00:05:29,830 --> 00:05:33,235
but if we look at the formula for the hidden units,

96
00:05:33,235 --> 00:05:35,530
we see that there is not only one dependence

97
00:05:35,530 --> 00:05:38,780
between our weight matrix W and hidden units at

98
00:05:38,780 --> 00:05:45,000
time step t because hidden units in the previous time step actually also depend on W.

99
00:05:45,000 --> 00:05:51,048
So, here, we need to roll deeper and use not a simple chain rule,

100
00:05:51,048 --> 00:05:53,584
but use a formula of total derivative.

101
00:05:53,584 --> 00:05:54,805
After finding this formula,

102
00:05:54,805 --> 00:05:58,123
we end up with the following equation for gradient for

103
00:05:58,123 --> 00:06:00,000
our loss function with respect to W.

104
00:06:00,000 --> 00:06:01,995
At the last part of this equation,

105
00:06:01,995 --> 00:06:04,630
you can see that the first we take into

106
00:06:04,630 --> 00:06:08,155
account dependence between hidden units at time steps t and the

107
00:06:08,155 --> 00:06:09,290
weight matrix W.

108
00:06:09,290 --> 00:06:12,494
When we take one step back and take into account the dependence

109
00:06:12,494 --> 00:06:15,455
between previous hidden units and weight matrix W,

110
00:06:15,455 --> 00:06:19,180
and then we take one step back and so on and so on while we

111
00:06:19,180 --> 00:06:22,560
don't reach the beginning power sequence.

112
00:06:22,560 --> 00:06:24,321
As a result, here,

113
00:06:24,321 --> 00:06:26,500
the last part of this equation is the sum of

114
00:06:26,500 --> 00:06:30,340
the contributions from the old previous time steps toward

115
00:06:30,340 --> 00:06:32,845
the gradient at time step t.

116
00:06:32,845 --> 00:06:37,172
And to calculate the contribution from time step k to the gradient at time step t,

117
00:06:37,172 --> 00:06:40,703
we actually need to go from the hidden units at time step t

118
00:06:40,703 --> 00:06:45,730
to hidden units at time step k. In each element on the sum,

119
00:06:45,730 --> 00:06:52,092
we have this product of the combined matrices of the gradients of hidden units

120
00:06:52,092 --> 00:06:55,630
at one time step with respect to the hidden units in the previous time step.

121
00:06:55,630 --> 00:06:58,930
Now, we calculated the gradient of our loss function

122
00:06:58,930 --> 00:07:02,060
with respect to two weight matrices, U and W.

123
00:07:02,060 --> 00:07:06,160
When we calculated the gradient of our loss function with respect to U,

124
00:07:06,160 --> 00:07:08,560
we needed only to go backwards in layers,

125
00:07:08,560 --> 00:07:10,675
so in vertical direction.

126
00:07:10,675 --> 00:07:14,185
When we calculated the gradient with respect to W, we need to go

127
00:07:14,185 --> 00:07:17,291
backwards both in vertical direction and in horizontal direction,

128
00:07:17,291 --> 00:07:18,790
so backwards in layers and time.

129
00:07:18,790 --> 00:07:21,720
Now we have the last weight matrix V,

130
00:07:21,720 --> 00:07:24,470
and it's a question for you.

131
00:07:24,470 --> 00:07:27,856
Do we really to go backwards in time to calculate the gradient of

132
00:07:27,856 --> 00:07:30,857
our loss function with respect to weight matrix V?

133
00:07:30,857 --> 00:07:33,080
Yes, here we have the same situation as with

134
00:07:33,080 --> 00:07:36,663
the recurrent weight matrix W because the dependence

135
00:07:36,663 --> 00:07:41,571
between hidden units and weight matrix V is actually not only in one place.

136
00:07:41,571 --> 00:07:44,650
Hidden units from all the previous time steps also depends

137
00:07:44,650 --> 00:07:48,750
from V so we need to go backwards in time to calculate this gradient.

138
00:07:48,750 --> 00:07:52,100
Let's summarize what you've learned in this video.

139
00:07:52,100 --> 00:07:54,950
We now know what is the simple Recurrent Neural Network is

140
00:07:54,950 --> 00:07:58,167
and how to train it using Backpropagation through time.

141
00:07:58,167 --> 00:08:02,950
This backpropagation through time algorithm is actually a simple backpropagation,

142
00:08:02,950 --> 00:08:04,600
but with a fancy name.

143
00:08:04,600 --> 00:08:06,820
In the next video, we will discuss

144
00:08:06,820 --> 00:08:11,270
the difficulties that arise during the training of Recurrent Neural Networks.