1
00:00:00,000 --> 00:00:06,038
In this video, I am going to describe an
approach to training recurrent neural

2
00:00:06,038 --> 00:00:09,600
networks that's called Long Short Term
Memory.

3
00:00:10,480 --> 00:00:16,447
You can consider the dynamic state of a
neural network to be a short term memory.

4
00:00:16,447 --> 00:00:22,120
And the idea is, you want to make that
short term memory last for a long time.

5
00:00:22,420 --> 00:00:28,426
This is done by creating special modules
that are designed to allow information to

6
00:00:28,426 --> 00:00:32,841
be gated in, and then information to be
gated out when needed.

7
00:00:32,841 --> 00:00:38,558
And in the intermediate period, the gate
is closed, so the stuff that arrives in

8
00:00:38,558 --> 00:00:43,480
the intermediate period doesn't interfere
with the remembered state.

9
00:00:44,020 --> 00:00:49,732
Long short term memory has been very
successful for tasks like recognizing

10
00:00:49,732 --> 00:00:53,670
handwriting, where it's won a number of
competitions.

11
00:00:53,670 --> 00:00:59,094
In 1997, Hochreiter & Schmidhuber
published a paper in neural computation

12
00:00:59,094 --> 00:01:04,240
that solved the problem of getting a
recurring neural network to remember

13
00:01:04,240 --> 00:01:08,830
things for a long time.
There recurrent nets could remember things

14
00:01:08,830 --> 00:01:13,560
for hundreds of time steps.
They did this by designing a memory cell

15
00:01:13,560 --> 00:01:18,358
that used logistic and linear units with
multiplicative interactions.

16
00:01:18,358 --> 00:01:24,061
So information gets into the memory cell
whenever a logistic write gate is turned

17
00:01:24,061 --> 00:01:26,312
on.
The rest of the recurrent network

18
00:01:26,312 --> 00:01:31,297
determines the state of that write gate,
and when the rest of the recurrent network

19
00:01:31,297 --> 00:01:35,801
wants information to be stored, it turns
the write gate on, and whatever the

20
00:01:35,801 --> 00:01:40,606
current input from the rest of the net to
the memory cell is, gets stored in the

21
00:01:40,606 --> 00:01:43,789
memory cell.
The information stays in the memory cell

22
00:01:43,789 --> 00:01:47,753
so long as its keep gate is on.
So again, the rest of the system is

23
00:01:47,753 --> 00:01:52,378
determining the state of a logistic keep
gate, and if it keeps it on, then the

24
00:01:52,378 --> 00:01:56,608
information will stay there.
And finally, the information gets read

25
00:01:56,608 --> 00:02:02,107
from the memory cell so that it then goes
off to the rest of the recurrent neural

26
00:02:02,107 --> 00:02:07,270
network and influences future states and
it's read by turning on a read gate,

27
00:02:07,270 --> 00:02:12,367
Which again is a logistic unit controlled
by the rest of the neural network.

28
00:02:12,367 --> 00:02:17,873
The memory cell actually stores an analog
value, so we can think of it as a linear

29
00:02:17,873 --> 00:02:23,515
neuron that has an analog value and keeps
writing that value to itself at each time

30
00:02:23,515 --> 00:02:27,545
step by a weight of one, so the
information just stays there.

31
00:02:27,545 --> 00:02:33,187
The weight of one is determined by a keep
gate so the rest of the system determines

32
00:02:33,187 --> 00:02:38,829
the state of that logistic keep gate and
if it puts it into a state of one or close

33
00:02:38,829 --> 00:02:44,270
to one the information just cycles around
and that value of 1.73 will stay there.

34
00:02:44,270 --> 00:02:48,918
As soon as the rest of the system wants to
get rid of that value, all it has to do is

35
00:02:48,918 --> 00:02:53,129
set the keep gate to have a value of zero
and the information will disappear.

36
00:02:53,129 --> 00:02:57,450
To store the information in the memory
cell, the rest of the system has to turn

37
00:02:57,450 --> 00:03:00,731
on the write gate.
And then whatever input is being provided

38
00:03:00,731 --> 00:03:05,106
to the memory cell from the rest of the
system will get written into the memory

39
00:03:05,106 --> 00:03:08,883
cell.
Similarly, to read the information from

40
00:03:08,883 --> 00:03:13,895
the memory cell, the rest of the system
turns on the logistic read gate and then,

41
00:03:13,895 --> 00:03:19,158
the value in the memory cell comes out and
affects the rest of the recurring neural

42
00:03:19,158 --> 00:03:22,762
network.
The point of using logistic units is that

43
00:03:22,762 --> 00:03:28,352
we can back propagate through them because
they have nice derivatives, and that means

44
00:03:28,352 --> 00:03:32,430
we can learn to use this kind of circuit
over many time steps.

45
00:03:32,430 --> 00:03:38,086
So I'm going to show you now a picture of
what backpropagation through a memory cell

46
00:03:38,086 --> 00:03:41,177
looks like.
First we're going to do a forward pass.

47
00:03:41,177 --> 00:03:46,702
So at the initial time, let's suppose that
the keep gate was set to zero, so we wiped

48
00:03:46,702 --> 00:03:50,320
out whatever information was in the memory
cell before,

49
00:03:50,580 --> 00:03:55,534
And the write gate is set to one.
So the value of 1.7 that is coming from

50
00:03:55,534 --> 00:04:00,625
the rest of the recurrent neural network
gets written into the memory cell.

51
00:04:00,625 --> 00:04:06,820
And we're not going to read it at this
time, so the read gate is set to zero.

52
00:04:06,820 --> 00:04:12,208
We then set the keep gate to one, or
rather the rest of the, neural network has

53
00:04:12,208 --> 00:04:17,665
to set the keep gate to one, And that
means that the value is written back into

54
00:04:17,665 --> 00:04:19,600
the memory cell.
It's stored.

55
00:04:19,960 --> 00:04:24,939
At the next time step, we're going to set
the right gate to zero and the read gate

56
00:04:24,939 --> 00:04:29,665
to zero, so the information isn't
influenced by what's going on in the rest

57
00:04:29,665 --> 00:04:34,392
of the net, and it doesn't influence
what's going on in the rest of the net.

58
00:04:34,392 --> 00:04:37,670
It's insulated.
Again, at the next time step, the keep

59
00:04:37,670 --> 00:04:42,144
gate is set to one, so the information is
stored for one more time step.

60
00:04:42,144 --> 00:04:46,997
And then, we're going to t set the right
gate to zero, so no information is written

61
00:04:46,997 --> 00:04:51,787
in, but we're now going to retrieve the
information by setting the reed gate to

62
00:04:51,787 --> 00:04:54,931
one.
The value of 1.7 then comes out of the

63
00:04:54,931 --> 00:04:59,020
memory cell and goes off to influence the
rest of the network.

64
00:04:59,480 --> 00:05:05,657
And if we don't need it anymore then the
keep gate can be set to zero and the

65
00:05:05,657 --> 00:05:11,202
information will be removed.
Now, if you look at the 1.7 that comes out

66
00:05:11,202 --> 00:05:17,380
when we do the retrieve and you look at
the path back to the 1.7 that came in,

67
00:05:17,920 --> 00:05:24,600
along that path is these little triangular
symbols and next to each triangular symbol

68
00:05:24,600 --> 00:05:29,667
is a one.
That means that the effective weight on

69
00:05:29,667 --> 00:05:34,349
that connection is a one.
So as we go back along that path whatever

70
00:05:34,349 --> 00:05:40,290
error derivative we have for the 1.7 when
it's retrieved gets backpropagated to 1.7

71
00:05:40,290 --> 00:05:44,134
when it's stored.
So if you'd rather retrieved a bigger

72
00:05:44,134 --> 00:05:50,074
value to make the right things happen now
you can send the information back and tell

73
00:05:50,074 --> 00:05:55,145
it, it should have stored a bigger value.
And notice that as long as the relevant

74
00:05:55,145 --> 00:05:59,581
gates have values of one, there's no
attenuation in this backpropagated signal.

75
00:05:59,581 --> 00:06:04,074
It's got just the properties we want.
Of course if they're logistic gates there

76
00:06:04,074 --> 00:06:07,714
will be some slight attenuation,
But it can be very small and so

77
00:06:07,714 --> 00:06:11,070
information can travel back through
hundreds of time steps.

78
00:06:11,070 --> 00:06:17,693
Now, let's look at a task that a recurrent
neural network with long short term memory

79
00:06:17,693 --> 00:06:21,342
is very good at.
It's a very natural task for recurrent

80
00:06:21,342 --> 00:06:24,300
neural network.
It's reading cursive handwriting.

81
00:06:24,840 --> 00:06:30,970
The input is just a sequence of the x and
y coordinance of the tip of the pen,

82
00:06:30,970 --> 00:06:35,355
Plus some information about whether the
pen is on the paper or not.

83
00:06:35,355 --> 00:06:39,161
The output is going to be a sequence of
recognized characters.

84
00:06:39,161 --> 00:06:44,643
Graves & Schmidhuber in 2009, showed that
recurrent neural networks with long short

85
00:06:44,643 --> 00:06:47,481
term memory are extremely good at this
task.

86
00:06:47,481 --> 00:06:52,627
So far as I know, they're currently the
best systems there are and I believe

87
00:06:52,627 --> 00:06:57,547
Canada Post is starting to use them for
reading handwriting.

88
00:06:57,547 --> 00:07:02,466
Graves & Schmidhuber who, in 2009, didn't
use pen coordinates as input.

89
00:07:02,466 --> 00:07:08,467
They used a sequence of small images.
And that means they can deal with optical

90
00:07:08,467 --> 00:07:11,717
input where the timing of the pen isn't
known.

91
00:07:11,717 --> 00:07:16,310
They can look at images after they've been
written and read them.

92
00:07:16,310 --> 00:07:22,033
So I'm now gonna show you a demonstration
of Alex Graves's system working on pen

93
00:07:22,033 --> 00:07:25,919
coordinates.
And in the movie that follows you're going

94
00:07:25,919 --> 00:07:31,863
to see four streams of information.
The top row shows the characters as

95
00:07:31,863 --> 00:07:35,810
they're recognized.
The system never revises its output.

96
00:07:35,810 --> 00:07:41,183
So if it has to make a difficult decision,
it delays it for a little bit, so that it

97
00:07:41,183 --> 00:07:45,909
can see a little distance into the future
to help it resolve ambiguities.

98
00:07:45,909 --> 00:07:51,023
The second row shows the states in a
subset of the memory cells, and you should

99
00:07:51,023 --> 00:07:54,713
notice how they get reset when it
recognizes a character.

100
00:07:54,713 --> 00:07:59,373
The third row shows the actual writing and
all the net sees is the x and y

101
00:07:59,373 --> 00:08:04,099
coordinates of the tip of the pen.
Just two numbers plus some information

102
00:08:04,099 --> 00:08:09,906
about whether the pen is up or down.
Finally, the fourth row shows something

103
00:08:09,906 --> 00:08:15,522
much more complicated.
It shows the gradient backpropagated all

104
00:08:15,522 --> 00:08:21,494
the way to the xy locations.
So what you get to see is, for the most

105
00:08:21,494 --> 00:08:26,123
active character,
If you backpropagate from that character

106
00:08:26,123 --> 00:08:31,907
and ask what would make that most active
character more active, you get to see

107
00:08:31,907 --> 00:08:37,764
which bits of the input are affecting the
probability that it's that character.

108
00:08:37,764 --> 00:08:43,769
So that let's you see how the decisions,
are depending on things that happened in

109
00:08:43,769 --> 00:08:45,920
the past.
So here's the movie.