1
00:00:00,000 --> 00:00:05,420
In this video I'm going to describe
eco-state networks.

2
00:00:05,720 --> 00:00:11,184
These use a clever trick to make it much
easier to learn a recurrent neural

3
00:00:11,184 --> 00:00:15,637
network.
They initialize the connections in the

4
00:00:15,637 --> 00:00:21,793
recurrent neural network in such a way
that it has a big reservoir of coupled

5
00:00:21,793 --> 00:00:26,446
oscillators.
So if you provide input to it, it converts

6
00:00:26,446 --> 00:00:32,473
that input into the states of these
oscillators, and then you can predict the

7
00:00:32,473 --> 00:00:36,620
output you want, from the states of these
oscillators.

8
00:00:36,620 --> 00:00:42,960
And the only thing you have learn, is how
to couple the output to the oscillators.

9
00:00:43,480 --> 00:00:48,659
This entirely gets rid of the problem.
Of learning hidden to hidden connections

10
00:00:48,659 --> 00:00:54,893
or even input to hidden connections.
However, to get these networks to be good

11
00:00:54,893 --> 00:00:59,100
at complicated tasks, you need a very big
hidden state.

12
00:00:59,580 --> 00:01:05,021
As we'll see at the end of the video,
There's no reason not to use the

13
00:01:05,021 --> 00:01:11,060
initialization that was carefully designed
for echo state networks, And then to use

14
00:01:11,060 --> 00:01:16,662
back propagation through time with
momentum to train the networks to be even

15
00:01:16,662 --> 00:01:24,231
better at the tasks that they're doing.
One interesting and quite recent idea

16
00:01:24,231 --> 00:01:30,372
about training recurrent neural networks,
is to not train the hidden to hidden

17
00:01:30,372 --> 00:01:36,434
connections at all, but to just fix them
randomly, and hope that you can learn

18
00:01:36,434 --> 00:01:41,080
sequences by just training the way they
affect the outputs.

19
00:01:42,520 --> 00:01:48,180
This has strong similarities with old
ideas about perceptions.

20
00:01:48,880 --> 00:01:54,373
So a very simple way to train a
feedforward neural network, is to make the

21
00:01:54,373 --> 00:01:58,010
early layers of feature detectors just be
random.

22
00:01:58,010 --> 00:02:04,097
You put in sensible sized random weights
and then all you learn is the last layer

23
00:02:04,097 --> 00:02:10,184
so that you're learning a linear model
from the activities of the hidden units in

24
00:02:10,184 --> 00:02:15,454
the last layer to the outputs.
And of course it's much faster to learn a

25
00:02:15,454 --> 00:02:19,990
linear model.
This relies on the idea that a big, random

26
00:02:19,990 --> 00:02:25,977
expansion of the input vector, can often
make it easy for a linear model to fit the

27
00:02:25,977 --> 00:02:31,100
data, when it couldn't fit the data well,
just looking at the raw inputs.

28
00:02:31,460 --> 00:02:36,508
Through the little neural network here,
those red weights will be fixed at random.

29
00:02:36,508 --> 00:02:41,431
They would expand the input vector and
then using that expanded representation,

30
00:02:41,431 --> 00:02:45,545
we try and fit a linear model.
This actually has some quite strong

31
00:02:45,545 --> 00:02:50,718
similarities with support vector machines.
Which are really just a really efficient

32
00:02:50,718 --> 00:02:55,116
way of doing this.
So those same ideas, many years later,

33
00:02:55,116 --> 00:02:58,370
were recycled for recurrent neural
networks.

34
00:02:58,370 --> 00:03:02,216
The idea is to make the input to hidden
connections.

35
00:03:02,216 --> 00:03:08,280
And the hidden to hidden connections have
random values that are carefully chosen.

36
00:03:08,280 --> 00:03:12,940
And just learn the final layer of hidden
to output connections.

37
00:03:12,940 --> 00:03:17,426
The learning is then very simple if you
use linear output units.

38
00:03:17,426 --> 00:03:22,684
And it can be done extremely fast.
This approach is only ever going to work

39
00:03:22,684 --> 00:03:28,153
if you set the random connections very
carefully, so that the recurring neural

40
00:03:28,153 --> 00:03:32,429
network doesn't die out with no activity
and doesn't explode.

41
00:03:32,429 --> 00:03:37,337
So, the way they set the random
connections in a echo state network is

42
00:03:37,337 --> 00:03:42,945
they set the hidden to hidden weights so
that the length of the activity vector

43
00:03:42,945 --> 00:03:49,145
stays about the same after each duration.
For those of you used to linear systems

44
00:03:49,145 --> 00:03:53,027
and matrices, you're setting it so the
spectral radius is one.

45
00:03:53,027 --> 00:03:58,119
That is the biggest eigenvalue of the
matrix of hidden to hidden weights is one.

46
00:03:58,119 --> 00:04:00,983
Or it would be one if it was a linear
system.

47
00:04:00,983 --> 00:04:05,120
And you want to achieve the same property
in a non-linear system.

48
00:04:05,580 --> 00:04:10,811
If you set those weights to be about the
right magnitude, then an input can echo

49
00:04:10,811 --> 00:04:13,820
around in the recurrent state for a long
time.

50
00:04:15,620 --> 00:04:20,180
It's also important to use sparse
connectivity.

51
00:04:20,800 --> 00:04:25,576
So instead of having lots of medium size
weights, we have a few quite large

52
00:04:25,576 --> 00:04:28,824
weights.
And nearly all the weights are zero in the

53
00:04:28,824 --> 00:04:32,963
hidden to hidden connections.
What this does is it makes a lot of

54
00:04:32,963 --> 00:04:37,485
loosely coupled oscillators.
So information can hang around in one part

55
00:04:37,485 --> 00:04:42,198
of the net without being propagated to
other parts of the net too quickly.

56
00:04:42,198 --> 00:04:47,293
It's also important to choose the scale of
the input to hidden connections very

57
00:04:47,293 --> 00:04:50,967
carefully.
Those connections need to drive the states

58
00:04:50,967 --> 00:04:56,477
of the loosely coupled oscillators but,
they mustn't wipe out information that

59
00:04:56,477 --> 00:05:00,080
those oscillators contain about the recent
history.

60
00:05:01,900 --> 00:05:07,337
Fortunately the learning is very fast in
echo state networks so we can afford to

61
00:05:07,337 --> 00:05:11,097
experiment with the scales of the
important connections.

62
00:05:11,097 --> 00:05:16,669
You could think of it as a little learning
loop that's just learning the scales of

63
00:05:16,669 --> 00:05:21,637
those connections and it's doing it by
sort of feedback that involves the

64
00:05:21,637 --> 00:05:24,926
experimenter.
It also helps to learn the level of

65
00:05:24,926 --> 00:05:30,499
sparseness that's needed in the hidden to
hidden connections, and again because the

66
00:05:30,499 --> 00:05:34,460
learning is so fast, you can afford to
experiment with that.

67
00:05:35,180 --> 00:05:39,540
That's important because it's often
necessary to do those experiments to get

68
00:05:39,540 --> 00:05:44,174
the system to work well.
So I'm now going to show you a simple

69
00:05:44,174 --> 00:05:48,220
example taken from the web of an eco-state
network.

70
00:05:48,640 --> 00:05:55,063
It has an input sequence which is a real
value that varies with time, and specifies

71
00:05:55,063 --> 00:06:00,402
the frequency of a sine wave for the
output of the eco-state network.

72
00:06:00,402 --> 00:06:06,826
So you'd like this thing to generate sine
waves, and the input is gonna specify the

73
00:06:06,826 --> 00:06:11,028
frequency.
The target output sequence is going to be

74
00:06:11,028 --> 00:06:15,327
the same wave with the frequency specified
by the output.

75
00:06:15,327 --> 00:06:21,135
And it's going to be learned simply by
putting a linear model that takes the

76
00:06:21,135 --> 00:06:27,547
states of the hidden units and from those
tries to predict the correct scalar output

77
00:06:27,547 --> 00:06:30,338
value.
So here's a picture taken from

78
00:06:30,338 --> 00:06:36,859
scholarpedia of an echo state network
doing this program, the input signal is

79
00:06:36,859 --> 00:06:43,693
the desired frequency of the sine wave.
The output signal after it's learned, or

80
00:06:43,693 --> 00:06:50,786
the teacher signal, when it's learning, is
a sine wave with the frequency specified

81
00:06:50,786 --> 00:06:55,111
by the input.
And the stuff in the middle is a big

82
00:06:55,111 --> 00:07:02,117
dynamical reservoir, so that the inputs
coming from the input signal driver those

83
00:07:02,117 --> 00:07:08,864
loosely coupled oscillators, and cause
complicated dynamics that goes on for a

84
00:07:08,864 --> 00:07:12,524
long time.
And those output weights are learning to

85
00:07:12,524 --> 00:07:17,566
map that complicated dynamics to the
particular dynamics you want for the

86
00:07:17,566 --> 00:07:20,974
output.
All the other pictures are showing you the

87
00:07:20,974 --> 00:07:25,540
actual dynamics of individual units inside
the dynamical reservoir.

88
00:07:25,860 --> 00:07:32,348
One thing to notice is that there are also
connections from the output back to the

89
00:07:32,348 --> 00:07:35,912
reservoir.
Those aren't always needed, but they help

90
00:07:35,912 --> 00:07:39,781
to tell the reservoir what have has been
produced so far.

91
00:07:39,781 --> 00:07:45,346
So here's an example of what they system
actually produces after it's learned, and

92
00:07:45,346 --> 00:07:50,097
you can see that at the beginning it's
producing a sign wave, in phase.

93
00:07:50,097 --> 00:07:55,391
At the end, it's producing a sign wave of
the right frequency, but the phase is

94
00:07:55,391 --> 00:07:58,717
wrong.
And that's because we weren't telling what

95
00:07:58,717 --> 00:08:03,604
phase the sign wave should be in.
So it's satisfying the requirements of

96
00:08:03,604 --> 00:08:08,985
producing an appropriate frequency.
There some very good aspects of echo state

97
00:08:08,985 --> 00:08:12,359
networks.
They can be trained very fast because they

98
00:08:12,359 --> 00:08:16,338
just fit in a linear model.
They also demonstrate how important it is

99
00:08:16,338 --> 00:08:19,279
to initialize the hidden-to-hidden weight
sensibly.

100
00:08:19,279 --> 00:08:23,489
And they can do quite impressive modeling
of one dimensional time savers.

101
00:08:23,489 --> 00:08:27,295
That's where they excel.
They can look at a time series for awhile,

102
00:08:27,295 --> 00:08:30,640
and then predict it very well a long time
into the future.

103
00:08:31,640 --> 00:08:38,093
What they're not so good at is modeling
high dimensional data, like frames of

104
00:08:38,093 --> 00:08:44,714
acoustic coefficients, or frames of video.
In order to model data like that, they

105
00:08:44,714 --> 00:08:51,084
need many more hidden units than a
recurrent neural network where you train

106
00:08:51,084 --> 00:08:56,192
the hidden to hidden connections.
Recently, Ilya Sutskever tried something

107
00:08:56,192 --> 00:09:00,878
which is fairly obvious which is to
initialize a recurrent neural network

108
00:09:00,878 --> 00:09:05,374
using all the tricks developed by the
people doing echo state networks.

109
00:09:05,374 --> 00:09:10,376
Once you've done that, you know you could
learn quite well just by learning the

110
00:09:10,376 --> 00:09:14,935
hidden driver connections.
But then, presumably, you could learn even

111
00:09:14,935 --> 00:09:19,304
better if you also learn to make the
hidden to hidden weights better.

112
00:09:19,304 --> 00:09:24,940
So Ilya tried using the echo state network
initializations but then training with

113
00:09:24,940 --> 00:09:30,206
back propagation through time.
He used rmsprop with momentum and he

114
00:09:30,206 --> 00:09:36,493
discovered that, that is actually a very
effective way to train recurrent neural