1 00:00:00,000 --> 00:00:05,420 In this video I'm going to describe eco-state networks. 2 00:00:05,720 --> 00:00:11,184 These use a clever trick to make it much easier to learn a recurrent neural 3 00:00:11,184 --> 00:00:15,637 network. They initialize the connections in the 4 00:00:15,637 --> 00:00:21,793 recurrent neural network in such a way that it has a big reservoir of coupled 5 00:00:21,793 --> 00:00:26,446 oscillators. So if you provide input to it, it converts 6 00:00:26,446 --> 00:00:32,473 that input into the states of these oscillators, and then you can predict the 7 00:00:32,473 --> 00:00:36,620 output you want, from the states of these oscillators. 8 00:00:36,620 --> 00:00:42,960 And the only thing you have learn, is how to couple the output to the oscillators. 9 00:00:43,480 --> 00:00:48,659 This entirely gets rid of the problem. Of learning hidden to hidden connections 10 00:00:48,659 --> 00:00:54,893 or even input to hidden connections. However, to get these networks to be good 11 00:00:54,893 --> 00:00:59,100 at complicated tasks, you need a very big hidden state. 12 00:00:59,580 --> 00:01:05,021 As we'll see at the end of the video, There's no reason not to use the 13 00:01:05,021 --> 00:01:11,060 initialization that was carefully designed for echo state networks, And then to use 14 00:01:11,060 --> 00:01:16,662 back propagation through time with momentum to train the networks to be even 15 00:01:16,662 --> 00:01:24,231 better at the tasks that they're doing. One interesting and quite recent idea 16 00:01:24,231 --> 00:01:30,372 about training recurrent neural networks, is to not train the hidden to hidden 17 00:01:30,372 --> 00:01:36,434 connections at all, but to just fix them randomly, and hope that you can learn 18 00:01:36,434 --> 00:01:41,080 sequences by just training the way they affect the outputs. 19 00:01:42,520 --> 00:01:48,180 This has strong similarities with old ideas about perceptions. 20 00:01:48,880 --> 00:01:54,373 So a very simple way to train a feedforward neural network, is to make the 21 00:01:54,373 --> 00:01:58,010 early layers of feature detectors just be random. 22 00:01:58,010 --> 00:02:04,097 You put in sensible sized random weights and then all you learn is the last layer 23 00:02:04,097 --> 00:02:10,184 so that you're learning a linear model from the activities of the hidden units in 24 00:02:10,184 --> 00:02:15,454 the last layer to the outputs. And of course it's much faster to learn a 25 00:02:15,454 --> 00:02:19,990 linear model. This relies on the idea that a big, random 26 00:02:19,990 --> 00:02:25,977 expansion of the input vector, can often make it easy for a linear model to fit the 27 00:02:25,977 --> 00:02:31,100 data, when it couldn't fit the data well, just looking at the raw inputs. 28 00:02:31,460 --> 00:02:36,508 Through the little neural network here, those red weights will be fixed at random. 29 00:02:36,508 --> 00:02:41,431 They would expand the input vector and then using that expanded representation, 30 00:02:41,431 --> 00:02:45,545 we try and fit a linear model. This actually has some quite strong 31 00:02:45,545 --> 00:02:50,718 similarities with support vector machines. Which are really just a really efficient 32 00:02:50,718 --> 00:02:55,116 way of doing this. So those same ideas, many years later, 33 00:02:55,116 --> 00:02:58,370 were recycled for recurrent neural networks. 34 00:02:58,370 --> 00:03:02,216 The idea is to make the input to hidden connections. 35 00:03:02,216 --> 00:03:08,280 And the hidden to hidden connections have random values that are carefully chosen. 36 00:03:08,280 --> 00:03:12,940 And just learn the final layer of hidden to output connections. 37 00:03:12,940 --> 00:03:17,426 The learning is then very simple if you use linear output units. 38 00:03:17,426 --> 00:03:22,684 And it can be done extremely fast. This approach is only ever going to work 39 00:03:22,684 --> 00:03:28,153 if you set the random connections very carefully, so that the recurring neural 40 00:03:28,153 --> 00:03:32,429 network doesn't die out with no activity and doesn't explode. 41 00:03:32,429 --> 00:03:37,337 So, the way they set the random connections in a echo state network is 42 00:03:37,337 --> 00:03:42,945 they set the hidden to hidden weights so that the length of the activity vector 43 00:03:42,945 --> 00:03:49,145 stays about the same after each duration. For those of you used to linear systems 44 00:03:49,145 --> 00:03:53,027 and matrices, you're setting it so the spectral radius is one. 45 00:03:53,027 --> 00:03:58,119 That is the biggest eigenvalue of the matrix of hidden to hidden weights is one. 46 00:03:58,119 --> 00:04:00,983 Or it would be one if it was a linear system. 47 00:04:00,983 --> 00:04:05,120 And you want to achieve the same property in a non-linear system. 48 00:04:05,580 --> 00:04:10,811 If you set those weights to be about the right magnitude, then an input can echo 49 00:04:10,811 --> 00:04:13,820 around in the recurrent state for a long time. 50 00:04:15,620 --> 00:04:20,180 It's also important to use sparse connectivity. 51 00:04:20,800 --> 00:04:25,576 So instead of having lots of medium size weights, we have a few quite large 52 00:04:25,576 --> 00:04:28,824 weights. And nearly all the weights are zero in the 53 00:04:28,824 --> 00:04:32,963 hidden to hidden connections. What this does is it makes a lot of 54 00:04:32,963 --> 00:04:37,485 loosely coupled oscillators. So information can hang around in one part 55 00:04:37,485 --> 00:04:42,198 of the net without being propagated to other parts of the net too quickly. 56 00:04:42,198 --> 00:04:47,293 It's also important to choose the scale of the input to hidden connections very 57 00:04:47,293 --> 00:04:50,967 carefully. Those connections need to drive the states 58 00:04:50,967 --> 00:04:56,477 of the loosely coupled oscillators but, they mustn't wipe out information that 59 00:04:56,477 --> 00:05:00,080 those oscillators contain about the recent history. 60 00:05:01,900 --> 00:05:07,337 Fortunately the learning is very fast in echo state networks so we can afford to 61 00:05:07,337 --> 00:05:11,097 experiment with the scales of the important connections. 62 00:05:11,097 --> 00:05:16,669 You could think of it as a little learning loop that's just learning the scales of 63 00:05:16,669 --> 00:05:21,637 those connections and it's doing it by sort of feedback that involves the 64 00:05:21,637 --> 00:05:24,926 experimenter. It also helps to learn the level of 65 00:05:24,926 --> 00:05:30,499 sparseness that's needed in the hidden to hidden connections, and again because the 66 00:05:30,499 --> 00:05:34,460 learning is so fast, you can afford to experiment with that. 67 00:05:35,180 --> 00:05:39,540 That's important because it's often necessary to do those experiments to get 68 00:05:39,540 --> 00:05:44,174 the system to work well. So I'm now going to show you a simple 69 00:05:44,174 --> 00:05:48,220 example taken from the web of an eco-state network. 70 00:05:48,640 --> 00:05:55,063 It has an input sequence which is a real value that varies with time, and specifies 71 00:05:55,063 --> 00:06:00,402 the frequency of a sine wave for the output of the eco-state network. 72 00:06:00,402 --> 00:06:06,826 So you'd like this thing to generate sine waves, and the input is gonna specify the 73 00:06:06,826 --> 00:06:11,028 frequency. The target output sequence is going to be 74 00:06:11,028 --> 00:06:15,327 the same wave with the frequency specified by the output. 75 00:06:15,327 --> 00:06:21,135 And it's going to be learned simply by putting a linear model that takes the 76 00:06:21,135 --> 00:06:27,547 states of the hidden units and from those tries to predict the correct scalar output 77 00:06:27,547 --> 00:06:30,338 value. So here's a picture taken from 78 00:06:30,338 --> 00:06:36,859 scholarpedia of an echo state network doing this program, the input signal is 79 00:06:36,859 --> 00:06:43,693 the desired frequency of the sine wave. The output signal after it's learned, or 80 00:06:43,693 --> 00:06:50,786 the teacher signal, when it's learning, is a sine wave with the frequency specified 81 00:06:50,786 --> 00:06:55,111 by the input. And the stuff in the middle is a big 82 00:06:55,111 --> 00:07:02,117 dynamical reservoir, so that the inputs coming from the input signal driver those 83 00:07:02,117 --> 00:07:08,864 loosely coupled oscillators, and cause complicated dynamics that goes on for a 84 00:07:08,864 --> 00:07:12,524 long time. And those output weights are learning to 85 00:07:12,524 --> 00:07:17,566 map that complicated dynamics to the particular dynamics you want for the 86 00:07:17,566 --> 00:07:20,974 output. All the other pictures are showing you the 87 00:07:20,974 --> 00:07:25,540 actual dynamics of individual units inside the dynamical reservoir. 88 00:07:25,860 --> 00:07:32,348 One thing to notice is that there are also connections from the output back to the 89 00:07:32,348 --> 00:07:35,912 reservoir. Those aren't always needed, but they help 90 00:07:35,912 --> 00:07:39,781 to tell the reservoir what have has been produced so far. 91 00:07:39,781 --> 00:07:45,346 So here's an example of what they system actually produces after it's learned, and 92 00:07:45,346 --> 00:07:50,097 you can see that at the beginning it's producing a sign wave, in phase. 93 00:07:50,097 --> 00:07:55,391 At the end, it's producing a sign wave of the right frequency, but the phase is 94 00:07:55,391 --> 00:07:58,717 wrong. And that's because we weren't telling what 95 00:07:58,717 --> 00:08:03,604 phase the sign wave should be in. So it's satisfying the requirements of 96 00:08:03,604 --> 00:08:08,985 producing an appropriate frequency. There some very good aspects of echo state 97 00:08:08,985 --> 00:08:12,359 networks. They can be trained very fast because they 98 00:08:12,359 --> 00:08:16,338 just fit in a linear model. They also demonstrate how important it is 99 00:08:16,338 --> 00:08:19,279 to initialize the hidden-to-hidden weight sensibly. 100 00:08:19,279 --> 00:08:23,489 And they can do quite impressive modeling of one dimensional time savers. 101 00:08:23,489 --> 00:08:27,295 That's where they excel. They can look at a time series for awhile, 102 00:08:27,295 --> 00:08:30,640 and then predict it very well a long time into the future. 103 00:08:31,640 --> 00:08:38,093 What they're not so good at is modeling high dimensional data, like frames of 104 00:08:38,093 --> 00:08:44,714 acoustic coefficients, or frames of video. In order to model data like that, they 105 00:08:44,714 --> 00:08:51,084 need many more hidden units than a recurrent neural network where you train 106 00:08:51,084 --> 00:08:56,192 the hidden to hidden connections. Recently, Ilya Sutskever tried something 107 00:08:56,192 --> 00:09:00,878 which is fairly obvious which is to initialize a recurrent neural network 108 00:09:00,878 --> 00:09:05,374 using all the tricks developed by the people doing echo state networks. 109 00:09:05,374 --> 00:09:10,376 Once you've done that, you know you could learn quite well just by learning the 110 00:09:10,376 --> 00:09:14,935 hidden driver connections. But then, presumably, you could learn even 111 00:09:14,935 --> 00:09:19,304 better if you also learn to make the hidden to hidden weights better. 112 00:09:19,304 --> 00:09:24,940 So Ilya tried using the echo state network initializations but then training with 113 00:09:24,940 --> 00:09:30,206 back propagation through time. He used rmsprop with momentum and he 114 00:09:30,206 --> 00:09:36,493 discovered that, that is actually a very effective way to train recurrent neural