In this video, I'm going to talk about the history of backpropagation. I'll start with where it came from in the 70s' and 80s,' and then I'll talk a bit about why it failed in the 90s.' That is, why serious machine learning research has abandoned it. There was a popular view of why this happened, and we can now see that, that popular view was largely wrong. The real reasons it was abandoned were because computers were too slow and data sets were too small. I'll conclude by showing you a historical document. There was a bet made between two machine learning researchers in 1995. It's interesting to see what people back then believed and how wrong they were. Backpropagation was invented independently several times in the 70s' and 80s'. It started in the late 60s' with control theories called Bryson and Ho who invented a linear version of backpropagation. Paul Werbos went to their lectures and realized it could be made non-linear. And in his thesis in 1974, he published what's probably the first proper version of backpropogation. Rumelhart and Williams and I invented it in 1981 without knowing about Paul Werbos's work. But we tried it out, and it didn't work very well for the first thing we tried it for, And so we abandoned it. David Parker invented it in 1985, and so did Yann LeCun. Also in 1985, I went back and tried again the thing that Rumelhart, Williams and I had abandoned and discovered it worked pretty well. In 1986, we produced a paper with a really convincing example of what it could do. It was clear the backpropagation had a lot of promise for learning multiple layers of non-linear future detectors. But it didn't really live up to its promise. And by the late 1990s, most of the serious researchers in machine learning had given up on backpropagation. For example, in David Mukai's textbook, there's very little mention of it. It was still widely used by psychologists for making psychological models. And it was also quite widely used in practical applications, such as credit card fraud detection. But in machine learning, people thought it had been supplanted by support vector machines. The popular explanation of what happened to backpropagation in the late 90s' was that it couldn't make use of multiple layers and non-linear features. This wasn't true in convolutional nets, which were the exception. But in general, people couldn't get feed forward neural networks trained with backpropagation to do impressive things if they had multiple hidden layers, Except for some toy examples. It also did not work well in recurrent networks or in deep auto-encoders, Which we'll cover in our later lecture. And recurrent networks were perhaps the place where its most exciting, and so it was there that it was most disappointing that people couldn't make it work well. Support vector machines by contrast, worked well. They didn't require as much expertise to make them work, they produced repeatable results, and they had a much better theory. And they had much fancy of theory. So, that was the popular explanation of what went wrong with backpropagation. With a more historical prospective, we can see why really failed. The computers were thousands of times too slow, And the label data's sets hundreds of times too small for the regime in which backpropagation would really shine. Also, the deep networks, as well as being too small, were not sensibly initialized. And so, backpropagating through deep networks didn't work well because the gradients tended to die, because the initial weights were typically too small. These issues prevented backpropagation from being successful. For tasks like vision and speech, They would eventually be a big win. So, we need to distinguish between different kinds of machine learning task. There's ones that are more typical of the kinds of things people study in statistics, And ones that are more typical of the kinds of things people study in artificial intelligence. So, at the statistics end of the spectrum, you typically have low dimensional data. A statistician thinks of a 100 dimensions as high dimensional data. At the artificial intelligence end of the spectrum, things like images or coefficients representing speech typically have many more than a 100 dimensions. At the statistics end of the spectrum, There's usually a lot of noise in the data. Whereas, in the AI end of the spectrum, noise isn't the real problem. For statistics, there's often not that much structure in the data, and what structure there is can be captured by a fairly simple model. At the AI end of the spectrum, there's typically a huge amount of structure in the data. So if you take a set of images, its highly structured data. But the structure is too complicated to be captured by a simple model. So in statistics, the main problem is separating true structure from noise, not thinking that noise is really structure. This can be done by a Bayesian neural net pretty well. But for typical non-Bayesian neural nets, it's not the kind of problem they're good at. And so, for problems like that, it makes sense to try a support vector machine or a method called a Gaussian process if you're doing regression, which I'll talk about briefly later. At the artificial intelligence end of the spectrum, the main problem is to find a way of representing all this complicated structure so that it can be learned. The obvious thing to do is to try to hand design appropriate representations. But actually, it's easier to let back propagation figure out what representations to use by giving it multiple layers and using a lot of computation power to let it decide what the representation should be. I now want to talk very briefly about support vector machines. I'm not going to explain how they work, But I am going to say what I think their limitations are. So, there's several ways in which you can view a support vector machine, and I'm going to give you two different views of them. According to the first view, support vector machines are just the reincarnation of perceptions with a clever trick called the kernel trick. So, the idea is that you take the inputs, You expand the raw input into a very large lair of non-linear, but also non adaptive features. So, that's just like perceptrons where you have this big layer of features it doesn't learn. And then, you only have to learn one layer of adaptive weights, the weights from the features to the decision unit. And support vector machines have a very clever way of avoiding overfitting when they learn those weights. They look for what's called a maximum margin hyperplane in a high dimensional space, And they can do that much more efficiently than you might have thought possible. And, that's why they work well. The second view also views support vector machines as a clever reincarnation of perceptrons. But, it has a completely different notion of what kinds of features they're using. So, according to the second view, Each input vector in the training set is used to define one feature. I'll spell it differently to indicate it's a completely different kind of feature from the first kind. Each of these features gives a scale of value which involves doing a global match between a test input and that particular training input. So, roughly speaking, it's how similar the test input is to a particular training case. Then, there's a clever way of simultaneously finding how to weight those features, so as to make the right decision, And also during feature selection. That is, deciding which of those features not to use. So, although these views sound extremely different from one another, they're just two alternatives ways of looking at the same thing, A support vector machine. And, in both cases, It's using non-adaptive features and then one layer of adaptive weights. And that limits to what you can do that way. You can't learn multiple layers of representation with a support vector machine. This is a historical document from 1995. It was given to me by [unknown] and it's a bet between Larry Jackel, who headed the adaptive systems research group at Bell Labs, And Vladamir Vapnik, who is the leading proponent of support vector machines. Larry Jackel bet that by 2000, people would understand why big neural nets trained with backpropagation worked well on large data sets. That is, they would understand it theoretically in terms of conditions and bands. Vapnik bet that they wouldn't, But he made a side bet. That if he was the one to figure it out, he would win anyway. Vapnik in turn, Bet that by 2005, nobody will be using big neural nets like that train backpropagation. It turns out that they were both wrong. The limitation to using big neural nets with backpropagation was not that we didn't have a good theory and not that they were essentially helpless, but that we didn't have big enough computers or big enough data sets. It was a practical limitation not a theoretical one.