In this video, we're gonna look at the limitations of perceptrons. These limitations stem from the kinds of features you use. If you use the right features, you could do almost anything. If you use the wrong features, they're extremely limited in what the learning part purpose that font can do. And that's what cause perceptrons to guard favor, and it emphasizes that the difficult bit of learning is to learn the right features. There's still a lot you can do with learning, even if you do not learn features. For example, if you want to say whether a sentence is a plausible English sentence, you could hand define a huge number of features, and then learn how to write them in order to decide whether particular sentence is likely a good English sentence. But, in the long run you need to learn features. So the reason that neural network research came to a halt in the late 1960s and early 1970s is that perceptrons were shown to be very limited, and we're now gonna understand what those limitations are. If you'd like to choose the features by hand, and if you use enough features, you can make the perceptron do almost anything. Suppose for example we have binary input vectors. And we create a separate feature unit that gets activated by exactly one of those binary input vectors. We'll need exponentially many feature units. But now we can make any possible discrimination on binary input vectors. So for binary input vectors there's no limitation if you're willing to make enough feature units. But of course, that's not a very good strategy for solving a practical problem because you need an awful lot of feature units and it won't generalize. You can't look at a subset of all possible cases and have any hope of getting the remaining cases right because those remaining cases require new feature units and you don't know what weights to put on those feature units. Once you've decided the hand coded features. That is once they've been determined, there are very strong limitations on what a perceptron can learn to do. So here's a classic example. What we're interested in is what can you learn to do with the binary threshold decision unit that is by changing its weights. And we're going to show that there's very simple things that it can learn to do. So the simplest example is consider a problem in which there's two positive cases and two negative cases. And the features, just single bit features, that have values either one or zero. So the two positive cases consist of both features being on. In which case the right answer's one. Or both features being off. In which case the right answer's one. And the two negative cases are when one feature's on and the other one's off. In which case the right answer is zero. So all we're asking the binary threshold unit to do is decide whether the two features have the same value. And they can't even learn to do that. We can prove that algebraically. Those four input/output pairs that I showed you give rise to four inequalities, and it's impossible to satisfy them. So the first positive case, when the two feature values are one the output should be one. That gives us the inequality that: one times W1 plus one times W2 is gonna be greater than the threshold. So we give an output a one. Then the second positive case gives us the inequality that zero times W1 plus zero times W2, must also be greater than the threshold. And the negative cases give us the inequalites that one times W1 plus zero times W2, must be less than the threshold, and similarly zero times W1 plus one times W2, must be less than the threshold. Now if you take those first two inequalities and you add them up, you get the W1 plus W2 must be greater than twice the threshold. And if you take the second two inequalities and you add them up, you get W1 plus W2 must be less than twice the threshold. So there's clearly no way to satisfy all four inequalities. Or to put it another way, if you look at the binary decision unit where we're going to put the threshold as a negative weight on an input line that always has value of one. If you take that binary threshold unit shown at the bottom right, there's no way to set the threshold in the two weights, so it gets all four cases right. We can also see this geometrically. So we're going to imagine a data space now, in which the axis correspond to components of an input vector. So in this space each point corresponds to a data point. And, a weight vector is going to find a plane in this space. So it's just the opposite of what we're doing with weight space. In weight space we made each point be a weight vector, and we used a plane, to define an input case. Of course that plane was defined by an input vector. Now what we're going to do is we're going to make each point be an input vector and we're going to use a wait vector to define a plane in the data space. The plane defined by the weight vector is going to be perpendicular to the weight vector and it's going to miss the origin by a distance equal to the threshold. So here's a picture. You see the four data cases there, and for the two data cases in red, we need to give an output of zero. And with the two data cases in green, we need to put an output of one. That me, means we need the green cases to be on the side of the weight plane where the output is one and we need the red cases to be on the side where the output is zero, and we obviously cannot arrange the weight plane so that's true. We call a set of cases like that, where there's no hyperplane that will separate the cases where we want the answer to be one, from the cases where we want the answer to be zero. We call that a set of training cases that's not linearly separable. And even more devastating example for perceptrons because it's much more general is when we try and discriminate simple patterns that have to retain the identity when you translate them with wrap-around. I'll give you an example of what that means in a minute. But the idea is that we want to recognize a pattern and we want to recognize it even when it's translated. So suppose we just use pixels as the features. The question is can a binary threshold unit discriminate between two different patterns. We'll call one positive example and the other's negative examples if they've got the same number of pixels in them. And the answer is no it can't discriminate two patterns out of the same number of pixels if that discrimination has to work when the patterns are translated and if the patterns can wrap-around when translate. So, if you look at these examples of pattern A, in a one-dimensional image. Pattern A has four pixels that are on. Those four black pixels. It's like a little bar code. And it's the same pattern when we translate it a bit to the right. And we're going to allow ourselves to translate the pattern so it goes off the right hand end, and comes back on the left hand end. So the third example is the same pattern that's been translated with some wrap-around. And pattern B, it also has four patterns, but four pixels, but in a different arrangement. And in the third example of pattern B, it's been translated with wrap-around. So that's still an example of pattern B. And for two sets of patterns like that, a binary threshold unit cannot learn to discriminate them. And here's the proof. What we're going to do is we're going to consider that for the positive examples we have pattern A in all possible translations. Now since pattern A has four on pixels, that means if we look at any pixel on the retina, there'll be four different positions in which we can put pattern A that will activate that pixel. So each pixel will be activated by four different translations of pattern A. That means that the total input received by the decision unit, over all those various translations of pattern A, would be four times the sum of all the weights in the perceptron, because each pixel will activate the decision unit four different times. And so summed over all those patterns will get four times the sum of the weights. Now consider pattern B. We're going to make the negative cases be pattern B, in all possible translations. And again, each pixel will be activated by four different translations of pattern B. So the total input of the decision unit receives and, over all those different translations of pattern B, will again be four times the sum of all the weights. But the perceptron, in order to discriminate correctly, has to have weights so that every single case of pattern A provides more input to the decision unit than every single case of pattern B. And that's clearly impossible if when you sum of all these cases, all those different versions of pattern A and all of those different versions of pattern B, provide exactly the same amount of input to the decision unit. So we've proved that a perceptron cannot recognize patterns under translation if we allow wrap-around. That's a particular case of Minsky and Papert's group invariance theorem. And that result is devastating for perceptrons, it was historically devastating. Because the whole point of pattern recognition is to recognize patterns that undergo transformations and see that they're still the same pattern, despite the transformation. Like for example, translation. And when Minsky and Papert showed that a perceptron couldn't do that if the transformations formed a group, that is the learning part of a perceptron couldn't learn to do that, it became clear that the claims that have been made for what perceptrons could learn were somewhat exaggerated, and that to get them to do anything interesting, you had to choose just the right features to make it fairly easy for the last stage to learn the classification. So the translations within our prime form a group and, Minsky and Papert proved a general theorem for transformations that form a group, are making it impossible, for a perceptron. For the learning part of a perceptron to do the recognition. The perceptron architecture can still do the recognition, but you have to organize the features so they do the difficult part. So we have to have multiple feature units that recognize informative sub patterns that tell you something about what class it is, and we have to have separate feature units for each position of those informative sub patterns, if we're trying to recognize under translation. So the tricky part of pattern recognition has to be solved by the hand-coded feature detectors, not the learning procedure. The temporary conclusion from this is that perceptrons are no good and therefore neural networks are no good. The longer term conclusion is that neural networks are only gonna be really powerful if we can learn the feature detectors. It's not enough just to learn weight sum feature detectors, we have to learn the feature detectors themselves. And the second generation of neural networks, which we'll come to in the next lecture, was all about how you learn the feature detectors. But it took twenty years before people figured out how to do that. So, networks without hidden units are very limited in what they can learn to model. If we add more layers of linear units, it doesn't help because everything is linear. We can make them much more powerful by putting in hand coded hidden units but they're not really hidden units because we hand coded them. We told them what to do. It's not enough just to have fixed output non-linearities. What we need is multiple layers of adaptive non-linear hidden units. And the problem is how can we train such nets. We need a way to adapt all the weights not just the last layer like in a perceptron, and that's hard. In particular, leaning the weights go in to the hidden units, that's equivalent to learning features. And that's the hard thing to do. Because nobody is telling us directly, what the hidden unit should be doing, when they should be active and, when they should not be active. And the, real problem is, how do we figure out how to learn these weights go into hidden units so that the hidden units turn into the kinds of feature detectors we need for solving a problem, when nobody is telling us what the featured detector should be.