1 00:00:00,000 --> 00:00:05,062 In this video, we're gonna look at the limitations of perceptrons. 2 00:00:06,082 --> 00:00:10,098 These limitations stem from the kinds of features you use. 3 00:00:10,098 --> 00:00:15,021 If you use the right features, you could do almost anything. 4 00:00:15,021 --> 00:00:20,074 If you use the wrong features, they're extremely limited in what the learning 5 00:00:20,074 --> 00:00:25,084 part purpose that font can do. And that's what cause perceptrons to guard 6 00:00:25,084 --> 00:00:31,044 favor, and it emphasizes that the difficult bit of learning is to learn the 7 00:00:31,044 --> 00:00:35,021 right features. There's still a lot you can do with 8 00:00:35,021 --> 00:00:37,098 learning, even if you do not learn features. 9 00:00:37,098 --> 00:00:43,025 For example, if you want to say whether a sentence is a plausible English sentence, 10 00:00:43,025 --> 00:00:48,040 you could hand define a huge number of features, and then learn how to write them 11 00:00:48,040 --> 00:00:53,003 in order to decide whether particular sentence is likely a good English 12 00:00:53,003 --> 00:00:56,006 sentence. But, in the long run you need to learn 13 00:00:56,006 --> 00:01:00,079 features. So the reason that neural network research 14 00:01:00,079 --> 00:01:07,053 came to a halt in the late 1960s and early 1970s is that perceptrons were shown to be 15 00:01:07,053 --> 00:01:13,009 very limited, and we're now gonna understand what those limitations are. 16 00:01:13,064 --> 00:01:19,021 If you'd like to choose the features by hand, and if you use enough features, you 17 00:01:19,021 --> 00:01:22,028 can make the perceptron do almost anything. 18 00:01:23,031 --> 00:01:26,069 Suppose for example we have binary input vectors. 19 00:01:26,069 --> 00:01:32,034 And we create a separate feature unit that gets activated by exactly one of those 20 00:01:32,034 --> 00:01:36,041 binary input vectors. We'll need exponentially many feature 21 00:01:36,041 --> 00:01:39,010 units. But now we can make any possible 22 00:01:39,010 --> 00:01:44,048 discrimination on binary input vectors. So for binary input vectors there's no 23 00:01:44,048 --> 00:01:48,048 limitation if you're willing to make enough feature units. 24 00:01:48,078 --> 00:01:53,091 But of course, that's not a very good strategy for solving a practical problem 25 00:01:53,091 --> 00:01:58,058 because you need an awful lot of feature units and it won't generalize. 26 00:01:58,058 --> 00:02:03,098 You can't look at a subset of all possible cases and have any hope of getting the 27 00:02:03,098 --> 00:02:08,484 remaining cases right because those remaining cases require new feature units 28 00:02:08,484 --> 00:02:12,093 and you don't know what weights to put on those feature units. 29 00:02:12,093 --> 00:02:15,083 Once you've decided the hand coded features. 30 00:02:15,083 --> 00:02:21,042 That is once they've been determined, there are very strong limitations on what 31 00:02:21,042 --> 00:02:26,096 a perceptron can learn to do. So here's a classic example. 32 00:02:28,034 --> 00:02:33,093 What we're interested in is what can you learn to do with the binary threshold 33 00:02:33,093 --> 00:02:37,019 decision unit that is by changing its weights. 34 00:02:38,012 --> 00:02:43,089 And we're going to show that there's very simple things that it can learn to do. 35 00:02:43,089 --> 00:02:49,045 So the simplest example is consider a problem in which there's two positive 36 00:02:49,045 --> 00:02:53,079 cases and two negative cases. And the features, just single bit 37 00:02:53,079 --> 00:02:56,031 features, that have values either one or zero. 38 00:02:56,063 --> 00:03:00,055 So the two positive cases consist of both features being on. 39 00:03:00,055 --> 00:03:04,073 In which case the right answer's one. Or both features being off. 40 00:03:04,073 --> 00:03:09,076 In which case the right answer's one. And the two negative cases are when one 41 00:03:09,076 --> 00:03:14,065 feature's on and the other one's off. In which case the right answer is zero. 42 00:03:14,065 --> 00:03:19,075 So all we're asking the binary threshold unit to do is decide whether the two 43 00:03:19,075 --> 00:03:24,006 features have the same value. And they can't even learn to do that. 44 00:03:25,030 --> 00:03:31,017 We can prove that algebraically. Those four input/output pairs that I 45 00:03:31,017 --> 00:03:36,052 showed you give rise to four inequalities, and it's impossible to satisfy them. 46 00:03:36,081 --> 00:03:42,098 So the first positive case, when the two feature values are one the output should 47 00:03:42,098 --> 00:03:46,040 be one. That gives us the inequality that: one 48 00:03:46,040 --> 00:03:51,042 times W1 plus one times W2 is gonna be greater than the threshold. 49 00:03:51,042 --> 00:03:56,090 So we give an output a one. Then the second positive case gives us the 50 00:03:56,090 --> 00:04:02,091 inequality that zero times W1 plus zero times W2, must also be greater than the 51 00:04:02,091 --> 00:04:06,095 threshold. And the negative cases give us the 52 00:04:06,095 --> 00:04:13,015 inequalites that one times W1 plus zero times W2, must be less than the threshold, 53 00:04:13,015 --> 00:04:19,012 and similarly zero times W1 plus one times W2, must be less than the threshold. 54 00:04:19,074 --> 00:04:24,076 Now if you take those first two inequalities and you add them up, you get 55 00:04:24,076 --> 00:04:28,062 the W1 plus W2 must be greater than twice the threshold. 56 00:04:28,062 --> 00:04:33,058 And if you take the second two inequalities and you add them up, you get 57 00:04:33,058 --> 00:04:36,096 W1 plus W2 must be less than twice the threshold. 58 00:04:36,096 --> 00:04:41,002 So there's clearly no way to satisfy all four inequalities. 59 00:04:41,002 --> 00:04:45,914 Or to put it another way, if you look at the binary decision unit where we're going 60 00:04:45,914 --> 00:04:52,068 to put the threshold as a negative weight on an input line that always has value of 61 00:04:52,068 --> 00:04:55,073 one. If you take that binary threshold unit 62 00:04:55,073 --> 00:05:01,049 shown at the bottom right, there's no way to set the threshold in the two weights, 63 00:05:01,049 --> 00:05:07,042 so it gets all four cases right. We can also see this geometrically. 64 00:05:08,083 --> 00:05:15,010 So we're going to imagine a data space now, in which the axis correspond to 65 00:05:15,010 --> 00:05:23,075 components of an input vector. So in this space each point corresponds to 66 00:05:23,075 --> 00:05:29,037 a data point. And, a weight vector is going to find a 67 00:05:29,037 --> 00:05:33,069 plane in this space. So it's just the opposite of what we're 68 00:05:33,069 --> 00:05:37,546 doing with weight space. In weight space we made each point be a 69 00:05:37,546 --> 00:05:42,002 weight vector, and we used a plane, to define an input case. 70 00:05:42,002 --> 00:05:45,076 Of course that plane was defined by an input vector. 71 00:05:45,076 --> 00:05:51,080 Now what we're going to do is we're going to make each point be an input vector and 72 00:05:51,080 --> 00:05:56,075 we're going to use a wait vector to define a plane in the data space. 73 00:05:58,018 --> 00:06:02,082 The plane defined by the weight vector is going to be perpendicular to the weight 74 00:06:02,082 --> 00:06:07,028 vector and it's going to miss the origin by a distance equal to the threshold. 75 00:06:07,028 --> 00:06:13,072 So here's a picture. You see the four data cases there, and for 76 00:06:13,072 --> 00:06:19,008 the two data cases in red, we need to give an output of zero. 77 00:06:19,008 --> 00:06:25,024 And with the two data cases in green, we need to put an output of one. 78 00:06:25,054 --> 00:06:30,069 That me, means we need the green cases to be on the side of the weight plane where 79 00:06:30,069 --> 00:06:34,894 the output is one and we need the red cases to be on the side where the output 80 00:06:34,894 --> 00:06:40,043 is zero, and we obviously cannot arrange the weight plane so that's true. 81 00:06:41,048 --> 00:06:47,073 We call a set of cases like that, where there's no hyperplane that will separate 82 00:06:47,073 --> 00:06:53,083 the cases where we want the answer to be one, from the cases where we want the 83 00:06:53,083 --> 00:06:58,013 answer to be zero. We call that a set of training cases 84 00:06:58,013 --> 00:07:05,038 that's not linearly separable. And even more devastating example for 85 00:07:05,038 --> 00:07:12,071 perceptrons because it's much more general is when we try and discriminate simple 86 00:07:12,071 --> 00:07:18,787 patterns that have to retain the identity when you translate them with wrap-around. 87 00:07:18,787 --> 00:07:23,490 I'll give you an example of what that means in a minute. 88 00:07:23,490 --> 00:07:30,349 But the idea is that we want to recognize a pattern and we want to recognize it even 89 00:07:30,349 --> 00:07:35,083 when it's translated. So suppose we just use pixels as the 90 00:07:35,083 --> 00:07:39,034 features. The question is can a binary threshold 91 00:07:39,034 --> 00:07:42,067 unit discriminate between two different patterns. 92 00:07:42,067 --> 00:07:48,017 We'll call one positive example and the other's negative examples if they've got 93 00:07:48,017 --> 00:07:53,039 the same number of pixels in them. And the answer is no it can't discriminate 94 00:07:53,039 --> 00:07:58,089 two patterns out of the same number of pixels if that discrimination has to work 95 00:07:58,089 --> 00:08:03,085 when the patterns are translated and if the patterns can wrap-around when 96 00:08:03,085 --> 00:08:08,087 translate. So, if you look at these examples of 97 00:08:08,087 --> 00:08:14,074 pattern A, in a one-dimensional image. Pattern A has four pixels that are on. 98 00:08:14,074 --> 00:08:19,092 Those four black pixels. It's like a little bar code. 99 00:08:20,019 --> 00:08:24,022 And it's the same pattern when we translate it a bit to the right. 100 00:08:24,022 --> 00:08:28,086 And we're going to allow ourselves to translate the pattern so it goes off the 101 00:08:28,086 --> 00:08:31,097 right hand end, and comes back on the left hand end. 102 00:08:31,097 --> 00:08:36,024 So the third example is the same pattern that's been translated with some 103 00:08:36,024 --> 00:08:42,029 wrap-around. And pattern B, it also has four patterns, 104 00:08:42,029 --> 00:08:45,090 but four pixels, but in a different arrangement. 105 00:08:45,090 --> 00:08:51,067 And in the third example of pattern B, it's been translated with wrap-around. 106 00:08:54,078 --> 00:08:57,089 So that's still an example of pattern B. And for two sets of patterns like that, a 107 00:08:57,089 --> 00:09:02,020 binary threshold unit cannot learn to discriminate them. 108 00:09:02,020 --> 00:09:10,010 And here's the proof. What we're going to do is we're going to 109 00:09:10,010 --> 00:09:15,027 consider that for the positive examples we have pattern A in all possible 110 00:09:15,027 --> 00:09:19,058 translations. Now since pattern A has four on pixels, 111 00:09:19,058 --> 00:09:25,031 that means if we look at any pixel on the retina, there'll be four different 112 00:09:25,031 --> 00:09:30,058 positions in which we can put pattern A that will activate that pixel. 113 00:09:30,058 --> 00:09:36,032 So each pixel will be activated by four different translations of pattern A. 114 00:09:37,070 --> 00:09:43,022 That means that the total input received by the decision unit, over all those 115 00:09:43,022 --> 00:09:49,003 various translations of pattern A, would be four times the sum of all the weights 116 00:09:49,003 --> 00:09:55,005 in the perceptron, because each pixel will activate the decision unit four different 117 00:09:55,005 --> 00:09:58,057 times. And so summed over all those patterns will 118 00:09:58,057 --> 00:10:02,094 get four times the sum of the weights. Now consider pattern B. 119 00:10:02,094 --> 00:10:08,053 We're going to make the negative cases be pattern B, in all possible translations. 120 00:10:09,055 --> 00:10:14,074 And again, each pixel will be activated by four different translations of pattern B. 121 00:10:15,032 --> 00:10:19,098 So the total input of the decision unit receives and, over all those different 122 00:10:19,098 --> 00:10:24,077 translations of pattern B, will again be four times the sum of all the weights. 123 00:10:26,058 --> 00:10:30,055 But the perceptron, in order to discriminate correctly, has to have 124 00:10:30,055 --> 00:10:35,002 weights so that every single case of pattern A provides more input to the 125 00:10:35,002 --> 00:10:38,003 decision unit than every single case of pattern B. 126 00:10:38,003 --> 00:10:42,062 And that's clearly impossible if when you sum of all these cases, all those 127 00:10:42,062 --> 00:10:47,050 different versions of pattern A and all of those different versions of pattern B, 128 00:10:47,050 --> 00:10:51,025 provide exactly the same amount of input to the decision unit. 129 00:10:51,025 --> 00:10:58,094 So we've proved that a perceptron cannot recognize patterns under translation if we 130 00:10:58,094 --> 00:11:04,006 allow wrap-around. That's a particular case of Minsky and 131 00:11:04,006 --> 00:11:12,010 Papert's group invariance theorem. And that result is devastating for 132 00:11:12,010 --> 00:11:14,084 perceptrons, it was historically devastating. 133 00:11:14,084 --> 00:11:19,064 Because the whole point of pattern recognition is to recognize patterns that 134 00:11:19,064 --> 00:11:24,038 undergo transformations and see that they're still the same pattern, despite 135 00:11:24,038 --> 00:11:27,037 the transformation. Like for example, translation. 136 00:11:27,098 --> 00:11:32,046 And when Minsky and Papert showed that a perceptron couldn't do that if the 137 00:11:32,046 --> 00:11:37,029 transformations formed a group, that is the learning part of a perceptron couldn't 138 00:11:37,029 --> 00:11:41,089 learn to do that, it became clear that the claims that have been made for what 139 00:11:41,089 --> 00:11:46,043 perceptrons could learn were somewhat exaggerated, and that to get them to do 140 00:11:46,043 --> 00:11:51,021 anything interesting, you had to choose just the right features to make it fairly 141 00:11:51,021 --> 00:11:54,027 easy for the last stage to learn the classification. 142 00:11:55,072 --> 00:11:59,651 So the translations within our prime form a group and, Minsky and Papert proved a 143 00:11:59,651 --> 00:12:04,381 general theorem for transformations that form a group, are making it impossible, 144 00:12:04,381 --> 00:12:08,024 for a perceptron. For the learning part of a perceptron to 145 00:12:08,024 --> 00:12:11,076 do the recognition. The perceptron architecture can still do 146 00:12:11,076 --> 00:12:16,038 the recognition, but you have to organize the features so they do the difficult 147 00:12:16,038 --> 00:12:19,092 part. So we have to have multiple feature units 148 00:12:19,092 --> 00:12:25,023 that recognize informative sub patterns that tell you something about what class 149 00:12:25,023 --> 00:12:30,020 it is, and we have to have separate feature units for each position of those 150 00:12:30,020 --> 00:12:34,091 informative sub patterns, if we're trying to recognize under translation. 151 00:12:36,022 --> 00:12:41,034 So the tricky part of pattern recognition has to be solved by the hand-coded feature 152 00:12:41,034 --> 00:12:48,050 detectors, not the learning procedure. The temporary conclusion from this is that 153 00:12:48,050 --> 00:12:53,032 perceptrons are no good and therefore neural networks are no good. 154 00:12:53,032 --> 00:12:58,044 The longer term conclusion is that neural networks are only gonna be really powerful 155 00:12:58,044 --> 00:13:03,020 if we can learn the feature detectors. It's not enough just to learn weight sum 156 00:13:03,020 --> 00:13:07,030 feature detectors, we have to learn the feature detectors themselves. 157 00:13:07,030 --> 00:13:11,094 And the second generation of neural networks, which we'll come to in the next 158 00:13:11,094 --> 00:13:15,044 lecture, was all about how you learn the feature detectors. 159 00:13:15,044 --> 00:13:19,041 But it took twenty years before people figured out how to do that. 160 00:13:20,039 --> 00:13:26,084 So, networks without hidden units are very limited in what they can learn to model. 161 00:13:27,058 --> 00:13:32,006 If we add more layers of linear units, it doesn't help because everything is linear. 162 00:13:32,006 --> 00:13:36,054 We can make them much more powerful by putting in hand coded hidden units but 163 00:13:36,054 --> 00:13:39,094 they're not really hidden units because we hand coded them. 164 00:13:39,094 --> 00:13:45,067 We told them what to do. It's not enough just to have fixed output 165 00:13:45,067 --> 00:13:49,038 non-linearities. What we need is multiple layers of 166 00:13:49,038 --> 00:13:54,025 adaptive non-linear hidden units. And the problem is how can we train such 167 00:13:54,025 --> 00:13:58,030 nets. We need a way to adapt all the weights not 168 00:13:58,030 --> 00:14:01,091 just the last layer like in a perceptron, and that's hard. 169 00:14:02,062 --> 00:14:08,016 In particular, leaning the weights go in to the hidden units, that's equivalent to 170 00:14:08,016 --> 00:14:11,060 learning features. And that's the hard thing to do. 171 00:14:11,060 --> 00:14:16,030 Because nobody is telling us directly, what the hidden unit should be doing, when 172 00:14:16,030 --> 00:14:19,041 they should be active and, when they should not be active. 173 00:14:19,041 --> 00:14:24,011 And the, real problem is, how do we figure out how to learn these weights go into 174 00:14:24,011 --> 00:14:28,093 hidden units so that the hidden units turn into the kinds of feature detectors we 175 00:14:28,093 --> 00:14:33,075 need for solving a problem, when nobody is telling us what the featured detector 176 00:14:33,075 --> 00:14:34,034 should be.