1
00:00:00,000 --> 00:00:05,062
In this video, we're gonna look at the
limitations of perceptrons.

2
00:00:06,082 --> 00:00:10,098
These limitations stem from the kinds of
features you use.

3
00:00:10,098 --> 00:00:15,021
If you use the right features, you could
do almost anything.

4
00:00:15,021 --> 00:00:20,074
If you use the wrong features, they're
extremely limited in what the learning

5
00:00:20,074 --> 00:00:25,084
part purpose that font can do.
And that's what cause perceptrons to guard

6
00:00:25,084 --> 00:00:31,044
favor, and it emphasizes that the
difficult bit of learning is to learn the

7
00:00:31,044 --> 00:00:35,021
right features.
There's still a lot you can do with

8
00:00:35,021 --> 00:00:37,098
learning, even if you do not learn
features.

9
00:00:37,098 --> 00:00:43,025
For example, if you want to say whether a
sentence is a plausible English sentence,

10
00:00:43,025 --> 00:00:48,040
you could hand define a huge number of
features, and then learn how to write them

11
00:00:48,040 --> 00:00:53,003
in order to decide whether particular
sentence is likely a good English

12
00:00:53,003 --> 00:00:56,006
sentence.
But, in the long run you need to learn

13
00:00:56,006 --> 00:01:00,079
features.
So the reason that neural network research

14
00:01:00,079 --> 00:01:07,053
came to a halt in the late 1960s and early
1970s is that perceptrons were shown to be

15
00:01:07,053 --> 00:01:13,009
very limited, and we're now gonna
understand what those limitations are.

16
00:01:13,064 --> 00:01:19,021
If you'd like to choose the features by
hand, and if you use enough features, you

17
00:01:19,021 --> 00:01:22,028
can make the perceptron do almost
anything.

18
00:01:23,031 --> 00:01:26,069
Suppose for example we have binary input
vectors.

19
00:01:26,069 --> 00:01:32,034
And we create a separate feature unit that
gets activated by exactly one of those

20
00:01:32,034 --> 00:01:36,041
binary input vectors.
We'll need exponentially many feature

21
00:01:36,041 --> 00:01:39,010
units.
But now we can make any possible

22
00:01:39,010 --> 00:01:44,048
discrimination on binary input vectors.
So for binary input vectors there's no

23
00:01:44,048 --> 00:01:48,048
limitation if you're willing to make
enough feature units.

24
00:01:48,078 --> 00:01:53,091
But of course, that's not a very good
strategy for solving a practical problem

25
00:01:53,091 --> 00:01:58,058
because you need an awful lot of feature
units and it won't generalize.

26
00:01:58,058 --> 00:02:03,098
You can't look at a subset of all possible
cases and have any hope of getting the

27
00:02:03,098 --> 00:02:08,484
remaining cases right because those
remaining cases require new feature units

28
00:02:08,484 --> 00:02:12,093
and you don't know what weights to put on
those feature units.

29
00:02:12,093 --> 00:02:15,083
Once you've decided the hand coded
features.

30
00:02:15,083 --> 00:02:21,042
That is once they've been determined,
there are very strong limitations on what

31
00:02:21,042 --> 00:02:26,096
a perceptron can learn to do.
So here's a classic example.

32
00:02:28,034 --> 00:02:33,093
What we're interested in is what can you
learn to do with the binary threshold

33
00:02:33,093 --> 00:02:37,019
decision unit that is by changing its
weights.

34
00:02:38,012 --> 00:02:43,089
And we're going to show that there's very
simple things that it can learn to do.

35
00:02:43,089 --> 00:02:49,045
So the simplest example is consider a
problem in which there's two positive

36
00:02:49,045 --> 00:02:53,079
cases and two negative cases.
And the features, just single bit

37
00:02:53,079 --> 00:02:56,031
features, that have values either one or
zero.

38
00:02:56,063 --> 00:03:00,055
So the two positive cases consist of both
features being on.

39
00:03:00,055 --> 00:03:04,073
In which case the right answer's one.
Or both features being off.

40
00:03:04,073 --> 00:03:09,076
In which case the right answer's one.
And the two negative cases are when one

41
00:03:09,076 --> 00:03:14,065
feature's on and the other one's off.
In which case the right answer is zero.

42
00:03:14,065 --> 00:03:19,075
So all we're asking the binary threshold
unit to do is decide whether the two

43
00:03:19,075 --> 00:03:24,006
features have the same value.
And they can't even learn to do that.

44
00:03:25,030 --> 00:03:31,017
We can prove that algebraically.
Those four input/output pairs that I

45
00:03:31,017 --> 00:03:36,052
showed you give rise to four inequalities,
and it's impossible to satisfy them.

46
00:03:36,081 --> 00:03:42,098
So the first positive case, when the two
feature values are one the output should

47
00:03:42,098 --> 00:03:46,040
be one.
That gives us the inequality that: one

48
00:03:46,040 --> 00:03:51,042
times W1 plus one times W2 is gonna be
greater than the threshold.

49
00:03:51,042 --> 00:03:56,090
So we give an output a one.
Then the second positive case gives us the

50
00:03:56,090 --> 00:04:02,091
inequality that zero times W1 plus zero
times W2, must also be greater than the

51
00:04:02,091 --> 00:04:06,095
threshold.
And the negative cases give us the

52
00:04:06,095 --> 00:04:13,015
inequalites that one times W1 plus zero
times W2, must be less than the threshold,

53
00:04:13,015 --> 00:04:19,012
and similarly zero times W1 plus one times
W2, must be less than the threshold.

54
00:04:19,074 --> 00:04:24,076
Now if you take those first two
inequalities and you add them up, you get

55
00:04:24,076 --> 00:04:28,062
the W1 plus W2 must be greater than twice
the threshold.

56
00:04:28,062 --> 00:04:33,058
And if you take the second two
inequalities and you add them up, you get

57
00:04:33,058 --> 00:04:36,096
W1 plus W2 must be less than twice the
threshold.

58
00:04:36,096 --> 00:04:41,002
So there's clearly no way to satisfy all
four inequalities.

59
00:04:41,002 --> 00:04:45,914
Or to put it another way, if you look at
the binary decision unit where we're going

60
00:04:45,914 --> 00:04:52,068
to put the threshold as a negative weight
on an input line that always has value of

61
00:04:52,068 --> 00:04:55,073
one.
If you take that binary threshold unit

62
00:04:55,073 --> 00:05:01,049
shown at the bottom right, there's no way
to set the threshold in the two weights,

63
00:05:01,049 --> 00:05:07,042
so it gets all four cases right.
We can also see this geometrically.

64
00:05:08,083 --> 00:05:15,010
So we're going to imagine a data space
now, in which the axis correspond to

65
00:05:15,010 --> 00:05:23,075
components of an input vector.
So in this space each point corresponds to

66
00:05:23,075 --> 00:05:29,037
a data point.
And, a weight vector is going to find a

67
00:05:29,037 --> 00:05:33,069
plane in this space.
So it's just the opposite of what we're

68
00:05:33,069 --> 00:05:37,546
doing with weight space.
In weight space we made each point be a

69
00:05:37,546 --> 00:05:42,002
weight vector, and we used a plane, to
define an input case.

70
00:05:42,002 --> 00:05:45,076
Of course that plane was defined by an
input vector.

71
00:05:45,076 --> 00:05:51,080
Now what we're going to do is we're going
to make each point be an input vector and

72
00:05:51,080 --> 00:05:56,075
we're going to use a wait vector to define
a plane in the data space.

73
00:05:58,018 --> 00:06:02,082
The plane defined by the weight vector is
going to be perpendicular to the weight

74
00:06:02,082 --> 00:06:07,028
vector and it's going to miss the origin
by a distance equal to the threshold.

75
00:06:07,028 --> 00:06:13,072
So here's a picture.
You see the four data cases there, and for

76
00:06:13,072 --> 00:06:19,008
the two data cases in red, we need to give
an output of zero.

77
00:06:19,008 --> 00:06:25,024
And with the two data cases in green, we
need to put an output of one.

78
00:06:25,054 --> 00:06:30,069
That me, means we need the green cases to
be on the side of the weight plane where

79
00:06:30,069 --> 00:06:34,894
the output is one and we need the red
cases to be on the side where the output

80
00:06:34,894 --> 00:06:40,043
is zero, and we obviously cannot arrange
the weight plane so that's true.

81
00:06:41,048 --> 00:06:47,073
We call a set of cases like that, where
there's no hyperplane that will separate

82
00:06:47,073 --> 00:06:53,083
the cases where we want the answer to be
one, from the cases where we want the

83
00:06:53,083 --> 00:06:58,013
answer to be zero.
We call that a set of training cases

84
00:06:58,013 --> 00:07:05,038
that's not linearly separable.
And even more devastating example for

85
00:07:05,038 --> 00:07:12,071
perceptrons because it's much more general
is when we try and discriminate simple

86
00:07:12,071 --> 00:07:18,787
patterns that have to retain the identity
when you translate them with wrap-around.

87
00:07:18,787 --> 00:07:23,490
I'll give you an example of what that
means in a minute.

88
00:07:23,490 --> 00:07:30,349
But the idea is that we want to recognize
a pattern and we want to recognize it even

89
00:07:30,349 --> 00:07:35,083
when it's translated.
So suppose we just use pixels as the

90
00:07:35,083 --> 00:07:39,034
features.
The question is can a binary threshold

91
00:07:39,034 --> 00:07:42,067
unit discriminate between two different
patterns.

92
00:07:42,067 --> 00:07:48,017
We'll call one positive example and the
other's negative examples if they've got

93
00:07:48,017 --> 00:07:53,039
the same number of pixels in them.
And the answer is no it can't discriminate

94
00:07:53,039 --> 00:07:58,089
two patterns out of the same number of
pixels if that discrimination has to work

95
00:07:58,089 --> 00:08:03,085
when the patterns are translated and if
the patterns can wrap-around when

96
00:08:03,085 --> 00:08:08,087
translate.
So, if you look at these examples of

97
00:08:08,087 --> 00:08:14,074
pattern A, in a one-dimensional image.
Pattern A has four pixels that are on.

98
00:08:14,074 --> 00:08:19,092
Those four black pixels.
It's like a little bar code.

99
00:08:20,019 --> 00:08:24,022
And it's the same pattern when we
translate it a bit to the right.

100
00:08:24,022 --> 00:08:28,086
And we're going to allow ourselves to
translate the pattern so it goes off the

101
00:08:28,086 --> 00:08:31,097
right hand end, and comes back on the left
hand end.

102
00:08:31,097 --> 00:08:36,024
So the third example is the same pattern
that's been translated with some

103
00:08:36,024 --> 00:08:42,029
wrap-around.
And pattern B, it also has four patterns,

104
00:08:42,029 --> 00:08:45,090
but four pixels, but in a different
arrangement.

105
00:08:45,090 --> 00:08:51,067
And in the third example of pattern B,
it's been translated with wrap-around.

106
00:08:54,078 --> 00:08:57,089
So that's still an example of pattern B.
And for two sets of patterns like that, a

107
00:08:57,089 --> 00:09:02,020
binary threshold unit cannot learn to
discriminate them.

108
00:09:02,020 --> 00:09:10,010
And here's the proof.
What we're going to do is we're going to

109
00:09:10,010 --> 00:09:15,027
consider that for the positive examples we
have pattern A in all possible

110
00:09:15,027 --> 00:09:19,058
translations.
Now since pattern A has four on pixels,

111
00:09:19,058 --> 00:09:25,031
that means if we look at any pixel on the
retina, there'll be four different

112
00:09:25,031 --> 00:09:30,058
positions in which we can put pattern A
that will activate that pixel.

113
00:09:30,058 --> 00:09:36,032
So each pixel will be activated by four
different translations of pattern A.

114
00:09:37,070 --> 00:09:43,022
That means that the total input received
by the decision unit, over all those

115
00:09:43,022 --> 00:09:49,003
various translations of pattern A, would
be four times the sum of all the weights

116
00:09:49,003 --> 00:09:55,005
in the perceptron, because each pixel will
activate the decision unit four different

117
00:09:55,005 --> 00:09:58,057
times.
And so summed over all those patterns will

118
00:09:58,057 --> 00:10:02,094
get four times the sum of the weights.
Now consider pattern B.

119
00:10:02,094 --> 00:10:08,053
We're going to make the negative cases be
pattern B, in all possible translations.

120
00:10:09,055 --> 00:10:14,074
And again, each pixel will be activated by
four different translations of pattern B.

121
00:10:15,032 --> 00:10:19,098
So the total input of the decision unit
receives and, over all those different

122
00:10:19,098 --> 00:10:24,077
translations of pattern B, will again be
four times the sum of all the weights.

123
00:10:26,058 --> 00:10:30,055
But the perceptron, in order to
discriminate correctly, has to have

124
00:10:30,055 --> 00:10:35,002
weights so that every single case of
pattern A provides more input to the

125
00:10:35,002 --> 00:10:38,003
decision unit than every single case of
pattern B.

126
00:10:38,003 --> 00:10:42,062
And that's clearly impossible if when you
sum of all these cases, all those

127
00:10:42,062 --> 00:10:47,050
different versions of pattern A and all of
those different versions of pattern B,

128
00:10:47,050 --> 00:10:51,025
provide exactly the same amount of input
to the decision unit.

129
00:10:51,025 --> 00:10:58,094
So we've proved that a perceptron cannot
recognize patterns under translation if we

130
00:10:58,094 --> 00:11:04,006
allow wrap-around.
That's a particular case of Minsky and

131
00:11:04,006 --> 00:11:12,010
Papert's group invariance theorem.
And that result is devastating for

132
00:11:12,010 --> 00:11:14,084
perceptrons, it was historically
devastating.

133
00:11:14,084 --> 00:11:19,064
Because the whole point of pattern
recognition is to recognize patterns that

134
00:11:19,064 --> 00:11:24,038
undergo transformations and see that
they're still the same pattern, despite

135
00:11:24,038 --> 00:11:27,037
the transformation.
Like for example, translation.

136
00:11:27,098 --> 00:11:32,046
And when Minsky and Papert showed that a
perceptron couldn't do that if the

137
00:11:32,046 --> 00:11:37,029
transformations formed a group, that is
the learning part of a perceptron couldn't

138
00:11:37,029 --> 00:11:41,089
learn to do that, it became clear that the
claims that have been made for what

139
00:11:41,089 --> 00:11:46,043
perceptrons could learn were somewhat
exaggerated, and that to get them to do

140
00:11:46,043 --> 00:11:51,021
anything interesting, you had to choose
just the right features to make it fairly

141
00:11:51,021 --> 00:11:54,027
easy for the last stage to learn the
classification.

142
00:11:55,072 --> 00:11:59,651
So the translations within our prime form
a group and, Minsky and Papert proved a

143
00:11:59,651 --> 00:12:04,381
general theorem for transformations that
form a group, are making it impossible,

144
00:12:04,381 --> 00:12:08,024
for a perceptron.
For the learning part of a perceptron to

145
00:12:08,024 --> 00:12:11,076
do the recognition.
The perceptron architecture can still do

146
00:12:11,076 --> 00:12:16,038
the recognition, but you have to organize
the features so they do the difficult

147
00:12:16,038 --> 00:12:19,092
part.
So we have to have multiple feature units

148
00:12:19,092 --> 00:12:25,023
that recognize informative sub patterns
that tell you something about what class

149
00:12:25,023 --> 00:12:30,020
it is, and we have to have separate
feature units for each position of those

150
00:12:30,020 --> 00:12:34,091
informative sub patterns, if we're trying
to recognize under translation.

151
00:12:36,022 --> 00:12:41,034
So the tricky part of pattern recognition
has to be solved by the hand-coded feature

152
00:12:41,034 --> 00:12:48,050
detectors, not the learning procedure.
The temporary conclusion from this is that

153
00:12:48,050 --> 00:12:53,032
perceptrons are no good and therefore
neural networks are no good.

154
00:12:53,032 --> 00:12:58,044
The longer term conclusion is that neural
networks are only gonna be really powerful

155
00:12:58,044 --> 00:13:03,020
if we can learn the feature detectors.
It's not enough just to learn weight sum

156
00:13:03,020 --> 00:13:07,030
feature detectors, we have to learn the
feature detectors themselves.

157
00:13:07,030 --> 00:13:11,094
And the second generation of neural
networks, which we'll come to in the next

158
00:13:11,094 --> 00:13:15,044
lecture, was all about how you learn the
feature detectors.

159
00:13:15,044 --> 00:13:19,041
But it took twenty years before people
figured out how to do that.

160
00:13:20,039 --> 00:13:26,084
So, networks without hidden units are very
limited in what they can learn to model.

161
00:13:27,058 --> 00:13:32,006
If we add more layers of linear units, it
doesn't help because everything is linear.

162
00:13:32,006 --> 00:13:36,054
We can make them much more powerful by
putting in hand coded hidden units but

163
00:13:36,054 --> 00:13:39,094
they're not really hidden units because we
hand coded them.

164
00:13:39,094 --> 00:13:45,067
We told them what to do.
It's not enough just to have fixed output

165
00:13:45,067 --> 00:13:49,038
non-linearities.
What we need is multiple layers of

166
00:13:49,038 --> 00:13:54,025
adaptive non-linear hidden units.
And the problem is how can we train such

167
00:13:54,025 --> 00:13:58,030
nets.
We need a way to adapt all the weights not

168
00:13:58,030 --> 00:14:01,091
just the last layer like in a perceptron,
and that's hard.

169
00:14:02,062 --> 00:14:08,016
In particular, leaning the weights go in
to the hidden units, that's equivalent to

170
00:14:08,016 --> 00:14:11,060
learning features.
And that's the hard thing to do.

171
00:14:11,060 --> 00:14:16,030
Because nobody is telling us directly,
what the hidden unit should be doing, when

172
00:14:16,030 --> 00:14:19,041
they should be active and, when they
should not be active.

173
00:14:19,041 --> 00:14:24,011
And the, real problem is, how do we figure
out how to learn these weights go into

174
00:14:24,011 --> 00:14:28,093
hidden units so that the hidden units turn
into the kinds of feature detectors we

175
00:14:28,093 --> 00:14:33,075
need for solving a problem, when nobody is
telling us what the featured detector

176
00:14:33,075 --> 00:14:34,034
should be.