1
00:00:00,000 --> 00:00:05,428
In this video, I'm going to talk about
convolusional neural networks for hundred

2
00:00:05,428 --> 00:00:10,669
and digit recognition.
This was one of the big success stories of

3
00:00:10,669 --> 00:00:15,801
neuron networks in the 1980s.
The deep convolutional nets developed by

4
00:00:15,801 --> 00:00:21,004
Yann LaCun and his collaborators did a
really good job of recognizing handwriting

5
00:00:21,004 --> 00:00:25,889
and were actually used in practice.
They're one of the few examples from that

6
00:00:25,889 --> 00:00:30,648
period of deep neural nets that it was
possible to train on computers that

7
00:00:30,648 --> 00:00:33,440
existed then, and that performed really
well.

8
00:00:33,440 --> 00:00:38,250
Convolutional neural networks are based on
the idea of replicated features.

9
00:00:38,250 --> 00:00:43,124
So, because objects move around and show
up on different pixels, if we have a

10
00:00:43,124 --> 00:00:48,126
feature detector that's useful in one
place in the image, it's likely that the

11
00:00:48,126 --> 00:00:51,461
same feature detector will be useful
somewhere else.

12
00:00:51,461 --> 00:00:56,720
So, the idea is to build many different
copies of the same feature detector in all

13
00:00:56,720 --> 00:01:01,016
the different positions.
If you look on the right I've shown you

14
00:01:01,016 --> 00:01:04,943
three feature detectors, which are
replicas of each other.

15
00:01:04,943 --> 00:01:10,128
Each of them has weights to nine pixels.
And those weights are identical between

16
00:01:10,128 --> 00:01:15,046
the three different feature detectors.
So the red arrow has the same weight on it

17
00:01:15,046 --> 00:01:19,599
for all three feature detectors.
And when we learn we keep those red arrows

18
00:01:19,599 --> 00:01:24,517
all having the same weight as each other
and we keep the green arrows having all

19
00:01:24,517 --> 00:01:28,888
the same weight as each other.
Even though the red and green arrows will

20
00:01:28,888 --> 00:01:32,895
have different weights.
We could also try replicating across scale

21
00:01:32,895 --> 00:01:37,752
and orientation but that's much more
difficult and expensive and probably not a

22
00:01:37,752 --> 00:01:41,347
good idea.
Replication across position greatly

23
00:01:41,347 --> 00:01:45,460
reduces the number of free parameters that
you have to learn.

24
00:01:45,940 --> 00:01:52,067
So those 27 pixels that you see in those
three replicated detectors only have nine

25
00:01:52,067 --> 00:01:56,497
different weights.
Now we don't just want to use one feature

26
00:01:56,497 --> 00:01:59,303
type.
So we're going to have many maps.

27
00:01:59,303 --> 00:02:05,283
Each map will have replicas of the same
feature, features that are constrained to

28
00:02:05,283 --> 00:02:10,736
be identical in different places.
And then different maps will learn to

29
00:02:10,736 --> 00:02:15,667
detect different features.
This allows each patch of the image to be

30
00:02:15,667 --> 00:02:19,149
represented by features of many different
types.

31
00:02:19,149 --> 00:02:24,733
Replicated features fit in nicely with
back propagation that is it's easy to

32
00:02:24,733 --> 00:02:29,447
learn using back propagation.
In fact its easy to modify the back

33
00:02:29,447 --> 00:02:34,960
propagation algorithm incorporate any
linear constraint between the weights.

34
00:02:36,100 --> 00:02:39,488
So what we do is we compute the gradients
as usual.

35
00:02:39,488 --> 00:02:44,537
But then we modify the gradients, so that
if the weight satisfied the linear

36
00:02:44,537 --> 00:02:49,785
constraint before the weight update,
they'll also satisfy the linear constraint

37
00:02:49,785 --> 00:02:54,780
after the weight update.
So, the simplest example is we want two

38
00:02:54,780 --> 00:02:57,780
weights to be equal.
We want w1 to equal w2.

39
00:02:59,100 --> 00:03:03,345
That would be true if we start off with W1
equal to W2.

40
00:03:03,345 --> 00:03:09,520
And then we make sure that the change in
W1 is always equal to the change in W2.

41
00:03:09,520 --> 00:03:14,680
The way we do that is we compute the
gradient of the arrow with respect to W1.

42
00:03:14,680 --> 00:03:19,576
And the gradient with respect to W2.
And then we use the sum or average of

43
00:03:19,576 --> 00:03:24,869
those two gradients for both W1 and W2.
By using weight constraints like that, we

44
00:03:24,869 --> 00:03:29,170
can force back propagation to learn
replicated feature detectors.

45
00:03:29,170 --> 00:03:34,529
There's quite a lot of confusion in the
literature about what replicated feature

46
00:03:34,529 --> 00:03:39,094
detectors are actually achieving.
Many people claim they're achieving

47
00:03:39,094 --> 00:03:41,940
translation invariance.
And that's not true.

48
00:03:42,280 --> 00:03:46,698
Well at least it's not true in the
activities of the neurons.

49
00:03:46,698 --> 00:03:51,769
So if you look at the activities, what
replication features achieve is

50
00:03:51,769 --> 00:03:56,260
equivariance not invariance.
An example should make that clear.

51
00:03:57,080 --> 00:04:02,070
Here's an image, and the black dots are
the activated neurons.

52
00:04:02,070 --> 00:04:06,870
Here's a translated image.
And notice the black dots have also

53
00:04:06,870 --> 00:04:09,967
translated.
So the image changed and the

54
00:04:09,967 --> 00:04:14,380
representation also changed by just as
much as the image.

55
00:04:14,380 --> 00:04:20,921
That's equivariance not invariance.
There is somethings invariant, and that's

56
00:04:20,921 --> 00:04:24,622
the knowledge.
So if you learn replicative feature

57
00:04:24,622 --> 00:04:29,806
detectors, if you know how to detect a
feature in one place, you'll know how to

58
00:04:29,806 --> 00:04:35,124
detect that same feature in another place.
And it's important to note that we're

59
00:04:35,124 --> 00:04:39,843
achieving equivariance in the activities
and invariance in the weights.

60
00:04:39,843 --> 00:04:45,160
If you want to achieve some invariance in
the activities, what you need to do is

61
00:04:45,160 --> 00:04:48,550
pool the acts replicative feature
detectors.

62
00:04:48,550 --> 00:04:54,048
So you can get a small amount of
translation in variance at each level of a

63
00:04:54,048 --> 00:04:58,389
deep net, by averaging full neighboring
replicated detectors.

64
00:04:58,389 --> 00:05:04,250
One advantage of this is that it reduces
the number of inputs for the next layer.

65
00:05:04,250 --> 00:05:09,319
So that we can have more different maps,
allowing us to learn more different kinds

66
00:05:09,319 --> 00:05:13,770
of features in the next layer.
It actually works slightly better to take

67
00:05:13,770 --> 00:05:18,530
the maximum of four neighboring feature
detectors, rather than an average, but

68
00:05:18,530 --> 00:05:22,054
there is a problem.
And the problem is that after several

69
00:05:22,054 --> 00:05:26,999
levels of doing this kind of pooling,
we've lost precise information about where

70
00:05:26,999 --> 00:05:30,549
things are.
That's okay if we just want to recognize

71
00:05:30,549 --> 00:05:33,958
that it's a face.
The fact that we've got a few eyes, and a

72
00:05:33,958 --> 00:05:38,308
nose, and a mouth floating about in
vaguely the same position is very good

73
00:05:38,308 --> 00:05:42,363
evidence that it's a face.
But if you want to recognize whose face it

74
00:05:42,363 --> 00:05:47,183
is, you need to use the precise spatial
relationships between the eyes and between

75
00:05:47,183 --> 00:05:50,298
the nose and the mouth.
And that's been lost by these

76
00:05:50,298 --> 00:05:54,060
convolutional neural nets.
I'll come back to that issue later on.

77
00:05:54,460 --> 00:05:59,768
So the first impressive example of a
convolution on your own end was done by

78
00:05:59,768 --> 00:06:05,146
Yann Lecun and his collaborators who
developed a really good recognizer for a

79
00:06:05,146 --> 00:06:08,180
hundred digits.
In it had many hidden layers.

80
00:06:08,480 --> 00:06:11,811
In each layer, it had many maps of
replicated units.

81
00:06:11,811 --> 00:06:16,513
And it had pooling between layers.
So you pool adjacent replicated units

82
00:06:16,513 --> 00:06:21,869
before you send them to the next layer.
But I also used a wide net that could cope

83
00:06:21,869 --> 00:06:26,768
with several characters at once.
And that would work even if the characters

84
00:06:26,768 --> 00:06:29,707
overlapped.
So you didn't have to segment out

85
00:06:29,707 --> 00:06:33,300
individual characters before you fed them
to their net.

86
00:06:33,780 --> 00:06:39,613
And something which, people often forget,
is that they used a clever way of training

87
00:06:39,613 --> 00:06:43,750
a complete system.
They weren't just training a recognizer of

88
00:06:43,750 --> 00:06:47,955
individual characters.
They were training a complete system, so

89
00:06:47,955 --> 00:06:53,585
that you put in pixels at one end and you
get out whole zip codes at the other end.

90
00:06:53,585 --> 00:06:59,011
And in training that system they used a
method that would now be called maximum

91
00:06:59,011 --> 00:07:02,131
margin.
But when they did it, it was way before

92
00:07:02,131 --> 00:07:06,586
maximum margin had been invented.
The net they used was at one point

93
00:07:06,586 --> 00:07:11,210
responsible reading about for ten percent
of the checks in North America.

94
00:07:11,210 --> 00:07:15,960
So it was of great practical value.
There were some very nice demos on that,

95
00:07:15,960 --> 00:07:19,380
on Yann's workpaget.
You should really go look at them.

96
00:07:19,380 --> 00:07:22,610
Look at all of them.
Because they show you just how.

97
00:07:22,610 --> 00:07:28,522
Well it copes with variations in size,
orientation, position, overlap of digits,

98
00:07:28,522 --> 00:07:33,821
and all sorts of background noise that
would, would kill most methods.

99
00:07:33,821 --> 00:07:37,200
The architecture of LeNet-5 looks like
this.

100
00:07:37,640 --> 00:07:43,604
There's an input, which is pixels.
And then there's a whole sequence of

101
00:07:43,604 --> 00:07:49,143
feature maps followed by sub sampling.
So in the C1 feature map, the six

102
00:07:49,143 --> 00:07:54,969
different maps, each which is 28 by 28.
Of those maps contain small features that

103
00:07:54,969 --> 00:07:58,101
just look at I think three by three
pixels.

104
00:07:58,101 --> 00:08:01,233
And their weights are constrained
together.

105
00:08:01,233 --> 00:08:04,583
So per map there's only about nine
parameters.

106
00:08:04,583 --> 00:08:09,900
That makes learning much more efficient.
It means you need much less data.

107
00:08:11,020 --> 00:08:17,660
Then after the feature map, there's what
they call sub-sampling which is now called

108
00:08:17,660 --> 00:08:22,515
pooling.
And so, you pool together the outputs of a

109
00:08:22,515 --> 00:08:26,898
bunch of neighboring replicated features
in C1.

110
00:08:26,898 --> 00:08:34,263
And that gives you a smaller map, which
will then provide the input to the next

111
00:08:34,263 --> 00:08:40,230
layer, which is discovering more
complicated replicated features.

112
00:08:40,230 --> 00:08:47,316
As you go up this hierarchy, you get
features that are much more complicated,

113
00:08:47,316 --> 00:08:53,121
but are more invariant to position.
Here's the errors that LeNet5 made.

114
00:08:53,121 --> 00:08:57,724
And this shows you the data that it's
dealing with is quite tricky.

115
00:08:57,724 --> 00:09:02,396
There's 10,000 test cases, and these are
the 82 errors that it makes.

116
00:09:02,396 --> 00:09:07,754
So it's doing better than 99% correct.
Nevertheless, most of the errors it makes

117
00:09:07,754 --> 00:09:11,601
are the things that people find quite easy
to recognize.

118
00:09:11,601 --> 00:09:16,243
So there's some way to go still.
Nobody knows the human error rate on this

119
00:09:16,243 --> 00:09:18,746
data.
But it's probably twenty to 30 errors.

120
00:09:18,746 --> 00:09:23,517
Of course the might-be digits that LeNet5
got right and you would get wrong.

121
00:09:23,517 --> 00:09:26,718
So you have to be careful in estimating
the error rate.

122
00:09:26,718 --> 00:09:31,548
You can't just look at these 82 and ask
which ones you'll get right and which ones

123
00:09:31,548 --> 00:09:34,864
you'll get wrong.
You have to worry about all those other

124
00:09:34,864 --> 00:09:38,880
ones that Lynette Five might've got right
and you might've got wrong.

125
00:09:39,260 --> 00:09:44,679
I'm now want to go to a very general
point, about how to inject prior knowledge

126
00:09:44,679 --> 00:09:49,274
in machine learning, and it applies
particularly to neural networks.

127
00:09:49,274 --> 00:09:54,831
We can put in prior knowledge as it is
done in the net five, by the design of the

128
00:09:54,831 --> 00:09:57,260
network.
We can have local connectivity.

129
00:09:57,260 --> 00:10:01,688
We can have weight constraints.
Or we can choose neuro-activites that are

130
00:10:01,688 --> 00:10:04,721
particularly appropriate for the task
we're doing.

131
00:10:04,721 --> 00:10:08,907
This is much less intrusive than trying to
hand-engineer the futures.

132
00:10:08,907 --> 00:10:13,517
But it still prejudices the network
towards a particular way of solving the

133
00:10:13,517 --> 00:10:17,581
problem that we had in mind.
We have an idea about how to do object

134
00:10:17,581 --> 00:10:21,160
recognition by gradually making bigger and
bigger features.

135
00:10:21,160 --> 00:10:24,011
And by replicating these features across
space.

136
00:10:24,011 --> 00:10:26,620
And we force the network to do it that
way.

137
00:10:27,520 --> 00:10:32,775
There is an alternative way to put in
prior knowledge that gives the network a

138
00:10:32,775 --> 00:10:37,081
much freer hand.
What we can do is use our prior knowledge

139
00:10:37,081 --> 00:10:43,602
to get a whole lot more training data.
One of the first examples of this was work

140
00:10:43,602 --> 00:10:48,710
by Hofmann and Tresp on trying to model
what happens in a steel mill.

141
00:10:48,710 --> 00:10:53,843
They wanted to know the relationship
between what comes out of the steel mill

142
00:10:53,843 --> 00:10:59,043
and various input variables, and they
actually had an, big old Fortran simulator

143
00:10:59,043 --> 00:11:02,268
that would allow them to simulate the
steel mill.

144
00:11:02,268 --> 00:11:07,600
Of course, the simulator wasn't reality.
It was making all sorts of approximations.

145
00:11:07,600 --> 00:11:10,430
So they had real data, and also a
simulator.

146
00:11:10,430 --> 00:11:15,630
And what they did was run the simulator in
order to create some synthetic data.

147
00:11:15,630 --> 00:11:20,434
We then added that to the real data, and
showed that they could do better than just

148
00:11:20,434 --> 00:11:24,113
using the real data alone.
If I remember right, their great, big

149
00:11:24,113 --> 00:11:28,279
Fortran simulator was only worth a few
dozen extra real examples, but

150
00:11:28,279 --> 00:11:32,385
nevertheless, they made the point.
Of course, if you generate a lot of

151
00:11:32,385 --> 00:11:35,585
synthetic data, it may make learning take
much longer.

152
00:11:35,585 --> 00:11:40,718
So in terms of the speed of learning, it's
much more efficient to put in knowledge by

153
00:11:40,718 --> 00:11:45,669
using things like connectivity and weight
constraints, as was done in Lynette five.

154
00:11:45,669 --> 00:11:50,560
But as computers get faster, this other
way of putting in knowledge, by generating

155
00:11:50,560 --> 00:11:53,700
synthetic examples, begins to look better
and better.

156
00:11:54,620 --> 00:12:00,221
In particular, it allows optimization to
discover clever ways of using the

157
00:12:00,221 --> 00:12:03,476
multilayer network that we didn't think
of.

158
00:12:03,476 --> 00:12:07,640
If fact, we might never fully understand
how it does it.

159
00:12:08,060 --> 00:12:12,192
If we just want good solutions to a
problem, that might be fine.

160
00:12:12,192 --> 00:12:16,783
So using the idea of synthetic data,
there's a brute force approach to

161
00:12:16,783 --> 00:12:22,227
handwritten digit recognition.
Lenet5 uses knowledge about invariances to

162
00:12:22,227 --> 00:12:26,360
design the connectivity and the weight
sharing and the pooling.

163
00:12:26,620 --> 00:12:32,450
And that achieves about 80 errors.
Adding a lot more tricks, including

164
00:12:32,450 --> 00:12:38,725
synthetic data, [UNKNOWN] was able to get
that down to about 40 errors.

165
00:12:38,725 --> 00:12:45,960
A group in Switzerland, led by [UNKNOWN]
went to town with injecting knowledge by

166
00:12:45,960 --> 00:12:51,886
putting in synthetic data.
They put a lot of work into creating very

167
00:12:51,886 --> 00:12:57,360
instructive synthetic data.
So for every real training case, they

168
00:12:57,360 --> 00:13:01,662
transformed it to make many more training
examples.

169
00:13:01,662 --> 00:13:08,494
They then trained a large net with many
units per layer, many layers on a graphic

170
00:13:08,494 --> 00:13:12,774
processor unit.
The graphics processor unit gave them a

171
00:13:12,774 --> 00:13:17,499
factor of thirteen computation.
And because of all the synthetic data they

172
00:13:17,499 --> 00:13:21,650
put in, it didn't overfit.
If they just use a large net with a GPU.

173
00:13:21,650 --> 00:13:26,721
It would have been a disaster that over
fitted terribly that they would have done

174
00:13:26,721 --> 00:13:29,876
on the training data but terribly on the
test data.

175
00:13:29,876 --> 00:13:32,535
So they were really combining three
tricks.

176
00:13:32,535 --> 00:13:37,730
Put your effort in to generating lots of
synthetic data then train a large net on a

177
00:13:37,730 --> 00:13:40,390
gpu.
They managed to achieve 35 errors like

178
00:13:40,390 --> 00:13:43,050
that.
So here's the 35 errors that they got.

179
00:13:43,050 --> 00:13:48,625
The top printed digit is the right answer.
And the bottom two digits are their top

180
00:13:48,625 --> 00:13:52,092
two answers.
What you'll notice is that they nearly

181
00:13:52,092 --> 00:13:55,152
always get the right answer in their top
two.

182
00:13:55,152 --> 00:14:00,607
There's only five cases where they don't.
With some more work by building several

183
00:14:00,607 --> 00:14:06,142
different models like this and then using
a consensus to decide what the digit was,

184
00:14:06,142 --> 00:14:09,076
they managed to get down to about 25
errors.

185
00:14:09,076 --> 00:14:12,477
And that must be around about the human
error rate.

186
00:14:12,477 --> 00:14:17,745
One question this work raises is how do
you tell if a model makes 30 errors is

187
00:14:17,745 --> 00:14:20,946
really better than a model that makes 40
errors.

188
00:14:20,946 --> 00:14:26,304
Is that significantly different?
Rather surprisingly, it turns out it

189
00:14:26,304 --> 00:14:30,746
depends on which errors they make.
The numbers then provide you enough

190
00:14:30,746 --> 00:14:34,236
information.
You have to know which ones they get wrong

191
00:14:34,236 --> 00:14:38,424
and which ones they get right.
So this statistical test called the

192
00:14:38,424 --> 00:14:43,564
McNemar test that uses the particular
errors and is far more sensitive than just

193
00:14:43,564 --> 00:14:46,420
using the numbers.
Let me give you an example.

194
00:14:47,480 --> 00:14:53,268
If you look at this two by two tape.
It shows you, in the top left hand corner,

195
00:14:53,268 --> 00:14:58,233
how many examples Model one got wrong and
Model two also got wrong.

196
00:14:58,233 --> 00:15:02,012
That's 29.
And in the bottom right, it shows you how

197
00:15:02,012 --> 00:15:06,681
many examples Model one got right and
Model two also got right.

198
00:15:06,681 --> 00:15:11,646
And in the Magnema Test, you can just
ignore those numbers in black.

199
00:15:11,646 --> 00:15:17,575
All you're interested in is ones where
Model one got it right and Model two get

200
00:15:17,575 --> 00:15:22,170
it wrong, or Model two got it right and
Model one get it wrong.

201
00:15:22,170 --> 00:15:27,874
And if you look at that, there's an eleven
to one ratio, and it turns out that's

202
00:15:27,874 --> 00:15:32,278
pretty significant.
Model two is definitely better than model

203
00:15:32,278 --> 00:15:35,311
one.
That didn't happen by accident, almost

204
00:15:35,311 --> 00:15:38,776
certainly.
By contrast if you look at this table,

205
00:15:38,776 --> 00:15:42,535
again.
Model one is making 40 hours, model two is

206
00:15:42,535 --> 00:15:48,870
making 30 hours, but now model one is
winning fifteen times when model two loses

207
00:15:48,870 --> 00:15:52,988
and model 2's winning 25 times when model
one loses.

208
00:15:52,988 --> 00:15:59,797
That difference is not very significant so
we wouldn't be confident that model two is

209
00:15:59,797 --> 00:16:01,540
better than model one.