In this video, I'm going to talk about
convolusional neural networks for hundred
and digit recognition.
This was one of the big success stories of
neuron networks in the 1980s.
The deep convolutional nets developed by
Yann LaCun and his collaborators did a
really good job of recognizing handwriting
and were actually used in practice.
They're one of the few examples from that
period of deep neural nets that it was
possible to train on computers that
existed then, and that performed really
well.
Convolutional neural networks are based on
the idea of replicated features.
So, because objects move around and show
up on different pixels, if we have a
feature detector that's useful in one
place in the image, it's likely that the
same feature detector will be useful
somewhere else.
So, the idea is to build many different
copies of the same feature detector in all
the different positions.
If you look on the right I've shown you
three feature detectors, which are
replicas of each other.
Each of them has weights to nine pixels.
And those weights are identical between
the three different feature detectors.
So the red arrow has the same weight on it
for all three feature detectors.
And when we learn we keep those red arrows
all having the same weight as each other
and we keep the green arrows having all
the same weight as each other.
Even though the red and green arrows will
have different weights.
We could also try replicating across scale
and orientation but that's much more
difficult and expensive and probably not a
good idea.
Replication across position greatly
reduces the number of free parameters that
you have to learn.
So those 27 pixels that you see in those
three replicated detectors only have nine
different weights.
Now we don't just want to use one feature
type.
So we're going to have many maps.
Each map will have replicas of the same
feature, features that are constrained to
be identical in different places.
And then different maps will learn to
detect different features.
This allows each patch of the image to be
represented by features of many different
types.
Replicated features fit in nicely with
back propagation that is it's easy to
learn using back propagation.
In fact its easy to modify the back
propagation algorithm incorporate any
linear constraint between the weights.
So what we do is we compute the gradients
as usual.
But then we modify the gradients, so that
if the weight satisfied the linear
constraint before the weight update,
they'll also satisfy the linear constraint
after the weight update.
So, the simplest example is we want two
weights to be equal.
We want w1 to equal w2.
That would be true if we start off with W1
equal to W2.
And then we make sure that the change in
W1 is always equal to the change in W2.
The way we do that is we compute the
gradient of the arrow with respect to W1.
And the gradient with respect to W2.
And then we use the sum or average of
those two gradients for both W1 and W2.
By using weight constraints like that, we
can force back propagation to learn
replicated feature detectors.
There's quite a lot of confusion in the
literature about what replicated feature
detectors are actually achieving.
Many people claim they're achieving
translation invariance.
And that's not true.
Well at least it's not true in the
activities of the neurons.
So if you look at the activities, what
replication features achieve is
equivariance not invariance.
An example should make that clear.
Here's an image, and the black dots are
the activated neurons.
Here's a translated image.
And notice the black dots have also
translated.
So the image changed and the
representation also changed by just as
much as the image.
That's equivariance not invariance.
There is somethings invariant, and that's
the knowledge.
So if you learn replicative feature
detectors, if you know how to detect a
feature in one place, you'll know how to
detect that same feature in another place.
And it's important to note that we're
achieving equivariance in the activities
and invariance in the weights.
If you want to achieve some invariance in
the activities, what you need to do is
pool the acts replicative feature
detectors.
So you can get a small amount of
translation in variance at each level of a
deep net, by averaging full neighboring
replicated detectors.
One advantage of this is that it reduces
the number of inputs for the next layer.
So that we can have more different maps,
allowing us to learn more different kinds
of features in the next layer.
It actually works slightly better to take
the maximum of four neighboring feature
detectors, rather than an average, but
there is a problem.
And the problem is that after several
levels of doing this kind of pooling,
we've lost precise information about where
things are.
That's okay if we just want to recognize
that it's a face.
The fact that we've got a few eyes, and a
nose, and a mouth floating about in
vaguely the same position is very good
evidence that it's a face.
But if you want to recognize whose face it
is, you need to use the precise spatial
relationships between the eyes and between
the nose and the mouth.
And that's been lost by these
convolutional neural nets.
I'll come back to that issue later on.
So the first impressive example of a
convolution on your own end was done by
Yann Lecun and his collaborators who
developed a really good recognizer for a
hundred digits.
In it had many hidden layers.
In each layer, it had many maps of
replicated units.
And it had pooling between layers.
So you pool adjacent replicated units
before you send them to the next layer.
But I also used a wide net that could cope
with several characters at once.
And that would work even if the characters
overlapped.
So you didn't have to segment out
individual characters before you fed them
to their net.
And something which, people often forget,
is that they used a clever way of training
a complete system.
They weren't just training a recognizer of
individual characters.
They were training a complete system, so
that you put in pixels at one end and you
get out whole zip codes at the other end.
And in training that system they used a
method that would now be called maximum
margin.
But when they did it, it was way before
maximum margin had been invented.
The net they used was at one point
responsible reading about for ten percent
of the checks in North America.
So it was of great practical value.
There were some very nice demos on that,
on Yann's workpaget.
You should really go look at them.
Look at all of them.
Because they show you just how.
Well it copes with variations in size,
orientation, position, overlap of digits,
and all sorts of background noise that
would, would kill most methods.
The architecture of LeNet-5 looks like
this.
There's an input, which is pixels.
And then there's a whole sequence of
feature maps followed by sub sampling.
So in the C1 feature map, the six
different maps, each which is 28 by 28.
Of those maps contain small features that
just look at I think three by three
pixels.
And their weights are constrained
together.
So per map there's only about nine
parameters.
That makes learning much more efficient.
It means you need much less data.
Then after the feature map, there's what
they call sub-sampling which is now called
pooling.
And so, you pool together the outputs of a
bunch of neighboring replicated features
in C1.
And that gives you a smaller map, which
will then provide the input to the next
layer, which is discovering more
complicated replicated features.
As you go up this hierarchy, you get
features that are much more complicated,
but are more invariant to position.
Here's the errors that LeNet5 made.
And this shows you the data that it's
dealing with is quite tricky.
There's 10,000 test cases, and these are
the 82 errors that it makes.
So it's doing better than 99% correct.
Nevertheless, most of the errors it makes
are the things that people find quite easy
to recognize.
So there's some way to go still.
Nobody knows the human error rate on this
data.
But it's probably twenty to 30 errors.
Of course the might-be digits that LeNet5
got right and you would get wrong.
So you have to be careful in estimating
the error rate.
You can't just look at these 82 and ask
which ones you'll get right and which ones
you'll get wrong.
You have to worry about all those other
ones that Lynette Five might've got right
and you might've got wrong.
I'm now want to go to a very general
point, about how to inject prior knowledge
in machine learning, and it applies
particularly to neural networks.
We can put in prior knowledge as it is
done in the net five, by the design of the
network.
We can have local connectivity.
We can have weight constraints.
Or we can choose neuro-activites that are
particularly appropriate for the task
we're doing.
This is much less intrusive than trying to
hand-engineer the futures.
But it still prejudices the network
towards a particular way of solving the
problem that we had in mind.
We have an idea about how to do object
recognition by gradually making bigger and
bigger features.
And by replicating these features across
space.
And we force the network to do it that
way.
There is an alternative way to put in
prior knowledge that gives the network a
much freer hand.
What we can do is use our prior knowledge
to get a whole lot more training data.
One of the first examples of this was work
by Hofmann and Tresp on trying to model
what happens in a steel mill.
They wanted to know the relationship
between what comes out of the steel mill
and various input variables, and they
actually had an, big old Fortran simulator
that would allow them to simulate the
steel mill.
Of course, the simulator wasn't reality.
It was making all sorts of approximations.
So they had real data, and also a
simulator.
And what they did was run the simulator in
order to create some synthetic data.
We then added that to the real data, and
showed that they could do better than just
using the real data alone.
If I remember right, their great, big
Fortran simulator was only worth a few
dozen extra real examples, but
nevertheless, they made the point.
Of course, if you generate a lot of
synthetic data, it may make learning take
much longer.
So in terms of the speed of learning, it's
much more efficient to put in knowledge by
using things like connectivity and weight
constraints, as was done in Lynette five.
But as computers get faster, this other
way of putting in knowledge, by generating
synthetic examples, begins to look better
and better.
In particular, it allows optimization to
discover clever ways of using the
multilayer network that we didn't think
of.
If fact, we might never fully understand
how it does it.
If we just want good solutions to a
problem, that might be fine.
So using the idea of synthetic data,
there's a brute force approach to
handwritten digit recognition.
Lenet5 uses knowledge about invariances to
design the connectivity and the weight
sharing and the pooling.
And that achieves about 80 errors.
Adding a lot more tricks, including
synthetic data, [UNKNOWN] was able to get
that down to about 40 errors.
A group in Switzerland, led by [UNKNOWN]
went to town with injecting knowledge by
putting in synthetic data.
They put a lot of work into creating very
instructive synthetic data.
So for every real training case, they
transformed it to make many more training
examples.
They then trained a large net with many
units per layer, many layers on a graphic
processor unit.
The graphics processor unit gave them a
factor of thirteen computation.
And because of all the synthetic data they
put in, it didn't overfit.
If they just use a large net with a GPU.
It would have been a disaster that over
fitted terribly that they would have done
on the training data but terribly on the
test data.
So they were really combining three
tricks.
Put your effort in to generating lots of
synthetic data then train a large net on a
gpu.
They managed to achieve 35 errors like
that.
So here's the 35 errors that they got.
The top printed digit is the right answer.
And the bottom two digits are their top
two answers.
What you'll notice is that they nearly
always get the right answer in their top
two.
There's only five cases where they don't.
With some more work by building several
different models like this and then using
a consensus to decide what the digit was,
they managed to get down to about 25
errors.
And that must be around about the human
error rate.
One question this work raises is how do
you tell if a model makes 30 errors is
really better than a model that makes 40
errors.
Is that significantly different?
Rather surprisingly, it turns out it
depends on which errors they make.
The numbers then provide you enough
information.
You have to know which ones they get wrong
and which ones they get right.
So this statistical test called the
McNemar test that uses the particular
errors and is far more sensitive than just
using the numbers.
Let me give you an example.
If you look at this two by two tape.
It shows you, in the top left hand corner,
how many examples Model one got wrong and
Model two also got wrong.
That's 29.
And in the bottom right, it shows you how
many examples Model one got right and
Model two also got right.
And in the Magnema Test, you can just
ignore those numbers in black.
All you're interested in is ones where
Model one got it right and Model two get
it wrong, or Model two got it right and
Model one get it wrong.
And if you look at that, there's an eleven
to one ratio, and it turns out that's
pretty significant.
Model two is definitely better than model
one.
That didn't happen by accident, almost
certainly.
By contrast if you look at this table,
again.
Model one is making 40 hours, model two is
making 30 hours, but now model one is
winning fifteen times when model two loses
and model 2's winning 25 times when model
one loses.
That difference is not very significant so
we wouldn't be confident that model two is
better than model one.