It was always a question for speculation
whether the kinds of nets developed for
recognizing handwritten digits could
actually be scaled up to what vision
people call a real task.
That is, recognizing objects in high
resolution color images when the scene is
cluttered.
So that you have to do things like
segmentation, you have to deal with 3D
viewpoint, you have to deal with 5-foot
list.
Many different objects surrounding, you're
not quite sure which is the intended one,
and so on.
Since the start of this course, we've got
some interesting new results on that.
So in my first lecture, I described the
network developed by Alex Krizhevsky and
showed that it was good at object
recognition.
But at that point it hadn't been
benchmarked against the best computer
vision systems,
Now it has.
People worked on Emenise for many years,
gradually improving their ability of these
networks to recognize handwritten digits.
Many computer vision researchers thought
this was a waste of time if you wanted to
be able to recognize real objects in color
images, because they thought the lessons
learned from Emnise would not generalize
to that domain.
That was a fairly reasonable thing to
think.
Here's a number of reasons why it's a much
more difficult task.
First of all, there's many, many more
different kinds of objects.
Even if we only recognize a thousand
classes, that's still a factor of a
hundred.
Secondly there is many more pixels even if
we use trans sampled images that are only
256 by 256 with color pixels that's still
100 or 300 times of many pixels Another
factor is that in real scenes you have to
deal with the fact you've got a two
dimensional image of a three dimensional
reality so a lot of information is being
lost.
And real scenes have clutter of a kind
that doesn't occur in handwriting.
In handwriting you can have overlapping
letters and that requires segmentation but
you don't have things like occlusion of
large parts of objects by opaque other
objects.
You don't have many different kinds of
objects in the same scene.
And you don't have a little lighting
variations that you get in real scenes.
So the question is will the same kind of
convolution neural network that proved to
be so good on recognizing hand written
digits work for real color images In the
domain of real color images we probably do
need to wire in some prior knowledge.
Because, if we try and do it in the sera
san way with no knowledge wired in,
putting in all the knowledge by generating
extra training examples.
The computational problem is still too
large for current computers.
So there was a recent competition.
And it was on a data base called ImageNet.
ImageNet actually has many more than a
machine images but a subset of 1.2
millimeters was chosen and the
classification task was to correctly label
those images.
Now the images were hand labelled with a
thousand different classes but this wasn't
very reliable.
There could be an image that has two of
those thousand different objects in it and
only one of them is labeled.
So, to make the task feasible the computer
vision system is allowed to make five
bets.
And it's set to get it right if one of
those bets corresponds to the label that a
person has given the image.
There's also a localization task.
The reason for the localization task is
that many computer vision systems use a
bag of features approach.
For the whole image or for say, a quadrant
of the image they know what the features
are, but they don't know where they are.
This allows them to recognize objects but
without knowing exactly where they are.
That's very unlike how people behave
except people with a curious kind of brain
damage called balance syndrome where they
can recognize objects and not be sure
where they are.
So for the localization task you have to
place a box around an object once you've
recognized it and to get it right your box
must have at least a 50% overlap with the
correct box.
On this task, people tried some of the
best existing computer vision methods.
So, leading groups from Oxford and the
French National Research Labs Inria and
Xerox's European Research Center and
various other universities tried this task
and discovered it's very hard.
The computer vision systems typically use
complicated multi-stage systems.
The early stages of these systems are
typically hand-tuned by optimizing a few
parameters using some of the data.
And, the top stage of these systems is
always a learning algorithm.
But they don't learn all the way through.
In the way that a deep neural net does
when its trained to do back propagation.
They don't have end-to-end learning, where
the parameters used in the early feature
detectors are being influenced by how
useful they are for making final decision
about classes.
So here are some examples from the test
set to show you what the data is like.
You already sow some examples in the first
lecture, but here's some more.
So you can see that it's fairly obvious
what the object is in that image, but a
lot of it's missing.
It doesn't have ears, it doesn't have
legs.
The predictions are the un-normalized
probabilities of Alex Krizhevsky's
deep-neural-network.
And you can see it's confident that, that
is a cheetah And if it's not a cheetah, it
thinks it almost sudden a leopard.
It also understands there's other
possibilities, like a snow leopard, it's
the wrong color for a snow leopard, or an
Egyptian cat.
Here's an example the other way around,
here there's many objects in the image and
the object of interest is only a very
small fraction of the pixels.
The network correctly says bullet train.
But it also has other bets, like subway
train or electric locomotive, which are
presentable bets.
If you look at the image, there's lots of
other things that could be labeled, like
the roof which occupies a much larger
fraction of the image than the train or
the pillar that's supporting the roof or
the pedestrian.
Or the large apartment block in the
background.
In these kinds of images you really have
to be able cope with the fact that there's
lots of alternative targets.
Last image shows a different kind of
example where there is no background
clutter.
The object is quite well isolated,
probably a picture from a catalog or
something.
And the network doesn't get it right for
its first bet, but it does get it in its
top five bets.
But here the network isn't confident about
anything.
These are the relative probabilities, and
the network correctly realizes it doesn't
really know.
And if you look at the other
possibilities, they're all perfectly
plausible.
If you screw your eyes up so you can't see
the image too well, you can see how it
might think it was a frying pan or a
stethoscope.
So how did the systems do on this data?
Here's the error rates for the computer
vision systems.
One thing you'll notice is that the best
systems are all very similar.
So the University of Tokyo managed to get
26.1%, and here what I'm doing is just
reporting the best system from each group.
Oxford University, which has a very good
computer vision group, generally
recognized to be possibly the best group
in Europe, again got in the 26 percents
and the French National Research Labs in
the Xerox Park Center, which are, again, a
very good computer vision groups, got 27%.
So you'll guess from this that it is going
to be hard to be 26%, and if you do beat
26% you're comparable with the very best
computer vision systems.
So Alex Krizhevsky's neural net got
sixteen percent error.
It's a huge gap.
Normally, in these competitions you don't
see big gaps like that.
So Alex Krizhevsky's network works like
this.
It's a very deep convolutional neural net
of the type pioneered by Yann Le Cun was
first used for digit recognition and then
Yann later applied it to recognizing real
objects.
And we're using all the lessons that we
learned by Yann's group from [UNKNOWN]
group and various other groups, developing
these deep neural nets for doing real
vision.
It has seven hidden layers, which is
deeper than usual and that's not counting
some of the max-pooling layers.
The early layers are convolutional.
We could probably get away with using just
local receptive fields, without tying any
weights, if we had a much bigger computer.
But by making them convolutionary, you cut
down the parameters a lot, so you cut down
the amount of training data you need a lot
which cuts down the amount of computation
time a lot.
The last two loads were globally connected
and that's where most of the parameters
are.
I think there's about sixteen million
parameters between each pair of those
layers.
What the last two layers are doing is
looking for combinations of local features
that were extracted by the early layers.
And obviously this commonly tourily many
combinations to look for.
And that's why you need a lot of
parameters there.
The activation functions were rectified
linear units in every hidden layer.
These train much faster than logistic
units and they're more expressive.
Most of the people seriously applying deep
in your own networks to real images to the
object recognition of I switch direct fi
linear units.
We also used competitive normalization,
within a layer to suppress the activity of
a unit, if other units that are looking
nearby localities are very active.
This helps a lot with variations in
intensity.
So, you might have an edge detector, which
gets somewhat active due to some fairly
faint edge.
And that's pretty much irrelevant, if
there is much more intense things around.
There's other tricks that we used to
significantly improve the generalization
of this net.
First of all, we use the trick of
enhancing the data by using
transformations.
So here's a skit for down-sampled the
images in the competition to 256 by 256.
But instead of using those whole images
Alex Krizhevsky took random 224 by 224
patches from those images.
Which gave him hugely more images to train
on.
And helped him deal with translation and
variance.
Even though they're convolutional nets
that's still a help.
He also used left-right reflections of the
images, which again doubled the amount of
data.
He didn't use up dime reflections.
Because, gravity's very important.
Left right reflections don't really change
what things look like much unless they're
things like writing.
At test time, he doesn't just use one
patch.
He uses a number of different patches, the
four corners, the middle, that gives him
five, and then the left right reflections
of all those, that gives him ten.
He runs all ten through the network and
then combines their opinions.
In the top layers, where most of the
parameters are, he uses a new
regularization technique, called drop-out,
which is very effective.
And stops the network over fitting.
That's worth several percent in his
results.
I'll describe drop pouch at some length in
the later lecture.
But for now, the basic idea of drop out is
that each time you present a training
example, you omit half the hidden units
from a layer.
This means that the other hidden units in
that layer, the survivors, can't rely on
the their com rates being present.
They can't learn to fix up the errors left
over by the other hidden units in that
layer, cuz the other hidden units might
not be there no matter be fixing up an
error that doesn't exist.
So they have to become more individualist.
They have to individually do useful things
but they still have to do useful things
that are different from what the other
survivors do.
So drop out is stopping too much
cooperation between the hidden units.
And a lot of cooperation is very good for
fitting the training data.
But if the test distribution is
significantly different, then all that
cooperation causes over-fitting.
Alex couldn't have done this work without
significant hardware, but the hardware
only costs a few thousand dollars now.
Alex is a very good programmer, and he
used a very efficient implementation of
convolution and neural nets on two Invidia
GTX 580 graphics processors.
Each of these has over 500 fast little
cores, which are very good at doing
arithmetic and not much good at anything
else.
The GP use are very good at doing matrix,
matrix multiplies.
So if you stack together the vector of
activities of a hidden layer, over many
training cases, that gives you a matrix.
And now you multiply that by matrix of
weights to figure out the activities in
the next hidden layer for all those
training cases.
And if both those matrices are big, the
GPU's give you a huge advantage.
They give you about a factor of 30.
They also have very high bandwidth to
memory, and that's needed for neural nets.
Cause in neural nets you keep wanting to
know another weight so that you can
multiply it by an activity.
And there's millions of these weights, so
you can't keep them all in the cache.
Using all that hard brac, he could train
his final network, in a week.
And you could also combine results from
ten, ten different patches of TestTime
very quickly.
So Test Time you can run it at just about
the frame rate.
In the future we are going to be able to
spread this kind of network over a large
number of calls.
As calls become cheaper, people at Google
are already experimenting with that.
And if we can communicate the stakes fast
enough we are going to be able to do much
bigger networks on many more calls.
Google has already simulated networks with
1.7 billion connections and I think that
it's only going to get bigger.
As the cores get cheaper and the data sets
get bigger, these big deep neural nets are
gonna improve much faster than the
old-fashioned computer vision systems,
because they don't involve much hand
engineering, and they can make very good
use of huge data sets and huge amounts of
computation.
So the fact that we've already opened up a
big gap I think means there's no looking
back.
I think from now on all the best object
recognition systems, at least of static
images, will use big deep neural nets.
There are other application domains where
we've learned the same lesson so Vladimir
Nee.
Used a net with local fields but without
convolution to extract roads from aerial
images.
These are cluttered aerial images of urban
scenes.
Again he uses multiple layers of rectified
linear units.
And he takes a relatively large image
patch, and predicts for the central 16x16
pixels whether each of those pixels is a
piece of road or not a piece of road.
The nice thing about this task is that
there's a lot of label training data
available.
That's because maps tell you where the
centre lines of roads are and roads are
roughly fixed width.
So from the vectors in the map that tell
you where the centre line of the road is
you can estimate which pixels are probably
road.
Nevertheless, the task is very hard.
There's the normal kind of vision
problems: so roads are occluded by
buildings because a plane isn't looking
straight down when it takes the
photograph.
They're occluded by trees.
They're also occluded by cars that are
sitting on the road.
The shadow effects from building, the
major lighting changes depending on
whether it's a sunny day or a cloudy day
for example and there's minor view point
changes.
So the plane is basically looking
downwards, but in any large photo it can't
be looking straight downwards at every
pixel.
The worst problems in this data are the
incorrect labels.
You get incorrect labels because the maps
aren't perfectly registered.
For most purposes, you don't need a map to
be registered better than a few meters.
The pixels are about one meter square in
this data.
And so if the registration of the map is
off by three meters, you're going to get
at least three of the labels wrong for
pixels, across every road.
Another, severe problem, is that the
people making maps have to make arbitrary
decisions about what counts as a road and
what counts as a laneway.
So, in may of the maps, you look at
something, and you've no idea whether
that's gonna be considered to be a road or
a lane-way.
And so you simply don't know what label
it's gonna get from the map.
Big neural nets trained on big image
patches, using millions of examples, are,
I think, the only real hope for doing a
good job at this task.
It's very hard to find out what people can
do.
So, here is what the data looks like.
This is a part of Toronto.
If you know Toronto, you can tell that by
the angle of the roads.
And, above the image of the part of
Toronto, I put two patches extracted from
that image.
And if you look at those patches, you can
see it's not trivial to tell which the
road pixels are.
On the right, is the output of [UNKNOWN]
system.
Green is correctly identified pixels of
road, and red means things that his system
thought might be road, but actually
aren't.
Actually that thing is a parking lot but
you can see why he might have thought it
was a road.