[MUSIC] Okay let's now try to imagine
what happens if we expand this to a more practical
feature on problem. Imagine that we have some kind of
set of measurements, or features, so we have maybe measurements from our
robots or maybe not an image data but some kind of high level representation. And then we want to be able to learn
some supervise program from it. So we want our encoder not to compress
the data, not to reuse the size, but to find feature presentation. There may even be more features. As long as for example. The original feature
presentation is very convoluted. But the results from the hidden one is
straight forward for XGBoost to take out. Basically this yields us
very intuitive extension. We can just make this hidden
representation a bit larger. And larger and even larger untill it
gets larger than the original data. So this is, from a mathematical
perspective, a totally legitimate model. But you probably, again, smell something
fishy here, so there is something wrong. Can you guess what?
Well, right. Basically, if your network is able to
maintain a representation which is larger than the initial one. Lets just remember what the formula looks
like from your networks perspective. So you want me a network to take
an image which has say 1000 pixels, you want me to compress it into 1 million
pixels or may be 1 million numbers. And then decompress it so
that I don't lose anything. What do I do? I copy the image.
Basically I allocate the first 1,000 of my 1 million features to be an exact
copy of the image feature, and then propagate it to the decoders so
that my error is 0. This is not what you want your super
feature presentation to be like because It's no better than the original one, plus
some noise in the additional components. Of course, you could still deal
with similar representations. But let's see if we maybe can fix
this problem without having to compromise the architecture. So one way we can regularize is we can
add some kind of L one or L two balance. Basically we take a loss function and we add some kind of maybe absolute
value of your activations or absolute value of activations or
absolute value of [INAUDIBLE]. This time it's value of activations so we
want to penalize in [INAUDIBLE] wherever it's activation is larger than zero and
we want to [INAUDIBLE]. Now there's one neat
property of this L 1 balance here that works when you penalize for
example weights in your linear model. What happens to it? Yeah exactly. So if your regularization is well harsh
enough, what you get is that some of irrelevant features are basically
dropped from your model. This happens when you're various weights. This time, however, you're going to
regularize not weights, but activations. So you want to zero out not the weights,
but activations of a particular sample. This is all sent through
the situation where your model is, or your model benefits in terms of loss from
zeroing out most of the features for a particular example. So your features become sparse. If everything goes right,
your features will still be useful so each feature would
activate on some objects. But for any given objects,
most of the features are going to be zero. This is, well, questionably desirable
because some classifiers work well with sparse representation, some don't. But if sparse is what you aim at,
sparse autoencoder is your thing. Another way to regularize
is to use the dropout, which is like the deep
learning way to regularize. And again, this way you just drop
the features, so that your model cannot, your decoder cannot access all
the features from the encoder. This results in the features of
your encoder being redundant. Like the neurons in your collision
[INAUDIBLE] when you drop out. Basically if you cannot rely
on any particular feature, it will have a few features that
are more or less about the same thing so that if they are well
only partially present. If some of them gets dropped.
Your model is still able to reconstruct the data. This way it becomes redundant,
but redundancy is not, well it's [INAUDIBLE] question
of [INAUDIBLE] desirable. Some of us are ok with us,
some of us aren't. Now the most peculiar way you can
regularize is to drop out the input data. Basically you take your image, and
before you fit it into the encoder, you corrupt it. Maybe you take a phase that
you want to encode and decode. And before filling it,
you maybe zero out particular regions. So it may be right eye you show zero
[INAUDIBLE] of the right eye of a person. Or maybe you just add a random noise
to the image or maybe dropout. You just take some random pixels and
zero them out. What this forces your model to do
is your model has to extrapolate. It has to guess what would
be there in an image. We humans are quite capable,
when it comes to image extrapolation, because it's quite easy
to guess what's behind. Well, maybe,
a hat of a person if he wears a hat. Or maybe what's behind glasses,
it's obviously eyes. But if you force a network to be as
capable as we are, or at least try to, It won't be able to run the identity
mapping, it won't be able to just copy the data, because, again,
the data access is imperfect, and you want your letter to take the imperfect
data, and predict the perfect one. So we want to remove the distortion. This is called a denoising of our encoder,
and, again, it's about removing noise. The way it operates is exactly
the same as the previous two models. It tries to minimize
reconstruction error but it has this need dropout there that
changes the whole behavior of the model. And there's a lot of ways you
can compare those approaches. Basically the general
intention stays the same. The sparse encoder gets
sparse representations. The redundant autoencoder get features
that cover for one another, and denoising encoder some features that are
able to extrapolate, even if some pieces of data is missing, so it's kind of
stable to small distortions in the data. Well you could, for example,
relate features of those autoencoders. You can see how well they generalize
on the images they have not been shown. And there's more than
one study covering it. Fortunately for us,
most of them are useless. Basically you can look at
the filters as long as you want, It won't give you the slightest
mathematical proof, whether you want to use one model or
the other, or vice versa. So again, this lets you train
another encoder which is richer than the original one. Now let's see how you can use that. Okay, so imagine you had a problem. The problem of image classification. So you have maybe, well,
photos of warplanes. You have allied warplanes and
enemy warplanes, you want to distinguish between them. So the first are getting maybe escorted
on the second or getting shot. And unfortunately, you only have, say 500
pieces of data, 500 labelled pictures, because its hard to label work when you,
because they are shooting at you maybe. And actually, this only leaves you with a possibility of
maybe using pretrained model from the zoo, or from some hand-crafted features,
which is even worse. But what you can do instead is, you can
just take a lot of images of warplanes in general, you don't have to label them. Those could be random warplanes,
even from a previous war, maybe. And you could train
the autoencoder on them. So you get some kind of feature
presentation which is specific to warplanes. Maybe not specific to or planes,
but we'll catch on that later. Okay so you have this autoencoder, and
the encoder part of it is very useful, because it kind of resembles the model
that we would use to classify it. So again you take this large chunk out,
you slice the model, and you get the pre-trained and minus one layer that
you could then use with any other model, like gray and boosting. Or you can even stick more
layers on top of it and train it with full back propagation,
like variable fine tuning. Now this gives you a nice
feature representation, but you probably already know how
to do this if you have a model. Okay, let's now see how they
compare against one another. The supervised pre training,
they take the model from the zoo, takes something image of the head and
fine tune. It works on some but right if you have. Relevant called supervised
learning problem. So you have something
that resembles image net. You're golden.
You take the model that classified cats against dogs. And then you train it to classify
particular breeds of a fox. I mean war planes were
previously classified tracks. What it allows you to do better than
unsupervised Pretraining is it gives you some insight into what
features are more relevant. For example if you classify cats then
you more likely get a feature where scenery image because scenery
is usually for classify. So if your case is having a thousand
labeled images of new brain scans and a lot of unlabeled images but there is no
large labeled data set of similar scans, which is probably true for
meds in this particular moment, you would benefit from autoencoders
much more than supervise chain because there's no [INAUDIBLE]
where it can be trained. And pre- training brain cancer detection
images of cats is slightly unreasonable. So here it goes. Basically supervisor training it gives
you more insight into wha's relevant and what isn't. But it requires a lot of...
that solves similar problems and if you don't have them use
use unsupervised [INAUDIBLE]. [MUSIC]