[MUSIC] Okay let's now try to imagine what happens if we expand this to a more practical feature on problem. Imagine that we have some kind of set of measurements, or features, so we have maybe measurements from our robots or maybe not an image data but some kind of high level representation. And then we want to be able to learn some supervise program from it. So we want our encoder not to compress the data, not to reuse the size, but to find feature presentation. There may even be more features. As long as for example. The original feature presentation is very convoluted. But the results from the hidden one is straight forward for XGBoost to take out. Basically this yields us very intuitive extension. We can just make this hidden representation a bit larger. And larger and even larger untill it gets larger than the original data. So this is, from a mathematical perspective, a totally legitimate model. But you probably, again, smell something fishy here, so there is something wrong. Can you guess what? Well, right. Basically, if your network is able to maintain a representation which is larger than the initial one. Lets just remember what the formula looks like from your networks perspective. So you want me a network to take an image which has say 1000 pixels, you want me to compress it into 1 million pixels or may be 1 million numbers. And then decompress it so that I don't lose anything. What do I do? I copy the image. Basically I allocate the first 1,000 of my 1 million features to be an exact copy of the image feature, and then propagate it to the decoders so that my error is 0. This is not what you want your super feature presentation to be like because It's no better than the original one, plus some noise in the additional components. Of course, you could still deal with similar representations. But let's see if we maybe can fix this problem without having to compromise the architecture. So one way we can regularize is we can add some kind of L one or L two balance. Basically we take a loss function and we add some kind of maybe absolute value of your activations or absolute value of activations or absolute value of [INAUDIBLE]. This time it's value of activations so we want to penalize in [INAUDIBLE] wherever it's activation is larger than zero and we want to [INAUDIBLE]. Now there's one neat property of this L 1 balance here that works when you penalize for example weights in your linear model. What happens to it? Yeah exactly. So if your regularization is well harsh enough, what you get is that some of irrelevant features are basically dropped from your model. This happens when you're various weights. This time, however, you're going to regularize not weights, but activations. So you want to zero out not the weights, but activations of a particular sample. This is all sent through the situation where your model is, or your model benefits in terms of loss from zeroing out most of the features for a particular example. So your features become sparse. If everything goes right, your features will still be useful so each feature would activate on some objects. But for any given objects, most of the features are going to be zero. This is, well, questionably desirable because some classifiers work well with sparse representation, some don't. But if sparse is what you aim at, sparse autoencoder is your thing. Another way to regularize is to use the dropout, which is like the deep learning way to regularize. And again, this way you just drop the features, so that your model cannot, your decoder cannot access all the features from the encoder. This results in the features of your encoder being redundant. Like the neurons in your collision [INAUDIBLE] when you drop out. Basically if you cannot rely on any particular feature, it will have a few features that are more or less about the same thing so that if they are well only partially present. If some of them gets dropped. Your model is still able to reconstruct the data. This way it becomes redundant, but redundancy is not, well it's [INAUDIBLE] question of [INAUDIBLE] desirable. Some of us are ok with us, some of us aren't. Now the most peculiar way you can regularize is to drop out the input data. Basically you take your image, and before you fit it into the encoder, you corrupt it. Maybe you take a phase that you want to encode and decode. And before filling it, you maybe zero out particular regions. So it may be right eye you show zero [INAUDIBLE] of the right eye of a person. Or maybe you just add a random noise to the image or maybe dropout. You just take some random pixels and zero them out. What this forces your model to do is your model has to extrapolate. It has to guess what would be there in an image. We humans are quite capable, when it comes to image extrapolation, because it's quite easy to guess what's behind. Well, maybe, a hat of a person if he wears a hat. Or maybe what's behind glasses, it's obviously eyes. But if you force a network to be as capable as we are, or at least try to, It won't be able to run the identity mapping, it won't be able to just copy the data, because, again, the data access is imperfect, and you want your letter to take the imperfect data, and predict the perfect one. So we want to remove the distortion. This is called a denoising of our encoder, and, again, it's about removing noise. The way it operates is exactly the same as the previous two models. It tries to minimize reconstruction error but it has this need dropout there that changes the whole behavior of the model. And there's a lot of ways you can compare those approaches. Basically the general intention stays the same. The sparse encoder gets sparse representations. The redundant autoencoder get features that cover for one another, and denoising encoder some features that are able to extrapolate, even if some pieces of data is missing, so it's kind of stable to small distortions in the data. Well you could, for example, relate features of those autoencoders. You can see how well they generalize on the images they have not been shown. And there's more than one study covering it. Fortunately for us, most of them are useless. Basically you can look at the filters as long as you want, It won't give you the slightest mathematical proof, whether you want to use one model or the other, or vice versa. So again, this lets you train another encoder which is richer than the original one. Now let's see how you can use that. Okay, so imagine you had a problem. The problem of image classification. So you have maybe, well, photos of warplanes. You have allied warplanes and enemy warplanes, you want to distinguish between them. So the first are getting maybe escorted on the second or getting shot. And unfortunately, you only have, say 500 pieces of data, 500 labelled pictures, because its hard to label work when you, because they are shooting at you maybe. And actually, this only leaves you with a possibility of maybe using pretrained model from the zoo, or from some hand-crafted features, which is even worse. But what you can do instead is, you can just take a lot of images of warplanes in general, you don't have to label them. Those could be random warplanes, even from a previous war, maybe. And you could train the autoencoder on them. So you get some kind of feature presentation which is specific to warplanes. Maybe not specific to or planes, but we'll catch on that later. Okay so you have this autoencoder, and the encoder part of it is very useful, because it kind of resembles the model that we would use to classify it. So again you take this large chunk out, you slice the model, and you get the pre-trained and minus one layer that you could then use with any other model, like gray and boosting. Or you can even stick more layers on top of it and train it with full back propagation, like variable fine tuning. Now this gives you a nice feature representation, but you probably already know how to do this if you have a model. Okay, let's now see how they compare against one another. The supervised pre training, they take the model from the zoo, takes something image of the head and fine tune. It works on some but right if you have. Relevant called supervised learning problem. So you have something that resembles image net. You're golden. You take the model that classified cats against dogs. And then you train it to classify particular breeds of a fox. I mean war planes were previously classified tracks. What it allows you to do better than unsupervised Pretraining is it gives you some insight into what features are more relevant. For example if you classify cats then you more likely get a feature where scenery image because scenery is usually for classify. So if your case is having a thousand labeled images of new brain scans and a lot of unlabeled images but there is no large labeled data set of similar scans, which is probably true for meds in this particular moment, you would benefit from autoencoders much more than supervise chain because there's no [INAUDIBLE] where it can be trained. And pre- training brain cancer detection images of cats is slightly unreasonable. So here it goes. Basically supervisor training it gives you more insight into wha's relevant and what isn't. But it requires a lot of... that solves similar problems and if you don't have them use use unsupervised [INAUDIBLE]. [MUSIC]