1 00:00:00,000 --> 00:00:00,831 [MUSIC] 2 00:00:00,831 --> 00:00:05,435 Okay let's now try to imagine what happens if we expand 3 00:00:05,435 --> 00:00:09,248 this to a more practical feature on problem. 4 00:00:09,248 --> 00:00:13,510 Imagine that we have some kind of set of measurements, or features, so 5 00:00:13,510 --> 00:00:17,771 we have maybe measurements from our robots or maybe not an image data but 6 00:00:17,771 --> 00:00:20,275 some kind of high level representation. 7 00:00:20,275 --> 00:00:24,387 And then we want to be able to learn some supervise program from it. 8 00:00:24,387 --> 00:00:28,419 So we want our encoder not to compress the data, not to reuse the size, but 9 00:00:28,419 --> 00:00:30,190 to find feature presentation. 10 00:00:30,190 --> 00:00:32,299 There may even be more features. 11 00:00:32,299 --> 00:00:33,391 As long as for example. 12 00:00:33,391 --> 00:00:35,496 The original feature presentation is very convoluted. 13 00:00:35,496 --> 00:00:41,253 But the results from the hidden one is straight forward for XGBoost to take out. 14 00:00:41,253 --> 00:00:44,799 Basically this yields us very intuitive extension. 15 00:00:44,799 --> 00:00:47,806 We can just make this hidden representation a bit larger. 16 00:00:47,806 --> 00:00:52,638 And larger and even larger untill it gets larger than the original data. 17 00:00:52,638 --> 00:00:56,231 So this is, from a mathematical perspective, a totally legitimate model. 18 00:00:56,231 --> 00:01:01,628 But you probably, again, smell something fishy here, so there is something wrong. 19 00:01:01,628 --> 00:01:03,397 Can you guess what? Well, right. 20 00:01:03,397 --> 00:01:07,755 Basically, if your network is able to maintain a representation which is larger 21 00:01:07,755 --> 00:01:08,870 than the initial one. 22 00:01:08,870 --> 00:01:13,823 Lets just remember what the formula looks like from your networks perspective. 23 00:01:13,823 --> 00:01:19,369 So you want me a network to take an image which has say 1000 pixels, 24 00:01:19,369 --> 00:01:25,480 you want me to compress it into 1 million pixels or may be 1 million numbers. 25 00:01:25,480 --> 00:01:27,992 And then decompress it so that I don't lose anything. 26 00:01:27,992 --> 00:01:29,611 What do I do? 27 00:01:29,611 --> 00:01:33,125 I copy the image. Basically I allocate the first 1,000 of my 28 00:01:33,125 --> 00:01:37,689 1 million features to be an exact copy of the image feature, and 29 00:01:37,689 --> 00:01:41,676 then propagate it to the decoders so that my error is 0. 30 00:01:41,676 --> 00:01:46,855 This is not what you want your super feature presentation to be like because 31 00:01:46,855 --> 00:01:52,869 It's no better than the original one, plus some noise in the additional components. 32 00:01:52,869 --> 00:01:56,251 Of course, you could still deal with similar representations. 33 00:01:56,251 --> 00:02:00,667 But let's see if we maybe can fix this problem without having to 34 00:02:00,667 --> 00:02:02,929 compromise the architecture. 35 00:02:02,929 --> 00:02:06,413 So one way we can regularize is we can add some kind of L one or L two balance. 36 00:02:06,413 --> 00:02:08,615 Basically we take a loss function and 37 00:02:08,615 --> 00:02:12,308 we add some kind of maybe absolute value of your activations or 38 00:02:12,308 --> 00:02:16,157 absolute value of activations or absolute value of [INAUDIBLE]. 39 00:02:16,157 --> 00:02:21,087 This time it's value of activations so we want to penalize in [INAUDIBLE] wherever 40 00:02:21,087 --> 00:02:24,940 it's activation is larger than zero and we want to [INAUDIBLE]. 41 00:02:24,940 --> 00:02:27,900 Now there's one neat property of this L 1 balance 42 00:02:27,900 --> 00:02:32,424 here that works when you penalize for example weights in your linear model. 43 00:02:32,424 --> 00:02:33,407 What happens to it? 44 00:02:33,407 --> 00:02:34,046 Yeah exactly. 45 00:02:34,046 --> 00:02:38,456 So if your regularization is well harsh enough, what you get is that some of 46 00:02:38,456 --> 00:02:42,097 irrelevant features are basically dropped from your model. 47 00:02:42,097 --> 00:02:44,086 This happens when you're various weights. 48 00:02:44,086 --> 00:02:48,947 This time, however, you're going to regularize not weights, but activations. 49 00:02:48,947 --> 00:02:52,344 So you want to zero out not the weights, but activations of a particular sample. 50 00:02:52,344 --> 00:02:55,830 This is all sent through the situation where your model is, or 51 00:02:55,830 --> 00:03:00,121 your model benefits in terms of loss from zeroing out most of the features for 52 00:03:00,121 --> 00:03:01,480 a particular example. 53 00:03:01,480 --> 00:03:03,464 So your features become sparse. 54 00:03:03,464 --> 00:03:06,707 If everything goes right, your features will still be useful so 55 00:03:06,707 --> 00:03:09,061 each feature would activate on some objects. 56 00:03:09,061 --> 00:03:11,886 But for any given objects, most of the features are going to be zero. 57 00:03:11,886 --> 00:03:16,426 This is, well, questionably desirable because some classifiers work well with 58 00:03:16,426 --> 00:03:18,675 sparse representation, some don't. 59 00:03:18,675 --> 00:03:22,008 But if sparse is what you aim at, sparse autoencoder is your thing. 60 00:03:23,453 --> 00:03:26,395 Another way to regularize is to use the dropout, 61 00:03:26,395 --> 00:03:29,496 which is like the deep learning way to regularize. 62 00:03:29,496 --> 00:03:34,151 And again, this way you just drop the features, so that your model cannot, 63 00:03:34,151 --> 00:03:38,030 your decoder cannot access all the features from the encoder. 64 00:03:38,030 --> 00:03:43,518 This results in the features of your encoder being redundant. 65 00:03:43,518 --> 00:03:45,338 Like the neurons in your collision [INAUDIBLE] when you drop out. 66 00:03:45,338 --> 00:03:48,891 Basically if you cannot rely on any particular feature, 67 00:03:48,891 --> 00:03:53,183 it will have a few features that are more or less about the same thing so 68 00:03:53,183 --> 00:03:56,084 that if they are well only partially present. 69 00:03:56,084 --> 00:03:59,263 If some of them gets dropped. Your model is still able to reconstruct 70 00:03:59,263 --> 00:04:00,011 the data. 71 00:04:00,011 --> 00:04:03,257 This way it becomes redundant, but redundancy is not, 72 00:04:03,257 --> 00:04:06,581 well it's [INAUDIBLE] question of [INAUDIBLE] desirable. 73 00:04:06,581 --> 00:04:07,999 Some of us are ok with us, some of us aren't. 74 00:04:07,999 --> 00:04:13,062 Now the most peculiar way you can regularize is to drop out the input data. 75 00:04:13,062 --> 00:04:16,759 Basically you take your image, and before you fit it into the encoder, 76 00:04:16,759 --> 00:04:17,630 you corrupt it. 77 00:04:17,630 --> 00:04:20,880 Maybe you take a phase that you want to encode and decode. 78 00:04:20,880 --> 00:04:25,268 And before filling it, you maybe zero out particular regions. 79 00:04:25,268 --> 00:04:29,763 So it may be right eye you show zero [INAUDIBLE] of the right eye of a person. 80 00:04:29,763 --> 00:04:35,408 Or maybe you just add a random noise to the image or maybe dropout. 81 00:04:35,408 --> 00:04:37,150 You just take some random pixels and zero them out. 82 00:04:37,150 --> 00:04:42,509 What this forces your model to do is your model has to extrapolate. 83 00:04:42,509 --> 00:04:44,730 It has to guess what would be there in an image. 84 00:04:44,730 --> 00:04:48,973 We humans are quite capable, when it comes to image extrapolation, 85 00:04:48,973 --> 00:04:52,129 because it's quite easy to guess what's behind. 86 00:04:52,129 --> 00:04:54,171 Well, maybe, a hat of a person if he wears a hat. 87 00:04:54,171 --> 00:04:57,374 Or maybe what's behind glasses, it's obviously eyes. 88 00:04:57,374 --> 00:05:01,693 But if you force a network to be as capable as we are, or at least try to, 89 00:05:01,693 --> 00:05:06,014 It won't be able to run the identity mapping, it won't be able to just 90 00:05:06,014 --> 00:05:10,263 copy the data, because, again, the data access is imperfect, and 91 00:05:10,263 --> 00:05:15,140 you want your letter to take the imperfect data, and predict the perfect one. 92 00:05:15,140 --> 00:05:17,614 So we want to remove the distortion. 93 00:05:17,614 --> 00:05:22,470 This is called a denoising of our encoder, and, again, it's about removing noise. 94 00:05:22,470 --> 00:05:26,918 The way it operates is exactly the same as the previous two models. 95 00:05:26,918 --> 00:05:28,955 It tries to minimize reconstruction error but 96 00:05:28,955 --> 00:05:32,309 it has this need dropout there that changes the whole behavior of the model. 97 00:05:32,309 --> 00:05:35,554 And there's a lot of ways you can compare those approaches. 98 00:05:35,554 --> 00:05:38,521 Basically the general intention stays the same. 99 00:05:38,521 --> 00:05:41,437 The sparse encoder gets sparse representations. 100 00:05:41,437 --> 00:05:45,739 The redundant autoencoder get features that cover for one another, and 101 00:05:45,739 --> 00:05:50,677 denoising encoder some features that are able to extrapolate, even if some pieces 102 00:05:50,677 --> 00:05:55,282 of data is missing, so it's kind of stable to small distortions in the data. 103 00:05:55,282 --> 00:05:59,151 Well you could, for example, relate features of those autoencoders. 104 00:05:59,151 --> 00:06:03,449 You can see how well they generalize on the images they have not been shown. 105 00:06:03,449 --> 00:06:06,328 And there's more than one study covering it. 106 00:06:06,328 --> 00:06:08,598 Fortunately for us, most of them are useless. 107 00:06:08,598 --> 00:06:12,461 Basically you can look at the filters as long as you want, 108 00:06:12,461 --> 00:06:16,165 It won't give you the slightest mathematical proof, 109 00:06:16,165 --> 00:06:20,371 whether you want to use one model or the other, or vice versa. 110 00:06:20,371 --> 00:06:23,245 So again, this lets you train another encoder which is richer than 111 00:06:23,245 --> 00:06:24,075 the original one. 112 00:06:24,075 --> 00:06:26,886 Now let's see how you can use that. 113 00:06:26,886 --> 00:06:28,136 Okay, so imagine you had a problem. 114 00:06:28,136 --> 00:06:30,724 The problem of image classification. 115 00:06:30,724 --> 00:06:32,967 So you have maybe, well, photos of warplanes. 116 00:06:32,967 --> 00:06:35,685 You have allied warplanes and enemy warplanes, 117 00:06:35,685 --> 00:06:37,876 you want to distinguish between them. 118 00:06:37,876 --> 00:06:42,475 So the first are getting maybe escorted on the second or getting shot. 119 00:06:42,475 --> 00:06:47,948 And unfortunately, you only have, say 500 pieces of data, 500 labelled pictures, 120 00:06:47,948 --> 00:06:52,737 because its hard to label work when you, because they are shooting at you maybe. 121 00:06:52,737 --> 00:06:53,605 And actually, 122 00:06:53,605 --> 00:06:58,413 this only leaves you with a possibility of maybe using pretrained model from the zoo, 123 00:06:58,413 --> 00:07:01,842 or from some hand-crafted features, which is even worse. 124 00:07:01,842 --> 00:07:06,785 But what you can do instead is, you can just take a lot of images of warplanes in 125 00:07:06,785 --> 00:07:09,462 general, you don't have to label them. 126 00:07:09,462 --> 00:07:11,478 Those could be random warplanes, even from a previous war, maybe. 127 00:07:11,478 --> 00:07:14,857 And you could train the autoencoder on them. 128 00:07:14,857 --> 00:07:17,487 So you get some kind of feature presentation which is specific to 129 00:07:17,487 --> 00:07:18,030 warplanes. 130 00:07:18,030 --> 00:07:23,536 Maybe not specific to or planes, but we'll catch on that later. 131 00:07:23,536 --> 00:07:28,455 Okay so you have this autoencoder, and the encoder part of it is very useful, 132 00:07:28,455 --> 00:07:33,006 because it kind of resembles the model that we would use to classify it. 133 00:07:33,006 --> 00:07:37,249 So again you take this large chunk out, you slice the model, and you get 134 00:07:37,249 --> 00:07:42,132 the pre-trained and minus one layer that you could then use with any other model, 135 00:07:42,132 --> 00:07:43,647 like gray and boosting. 136 00:07:43,647 --> 00:07:46,005 Or you can even stick more layers on top of it and 137 00:07:46,005 --> 00:07:49,405 train it with full back propagation, like variable fine tuning. 138 00:07:49,405 --> 00:07:52,740 Now this gives you a nice feature representation, but 139 00:07:52,740 --> 00:07:56,885 you probably already know how to do this if you have a model. 140 00:07:56,885 --> 00:07:59,497 Okay, let's now see how they compare against one another. 141 00:07:59,497 --> 00:08:02,994 The supervised pre training, they take the model from the zoo, 142 00:08:02,994 --> 00:08:06,581 takes something image of the head and fine tune. 143 00:08:06,581 --> 00:08:09,095 It works on some but right if you have. 144 00:08:09,095 --> 00:08:11,304 Relevant called supervised learning problem. 145 00:08:11,304 --> 00:08:13,518 So you have something that resembles image net. 146 00:08:13,518 --> 00:08:16,091 You're golden. You take the model that classified cats 147 00:08:16,091 --> 00:08:17,090 against dogs. 148 00:08:17,090 --> 00:08:19,392 And then you train it to classify particular breeds of a fox. 149 00:08:19,392 --> 00:08:22,597 I mean war planes were previously classified tracks. 150 00:08:22,597 --> 00:08:26,325 What it allows you to do better than unsupervised Pretraining is it gives you 151 00:08:26,325 --> 00:08:28,899 some insight into what features are more relevant. 152 00:08:28,899 --> 00:08:36,483 For example if you classify cats then you more likely get a feature where 153 00:08:36,483 --> 00:08:42,724 scenery image because scenery is usually for classify. 154 00:08:42,724 --> 00:08:47,644 So if your case is having a thousand labeled images of new brain scans and 155 00:08:47,644 --> 00:08:53,220 a lot of unlabeled images but there is no large labeled data set of similar scans, 156 00:08:53,220 --> 00:08:57,320 which is probably true for meds in this particular moment, 157 00:08:57,320 --> 00:09:01,994 you would benefit from autoencoders much more than supervise chain 158 00:09:01,994 --> 00:09:05,861 because there's no [INAUDIBLE] where it can be trained. 159 00:09:05,861 --> 00:09:13,922 And pre- training brain cancer detection images of cats is slightly unreasonable. 160 00:09:13,922 --> 00:09:15,172 So here it goes. 161 00:09:15,172 --> 00:09:21,025 Basically supervisor training it gives you more insight into wha's relevant and 162 00:09:21,025 --> 00:09:22,049 what isn't. 163 00:09:22,049 --> 00:09:25,164 But it requires a lot of... that solves similar problems and 164 00:09:25,164 --> 00:09:29,705 if you don't have them use use unsupervised [INAUDIBLE]. 165 00:09:29,705 --> 00:09:39,705 [MUSIC]