1 00:00:00,000 --> 00:00:03,670 [MUSIC] 2 00:00:03,670 --> 00:00:05,397 So, I've just promised you a lot of cool stuff that you can 3 00:00:05,397 --> 00:00:06,395 do with unsupervised learning. 4 00:00:06,395 --> 00:00:09,529 Now, let's cover how you do this, because otherwise would be a cheat. 5 00:00:09,529 --> 00:00:13,478 Now, as I've mentioned, there's many methods at play here. 6 00:00:13,478 --> 00:00:16,927 But let's start from the most simple to understand and the most, 7 00:00:16,927 --> 00:00:19,761 sort of, kind of, general one, the autoencoders. 8 00:00:19,761 --> 00:00:22,703 Autoencoders is the kind of models that encode the data in 9 00:00:22,703 --> 00:00:25,540 hidden representation and then decode it backwards. 10 00:00:25,540 --> 00:00:29,604 Now, this seems like a weird problem unless you want to compress the data, but 11 00:00:29,604 --> 00:00:32,019 trust me, they have a lot of surprises there. 12 00:00:32,019 --> 00:00:36,139 Now, again, autoencoders consist of two parts as the name suggests, 13 00:00:36,139 --> 00:00:38,004 those are encoder and decoder. 14 00:00:38,004 --> 00:00:42,720 If your data is denoted by x, then you can encode x, maybe images, 15 00:00:42,720 --> 00:00:46,781 cat images, into a hidden representation encoder of x. 16 00:00:46,781 --> 00:00:50,972 So that you can then decode it backwards with the decoder into the original 17 00:00:50,972 --> 00:00:52,061 representation. 18 00:00:52,061 --> 00:00:55,484 The mathematical objective here is again, weird. 19 00:00:55,484 --> 00:00:58,615 You want to compress image and decompress it backwards. 20 00:00:58,615 --> 00:01:01,917 So that the decompressed image is as lossless as possible. 21 00:01:01,917 --> 00:01:06,199 It kind of, it resembles the initial image in the sense of 22 00:01:06,199 --> 00:01:10,050 minimizing the pixel res MSE error, for example. 23 00:01:10,050 --> 00:01:11,775 It means squared error, to be accurate. 24 00:01:11,775 --> 00:01:16,600 Now this is immediately useful when you want to compress the data. 25 00:01:16,600 --> 00:01:21,326 But this representation that you learn is also very useful if you want to 26 00:01:21,326 --> 00:01:25,209 apply classification or regression methods on top of it. 27 00:01:25,209 --> 00:01:29,544 For example, you could take image raw pixels, and you probably know that when 28 00:01:29,544 --> 00:01:34,153 most cases, for example gradient boosting is useless when applied to raw pixels. 29 00:01:34,153 --> 00:01:36,258 But instead, you can feed it, not with the raw pixels, but 30 00:01:36,258 --> 00:01:38,678 with this hidden representation that you found without encoders. 31 00:01:38,678 --> 00:01:43,430 Well, this is all nice and good, but, in fact, you've already learned some kind of 32 00:01:43,430 --> 00:01:47,660 autoencoder if you've studied even the basic topics of machine learning. 33 00:01:47,660 --> 00:01:51,894 Because you've probably already know such things like a simple component analysis, 34 00:01:51,894 --> 00:01:55,793 or singular value decomposition or maybe non negative matrix factorization. 35 00:01:55,793 --> 00:01:59,877 In fact, those are all familiar to you if you use scikit-learn or caret. 36 00:01:59,877 --> 00:02:03,638 But the general idea behind all those methods is they take a large matrix, 37 00:02:03,638 --> 00:02:07,717 this is usually an object feature matrix of your dataset, your pixels go here. 38 00:02:07,717 --> 00:02:09,631 You chose a particular image, 39 00:02:09,631 --> 00:02:13,908 you try to represent this matrix as a product of two or more matrices. 40 00:02:13,908 --> 00:02:18,628 For example, it would be matrix that maps your data, your full roll, 41 00:02:18,628 --> 00:02:21,030 into some hidden representation. 42 00:02:21,030 --> 00:02:25,847 And a second matrix that maps your hidden representation back into 43 00:02:25,847 --> 00:02:29,124 the original pixel-wise representation. 44 00:02:29,124 --> 00:02:32,525 You try to learn a couple or more matrices depending on your method, 45 00:02:32,525 --> 00:02:34,984 to minimize some kind of reconstruction error. 46 00:02:34,984 --> 00:02:38,006 For a singular value decomposition, one way to do so 47 00:02:38,006 --> 00:02:42,060 is to minimize the mean squared error between your original matrix and 48 00:02:42,060 --> 00:02:44,894 the product of two kind of substitute matrices. 49 00:02:44,894 --> 00:02:46,719 Look at this matrix position thingy differently. 50 00:02:46,719 --> 00:02:50,697 One way to rewrite it is a process that first takes your data and 51 00:02:50,697 --> 00:02:52,235 kind of compresses it. 52 00:02:52,235 --> 00:02:56,274 Here's the encoder part, compresses it linearly to a hidden representation. 53 00:02:56,274 --> 00:03:01,211 And the second part then becomes the decoder, 54 00:03:01,211 --> 00:03:05,769 that takes your hidden representation and 55 00:03:05,769 --> 00:03:13,376 converts it backwards into pixels or whatever the form the data was in. 56 00:03:13,376 --> 00:03:18,312 To minimize the mean squared error between what was fed into the network and 57 00:03:18,312 --> 00:03:19,812 what emerged from it. 58 00:03:19,812 --> 00:03:24,315 Now, one initial way to expand this is we usually do with neural networks is to 59 00:03:24,315 --> 00:03:29,378 pretend that linear compression, linear decompression is somehow insufficient for 60 00:03:29,378 --> 00:03:32,870 us and make it nonlinear with address, of course. 61 00:03:32,870 --> 00:03:38,013 Now you think your encoder, instead of having a linear transformation, 62 00:03:38,013 --> 00:03:42,875 stick in a few dense layers or maybe other layers that you've learned about. 63 00:03:42,875 --> 00:03:46,292 Maybe with some dropout or whatever fancy names you remember. 64 00:03:46,292 --> 00:03:50,529 And then your autoencoder becomes nonlinear. 65 00:03:50,529 --> 00:03:53,766 And as we probably know or believe since the last two weeks, 66 00:03:53,766 --> 00:03:57,784 non-linear presentations can be more powerful in terms of they can learn 67 00:03:57,784 --> 00:03:59,558 more abstract features there. 68 00:03:59,558 --> 00:04:03,765 And the question is, imagine your data format is not 69 00:04:03,765 --> 00:04:07,598 just arbitrary set of features, but an image. 70 00:04:07,598 --> 00:04:12,400 So there's three channels, RGB, with, say, a 100 by 100 pixel grid. 71 00:04:12,400 --> 00:04:16,759 Is there maybe some particular architecture that you can use to compress 72 00:04:16,759 --> 00:04:19,194 the data and decompress it thereafter? 73 00:04:19,194 --> 00:04:23,814 So that your features are having some nice properties that are desirable for 74 00:04:23,814 --> 00:04:28,577 images, like being able to transfer the same feature one meter to the right and 75 00:04:28,577 --> 00:04:30,906 still have this feature recognized. 76 00:04:30,906 --> 00:04:35,584 Yes, right, one way to deal with it is to use the convolutional layers, or 77 00:04:35,584 --> 00:04:38,274 convolutional architecture in general. 78 00:04:38,274 --> 00:04:42,834 So, on the slide we have this super small one-layer, one convolution, 79 00:04:42,834 --> 00:04:44,609 one pooling architecture. 80 00:04:44,609 --> 00:04:48,702 But you could, of course, use a lot of stacked convolutions and poolings, or 81 00:04:48,702 --> 00:04:52,546 maybe some residual layers or inception models, whatever you prefer for 82 00:04:52,546 --> 00:04:53,800 a particular problem. 83 00:04:53,800 --> 00:04:56,730 The general idea is that anything that maps your 84 00:04:56,730 --> 00:05:01,272 input into hidden representation, and anything that maps it backward to 85 00:05:01,272 --> 00:05:06,272 the original presentation from the hidden one fits as the model of autoencoder. 86 00:05:06,272 --> 00:05:08,766 Provided it's differentiable, of course. 87 00:05:08,766 --> 00:05:12,198 So if it's that easy, you can even deal without dense layers at all. 88 00:05:12,198 --> 00:05:15,081 So you can take maybe convolutional encoder and 89 00:05:15,081 --> 00:05:18,045 then go straight to the convolutional decoder. 90 00:05:18,045 --> 00:05:21,795 This way your hidden representation is a small image-like format. 91 00:05:21,795 --> 00:05:31,795 [MUSIC]