1
00:00:00,000 --> 00:00:03,670
[MUSIC]

2
00:00:03,670 --> 00:00:05,397
So, I've just promised you a lot
of cool stuff that you can

3
00:00:05,397 --> 00:00:06,395
do with unsupervised learning.

4
00:00:06,395 --> 00:00:09,529
Now, let's cover how you do this,
because otherwise would be a cheat.

5
00:00:09,529 --> 00:00:13,478
Now, as I've mentioned,
there's many methods at play here.

6
00:00:13,478 --> 00:00:16,927
But let's start from the most
simple to understand and the most,

7
00:00:16,927 --> 00:00:19,761
sort of, kind of, general one,
the autoencoders.

8
00:00:19,761 --> 00:00:22,703
Autoencoders is the kind of
models that encode the data in

9
00:00:22,703 --> 00:00:25,540
hidden representation and
then decode it backwards.

10
00:00:25,540 --> 00:00:29,604
Now, this seems like a weird problem
unless you want to compress the data, but

11
00:00:29,604 --> 00:00:32,019
trust me,
they have a lot of surprises there.

12
00:00:32,019 --> 00:00:36,139
Now, again, autoencoders consist
of two parts as the name suggests,

13
00:00:36,139 --> 00:00:38,004
those are encoder and decoder.

14
00:00:38,004 --> 00:00:42,720
If your data is denoted by x,
then you can encode x, maybe images,

15
00:00:42,720 --> 00:00:46,781
cat images,
into a hidden representation encoder of x.

16
00:00:46,781 --> 00:00:50,972
So that you can then decode it backwards
with the decoder into the original

17
00:00:50,972 --> 00:00:52,061
representation.

18
00:00:52,061 --> 00:00:55,484
The mathematical objective here is again,
weird.

19
00:00:55,484 --> 00:00:58,615
You want to compress image and
decompress it backwards.

20
00:00:58,615 --> 00:01:01,917
So that the decompressed image
is as lossless as possible.

21
00:01:01,917 --> 00:01:06,199
It kind of, it resembles
the initial image in the sense of

22
00:01:06,199 --> 00:01:10,050
minimizing the pixel res MSE error,
for example.

23
00:01:10,050 --> 00:01:11,775
It means squared error, to be accurate.

24
00:01:11,775 --> 00:01:16,600
Now this is immediately useful when
you want to compress the data.

25
00:01:16,600 --> 00:01:21,326
But this representation that you learn
is also very useful if you want to

26
00:01:21,326 --> 00:01:25,209
apply classification or
regression methods on top of it.

27
00:01:25,209 --> 00:01:29,544
For example, you could take image raw
pixels, and you probably know that when

28
00:01:29,544 --> 00:01:34,153
most cases, for example gradient boosting
is useless when applied to raw pixels.

29
00:01:34,153 --> 00:01:36,258
But instead, you can feed it,
not with the raw pixels, but

30
00:01:36,258 --> 00:01:38,678
with this hidden representation
that you found without encoders.

31
00:01:38,678 --> 00:01:43,430
Well, this is all nice and good, but, in
fact, you've already learned some kind of

32
00:01:43,430 --> 00:01:47,660
autoencoder if you've studied even
the basic topics of machine learning.

33
00:01:47,660 --> 00:01:51,894
Because you've probably already know such
things like a simple component analysis,

34
00:01:51,894 --> 00:01:55,793
or singular value decomposition or
maybe non negative matrix factorization.

35
00:01:55,793 --> 00:01:59,877
In fact, those are all familiar to
you if you use scikit-learn or caret.

36
00:01:59,877 --> 00:02:03,638
But the general idea behind all those
methods is they take a large matrix,

37
00:02:03,638 --> 00:02:07,717
this is usually an object feature matrix
of your dataset, your pixels go here.

38
00:02:07,717 --> 00:02:09,631
You chose a particular image,

39
00:02:09,631 --> 00:02:13,908
you try to represent this matrix as
a product of two or more matrices.

40
00:02:13,908 --> 00:02:18,628
For example, it would be matrix that
maps your data, your full roll,

41
00:02:18,628 --> 00:02:21,030
into some hidden representation.

42
00:02:21,030 --> 00:02:25,847
And a second matrix that maps your
hidden representation back into

43
00:02:25,847 --> 00:02:29,124
the original pixel-wise representation.

44
00:02:29,124 --> 00:02:32,525
You try to learn a couple or
more matrices depending on your method,

45
00:02:32,525 --> 00:02:34,984
to minimize some kind of
reconstruction error.

46
00:02:34,984 --> 00:02:38,006
For a singular value decomposition,
one way to do so

47
00:02:38,006 --> 00:02:42,060
is to minimize the mean squared error
between your original matrix and

48
00:02:42,060 --> 00:02:44,894
the product of two kind
of substitute matrices.

49
00:02:44,894 --> 00:02:46,719
Look at this matrix position
thingy differently.

50
00:02:46,719 --> 00:02:50,697
One way to rewrite it is a process
that first takes your data and

51
00:02:50,697 --> 00:02:52,235
kind of compresses it.

52
00:02:52,235 --> 00:02:56,274
Here's the encoder part, compresses it
linearly to a hidden representation.

53
00:02:56,274 --> 00:03:01,211
And the second part then
becomes the decoder,

54
00:03:01,211 --> 00:03:05,769
that takes your hidden representation and

55
00:03:05,769 --> 00:03:13,376
converts it backwards into pixels or
whatever the form the data was in.

56
00:03:13,376 --> 00:03:18,312
To minimize the mean squared error
between what was fed into the network and

57
00:03:18,312 --> 00:03:19,812
what emerged from it.

58
00:03:19,812 --> 00:03:24,315
Now, one initial way to expand this is
we usually do with neural networks is to

59
00:03:24,315 --> 00:03:29,378
pretend that linear compression, linear
decompression is somehow insufficient for

60
00:03:29,378 --> 00:03:32,870
us and make it nonlinear with address,
of course.

61
00:03:32,870 --> 00:03:38,013
Now you think your encoder,
instead of having a linear transformation,

62
00:03:38,013 --> 00:03:42,875
stick in a few dense layers or maybe
other layers that you've learned about.

63
00:03:42,875 --> 00:03:46,292
Maybe with some dropout or
whatever fancy names you remember.

64
00:03:46,292 --> 00:03:50,529
And then your autoencoder
becomes nonlinear.

65
00:03:50,529 --> 00:03:53,766
And as we probably know or
believe since the last two weeks,

66
00:03:53,766 --> 00:03:57,784
non-linear presentations can be more
powerful in terms of they can learn

67
00:03:57,784 --> 00:03:59,558
more abstract features there.

68
00:03:59,558 --> 00:04:03,765
And the question is,
imagine your data format is not

69
00:04:03,765 --> 00:04:07,598
just arbitrary set of features,
but an image.

70
00:04:07,598 --> 00:04:12,400
So there's three channels, RGB,
with, say, a 100 by 100 pixel grid.

71
00:04:12,400 --> 00:04:16,759
Is there maybe some particular
architecture that you can use to compress

72
00:04:16,759 --> 00:04:19,194
the data and decompress it thereafter?

73
00:04:19,194 --> 00:04:23,814
So that your features are having some
nice properties that are desirable for

74
00:04:23,814 --> 00:04:28,577
images, like being able to transfer the
same feature one meter to the right and

75
00:04:28,577 --> 00:04:30,906
still have this feature recognized.

76
00:04:30,906 --> 00:04:35,584
Yes, right, one way to deal with it is
to use the convolutional layers, or

77
00:04:35,584 --> 00:04:38,274
convolutional architecture in general.

78
00:04:38,274 --> 00:04:42,834
So, on the slide we have this super
small one-layer, one convolution,

79
00:04:42,834 --> 00:04:44,609
one pooling architecture.

80
00:04:44,609 --> 00:04:48,702
But you could, of course, use a lot of
stacked convolutions and poolings, or

81
00:04:48,702 --> 00:04:52,546
maybe some residual layers or
inception models, whatever you prefer for

82
00:04:52,546 --> 00:04:53,800
a particular problem.

83
00:04:53,800 --> 00:04:56,730
The general idea is that
anything that maps your

84
00:04:56,730 --> 00:05:01,272
input into hidden representation, and
anything that maps it backward to

85
00:05:01,272 --> 00:05:06,272
the original presentation from the hidden
one fits as the model of autoencoder.

86
00:05:06,272 --> 00:05:08,766
Provided it's differentiable, of course.

87
00:05:08,766 --> 00:05:12,198
So if it's that easy, you can even
deal without dense layers at all.

88
00:05:12,198 --> 00:05:15,081
So you can take maybe
convolutional encoder and

89
00:05:15,081 --> 00:05:18,045
then go straight to
the convolutional decoder.

90
00:05:18,045 --> 00:05:21,795
This way your hidden representation
is a small image-like format.

91
00:05:21,795 --> 00:05:31,795
[MUSIC]