1
00:00:00,000 --> 00:00:00,831
[MUSIC]

2
00:00:00,831 --> 00:00:05,435
Okay let's now try to imagine
what happens if we expand

3
00:00:05,435 --> 00:00:09,248
this to a more practical
feature on problem.

4
00:00:09,248 --> 00:00:13,510
Imagine that we have some kind of
set of measurements, or features, so

5
00:00:13,510 --> 00:00:17,771
we have maybe measurements from our
robots or maybe not an image data but

6
00:00:17,771 --> 00:00:20,275
some kind of high level representation.

7
00:00:20,275 --> 00:00:24,387
And then we want to be able to learn
some supervise program from it.

8
00:00:24,387 --> 00:00:28,419
So we want our encoder not to compress
the data, not to reuse the size, but

9
00:00:28,419 --> 00:00:30,190
to find feature presentation.

10
00:00:30,190 --> 00:00:32,299
There may even be more features.

11
00:00:32,299 --> 00:00:33,391
As long as for example.

12
00:00:33,391 --> 00:00:35,496
The original feature
presentation is very convoluted.

13
00:00:35,496 --> 00:00:41,253
But the results from the hidden one is
straight forward for XGBoost to take out.

14
00:00:41,253 --> 00:00:44,799
Basically this yields us
very intuitive extension.

15
00:00:44,799 --> 00:00:47,806
We can just make this hidden
representation a bit larger.

16
00:00:47,806 --> 00:00:52,638
And larger and even larger untill it
gets larger than the original data.

17
00:00:52,638 --> 00:00:56,231
So this is, from a mathematical
perspective, a totally legitimate model.

18
00:00:56,231 --> 00:01:01,628
But you probably, again, smell something
fishy here, so there is something wrong.

19
00:01:01,628 --> 00:01:03,397
Can you guess what?
Well, right.

20
00:01:03,397 --> 00:01:07,755
Basically, if your network is able to
maintain a representation which is larger

21
00:01:07,755 --> 00:01:08,870
than the initial one.

22
00:01:08,870 --> 00:01:13,823
Lets just remember what the formula looks
like from your networks perspective.

23
00:01:13,823 --> 00:01:19,369
So you want me a network to take
an image which has say 1000 pixels,

24
00:01:19,369 --> 00:01:25,480
you want me to compress it into 1 million
pixels or may be 1 million numbers.

25
00:01:25,480 --> 00:01:27,992
And then decompress it so
that I don't lose anything.

26
00:01:27,992 --> 00:01:29,611
What do I do?

27
00:01:29,611 --> 00:01:33,125
I copy the image.
Basically I allocate the first 1,000 of my

28
00:01:33,125 --> 00:01:37,689
1 million features to be an exact
copy of the image feature, and

29
00:01:37,689 --> 00:01:41,676
then propagate it to the decoders so
that my error is 0.

30
00:01:41,676 --> 00:01:46,855
This is not what you want your super
feature presentation to be like because

31
00:01:46,855 --> 00:01:52,869
It's no better than the original one, plus
some noise in the additional components.

32
00:01:52,869 --> 00:01:56,251
Of course, you could still deal
with similar representations.

33
00:01:56,251 --> 00:02:00,667
But let's see if we maybe can fix
this problem without having to

34
00:02:00,667 --> 00:02:02,929
compromise the architecture.

35
00:02:02,929 --> 00:02:06,413
So one way we can regularize is we can
add some kind of L one or L two balance.

36
00:02:06,413 --> 00:02:08,615
Basically we take a loss function and

37
00:02:08,615 --> 00:02:12,308
we add some kind of maybe absolute
value of your activations or

38
00:02:12,308 --> 00:02:16,157
absolute value of activations or
absolute value of [INAUDIBLE].

39
00:02:16,157 --> 00:02:21,087
This time it's value of activations so we
want to penalize in [INAUDIBLE] wherever

40
00:02:21,087 --> 00:02:24,940
it's activation is larger than zero and
we want to [INAUDIBLE].

41
00:02:24,940 --> 00:02:27,900
Now there's one neat
property of this L 1 balance

42
00:02:27,900 --> 00:02:32,424
here that works when you penalize for
example weights in your linear model.

43
00:02:32,424 --> 00:02:33,407
What happens to it?

44
00:02:33,407 --> 00:02:34,046
Yeah exactly.

45
00:02:34,046 --> 00:02:38,456
So if your regularization is well harsh
enough, what you get is that some of

46
00:02:38,456 --> 00:02:42,097
irrelevant features are basically
dropped from your model.

47
00:02:42,097 --> 00:02:44,086
This happens when you're various weights.

48
00:02:44,086 --> 00:02:48,947
This time, however, you're going to
regularize not weights, but activations.

49
00:02:48,947 --> 00:02:52,344
So you want to zero out not the weights,
but activations of a particular sample.

50
00:02:52,344 --> 00:02:55,830
This is all sent through
the situation where your model is, or

51
00:02:55,830 --> 00:03:00,121
your model benefits in terms of loss from
zeroing out most of the features for

52
00:03:00,121 --> 00:03:01,480
a particular example.

53
00:03:01,480 --> 00:03:03,464
So your features become sparse.

54
00:03:03,464 --> 00:03:06,707
If everything goes right,
your features will still be useful so

55
00:03:06,707 --> 00:03:09,061
each feature would
activate on some objects.

56
00:03:09,061 --> 00:03:11,886
But for any given objects,
most of the features are going to be zero.

57
00:03:11,886 --> 00:03:16,426
This is, well, questionably desirable
because some classifiers work well with

58
00:03:16,426 --> 00:03:18,675
sparse representation, some don't.

59
00:03:18,675 --> 00:03:22,008
But if sparse is what you aim at,
sparse autoencoder is your thing.

60
00:03:23,453 --> 00:03:26,395
Another way to regularize
is to use the dropout,

61
00:03:26,395 --> 00:03:29,496
which is like the deep
learning way to regularize.

62
00:03:29,496 --> 00:03:34,151
And again, this way you just drop
the features, so that your model cannot,

63
00:03:34,151 --> 00:03:38,030
your decoder cannot access all
the features from the encoder.

64
00:03:38,030 --> 00:03:43,518
This results in the features of
your encoder being redundant.

65
00:03:43,518 --> 00:03:45,338
Like the neurons in your collision
[INAUDIBLE] when you drop out.

66
00:03:45,338 --> 00:03:48,891
Basically if you cannot rely
on any particular feature,

67
00:03:48,891 --> 00:03:53,183
it will have a few features that
are more or less about the same thing so

68
00:03:53,183 --> 00:03:56,084
that if they are well
only partially present.

69
00:03:56,084 --> 00:03:59,263
If some of them gets dropped.
Your model is still able to reconstruct

70
00:03:59,263 --> 00:04:00,011
the data.

71
00:04:00,011 --> 00:04:03,257
This way it becomes redundant,
but redundancy is not,

72
00:04:03,257 --> 00:04:06,581
well it's [INAUDIBLE] question
of [INAUDIBLE] desirable.

73
00:04:06,581 --> 00:04:07,999
Some of us are ok with us,
some of us aren't.

74
00:04:07,999 --> 00:04:13,062
Now the most peculiar way you can
regularize is to drop out the input data.

75
00:04:13,062 --> 00:04:16,759
Basically you take your image, and
before you fit it into the encoder,

76
00:04:16,759 --> 00:04:17,630
you corrupt it.

77
00:04:17,630 --> 00:04:20,880
Maybe you take a phase that
you want to encode and decode.

78
00:04:20,880 --> 00:04:25,268
And before filling it,
you maybe zero out particular regions.

79
00:04:25,268 --> 00:04:29,763
So it may be right eye you show zero
[INAUDIBLE] of the right eye of a person.

80
00:04:29,763 --> 00:04:35,408
Or maybe you just add a random noise
to the image or maybe dropout.

81
00:04:35,408 --> 00:04:37,150
You just take some random pixels and
zero them out.

82
00:04:37,150 --> 00:04:42,509
What this forces your model to do
is your model has to extrapolate.

83
00:04:42,509 --> 00:04:44,730
It has to guess what would
be there in an image.

84
00:04:44,730 --> 00:04:48,973
We humans are quite capable,
when it comes to image extrapolation,

85
00:04:48,973 --> 00:04:52,129
because it's quite easy
to guess what's behind.

86
00:04:52,129 --> 00:04:54,171
Well, maybe,
a hat of a person if he wears a hat.

87
00:04:54,171 --> 00:04:57,374
Or maybe what's behind glasses,
it's obviously eyes.

88
00:04:57,374 --> 00:05:01,693
But if you force a network to be as
capable as we are, or at least try to,

89
00:05:01,693 --> 00:05:06,014
It won't be able to run the identity
mapping, it won't be able to just

90
00:05:06,014 --> 00:05:10,263
copy the data, because, again,
the data access is imperfect, and

91
00:05:10,263 --> 00:05:15,140
you want your letter to take the imperfect
data, and predict the perfect one.

92
00:05:15,140 --> 00:05:17,614
So we want to remove the distortion.

93
00:05:17,614 --> 00:05:22,470
This is called a denoising of our encoder,
and, again, it's about removing noise.

94
00:05:22,470 --> 00:05:26,918
The way it operates is exactly
the same as the previous two models.

95
00:05:26,918 --> 00:05:28,955
It tries to minimize
reconstruction error but

96
00:05:28,955 --> 00:05:32,309
it has this need dropout there that
changes the whole behavior of the model.

97
00:05:32,309 --> 00:05:35,554
And there's a lot of ways you
can compare those approaches.

98
00:05:35,554 --> 00:05:38,521
Basically the general
intention stays the same.

99
00:05:38,521 --> 00:05:41,437
The sparse encoder gets
sparse representations.

100
00:05:41,437 --> 00:05:45,739
The redundant autoencoder get features
that cover for one another, and

101
00:05:45,739 --> 00:05:50,677
denoising encoder some features that are
able to extrapolate, even if some pieces

102
00:05:50,677 --> 00:05:55,282
of data is missing, so it's kind of
stable to small distortions in the data.

103
00:05:55,282 --> 00:05:59,151
Well you could, for example,
relate features of those autoencoders.

104
00:05:59,151 --> 00:06:03,449
You can see how well they generalize
on the images they have not been shown.

105
00:06:03,449 --> 00:06:06,328
And there's more than
one study covering it.

106
00:06:06,328 --> 00:06:08,598
Fortunately for us,
most of them are useless.

107
00:06:08,598 --> 00:06:12,461
Basically you can look at
the filters as long as you want,

108
00:06:12,461 --> 00:06:16,165
It won't give you the slightest
mathematical proof,

109
00:06:16,165 --> 00:06:20,371
whether you want to use one model or
the other, or vice versa.

110
00:06:20,371 --> 00:06:23,245
So again, this lets you train
another encoder which is richer than

111
00:06:23,245 --> 00:06:24,075
the original one.

112
00:06:24,075 --> 00:06:26,886
Now let's see how you can use that.

113
00:06:26,886 --> 00:06:28,136
Okay, so imagine you had a problem.

114
00:06:28,136 --> 00:06:30,724
The problem of image classification.

115
00:06:30,724 --> 00:06:32,967
So you have maybe, well,
photos of warplanes.

116
00:06:32,967 --> 00:06:35,685
You have allied warplanes and
enemy warplanes,

117
00:06:35,685 --> 00:06:37,876
you want to distinguish between them.

118
00:06:37,876 --> 00:06:42,475
So the first are getting maybe escorted
on the second or getting shot.

119
00:06:42,475 --> 00:06:47,948
And unfortunately, you only have, say 500
pieces of data, 500 labelled pictures,

120
00:06:47,948 --> 00:06:52,737
because its hard to label work when you,
because they are shooting at you maybe.

121
00:06:52,737 --> 00:06:53,605
And actually,

122
00:06:53,605 --> 00:06:58,413
this only leaves you with a possibility of
maybe using pretrained model from the zoo,

123
00:06:58,413 --> 00:07:01,842
or from some hand-crafted features,
which is even worse.

124
00:07:01,842 --> 00:07:06,785
But what you can do instead is, you can
just take a lot of images of warplanes in

125
00:07:06,785 --> 00:07:09,462
general, you don't have to label them.

126
00:07:09,462 --> 00:07:11,478
Those could be random warplanes,
even from a previous war, maybe.

127
00:07:11,478 --> 00:07:14,857
And you could train
the autoencoder on them.

128
00:07:14,857 --> 00:07:17,487
So you get some kind of feature
presentation which is specific to

129
00:07:17,487 --> 00:07:18,030
warplanes.

130
00:07:18,030 --> 00:07:23,536
Maybe not specific to or planes,
but we'll catch on that later.

131
00:07:23,536 --> 00:07:28,455
Okay so you have this autoencoder, and
the encoder part of it is very useful,

132
00:07:28,455 --> 00:07:33,006
because it kind of resembles the model
that we would use to classify it.

133
00:07:33,006 --> 00:07:37,249
So again you take this large chunk out,
you slice the model, and you get

134
00:07:37,249 --> 00:07:42,132
the pre-trained and minus one layer that
you could then use with any other model,

135
00:07:42,132 --> 00:07:43,647
like gray and boosting.

136
00:07:43,647 --> 00:07:46,005
Or you can even stick more
layers on top of it and

137
00:07:46,005 --> 00:07:49,405
train it with full back propagation,
like variable fine tuning.

138
00:07:49,405 --> 00:07:52,740
Now this gives you a nice
feature representation, but

139
00:07:52,740 --> 00:07:56,885
you probably already know how
to do this if you have a model.

140
00:07:56,885 --> 00:07:59,497
Okay, let's now see how they
compare against one another.

141
00:07:59,497 --> 00:08:02,994
The supervised pre training,
they take the model from the zoo,

142
00:08:02,994 --> 00:08:06,581
takes something image of the head and
fine tune.

143
00:08:06,581 --> 00:08:09,095
It works on some but right if you have.

144
00:08:09,095 --> 00:08:11,304
Relevant called supervised
learning problem.

145
00:08:11,304 --> 00:08:13,518
So you have something
that resembles image net.

146
00:08:13,518 --> 00:08:16,091
You're golden.
You take the model that classified cats

147
00:08:16,091 --> 00:08:17,090
against dogs.

148
00:08:17,090 --> 00:08:19,392
And then you train it to classify
particular breeds of a fox.

149
00:08:19,392 --> 00:08:22,597
I mean war planes were
previously classified tracks.

150
00:08:22,597 --> 00:08:26,325
What it allows you to do better than
unsupervised Pretraining is it gives you

151
00:08:26,325 --> 00:08:28,899
some insight into what
features are more relevant.

152
00:08:28,899 --> 00:08:36,483
For example if you classify cats then
you more likely get a feature where

153
00:08:36,483 --> 00:08:42,724
scenery image because scenery
is usually for classify.

154
00:08:42,724 --> 00:08:47,644
So if your case is having a thousand
labeled images of new brain scans and

155
00:08:47,644 --> 00:08:53,220
a lot of unlabeled images but there is no
large labeled data set of similar scans,

156
00:08:53,220 --> 00:08:57,320
which is probably true for
meds in this particular moment,

157
00:08:57,320 --> 00:09:01,994
you would benefit from autoencoders
much more than supervise chain

158
00:09:01,994 --> 00:09:05,861
because there's no [INAUDIBLE]
where it can be trained.

159
00:09:05,861 --> 00:09:13,922
And pre- training brain cancer detection
images of cats is slightly unreasonable.

160
00:09:13,922 --> 00:09:15,172
So here it goes.

161
00:09:15,172 --> 00:09:21,025
Basically supervisor training it gives
you more insight into wha's relevant and

162
00:09:21,025 --> 00:09:22,049
what isn't.

163
00:09:22,049 --> 00:09:25,164
But it requires a lot of...
that solves similar problems and

164
00:09:25,164 --> 00:09:29,705
if you don't have them use
use unsupervised [INAUDIBLE].

165
00:09:29,705 --> 00:09:39,705
[MUSIC]