1
00:00:00,000 --> 00:00:03,791
[MUSIC]

2
00:00:03,791 --> 00:00:05,210
Welcome back.

3
00:00:05,210 --> 00:00:09,859
We just studied about how Deep Learning
works and how to train your [INAUDIBLE].

4
00:00:09,859 --> 00:00:13,190
And with some luck, you've already made
it through the practical assignments.

5
00:00:13,190 --> 00:00:16,870
So you basically know that deep learning
can help you when one of your models

6
00:00:16,870 --> 00:00:18,240
just don't cut it.

7
00:00:18,240 --> 00:00:21,400
And you probably hope that
the process will repeat

8
00:00:21,400 --> 00:00:23,631
itself along other [INAUDIBLE].

9
00:00:23,631 --> 00:00:25,899
To a large extent that this is true,
but today,

10
00:00:25,899 --> 00:00:29,155
let's talk about where this isn't and
where deep learning is not.

11
00:00:29,155 --> 00:00:31,160
You know these things,
some of these things.

12
00:00:31,160 --> 00:00:33,730
Now, let's talk about
some of the limitations.

13
00:00:33,730 --> 00:00:35,520
Now, what deep learning is,

14
00:00:35,520 --> 00:00:38,650
the one thing deep learning is not
is deep learning is not magic.

15
00:00:38,650 --> 00:00:40,690
It won't just solve all the problems for
you.

16
00:00:40,690 --> 00:00:45,717
It won't be this silver bullet that you
can just unpack from [INAUDIBLE] and

17
00:00:45,717 --> 00:00:50,923
hope that it gets much better than
anything you tried previously for years.

18
00:00:50,923 --> 00:00:54,101
This is what a lot of people expect
from neural network, but please don't,

19
00:00:54,101 --> 00:00:56,150
because it won't solve your problems for
free.

20
00:00:57,410 --> 00:00:59,860
Instead, deep learning is
just a practical field.

21
00:00:59,860 --> 00:01:02,964
It has its strengths,
we'll talk about them in the second part.

22
00:01:02,964 --> 00:01:05,931
But it also has its weak points.

23
00:01:05,931 --> 00:01:09,940
And for the one thing, deep learning lacks
this core theoretical understanding.

24
00:01:09,940 --> 00:01:12,930
It sounds like a lame accusation when you
talk about the practical [INAUDIBLE] or

25
00:01:12,930 --> 00:01:16,950
absence if a theory isn't obviously
preventing it from working.

26
00:01:16,950 --> 00:01:20,670
But the problem here is that when
you try to build an architecture,

27
00:01:20,670 --> 00:01:24,630
develop something new for a model,
an absence of a theoretical kernel that'll

28
00:01:24,630 --> 00:01:29,493
be able to explain stuff for you actually
makes you do a lot more experimentation.

29
00:01:29,493 --> 00:01:34,040
It's the fact deep learning only offers
you some [INAUDIBLE] like this works,

30
00:01:34,040 --> 00:01:38,770
this idea kind of applies everywhere
where you have this situation and so on.

31
00:01:38,770 --> 00:01:44,200
But those intuitive kind of rules,
they are not 100% accurate,

32
00:01:44,200 --> 00:01:45,990
and this is a problem if you
want to develop something new.

33
00:01:47,170 --> 00:01:50,240
Not a problem is that turn complex
dependencies, neural networks and

34
00:01:50,240 --> 00:01:53,230
deep learning models and
journal have a lot of parameter.

35
00:01:53,230 --> 00:01:56,481
This not only means that they can capture
complex dependencies in the data but

36
00:01:56,481 --> 00:01:58,143
they can also over feed tremendously.

37
00:01:58,143 --> 00:02:02,031
This means that for any problem,
if you neural network, you generally need

38
00:02:02,031 --> 00:02:07,250
much larger dataset to train on, then you
would if linear models or decision trees.

39
00:02:07,250 --> 00:02:10,772
Whenever you end up in some new area
which is not some image classification or

40
00:02:10,772 --> 00:02:14,295
text processing, sometimes you'll
find out that for practical reasons,

41
00:02:14,295 --> 00:02:17,009
it's better to use decision trees or
even linear models.

42
00:02:18,810 --> 00:02:21,700
Now, finally, deep learning
models are computationally heavy.

43
00:02:21,700 --> 00:02:24,910
And whenever you want your machine
learning to run super fast or

44
00:02:24,910 --> 00:02:26,760
require as little memory as possible.

45
00:02:26,760 --> 00:02:31,634
So if you're running on smart phones or
embedded systems, you'll generally have to

46
00:02:31,634 --> 00:02:36,390
do some, again, dark magic to make your
neural network run as fast as you require.

47
00:02:36,390 --> 00:02:41,057
This isn't true for C-linear models,
that apply almost instantaneously.

48
00:02:41,057 --> 00:02:43,300
There's one more disadvantage.

49
00:02:43,300 --> 00:02:49,361
It's kind of well,
it's hard to fix a disadvantage.

50
00:02:49,361 --> 00:02:52,072
It has some strong points.

51
00:02:52,072 --> 00:02:55,294
But the deep learning is
pathologically overhyped.

52
00:02:55,294 --> 00:02:59,340
But basically, machine learning,
the super domain of deep learning,

53
00:02:59,340 --> 00:03:00,610
is overhyped as well.

54
00:03:00,610 --> 00:03:04,293
But deep learning is the most kind of
advertised, the most hot topic within

55
00:03:04,293 --> 00:03:07,750
the most hot area of the mathematics,
which is the machine learning.

56
00:03:08,760 --> 00:03:12,492
This is good because deep learning
attracts a lot of talented researches and

57
00:03:12,492 --> 00:03:13,842
talented practitioners.

58
00:03:13,842 --> 00:03:16,606
But the problem is that since it's so
hard,

59
00:03:16,606 --> 00:03:19,710
a lot of people expect wonders from it.

60
00:03:19,710 --> 00:03:22,500
So sometimes you'll find yourself,
if you're trying to apply deep learning in

61
00:03:22,500 --> 00:03:26,030
business, find yourself in a company of
people who don't understand deep learning.

62
00:03:26,030 --> 00:03:29,160
They believe that it's some super
artificial intelligence big data blah

63
00:03:29,160 --> 00:03:30,630
blah, yada yada yada.

64
00:03:30,630 --> 00:03:33,940
It will get you top one
position in the business and

65
00:03:33,940 --> 00:03:35,690
solve all your problems for you.

66
00:03:35,690 --> 00:03:39,566
So not only you should not expect deep
learning to make wonders yourself,

67
00:03:39,566 --> 00:03:42,705
because wonders, as you know,
require all the hard work.

68
00:03:42,705 --> 00:03:47,066
You also have to fight with other
people who expect otherwise.

69
00:03:47,066 --> 00:03:51,531
Now all those arguments draw a rather
grim picture what deep learning is, but

70
00:03:51,531 --> 00:03:55,110
there are a lot of positive
sides of it as well.

71
00:03:55,110 --> 00:03:59,700
For one, you can think of deep learning as
this kind of machine learning language,

72
00:03:59,700 --> 00:04:00,880
like any language.

73
00:04:00,880 --> 00:04:04,180
A language is a tool to express something.

74
00:04:04,180 --> 00:04:06,656
A national language is a tool which you
can express at least to all the humans.

75
00:04:06,656 --> 00:04:11,211
And a programming language means to
express what you want your computer to do

76
00:04:11,211 --> 00:04:14,640
in a way that the computer can execute it.

77
00:04:14,640 --> 00:04:17,670
And deep learning, in turn,
is a language that allows you to

78
00:04:17,670 --> 00:04:21,780
hint your machine learning model about
what you want this model to learn.

79
00:04:21,780 --> 00:04:25,040
Hint is about what kind of
features you want it to have,

80
00:04:25,040 --> 00:04:29,571
and what kind of expert knowledge
can be applied for on this dataset.

81
00:04:31,000 --> 00:04:34,465
Let's draw a few examples
to prove this point.

82
00:04:34,465 --> 00:04:36,873
Let's say you have a usual
classification problem.

83
00:04:36,873 --> 00:04:37,905
You have two sets of features.

84
00:04:37,905 --> 00:04:41,636
You have the raw low-level features,
and the high-level features.

85
00:04:41,636 --> 00:04:43,828
And you want to predict some
kind of target given this.

86
00:04:43,828 --> 00:04:46,092
This whole thing is being to
sound a little abstract, so

87
00:04:46,092 --> 00:04:47,545
let's get to a concrete scenario.

88
00:04:47,545 --> 00:04:50,942
Say you want to make a regression on
the price of a car, a second hand car,

89
00:04:50,942 --> 00:04:51,853
to be accurate.

90
00:04:51,853 --> 00:04:56,075
Have a, well, a photo of a car, and
some high level features like a brand,

91
00:04:56,075 --> 00:04:59,628
some model, maybe a production dates,
and some blemishes and

92
00:04:59,628 --> 00:05:01,729
enhancement installed in this car.

93
00:05:01,729 --> 00:05:04,277
What you want to do is you
want to build a model that

94
00:05:04,277 --> 00:05:06,470
uses both of those feature types.

95
00:05:06,470 --> 00:05:09,118
And the simplest way to do so
is to just concatenate them and

96
00:05:09,118 --> 00:05:12,805
feed them into a whole [INAUDIBLE] using
neural network in your model, whatever.

97
00:05:12,805 --> 00:05:14,426
Of course you can do that,

98
00:05:14,426 --> 00:05:18,272
but the problem with this is
approach is kind of inefficient.

99
00:05:18,272 --> 00:05:22,540
Now if we speak about neural networks,
the resulting model would like this, for

100
00:05:22,540 --> 00:05:23,810
example.

101
00:05:23,810 --> 00:05:28,110
And the problem with this model, the main
one, is that the first dense layer here,

102
00:05:28,110 --> 00:05:31,305
it tries to combine two worlds,
two domains of features, and

103
00:05:31,305 --> 00:05:33,159
it tries to combine them linearly.

104
00:05:33,159 --> 00:05:36,699
So, what it does is it takes the age
measured in years or months.

105
00:05:36,699 --> 00:05:41,984
And it's multiplied by some coefficient,
and adds up with a pixel intensity.

106
00:05:41,984 --> 00:05:43,338
It's technically possible.

107
00:05:43,338 --> 00:05:47,642
I mean no one will punish you from doing
so unless there's a physicist nearby.

108
00:05:47,642 --> 00:05:52,353
But it's kind of [INAUDIBLE] and it made
practical application this architecture

109
00:05:52,353 --> 00:05:54,993
tends to work worse than
it otherwise could.

110
00:05:54,993 --> 00:05:58,440
What you can do is you can save
the following thing of this language.

111
00:05:58,440 --> 00:06:02,440
You can see that you want to view
the representation for those raw features,

112
00:06:02,440 --> 00:06:04,963
which is as complex as
those high-level ones.

113
00:06:04,963 --> 00:06:08,860
The way you can express this is by,
well, taking more layers.

114
00:06:08,860 --> 00:06:12,594
Now, basically you have
two branches of data.

115
00:06:12,594 --> 00:06:15,016
And for some amount of time,
you persist them independently.

116
00:06:15,016 --> 00:06:20,222
You have those raw features and
you apply dense layers, stacked maybe two,

117
00:06:20,222 --> 00:06:25,220
three dense layers, that only extract
features from raw image pixels.

118
00:06:25,220 --> 00:06:29,605
And only then, once you've got those
features like a presence of a blemish, or

119
00:06:29,605 --> 00:06:32,222
maybe a crack on the front glass,
or anything,

120
00:06:32,222 --> 00:06:36,758
only then you combine those features
with the high-level features you've got.

121
00:06:36,758 --> 00:06:40,957
Now, it makes slightly more sense,
although it's not the perfect model.

122
00:06:40,957 --> 00:06:44,175
Generalists taking more layers to extract
features is also in the more abstract kind

123
00:06:44,175 --> 00:06:45,190
of features.

124
00:06:45,190 --> 00:06:46,945
And if you stack enough layers,

125
00:06:46,945 --> 00:06:50,521
you'll eventually get features
that are easy to combine there.

126
00:06:50,521 --> 00:06:53,450
So let's now consider a similar
although a slightly different problem.

127
00:06:53,450 --> 00:06:56,165
This time,
we're still solving car price regression.

128
00:06:56,165 --> 00:06:59,580
But we want to also infuse
another [INAUDIBLE].

129
00:06:59,580 --> 00:07:03,969
They say that through some kind of
external information that we've got,

130
00:07:03,969 --> 00:07:08,378
we don't want our network to trust
the image data too enthusiastically.

131
00:07:08,378 --> 00:07:11,618
For example, I might be unwilling
to trust the car dealers that much.

132
00:07:11,618 --> 00:07:16,418
Let's say that some of their images
have shown to be too optimistic,

133
00:07:16,418 --> 00:07:20,573
and showing a car in a condition
better than the actual one.

134
00:07:20,573 --> 00:07:22,870
By default,
our network does the exact opposite.

135
00:07:22,870 --> 00:07:27,000
It trusts the images too much because
there is say 10,000 image pixels,

136
00:07:27,000 --> 00:07:30,400
100 by 100% pixels.

137
00:07:30,400 --> 00:07:34,490
And there's only say, 100 attributes
that are high-level features.

138
00:07:34,490 --> 00:07:36,062
So we want to do the opposite.

139
00:07:36,062 --> 00:07:40,010
You can of course achieve this by means
of applying usual machine learning.

140
00:07:40,010 --> 00:07:43,625
Simply over-glorizing the raw features,
the pixels, or

141
00:07:43,625 --> 00:07:45,596
maybe you're [INAUDIBLE] here.

142
00:07:45,596 --> 00:07:49,276
But in deep learning, you can also
do this by means of architecture.

143
00:07:49,276 --> 00:07:52,610
In this case, we have introduced
the thing called the bottle neck layer.

144
00:07:52,610 --> 00:07:55,030
This one layer, this one with 32 units,

145
00:07:55,030 --> 00:07:58,600
which is much smaller
than any other layer.

146
00:07:58,600 --> 00:08:01,890
And it's the bottle neck, so
any information that neural

147
00:08:01,890 --> 00:08:05,220
network takes from the image,
it should go through this layer.

148
00:08:05,220 --> 00:08:09,141
It kind of limits the amount of useful
features your model can get, and

149
00:08:09,141 --> 00:08:11,942
biases toward trusting
raw image features less.

150
00:08:11,942 --> 00:08:13,510
This is of course not guaranteed.

151
00:08:13,510 --> 00:08:16,470
So, technically,
if you feed your model for too long,

152
00:08:16,470 --> 00:08:21,050
it might just encode everything in this
super-complex non-linear dependency and

153
00:08:21,050 --> 00:08:22,810
still get all the information through.

154
00:08:22,810 --> 00:08:25,566
But it's one way you can
approach this problem.

155
00:08:25,566 --> 00:08:35,566
[MUSIC]