1
00:00:00,000 --> 00:00:02,820
Let's say in building a machine learning system you're trying to

2
00:00:02,820 --> 00:00:05,690
decide whether or not to use an end-to-end approach.

3
00:00:05,690 --> 00:00:08,250
Let's take a look at some of the pros and cons of

4
00:00:08,250 --> 00:00:10,770
end-to-end deep learning so that you can come

5
00:00:10,770 --> 00:00:12,690
away with some guidelines on whether or

6
00:00:12,690 --> 00:00:17,040
not an end-to-end approach seems promising for your application.

7
00:00:17,040 --> 00:00:20,265
Here are some of the benefits of applying end-to-end learning.

8
00:00:20,265 --> 00:00:25,435
First is that end-to-end learning really just lets the data speak.

9
00:00:25,435 --> 00:00:29,010
So if you have enough X,Y data

10
00:00:29,010 --> 00:00:33,450
then whatever is the most appropriate function mapping from X to Y,

11
00:00:33,450 --> 00:00:35,475
if you train a big enough neural network,

12
00:00:35,475 --> 00:00:38,395
hopefully the neural network will figure it out.

13
00:00:38,395 --> 00:00:41,040
And by having a pure machine learning approach,

14
00:00:41,040 --> 00:00:44,550
your neural network learning input from X to Y may be

15
00:00:44,550 --> 00:00:48,540
more able to capture whatever statistics are in the data,

16
00:00:48,540 --> 00:00:52,800
rather than being forced to reflect human preconceptions.

17
00:00:52,800 --> 00:00:55,695
So for example, in the case of speech recognition

18
00:00:55,695 --> 00:00:58,530
earlier speech systems had this notion of

19
00:00:58,530 --> 00:01:01,260
a phoneme which was a basic unit of sound like C,

20
00:01:01,260 --> 00:01:04,240
A, and T for the word cat.

21
00:01:04,240 --> 00:01:09,890
And I think that phonemes are an artifact created by human linguists.

22
00:01:09,890 --> 00:01:12,120
I actually think that phonemes are a fantasy of

23
00:01:12,120 --> 00:01:15,690
linguists that are a reasonable description of language,

24
00:01:15,690 --> 00:01:21,785
but it's not obvious that you want to force your learning algorithm to think in phonemes.

25
00:01:21,785 --> 00:01:25,895
And if you let your learning algorithm learn whatever representation it wants to

26
00:01:25,895 --> 00:01:30,675
learn rather than forcing your learning algorithm to use phonemes as a representation,

27
00:01:30,675 --> 00:01:34,645
then its overall performance might end up being better.

28
00:01:34,645 --> 00:01:37,140
The second benefit to end-to-end deep learning is

29
00:01:37,140 --> 00:01:40,480
that there's less hand designing of components needed.

30
00:01:40,480 --> 00:01:43,960
And so this could also simplify your design work flow,

31
00:01:43,960 --> 00:01:47,655
that you just don't need to spend a lot of time hand designing features,

32
00:01:47,655 --> 00:01:51,040
hand designing these intermediate representations.

33
00:01:51,040 --> 00:01:52,650
How about the disadvantages.

34
00:01:52,650 --> 00:01:54,335
Here are some of the cons.

35
00:01:54,335 --> 00:01:57,010
First, it may need a large amount of data.

36
00:01:57,010 --> 00:02:00,225
So to learn this X to Y mapping directly,

37
00:02:00,225 --> 00:02:03,030
you might need a lot of data of X,

38
00:02:03,030 --> 00:02:06,600
Y and we were seeing in a previous video some examples of

39
00:02:06,600 --> 00:02:10,720
where you could obtain a lot of data for subtasks.

40
00:02:10,720 --> 00:02:13,675
Such as for face recognition,

41
00:02:13,675 --> 00:02:17,310
we could find a lot data for finding a face in the image,

42
00:02:17,310 --> 00:02:20,510
as well as identifying the face once you found a face,

43
00:02:20,510 --> 00:02:24,780
but there was just less data available for the entire end-to-end task.

44
00:02:24,780 --> 00:02:32,989
So X, this is the input end of the end-to-end learning and Y is the output end.

45
00:02:32,989 --> 00:02:36,180
And so you need all the data X Y with

46
00:02:36,180 --> 00:02:40,540
both the input end and the output end in order to train these systems,

47
00:02:40,540 --> 00:02:45,100
and this is why we call it end-to-end learning value as well because you're learning

48
00:02:45,100 --> 00:02:52,690
a direct mapping from one end of the system all the way to the other end of the system.

49
00:02:52,690 --> 00:02:58,960
The other disadvantage is that it excludes potentially useful hand designed components.

50
00:02:58,960 --> 00:03:04,510
So machine learning researchers tend to speak disparagingly of hand designing things.

51
00:03:04,510 --> 00:03:06,880
But if you don't have a lot of data,

52
00:03:06,880 --> 00:03:09,490
then your learning algorithm doesn't have

53
00:03:09,490 --> 00:03:13,450
that much insight it can gain from your data if your training set is small.

54
00:03:13,450 --> 00:03:17,080
And so hand designing a component can really be a way for

55
00:03:17,080 --> 00:03:21,138
you to inject manual knowledge into the algorithm,

56
00:03:21,138 --> 00:03:24,035
and that's not always a bad thing.

57
00:03:24,035 --> 00:03:28,060
I think of a learning algorithm as having two main sources of knowledge.

58
00:03:28,060 --> 00:03:33,590
One is the data and the other is whatever you hand design,

59
00:03:33,590 --> 00:03:37,070
be it components, or features, or other things.

60
00:03:37,070 --> 00:03:39,840
And so when you have a ton of data it's less

61
00:03:39,840 --> 00:03:44,125
important to hand design things but when you don't have much data,

62
00:03:44,125 --> 00:03:49,170
then having a carefully hand-designed system can actually allow humans to inject

63
00:03:49,170 --> 00:03:51,555
a lot of knowledge about the problem

64
00:03:51,555 --> 00:03:54,985
into an algorithm deck and that should be very helpful.

65
00:03:54,985 --> 00:03:58,200
So one of the downsides of end-to-end deep learning is

66
00:03:58,200 --> 00:04:02,605
that it excludes potentially useful hand-designed components.

67
00:04:02,605 --> 00:04:06,490
And hand-designed components could be very helpful if well designed.

68
00:04:06,490 --> 00:04:09,660
They could also be harmful if it really limits your performance,

69
00:04:09,660 --> 00:04:12,960
such as if you force an algorithm to think in phonemes

70
00:04:12,960 --> 00:04:16,770
when maybe it could have discovered a better representation by itself.

71
00:04:16,770 --> 00:04:19,110
So it's kind of a double edged sword that could

72
00:04:19,110 --> 00:04:21,660
hurt or help but it does tend to help more,

73
00:04:21,660 --> 00:04:26,320
hand-designed components tend to help more when you're training on a small training set.

74
00:04:26,320 --> 00:04:29,250
So if you're building a new machine learning system and you're trying to

75
00:04:29,250 --> 00:04:32,535
decide whether or not to use end-to-end deep learning,

76
00:04:32,535 --> 00:04:34,500
I think the key question is,

77
00:04:34,500 --> 00:04:37,330
do you have sufficient data to learn the function of

78
00:04:37,330 --> 00:04:41,340
the complexity needed to map from X to Y?

79
00:04:41,340 --> 00:04:44,790
I don't have a formal definition of this phrase,

80
00:04:44,790 --> 00:04:49,170
complexity needed, but intuitively,

81
00:04:49,170 --> 00:04:52,140
if you're trying to learn a function from X to Y,

82
00:04:52,140 --> 00:04:54,685
that is looking at an image like this

83
00:04:54,685 --> 00:04:57,495
and recognizing the position of the bones in this image,

84
00:04:57,495 --> 00:05:01,435
then maybe this seems like a relatively simple problem to

85
00:05:01,435 --> 00:05:06,030
identify the bones of the image and maybe they'll need that much data for that task.

86
00:05:06,030 --> 00:05:12,020
Or given a picture of a person,

87
00:05:12,020 --> 00:05:18,995
maybe finding the face of that person in the image doesn't seem like that hard a problem,

88
00:05:18,995 --> 00:05:23,420
so maybe you don't need too much data to find the face of a person.

89
00:05:23,420 --> 00:05:28,465
Or at least maybe you can find enough data to solve that task, whereas in contrast,

90
00:05:28,465 --> 00:05:34,210
the function needed to look at the hand and map that directly to the age of the child,

91
00:05:34,210 --> 00:05:38,995
that seems like a much more complex problem that intuitively maybe you need

92
00:05:38,995 --> 00:05:45,815
more data to learn if you were to apply a pure end-to-end deep learning approach.

93
00:05:45,815 --> 00:05:50,185
So let me finish this video with a more complex example.

94
00:05:50,185 --> 00:05:52,645
You may know that I've been spending time helping out

95
00:05:52,645 --> 00:05:55,360
an autonomous driving company, Drive.ai.

96
00:05:55,360 --> 00:06:00,108
So I'm actually very excited about autonomous driving.

97
00:06:00,108 --> 00:06:03,950
So how do you build a car that drives itself?

98
00:06:03,950 --> 00:06:06,065
Well, here's one thing you could do,

99
00:06:06,065 --> 00:06:08,795
and this is not an end-to-end deep learning approach.

100
00:06:08,795 --> 00:06:15,620
You can take as input an image of what's in front of your car, maybe radar, lighter,

101
00:06:15,620 --> 00:06:17,075
other sensor readings as well,

102
00:06:17,075 --> 00:06:20,030
but to simplify the description,

103
00:06:20,030 --> 00:06:24,700
let's just say you take a picture of what's in front or what's around your car.

104
00:06:24,700 --> 00:06:28,430
And then to drive your car safely you need to detect

105
00:06:28,430 --> 00:06:33,135
other cars and you also need to detect pedestrians.

106
00:06:33,135 --> 00:06:35,840
You need to detect other things, of course,

107
00:06:35,840 --> 00:06:39,033
but we'll just present a simplified example here.

108
00:06:39,033 --> 00:06:42,625
Having figured out where are the other cars and pedestrians,

109
00:06:42,625 --> 00:06:48,783
you then need to plan your own route.

110
00:06:48,783 --> 00:06:50,305
So in other words,

111
00:06:50,305 --> 00:06:54,880
if you see where are the other cars,

112
00:06:54,880 --> 00:06:58,300
where are the pedestrians, you need to decide how to steer your own car,

113
00:06:58,300 --> 00:07:02,110
what path to steer your own car for the next several seconds.

114
00:07:02,110 --> 00:07:08,185
And having decided that you're going to drive a certain path,

115
00:07:08,185 --> 00:07:14,725
maybe this is a top down view of a road and that's your car.

116
00:07:14,725 --> 00:07:17,625
Maybe you've decided to drive that path,

117
00:07:17,625 --> 00:07:18,750
that's what a route is,

118
00:07:18,750 --> 00:07:25,405
then you need to execute this by generating the appropriate steering,

119
00:07:25,405 --> 00:07:28,850
as well as acceleration and braking commands.

120
00:07:28,850 --> 00:07:34,030
So in going from your image or your sensory inputs to detecting cars and pedestrians,

121
00:07:34,030 --> 00:07:37,630
that can be done pretty well using deep learning,

122
00:07:37,630 --> 00:07:40,870
but then having figured out where the other cars and pedestrians are going,

123
00:07:40,870 --> 00:07:45,220
to select this route to exactly how you want to move your car,

124
00:07:45,220 --> 00:07:47,420
usually that's not to done with deep learning.

125
00:07:47,420 --> 00:07:51,715
Instead that's done with a piece of software called Motion Planning.

126
00:07:51,715 --> 00:07:55,420
And if you ever take a course in robotics you'll learn about motion planning.

127
00:07:55,420 --> 00:07:59,240
And then having decided what's the path you want to steer your car through,

128
00:07:59,240 --> 00:08:00,880
there'll be some other algorithm,

129
00:08:00,880 --> 00:08:06,355
we're going to say it's a control algorithm that then generates the exact decision,

130
00:08:06,355 --> 00:08:09,040
that then decides exactly how much to turn

131
00:08:09,040 --> 00:08:13,160
the steering wheel and how much to step on the accelerator or step on the brake.

132
00:08:13,160 --> 00:08:16,990
So I think what this example illustrates is that

133
00:08:16,990 --> 00:08:21,340
you want to use machine learning or use deep learning to learn

134
00:08:21,340 --> 00:08:30,640
some individual components and when applying supervised learning you should

135
00:08:30,640 --> 00:08:37,650
carefully choose what types of X to Y mappings you want to

136
00:08:37,650 --> 00:08:44,745
learn depending on what task

137
00:08:44,745 --> 00:08:48,875
you can get data for.

138
00:08:48,875 --> 00:08:51,765
And in contrast, it is exciting to talk about

139
00:08:51,765 --> 00:08:54,435
a pure end-to-end deep learning approach where you input

140
00:08:54,435 --> 00:08:57,180
the image and directly output a steering.

141
00:08:57,180 --> 00:09:04,570
But given data availability

142
00:09:04,570 --> 00:09:08,140
and the types of things we can learn with neural networks today,

143
00:09:08,140 --> 00:09:12,880
this is actually not the most promising approach or this is

144
00:09:12,880 --> 00:09:18,170
not an approach that I think teams have gotten to work best.

145
00:09:18,170 --> 00:09:22,780
And I think this pure end-to-end deep learning approach is actually

146
00:09:22,780 --> 00:09:27,170
less promising than more sophisticated approaches like this,

147
00:09:27,170 --> 00:09:32,260
given the availability of data and our ability to train neural networks today.

148
00:09:32,260 --> 00:09:35,055
So that's it for end-to-end deep learning.

149
00:09:35,055 --> 00:09:38,380
It can sometimes work really well but you also have to be

150
00:09:38,380 --> 00:09:42,453
mindful of where you apply end-to-end deep learning.

151
00:09:42,453 --> 00:09:46,985
Finally, thank you and congrats on making it this far with me.

152
00:09:46,985 --> 00:09:50,290
If you finish last week's videos and this week's videos then I

153
00:09:50,290 --> 00:09:53,860
think you will already be much smarter and much more strategic

154
00:09:53,860 --> 00:09:57,250
and much more able to make good prioritization decisions in

155
00:09:57,250 --> 00:10:01,138
terms of how to move forward on your machine learning project,

156
00:10:01,138 --> 00:10:03,520
even compared to a lot of machine learning engineers

157
00:10:03,520 --> 00:10:06,310
and researchers that I see here in Silicon Valley.

158
00:10:06,310 --> 00:10:11,320
So congrats on all that you've learned so far and I hope you now also take a look at

159
00:10:11,320 --> 00:10:13,480
this week's homework problems which should give you

160
00:10:13,480 --> 00:10:18,770
another opportunity to practice these ideas and make sure that you're mastering them.