1
00:00:00,000 --> 00:00:02,880
One of the most exciting recent developments in deep learning,

2
00:00:02,880 --> 00:00:05,765
has been the rise of end-to-end deep learning.

3
00:00:05,765 --> 00:00:07,570
So what is the end-to-end learning?

4
00:00:07,570 --> 00:00:10,510
Briefly, there have been some data processing systems,

5
00:00:10,510 --> 00:00:13,880
or learning systems that require multiple stages of processing.

6
00:00:13,880 --> 00:00:15,330
And what end-to-end deep learning does,

7
00:00:15,330 --> 00:00:17,625
is it can take all those multiple stages,

8
00:00:17,625 --> 00:00:20,968
and replace it usually with just a single neural network.

9
00:00:20,968 --> 00:00:24,170
Let's look at some examples.

10
00:00:24,170 --> 00:00:26,460
Take speech recognition as an example,

11
00:00:26,460 --> 00:00:30,975
where your goal is to take an input X such an audio clip,

12
00:00:30,975 --> 00:00:33,015
and map it to an output Y,

13
00:00:33,015 --> 00:00:37,230
which is a transcript of the audio clip.

14
00:00:37,230 --> 00:00:41,550
So traditionally, speech recognition required many stages of processing.

15
00:00:41,550 --> 00:00:44,085
First, you will extract some features,

16
00:00:44,085 --> 00:00:46,045
some hand-designed features of the audio.

17
00:00:46,045 --> 00:00:48,885
So if you've heard of MFCC,

18
00:00:48,885 --> 00:00:53,570
that's an algorithm for extracting a certain set of hand designed features for audio.

19
00:00:53,570 --> 00:00:55,845
And then having extracted some low level features,

20
00:00:55,845 --> 00:00:58,260
you might apply a machine learning algorithm,

21
00:00:58,260 --> 00:01:01,530
to find the phonemes in the audio clip.

22
00:01:01,530 --> 00:01:04,390
So phonemes are the basic units of sound.

23
00:01:04,390 --> 00:01:07,810
So for example, the word cat is made out of three sounds.

24
00:01:07,810 --> 00:01:10,690
The Cu- Ah- and Tu- so they extract those.

25
00:01:10,690 --> 00:01:13,630
And then you string together phonemes to form individual words.

26
00:01:13,630 --> 00:01:19,356
And then you string those together to form the transcripts of the audio clip.

27
00:01:19,356 --> 00:01:23,365
So, in contrast to this pipeline with a lot of stages,

28
00:01:23,365 --> 00:01:24,940
what end-to-end deep learning does,

29
00:01:24,940 --> 00:01:28,735
is you can train a huge neural network to just input the audio clip,

30
00:01:28,735 --> 00:01:32,670
and have it directly output the transcript.

31
00:01:32,670 --> 00:01:35,800
One interesting sociological effect in AI is

32
00:01:35,800 --> 00:01:39,085
that as end-to-end deep learning started to work better,

33
00:01:39,085 --> 00:01:41,545
there were some researchers that had for example spent

34
00:01:41,545 --> 00:01:44,830
many years of their career designing individual steps of the pipeline.

35
00:01:44,830 --> 00:01:50,249
So there were some researchers in different disciplines not just in speech recognition.

36
00:01:50,249 --> 00:01:52,360
Maybe in computer vision, and other areas as well,

37
00:01:52,360 --> 00:01:53,945
that had spent a lot of time you know,

38
00:01:53,945 --> 00:01:57,332
written multiple papers, maybe even built a large part of their career,

39
00:01:57,332 --> 00:02:00,270
engineering featuresor engineering other pieces of the pipeline.

40
00:02:00,270 --> 00:02:02,990
And when end-to-end deep learning just took

41
00:02:02,990 --> 00:02:06,475
the last training set and learned the function mapping from x and y directly,

42
00:02:06,475 --> 00:02:09,720
really bypassing a lot of these intermediate steps,

43
00:02:09,720 --> 00:02:13,300
it was challenging for some disciplines to come around to

44
00:02:13,300 --> 00:02:17,255
accepting this alternative way of building AI systems.

45
00:02:17,255 --> 00:02:20,110
Because it really obsoleted in some cases,

46
00:02:20,110 --> 00:02:23,665
many years of research in some of the intermediate components.

47
00:02:23,665 --> 00:02:27,070
It turns out that one of the challenges of end-to-end deep learning is

48
00:02:27,070 --> 00:02:30,670
that you might need a lot of data before it works well.

49
00:02:30,670 --> 00:02:33,030
So for example, if you're training on 3,000

50
00:02:33,030 --> 00:02:35,935
hours of data to build a speech recognition system,

51
00:02:35,935 --> 00:02:37,600
then the traditional pipeline,

52
00:02:37,600 --> 00:02:40,675
the full traditional pipeline works really well.

53
00:02:40,675 --> 00:02:42,745
It's only when you have a very large data set,

54
00:02:42,745 --> 00:02:45,290
you know one to say 10,000 hours of data,

55
00:02:45,290 --> 00:02:49,000
anything going up to maybe 100,000 hours of data that

56
00:02:49,000 --> 00:02:53,350
the end-to end-approach then suddenly starts to work really well.

57
00:02:53,350 --> 00:02:55,000
So when you have a smaller data set,

58
00:02:55,000 --> 00:02:58,990
the more traditional pipeline approach actually works just as well.

59
00:02:58,990 --> 00:03:00,700
Often works even better.

60
00:03:00,700 --> 00:03:06,855
And you need a large data set before the end-to-end approach really shines.

61
00:03:06,855 --> 00:03:08,850
And if you have a medium amount of data,

62
00:03:08,850 --> 00:03:12,365
then there are also intermediate approaches where maybe you input audio

63
00:03:12,365 --> 00:03:16,420
and bypass the features and just learn to output the phonemes of the neural network,

64
00:03:16,420 --> 00:03:17,765
and then at some other stages as well.

65
00:03:17,765 --> 00:03:19,420
So this will be a step toward end-to-end learning,

66
00:03:19,420 --> 00:03:29,065
but not all the way there.

67
00:03:29,065 --> 00:03:34,855
Test. So this is a picture of a face recognition turnstile built by a researcher,

68
00:03:34,855 --> 00:03:36,600
Yuanqing Lin at Baidu,

69
00:03:36,600 --> 00:03:41,050
where this is a camera and it looks at the person approaching the gate,

70
00:03:41,050 --> 00:03:43,135
and if it recognizes the person then,

71
00:03:43,135 --> 00:03:46,645
you know the turnstile automatically lets them through.

72
00:03:46,645 --> 00:03:51,170
So rather than needing to swipe an RFID badge to enter this facility,

73
00:03:51,170 --> 00:03:53,410
in increasingly many offices in

74
00:03:53,410 --> 00:03:56,270
China and hopefully more and more in other countries as well,

75
00:03:56,270 --> 00:03:59,485
you can just approach the turnstile and if it recognizes your face it

76
00:03:59,485 --> 00:04:04,155
just lets you through without needing you to carry an RFID badge.

77
00:04:04,155 --> 00:04:07,385
So, how do you build a system like this?

78
00:04:07,385 --> 00:04:12,590
Well, one thing you could do is just look at the image that the camera is capturing.

79
00:04:12,590 --> 00:04:14,645
Right? So, I guess this is my bad drawing,

80
00:04:14,645 --> 00:04:16,190
but maybe this is a camera image.

81
00:04:16,190 --> 00:04:19,390
And you know, you have someone approaching the turnstile.

82
00:04:19,390 --> 00:04:23,810
So this might be the image X that you that your camera is capturing.

83
00:04:23,810 --> 00:04:26,950
And one thing you could do is try to learn a function mapping

84
00:04:26,950 --> 00:04:31,470
directly from the image X to the identity of the person Y.

85
00:04:31,470 --> 00:04:34,580
It turns out this is not the best approach.

86
00:04:34,580 --> 00:04:36,395
And one of the problems is that you know,

87
00:04:36,395 --> 00:04:39,835
the person approaching the turnstile can approach from lots of different directions.

88
00:04:39,835 --> 00:04:41,945
So they could be green positions,

89
00:04:41,945 --> 00:04:43,195
they could be in blue position.

90
00:04:43,195 --> 00:04:45,245
You know, sometimes they're closer to the camera,

91
00:04:45,245 --> 00:04:47,020
so they appear bigger in the image.

92
00:04:47,020 --> 00:04:49,225
And sometimes they're already closer to the camera,

93
00:04:49,225 --> 00:04:51,600
so that face appears much bigger.

94
00:04:51,600 --> 00:04:54,205
So what it has actually done to build these turnstiles,

95
00:04:54,205 --> 00:04:56,350
is not to just take the raw image and

96
00:04:56,350 --> 00:04:59,315
feed it to a neural net to try to figure out a person's identity.

97
00:04:59,315 --> 00:05:02,005
Instead, the best approach to date,

98
00:05:02,005 --> 00:05:05,140
seems to be a multi-step approach, where first,

99
00:05:05,140 --> 00:05:09,210
you run one piece of software to detect the person's face.

100
00:05:09,210 --> 00:05:12,820
So this first detector to figure out where's the person's face.

101
00:05:12,820 --> 00:05:14,785
Having detected the person's face,

102
00:05:14,785 --> 00:05:16,930
you then zoom in to that part of

103
00:05:16,930 --> 00:05:24,540
the image and crop

104
00:05:24,540 --> 00:05:29,340
that image so that the person's face is centered.

105
00:05:29,340 --> 00:05:34,757
Then, it is this picture that I guess I drew here in red,

106
00:05:34,757 --> 00:05:36,080
this is then fed to the neural network,

107
00:05:36,080 --> 00:05:38,010
to then try to learn,

108
00:05:38,010 --> 00:05:40,485
or estimate the person's identity.

109
00:05:40,485 --> 00:05:42,150
And what researchers have found,

110
00:05:42,150 --> 00:05:45,615
is that instead of trying to learn everything on one step,

111
00:05:45,615 --> 00:05:48,780
by breaking this problem down into two simpler steps,

112
00:05:48,780 --> 00:05:51,490
first is figure out where is the face.

113
00:05:51,490 --> 00:05:54,890
And second, is look at the face and figure out who this actually is.

114
00:05:54,890 --> 00:05:58,010
This second approach allows the learning algorithm or really two learning algorithms

115
00:05:58,010 --> 00:06:03,560
to solve two much simpler tasks and results in overall better performance.

116
00:06:03,560 --> 00:06:05,930
By the way, if you want to know how

117
00:06:05,930 --> 00:06:08,975
the second step actually works I've simplified the discussion.

118
00:06:08,975 --> 00:06:11,865
By the way, if you want to know how step two here actually works,

119
00:06:11,865 --> 00:06:13,605
I've actually simplified the description a bit.

120
00:06:13,605 --> 00:06:15,470
The way the second step is actually trained,

121
00:06:15,470 --> 00:06:16,730
as you train in your network,

122
00:06:16,730 --> 00:06:18,920
that takes as input two images,

123
00:06:18,920 --> 00:06:22,230
and what then your network does is it takes

124
00:06:22,230 --> 00:06:29,553
this input two images and it tells you if these two are the same person or not.

125
00:06:29,553 --> 00:06:34,700
So if you then have say 10,000 employees IDs on file,

126
00:06:34,700 --> 00:06:36,430
you can then take this image in red,

127
00:06:36,430 --> 00:06:38,545
and quickly compare it against maybe all

128
00:06:38,545 --> 00:06:41,795
10,000 employee IDs on file to try to figure out if

129
00:06:41,795 --> 00:06:44,810
this picture in red is indeed one of your 10000 employees that you

130
00:06:44,810 --> 00:06:48,640
should allow into this facility or that should allow into your office building.

131
00:06:48,640 --> 00:06:51,770
This is a turnstile that is giving employees access to a

132
00:06:51,770 --> 00:06:55,970
workplace.So why is it that the two step approach works better?

133
00:06:55,970 --> 00:06:58,605
There are actually two reasons for that.

134
00:06:58,605 --> 00:07:02,020
One is that each of the two problems you're solving is actually much simpler.

135
00:07:02,020 --> 00:07:10,360
But second, is that you have a lot of data for each of the two sub-tasks.

136
00:07:10,360 --> 00:07:16,950
In particular, there is a lot of data you can obtain for phase detection,

137
00:07:16,950 --> 00:07:18,720
for task one over here,

138
00:07:18,720 --> 00:07:20,820
where the task is to look at an image and figure

139
00:07:20,820 --> 00:07:23,415
out where is the person's face and the image.

140
00:07:23,415 --> 00:07:25,695
So there is a lot of data.

141
00:07:25,695 --> 00:07:27,610
There is a lot of label data X,

142
00:07:27,610 --> 00:07:31,420
comma Y where X is a picture and y shows the position of the person's face.

143
00:07:31,420 --> 00:07:35,100
So you could build a neural network to do task one quite well.

144
00:07:35,100 --> 00:07:37,440
And then separately, there's a lot of data for task two as well.

145
00:07:37,440 --> 00:07:41,710
Today, leading companies have let's say,

146
00:07:41,710 --> 00:07:44,525
hundreds of millions of pictures of people's faces.

147
00:07:44,525 --> 00:07:46,845
So given a closely cropped image,

148
00:07:46,845 --> 00:07:49,275
like this red image or this one down here,

149
00:07:49,275 --> 00:07:51,555
today leading face recognition teams have

150
00:07:51,555 --> 00:07:53,880
at least hundreds of millions of images that they could

151
00:07:53,880 --> 00:07:55,980
use to look at two images and try to

152
00:07:55,980 --> 00:07:58,765
figure out the identity or to figure out if it's the same person or not.

153
00:07:58,765 --> 00:08:02,310
So there's also a lot of data for task two.

154
00:08:02,310 --> 00:08:07,155
But in contrast, if you were to try to learn everything at the same time,

155
00:08:07,155 --> 00:08:10,680
there is much less data of the form X comma Y.

156
00:08:10,680 --> 00:08:13,175
Where X is image like this taken from the turnstile,

157
00:08:13,175 --> 00:08:16,390
and Y is the identity of the person.

158
00:08:16,390 --> 00:08:21,610
So because you don't have enough data to solve this end-to-end learning problem,

159
00:08:21,610 --> 00:08:27,410
but you do have enough data to solve sub-problems one and two, in practice,

160
00:08:27,410 --> 00:08:29,740
breaking this down to two sub-problems results in

161
00:08:29,740 --> 00:08:34,260
better performance than a pure end-to-end deep learning approach.

162
00:08:34,260 --> 00:08:37,575
Although if you had enough data for the end-to-end approach,

163
00:08:37,575 --> 00:08:40,470
maybe the end-to-end approach would work better,

164
00:08:40,470 --> 00:08:44,240
but that's not actually what works best in practice today.

165
00:08:44,240 --> 00:08:46,886
Let's look at a few more examples.

166
00:08:46,886 --> 00:08:49,000
Take machine translation.

167
00:08:49,000 --> 00:08:54,390
Traditionally, machine translation systems also had a long complicated pipeline,

168
00:08:54,390 --> 00:08:56,230
where you first take say English,

169
00:08:56,230 --> 00:08:58,955
text and then do text analysis.

170
00:08:58,955 --> 00:09:01,990
Basically, extract a bunch of features off the text, and so on.

171
00:09:01,990 --> 00:09:04,630
And after many many steps you'd end up with say,

172
00:09:04,630 --> 00:09:07,900
a translation of the English text into French.

173
00:09:07,900 --> 00:09:10,030
Because, for machine translation,

174
00:09:10,030 --> 00:09:13,990
you do have a lot of pairs of English comma French sentences.

175
00:09:13,990 --> 00:09:17,640
End-to-end deep learning works quite well for machine translation.

176
00:09:17,640 --> 00:09:20,140
And that's because today,

177
00:09:20,140 --> 00:09:24,350
it is possible to gather large data sets of X-Y pairs where that's

178
00:09:24,350 --> 00:09:28,780
the English sentence and that's the corresponding French translation.

179
00:09:28,780 --> 00:09:29,955
So in this example,

180
00:09:29,955 --> 00:09:32,320
end-to-end deep learning works well.

181
00:09:32,320 --> 00:09:35,200
One last example, let's say that you

182
00:09:35,200 --> 00:09:38,220
want to look at an X-ray picture of a hand of a child,

183
00:09:38,220 --> 00:09:40,215
and estimate the age of a child.

184
00:09:40,215 --> 00:09:41,830
You know, when I first heard about this problem,

185
00:09:41,830 --> 00:09:45,460
I thought this is a very cool crime scene investigation task

186
00:09:45,460 --> 00:09:48,610
where you find maybe tragically the skeleton of a child,

187
00:09:48,610 --> 00:09:50,980
and you want to figure out how the child was.

188
00:09:50,980 --> 00:09:54,445
It turns out that typical application of this problem,

189
00:09:54,445 --> 00:09:57,010
estimating age of a child from an X-ray is

190
00:09:57,010 --> 00:09:59,995
less dramatic than this crime scene investigation I was picturing.

191
00:09:59,995 --> 00:10:02,890
It turns out that pediatricians use

192
00:10:02,890 --> 00:10:06,925
this to estimate whether or not a child is growing or developing normally.

193
00:10:06,925 --> 00:10:09,460
But a non end-to-end approach to this,

194
00:10:09,460 --> 00:10:14,045
would be you locate an image and then you segment out or recognize the bones.

195
00:10:14,045 --> 00:10:16,730
So, just try to figure out where is that bone segment?

196
00:10:16,730 --> 00:10:17,849
Where is that bone segment?

197
00:10:17,849 --> 00:10:20,290
Where is that bone segment? And so on. And then.

198
00:10:20,290 --> 00:10:22,285
Knowing the lengths of the different bones,

199
00:10:22,285 --> 00:10:26,380
you can sort of go to a look up table showing the average bone lengths in

200
00:10:26,380 --> 00:10:30,615
a child's hand and then use that to estimate the child's age.

201
00:10:30,615 --> 00:10:32,800
And so this approach actually works pretty well.

202
00:10:32,800 --> 00:10:37,960
In contrast, if you were to go straight from the image to the child's age,

203
00:10:37,960 --> 00:10:43,030
then you would need a lot of data to do that directly and as far as I know,

204
00:10:43,030 --> 00:10:45,850
this approach does not work as well today just

205
00:10:45,850 --> 00:10:50,515
because there isn't enough data to train this task in an end-to-end fashion.

206
00:10:50,515 --> 00:10:56,326
Whereas in contrast, you can imagine that by breaking down this problem into two steps.

207
00:10:56,326 --> 00:10:58,780
Step one is a relatively simple problem.

208
00:10:58,780 --> 00:11:00,345
Maybe you don't need that much data.

209
00:11:00,345 --> 00:11:03,455
Maybe you don't need that many X-ray images to segment out the bones.

210
00:11:03,455 --> 00:11:08,225
And task two, by collecting statistics of a number of children's hands,

211
00:11:08,225 --> 00:11:11,280
you can also get decent estimates of that without too much data.

212
00:11:11,280 --> 00:11:14,125
So this multi-step approach seems promising.

213
00:11:14,125 --> 00:11:16,420
Maybe more promising than the end-to-end approach,

214
00:11:16,420 --> 00:11:20,635
at least until you can get more data for the end-to-end learning approach.

215
00:11:20,635 --> 00:11:22,840
So an end-to-end deep learning works.

216
00:11:22,840 --> 00:11:26,650
It can work really well and it can really simplify the system and not

217
00:11:26,650 --> 00:11:30,875
require you to build so many hand-designed individual components.

218
00:11:30,875 --> 00:11:32,773
But it's also not panacea,

219
00:11:32,773 --> 00:11:34,315
it doesn't always work.

220
00:11:34,315 --> 00:11:35,650
In the next video,

221
00:11:35,650 --> 00:11:39,530
I want to share with you a more systematic description of when you should,

222
00:11:39,530 --> 00:11:42,820
and maybe when you shouldn't use end-to-end deep learning and how

223
00:11:42,820 --> 00:11:47,000
to piece together these complex machine learning systems.