1
00:00:00,329 --> 00:00:03,060
if the basic technical idea is behind

2
00:00:03,060 --> 00:00:05,700
deep learning behind your networks have

3
00:00:05,700 --> 00:00:07,470
been around for decades why are they

4
00:00:07,470 --> 00:00:09,870
only just now taking off in this video

5
00:00:09,870 --> 00:00:12,090
let's go over some of the main drivers

6
00:00:12,090 --> 00:00:14,130
behind the rise of deep learning because

7
00:00:14,130 --> 00:00:16,170
I think this will help you that the spot

8
00:00:16,170 --> 00:00:18,090
the best opportunities within your own

9
00:00:18,090 --> 00:00:20,850
organization to apply these to over the

10
00:00:20,850 --> 00:00:22,439
last few years a lot of people have

11
00:00:22,439 --> 00:00:24,240
asked me Andrew why is deep learning

12
00:00:24,240 --> 00:00:26,820
certainly working so well and when a

13
00:00:26,820 --> 00:00:28,949
marsan question this is usually the

14
00:00:28,949 --> 00:00:31,109
picture I draw for them let's say we

15
00:00:31,109 --> 00:00:33,210
plot a figure where on the horizontal

16
00:00:33,210 --> 00:00:36,180
axis we plot the amount of data we have

17
00:00:36,180 --> 00:00:39,270
for a task and let's say on the vertical

18
00:00:39,270 --> 00:00:42,570
axis we plot the performance on above

19
00:00:42,570 --> 00:00:44,430
learning algorithms such as the accuracy

20
00:00:44,430 --> 00:00:48,180
of our spam classifier or our ad click

21
00:00:48,180 --> 00:00:51,960
predictor or the accuracy of our neural

22
00:00:51,960 --> 00:00:53,969
net for figuring out the position of

23
00:00:53,969 --> 00:00:56,399
other calls for our self-driving car it

24
00:00:56,399 --> 00:00:58,440
turns out if you plot the performance of

25
00:00:58,440 --> 00:01:00,270
a traditional learning algorithm like

26
00:01:00,270 --> 00:01:02,460
support vector machine or logistic

27
00:01:02,460 --> 00:01:04,710
regression as a function of the amount

28
00:01:04,710 --> 00:01:07,619
of data you have you might get a curve

29
00:01:07,619 --> 00:01:09,720
that looks like this where the

30
00:01:09,720 --> 00:01:11,670
performance improves for a while as you

31
00:01:11,670 --> 00:01:14,280
add more data but after a while the

32
00:01:14,280 --> 00:01:16,200
performance you know pretty much

33
00:01:16,200 --> 00:01:18,630
plateaus right suppose your horizontal

34
00:01:18,630 --> 00:01:21,180
lines enjoy that very well you know was

35
00:01:21,180 --> 00:01:25,320
it they didn't know what to do with huge

36
00:01:25,320 --> 00:01:28,140
amounts of data and what happened in our

37
00:01:28,140 --> 00:01:30,689
society over the last 10 years maybe is

38
00:01:30,689 --> 00:01:32,850
that for a lot of problems we went from

39
00:01:32,850 --> 00:01:34,820
having a relatively small amount of data

40
00:01:34,820 --> 00:01:38,610
to having you know often a fairly large

41
00:01:38,610 --> 00:01:40,979
amount of data and all of this was

42
00:01:40,979 --> 00:01:43,979
thanks to the digitization of a society

43
00:01:43,979 --> 00:01:46,979
where so much human activity is now in

44
00:01:46,979 --> 00:01:48,720
the digital realm we spend so much time

45
00:01:48,720 --> 00:01:51,180
on the computers on websites on mobile

46
00:01:51,180 --> 00:01:54,320
apps and activities on digital devices

47
00:01:54,320 --> 00:01:57,960
creates data and thanks to the rise of

48
00:01:57,960 --> 00:02:00,360
inexpensive cameras built into our cell

49
00:02:00,360 --> 00:02:02,369
phones accelerometers all sorts of

50
00:02:02,369 --> 00:02:05,909
sensors in the Internet of Things we

51
00:02:05,909 --> 00:02:07,890
also just have been collecting one more

52
00:02:07,890 --> 00:02:11,129
and more data so over the last 20 years

53
00:02:11,129 --> 00:02:12,870
for a lot of applications we just

54
00:02:12,870 --> 00:02:13,560
accumulate

55
00:02:13,560 --> 00:02:16,319
a lot more data more than traditional

56
00:02:16,319 --> 00:02:17,550
learning algorithms were able to

57
00:02:17,550 --> 00:02:20,520
effectively take advantage of and what

58
00:02:20,520 --> 00:02:22,560
new network lead turns out that if you

59
00:02:22,560 --> 00:02:26,310
train a small neural net then this

60
00:02:26,310 --> 00:02:28,470
performance maybe looks like that

61
00:02:28,470 --> 00:02:31,349
if you train a somewhat larger Internet

62
00:02:31,349 --> 00:02:34,590
that's called as a medium-sized internet

63
00:02:34,590 --> 00:02:36,330
to fall in something a little bit better

64
00:02:36,330 --> 00:02:39,900
and if you train a very large neural net

65
00:02:39,900 --> 00:02:42,180
then it's the form and often just keeps

66
00:02:42,180 --> 00:02:44,580
getting better and better so couple

67
00:02:44,580 --> 00:02:46,890
observations one is if you want to hit

68
00:02:46,890 --> 00:02:49,410
this very high level of performance then

69
00:02:49,410 --> 00:02:52,620
you need two things first often you need

70
00:02:52,620 --> 00:02:54,420
to be able to train a big enough neural

71
00:02:54,420 --> 00:02:57,360
network in order to take advantage of

72
00:02:57,360 --> 00:02:59,670
the huge amount of data and second you

73
00:02:59,670 --> 00:03:02,010
need to be out here on the x axes you do

74
00:03:02,010 --> 00:03:05,430
need a lot of data so we often say that

75
00:03:05,430 --> 00:03:07,799
scale has been driving deep learning

76
00:03:07,799 --> 00:03:10,860
progress and by scale I mean both the

77
00:03:10,860 --> 00:03:12,900
size of the neural network we need just

78
00:03:12,900 --> 00:03:15,150
a new network a lot of hidden units a

79
00:03:15,150 --> 00:03:17,069
lot of parameters a lot of connections

80
00:03:17,069 --> 00:03:21,480
as well as scale of the data in fact

81
00:03:21,480 --> 00:03:23,910
today one of the most reliable ways to

82
00:03:23,910 --> 00:03:25,440
get better performance in the neural

83
00:03:25,440 --> 00:03:27,390
network is often to either train a

84
00:03:27,390 --> 00:03:29,940
bigger network or throw more data at it

85
00:03:29,940 --> 00:03:31,829
and that only works up to a point

86
00:03:31,829 --> 00:03:33,359
because eventually you run out of data

87
00:03:33,359 --> 00:03:35,640
or eventually then your network is so

88
00:03:35,640 --> 00:03:37,769
big that it takes too long to train but

89
00:03:37,769 --> 00:03:40,200
just improving scale has actually taken

90
00:03:40,200 --> 00:03:42,690
us a long way in the world of learning

91
00:03:42,690 --> 00:03:45,810
in order to make this diagram a bit more

92
00:03:45,810 --> 00:03:48,060
technically precise and just add a few

93
00:03:48,060 --> 00:03:49,920
more things I wrote the amount of data

94
00:03:49,920 --> 00:03:53,040
on the x-axis technically this is amount

95
00:03:53,040 --> 00:03:57,900
of labeled data where by label data

96
00:03:57,900 --> 00:04:00,180
I mean training examples we have both

97
00:04:00,180 --> 00:04:03,630
the input X and the label Y I went to

98
00:04:03,630 --> 00:04:05,910
introduce a little bit of notation that

99
00:04:05,910 --> 00:04:07,709
we'll use later in this course we're

100
00:04:07,709 --> 00:04:10,769
going to use lowercase alphabet to

101
00:04:10,769 --> 00:04:12,540
denote the size of my training sets or

102
00:04:12,540 --> 00:04:13,739
the number of training examples

103
00:04:13,739 --> 00:04:15,690
this lowercase M so that's the

104
00:04:15,690 --> 00:04:18,989
horizontal axis couple other details to

105
00:04:18,989 --> 00:04:20,310
this Tigger

106
00:04:20,310 --> 00:04:23,340
in this regime of smaller training sets

107
00:04:23,340 --> 00:04:26,970
the relative ordering of the algorithms

108
00:04:26,970 --> 00:04:29,700
is actually not very well defined so if

109
00:04:29,700 --> 00:04:31,590
you don't have a lot of training data is

110
00:04:31,590 --> 00:04:34,500
often up to your skill at hand

111
00:04:34,500 --> 00:04:36,510
engineering features that determines the

112
00:04:36,510 --> 00:04:39,090
foreman so it's quite possible that if

113
00:04:39,090 --> 00:04:41,910
someone training an SVM is more

114
00:04:41,910 --> 00:04:44,070
motivated to hand engineer features and

115
00:04:44,070 --> 00:04:46,320
someone training even large their own

116
00:04:46,320 --> 00:04:48,300
that may be in this small training set

117
00:04:48,300 --> 00:04:50,730
regime the SEM could do better

118
00:04:50,730 --> 00:04:53,130
so you know in this region to the left

119
00:04:53,130 --> 00:04:55,020
of the figure the relative ordering

120
00:04:55,020 --> 00:04:57,090
between gene algorithms is not that well

121
00:04:57,090 --> 00:04:59,550
defined and performance depends much

122
00:04:59,550 --> 00:05:01,919
more on your skill at engine features

123
00:05:01,919 --> 00:05:03,389
and other mobile details of the

124
00:05:03,389 --> 00:05:05,970
algorithms and there's only in this some

125
00:05:05,970 --> 00:05:08,850
big data regime very large training sets

126
00:05:08,850 --> 00:05:12,000
very large M regime in the right that we

127
00:05:12,000 --> 00:05:14,669
more consistently see largely Ronettes

128
00:05:14,669 --> 00:05:17,639
dominating the other approaches and so

129
00:05:17,639 --> 00:05:19,560
if any of your friends ask you why are

130
00:05:19,560 --> 00:05:21,600
known as you know taking off I would

131
00:05:21,600 --> 00:05:23,700
encourage you to draw this picture for

132
00:05:23,700 --> 00:05:26,729
them as well so I will say that in the

133
00:05:26,729 --> 00:05:28,890
early days in their modern rise of deep

134
00:05:28,890 --> 00:05:29,310
learning

135
00:05:29,310 --> 00:05:32,070
it was scaled data and scale of

136
00:05:32,070 --> 00:05:34,919
computation just our ability to Train

137
00:05:34,919 --> 00:05:36,330
very large dinner networks

138
00:05:36,330 --> 00:05:39,479
either on a CPU or GPU that enabled us

139
00:05:39,479 --> 00:05:41,850
to make a lot of progress but

140
00:05:41,850 --> 00:05:43,590
increasingly especially in the last

141
00:05:43,590 --> 00:05:45,800
several years we've seen tremendous

142
00:05:45,800 --> 00:05:48,360
algorithmic innovation as well so I also

143
00:05:48,360 --> 00:05:50,539
don't want to understate that

144
00:05:50,539 --> 00:05:53,700
interestingly many of the algorithmic

145
00:05:53,700 --> 00:05:56,940
innovations have been about trying to

146
00:05:56,940 --> 00:06:01,139
make neural networks run much faster so

147
00:06:01,139 --> 00:06:03,510
as a concrete example one of the huge

148
00:06:03,510 --> 00:06:05,310
breakthroughs in your networks has been

149
00:06:05,310 --> 00:06:08,729
switching from a sigmoid function which

150
00:06:08,729 --> 00:06:12,330
looks like this to a railer function

151
00:06:12,330 --> 00:06:14,760
which we talked about briefly in an

152
00:06:14,760 --> 00:06:18,479
early video that looks like this if you

153
00:06:18,479 --> 00:06:20,190
don't understand the details of one

154
00:06:20,190 --> 00:06:22,260
about the state don't worry about it but

155
00:06:22,260 --> 00:06:24,389
it turns out that one of the problems of

156
00:06:24,389 --> 00:06:26,010
using sigmoid functions and machine

157
00:06:26,010 --> 00:06:27,870
learning is that there these regions

158
00:06:27,870 --> 00:06:29,520
here where the slope of the function

159
00:06:29,520 --> 00:06:30,280
would

160
00:06:30,280 --> 00:06:32,920
gradient is nearly zero and so learning

161
00:06:32,920 --> 00:06:35,350
becomes really slow because when you

162
00:06:35,350 --> 00:06:37,060
implement gradient descent and gradient

163
00:06:37,060 --> 00:06:39,639
is zero the parameters just change very

164
00:06:39,639 --> 00:06:41,470
slowly and so learning is very slow

165
00:06:41,470 --> 00:06:44,740
whereas by changing the what's called

166
00:06:44,740 --> 00:06:46,450
the activation function the neural

167
00:06:46,450 --> 00:06:48,600
network to use this function called the

168
00:06:48,600 --> 00:06:52,060
value function of the rectified linear

169
00:06:52,060 --> 00:06:54,970
unit our elu the gradient is equal to

170
00:06:54,970 --> 00:06:57,070
one for all positive values of input

171
00:06:57,070 --> 00:07:00,220
right and so the gradient is much less

172
00:07:00,220 --> 00:07:03,100
likely to gradually shrink to zero and

173
00:07:03,100 --> 00:07:04,750
the gradient here the slope of this line

174
00:07:04,750 --> 00:07:07,300
is zero on the left but it turns out

175
00:07:07,300 --> 00:07:09,520
that just by switching to the sigmoid

176
00:07:09,520 --> 00:07:12,580
function to the rayleigh function has

177
00:07:12,580 --> 00:07:14,410
made an algorithm called gradient

178
00:07:14,410 --> 00:07:16,960
descent work much faster and so this is

179
00:07:16,960 --> 00:07:19,169
an example of maybe relatively simple

180
00:07:19,169 --> 00:07:22,030
algorithm in Bayesian but ultimately the

181
00:07:22,030 --> 00:07:23,860
impact of this algorithmic innovation

182
00:07:23,860 --> 00:07:27,520
was it really hope computation so the

183
00:07:27,520 --> 00:07:29,080
regimen quite a lot of examples like

184
00:07:29,080 --> 00:07:31,240
this of where we change the algorithm

185
00:07:31,240 --> 00:07:33,340
because it allows that code to run much

186
00:07:33,340 --> 00:07:35,140
faster and this allows us to train

187
00:07:35,140 --> 00:07:37,479
bigger neural networks or to do so the

188
00:07:37,479 --> 00:07:39,520
reason or multi-client even when we have

189
00:07:39,520 --> 00:07:42,250
a large network roam all the data the

190
00:07:42,250 --> 00:07:45,810
other reason that fast computation is

191
00:07:45,810 --> 00:07:48,610
important is that it turns out the

192
00:07:48,610 --> 00:07:51,070
process of training your network this is

193
00:07:51,070 --> 00:07:53,710
very intuitive often you have an idea

194
00:07:53,710 --> 00:07:56,350
for a neural network architecture and so

195
00:07:56,350 --> 00:07:58,020
you implement your idea and code

196
00:07:58,020 --> 00:08:01,060
implementing your idea then lets you run

197
00:08:01,060 --> 00:08:02,830
an experiment which tells you how well

198
00:08:02,830 --> 00:08:05,050
your neural network does and then by

199
00:08:05,050 --> 00:08:07,510
looking at it you go back to change the

200
00:08:07,510 --> 00:08:10,030
details of your new network and then you

201
00:08:10,030 --> 00:08:12,930
go around this circle over and over and

202
00:08:12,930 --> 00:08:15,880
when your new network takes a long time

203
00:08:15,880 --> 00:08:18,550
to Train it just takes a long time to go

204
00:08:18,550 --> 00:08:21,400
around this cycle and there's a huge

205
00:08:21,400 --> 00:08:24,039
difference in your productivity building

206
00:08:24,039 --> 00:08:26,740
effective neural networks when you can

207
00:08:26,740 --> 00:08:29,560
have an idea and try it and see the work

208
00:08:29,560 --> 00:08:34,169
in ten minutes or maybe ammos a day

209
00:08:34,169 --> 00:08:36,370
versus if you've to train your neural

210
00:08:36,370 --> 00:08:39,490
network for a month which sometimes does

211
00:08:39,490 --> 00:08:40,590
happened

212
00:08:40,590 --> 00:08:42,570
because you get a result back you know

213
00:08:42,570 --> 00:08:44,670
in ten minutes or maybe in a day you

214
00:08:44,670 --> 00:08:47,250
should just try a lot more ideas and be

215
00:08:47,250 --> 00:08:49,170
much more likely to discover in your

216
00:08:49,170 --> 00:08:50,610
network and it works well for your

217
00:08:50,610 --> 00:08:53,720
application and so faster computation

218
00:08:53,720 --> 00:08:57,900
has really helped in terms of speeding

219
00:08:57,900 --> 00:08:59,730
up the rate at which you can get an

220
00:08:59,730 --> 00:09:02,610
experimental result back and this has

221
00:09:02,610 --> 00:09:05,400
really helped both practitioners of

222
00:09:05,400 --> 00:09:07,550
neuro networks as well as researchers

223
00:09:07,550 --> 00:09:10,650
working and deep learning iterate much

224
00:09:10,650 --> 00:09:13,320
faster and improve your ideas much

225
00:09:13,320 --> 00:09:16,589
faster and so all this has also been a

226
00:09:16,589 --> 00:09:18,570
huge boon to the entire deep learning

227
00:09:18,570 --> 00:09:21,029
research community which has been

228
00:09:21,029 --> 00:09:23,370
incredible with just you know inventing

229
00:09:23,370 --> 00:09:25,620
new algorithms and making nonstop

230
00:09:25,620 --> 00:09:28,920
progress on that front so these are some

231
00:09:28,920 --> 00:09:30,990
of the forces powering the rise of deep

232
00:09:30,990 --> 00:09:33,570
learning but the good news is that these

233
00:09:33,570 --> 00:09:36,000
forces are still working powerfully to

234
00:09:36,000 --> 00:09:38,490
make deep learning even better Tech Data

235
00:09:38,490 --> 00:09:41,130
society is still throwing up one more

236
00:09:41,130 --> 00:09:43,800
digital data or take computation with

237
00:09:43,800 --> 00:09:45,660
the rise of specialized hardware like

238
00:09:45,660 --> 00:09:48,300
GPUs and faster networking many types of

239
00:09:48,300 --> 00:09:50,940
hardware I'm actually quite confident

240
00:09:50,940 --> 00:09:53,250
that our ability to do very large neural

241
00:09:53,250 --> 00:09:55,140
networks or should a computation point

242
00:09:55,140 --> 00:09:57,320
of view will keep on getting better and

243
00:09:57,320 --> 00:10:00,360
take algorithms relative learning

244
00:10:00,360 --> 00:10:02,880
research communities though continuously

245
00:10:02,880 --> 00:10:05,070
phenomenal at innovating on the

246
00:10:05,070 --> 00:10:07,680
algorithms front so because of this I

247
00:10:07,680 --> 00:10:09,839
think that we can be optimistic answer

248
00:10:09,839 --> 00:10:11,370
the optimistic the deep learning will

249
00:10:11,370 --> 00:10:13,650
keep on getting better for many years to

250
00:10:13,650 --> 00:10:14,120
come

251
00:10:14,120 --> 00:10:17,100
so that let's go on to the last video of

252
00:10:17,100 --> 00:10:18,540
the section where we'll talk a little

253
00:10:18,540 --> 00:10:20,280
bit more about what you learn from this

254
00:10:20,280 --> 00:10:22,610
course