1 00:00:00,329 --> 00:00:03,060 if the basic technical idea is behind 2 00:00:03,060 --> 00:00:05,700 deep learning behind your networks have 3 00:00:05,700 --> 00:00:07,470 been around for decades why are they 4 00:00:07,470 --> 00:00:09,870 only just now taking off in this video 5 00:00:09,870 --> 00:00:12,090 let's go over some of the main drivers 6 00:00:12,090 --> 00:00:14,130 behind the rise of deep learning because 7 00:00:14,130 --> 00:00:16,170 I think this will help you that the spot 8 00:00:16,170 --> 00:00:18,090 the best opportunities within your own 9 00:00:18,090 --> 00:00:20,850 organization to apply these to over the 10 00:00:20,850 --> 00:00:22,439 last few years a lot of people have 11 00:00:22,439 --> 00:00:24,240 asked me Andrew why is deep learning 12 00:00:24,240 --> 00:00:26,820 certainly working so well and when a 13 00:00:26,820 --> 00:00:28,949 marsan question this is usually the 14 00:00:28,949 --> 00:00:31,109 picture I draw for them let's say we 15 00:00:31,109 --> 00:00:33,210 plot a figure where on the horizontal 16 00:00:33,210 --> 00:00:36,180 axis we plot the amount of data we have 17 00:00:36,180 --> 00:00:39,270 for a task and let's say on the vertical 18 00:00:39,270 --> 00:00:42,570 axis we plot the performance on above 19 00:00:42,570 --> 00:00:44,430 learning algorithms such as the accuracy 20 00:00:44,430 --> 00:00:48,180 of our spam classifier or our ad click 21 00:00:48,180 --> 00:00:51,960 predictor or the accuracy of our neural 22 00:00:51,960 --> 00:00:53,969 net for figuring out the position of 23 00:00:53,969 --> 00:00:56,399 other calls for our self-driving car it 24 00:00:56,399 --> 00:00:58,440 turns out if you plot the performance of 25 00:00:58,440 --> 00:01:00,270 a traditional learning algorithm like 26 00:01:00,270 --> 00:01:02,460 support vector machine or logistic 27 00:01:02,460 --> 00:01:04,710 regression as a function of the amount 28 00:01:04,710 --> 00:01:07,619 of data you have you might get a curve 29 00:01:07,619 --> 00:01:09,720 that looks like this where the 30 00:01:09,720 --> 00:01:11,670 performance improves for a while as you 31 00:01:11,670 --> 00:01:14,280 add more data but after a while the 32 00:01:14,280 --> 00:01:16,200 performance you know pretty much 33 00:01:16,200 --> 00:01:18,630 plateaus right suppose your horizontal 34 00:01:18,630 --> 00:01:21,180 lines enjoy that very well you know was 35 00:01:21,180 --> 00:01:25,320 it they didn't know what to do with huge 36 00:01:25,320 --> 00:01:28,140 amounts of data and what happened in our 37 00:01:28,140 --> 00:01:30,689 society over the last 10 years maybe is 38 00:01:30,689 --> 00:01:32,850 that for a lot of problems we went from 39 00:01:32,850 --> 00:01:34,820 having a relatively small amount of data 40 00:01:34,820 --> 00:01:38,610 to having you know often a fairly large 41 00:01:38,610 --> 00:01:40,979 amount of data and all of this was 42 00:01:40,979 --> 00:01:43,979 thanks to the digitization of a society 43 00:01:43,979 --> 00:01:46,979 where so much human activity is now in 44 00:01:46,979 --> 00:01:48,720 the digital realm we spend so much time 45 00:01:48,720 --> 00:01:51,180 on the computers on websites on mobile 46 00:01:51,180 --> 00:01:54,320 apps and activities on digital devices 47 00:01:54,320 --> 00:01:57,960 creates data and thanks to the rise of 48 00:01:57,960 --> 00:02:00,360 inexpensive cameras built into our cell 49 00:02:00,360 --> 00:02:02,369 phones accelerometers all sorts of 50 00:02:02,369 --> 00:02:05,909 sensors in the Internet of Things we 51 00:02:05,909 --> 00:02:07,890 also just have been collecting one more 52 00:02:07,890 --> 00:02:11,129 and more data so over the last 20 years 53 00:02:11,129 --> 00:02:12,870 for a lot of applications we just 54 00:02:12,870 --> 00:02:13,560 accumulate 55 00:02:13,560 --> 00:02:16,319 a lot more data more than traditional 56 00:02:16,319 --> 00:02:17,550 learning algorithms were able to 57 00:02:17,550 --> 00:02:20,520 effectively take advantage of and what 58 00:02:20,520 --> 00:02:22,560 new network lead turns out that if you 59 00:02:22,560 --> 00:02:26,310 train a small neural net then this 60 00:02:26,310 --> 00:02:28,470 performance maybe looks like that 61 00:02:28,470 --> 00:02:31,349 if you train a somewhat larger Internet 62 00:02:31,349 --> 00:02:34,590 that's called as a medium-sized internet 63 00:02:34,590 --> 00:02:36,330 to fall in something a little bit better 64 00:02:36,330 --> 00:02:39,900 and if you train a very large neural net 65 00:02:39,900 --> 00:02:42,180 then it's the form and often just keeps 66 00:02:42,180 --> 00:02:44,580 getting better and better so couple 67 00:02:44,580 --> 00:02:46,890 observations one is if you want to hit 68 00:02:46,890 --> 00:02:49,410 this very high level of performance then 69 00:02:49,410 --> 00:02:52,620 you need two things first often you need 70 00:02:52,620 --> 00:02:54,420 to be able to train a big enough neural 71 00:02:54,420 --> 00:02:57,360 network in order to take advantage of 72 00:02:57,360 --> 00:02:59,670 the huge amount of data and second you 73 00:02:59,670 --> 00:03:02,010 need to be out here on the x axes you do 74 00:03:02,010 --> 00:03:05,430 need a lot of data so we often say that 75 00:03:05,430 --> 00:03:07,799 scale has been driving deep learning 76 00:03:07,799 --> 00:03:10,860 progress and by scale I mean both the 77 00:03:10,860 --> 00:03:12,900 size of the neural network we need just 78 00:03:12,900 --> 00:03:15,150 a new network a lot of hidden units a 79 00:03:15,150 --> 00:03:17,069 lot of parameters a lot of connections 80 00:03:17,069 --> 00:03:21,480 as well as scale of the data in fact 81 00:03:21,480 --> 00:03:23,910 today one of the most reliable ways to 82 00:03:23,910 --> 00:03:25,440 get better performance in the neural 83 00:03:25,440 --> 00:03:27,390 network is often to either train a 84 00:03:27,390 --> 00:03:29,940 bigger network or throw more data at it 85 00:03:29,940 --> 00:03:31,829 and that only works up to a point 86 00:03:31,829 --> 00:03:33,359 because eventually you run out of data 87 00:03:33,359 --> 00:03:35,640 or eventually then your network is so 88 00:03:35,640 --> 00:03:37,769 big that it takes too long to train but 89 00:03:37,769 --> 00:03:40,200 just improving scale has actually taken 90 00:03:40,200 --> 00:03:42,690 us a long way in the world of learning 91 00:03:42,690 --> 00:03:45,810 in order to make this diagram a bit more 92 00:03:45,810 --> 00:03:48,060 technically precise and just add a few 93 00:03:48,060 --> 00:03:49,920 more things I wrote the amount of data 94 00:03:49,920 --> 00:03:53,040 on the x-axis technically this is amount 95 00:03:53,040 --> 00:03:57,900 of labeled data where by label data 96 00:03:57,900 --> 00:04:00,180 I mean training examples we have both 97 00:04:00,180 --> 00:04:03,630 the input X and the label Y I went to 98 00:04:03,630 --> 00:04:05,910 introduce a little bit of notation that 99 00:04:05,910 --> 00:04:07,709 we'll use later in this course we're 100 00:04:07,709 --> 00:04:10,769 going to use lowercase alphabet to 101 00:04:10,769 --> 00:04:12,540 denote the size of my training sets or 102 00:04:12,540 --> 00:04:13,739 the number of training examples 103 00:04:13,739 --> 00:04:15,690 this lowercase M so that's the 104 00:04:15,690 --> 00:04:18,989 horizontal axis couple other details to 105 00:04:18,989 --> 00:04:20,310 this Tigger 106 00:04:20,310 --> 00:04:23,340 in this regime of smaller training sets 107 00:04:23,340 --> 00:04:26,970 the relative ordering of the algorithms 108 00:04:26,970 --> 00:04:29,700 is actually not very well defined so if 109 00:04:29,700 --> 00:04:31,590 you don't have a lot of training data is 110 00:04:31,590 --> 00:04:34,500 often up to your skill at hand 111 00:04:34,500 --> 00:04:36,510 engineering features that determines the 112 00:04:36,510 --> 00:04:39,090 foreman so it's quite possible that if 113 00:04:39,090 --> 00:04:41,910 someone training an SVM is more 114 00:04:41,910 --> 00:04:44,070 motivated to hand engineer features and 115 00:04:44,070 --> 00:04:46,320 someone training even large their own 116 00:04:46,320 --> 00:04:48,300 that may be in this small training set 117 00:04:48,300 --> 00:04:50,730 regime the SEM could do better 118 00:04:50,730 --> 00:04:53,130 so you know in this region to the left 119 00:04:53,130 --> 00:04:55,020 of the figure the relative ordering 120 00:04:55,020 --> 00:04:57,090 between gene algorithms is not that well 121 00:04:57,090 --> 00:04:59,550 defined and performance depends much 122 00:04:59,550 --> 00:05:01,919 more on your skill at engine features 123 00:05:01,919 --> 00:05:03,389 and other mobile details of the 124 00:05:03,389 --> 00:05:05,970 algorithms and there's only in this some 125 00:05:05,970 --> 00:05:08,850 big data regime very large training sets 126 00:05:08,850 --> 00:05:12,000 very large M regime in the right that we 127 00:05:12,000 --> 00:05:14,669 more consistently see largely Ronettes 128 00:05:14,669 --> 00:05:17,639 dominating the other approaches and so 129 00:05:17,639 --> 00:05:19,560 if any of your friends ask you why are 130 00:05:19,560 --> 00:05:21,600 known as you know taking off I would 131 00:05:21,600 --> 00:05:23,700 encourage you to draw this picture for 132 00:05:23,700 --> 00:05:26,729 them as well so I will say that in the 133 00:05:26,729 --> 00:05:28,890 early days in their modern rise of deep 134 00:05:28,890 --> 00:05:29,310 learning 135 00:05:29,310 --> 00:05:32,070 it was scaled data and scale of 136 00:05:32,070 --> 00:05:34,919 computation just our ability to Train 137 00:05:34,919 --> 00:05:36,330 very large dinner networks 138 00:05:36,330 --> 00:05:39,479 either on a CPU or GPU that enabled us 139 00:05:39,479 --> 00:05:41,850 to make a lot of progress but 140 00:05:41,850 --> 00:05:43,590 increasingly especially in the last 141 00:05:43,590 --> 00:05:45,800 several years we've seen tremendous 142 00:05:45,800 --> 00:05:48,360 algorithmic innovation as well so I also 143 00:05:48,360 --> 00:05:50,539 don't want to understate that 144 00:05:50,539 --> 00:05:53,700 interestingly many of the algorithmic 145 00:05:53,700 --> 00:05:56,940 innovations have been about trying to 146 00:05:56,940 --> 00:06:01,139 make neural networks run much faster so 147 00:06:01,139 --> 00:06:03,510 as a concrete example one of the huge 148 00:06:03,510 --> 00:06:05,310 breakthroughs in your networks has been 149 00:06:05,310 --> 00:06:08,729 switching from a sigmoid function which 150 00:06:08,729 --> 00:06:12,330 looks like this to a railer function 151 00:06:12,330 --> 00:06:14,760 which we talked about briefly in an 152 00:06:14,760 --> 00:06:18,479 early video that looks like this if you 153 00:06:18,479 --> 00:06:20,190 don't understand the details of one 154 00:06:20,190 --> 00:06:22,260 about the state don't worry about it but 155 00:06:22,260 --> 00:06:24,389 it turns out that one of the problems of 156 00:06:24,389 --> 00:06:26,010 using sigmoid functions and machine 157 00:06:26,010 --> 00:06:27,870 learning is that there these regions 158 00:06:27,870 --> 00:06:29,520 here where the slope of the function 159 00:06:29,520 --> 00:06:30,280 would 160 00:06:30,280 --> 00:06:32,920 gradient is nearly zero and so learning 161 00:06:32,920 --> 00:06:35,350 becomes really slow because when you 162 00:06:35,350 --> 00:06:37,060 implement gradient descent and gradient 163 00:06:37,060 --> 00:06:39,639 is zero the parameters just change very 164 00:06:39,639 --> 00:06:41,470 slowly and so learning is very slow 165 00:06:41,470 --> 00:06:44,740 whereas by changing the what's called 166 00:06:44,740 --> 00:06:46,450 the activation function the neural 167 00:06:46,450 --> 00:06:48,600 network to use this function called the 168 00:06:48,600 --> 00:06:52,060 value function of the rectified linear 169 00:06:52,060 --> 00:06:54,970 unit our elu the gradient is equal to 170 00:06:54,970 --> 00:06:57,070 one for all positive values of input 171 00:06:57,070 --> 00:07:00,220 right and so the gradient is much less 172 00:07:00,220 --> 00:07:03,100 likely to gradually shrink to zero and 173 00:07:03,100 --> 00:07:04,750 the gradient here the slope of this line 174 00:07:04,750 --> 00:07:07,300 is zero on the left but it turns out 175 00:07:07,300 --> 00:07:09,520 that just by switching to the sigmoid 176 00:07:09,520 --> 00:07:12,580 function to the rayleigh function has 177 00:07:12,580 --> 00:07:14,410 made an algorithm called gradient 178 00:07:14,410 --> 00:07:16,960 descent work much faster and so this is 179 00:07:16,960 --> 00:07:19,169 an example of maybe relatively simple 180 00:07:19,169 --> 00:07:22,030 algorithm in Bayesian but ultimately the 181 00:07:22,030 --> 00:07:23,860 impact of this algorithmic innovation 182 00:07:23,860 --> 00:07:27,520 was it really hope computation so the 183 00:07:27,520 --> 00:07:29,080 regimen quite a lot of examples like 184 00:07:29,080 --> 00:07:31,240 this of where we change the algorithm 185 00:07:31,240 --> 00:07:33,340 because it allows that code to run much 186 00:07:33,340 --> 00:07:35,140 faster and this allows us to train 187 00:07:35,140 --> 00:07:37,479 bigger neural networks or to do so the 188 00:07:37,479 --> 00:07:39,520 reason or multi-client even when we have 189 00:07:39,520 --> 00:07:42,250 a large network roam all the data the 190 00:07:42,250 --> 00:07:45,810 other reason that fast computation is 191 00:07:45,810 --> 00:07:48,610 important is that it turns out the 192 00:07:48,610 --> 00:07:51,070 process of training your network this is 193 00:07:51,070 --> 00:07:53,710 very intuitive often you have an idea 194 00:07:53,710 --> 00:07:56,350 for a neural network architecture and so 195 00:07:56,350 --> 00:07:58,020 you implement your idea and code 196 00:07:58,020 --> 00:08:01,060 implementing your idea then lets you run 197 00:08:01,060 --> 00:08:02,830 an experiment which tells you how well 198 00:08:02,830 --> 00:08:05,050 your neural network does and then by 199 00:08:05,050 --> 00:08:07,510 looking at it you go back to change the 200 00:08:07,510 --> 00:08:10,030 details of your new network and then you 201 00:08:10,030 --> 00:08:12,930 go around this circle over and over and 202 00:08:12,930 --> 00:08:15,880 when your new network takes a long time 203 00:08:15,880 --> 00:08:18,550 to Train it just takes a long time to go 204 00:08:18,550 --> 00:08:21,400 around this cycle and there's a huge 205 00:08:21,400 --> 00:08:24,039 difference in your productivity building 206 00:08:24,039 --> 00:08:26,740 effective neural networks when you can 207 00:08:26,740 --> 00:08:29,560 have an idea and try it and see the work 208 00:08:29,560 --> 00:08:34,169 in ten minutes or maybe ammos a day 209 00:08:34,169 --> 00:08:36,370 versus if you've to train your neural 210 00:08:36,370 --> 00:08:39,490 network for a month which sometimes does 211 00:08:39,490 --> 00:08:40,590 happened 212 00:08:40,590 --> 00:08:42,570 because you get a result back you know 213 00:08:42,570 --> 00:08:44,670 in ten minutes or maybe in a day you 214 00:08:44,670 --> 00:08:47,250 should just try a lot more ideas and be 215 00:08:47,250 --> 00:08:49,170 much more likely to discover in your 216 00:08:49,170 --> 00:08:50,610 network and it works well for your 217 00:08:50,610 --> 00:08:53,720 application and so faster computation 218 00:08:53,720 --> 00:08:57,900 has really helped in terms of speeding 219 00:08:57,900 --> 00:08:59,730 up the rate at which you can get an 220 00:08:59,730 --> 00:09:02,610 experimental result back and this has 221 00:09:02,610 --> 00:09:05,400 really helped both practitioners of 222 00:09:05,400 --> 00:09:07,550 neuro networks as well as researchers 223 00:09:07,550 --> 00:09:10,650 working and deep learning iterate much 224 00:09:10,650 --> 00:09:13,320 faster and improve your ideas much 225 00:09:13,320 --> 00:09:16,589 faster and so all this has also been a 226 00:09:16,589 --> 00:09:18,570 huge boon to the entire deep learning 227 00:09:18,570 --> 00:09:21,029 research community which has been 228 00:09:21,029 --> 00:09:23,370 incredible with just you know inventing 229 00:09:23,370 --> 00:09:25,620 new algorithms and making nonstop 230 00:09:25,620 --> 00:09:28,920 progress on that front so these are some 231 00:09:28,920 --> 00:09:30,990 of the forces powering the rise of deep 232 00:09:30,990 --> 00:09:33,570 learning but the good news is that these 233 00:09:33,570 --> 00:09:36,000 forces are still working powerfully to 234 00:09:36,000 --> 00:09:38,490 make deep learning even better Tech Data 235 00:09:38,490 --> 00:09:41,130 society is still throwing up one more 236 00:09:41,130 --> 00:09:43,800 digital data or take computation with 237 00:09:43,800 --> 00:09:45,660 the rise of specialized hardware like 238 00:09:45,660 --> 00:09:48,300 GPUs and faster networking many types of 239 00:09:48,300 --> 00:09:50,940 hardware I'm actually quite confident 240 00:09:50,940 --> 00:09:53,250 that our ability to do very large neural 241 00:09:53,250 --> 00:09:55,140 networks or should a computation point 242 00:09:55,140 --> 00:09:57,320 of view will keep on getting better and 243 00:09:57,320 --> 00:10:00,360 take algorithms relative learning 244 00:10:00,360 --> 00:10:02,880 research communities though continuously 245 00:10:02,880 --> 00:10:05,070 phenomenal at innovating on the 246 00:10:05,070 --> 00:10:07,680 algorithms front so because of this I 247 00:10:07,680 --> 00:10:09,839 think that we can be optimistic answer 248 00:10:09,839 --> 00:10:11,370 the optimistic the deep learning will 249 00:10:11,370 --> 00:10:13,650 keep on getting better for many years to 250 00:10:13,650 --> 00:10:14,120 come 251 00:10:14,120 --> 00:10:17,100 so that let's go on to the last video of 252 00:10:17,100 --> 00:10:18,540 the section where we'll talk a little 253 00:10:18,540 --> 00:10:20,280 bit more about what you learn from this 254 00:10:20,280 --> 00:10:22,610 course