1 00:00:00,000 --> 00:00:02,820 Let's say in building a machine learning system you're trying to 2 00:00:02,820 --> 00:00:05,690 decide whether or not to use an end-to-end approach. 3 00:00:05,690 --> 00:00:08,250 Let's take a look at some of the pros and cons of 4 00:00:08,250 --> 00:00:10,770 end-to-end deep learning so that you can come 5 00:00:10,770 --> 00:00:12,690 away with some guidelines on whether or 6 00:00:12,690 --> 00:00:17,040 not an end-to-end approach seems promising for your application. 7 00:00:17,040 --> 00:00:20,265 Here are some of the benefits of applying end-to-end learning. 8 00:00:20,265 --> 00:00:25,435 First is that end-to-end learning really just lets the data speak. 9 00:00:25,435 --> 00:00:29,010 So if you have enough X,Y data 10 00:00:29,010 --> 00:00:33,450 then whatever is the most appropriate function mapping from X to Y, 11 00:00:33,450 --> 00:00:35,475 if you train a big enough neural network, 12 00:00:35,475 --> 00:00:38,395 hopefully the neural network will figure it out. 13 00:00:38,395 --> 00:00:41,040 And by having a pure machine learning approach, 14 00:00:41,040 --> 00:00:44,550 your neural network learning input from X to Y may be 15 00:00:44,550 --> 00:00:48,540 more able to capture whatever statistics are in the data, 16 00:00:48,540 --> 00:00:52,800 rather than being forced to reflect human preconceptions. 17 00:00:52,800 --> 00:00:55,695 So for example, in the case of speech recognition 18 00:00:55,695 --> 00:00:58,530 earlier speech systems had this notion of 19 00:00:58,530 --> 00:01:01,260 a phoneme which was a basic unit of sound like C, 20 00:01:01,260 --> 00:01:04,240 A, and T for the word cat. 21 00:01:04,240 --> 00:01:09,890 And I think that phonemes are an artifact created by human linguists. 22 00:01:09,890 --> 00:01:12,120 I actually think that phonemes are a fantasy of 23 00:01:12,120 --> 00:01:15,690 linguists that are a reasonable description of language, 24 00:01:15,690 --> 00:01:21,785 but it's not obvious that you want to force your learning algorithm to think in phonemes. 25 00:01:21,785 --> 00:01:25,895 And if you let your learning algorithm learn whatever representation it wants to 26 00:01:25,895 --> 00:01:30,675 learn rather than forcing your learning algorithm to use phonemes as a representation, 27 00:01:30,675 --> 00:01:34,645 then its overall performance might end up being better. 28 00:01:34,645 --> 00:01:37,140 The second benefit to end-to-end deep learning is 29 00:01:37,140 --> 00:01:40,480 that there's less hand designing of components needed. 30 00:01:40,480 --> 00:01:43,960 And so this could also simplify your design work flow, 31 00:01:43,960 --> 00:01:47,655 that you just don't need to spend a lot of time hand designing features, 32 00:01:47,655 --> 00:01:51,040 hand designing these intermediate representations. 33 00:01:51,040 --> 00:01:52,650 How about the disadvantages. 34 00:01:52,650 --> 00:01:54,335 Here are some of the cons. 35 00:01:54,335 --> 00:01:57,010 First, it may need a large amount of data. 36 00:01:57,010 --> 00:02:00,225 So to learn this X to Y mapping directly, 37 00:02:00,225 --> 00:02:03,030 you might need a lot of data of X, 38 00:02:03,030 --> 00:02:06,600 Y and we were seeing in a previous video some examples of 39 00:02:06,600 --> 00:02:10,720 where you could obtain a lot of data for subtasks. 40 00:02:10,720 --> 00:02:13,675 Such as for face recognition, 41 00:02:13,675 --> 00:02:17,310 we could find a lot data for finding a face in the image, 42 00:02:17,310 --> 00:02:20,510 as well as identifying the face once you found a face, 43 00:02:20,510 --> 00:02:24,780 but there was just less data available for the entire end-to-end task. 44 00:02:24,780 --> 00:02:32,989 So X, this is the input end of the end-to-end learning and Y is the output end. 45 00:02:32,989 --> 00:02:36,180 And so you need all the data X Y with 46 00:02:36,180 --> 00:02:40,540 both the input end and the output end in order to train these systems, 47 00:02:40,540 --> 00:02:45,100 and this is why we call it end-to-end learning value as well because you're learning 48 00:02:45,100 --> 00:02:52,690 a direct mapping from one end of the system all the way to the other end of the system. 49 00:02:52,690 --> 00:02:58,960 The other disadvantage is that it excludes potentially useful hand designed components. 50 00:02:58,960 --> 00:03:04,510 So machine learning researchers tend to speak disparagingly of hand designing things. 51 00:03:04,510 --> 00:03:06,880 But if you don't have a lot of data, 52 00:03:06,880 --> 00:03:09,490 then your learning algorithm doesn't have 53 00:03:09,490 --> 00:03:13,450 that much insight it can gain from your data if your training set is small. 54 00:03:13,450 --> 00:03:17,080 And so hand designing a component can really be a way for 55 00:03:17,080 --> 00:03:21,138 you to inject manual knowledge into the algorithm, 56 00:03:21,138 --> 00:03:24,035 and that's not always a bad thing. 57 00:03:24,035 --> 00:03:28,060 I think of a learning algorithm as having two main sources of knowledge. 58 00:03:28,060 --> 00:03:33,590 One is the data and the other is whatever you hand design, 59 00:03:33,590 --> 00:03:37,070 be it components, or features, or other things. 60 00:03:37,070 --> 00:03:39,840 And so when you have a ton of data it's less 61 00:03:39,840 --> 00:03:44,125 important to hand design things but when you don't have much data, 62 00:03:44,125 --> 00:03:49,170 then having a carefully hand-designed system can actually allow humans to inject 63 00:03:49,170 --> 00:03:51,555 a lot of knowledge about the problem 64 00:03:51,555 --> 00:03:54,985 into an algorithm deck and that should be very helpful. 65 00:03:54,985 --> 00:03:58,200 So one of the downsides of end-to-end deep learning is 66 00:03:58,200 --> 00:04:02,605 that it excludes potentially useful hand-designed components. 67 00:04:02,605 --> 00:04:06,490 And hand-designed components could be very helpful if well designed. 68 00:04:06,490 --> 00:04:09,660 They could also be harmful if it really limits your performance, 69 00:04:09,660 --> 00:04:12,960 such as if you force an algorithm to think in phonemes 70 00:04:12,960 --> 00:04:16,770 when maybe it could have discovered a better representation by itself. 71 00:04:16,770 --> 00:04:19,110 So it's kind of a double edged sword that could 72 00:04:19,110 --> 00:04:21,660 hurt or help but it does tend to help more, 73 00:04:21,660 --> 00:04:26,320 hand-designed components tend to help more when you're training on a small training set. 74 00:04:26,320 --> 00:04:29,250 So if you're building a new machine learning system and you're trying to 75 00:04:29,250 --> 00:04:32,535 decide whether or not to use end-to-end deep learning, 76 00:04:32,535 --> 00:04:34,500 I think the key question is, 77 00:04:34,500 --> 00:04:37,330 do you have sufficient data to learn the function of 78 00:04:37,330 --> 00:04:41,340 the complexity needed to map from X to Y? 79 00:04:41,340 --> 00:04:44,790 I don't have a formal definition of this phrase, 80 00:04:44,790 --> 00:04:49,170 complexity needed, but intuitively, 81 00:04:49,170 --> 00:04:52,140 if you're trying to learn a function from X to Y, 82 00:04:52,140 --> 00:04:54,685 that is looking at an image like this 83 00:04:54,685 --> 00:04:57,495 and recognizing the position of the bones in this image, 84 00:04:57,495 --> 00:05:01,435 then maybe this seems like a relatively simple problem to 85 00:05:01,435 --> 00:05:06,030 identify the bones of the image and maybe they'll need that much data for that task. 86 00:05:06,030 --> 00:05:12,020 Or given a picture of a person, 87 00:05:12,020 --> 00:05:18,995 maybe finding the face of that person in the image doesn't seem like that hard a problem, 88 00:05:18,995 --> 00:05:23,420 so maybe you don't need too much data to find the face of a person. 89 00:05:23,420 --> 00:05:28,465 Or at least maybe you can find enough data to solve that task, whereas in contrast, 90 00:05:28,465 --> 00:05:34,210 the function needed to look at the hand and map that directly to the age of the child, 91 00:05:34,210 --> 00:05:38,995 that seems like a much more complex problem that intuitively maybe you need 92 00:05:38,995 --> 00:05:45,815 more data to learn if you were to apply a pure end-to-end deep learning approach. 93 00:05:45,815 --> 00:05:50,185 So let me finish this video with a more complex example. 94 00:05:50,185 --> 00:05:52,645 You may know that I've been spending time helping out 95 00:05:52,645 --> 00:05:55,360 an autonomous driving company, Drive.ai. 96 00:05:55,360 --> 00:06:00,108 So I'm actually very excited about autonomous driving. 97 00:06:00,108 --> 00:06:03,950 So how do you build a car that drives itself? 98 00:06:03,950 --> 00:06:06,065 Well, here's one thing you could do, 99 00:06:06,065 --> 00:06:08,795 and this is not an end-to-end deep learning approach. 100 00:06:08,795 --> 00:06:15,620 You can take as input an image of what's in front of your car, maybe radar, lighter, 101 00:06:15,620 --> 00:06:17,075 other sensor readings as well, 102 00:06:17,075 --> 00:06:20,030 but to simplify the description, 103 00:06:20,030 --> 00:06:24,700 let's just say you take a picture of what's in front or what's around your car. 104 00:06:24,700 --> 00:06:28,430 And then to drive your car safely you need to detect 105 00:06:28,430 --> 00:06:33,135 other cars and you also need to detect pedestrians. 106 00:06:33,135 --> 00:06:35,840 You need to detect other things, of course, 107 00:06:35,840 --> 00:06:39,033 but we'll just present a simplified example here. 108 00:06:39,033 --> 00:06:42,625 Having figured out where are the other cars and pedestrians, 109 00:06:42,625 --> 00:06:48,783 you then need to plan your own route. 110 00:06:48,783 --> 00:06:50,305 So in other words, 111 00:06:50,305 --> 00:06:54,880 if you see where are the other cars, 112 00:06:54,880 --> 00:06:58,300 where are the pedestrians, you need to decide how to steer your own car, 113 00:06:58,300 --> 00:07:02,110 what path to steer your own car for the next several seconds. 114 00:07:02,110 --> 00:07:08,185 And having decided that you're going to drive a certain path, 115 00:07:08,185 --> 00:07:14,725 maybe this is a top down view of a road and that's your car. 116 00:07:14,725 --> 00:07:17,625 Maybe you've decided to drive that path, 117 00:07:17,625 --> 00:07:18,750 that's what a route is, 118 00:07:18,750 --> 00:07:25,405 then you need to execute this by generating the appropriate steering, 119 00:07:25,405 --> 00:07:28,850 as well as acceleration and braking commands. 120 00:07:28,850 --> 00:07:34,030 So in going from your image or your sensory inputs to detecting cars and pedestrians, 121 00:07:34,030 --> 00:07:37,630 that can be done pretty well using deep learning, 122 00:07:37,630 --> 00:07:40,870 but then having figured out where the other cars and pedestrians are going, 123 00:07:40,870 --> 00:07:45,220 to select this route to exactly how you want to move your car, 124 00:07:45,220 --> 00:07:47,420 usually that's not to done with deep learning. 125 00:07:47,420 --> 00:07:51,715 Instead that's done with a piece of software called Motion Planning. 126 00:07:51,715 --> 00:07:55,420 And if you ever take a course in robotics you'll learn about motion planning. 127 00:07:55,420 --> 00:07:59,240 And then having decided what's the path you want to steer your car through, 128 00:07:59,240 --> 00:08:00,880 there'll be some other algorithm, 129 00:08:00,880 --> 00:08:06,355 we're going to say it's a control algorithm that then generates the exact decision, 130 00:08:06,355 --> 00:08:09,040 that then decides exactly how much to turn 131 00:08:09,040 --> 00:08:13,160 the steering wheel and how much to step on the accelerator or step on the brake. 132 00:08:13,160 --> 00:08:16,990 So I think what this example illustrates is that 133 00:08:16,990 --> 00:08:21,340 you want to use machine learning or use deep learning to learn 134 00:08:21,340 --> 00:08:30,640 some individual components and when applying supervised learning you should 135 00:08:30,640 --> 00:08:37,650 carefully choose what types of X to Y mappings you want to 136 00:08:37,650 --> 00:08:44,745 learn depending on what task 137 00:08:44,745 --> 00:08:48,875 you can get data for. 138 00:08:48,875 --> 00:08:51,765 And in contrast, it is exciting to talk about 139 00:08:51,765 --> 00:08:54,435 a pure end-to-end deep learning approach where you input 140 00:08:54,435 --> 00:08:57,180 the image and directly output a steering. 141 00:08:57,180 --> 00:09:04,570 But given data availability 142 00:09:04,570 --> 00:09:08,140 and the types of things we can learn with neural networks today, 143 00:09:08,140 --> 00:09:12,880 this is actually not the most promising approach or this is 144 00:09:12,880 --> 00:09:18,170 not an approach that I think teams have gotten to work best. 145 00:09:18,170 --> 00:09:22,780 And I think this pure end-to-end deep learning approach is actually 146 00:09:22,780 --> 00:09:27,170 less promising than more sophisticated approaches like this, 147 00:09:27,170 --> 00:09:32,260 given the availability of data and our ability to train neural networks today. 148 00:09:32,260 --> 00:09:35,055 So that's it for end-to-end deep learning. 149 00:09:35,055 --> 00:09:38,380 It can sometimes work really well but you also have to be 150 00:09:38,380 --> 00:09:42,453 mindful of where you apply end-to-end deep learning. 151 00:09:42,453 --> 00:09:46,985 Finally, thank you and congrats on making it this far with me. 152 00:09:46,985 --> 00:09:50,290 If you finish last week's videos and this week's videos then I 153 00:09:50,290 --> 00:09:53,860 think you will already be much smarter and much more strategic 154 00:09:53,860 --> 00:09:57,250 and much more able to make good prioritization decisions in 155 00:09:57,250 --> 00:10:01,138 terms of how to move forward on your machine learning project, 156 00:10:01,138 --> 00:10:03,520 even compared to a lot of machine learning engineers 157 00:10:03,520 --> 00:10:06,310 and researchers that I see here in Silicon Valley. 158 00:10:06,310 --> 00:10:11,320 So congrats on all that you've learned so far and I hope you now also take a look at 159 00:10:11,320 --> 00:10:13,480 this week's homework problems which should give you 160 00:10:13,480 --> 00:10:18,770 another opportunity to practice these ideas and make sure that you're mastering them.