1 00:00:00,330 --> 00:00:03,705 So whereas in transfer learning, you have a sequential process 2 00:00:03,705 --> 00:00:07,162 where you learn from task A and then transfer that to task B. 3 00:00:07,162 --> 00:00:10,520 In multi-task learning, you start off simultaneously, 4 00:00:10,520 --> 00:00:13,650 trying to have one neural network do several things at the same time. 5 00:00:13,650 --> 00:00:17,520 And then each of these task helps hopefully all of the other task. 6 00:00:17,520 --> 00:00:18,480 Let's look at an example. 7 00:00:20,110 --> 00:00:24,140 Let's say you're building an autonomous vehicle, building a self driving car. 8 00:00:24,140 --> 00:00:28,969 Then your self driving car would need to detect several different 9 00:00:28,969 --> 00:00:34,164 things such as pedestrians, detect other cars, detect stop signs. 10 00:00:37,312 --> 00:00:42,410 And also detect traffic lights and also other things. 11 00:00:43,670 --> 00:00:47,330 So for example, in this example on the left, there is a stop sign in this image 12 00:00:47,330 --> 00:00:53,510 and there is a car in this image but there aren't any pedestrians or traffic lights. 13 00:00:53,510 --> 00:00:58,200 So if this image is an input for an example, x(i), 14 00:00:58,200 --> 00:01:02,770 then Instead of having one label y(i), you would actually a four labels. 15 00:01:02,770 --> 00:01:05,640 In this example, there are no pedestrians, there is a car, 16 00:01:05,640 --> 00:01:08,850 there is a stop sign and there are no traffic lights. 17 00:01:08,850 --> 00:01:10,580 And if you try and detect other things, 18 00:01:10,580 --> 00:01:12,634 there may be y(i) has even more dimensions. 19 00:01:12,634 --> 00:01:14,385 But for now let's stick with these four. 20 00:01:14,385 --> 00:01:18,013 So y(i) is a 4 by 1 vector. 21 00:01:18,013 --> 00:01:22,444 And if you look at the training test labels as a whole, 22 00:01:22,444 --> 00:01:27,370 then similar to before, we'll stack the training data's 23 00:01:27,370 --> 00:01:32,116 labels horizontally as follows, y(1) up to y(m). 24 00:01:32,116 --> 00:01:39,470 Except that now y(i) is a 4 by 1 vector so each of these is a tall column vector. 25 00:01:39,470 --> 00:01:45,530 And so this matrix Y is now a 4 by m matrix, whereas previously, 26 00:01:45,530 --> 00:01:49,810 when y was single real number, this would have been a 1 by m matrix. 27 00:01:49,810 --> 00:01:55,122 So what you can do is now train a neural network to predict these values of y. 28 00:01:55,122 --> 00:01:57,828 So you can have a neural network input x and 29 00:01:57,828 --> 00:02:00,970 output now a four dimensional value for y. 30 00:02:00,970 --> 00:02:04,590 Notice here for the output there I've drawn four nodes. 31 00:02:04,590 --> 00:02:09,000 And so the first node when we try to predict is there a pedestrian 32 00:02:09,000 --> 00:02:10,610 in this picture. 33 00:02:10,610 --> 00:02:13,470 The second output will predict is there a car here, 34 00:02:13,470 --> 00:02:18,870 predict is there a stop sign and this will predict maybe is there a traffic light. 35 00:02:20,850 --> 00:02:23,950 So y hat here is four dimensional. 36 00:02:26,110 --> 00:02:30,575 So to train this neural network, you now need to define the loss for 37 00:02:30,575 --> 00:02:32,029 the neural network. 38 00:02:32,029 --> 00:02:39,190 And so given a predicted output y hat i which is 4 by 1 dimensional. 39 00:02:39,190 --> 00:02:43,939 The loss averaged over your entire training set 40 00:02:43,939 --> 00:02:48,209 would be 1 over m sum from i = 1 through m, 41 00:02:48,209 --> 00:02:55,349 sum from j = 1 through 4 of the losses of the individual predictions. 42 00:02:59,030 --> 00:03:03,256 So it's just summing over at the four components of pedestrian car stop sign 43 00:03:03,256 --> 00:03:04,253 traffic lights. 44 00:03:04,253 --> 00:03:11,462 And this script L is the usual logistic loss. 45 00:03:14,493 --> 00:03:19,404 So just to write this out, 46 00:03:19,404 --> 00:03:24,314 this is -yj i log y hat ji- 47 00:03:24,314 --> 00:03:28,349 1- y log 1- y hat. 48 00:03:31,706 --> 00:03:36,018 And the main difference compared to the earlier binding classification examples is 49 00:03:36,018 --> 00:03:38,760 that you're now summing over j equals 1 through 4. 50 00:03:40,570 --> 00:03:45,260 And the main difference between this and softmax regression, is that unlike softmax 51 00:03:45,260 --> 00:03:50,550 regression, which assigned a single label to single example. 52 00:03:50,550 --> 00:03:53,370 This one image can have multiple labels. 53 00:03:55,350 --> 00:04:00,162 So you're not saying that each image is either a picture of a pedestrian, or 54 00:04:00,162 --> 00:04:04,780 a picture of car, a picture of a stop sign, picture of a traffic light. 55 00:04:04,780 --> 00:04:09,010 You're asking for each picture, does it have a pedestrian, or a car a stop sign or 56 00:04:09,010 --> 00:04:11,860 traffic light, and multiple objects could appear in the same image. 57 00:04:11,860 --> 00:04:16,390 In fact, in the example on the previous slide, we had both a car and 58 00:04:16,390 --> 00:04:19,400 a stop sign in that image, but no pedestrians and traffic lights. 59 00:04:19,400 --> 00:04:22,580 So you're not assigning a single label to an image, 60 00:04:22,580 --> 00:04:25,800 you're going through the different classes and asking for 61 00:04:25,800 --> 00:04:29,573 each of the classes does that class, does that type of object appear in the image? 62 00:04:31,420 --> 00:04:34,839 So that's why I'm saying that with this setting, 63 00:04:34,839 --> 00:04:37,313 one image can have multiple labels. 64 00:04:37,313 --> 00:04:42,281 If you train a neural network to minimize this cost function, 65 00:04:42,281 --> 00:04:45,920 you are carrying out multi-task learning. 66 00:04:45,920 --> 00:04:50,823 Because what you're doing is building a single neural network that is looking at 67 00:04:50,823 --> 00:04:53,860 each image and basically solving four problems. 68 00:04:53,860 --> 00:04:58,910 It's trying to tell you does each image have each of these four objects in it. 69 00:05:00,250 --> 00:05:03,850 And one other thing you could have done is just train four separate neural networks, 70 00:05:03,850 --> 00:05:06,700 instead of train one network to do four things. 71 00:05:06,700 --> 00:05:11,300 But if some of the earlier features in neural network can be shared between these 72 00:05:11,300 --> 00:05:13,790 different types of objects, then you find that 73 00:05:13,790 --> 00:05:17,880 training one neural network to do four things results in better performance than 74 00:05:17,880 --> 00:05:21,760 training four completely separate neural networks to do the four tasks separately. 75 00:05:23,040 --> 00:05:25,450 So that's the power of multi-task learning. 76 00:05:26,716 --> 00:05:28,127 And one other detail, 77 00:05:28,127 --> 00:05:33,440 so far I've described this algorithm as if every image had every single label. 78 00:05:33,440 --> 00:05:37,754 It turns out that multi-task learning also works even if some of the images we'll 79 00:05:37,754 --> 00:05:39,452 label only some of the objects. 80 00:05:39,452 --> 00:05:43,078 So the first training example, let's say someone, your labeler had told you there's 81 00:05:43,078 --> 00:05:46,214 a pedestrian, there's no car, but they didn't bother to label whether or 82 00:05:46,214 --> 00:05:49,072 not there's a stop sign or whether or not there's a traffic light. 83 00:05:49,072 --> 00:05:52,691 And maybe for the second example, there is a pedestrian, there is a car, but 84 00:05:52,691 --> 00:05:56,479 again the labeler, when they looked at that image, they just didn't label it, 85 00:05:56,479 --> 00:05:59,790 whether it had a stop sign or whether it had a traffic light, and so on. 86 00:05:59,790 --> 00:06:03,153 And maybe some examples are fully labeled, and maybe some examples, 87 00:06:03,153 --> 00:06:06,117 they were just labeling for the presence and absence of cars so 88 00:06:06,117 --> 00:06:08,858 there's some question marks, and so on. 89 00:06:08,858 --> 00:06:13,050 So with a data set like this, you can still train your learning algorithm 90 00:06:13,050 --> 00:06:16,870 to do four tasks at the same time, even when some images have 91 00:06:16,870 --> 00:06:21,300 only a subset of the labels and others are sort of question marks or don't cares. 92 00:06:21,300 --> 00:06:24,808 And the way you train your algorithm, 93 00:06:24,808 --> 00:06:29,520 even when some of these labels are question marks or 94 00:06:29,520 --> 00:06:34,669 really unlabeled is that in this sum over j from 1 to 4, 95 00:06:34,669 --> 00:06:39,730 you would sum only over values of j with a 0 or 1 label. 96 00:06:41,354 --> 00:06:46,850 So whenever there's a question mark, you just omit that term from summation but 97 00:06:46,850 --> 00:06:51,480 just sum over only the values where there is a label. 98 00:06:51,480 --> 00:06:54,720 And so that allows you to use datasets like this as well. 99 00:06:54,720 --> 00:06:57,390 So when does multi-task learning makes sense? 100 00:06:57,390 --> 00:06:59,471 So when does multi-task learning make sense? 101 00:06:59,471 --> 00:07:03,550 I'll say it makes sense usually when three things are true. 102 00:07:03,550 --> 00:07:06,400 One is if your training on a set of tasks that could benefit from 103 00:07:06,400 --> 00:07:08,470 having shared low-level features. 104 00:07:08,470 --> 00:07:13,155 So for the autonomous driving example, it makes sense that recognizing traffic 105 00:07:13,155 --> 00:07:16,980 lights and cars and pedestrians, those should have similar features that 106 00:07:16,980 --> 00:07:21,653 could also help you recognize stop signs, because these are all features of roads. 107 00:07:23,120 --> 00:07:28,032 Second, this is less of a hard and fast rule, so this isn't always true. 108 00:07:28,032 --> 00:07:31,757 But what I see from a lot of successful multi-task learning settings is that 109 00:07:31,757 --> 00:07:35,310 the amount of data you have for each task is quite similar. 110 00:07:35,310 --> 00:07:39,480 So if you recall from transfer learning, you learn from some task A and 111 00:07:39,480 --> 00:07:41,930 transfer it to some task B. 112 00:07:41,930 --> 00:07:46,891 So if you have a million examples of task A then and 1,000 examples for 113 00:07:46,891 --> 00:07:51,611 task B, then all the knowledge you learned from that million examples 114 00:07:51,611 --> 00:07:56,430 could really help augment the much smaller data set you have for task B. 115 00:07:56,430 --> 00:07:58,652 Well how about multi-task learning? 116 00:07:58,652 --> 00:08:01,520 In multi-task learning you usually have a lot more tasks than just two. 117 00:08:01,520 --> 00:08:07,678 So maybe you have, previously we had 4 tasks but let's say you have 100 tasks. 118 00:08:07,678 --> 00:08:11,452 And you're going to do multi-task learning to try to recognize 100 different types of 119 00:08:11,452 --> 00:08:12,580 objects at the same time. 120 00:08:12,580 --> 00:08:17,444 So what you may find is that you may have 1,000 examples per task and so 121 00:08:17,444 --> 00:08:20,660 if you focus on the performance of just one task, 122 00:08:20,660 --> 00:08:25,775 let's focus on the performance on the 100th task, you can call A100. 123 00:08:25,775 --> 00:08:28,899 If you are trying to do this final task in isolation, 124 00:08:28,899 --> 00:08:32,875 you would have had just a thousand examples to train this one task, 125 00:08:32,875 --> 00:08:37,320 this one of the 100 tasks that by training on these 99 other tasks. 126 00:08:37,320 --> 00:08:42,810 These in aggregate have 99,000 training examples which could be a big boost, 127 00:08:42,810 --> 00:08:46,597 could give a lot of knowledge to argument this otherwise, 128 00:08:46,597 --> 00:08:52,040 relatively small 1,000 example training set that you have for task A100. 129 00:08:52,040 --> 00:08:57,080 And symmetrically every one of the other 99 tasks can provide some data or provide 130 00:08:57,080 --> 00:09:01,197 some knowledge that help every one of the other tasks in this list of 100 tasks. 131 00:09:02,640 --> 00:09:07,940 So the second bullet isn't a hard and fast rule but what I tend to look at is 132 00:09:07,940 --> 00:09:13,150 if you focus on any one task, for that to get a big boost for multi-task learning, 133 00:09:13,150 --> 00:09:17,260 the other tasks in aggregate need to have quite a lot more data than for 134 00:09:17,260 --> 00:09:18,220 that one task. 135 00:09:18,220 --> 00:09:22,730 And so one way to satisfy that is if a lot of tasks like we have in this example on 136 00:09:22,730 --> 00:09:27,030 the right, and if the amount of data you have in each task is quite similar. 137 00:09:27,030 --> 00:09:31,558 But the key really is that if you already have 1,000 examples for 1 task, 138 00:09:31,558 --> 00:09:36,360 then for all of the other tasks you better have a lot more than 1,000 examples if 139 00:09:36,360 --> 00:09:40,565 those other other task are meant to help you do better on this final task. 140 00:09:40,565 --> 00:09:44,521 And finally multi-task learning tends to make more sense when you can train a big 141 00:09:44,521 --> 00:09:47,640 enough neural network to do well on all the tasks. 142 00:09:47,640 --> 00:09:50,259 So the alternative to multi-task learning would be 143 00:09:50,259 --> 00:09:52,767 to train a separate neural network for each task. 144 00:09:52,767 --> 00:09:56,084 So rather than training one neural network for pedestrian, car, stop sign, and 145 00:09:56,084 --> 00:09:59,017 traffic light detection, you could have trained one neural network for 146 00:09:59,017 --> 00:10:02,528 pedestrian detection, one neural network for car detection, one neural network for 147 00:10:02,528 --> 00:10:05,630 stop sign detection, and one neural network for traffic light detection. 148 00:10:06,640 --> 00:10:10,895 So what a researcher, Rich Carona, found many years ago was that the only times 149 00:10:10,895 --> 00:10:14,920 multi-task learning hurts performance compared to training 150 00:10:14,920 --> 00:10:18,590 separate neural networks is if your neural network isn't big enough. 151 00:10:18,590 --> 00:10:22,898 But if you can train a big enough neural network, then multi-task learning 152 00:10:22,898 --> 00:10:26,476 certainly should not or should very rarely hurt performance. 153 00:10:26,476 --> 00:10:29,405 And hopefully it will actually help performance compared to if you 154 00:10:29,405 --> 00:10:33,640 were training neural networks to do these different tasks in isolation. 155 00:10:33,640 --> 00:10:35,860 So that's it for multi-task learning. 156 00:10:35,860 --> 00:10:40,410 In practice, multi-task learning is used much less often than transfer learning. 157 00:10:40,410 --> 00:10:43,450 I see a lot of applications of transfer learning where you 158 00:10:43,450 --> 00:10:46,150 have a problem you want to solve with a small amount of data. 159 00:10:46,150 --> 00:10:49,580 So you find a related problem with a lot of data to learn something and 160 00:10:49,580 --> 00:10:51,802 transfer that to this new problem. 161 00:10:51,802 --> 00:10:56,084 But multi-task learning is just more rare that you have a huge set of tasks you want 162 00:10:56,084 --> 00:10:57,820 to use that you want to do well on, 163 00:10:57,820 --> 00:11:00,390 you can train all of those tasks at the same time. 164 00:11:00,390 --> 00:11:02,254 Maybe the one example is computer vision. 165 00:11:02,254 --> 00:11:05,778 In object detection I see more applications of multi-task any where one 166 00:11:05,778 --> 00:11:09,533 neural network trying to detect a whole bunch of objects at the same time works 167 00:11:09,533 --> 00:11:13,700 better than different neural networks trained separately to detect objects. 168 00:11:13,700 --> 00:11:17,870 But I would say that on average transfer learning is used much more today 169 00:11:17,870 --> 00:11:22,610 than multi-task learning, but both are useful tools to have in your arsenal. 170 00:11:22,610 --> 00:11:23,685 So to summarize, 171 00:11:23,685 --> 00:11:28,270 multi-task learning enables you to train one neural network to do many tasks and 172 00:11:28,270 --> 00:11:32,630 this can give you better performance than if you were to do the tasks in isolation. 173 00:11:32,630 --> 00:11:37,410 Now one note of caution, in practice I see that transfer learning is used 174 00:11:37,410 --> 00:11:39,790 much more often than multi-task learning. 175 00:11:39,790 --> 00:11:43,262 So I do see a lot of tasks where if you want to solve a machine learning problem 176 00:11:43,262 --> 00:11:47,197 but you have a relatively small data set, then transfer learning can really help. 177 00:11:47,197 --> 00:11:50,172 Where if you find a related problem but you have a much bigger data set, 178 00:11:50,172 --> 00:11:52,290 you can train in your neural network from there and 179 00:11:52,290 --> 00:11:54,830 then transfer it to the problem where we have very low data. 180 00:11:54,830 --> 00:11:57,460 So transfer learning is used a lot today. 181 00:11:57,460 --> 00:12:01,785 There are some applications of transfer multi-task learning as well, but 182 00:12:01,785 --> 00:12:05,730 multi-task learning I think is used much less often than transfer learning. 183 00:12:05,730 --> 00:12:09,230 And maybe the one exception is computer vision object detection, 184 00:12:09,230 --> 00:12:12,180 where I do see a lot of applications of training a neural network 185 00:12:12,180 --> 00:12:13,980 to detect lots of different objects. 186 00:12:13,980 --> 00:12:16,660 And that works better than training separate neural networks and 187 00:12:16,660 --> 00:12:18,150 detecting the visual objects. 188 00:12:18,150 --> 00:12:21,385 But on average I think that even though transfer learning and 189 00:12:21,385 --> 00:12:26,130 multi-task learning often you're presented in a similar way, in practice I've 190 00:12:26,130 --> 00:12:30,130 seen a lot more applications of transfer learning than of multi-task learning. 191 00:12:30,130 --> 00:12:34,250 I think because often it's just difficult to set up or to find so many different 192 00:12:34,250 --> 00:12:37,120 tasks that you would actually want to train a single neural network for. 193 00:12:37,120 --> 00:12:39,050 Again, with some sort of computer vision, 194 00:12:39,050 --> 00:12:43,000 object detection examples being the most notable exception. 195 00:12:43,000 --> 00:12:45,465 So that's it for multi-task learning. 196 00:12:45,465 --> 00:12:46,310 Multi-task learning and 197 00:12:46,310 --> 00:12:50,350 transfer learning are both important tools to have in your tool bag. 198 00:12:50,350 --> 00:12:54,730 And finally, I'd like to move on to discuss end-to-end deep learning. 199 00:12:54,730 --> 00:12:57,620 So let's go onto the next video to discuss end-to-end learning.