1 00:00:00,000 --> 00:00:05,250 One of the most powerful ideas in deep learning is that sometimes you can take knowledge 2 00:00:05,250 --> 00:00:10,560 the neural network has learned from one task and apply that knowledge to a separate task. 3 00:00:10,560 --> 00:00:12,815 So for example, maybe you could have the neural network 4 00:00:12,815 --> 00:00:15,720 learn to recognize objects like cats and then use 5 00:00:15,720 --> 00:00:18,085 that knowledge or use part of that knowledge to 6 00:00:18,085 --> 00:00:21,300 help you do a better job reading x-ray scans. 7 00:00:21,300 --> 00:00:24,920 This is called transfer learning. Let's take a look. 8 00:00:24,920 --> 00:00:30,180 Let's say you've trained your neural network on image recognition. 9 00:00:30,180 --> 00:00:34,395 So you first take a neural network and train it on X Y pairs, 10 00:00:34,395 --> 00:00:37,700 where X is an image and Y is some object. 11 00:00:37,700 --> 00:00:41,340 An image is a cat or a dog or a bird or something else. 12 00:00:41,340 --> 00:00:43,960 If you want to take this neural network and adapt, 13 00:00:43,960 --> 00:00:45,465 or we say transfer, 14 00:00:45,465 --> 00:00:47,670 what is learned to a different task, 15 00:00:47,670 --> 00:00:51,750 such as radiology diagnosis, 16 00:00:51,750 --> 00:00:54,780 meaning really reading X-ray scans, 17 00:00:54,780 --> 00:00:59,160 what you can do is take this last output layer of the neural network and 18 00:00:59,160 --> 00:01:03,885 just delete that and delete also the weights feeding into that last output layer 19 00:01:03,885 --> 00:01:08,880 and create a new set of randomly initialized weights just for the last layer 20 00:01:08,880 --> 00:01:15,000 and have that now output radiology diagnosis. 21 00:01:15,000 --> 00:01:17,160 So to be concrete, during the first phase of 22 00:01:17,160 --> 00:01:19,795 training when you're training on an image recognition task, 23 00:01:19,795 --> 00:01:23,710 you train all of the usual parameters for the neural network, all the weights, 24 00:01:23,710 --> 00:01:27,150 all the layers and you have something that 25 00:01:27,150 --> 00:01:32,680 now learns to make image recognition predictions. 26 00:01:32,680 --> 00:01:35,370 Having trained that neural network, 27 00:01:35,370 --> 00:01:41,540 what you now do to implement transfer learning is swap in a new data set X Y, 28 00:01:41,540 --> 00:01:47,120 where now these are radiology images. 29 00:01:47,120 --> 00:01:50,580 And Y are the diagnoses you want to 30 00:01:50,580 --> 00:01:58,340 predict and what you do is initialize the last layers' weights. 31 00:01:58,340 --> 00:02:00,250 Let's call that W.L. 32 00:02:00,250 --> 00:02:02,360 and P.L. randomly. 33 00:02:02,360 --> 00:02:07,175 And now, retrain the neural network on this new data set, 34 00:02:07,175 --> 00:02:09,320 on the new radiology data set. 35 00:02:09,320 --> 00:02:14,260 You have a couple options of how you retrain neural network with radiology data. 36 00:02:14,260 --> 00:02:16,905 You might, if you have a small radiology dataset, 37 00:02:16,905 --> 00:02:20,647 you might want to just retrain the weights of the last layer, just W.L. 38 00:02:20,647 --> 00:02:22,620 P.L., and keep the rest of the parameters fixed. 39 00:02:22,620 --> 00:02:23,975 If you have enough data, 40 00:02:23,975 --> 00:02:27,400 you could also retrain all the layers of the rest of the neural network. 41 00:02:27,400 --> 00:02:32,535 And the rule of thumb is maybe if you have a small data set, 42 00:02:32,535 --> 00:02:35,560 then just retrain the one last layer at the output layer. 43 00:02:35,560 --> 00:02:37,070 Or maybe that last one or two layers. 44 00:02:37,070 --> 00:02:38,755 But if you have a lot of data, 45 00:02:38,755 --> 00:02:42,490 then maybe you can retrain all the parameters in the network. 46 00:02:42,490 --> 00:02:45,775 And if you retrain all the parameters in the neural network, 47 00:02:45,775 --> 00:02:49,270 then this initial phase of training on 48 00:02:49,270 --> 00:02:53,938 image recognition is sometimes called pre-training, 49 00:02:53,938 --> 00:02:57,355 because you're using image recognitions data 50 00:02:57,355 --> 00:03:01,745 to pre-initialize or really pre-train the weights of the neural network. 51 00:03:01,745 --> 00:03:04,545 And then if you are updating all the weights afterwards, 52 00:03:04,545 --> 00:03:09,885 then training on the radiology data sometimes that's called fine tuning. 53 00:03:09,885 --> 00:03:15,185 So you hear the words pre-training and fine tuning in a deep learning context, 54 00:03:15,185 --> 00:03:17,530 this is what they mean when they refer to 55 00:03:17,530 --> 00:03:21,050 pre-training and fine tuning weights in a transfer learning source. 56 00:03:21,050 --> 00:03:22,585 And what you've done in this example, 57 00:03:22,585 --> 00:03:25,435 is you've taken knowledge learned from 58 00:03:25,435 --> 00:03:31,285 image recognition and applied it or transferred it to radiology diagnosis. 59 00:03:31,285 --> 00:03:33,490 And the reason this can be helpful is that 60 00:03:33,490 --> 00:03:36,570 a lot of the low level features such as detecting edges, 61 00:03:36,570 --> 00:03:39,400 detecting curves, detecting positive objects. 62 00:03:39,400 --> 00:03:43,045 Learning from that, from a very large image recognition data set, 63 00:03:43,045 --> 00:03:47,736 might help your learning algorithm do better in radiology diagnosis. 64 00:03:47,736 --> 00:03:51,730 It's just learned a lot about the structure and the nature of how images 65 00:03:51,730 --> 00:03:56,465 look like and some of that knowledge will be useful. 66 00:03:56,465 --> 00:03:58,545 So having learned to recognize images, 67 00:03:58,545 --> 00:04:00,910 it might have learned enough about you know, 68 00:04:00,910 --> 00:04:03,135 just what parts of different images look like, 69 00:04:03,135 --> 00:04:05,880 that that knowledge about lines, 70 00:04:05,880 --> 00:04:07,725 dots, curves, and so on, 71 00:04:07,725 --> 00:04:09,555 maybe small parts of objects, 72 00:04:09,555 --> 00:04:10,950 that knowledge could help 73 00:04:10,950 --> 00:04:15,910 your radiology diagnosis network learn a bit faster or learn with less data. 74 00:04:15,910 --> 00:04:17,545 Here's another example. 75 00:04:17,545 --> 00:04:20,730 Let's say that you've trained a speech recognition system so 76 00:04:20,730 --> 00:04:24,398 now X is input of audio or audio snippets, 77 00:04:24,398 --> 00:04:27,545 and Y is some ink transcript. 78 00:04:27,545 --> 00:04:34,200 So you've trained in speech recognition system to output your transcripts. 79 00:04:34,200 --> 00:04:39,435 And let's say that you now want to build a "wake words" or 80 00:04:39,435 --> 00:04:45,345 a "trigger words" detection system. 81 00:04:45,345 --> 00:04:49,580 So, recall that a wake word or the trigger word are the words we say 82 00:04:49,580 --> 00:04:54,100 in order to wake up speech control devices in our houses such as saying 83 00:04:54,100 --> 00:04:58,610 "Alexa" to wake up an Amazon Echo or "OK Google" to wake up a Google device or 84 00:04:58,610 --> 00:05:03,590 "hey Siri" to wake up an Apple device or saying "Ni hao baidu" to wake up a baidu device. 85 00:05:03,590 --> 00:05:05,120 So in order to do this, 86 00:05:05,120 --> 00:05:09,080 you might take out the last layer of 87 00:05:09,080 --> 00:05:13,435 the neural network again and create a new output node. 88 00:05:13,435 --> 00:05:18,995 But sometimes another thing you could do is actually create not just a single new output, 89 00:05:18,995 --> 00:05:23,120 but actually create several new layers to your neural network to try 90 00:05:23,120 --> 00:05:28,215 to put the labels Y for your wake word detection problem. 91 00:05:28,215 --> 00:05:30,425 Then again, depending on how much data you have, 92 00:05:30,425 --> 00:05:34,400 you might just retrain the new layers of the network or maybe 93 00:05:34,400 --> 00:05:38,925 you could retrain even more layers of this neural network. 94 00:05:38,925 --> 00:05:42,150 So, when does transfer learning make sense? 95 00:05:42,150 --> 00:05:46,845 Transfer learning makes sense when you have a lot of data for the problem you're 96 00:05:46,845 --> 00:05:49,110 transferring from and usually 97 00:05:49,110 --> 00:05:52,430 relatively less data for the problem you're transferring to. 98 00:05:52,430 --> 00:05:58,030 So for example, let's say you have a million examples for image recognition task. 99 00:05:58,030 --> 00:06:00,605 So that's a lot of data to learn a lot of 100 00:06:00,605 --> 00:06:03,095 low level features or to learn a lot of 101 00:06:03,095 --> 00:06:06,385 useful features in the earlier layers in neural network. 102 00:06:06,385 --> 00:06:08,240 But for the radiology task, 103 00:06:08,240 --> 00:06:12,005 maybe you have only a hundred examples. 104 00:06:12,005 --> 00:06:15,650 So you have very low data for the radiology diagnosis problem, 105 00:06:15,650 --> 00:06:17,530 maybe only 100 x-ray scans. 106 00:06:17,530 --> 00:06:23,070 So a lot of knowledge you learn from English recognition can be transferred and can 107 00:06:23,070 --> 00:06:24,560 really help you get going with 108 00:06:24,560 --> 00:06:29,360 radiology recognition even if you don't have all the data for radiology. 109 00:06:29,360 --> 00:06:31,800 For speech recognition, maybe you've trained 110 00:06:31,800 --> 00:06:35,110 the speech recognition system on 10000 hours of data. 111 00:06:35,110 --> 00:06:37,700 So, you've learned a lot about what human voices 112 00:06:37,700 --> 00:06:41,270 sounds like from that 10000 hours of data, which really is a lot. 113 00:06:41,270 --> 00:06:43,220 But for your trigger word detection, 114 00:06:43,220 --> 00:06:45,735 maybe you have only one hour of data. 115 00:06:45,735 --> 00:06:48,800 So, that's not a lot of data to fit a lot of parameters. 116 00:06:48,800 --> 00:06:53,215 So in this case, a lot of what you learn about what human voices sound like, 117 00:06:53,215 --> 00:06:56,450 what are components of human speech and so on, 118 00:06:56,450 --> 00:07:00,300 that can be really helpful for building a good wake word detector, 119 00:07:00,300 --> 00:07:03,220 even though you have a relatively small dataset or 120 00:07:03,220 --> 00:07:08,005 at least a much smaller dataset for the wake word detection task. 121 00:07:08,005 --> 00:07:09,440 So in both of these cases, 122 00:07:09,440 --> 00:07:11,500 you're transferring from a problem with a lot of 123 00:07:11,500 --> 00:07:15,610 data to a problem with relatively little data. 124 00:07:15,610 --> 00:07:19,480 One case where transfer learning would not make sense, 125 00:07:19,480 --> 00:07:22,330 is if the opposite was true. 126 00:07:22,330 --> 00:07:27,560 So, if you had a hundred images for image recognition and you had 127 00:07:27,560 --> 00:07:34,120 100 images for radiology diagnosis or even a thousand images for radiology diagnosis, 128 00:07:34,120 --> 00:07:38,395 one would think about it is that to do well on radiology diagnosis, 129 00:07:38,395 --> 00:07:41,830 assuming what you really want to do well on this radiology diagnosis, 130 00:07:41,830 --> 00:07:47,670 having radiology images is much more valuable than having cat and dog and so on images. 131 00:07:47,670 --> 00:07:52,060 So each example here is much more valuable than each example there, 132 00:07:52,060 --> 00:07:55,935 at least for the purpose of building a good radiology system. 133 00:07:55,935 --> 00:07:58,810 So, if you already have more data for radiology, 134 00:07:58,810 --> 00:08:01,955 it's not that likely that having 100 images of 135 00:08:01,955 --> 00:08:06,310 your random objects of cats and dogs and cars and so on will be that helpful, 136 00:08:06,310 --> 00:08:12,130 because the value of one example of image from your image recognition task of cats and 137 00:08:12,130 --> 00:08:15,430 dogs is just less valuable than one example of 138 00:08:15,430 --> 00:08:19,870 an x-ray image for the task of building a good radiology system. 139 00:08:19,870 --> 00:08:22,925 So, this would be one example where transfer learning, well, 140 00:08:22,925 --> 00:08:27,515 it might not hurt but I wouldn't expect it to give you any meaningful gain either. 141 00:08:27,515 --> 00:08:31,030 And similarly, if you'd built a speech recognition system on 10 hours of 142 00:08:31,030 --> 00:08:34,660 data and you actually have 10 hours or maybe even more, 143 00:08:34,660 --> 00:08:38,330 say 50 hours of data for wake word detection, 144 00:08:38,330 --> 00:08:40,505 you know it won't, it may or may not hurt, 145 00:08:40,505 --> 00:08:44,010 maybe it won't hurt to include that 10 hours of data to your transfer learning, 146 00:08:44,010 --> 00:08:47,350 but you just wouldn't expect to get a meaningful gain. 147 00:08:47,350 --> 00:08:51,220 So to summarize, when does transfer learning make sense? 148 00:08:51,220 --> 00:08:53,200 If you're trying to learn from 149 00:08:53,200 --> 00:09:00,830 some Task A and transfer some of the knowledge to some Task B, 150 00:09:00,830 --> 00:09:07,825 then transfer learning makes sense when Task A and B have the same input X. 151 00:09:07,825 --> 00:09:10,285 In the first example, 152 00:09:10,285 --> 00:09:12,455 A and B both have images as input. 153 00:09:12,455 --> 00:09:13,585 In the second example, 154 00:09:13,585 --> 00:09:17,260 both have audio clips as input. 155 00:09:17,260 --> 00:09:22,460 It has to make sense when you have a lot more data for Task A than for Task B. 156 00:09:22,460 --> 00:09:27,345 All this is under the assumption that what you really want to do well on is Task B. 157 00:09:27,345 --> 00:09:32,023 And because data for Task B is more valuable for Task B, 158 00:09:32,023 --> 00:09:36,765 usually you just need a lot more data for Task A because you know, 159 00:09:36,765 --> 00:09:43,605 each example from Task A is just less valuable for Task B than each example for Task B. 160 00:09:43,605 --> 00:09:47,740 And then finally, transfer learning will tend to make more sense if you suspect 161 00:09:47,740 --> 00:09:52,300 that low level features from Task A could be helpful for learning Task B. 162 00:09:52,300 --> 00:09:54,395 And in both of the earlier examples, 163 00:09:54,395 --> 00:09:57,010 maybe learning image recognition teaches you enough about 164 00:09:57,010 --> 00:09:59,800 images to have a radiology diagnosis and maybe 165 00:09:59,800 --> 00:10:02,200 learning speech recognition teaches you about 166 00:10:02,200 --> 00:10:06,000 human speech to help you with trigger word or wake word detection. 167 00:10:06,000 --> 00:10:08,350 So to summarize, transfer learning has been most 168 00:10:08,350 --> 00:10:11,560 useful if you're trying to do well on some Task B, 169 00:10:11,560 --> 00:10:14,480 usually a problem where you have relatively little data. 170 00:10:14,480 --> 00:10:16,910 So for example, in radiology, 171 00:10:16,910 --> 00:10:18,310 you know it's difficult to get that 172 00:10:18,310 --> 00:10:21,990 many x-ray scans to build a good radiology diagnosis system. 173 00:10:21,990 --> 00:10:25,270 So in that case, you might find a related but different task, 174 00:10:25,270 --> 00:10:26,645 such as image recognition, 175 00:10:26,645 --> 00:10:30,850 where you can get maybe a million images and learn a lot of load-over features from that, 176 00:10:30,850 --> 00:10:34,180 so that you can then try to do well on Task B on 177 00:10:34,180 --> 00:10:38,166 your radiology task despite not having that much data for it. 178 00:10:38,166 --> 00:10:40,305 When transfer learning makes sense? 179 00:10:40,305 --> 00:10:43,690 It does help the performance of your learning task significantly. 180 00:10:43,690 --> 00:10:47,865 But I've also seen sometimes seen transfer learning applied in settings where 181 00:10:47,865 --> 00:10:52,360 Task A actually has less data than Task B and in those cases, 182 00:10:52,360 --> 00:10:55,285 you kind of don't expect to see much of a gain. 183 00:10:55,285 --> 00:10:57,900 So, that's it for transfer learning where you 184 00:10:57,900 --> 00:11:00,895 learn from one task and try to transfer to a different task. 185 00:11:00,895 --> 00:11:02,585 There's another version of learning from 186 00:11:02,585 --> 00:11:05,080 multiple tasks which is called multitask learning, 187 00:11:05,080 --> 00:11:07,510 which is when you try to learn from multiple tasks at 188 00:11:07,510 --> 00:11:10,845 the same time rather than learning from one and then sequentially, 189 00:11:10,845 --> 00:11:14,170 or after that, trying to transfer to a different task. 190 00:11:14,170 --> 00:11:15,450 So in the next video, 191 00:11:15,450 --> 00:11:17,340 let's discuss multitasking learning.