1 00:00:00,000 --> 00:00:03,571 [MUSIC] 2 00:00:03,571 --> 00:00:07,960 In this video, we will talk about tricks that will make training of new neural 3 00:00:07,960 --> 00:00:10,290 networks much faster. 4 00:00:10,290 --> 00:00:12,740 And the first one is transfer learning. 5 00:00:12,740 --> 00:00:16,940 Remember that deep networks learn complex features extractor but 6 00:00:16,940 --> 00:00:19,850 we need lots of data to train it from scratch. 7 00:00:19,850 --> 00:00:23,320 Let's look at the ImageNet classification architecture. 8 00:00:23,320 --> 00:00:27,570 There's lots of convolutional layers, and we call it, features extractor. 9 00:00:27,570 --> 00:00:32,190 Because the last convolutional layer, extracts features that are useful for 10 00:00:32,190 --> 00:00:36,370 classification with MOP, which are the last pink layers. 11 00:00:36,370 --> 00:00:40,390 This architecture is trained on ImageNet dataset. 12 00:00:40,390 --> 00:00:43,790 But what if we can reuse an existing features extractor, 13 00:00:43,790 --> 00:00:48,410 that blue convolutions on the slide, for a new task? 14 00:00:48,410 --> 00:00:49,720 How do we do that? 15 00:00:49,720 --> 00:00:53,180 We add a new classifier on top of that features, and 16 00:00:53,180 --> 00:00:55,660 that orange weights is all we need to train. 17 00:00:57,310 --> 00:01:01,960 You need less data to train, because you train only the final MLP layer. 18 00:01:01,960 --> 00:01:06,130 It works if a domain of a new task is similar to ImageNet's. 19 00:01:06,130 --> 00:01:10,300 It won't work for human emotions classification, because ImageNet doesn't 20 00:01:10,300 --> 00:01:15,180 have people faces in the dataset, so it doesn't know the concept of a human face. 21 00:01:16,350 --> 00:01:19,300 But what if we need to classify human emotions? 22 00:01:19,300 --> 00:01:22,270 Maybe we can partially reuse ImageNet features extractor. 23 00:01:23,320 --> 00:01:27,140 Let's look at the perfect feature extractor that we need, 24 00:01:27,140 --> 00:01:29,790 it looks like a bunch of convolutional layers, and 25 00:01:29,790 --> 00:01:33,500 let's look at what activation stimuli we actually want to get. 26 00:01:33,500 --> 00:01:38,490 The first convolutional layer will have highest activations for 27 00:01:38,490 --> 00:01:41,480 edge detectors that have different rotation. 28 00:01:41,480 --> 00:01:46,590 If we go deeper than those convolutional layers along the concept of a human eye, 29 00:01:46,590 --> 00:01:48,390 or nose, or throat. 30 00:01:48,390 --> 00:01:50,090 And if we go deeper than that, 31 00:01:50,090 --> 00:01:55,650 then we have layers that learn the representation of a human face. 32 00:01:55,650 --> 00:01:58,640 That is our perfect feature extractor that we want, 33 00:01:58,640 --> 00:02:01,610 let's compare it with ImageNet feature extractor. 34 00:02:01,610 --> 00:02:05,460 ImageNet definitely has those edge detectors as well, 35 00:02:05,460 --> 00:02:10,100 but, provided that it doesn't have human faces in the data set, 36 00:02:10,100 --> 00:02:15,160 it doesn't know the concept of a nose or throat, so we will need to train this. 37 00:02:17,080 --> 00:02:20,220 Let's look at how an architecture for 38 00:02:20,220 --> 00:02:23,960 human emotions classification might look like. 39 00:02:23,960 --> 00:02:28,480 What we do is we actually reuse first convolutional layers which are in green, 40 00:02:28,480 --> 00:02:30,850 and we add new convolutional layers and 41 00:02:30,850 --> 00:02:34,850 a new multi-layer perceptron to train for our new task. 42 00:02:34,850 --> 00:02:38,350 And all we need to train are those blue convolutions and 43 00:02:38,350 --> 00:02:40,680 orange full connected layers. 44 00:02:40,680 --> 00:02:46,410 It works much better because we have to train much less number of parameters. 45 00:02:46,410 --> 00:02:49,410 What if we don't start from scratch and 46 00:02:49,410 --> 00:02:53,860 don't initialize those blue convolutions with random numbers, but 47 00:02:53,860 --> 00:02:58,960 rather use initialization from pre-trained ImageNet network? 48 00:02:58,960 --> 00:03:03,670 That leads us to so called, fine-tuning technique, 49 00:03:03,670 --> 00:03:07,300 because you don't start with the random initialization, but rather 50 00:03:07,300 --> 00:03:12,850 reuse those complex representations that is suitable for ImageNet classification. 51 00:03:12,850 --> 00:03:15,230 What is the intuition behind this? 52 00:03:15,230 --> 00:03:17,910 We don't start with random features, but 53 00:03:17,910 --> 00:03:21,850 we start with features that are useful for some other task. 54 00:03:21,850 --> 00:03:26,830 They are not perfect for our task, but they might be much better than random. 55 00:03:27,870 --> 00:03:32,983 What we do next is we propagate all the gradients, but with a smaller learning 56 00:03:32,983 --> 00:03:38,110 rate, so that we don't lose that initialization that we got from ImageNet. 57 00:03:40,211 --> 00:03:46,140 Fine-tuning is very frequently used thanks to wide spectrum of ImageNet classes. 58 00:03:46,140 --> 00:03:50,720 Keras, a deep learning framework, has the weights of pre-trained VGG, 59 00:03:50,720 --> 00:03:53,600 Inception, and ResNet architectures. 60 00:03:53,600 --> 00:03:56,910 What is so special about that, is that you can fine-tune 61 00:03:56,910 --> 00:04:00,720 a bunch of different architectures and make an ensemble out of them. 62 00:04:00,720 --> 00:04:02,420 And you don't have to wait for two or 63 00:04:02,420 --> 00:04:05,930 three weeks to train your network on ImageNet dataset. 64 00:04:07,810 --> 00:04:09,710 Let's summarize a little bit. 65 00:04:09,710 --> 00:04:14,840 If you have a small dataset and it is from ImageNet domain, which means that you have 66 00:04:14,840 --> 00:04:20,000 objects that are somewhat similar to those seen in the ImageNet data set, 67 00:04:20,000 --> 00:04:26,410 then all you need to do is to use transfer learning and train last MLP layers. 68 00:04:26,410 --> 00:04:32,130 If you have a bigger data set, then it makes sense to fine-tune deeper layers so 69 00:04:32,130 --> 00:04:36,350 that you squeeze a little bit more of quality from this network. 70 00:04:36,350 --> 00:04:40,570 If you have a big dataset but it's not similar to ImageNet domain, 71 00:04:40,570 --> 00:04:44,400 then it makes sense to train from scratch because most likely, 72 00:04:44,400 --> 00:04:46,440 you can't reuse the features from ImageNet. 73 00:04:47,520 --> 00:04:51,720 But if you have a small dataset and which is not similar to ImageNet, 74 00:04:51,720 --> 00:04:55,770 then you'll not like it and most likely, you will have to collect more data. 75 00:04:57,520 --> 00:05:01,920 In the next video, we will take a look at other computer vision problems that 76 00:05:01,920 --> 00:05:04,001 utilize convolutional networks. 77 00:05:04,001 --> 00:05:14,001 [MUSIC]