1 00:00:00,000 --> 00:00:03,348 [MUSIC] 2 00:00:03,348 --> 00:00:07,704 In this video, we will take a quick look at other computer vision problems that 3 00:00:07,704 --> 00:00:10,480 successfully utilize convolutional networks. 4 00:00:11,650 --> 00:00:15,050 So far, we have examined image classification task, 5 00:00:15,050 --> 00:00:19,300 which has an image as an input and a class label as an output. 6 00:00:19,300 --> 00:00:22,830 In this video, we will review two more tasks. 7 00:00:22,830 --> 00:00:28,030 And the first one is semantic segmentation where you have an image as an input. 8 00:00:28,030 --> 00:00:33,570 And as an output, you need to give a class label for each pixel of that image. 9 00:00:33,570 --> 00:00:34,470 For example, 10 00:00:34,470 --> 00:00:39,560 which pixels correspond to water, which pixels correspond to duck or grass. 11 00:00:39,560 --> 00:00:44,340 Another example is image classification plus localization. 12 00:00:44,340 --> 00:00:49,170 In this task, you not only need to say which object you see on the image, but 13 00:00:49,170 --> 00:00:51,210 also the way you see it. 14 00:00:51,210 --> 00:00:54,340 And for that, you need to define a bounding box, 15 00:00:54,340 --> 00:00:58,510 that is a yellow rectangular region that contains your object. 16 00:01:00,190 --> 00:01:02,770 Let's start with semantic segmentation. 17 00:01:02,770 --> 00:01:06,770 For this task, we need to classify each pixel of our image. 18 00:01:06,770 --> 00:01:09,850 So what do we do when we have an image as an input? 19 00:01:09,850 --> 00:01:12,890 We stack convolutional layer, right? 20 00:01:12,890 --> 00:01:14,970 And we use the same width and 21 00:01:14,970 --> 00:01:19,730 height as our input image because we will need to classify each pixel. 22 00:01:20,920 --> 00:01:22,140 And what we do next? 23 00:01:22,140 --> 00:01:24,550 Usually we add pooling layers. 24 00:01:24,550 --> 00:01:28,420 But in this particular task, it is not easy to implement. 25 00:01:28,420 --> 00:01:33,290 Because when you add pooling layers that effectively down samples your image and 26 00:01:33,290 --> 00:01:37,510 our classification will not be crisp, it will be pixelated. 27 00:01:37,510 --> 00:01:38,870 And we don't want that. 28 00:01:38,870 --> 00:01:41,150 So let's maintain the width and 29 00:01:41,150 --> 00:01:46,050 height of our convolutional filters as we stack more and more of that layers. 30 00:01:47,140 --> 00:01:50,520 The final layer will be a different one. 31 00:01:50,520 --> 00:01:53,660 It will have the number of output channels 32 00:01:53,660 --> 00:01:58,280 equal to the number of classes that we need for our segmentation. 33 00:01:58,280 --> 00:02:02,420 For example, each depth slice will be responsible for 34 00:02:02,420 --> 00:02:06,550 classification of let's say, water, duck, or grass. 35 00:02:06,550 --> 00:02:10,440 And what we do in the end is for every pixel, and remember, 36 00:02:10,440 --> 00:02:15,460 we maintain the width and height, for every pixel we can take all those values 37 00:02:15,460 --> 00:02:20,880 that are encoded in the depth of that final volume which has an orange color. 38 00:02:20,880 --> 00:02:28,190 And we actually apply a softmax function over those values in the output channels. 39 00:02:28,190 --> 00:02:30,230 This is a rather naive approach. 40 00:02:30,230 --> 00:02:33,910 We stack convolutional layers and add per-pixel softmax. 41 00:02:35,010 --> 00:02:37,750 We go deep but we don't add pooling. 42 00:02:37,750 --> 00:02:41,920 And that is too expensive and it is harder to explain. 43 00:02:41,920 --> 00:02:45,090 Let's add pooling which acts like down-sampling. 44 00:02:45,090 --> 00:02:47,000 We will paint it with pink color. 45 00:02:47,000 --> 00:02:48,740 So we have an image as an input, 46 00:02:48,740 --> 00:02:53,260 we have first convolution layer, then followed by pooling layer. 47 00:02:53,260 --> 00:02:57,930 After pooling layer, we actually reduce the number, we reduce the width and 48 00:02:57,930 --> 00:03:00,790 height of our volume, and we increase the depth. 49 00:03:00,790 --> 00:03:04,280 Then we have one more convolutional layer and one more pooling layer. 50 00:03:04,280 --> 00:03:07,950 Then what we do next is we stack one more convolutional layer. 51 00:03:07,950 --> 00:03:11,060 But wait a second, we need to classify each pixel. 52 00:03:11,060 --> 00:03:17,110 And right now, the width and height of our volume is significantly reduced. 53 00:03:17,110 --> 00:03:18,760 We need to do unpooling somehow. 54 00:03:20,330 --> 00:03:24,700 And for that task, we will use a special layer which we’ll do upsampling. 55 00:03:24,700 --> 00:03:27,480 And we color it with green color. 56 00:03:28,480 --> 00:03:32,280 And after upsampling, we will use convolutional layers so 57 00:03:32,280 --> 00:03:36,960 that we can learn a transformation back to their original pixels. 58 00:03:36,960 --> 00:03:41,190 We add one more upsampling layer and one more convolutional layer. 59 00:03:41,190 --> 00:03:45,730 And that is how we get our semantic segmentation of input pixels. 60 00:03:45,730 --> 00:03:47,440 How do we do that unpooling? 61 00:03:47,440 --> 00:03:50,770 The easiest way is to fill the nearest neighbor values. 62 00:03:50,770 --> 00:03:52,940 We have a 2x2 input and 63 00:03:52,940 --> 00:03:59,580 we replace each cell in that input as a 2x2 patch with the same values. 64 00:03:59,580 --> 00:04:03,770 This way, we get a pixelated output and it is not crisp. 65 00:04:03,770 --> 00:04:05,665 It's not the best way to go. 66 00:04:05,665 --> 00:04:08,270 Another technique is called max unpooling. 67 00:04:08,270 --> 00:04:10,130 Let's look at our architecture. 68 00:04:10,130 --> 00:04:14,960 We have corresponding pairs of downsampling and upsampling layers. 69 00:04:14,960 --> 00:04:18,690 They actually do the same thing but in the reverse order. 70 00:04:18,690 --> 00:04:20,970 Let's use that information. 71 00:04:20,970 --> 00:04:24,950 What if we remember which element was maximum during pooling and 72 00:04:24,950 --> 00:04:27,740 fill that position during unpooling? 73 00:04:27,740 --> 00:04:28,913 Let's look at the example. 74 00:04:28,913 --> 00:04:36,100 We have a 4x4 input, and let's apply max pooling of 2x2 with stride 2. 75 00:04:36,100 --> 00:04:41,120 And let's remember, which neurons gave us the maximum activations. 76 00:04:41,120 --> 00:04:46,550 Then goes the rest of the network, and at some point, we will have to do unpooling. 77 00:04:46,550 --> 00:04:52,020 We will have to do a 4x4 output out of the 2x2 input. 78 00:04:52,020 --> 00:04:55,420 And we do that not by filling nearest neighbor values. 79 00:04:55,420 --> 00:04:59,750 But we rather pool these values and put them 80 00:04:59,750 --> 00:05:04,730 in those locations where we had maximum activations during corresponding pooling. 81 00:05:05,760 --> 00:05:09,800 This way, we get a crisper image and it actually works better. 82 00:05:11,910 --> 00:05:14,770 Previous approaches are not data-driven. 83 00:05:14,770 --> 00:05:18,230 Imagine that you have an object that are round, and 84 00:05:18,230 --> 00:05:21,800 not squared, and we don't use that information. 85 00:05:21,800 --> 00:05:27,217 Nearest neighbor unpooling, or max unpooling, is not aware of that. 86 00:05:27,217 --> 00:05:31,290 But we can actually use that information to give better unpooling. 87 00:05:32,430 --> 00:05:36,550 Remember that we can replace max pooling layer with convolutional layer 88 00:05:36,550 --> 00:05:38,230 that has a bigger stride. 89 00:05:38,230 --> 00:05:41,000 What if we can apply convolutions to do unpooling? 90 00:05:42,820 --> 00:05:44,890 Let's see how it might work. 91 00:05:44,890 --> 00:05:50,540 We have a 2x2 input and we somehow need to produce a 4x4 output. 92 00:05:50,540 --> 00:05:53,010 Let's use a 3x3 convolutional filter for that. 93 00:05:54,310 --> 00:05:55,920 How it works? 94 00:05:55,920 --> 00:05:58,080 We take a convolutional filter and 95 00:05:58,080 --> 00:06:02,490 we multiply it by the value in the red input cell. 96 00:06:02,490 --> 00:06:04,340 And we add those values to the output. 97 00:06:05,770 --> 00:06:09,240 Then we move to the next pixel in the input. 98 00:06:09,240 --> 00:06:13,200 And for that, we use a stride 2 in the output, so 99 00:06:13,200 --> 00:06:17,140 that we double the resolution that we had in the input. 100 00:06:18,200 --> 00:06:19,930 Then we take the kernel weights or 101 00:06:19,930 --> 00:06:24,940 filter weights and multiply it by the value that we have in the blue input cell. 102 00:06:24,940 --> 00:06:27,036 And we add it to the output as well. 103 00:06:27,036 --> 00:06:31,564 But what to do with that value with those values where we have 104 00:06:31,564 --> 00:06:34,640 an intersection of our filters? 105 00:06:34,640 --> 00:06:38,680 We actually take a sum of those values and it still works. 106 00:06:38,680 --> 00:06:42,130 Let's go to object classification and localization task. 107 00:06:43,150 --> 00:06:47,090 For this, we need to find a bounding box to localize our object. 108 00:06:47,090 --> 00:06:52,830 Let's parameterize our bounding box with four numbers, x, y, w, and h. 109 00:06:52,830 --> 00:06:58,650 X and y stand for the coordinates of the upper left corner of that box, 110 00:06:58,650 --> 00:07:03,190 and w and h stand for width and height of that box. 111 00:07:03,190 --> 00:07:06,840 We can actually use regression for those four parameters. 112 00:07:06,840 --> 00:07:07,860 Let's see how it might work. 113 00:07:09,150 --> 00:07:14,040 We have a classification network which looks like a bunch of convolutional layers 114 00:07:14,040 --> 00:07:16,410 followed by a multilayer perceptron. 115 00:07:16,410 --> 00:07:19,550 And we train it using cross-entropy. 116 00:07:19,550 --> 00:07:24,170 But do we need a second network to do a bounding box regression? 117 00:07:24,170 --> 00:07:29,420 Actually, we can reuse those convolutional layers for our new task. 118 00:07:29,420 --> 00:07:33,780 And we can train a new, fully connected layer that will 119 00:07:33,780 --> 00:07:38,940 predict bounding box parameters and we will use mean squared error for that. 120 00:07:38,940 --> 00:07:42,980 But how do we train such a network when we have two different losses? 121 00:07:42,980 --> 00:07:47,780 We actually take the sum of those losses and that gives us the final loss for 122 00:07:47,780 --> 00:07:50,360 which we'll propagate the gradients during back propagation. 123 00:07:52,280 --> 00:07:56,483 In this video, we took a sneak peek into other computer vision problems that 124 00:07:56,483 --> 00:07:59,649 successfully utilize convolutional neural networks. 125 00:07:59,649 --> 00:08:04,994 This video concludes our introduction to neural networks for images. 126 00:08:04,994 --> 00:08:14,994 [MUSIC]