[MUSIC] In this video, we will take a quick look at other computer vision problems that successfully utilize convolutional networks. So far, we have examined image classification task, which has an image as an input and a class label as an output. In this video, we will review two more tasks. And the first one is semantic segmentation where you have an image as an input. And as an output, you need to give a class label for each pixel of that image. For example, which pixels correspond to water, which pixels correspond to duck or grass. Another example is image classification plus localization. In this task, you not only need to say which object you see on the image, but also the way you see it. And for that, you need to define a bounding box, that is a yellow rectangular region that contains your object. Let's start with semantic segmentation. For this task, we need to classify each pixel of our image. So what do we do when we have an image as an input? We stack convolutional layer, right? And we use the same width and height as our input image because we will need to classify each pixel. And what we do next? Usually we add pooling layers. But in this particular task, it is not easy to implement. Because when you add pooling layers that effectively down samples your image and our classification will not be crisp, it will be pixelated. And we don't want that. So let's maintain the width and height of our convolutional filters as we stack more and more of that layers. The final layer will be a different one. It will have the number of output channels equal to the number of classes that we need for our segmentation. For example, each depth slice will be responsible for classification of let's say, water, duck, or grass. And what we do in the end is for every pixel, and remember, we maintain the width and height, for every pixel we can take all those values that are encoded in the depth of that final volume which has an orange color. And we actually apply a softmax function over those values in the output channels. This is a rather naive approach. We stack convolutional layers and add per-pixel softmax. We go deep but we don't add pooling. And that is too expensive and it is harder to explain. Let's add pooling which acts like down-sampling. We will paint it with pink color. So we have an image as an input, we have first convolution layer, then followed by pooling layer. After pooling layer, we actually reduce the number, we reduce the width and height of our volume, and we increase the depth. Then we have one more convolutional layer and one more pooling layer. Then what we do next is we stack one more convolutional layer. But wait a second, we need to classify each pixel. And right now, the width and height of our volume is significantly reduced. We need to do unpooling somehow. And for that task, we will use a special layer which we’ll do upsampling. And we color it with green color. And after upsampling, we will use convolutional layers so that we can learn a transformation back to their original pixels. We add one more upsampling layer and one more convolutional layer. And that is how we get our semantic segmentation of input pixels. How do we do that unpooling? The easiest way is to fill the nearest neighbor values. We have a 2x2 input and we replace each cell in that input as a 2x2 patch with the same values. This way, we get a pixelated output and it is not crisp. It's not the best way to go. Another technique is called max unpooling. Let's look at our architecture. We have corresponding pairs of downsampling and upsampling layers. They actually do the same thing but in the reverse order. Let's use that information. What if we remember which element was maximum during pooling and fill that position during unpooling? Let's look at the example. We have a 4x4 input, and let's apply max pooling of 2x2 with stride 2. And let's remember, which neurons gave us the maximum activations. Then goes the rest of the network, and at some point, we will have to do unpooling. We will have to do a 4x4 output out of the 2x2 input. And we do that not by filling nearest neighbor values. But we rather pool these values and put them in those locations where we had maximum activations during corresponding pooling. This way, we get a crisper image and it actually works better. Previous approaches are not data-driven. Imagine that you have an object that are round, and not squared, and we don't use that information. Nearest neighbor unpooling, or max unpooling, is not aware of that. But we can actually use that information to give better unpooling. Remember that we can replace max pooling layer with convolutional layer that has a bigger stride. What if we can apply convolutions to do unpooling? Let's see how it might work. We have a 2x2 input and we somehow need to produce a 4x4 output. Let's use a 3x3 convolutional filter for that. How it works? We take a convolutional filter and we multiply it by the value in the red input cell. And we add those values to the output. Then we move to the next pixel in the input. And for that, we use a stride 2 in the output, so that we double the resolution that we had in the input. Then we take the kernel weights or filter weights and multiply it by the value that we have in the blue input cell. And we add it to the output as well. But what to do with that value with those values where we have an intersection of our filters? We actually take a sum of those values and it still works. Let's go to object classification and localization task. For this, we need to find a bounding box to localize our object. Let's parameterize our bounding box with four numbers, x, y, w, and h. X and y stand for the coordinates of the upper left corner of that box, and w and h stand for width and height of that box. We can actually use regression for those four parameters. Let's see how it might work. We have a classification network which looks like a bunch of convolutional layers followed by a multilayer perceptron. And we train it using cross-entropy. But do we need a second network to do a bounding box regression? Actually, we can reuse those convolutional layers for our new task. And we can train a new, fully connected layer that will predict bounding box parameters and we will use mean squared error for that. But how do we train such a network when we have two different losses? We actually take the sum of those losses and that gives us the final loss for which we'll propagate the gradients during back propagation. In this video, we took a sneak peek into other computer vision problems that successfully utilize convolutional neural networks. This video concludes our introduction to neural networks for images. [MUSIC]