[MUSIC] In this video, we will take a quick look
at other computer vision problems that successfully utilize
convolutional networks. So far, we have examined
image classification task, which has an image as an input and
a class label as an output. In this video,
we will review two more tasks. And the first one is semantic segmentation
where you have an image as an input. And as an output, you need to give a class
label for each pixel of that image. For example, which pixels correspond to water,
which pixels correspond to duck or grass. Another example is image
classification plus localization. In this task, you not only need to say
which object you see on the image, but also the way you see it. And for that,
you need to define a bounding box, that is a yellow rectangular
region that contains your object. Let's start with semantic segmentation. For this task, we need to
classify each pixel of our image. So what do we do when we
have an image as an input? We stack convolutional layer, right? And we use the same width and height as our input image because we
will need to classify each pixel. And what we do next? Usually we add pooling layers. But in this particular task,
it is not easy to implement. Because when you add pooling layers that
effectively down samples your image and our classification will not be crisp,
it will be pixelated. And we don't want that. So let's maintain the width and height of our convolutional filters as
we stack more and more of that layers. The final layer will be a different one. It will have the number of output channels equal to the number of classes
that we need for our segmentation. For example,
each depth slice will be responsible for classification of let's say,
water, duck, or grass. And what we do in the end is for
every pixel, and remember, we maintain the width and height, for
every pixel we can take all those values that are encoded in the depth of that
final volume which has an orange color. And we actually apply a softmax function
over those values in the output channels. This is a rather naive approach. We stack convolutional layers and
add per-pixel softmax. We go deep but we don't add pooling. And that is too expensive and
it is harder to explain. Let's add pooling which
acts like down-sampling. We will paint it with pink color. So we have an image as an input, we have first convolution layer,
then followed by pooling layer. After pooling layer, we actually reduce
the number, we reduce the width and height of our volume, and
we increase the depth. Then we have one more convolutional
layer and one more pooling layer. Then what we do next is we stack
one more convolutional layer. But wait a second,
we need to classify each pixel. And right now, the width and height of
our volume is significantly reduced. We need to do unpooling somehow. And for that task, we will use a special
layer which we’ll do upsampling. And we color it with green color. And after upsampling,
we will use convolutional layers so that we can learn a transformation
back to their original pixels. We add one more upsampling layer and
one more convolutional layer. And that is how we get our semantic
segmentation of input pixels. How do we do that unpooling? The easiest way is to fill
the nearest neighbor values. We have a 2x2 input and we replace each cell in that input
as a 2x2 patch with the same values. This way, we get a pixelated output and
it is not crisp. It's not the best way to go. Another technique is called max unpooling. Let's look at our architecture. We have corresponding pairs of
downsampling and upsampling layers. They actually do the same thing but
in the reverse order. Let's use that information. What if we remember which element
was maximum during pooling and fill that position during unpooling? Let's look at the example. We have a 4x4 input, and let's apply
max pooling of 2x2 with stride 2. And let's remember, which neurons
gave us the maximum activations. Then goes the rest of the network, and at
some point, we will have to do unpooling. We will have to do a 4x4
output out of the 2x2 input. And we do that not by filling
nearest neighbor values. But we rather pool these values and
put them in those locations where we had maximum
activations during corresponding pooling. This way, we get a crisper image and
it actually works better. Previous approaches are not data-driven. Imagine that you have
an object that are round, and not squared, and
we don't use that information. Nearest neighbor unpooling, or
max unpooling, is not aware of that. But we can actually use that
information to give better unpooling. Remember that we can replace max
pooling layer with convolutional layer that has a bigger stride. What if we can apply
convolutions to do unpooling? Let's see how it might work. We have a 2x2 input and
we somehow need to produce a 4x4 output. Let's use a 3x3 convolutional filter for
that. How it works? We take a convolutional filter and we multiply it by the value
in the red input cell. And we add those values to the output. Then we move to the next
pixel in the input. And for that,
we use a stride 2 in the output, so that we double the resolution
that we had in the input. Then we take the kernel weights or filter weights and multiply it by the
value that we have in the blue input cell. And we add it to the output as well. But what to do with that value
with those values where we have an intersection of our filters? We actually take a sum of those values and
it still works. Let's go to object classification and
localization task. For this, we need to find a bounding
box to localize our object. Let's parameterize our bounding box
with four numbers, x, y, w, and h. X and y stand for the coordinates of
the upper left corner of that box, and w and h stand for
width and height of that box. We can actually use regression for
those four parameters. Let's see how it might work. We have a classification network which
looks like a bunch of convolutional layers followed by a multilayer perceptron. And we train it using cross-entropy. But do we need a second network
to do a bounding box regression? Actually, we can reuse those
convolutional layers for our new task. And we can train a new,
fully connected layer that will predict bounding box parameters and
we will use mean squared error for that. But how do we train such a network
when we have two different losses? We actually take the sum of those losses
and that gives us the final loss for which we'll propagate the gradients
during back propagation. In this video, we took a sneak peek
into other computer vision problems that successfully utilize
convolutional neural networks. This video concludes our introduction
to neural networks for images. [MUSIC]