So whereas in transfer learning,
you have a sequential process where you learn from task A and
then transfer that to task B. In multi-task learning,
you start off simultaneously, trying to have one neural network
do several things at the same time. And then each of these task helps
hopefully all of the other task. Let's look at an example. Let's say you're building an autonomous
vehicle, building a self driving car. Then your self driving car would
need to detect several different things such as pedestrians,
detect other cars, detect stop signs. And also detect traffic lights and
also other things. So for example, in this example on the
left, there is a stop sign in this image and there is a car in this image but there
aren't any pedestrians or traffic lights. So if this image is an input for
an example, x(i), then Instead of having one label y(i),
you would actually a four labels. In this example,
there are no pedestrians, there is a car, there is a stop sign and
there are no traffic lights. And if you try and detect other things, there may be y(i) has
even more dimensions. But for now let's stick with these four. So y(i) is a 4 by 1 vector. And if you look at the training
test labels as a whole, then similar to before,
we'll stack the training data's labels horizontally as follows,
y(1) up to y(m). Except that now y(i) is a 4 by 1 vector so
each of these is a tall column vector. And so this matrix Y is now a 4
by m matrix, whereas previously, when y was single real number,
this would have been a 1 by m matrix. So what you can do is now train a neural
network to predict these values of y. So you can have a neural
network input x and output now a four dimensional value for y. Notice here for
the output there I've drawn four nodes. And so the first node when we try
to predict is there a pedestrian in this picture. The second output will
predict is there a car here, predict is there a stop sign and this will
predict maybe is there a traffic light. So y hat here is four dimensional. So to train this neural network,
you now need to define the loss for the neural network. And so given a predicted output y
hat i which is 4 by 1 dimensional. The loss averaged over
your entire training set would be 1 over m sum
from i = 1 through m, sum from j = 1 through 4 of the losses
of the individual predictions. So it's just summing over at the four
components of pedestrian car stop sign traffic lights. And this script L is
the usual logistic loss. So just to write this out, this is -yj i log y hat ji- 1- y log 1- y hat. And the main difference compared to the
earlier binding classification examples is that you're now summing
over j equals 1 through 4. And the main difference between this and
softmax regression, is that unlike softmax regression, which assigned
a single label to single example. This one image can have multiple labels. So you're not saying that each image
is either a picture of a pedestrian, or a picture of car, a picture of a stop
sign, picture of a traffic light. You're asking for each picture, does it
have a pedestrian, or a car a stop sign or traffic light, and multiple objects
could appear in the same image. In fact, in the example on the previous
slide, we had both a car and a stop sign in that image, but
no pedestrians and traffic lights. So you're not assigning
a single label to an image, you're going through the different
classes and asking for each of the classes does that class, does
that type of object appear in the image? So that's why I'm saying
that with this setting, one image can have multiple labels. If you train a neural network
to minimize this cost function, you are carrying out multi-task learning. Because what you're doing is building
a single neural network that is looking at each image and
basically solving four problems. It's trying to tell you does each image
have each of these four objects in it. And one other thing you could have done is
just train four separate neural networks, instead of train one
network to do four things. But if some of the earlier features in
neural network can be shared between these different types of objects,
then you find that training one neural network to do four
things results in better performance than training four completely separate neural
networks to do the four tasks separately. So that's the power of
multi-task learning. And one other detail, so far I've described this algorithm as
if every image had every single label. It turns out that multi-task learning also
works even if some of the images we'll label only some of the objects. So the first training example, let's say
someone, your labeler had told you there's a pedestrian, there's no car, but
they didn't bother to label whether or not there's a stop sign or whether or
not there's a traffic light. And maybe for the second example,
there is a pedestrian, there is a car, but again the labeler, when they looked at
that image, they just didn't label it, whether it had a stop sign or
whether it had a traffic light, and so on. And maybe some examples are fully labeled,
and maybe some examples, they were just labeling for
the presence and absence of cars so there's some question marks, and so on. So with a data set like this, you can
still train your learning algorithm to do four tasks at the same time,
even when some images have only a subset of the labels and others
are sort of question marks or don't cares. And the way you train your algorithm, even when some of these
labels are question marks or really unlabeled is that in
this sum over j from 1 to 4, you would sum only over values
of j with a 0 or 1 label. So whenever there's a question mark,
you just omit that term from summation but just sum over only the values
where there is a label. And so that allows you to use
datasets like this as well. So when does multi-task
learning makes sense? So when does multi-task
learning make sense? I'll say it makes sense usually
when three things are true. One is if your training on a set
of tasks that could benefit from having shared low-level features. So for the autonomous driving example,
it makes sense that recognizing traffic lights and cars and pedestrians,
those should have similar features that could also help you recognize stop signs,
because these are all features of roads. Second, this is less of a hard and
fast rule, so this isn't always true. But what I see from a lot of successful
multi-task learning settings is that the amount of data you have for
each task is quite similar. So if you recall from transfer learning,
you learn from some task A and transfer it to some task B. So if you have a million examples of
task A then and 1,000 examples for task B, then all the knowledge you
learned from that million examples could really help augment the much
smaller data set you have for task B. Well how about multi-task learning? In multi-task learning you usually
have a lot more tasks than just two. So maybe you have, previously we had 4
tasks but let's say you have 100 tasks. And you're going to do multi-task learning
to try to recognize 100 different types of objects at the same time. So what you may find is that you may
have 1,000 examples per task and so if you focus on the performance
of just one task, let's focus on the performance on
the 100th task, you can call A100. If you are trying to do this
final task in isolation, you would have had just a thousand
examples to train this one task, this one of the 100 tasks that by
training on these 99 other tasks. These in aggregate have 99,000 training
examples which could be a big boost, could give a lot of knowledge
to argument this otherwise, relatively small 1,000 example training
set that you have for task A100. And symmetrically every one of the other
99 tasks can provide some data or provide some knowledge that help every one of
the other tasks in this list of 100 tasks. So the second bullet isn't a hard and
fast rule but what I tend to look at is if you focus on any one task, for that to
get a big boost for multi-task learning, the other tasks in aggregate need to
have quite a lot more data than for that one task. And so one way to satisfy that is if a lot
of tasks like we have in this example on the right, and if the amount of data
you have in each task is quite similar. But the key really is that if you
already have 1,000 examples for 1 task, then for all of the other tasks you better
have a lot more than 1,000 examples if those other other task are meant to
help you do better on this final task. And finally multi-task learning tends to
make more sense when you can train a big enough neural network to
do well on all the tasks. So the alternative to
multi-task learning would be to train a separate neural network for
each task. So rather than training one neural network
for pedestrian, car, stop sign, and traffic light detection, you could
have trained one neural network for pedestrian detection, one neural network
for car detection, one neural network for stop sign detection, and one neural
network for traffic light detection. So what a researcher, Rich Carona, found
many years ago was that the only times multi-task learning hurts
performance compared to training separate neural networks is if your
neural network isn't big enough. But if you can train a big enough neural
network, then multi-task learning certainly should not or
should very rarely hurt performance. And hopefully it will actually help
performance compared to if you were training neural networks to do
these different tasks in isolation. So that's it for multi-task learning. In practice, multi-task learning is used
much less often than transfer learning. I see a lot of applications of
transfer learning where you have a problem you want to solve
with a small amount of data. So you find a related problem with
a lot of data to learn something and transfer that to this new problem. But multi-task learning is just more rare
that you have a huge set of tasks you want to use that you want to do well on, you can train all of those
tasks at the same time. Maybe the one example is computer vision. In object detection I see more
applications of multi-task any where one neural network trying to detect a whole
bunch of objects at the same time works better than different neural networks
trained separately to detect objects. But I would say that on average transfer
learning is used much more today than multi-task learning, but both
are useful tools to have in your arsenal. So to summarize, multi-task learning enables you to train
one neural network to do many tasks and this can give you better performance than
if you were to do the tasks in isolation. Now one note of caution, in practice
I see that transfer learning is used much more often than multi-task learning. So I do see a lot of tasks where if you
want to solve a machine learning problem but you have a relatively small data set,
then transfer learning can really help. Where if you find a related problem but
you have a much bigger data set, you can train in your neural
network from there and then transfer it to the problem
where we have very low data. So transfer learning is used a lot today. There are some applications of transfer
multi-task learning as well, but multi-task learning I think is used
much less often than transfer learning. And maybe the one exception is
computer vision object detection, where I do see a lot of applications
of training a neural network to detect lots of different objects. And that works better than training
separate neural networks and detecting the visual objects. But on average I think that even
though transfer learning and multi-task learning often you're presented
in a similar way, in practice I've seen a lot more applications of transfer
learning than of multi-task learning. I think because often it's just difficult
to set up or to find so many different tasks that you would actually want to
train a single neural network for. Again, with some sort of computer vision, object detection examples being
the most notable exception. So that's it for multi-task learning. Multi-task learning and transfer learning are both important
tools to have in your tool bag. And finally, I'd like to move on to
discuss end-to-end deep learning. So let's go onto the next video
to discuss end-to-end learning.