1
00:00:00,000 --> 00:00:04,814
[MUSIC]

2
00:00:04,814 --> 00:00:07,832
Finally, let's talk about
multiclass classification.

3
00:00:07,832 --> 00:00:11,108
And in particular, we're going to
talk about perhaps the simplest,

4
00:00:11,108 --> 00:00:14,338
but very useful approach for
doing multiclass classifications.

5
00:00:14,338 --> 00:00:16,658
It's called 1 verses all approach.

6
00:00:16,658 --> 00:00:19,432
So let's say, here's an example
of multiclass classification.

7
00:00:19,432 --> 00:00:23,832
I give you an image of some object,
maybe it's an image of my dog and

8
00:00:23,832 --> 00:00:28,959
I feed this into the classifier to try
to predict what object is in that image.

9
00:00:28,959 --> 00:00:31,626
So the output y is
the object in the image.

10
00:00:31,626 --> 00:00:36,341
Maybe a Labrador retriever,
a golden retriever, table monitor,

11
00:00:36,341 --> 00:00:38,290
camera and so on.

12
00:00:38,290 --> 00:00:40,490
So, that's the prediction
task that we're going to do.

13
00:00:40,490 --> 00:00:43,625
And it has more than two classes and
not just +1 minus 1, but

14
00:00:43,625 --> 00:00:45,990
maybe have a thousand
different categories.

15
00:00:45,990 --> 00:00:47,366
So, how do we solve a problem like this?

16
00:00:47,366 --> 00:00:49,676
There are many approaches to doing that.

17
00:00:49,676 --> 00:00:54,808
Let's talk about a very simple, but super
useful approach called one versus all and

18
00:00:54,808 --> 00:00:59,288
then we're going to use the following
example where we have three classes,

19
00:00:59,288 --> 00:01:01,326
triangles, hearts and donuts.

20
00:01:01,326 --> 00:01:04,236
But of course,
you might have many more classes and

21
00:01:04,236 --> 00:01:07,630
I'm using capital C to denote
the total number of classes.

22
00:01:07,630 --> 00:01:12,076
Capital C in this case is three, but
it could be 10,000 in a different case and

23
00:01:12,076 --> 00:01:13,614
we still have n data points.

24
00:01:13,614 --> 00:01:18,606
And now for each data point, we have
associated with it not just x1, x2 and

25
00:01:18,606 --> 00:01:23,526
so on, but also the category y,
which is still being just plus 1 minus 1.

26
00:01:23,526 --> 00:01:27,452
In this case, it's triangles,
hearts or donuts and

27
00:01:27,452 --> 00:01:32,070
what I'd like to know is for
particular input, x1.

28
00:01:32,070 --> 00:01:38,140
What is the probability that this
import corresponds to a triangle,

29
00:01:38,140 --> 00:01:40,030
to a heart or to a donut?

30
00:01:40,030 --> 00:01:44,697
Or in the case of damage, whether it
corresponds to a golden retriever,

31
00:01:44,697 --> 00:01:46,849
a Labrador retriever or Tamara.

32
00:01:46,849 --> 00:01:49,892
The 1 versus all model
is extremely simple and

33
00:01:49,892 --> 00:01:53,411
is exactly what you expect
of the words 1 versus all.

34
00:01:53,411 --> 00:01:57,336
Here's what we do,
you train a classifier for each category.

35
00:01:57,336 --> 00:02:02,031
So for example, you train one that
says the +1 category is going to be

36
00:02:02,031 --> 00:02:06,979
all the triangles and the negative
category is going to be everything else.

37
00:02:06,979 --> 00:02:11,779
In our example, hearts and doughnuts and
what you're trying to do is a learn

38
00:02:11,779 --> 00:02:16,959
a classifier that separates most of the
triangles from the hearts and the donuts.

39
00:02:16,959 --> 00:02:20,233
So in particular, we're going to train or

40
00:02:20,233 --> 00:02:25,241
classify them denote by p hat of triangle,
which outputs +1

41
00:02:25,241 --> 00:02:30,848
if the input x is more likely to be
a triangle, then everything else.

42
00:02:30,848 --> 00:02:35,040
Don't know it or heart.

43
00:02:35,040 --> 00:02:40,348
And then the way that we
estimate the probability

44
00:02:40,348 --> 00:02:45,267
that an input x i is
a triangle is just by saying

45
00:02:45,267 --> 00:02:49,425
it's P hat of triangle with y = + 1.

46
00:02:49,425 --> 00:02:52,522
So in our picture on the right,
here's what our classifier's going to do.

47
00:02:52,522 --> 00:03:00,290
It's going to assign score of xi to be
greater than 0 on the triangle side.

48
00:03:00,290 --> 00:03:05,070
Score of xi to be less
than 0 on the donuts and

49
00:03:05,070 --> 00:03:07,921
heart side, hopefully.

50
00:03:07,921 --> 00:03:14,347
Which is going to mean that on
the triangle side the probability

51
00:03:14,347 --> 00:03:19,387
that y is equal to triangle
given the input xi and

52
00:03:19,387 --> 00:03:24,931
the parameter w is going
to be greater than 0.5 and

53
00:03:24,931 --> 00:03:29,593
on the other side,
the probability that y is

54
00:03:29,593 --> 00:03:33,751
equal to triangle given the input xi and

55
00:03:33,751 --> 00:03:40,315
the parameters w is going to be hopefully,
less than 0.5.

56
00:03:40,315 --> 00:03:43,139
So this let's me take a data point of say,

57
00:03:43,139 --> 00:03:47,280
is this more likely to be a triangle,
than a donut or a heart?

58
00:03:47,280 --> 00:03:50,492
That doesn't tell me how to do
multiclass classification in general.

59
00:03:50,492 --> 00:03:53,696
How do we do multiclass
classification in general?

60
00:03:53,696 --> 00:03:57,512
And so what we'll do is we'll learn
a model for each one of these cases.

61
00:03:57,512 --> 00:04:00,856
So, it's going to be 1 versus all for
each one of the classes.

62
00:04:00,856 --> 00:04:06,193
So we will learn a 1 versus all
that tries to compare triangles

63
00:04:06,193 --> 00:04:11,119
against donuts and hearts,
which going to be find by P heart

64
00:04:11,119 --> 00:04:15,758
sub-triangle that y equals plus 1,
given xi and w.

65
00:04:15,758 --> 00:04:20,258
We're going to learn a model for
hearts that tries to separate hearts from

66
00:04:20,258 --> 00:04:23,483
triangles and donuts,
which is going to be P heart and

67
00:04:23,483 --> 00:04:27,312
we learn different datasets,
so it's going to be subhearts.

68
00:04:27,312 --> 00:04:30,157
That's a gnarly heart.

69
00:04:30,157 --> 00:04:33,616
I want pretty hearts here.

70
00:04:33,616 --> 00:04:36,135
That's slightly prettier, only slightly.

71
00:04:36,135 --> 00:04:44,684
[LAUGH] The probability that it's plus one
given the input xi and the parameters w.

72
00:04:44,684 --> 00:04:50,475
And lastly, we have for a donuts,

73
00:04:50,475 --> 00:04:57,067
the probability according to the donut

74
00:04:57,067 --> 00:05:02,470
model of y = +1 given xi and w.

75
00:05:02,470 --> 00:05:04,738
And one last little note
is that the w's for

76
00:05:04,738 --> 00:05:08,896
each one's of the smallest are here
separating donuts from everything else.

77
00:05:08,896 --> 00:05:10,413
The w's for
each one's the smallest difference,

78
00:05:10,413 --> 00:05:12,252
because you can see by
the lines being different.

79
00:05:12,252 --> 00:05:15,700
So in the first one is w of triangles and

80
00:05:15,700 --> 00:05:20,290
the second one is w of hearts, and
then the last one is W of donuts.

81
00:05:22,010 --> 00:05:25,776
So, we trained this one versus
all models and what do we output?

82
00:05:25,776 --> 00:05:32,376
As a prediction, we just say, whatever
class has the highest probability wins.

83
00:05:32,376 --> 00:05:36,733
So in other words, if the probability
that an input is a heart against

84
00:05:36,733 --> 00:05:40,263
everything else is higher
than the point is a triangle,

85
00:05:40,263 --> 00:05:44,410
higher than the point is a donut,
you said that the class is hard.

86
00:05:44,410 --> 00:05:47,666
More explicitly in
multiclass classification,

87
00:05:47,666 --> 00:05:50,995
we're going to train our model to ask for
each class,

88
00:05:50,995 --> 00:05:55,196
what's the probability that it
wins against everything else?

89
00:05:55,196 --> 00:06:01,276
The probability of y equals 1, given x and
you estimate one of those for each class.

90
00:06:01,276 --> 00:06:04,240
And the one way, you get
a particular input xi, for example,

91
00:06:04,240 --> 00:06:05,267
this image of my dog.

92
00:06:05,267 --> 00:06:09,454
What we're going to do is compute
the problem to the every class estimates

93
00:06:09,454 --> 00:06:14,287
an output maximum and then just written
out here, the kind of natural algorithm.

94
00:06:14,287 --> 00:06:18,258
So, we start with the maximum
probability being zero and

95
00:06:18,258 --> 00:06:20,664
y hat being the non category zero.

96
00:06:20,664 --> 00:06:24,770
And we go class by class and ask,
is the probability, the y = 1,

97
00:06:24,770 --> 00:06:29,478
according to the model for this class,
the model for Labrador retriever,

98
00:06:29,478 --> 00:06:33,016
the model for golden retriever,
the model for camera.

99
00:06:33,016 --> 00:06:36,116
Is that higher than the max
probability of C so far?

100
00:06:36,116 --> 00:06:39,590
If it's higher that means that
class looks like it's winning, so

101
00:06:39,590 --> 00:06:42,133
we say that y heart is
whatever this class says and

102
00:06:42,133 --> 00:06:46,436
we update the maximum probability to be
the probability according to this class.

103
00:06:46,436 --> 00:06:49,512
And as iterate over each one of these,
the maximum is going to win.

104
00:06:49,512 --> 00:06:51,196
So, this is just kind of an algorithm.

105
00:06:51,196 --> 00:06:53,046
It's for exactly what you would expect.

106
00:06:53,046 --> 00:06:57,509
We check how each model believes,
whether he believes it's a dog,

107
00:06:57,509 --> 00:07:01,201
a Labrador retriever,
a golden retriever, a camera.

108
00:07:01,201 --> 00:07:03,494
What the probabilities are and

109
00:07:03,494 --> 00:07:08,266
he just helped put the object
that has the highest probability.

110
00:07:08,266 --> 00:07:12,984
And with that simple, simple algorithm,
we now have a multi-class classification

111
00:07:12,984 --> 00:07:16,012
system by using a number of
these binary classifiers.

112
00:07:16,012 --> 00:07:21,179
[MUSIC]