1 00:00:00,000 --> 00:00:04,814 [MUSIC] 2 00:00:04,814 --> 00:00:07,832 Finally, let's talk about multiclass classification. 3 00:00:07,832 --> 00:00:11,108 And in particular, we're going to talk about perhaps the simplest, 4 00:00:11,108 --> 00:00:14,338 but very useful approach for doing multiclass classifications. 5 00:00:14,338 --> 00:00:16,658 It's called 1 verses all approach. 6 00:00:16,658 --> 00:00:19,432 So let's say, here's an example of multiclass classification. 7 00:00:19,432 --> 00:00:23,832 I give you an image of some object, maybe it's an image of my dog and 8 00:00:23,832 --> 00:00:28,959 I feed this into the classifier to try to predict what object is in that image. 9 00:00:28,959 --> 00:00:31,626 So the output y is the object in the image. 10 00:00:31,626 --> 00:00:36,341 Maybe a Labrador retriever, a golden retriever, table monitor, 11 00:00:36,341 --> 00:00:38,290 camera and so on. 12 00:00:38,290 --> 00:00:40,490 So, that's the prediction task that we're going to do. 13 00:00:40,490 --> 00:00:43,625 And it has more than two classes and not just +1 minus 1, but 14 00:00:43,625 --> 00:00:45,990 maybe have a thousand different categories. 15 00:00:45,990 --> 00:00:47,366 So, how do we solve a problem like this? 16 00:00:47,366 --> 00:00:49,676 There are many approaches to doing that. 17 00:00:49,676 --> 00:00:54,808 Let's talk about a very simple, but super useful approach called one versus all and 18 00:00:54,808 --> 00:00:59,288 then we're going to use the following example where we have three classes, 19 00:00:59,288 --> 00:01:01,326 triangles, hearts and donuts. 20 00:01:01,326 --> 00:01:04,236 But of course, you might have many more classes and 21 00:01:04,236 --> 00:01:07,630 I'm using capital C to denote the total number of classes. 22 00:01:07,630 --> 00:01:12,076 Capital C in this case is three, but it could be 10,000 in a different case and 23 00:01:12,076 --> 00:01:13,614 we still have n data points. 24 00:01:13,614 --> 00:01:18,606 And now for each data point, we have associated with it not just x1, x2 and 25 00:01:18,606 --> 00:01:23,526 so on, but also the category y, which is still being just plus 1 minus 1. 26 00:01:23,526 --> 00:01:27,452 In this case, it's triangles, hearts or donuts and 27 00:01:27,452 --> 00:01:32,070 what I'd like to know is for particular input, x1. 28 00:01:32,070 --> 00:01:38,140 What is the probability that this import corresponds to a triangle, 29 00:01:38,140 --> 00:01:40,030 to a heart or to a donut? 30 00:01:40,030 --> 00:01:44,697 Or in the case of damage, whether it corresponds to a golden retriever, 31 00:01:44,697 --> 00:01:46,849 a Labrador retriever or Tamara. 32 00:01:46,849 --> 00:01:49,892 The 1 versus all model is extremely simple and 33 00:01:49,892 --> 00:01:53,411 is exactly what you expect of the words 1 versus all. 34 00:01:53,411 --> 00:01:57,336 Here's what we do, you train a classifier for each category. 35 00:01:57,336 --> 00:02:02,031 So for example, you train one that says the +1 category is going to be 36 00:02:02,031 --> 00:02:06,979 all the triangles and the negative category is going to be everything else. 37 00:02:06,979 --> 00:02:11,779 In our example, hearts and doughnuts and what you're trying to do is a learn 38 00:02:11,779 --> 00:02:16,959 a classifier that separates most of the triangles from the hearts and the donuts. 39 00:02:16,959 --> 00:02:20,233 So in particular, we're going to train or 40 00:02:20,233 --> 00:02:25,241 classify them denote by p hat of triangle, which outputs +1 41 00:02:25,241 --> 00:02:30,848 if the input x is more likely to be a triangle, then everything else. 42 00:02:30,848 --> 00:02:35,040 Don't know it or heart. 43 00:02:35,040 --> 00:02:40,348 And then the way that we estimate the probability 44 00:02:40,348 --> 00:02:45,267 that an input x i is a triangle is just by saying 45 00:02:45,267 --> 00:02:49,425 it's P hat of triangle with y = + 1. 46 00:02:49,425 --> 00:02:52,522 So in our picture on the right, here's what our classifier's going to do. 47 00:02:52,522 --> 00:03:00,290 It's going to assign score of xi to be greater than 0 on the triangle side. 48 00:03:00,290 --> 00:03:05,070 Score of xi to be less than 0 on the donuts and 49 00:03:05,070 --> 00:03:07,921 heart side, hopefully. 50 00:03:07,921 --> 00:03:14,347 Which is going to mean that on the triangle side the probability 51 00:03:14,347 --> 00:03:19,387 that y is equal to triangle given the input xi and 52 00:03:19,387 --> 00:03:24,931 the parameter w is going to be greater than 0.5 and 53 00:03:24,931 --> 00:03:29,593 on the other side, the probability that y is 54 00:03:29,593 --> 00:03:33,751 equal to triangle given the input xi and 55 00:03:33,751 --> 00:03:40,315 the parameters w is going to be hopefully, less than 0.5. 56 00:03:40,315 --> 00:03:43,139 So this let's me take a data point of say, 57 00:03:43,139 --> 00:03:47,280 is this more likely to be a triangle, than a donut or a heart? 58 00:03:47,280 --> 00:03:50,492 That doesn't tell me how to do multiclass classification in general. 59 00:03:50,492 --> 00:03:53,696 How do we do multiclass classification in general? 60 00:03:53,696 --> 00:03:57,512 And so what we'll do is we'll learn a model for each one of these cases. 61 00:03:57,512 --> 00:04:00,856 So, it's going to be 1 versus all for each one of the classes. 62 00:04:00,856 --> 00:04:06,193 So we will learn a 1 versus all that tries to compare triangles 63 00:04:06,193 --> 00:04:11,119 against donuts and hearts, which going to be find by P heart 64 00:04:11,119 --> 00:04:15,758 sub-triangle that y equals plus 1, given xi and w. 65 00:04:15,758 --> 00:04:20,258 We're going to learn a model for hearts that tries to separate hearts from 66 00:04:20,258 --> 00:04:23,483 triangles and donuts, which is going to be P heart and 67 00:04:23,483 --> 00:04:27,312 we learn different datasets, so it's going to be subhearts. 68 00:04:27,312 --> 00:04:30,157 That's a gnarly heart. 69 00:04:30,157 --> 00:04:33,616 I want pretty hearts here. 70 00:04:33,616 --> 00:04:36,135 That's slightly prettier, only slightly. 71 00:04:36,135 --> 00:04:44,684 [LAUGH] The probability that it's plus one given the input xi and the parameters w. 72 00:04:44,684 --> 00:04:50,475 And lastly, we have for a donuts, 73 00:04:50,475 --> 00:04:57,067 the probability according to the donut 74 00:04:57,067 --> 00:05:02,470 model of y = +1 given xi and w. 75 00:05:02,470 --> 00:05:04,738 And one last little note is that the w's for 76 00:05:04,738 --> 00:05:08,896 each one's of the smallest are here separating donuts from everything else. 77 00:05:08,896 --> 00:05:10,413 The w's for each one's the smallest difference, 78 00:05:10,413 --> 00:05:12,252 because you can see by the lines being different. 79 00:05:12,252 --> 00:05:15,700 So in the first one is w of triangles and 80 00:05:15,700 --> 00:05:20,290 the second one is w of hearts, and then the last one is W of donuts. 81 00:05:22,010 --> 00:05:25,776 So, we trained this one versus all models and what do we output? 82 00:05:25,776 --> 00:05:32,376 As a prediction, we just say, whatever class has the highest probability wins. 83 00:05:32,376 --> 00:05:36,733 So in other words, if the probability that an input is a heart against 84 00:05:36,733 --> 00:05:40,263 everything else is higher than the point is a triangle, 85 00:05:40,263 --> 00:05:44,410 higher than the point is a donut, you said that the class is hard. 86 00:05:44,410 --> 00:05:47,666 More explicitly in multiclass classification, 87 00:05:47,666 --> 00:05:50,995 we're going to train our model to ask for each class, 88 00:05:50,995 --> 00:05:55,196 what's the probability that it wins against everything else? 89 00:05:55,196 --> 00:06:01,276 The probability of y equals 1, given x and you estimate one of those for each class. 90 00:06:01,276 --> 00:06:04,240 And the one way, you get a particular input xi, for example, 91 00:06:04,240 --> 00:06:05,267 this image of my dog. 92 00:06:05,267 --> 00:06:09,454 What we're going to do is compute the problem to the every class estimates 93 00:06:09,454 --> 00:06:14,287 an output maximum and then just written out here, the kind of natural algorithm. 94 00:06:14,287 --> 00:06:18,258 So, we start with the maximum probability being zero and 95 00:06:18,258 --> 00:06:20,664 y hat being the non category zero. 96 00:06:20,664 --> 00:06:24,770 And we go class by class and ask, is the probability, the y = 1, 97 00:06:24,770 --> 00:06:29,478 according to the model for this class, the model for Labrador retriever, 98 00:06:29,478 --> 00:06:33,016 the model for golden retriever, the model for camera. 99 00:06:33,016 --> 00:06:36,116 Is that higher than the max probability of C so far? 100 00:06:36,116 --> 00:06:39,590 If it's higher that means that class looks like it's winning, so 101 00:06:39,590 --> 00:06:42,133 we say that y heart is whatever this class says and 102 00:06:42,133 --> 00:06:46,436 we update the maximum probability to be the probability according to this class. 103 00:06:46,436 --> 00:06:49,512 And as iterate over each one of these, the maximum is going to win. 104 00:06:49,512 --> 00:06:51,196 So, this is just kind of an algorithm. 105 00:06:51,196 --> 00:06:53,046 It's for exactly what you would expect. 106 00:06:53,046 --> 00:06:57,509 We check how each model believes, whether he believes it's a dog, 107 00:06:57,509 --> 00:07:01,201 a Labrador retriever, a golden retriever, a camera. 108 00:07:01,201 --> 00:07:03,494 What the probabilities are and 109 00:07:03,494 --> 00:07:08,266 he just helped put the object that has the highest probability. 110 00:07:08,266 --> 00:07:12,984 And with that simple, simple algorithm, we now have a multi-class classification 111 00:07:12,984 --> 00:07:16,012 system by using a number of these binary classifiers. 112 00:07:16,012 --> 00:07:21,179 [MUSIC]