1 00:00:02,180 --> 00:00:08,453 In this video, we will discuss how to adapt linear methods for classification problems. 2 00:00:08,453 --> 00:00:12,318 And let's start with simplest classification problem, binary classification. 3 00:00:12,318 --> 00:00:17,160 Here we have only two values for the target: minus one and one; 4 00:00:17,160 --> 00:00:21,339 negative class and positive class -- so, essentially, 5 00:00:21,339 --> 00:00:23,823 linear model calculus dot product between w, 6 00:00:23,823 --> 00:00:26,670 weight vector, and x, feature vector. 7 00:00:26,670 --> 00:00:29,100 This dot product is a real value, 8 00:00:29,100 --> 00:00:32,070 and we should somehow transform it to minus one or one, 9 00:00:32,070 --> 00:00:34,990 and to do it, we can just take a sine of dot product. 10 00:00:34,990 --> 00:00:39,975 So linear classifier looks like sine of w transposed by x. 11 00:00:39,975 --> 00:00:42,105 It has deep parameters. 12 00:00:42,105 --> 00:00:43,275 And if you remember, 13 00:00:43,275 --> 00:00:48,300 we agreed that there is also a constant feature that has a value of one on every example. 14 00:00:48,300 --> 00:00:51,780 So we don't have to explicitly include bias into our model. 15 00:00:51,780 --> 00:00:55,730 The coefficient of this constant feature will be the bias. 16 00:00:55,730 --> 00:01:01,720 So actually, maybe there are d+1 parameters for this model, 17 00:01:01,720 --> 00:01:04,245 and geometrically it looks like that. 18 00:01:04,245 --> 00:01:05,680 Suppose that we have two features, 19 00:01:05,680 --> 00:01:08,635 so x's on this graph correspond to our features, 20 00:01:08,635 --> 00:01:14,125 and we denote negative class by red points and positive class by blue points. 21 00:01:14,125 --> 00:01:16,180 And the linear model tries to find 22 00:01:16,180 --> 00:01:20,410 some line that separates blue points from the red points. 23 00:01:20,410 --> 00:01:21,930 And as we know from geometry, 24 00:01:21,930 --> 00:01:27,485 the sine of the dot product indicates on which side of the line the point lies. 25 00:01:27,485 --> 00:01:29,965 So, if we have a positive dot product, 26 00:01:29,965 --> 00:01:33,325 then the point lies at the positive side of this line, 27 00:01:33,325 --> 00:01:35,190 and if the product is negative, 28 00:01:35,190 --> 00:01:39,715 then the point lies on the negative side of our line. 29 00:01:39,715 --> 00:01:41,280 Okay. 30 00:01:41,280 --> 00:01:47,835 Let's switch to multi-class classification problem with K classes 1- K. In this case, 31 00:01:47,835 --> 00:01:52,791 we should use some more complicated techniques to build our classificator. 32 00:01:52,791 --> 00:01:57,900 One of the most popular approaches is to build a separate classifier for each class. 33 00:01:57,900 --> 00:01:59,320 So, for example, for the first class, 34 00:01:59,320 --> 00:02:02,325 we'll have a linear model -- linear classifier -- that 35 00:02:02,325 --> 00:02:06,535 separates points of the first class from all other points. 36 00:02:06,535 --> 00:02:10,155 So essentially, we try to fit a model so that 37 00:02:10,155 --> 00:02:15,315 points of the first class lie on the positive side of this line of this hyperplane, 38 00:02:15,315 --> 00:02:20,865 and points from all other classes lie on the negative side of this hyperplane. 39 00:02:20,865 --> 00:02:24,195 And the dot product of this model is essentially a score. 40 00:02:24,195 --> 00:02:25,485 The higher the score, 41 00:02:25,485 --> 00:02:29,690 the more the model is confident that this point lies in the first class. 42 00:02:29,690 --> 00:02:31,920 Then we build such a model for every class, 43 00:02:31,920 --> 00:02:33,855 and we have K linear models, 44 00:02:33,855 --> 00:02:36,540 and each model calculates a score, 45 00:02:36,540 --> 00:02:40,230 and then we assign our new example to the class that 46 00:02:40,230 --> 00:02:44,010 has the largest score -- the class with higher confidence. 47 00:02:44,010 --> 00:02:45,915 For example, if we have three classes, 48 00:02:45,915 --> 00:02:50,805 and our score vector looks like 7-7.5 and 10, 49 00:02:50,805 --> 00:02:53,415 then we assign our example to the third class, 50 00:02:53,415 --> 00:02:57,390 because a third component of the score vector is the largest. 51 00:02:57,390 --> 00:03:00,035 Okay. Now we have the model, 52 00:03:00,035 --> 00:03:01,510 and we should somehow learn it. 53 00:03:01,510 --> 00:03:03,185 So we need a loss function. 54 00:03:03,185 --> 00:03:05,500 And let's start with the simplest loss function, 55 00:03:05,500 --> 00:03:07,280 accuracy loss, and to define it, 56 00:03:07,280 --> 00:03:09,830 we'll need some notation -- Iverson brackets. 57 00:03:09,830 --> 00:03:11,680 They denote just basic square brackets, 58 00:03:11,680 --> 00:03:15,105 and we write some logical statement inside the brackets. 59 00:03:15,105 --> 00:03:16,355 If the statement is true, 60 00:03:16,355 --> 00:03:18,400 then the value of brackets is one, 61 00:03:18,400 --> 00:03:20,150 and if the statement is false, 62 00:03:20,150 --> 00:03:23,270 then the value of brackets is zero. 63 00:03:23,270 --> 00:03:27,290 So, now let's define accuracy metric. 64 00:03:27,290 --> 00:03:29,415 Let's take an example xi, 65 00:03:29,415 --> 00:03:32,085 find the prediction a of xi, 66 00:03:32,085 --> 00:03:34,765 and compare it to the true value of class 67 00:03:34,765 --> 00:03:41,325 yi and write the equality of predicted class and true class in Iverson brackets. 68 00:03:41,325 --> 00:03:45,180 So, the value of the bracket will be one if we guessed the class correctly, 69 00:03:45,180 --> 00:03:48,250 and then it will be zero if we are misclassifying these points. 70 00:03:48,250 --> 00:03:50,600 And then we just average these brackets over 71 00:03:50,600 --> 00:03:53,945 all our data points -- over all our training set. 72 00:03:53,945 --> 00:03:56,700 So, essentially, accuracy is just a ratio of 73 00:03:56,700 --> 00:03:59,755 correctly classifying points in our training set. 74 00:03:59,755 --> 00:04:02,580 This metric is good and could be easily interpreted, 75 00:04:02,580 --> 00:04:05,005 but it has two large disadvantages. 76 00:04:05,005 --> 00:04:08,300 At first, we'll learn from our next videos that we need 77 00:04:08,300 --> 00:04:12,120 a gradient to optimize our loss function effectively. 78 00:04:12,120 --> 00:04:16,145 And accuracy doesn't have gradients with respect to model parameters. 79 00:04:16,145 --> 00:04:20,515 So we cannot optimize it -- we cannot learn the model to accuracy score. 80 00:04:20,515 --> 00:04:22,530 And also, this model doesn't take into 81 00:04:22,530 --> 00:04:26,400 account the confidence of our model in this prediction. 82 00:04:26,400 --> 00:04:31,845 So actually, we have a dot product of weight vector and feature vector w and x, 83 00:04:31,845 --> 00:04:33,750 and the larger the score, 84 00:04:33,750 --> 00:04:36,630 the more the model is confident in this prediction. 85 00:04:36,630 --> 00:04:40,265 If this dot product has a positive sign and a large value, 86 00:04:40,265 --> 00:04:41,640 then the model is confident. 87 00:04:41,640 --> 00:04:43,005 But if the sign is positive, 88 00:04:43,005 --> 00:04:44,440 but the value is close to zero, 89 00:04:44,440 --> 00:04:46,165 then the model is inconfident. 90 00:04:46,165 --> 00:04:48,975 And we want not only a model that makes 91 00:04:48,975 --> 00:04:52,945 correct decisions -- that gets its classes -- but we want a confident model, 92 00:04:52,945 --> 00:04:58,920 and it's known from machine learning that models with high confidence generalize better. 93 00:04:58,920 --> 00:05:02,080 Okay. Accuracy doesn't fit, 94 00:05:02,080 --> 00:05:04,000 so we need some other loss function. 95 00:05:04,000 --> 00:05:06,070 Maybe we can use mean squared error. 96 00:05:06,070 --> 00:05:08,800 Suppose that we have some example, xi, 97 00:05:08,800 --> 00:05:10,610 and it belongs to the positive class, 98 00:05:10,610 --> 00:05:11,995 to the class one, 99 00:05:11,995 --> 00:05:15,160 and consider a squared loss on this example. 100 00:05:15,160 --> 00:05:19,705 So we take dot product between w and x and compare it to one, 101 00:05:19,705 --> 00:05:22,170 and take a square of this difference. 102 00:05:22,170 --> 00:05:24,295 So, if our model predicts one, 103 00:05:24,295 --> 00:05:27,420 then the guess is correct and the loss is zero. 104 00:05:27,420 --> 00:05:30,160 If our model gives a prediction between zero and one, 105 00:05:30,160 --> 00:05:34,785 then it's inconfident in its decision and we penalize it for low confidence. 106 00:05:34,785 --> 00:05:37,765 If the model gives the value lower than zero, 107 00:05:37,765 --> 00:05:40,350 then it misclassifies this point. 108 00:05:40,350 --> 00:05:42,505 So we give it an even larger penalty. 109 00:05:42,505 --> 00:05:48,340 That's okay, but if the model predicts a value larger than one, then we penalize it. 110 00:05:48,340 --> 00:05:50,455 We penalize it for high confidence, 111 00:05:50,455 --> 00:05:52,240 and that's not very good. 112 00:05:52,240 --> 00:05:56,785 We should give small or zero loss for high-confidence decisions. 113 00:05:56,785 --> 00:05:59,735 Okay. So we can just take one branch of 114 00:05:59,735 --> 00:06:03,410 our squared loss and penalize for low confidence and for misclassification, 115 00:06:03,410 --> 00:06:05,900 and give zero loss for high confidence. 116 00:06:05,900 --> 00:06:08,915 Actually, there are many loss functions that look like this one, 117 00:06:08,915 --> 00:06:12,110 and all of them lead to their own classification methods. 118 00:06:12,110 --> 00:06:16,045 We'll discuss one of the most important-for-us methods, logistic regression. 119 00:06:16,045 --> 00:06:19,470 And to talk about it, we should first find a way to 120 00:06:19,470 --> 00:06:22,905 convert our scores from linear classifiers, 121 00:06:22,905 --> 00:06:25,405 to probabilities, to distribution. 122 00:06:25,405 --> 00:06:28,260 So, we have some vector of scores z, 123 00:06:28,260 --> 00:06:31,305 which has components w transposed by x, 124 00:06:31,305 --> 00:06:33,990 though these are scores for each of our classes. 125 00:06:33,990 --> 00:06:37,650 Dot products can have any sign and have any magnitude. 126 00:06:37,650 --> 00:06:40,510 So we cannot interpret them as probability distributions, 127 00:06:40,510 --> 00:06:42,280 and we should somehow change it. 128 00:06:42,280 --> 00:06:43,870 We'll do it in two steps. 129 00:06:43,870 --> 00:06:46,360 At first step, we take first component of 130 00:06:46,360 --> 00:06:50,005 our vector and take e to the degree of this component. 131 00:06:50,005 --> 00:06:51,525 We do the same to the second component, 132 00:06:51,525 --> 00:06:53,490 et cetera, to the last component. 133 00:06:53,490 --> 00:06:55,035 So, after this step, 134 00:06:55,035 --> 00:07:01,015 we have a vector e to the degree of z that has only positive coordinates. 135 00:07:01,015 --> 00:07:05,685 So now we need only to normalize these components to get a distribution. 136 00:07:05,685 --> 00:07:08,430 And to do it, we just sum all the components of 137 00:07:08,430 --> 00:07:12,800 this e-to-z vector and divide each component by the sum. 138 00:07:12,800 --> 00:07:16,110 And after that, we get a vector sigma of z that is 139 00:07:16,110 --> 00:07:19,950 normalized and has only non-negative components. 140 00:07:19,950 --> 00:07:23,635 So we can interpret it as a probability distribution. 141 00:07:23,635 --> 00:07:28,230 This transform is called a softmax function -- a softmax transform. 142 00:07:28,230 --> 00:07:31,005 Consider an example with three classes. 143 00:07:31,005 --> 00:07:34,598 We score 7-7.5 and 10. 144 00:07:34,598 --> 00:07:37,310 If we apply softmax transform to this vector, 145 00:07:37,310 --> 00:07:44,075 then we get the vector sigma of z with components 0.05, zero, and 0.95. 146 00:07:44,075 --> 00:07:47,360 So the first component was largest before the transform, 147 00:07:47,360 --> 00:07:51,465 and it has the largest probability after softmax transform. 148 00:07:51,465 --> 00:07:57,460 Okay. Now we have an approach to transform our scores to probabilities. 149 00:07:57,460 --> 00:07:59,730 This is the predicted probabilities of classes. 150 00:07:59,730 --> 00:08:02,350 And now we need some target vector, 151 00:08:02,350 --> 00:08:06,460 the vector that we want our probabilities to be equal to. 152 00:08:06,460 --> 00:08:10,375 Of course, we want the probability of the true class to be one, 153 00:08:10,375 --> 00:08:14,200 and probabilities of all other classes to be equal to zero. 154 00:08:14,200 --> 00:08:17,166 So,we'd form a vector, b. 155 00:08:17,166 --> 00:08:19,030 It's a target vector that is 156 00:08:19,030 --> 00:08:22,960 just a binary vector of the size K where K is number of classes, 157 00:08:22,960 --> 00:08:24,970 and it has one in the component that 158 00:08:24,970 --> 00:08:27,340 corresponds to the true class of the current example, 159 00:08:27,340 --> 00:08:30,250 and zeros in all other coordinates. 160 00:08:30,250 --> 00:08:31,950 Now, we have target vector b, 161 00:08:31,950 --> 00:08:36,170 vector of predicted class probabilities sigma of z, 162 00:08:36,170 --> 00:08:40,195 and we should somehow measure the distance between these probability distributions. 163 00:08:40,195 --> 00:08:42,535 To do it, we can use cross entropy. 164 00:08:42,535 --> 00:08:44,050 Essentially, cross entropy is 165 00:08:44,050 --> 00:08:48,035 just a minus log of the predicted class probability for the true class. 166 00:08:48,035 --> 00:08:55,060 And also, we can write it as a minus sum of the indicator that our class y equals 167 00:08:55,060 --> 00:08:58,840 to K multiplied by log of the predicted class probability for 168 00:08:58,840 --> 00:09:03,330 the class K. Let's look at some examples of cross entropy. 169 00:09:03,330 --> 00:09:05,045 Suppose that we have three classes, 170 00:09:05,045 --> 00:09:07,160 and our example belongs to the first class. 171 00:09:07,160 --> 00:09:09,305 So, y equals to one. 172 00:09:09,305 --> 00:09:13,825 Suppose that we have some model that predicts probability of one to the first class, 173 00:09:13,825 --> 00:09:17,310 and zero probabilities to the second and third classes. 174 00:09:17,310 --> 00:09:19,495 So, this model makes a correct guess, 175 00:09:19,495 --> 00:09:21,025 and the cross entropy is zero, 176 00:09:21,025 --> 00:09:23,500 because it's a perfect model for us. 177 00:09:23,500 --> 00:09:26,560 If we have a model that makes a prediction of 0.5 for 178 00:09:26,560 --> 00:09:30,715 the first class and 0.25 for two other classes, 179 00:09:30,715 --> 00:09:34,005 then the cross entropy equals approximately to 0.7. 180 00:09:34,005 --> 00:09:35,670 So there is some loss here. 181 00:09:35,670 --> 00:09:38,730 But if I have a model that assigns the probability of one to 182 00:09:38,730 --> 00:09:41,865 the second class and zero probability to the first class, 183 00:09:41,865 --> 00:09:44,620 then the cross entropy equals to plus infinity, 184 00:09:44,620 --> 00:09:47,880 because I multiply one by the logarithm of zero. 185 00:09:47,880 --> 00:09:49,495 So, cross entropy gives 186 00:09:49,495 --> 00:09:54,425 a very high penalty for models that are confident in wrong decisions. 187 00:09:54,425 --> 00:09:59,985 Okay. Now we can just sum cross entropies over all examples from our training set, 188 00:09:59,985 --> 00:10:02,150 and that would be our loss function. 189 00:10:02,150 --> 00:10:06,100 It's quite complicated and we cannot find analytical solution for this problem. 190 00:10:06,100 --> 00:10:08,555 So, we need some numerical methods to optimize it, 191 00:10:08,555 --> 00:10:11,380 and we'll discuss this method in the following videos. 192 00:10:11,380 --> 00:10:16,135 So in this video, we discussed how to apply linear models to classification problems, 193 00:10:16,135 --> 00:10:19,680 both through binary classification and multi-class classification, 194 00:10:19,680 --> 00:10:23,220 and discussed how loss for classification problems should look like. 195 00:10:23,220 --> 00:10:28,145 One of the most important methods for learning linear classifiers is logistic regression, 196 00:10:28,145 --> 00:10:30,935 and we discussed how a loss looks in this case. 197 00:10:30,935 --> 00:10:32,490 And in the next video, we'll talk about 198 00:10:32,490 --> 00:10:38,290 gradient descent numerical method that optimizes any differentiable loss function.