1
00:00:02,180 --> 00:00:08,453
In this video, we will discuss how to adapt linear methods for classification problems.

2
00:00:08,453 --> 00:00:12,318
And let's start with simplest classification problem, binary classification.

3
00:00:12,318 --> 00:00:17,160
Here we have only two values for the target: minus one and one;

4
00:00:17,160 --> 00:00:21,339
negative class and positive class -- so, essentially,

5
00:00:21,339 --> 00:00:23,823
linear model calculus dot product between w,

6
00:00:23,823 --> 00:00:26,670
weight vector, and x, feature vector.

7
00:00:26,670 --> 00:00:29,100
This dot product is a real value,

8
00:00:29,100 --> 00:00:32,070
and we should somehow transform it to minus one or one,

9
00:00:32,070 --> 00:00:34,990
and to do it, we can just take a sine of dot product.

10
00:00:34,990 --> 00:00:39,975
So linear classifier looks like sine of w transposed by x.

11
00:00:39,975 --> 00:00:42,105
It has deep parameters.

12
00:00:42,105 --> 00:00:43,275
And if you remember,

13
00:00:43,275 --> 00:00:48,300
we agreed that there is also a constant feature that has a value of one on every example.

14
00:00:48,300 --> 00:00:51,780
So we don't have to explicitly include bias into our model.

15
00:00:51,780 --> 00:00:55,730
The coefficient of this constant feature will be the bias.

16
00:00:55,730 --> 00:01:01,720
So actually, maybe there are d+1 parameters for this model,

17
00:01:01,720 --> 00:01:04,245
and geometrically it looks like that.

18
00:01:04,245 --> 00:01:05,680
Suppose that we have two features,

19
00:01:05,680 --> 00:01:08,635
so x's on this graph correspond to our features,

20
00:01:08,635 --> 00:01:14,125
and we denote negative class by red points and positive class by blue points.

21
00:01:14,125 --> 00:01:16,180
And the linear model tries to find

22
00:01:16,180 --> 00:01:20,410
some line that separates blue points from the red points.

23
00:01:20,410 --> 00:01:21,930
And as we know from geometry,

24
00:01:21,930 --> 00:01:27,485
the sine of the dot product indicates on which side of the line the point lies.

25
00:01:27,485 --> 00:01:29,965
So, if we have a positive dot product,

26
00:01:29,965 --> 00:01:33,325
then the point lies at the positive side of this line,

27
00:01:33,325 --> 00:01:35,190
and if the product is negative,

28
00:01:35,190 --> 00:01:39,715
then the point lies on the negative side of our line.

29
00:01:39,715 --> 00:01:41,280
Okay.

30
00:01:41,280 --> 00:01:47,835
Let's switch to multi-class classification problem with K classes 1- K. In this case,

31
00:01:47,835 --> 00:01:52,791
we should use some more complicated techniques to build our classificator.

32
00:01:52,791 --> 00:01:57,900
One of the most popular approaches is to build a separate classifier for each class.

33
00:01:57,900 --> 00:01:59,320
So, for example, for the first class,

34
00:01:59,320 --> 00:02:02,325
we'll have a linear model -- linear classifier -- that

35
00:02:02,325 --> 00:02:06,535
separates points of the first class from all other points.

36
00:02:06,535 --> 00:02:10,155
So essentially, we try to fit a model so that

37
00:02:10,155 --> 00:02:15,315
points of the first class lie on the positive side of this line of this hyperplane,

38
00:02:15,315 --> 00:02:20,865
and points from all other classes lie on the negative side of this hyperplane.

39
00:02:20,865 --> 00:02:24,195
And the dot product of this model is essentially a score.

40
00:02:24,195 --> 00:02:25,485
The higher the score,

41
00:02:25,485 --> 00:02:29,690
the more the model is confident that this point lies in the first class.

42
00:02:29,690 --> 00:02:31,920
Then we build such a model for every class,

43
00:02:31,920 --> 00:02:33,855
and we have K linear models,

44
00:02:33,855 --> 00:02:36,540
and each model calculates a score,

45
00:02:36,540 --> 00:02:40,230
and then we assign our new example to the class that

46
00:02:40,230 --> 00:02:44,010
has the largest score -- the class with higher confidence.

47
00:02:44,010 --> 00:02:45,915
For example, if we have three classes,

48
00:02:45,915 --> 00:02:50,805
and our score vector looks like 7-7.5 and 10,

49
00:02:50,805 --> 00:02:53,415
then we assign our example to the third class,

50
00:02:53,415 --> 00:02:57,390
because a third component of the score vector is the largest.

51
00:02:57,390 --> 00:03:00,035
Okay. Now we have the model,

52
00:03:00,035 --> 00:03:01,510
and we should somehow learn it.

53
00:03:01,510 --> 00:03:03,185
So we need a loss function.

54
00:03:03,185 --> 00:03:05,500
And let's start with the simplest loss function,

55
00:03:05,500 --> 00:03:07,280
accuracy loss, and to define it,

56
00:03:07,280 --> 00:03:09,830
we'll need some notation -- Iverson brackets.

57
00:03:09,830 --> 00:03:11,680
They denote just basic square brackets,

58
00:03:11,680 --> 00:03:15,105
and we write some logical statement inside the brackets.

59
00:03:15,105 --> 00:03:16,355
If the statement is true,

60
00:03:16,355 --> 00:03:18,400
then the value of brackets is one,

61
00:03:18,400 --> 00:03:20,150
and if the statement is false,

62
00:03:20,150 --> 00:03:23,270
then the value of brackets is zero.

63
00:03:23,270 --> 00:03:27,290
So, now let's define accuracy metric.

64
00:03:27,290 --> 00:03:29,415
Let's take an example xi,

65
00:03:29,415 --> 00:03:32,085
find the prediction a of xi,

66
00:03:32,085 --> 00:03:34,765
and compare it to the true value of class

67
00:03:34,765 --> 00:03:41,325
yi and write the equality of predicted class and true class in Iverson brackets.

68
00:03:41,325 --> 00:03:45,180
So, the value of the bracket will be one if we guessed the class correctly,

69
00:03:45,180 --> 00:03:48,250
and then it will be zero if we are misclassifying these points.

70
00:03:48,250 --> 00:03:50,600
And then we just average these brackets over

71
00:03:50,600 --> 00:03:53,945
all our data points -- over all our training set.

72
00:03:53,945 --> 00:03:56,700
So, essentially, accuracy is just a ratio of

73
00:03:56,700 --> 00:03:59,755
correctly classifying points in our training set.

74
00:03:59,755 --> 00:04:02,580
This metric is good and could be easily interpreted,

75
00:04:02,580 --> 00:04:05,005
but it has two large disadvantages.

76
00:04:05,005 --> 00:04:08,300
At first, we'll learn from our next videos that we need

77
00:04:08,300 --> 00:04:12,120
a gradient to optimize our loss function effectively.

78
00:04:12,120 --> 00:04:16,145
And accuracy doesn't have gradients with respect to model parameters.

79
00:04:16,145 --> 00:04:20,515
So we cannot optimize it -- we cannot learn the model to accuracy score.

80
00:04:20,515 --> 00:04:22,530
And also, this model doesn't take into

81
00:04:22,530 --> 00:04:26,400
account the confidence of our model in this prediction.

82
00:04:26,400 --> 00:04:31,845
So actually, we have a dot product of weight vector and feature vector w and x,

83
00:04:31,845 --> 00:04:33,750
and the larger the score,

84
00:04:33,750 --> 00:04:36,630
the more the model is confident in this prediction.

85
00:04:36,630 --> 00:04:40,265
If this dot product has a positive sign and a large value,

86
00:04:40,265 --> 00:04:41,640
then the model is confident.

87
00:04:41,640 --> 00:04:43,005
But if the sign is positive,

88
00:04:43,005 --> 00:04:44,440
but the value is close to zero,

89
00:04:44,440 --> 00:04:46,165
then the model is inconfident.

90
00:04:46,165 --> 00:04:48,975
And we want not only a model that makes

91
00:04:48,975 --> 00:04:52,945
correct decisions -- that gets its classes -- but we want a confident model,

92
00:04:52,945 --> 00:04:58,920
and it's known from machine learning that models with high confidence generalize better.

93
00:04:58,920 --> 00:05:02,080
Okay. Accuracy doesn't fit,

94
00:05:02,080 --> 00:05:04,000
so we need some other loss function.

95
00:05:04,000 --> 00:05:06,070
Maybe we can use mean squared error.

96
00:05:06,070 --> 00:05:08,800
Suppose that we have some example, xi,

97
00:05:08,800 --> 00:05:10,610
and it belongs to the positive class,

98
00:05:10,610 --> 00:05:11,995
to the class one,

99
00:05:11,995 --> 00:05:15,160
and consider a squared loss on this example.

100
00:05:15,160 --> 00:05:19,705
So we take dot product between w and x and compare it to one,

101
00:05:19,705 --> 00:05:22,170
and take a square of this difference.

102
00:05:22,170 --> 00:05:24,295
So, if our model predicts one,

103
00:05:24,295 --> 00:05:27,420
then the guess is correct and the loss is zero.

104
00:05:27,420 --> 00:05:30,160
If our model gives a prediction between zero and one,

105
00:05:30,160 --> 00:05:34,785
then it's inconfident in its decision and we penalize it for low confidence.

106
00:05:34,785 --> 00:05:37,765
If the model gives the value lower than zero,

107
00:05:37,765 --> 00:05:40,350
then it misclassifies this point.

108
00:05:40,350 --> 00:05:42,505
So we give it an even larger penalty.

109
00:05:42,505 --> 00:05:48,340
That's okay, but if the model predicts a value larger than one, then we penalize it.

110
00:05:48,340 --> 00:05:50,455
We penalize it for high confidence,

111
00:05:50,455 --> 00:05:52,240
and that's not very good.

112
00:05:52,240 --> 00:05:56,785
We should give small or zero loss for high-confidence decisions.

113
00:05:56,785 --> 00:05:59,735
Okay. So we can just take one branch of

114
00:05:59,735 --> 00:06:03,410
our squared loss and penalize for low confidence and for misclassification,

115
00:06:03,410 --> 00:06:05,900
and give zero loss for high confidence.

116
00:06:05,900 --> 00:06:08,915
Actually, there are many loss functions that look like this one,

117
00:06:08,915 --> 00:06:12,110
and all of them lead to their own classification methods.

118
00:06:12,110 --> 00:06:16,045
We'll discuss one of the most important-for-us methods, logistic regression.

119
00:06:16,045 --> 00:06:19,470
And to talk about it, we should first find a way to

120
00:06:19,470 --> 00:06:22,905
convert our scores from linear classifiers,

121
00:06:22,905 --> 00:06:25,405
to probabilities, to distribution.

122
00:06:25,405 --> 00:06:28,260
So, we have some vector of scores z,

123
00:06:28,260 --> 00:06:31,305
which has components w transposed by x,

124
00:06:31,305 --> 00:06:33,990
though these are scores for each of our classes.

125
00:06:33,990 --> 00:06:37,650
Dot products can have any sign and have any magnitude.

126
00:06:37,650 --> 00:06:40,510
So we cannot interpret them as probability distributions,

127
00:06:40,510 --> 00:06:42,280
and we should somehow change it.

128
00:06:42,280 --> 00:06:43,870
We'll do it in two steps.

129
00:06:43,870 --> 00:06:46,360
At first step, we take first component of

130
00:06:46,360 --> 00:06:50,005
our vector and take e to the degree of this component.

131
00:06:50,005 --> 00:06:51,525
We do the same to the second component,

132
00:06:51,525 --> 00:06:53,490
et cetera, to the last component.

133
00:06:53,490 --> 00:06:55,035
So, after this step,

134
00:06:55,035 --> 00:07:01,015
we have a vector e to the degree of z that has only positive coordinates.

135
00:07:01,015 --> 00:07:05,685
So now we need only to normalize these components to get a distribution.

136
00:07:05,685 --> 00:07:08,430
And to do it, we just sum all the components of

137
00:07:08,430 --> 00:07:12,800
this e-to-z vector and divide each component by the sum.

138
00:07:12,800 --> 00:07:16,110
And after that, we get a vector sigma of z that is

139
00:07:16,110 --> 00:07:19,950
normalized and has only non-negative components.

140
00:07:19,950 --> 00:07:23,635
So we can interpret it as a probability distribution.

141
00:07:23,635 --> 00:07:28,230
This transform is called a softmax function -- a softmax transform.

142
00:07:28,230 --> 00:07:31,005
Consider an example with three classes.

143
00:07:31,005 --> 00:07:34,598
We score 7-7.5 and 10.

144
00:07:34,598 --> 00:07:37,310
If we apply softmax transform to this vector,

145
00:07:37,310 --> 00:07:44,075
then we get the vector sigma of z with components 0.05, zero, and 0.95.

146
00:07:44,075 --> 00:07:47,360
So the first component was largest before the transform,

147
00:07:47,360 --> 00:07:51,465
and it has the largest probability after softmax transform.

148
00:07:51,465 --> 00:07:57,460
Okay. Now we have an approach to transform our scores to probabilities.

149
00:07:57,460 --> 00:07:59,730
This is the predicted probabilities of classes.

150
00:07:59,730 --> 00:08:02,350
And now we need some target vector,

151
00:08:02,350 --> 00:08:06,460
the vector that we want our probabilities to be equal to.

152
00:08:06,460 --> 00:08:10,375
Of course, we want the probability of the true class to be one,

153
00:08:10,375 --> 00:08:14,200
and probabilities of all other classes to be equal to zero.

154
00:08:14,200 --> 00:08:17,166
So,we'd form a vector, b.

155
00:08:17,166 --> 00:08:19,030
It's a target vector that is

156
00:08:19,030 --> 00:08:22,960
just a binary vector of the size K where K is number of classes,

157
00:08:22,960 --> 00:08:24,970
and it has one in the component that

158
00:08:24,970 --> 00:08:27,340
corresponds to the true class of the current example,

159
00:08:27,340 --> 00:08:30,250
and zeros in all other coordinates.

160
00:08:30,250 --> 00:08:31,950
Now, we have target vector b,

161
00:08:31,950 --> 00:08:36,170
vector of predicted class probabilities sigma of z,

162
00:08:36,170 --> 00:08:40,195
and we should somehow measure the distance between these probability distributions.

163
00:08:40,195 --> 00:08:42,535
To do it, we can use cross entropy.

164
00:08:42,535 --> 00:08:44,050
Essentially, cross entropy is

165
00:08:44,050 --> 00:08:48,035
just a minus log of the predicted class probability for the true class.

166
00:08:48,035 --> 00:08:55,060
And also, we can write it as a minus sum of the indicator that our class y equals

167
00:08:55,060 --> 00:08:58,840
to K multiplied by log of the predicted class probability for

168
00:08:58,840 --> 00:09:03,330
the class K. Let's look at some examples of cross entropy.

169
00:09:03,330 --> 00:09:05,045
Suppose that we have three classes,

170
00:09:05,045 --> 00:09:07,160
and our example belongs to the first class.

171
00:09:07,160 --> 00:09:09,305
So, y equals to one.

172
00:09:09,305 --> 00:09:13,825
Suppose that we have some model that predicts probability of one to the first class,

173
00:09:13,825 --> 00:09:17,310
and zero probabilities to the second and third classes.

174
00:09:17,310 --> 00:09:19,495
So, this model makes a correct guess,

175
00:09:19,495 --> 00:09:21,025
and the cross entropy is zero,

176
00:09:21,025 --> 00:09:23,500
because it's a perfect model for us.

177
00:09:23,500 --> 00:09:26,560
If we have a model that makes a prediction of 0.5 for

178
00:09:26,560 --> 00:09:30,715
the first class and 0.25 for two other classes,

179
00:09:30,715 --> 00:09:34,005
then the cross entropy equals approximately to 0.7.

180
00:09:34,005 --> 00:09:35,670
So there is some loss here.

181
00:09:35,670 --> 00:09:38,730
But if I have a model that assigns the probability of one to

182
00:09:38,730 --> 00:09:41,865
the second class and zero probability to the first class,

183
00:09:41,865 --> 00:09:44,620
then the cross entropy equals to plus infinity,

184
00:09:44,620 --> 00:09:47,880
because I multiply one by the logarithm of zero.

185
00:09:47,880 --> 00:09:49,495
So, cross entropy gives

186
00:09:49,495 --> 00:09:54,425
a very high penalty for models that are confident in wrong decisions.

187
00:09:54,425 --> 00:09:59,985
Okay. Now we can just sum cross entropies over all examples from our training set,

188
00:09:59,985 --> 00:10:02,150
and that would be our loss function.

189
00:10:02,150 --> 00:10:06,100
It's quite complicated and we cannot find analytical solution for this problem.

190
00:10:06,100 --> 00:10:08,555
So, we need some numerical methods to optimize it,

191
00:10:08,555 --> 00:10:11,380
and we'll discuss this method in the following videos.

192
00:10:11,380 --> 00:10:16,135
So in this video, we discussed how to apply linear models to classification problems,

193
00:10:16,135 --> 00:10:19,680
both through binary classification and multi-class classification,

194
00:10:19,680 --> 00:10:23,220
and discussed how loss for classification problems should look like.

195
00:10:23,220 --> 00:10:28,145
One of the most important methods for learning linear classifiers is logistic regression,

196
00:10:28,145 --> 00:10:30,935
and we discussed how a loss looks in this case.

197
00:10:30,935 --> 00:10:32,490
And in the next video, we'll talk about

198
00:10:32,490 --> 00:10:38,290
gradient descent numerical method that optimizes any differentiable loss function.