1
00:00:00,460 --> 00:00:01,410
In this and the next

2
00:00:01,580 --> 00:00:02,730
few videos, I want to

3
00:00:02,960 --> 00:00:04,660
start to talk about classification problems,

4
00:00:05,520 --> 00:00:07,000
where the variable y that

5
00:00:07,110 --> 00:00:08,160
you want to predict is discreet

6
00:00:08,570 --> 00:00:10,190
valued. We'll develop an

7
00:00:10,420 --> 00:00:11,860
algorithm called logistic regression,

8
00:00:12,410 --> 00:00:13,620
which is one of the

9
00:00:13,700 --> 00:00:16,560
most popular and most widely used learning algorithms today.

10
00:00:19,770 --> 00:00:22,150
Here are some examples of classification problems.

11
00:00:23,170 --> 00:00:24,720
Earlier, we talked about emails,

12
00:00:25,260 --> 00:00:26,700
spam classification as an

13
00:00:27,070 --> 00:00:28,260
example of a classification problem.

14
00:00:29,380 --> 00:00:32,160
Another example would be classifying online transactions.

15
00:00:33,080 --> 00:00:34,110
So, if you have a website

16
00:00:34,340 --> 00:00:35,530
that sells stuff and if you

17
00:00:35,750 --> 00:00:36,740
want to know if a physical

18
00:00:37,040 --> 00:00:39,140
transaction is fraudulent or

19
00:00:39,260 --> 00:00:40,920
not, whether someone has, you

20
00:00:41,060 --> 00:00:42,260
know, is using a stolen credit card

21
00:00:42,580 --> 00:00:43,890
or has stolen the user's password.

22
00:00:44,560 --> 00:00:46,830
That's another classification problem, and

23
00:00:47,030 --> 00:00:48,220
earlier we also talked about

24
00:00:48,410 --> 00:00:50,610
the example of classifying tumors

25
00:00:51,640 --> 00:00:53,680
as a cancerous malignant or as benign tumors.

26
00:00:55,070 --> 00:00:56,010
In all of these problems,

27
00:00:56,690 --> 00:00:57,610
the variable that we're trying

28
00:00:57,850 --> 00:00:58,870
to predict is a variable

29
00:00:59,290 --> 00:01:00,110
Y that we can think

30
00:01:00,420 --> 00:01:01,710
of as taking on two values,

31
00:01:02,600 --> 00:01:04,120
either zero or one, either

32
00:01:04,340 --> 00:01:05,780
a spam or not spam, fraudulent

33
00:01:06,620 --> 00:01:08,740
or not fraudulent, malignant or benign.

34
00:01:10,490 --> 00:01:11,430
Another name for the class

35
00:01:11,810 --> 00:01:13,160
that we denote with 0 is

36
00:01:13,810 --> 00:01:15,660
the negative class, and another

37
00:01:15,950 --> 00:01:16,920
name for the class that we

38
00:01:17,020 --> 00:01:19,350
denote with 1 is the positive class.

39
00:01:20,170 --> 00:01:21,500
So 0 may denote the

40
00:01:22,070 --> 00:01:23,460
benign tumor and 1

41
00:01:23,850 --> 00:01:25,940
positive class may denote a malignant tumor.

42
00:01:27,090 --> 00:01:28,410
The assignment of the 2

43
00:01:28,860 --> 00:01:29,940
classes, you know, spam,

44
00:01:30,050 --> 00:01:31,140
no spam, and so on -

45
00:01:31,330 --> 00:01:32,470
the assignment of the 2

46
00:01:32,790 --> 00:01:34,140
classes to positive and negative,

47
00:01:34,500 --> 00:01:35,950
to 0 and 1 is somewhat

48
00:01:36,250 --> 00:01:37,840
arbitrary and it doesn't really matter.

49
00:01:38,680 --> 00:01:39,820
But often there is this

50
00:01:39,990 --> 00:01:40,970
intuition that the negative

51
00:01:41,460 --> 00:01:43,430
class is conveying the

52
00:01:43,590 --> 00:01:44,690
absence of something, like the absence

53
00:01:45,000 --> 00:01:47,440
of a malignant tumor, whereas one,

54
00:01:47,860 --> 00:01:49,410
the positive class, is conveying

55
00:01:49,950 --> 00:01:52,110
the presence of something that we may be looking for.

56
00:01:52,770 --> 00:01:54,340
But the definition of which

57
00:01:54,560 --> 00:01:55,400
is negative and which is positive

58
00:01:55,680 --> 00:01:58,480
is somewhat arbitrary and it doesn't matter that much.

59
00:02:00,090 --> 00:02:00,980
For now, we're going to start

60
00:02:01,340 --> 00:02:03,030
with classification problems with just

61
00:02:03,290 --> 00:02:04,540
two classes; zero and one.

62
00:02:05,480 --> 00:02:07,010
Later on, we'll talk about multi-class

63
00:02:07,440 --> 00:02:09,320
problems as well, whether variable

64
00:02:09,750 --> 00:02:10,960
Y may take on say,

65
00:02:11,550 --> 00:02:13,120
for value zero, one, two and three.

66
00:02:14,220 --> 00:02:16,810
This is called a multi-class classification problem,

67
00:02:17,680 --> 00:02:18,800
but for the next few

68
00:02:18,950 --> 00:02:20,280
videos, let's start with the

69
00:02:20,660 --> 00:02:22,750
two class or the binary classification problem.

70
00:02:23,580 --> 00:02:25,650
and we'll worry about the multi-class setting later.

71
00:02:26,980 --> 00:02:29,440
So, how do we develop a classification algorithm?

72
00:02:30,530 --> 00:02:31,670
Here's an example of a

73
00:02:31,750 --> 00:02:32,730
training set for a classification

74
00:02:34,350 --> 00:02:35,800
task for classifying a tumor

75
00:02:36,240 --> 00:02:37,540
as malignant or benign and

76
00:02:37,820 --> 00:02:39,260
notice that malignancy takes on

77
00:02:39,530 --> 00:02:41,200
only two values zero or

78
00:02:41,380 --> 00:02:43,210
no or one or one or yes.

79
00:02:44,550 --> 00:02:45,650
So, one thing we could

80
00:02:45,850 --> 00:02:46,970
do given this training set

81
00:02:47,440 --> 00:02:48,700
is to apply the algorithm

82
00:02:49,120 --> 00:02:52,710
that we already know, linear regression to this data set

83
00:02:53,150 --> 00:02:55,310
and just try to fit the straight line to the data.

84
00:02:56,290 --> 00:02:57,480
So, if you take this training

85
00:02:57,780 --> 00:02:58,760
set and fill a straight

86
00:02:58,900 --> 00:03:00,320
line to it, maybe you get

87
00:03:00,700 --> 00:03:03,530
hypothesis that looks like that.

88
00:03:03,700 --> 00:03:05,920
Alright, so that's my hypothesis, h of

89
00:03:06,020 --> 00:03:07,890
x equals theta transpose

90
00:03:08,020 --> 00:03:09,330
x.
If you want

91
00:03:09,570 --> 00:03:11,270
to make predictions, one thing

92
00:03:11,500 --> 00:03:12,980
you could try doing is then

93
00:03:13,610 --> 00:03:16,760
threshold the classifier outputs at 0.5.

94
00:03:17,110 --> 00:03:19,880
That is at the vertical access value 0.5.

95
00:03:21,760 --> 00:03:23,940
And if the hypothesis outputs

96
00:03:24,330 --> 00:03:25,490
a value that's greater than

97
00:03:25,620 --> 00:03:27,510
equal to 0.5 you predict y equals one.

98
00:03:27,860 --> 00:03:29,940
If it's less than 0.5, you predict y equals zero.

99
00:03:31,070 --> 00:03:32,540
Let's see what happens when we do that.

100
00:03:32,740 --> 00:03:33,900
So, let's take 0.5, and

101
00:03:34,090 --> 00:03:36,670
so, you know, that's where the threshold is.

102
00:03:37,070 --> 00:03:39,260
And thus, using linear regression this way.

103
00:03:39,920 --> 00:03:41,060
Everything to the right

104
00:03:41,330 --> 00:03:42,460
of this point, we will end

105
00:03:42,640 --> 00:03:43,690
up predicting as the positive

106
00:03:44,280 --> 00:03:45,390
class because of the output

107
00:03:45,690 --> 00:03:46,800
values are greater than 0.5

108
00:03:47,270 --> 00:03:48,690
on the vertical axis and

109
00:03:49,340 --> 00:03:50,730
everything to the left

110
00:03:51,000 --> 00:03:52,260
of that point we will end

111
00:03:52,490 --> 00:03:54,170
up predicting as a negative value.

112
00:03:55,660 --> 00:03:57,570
In this particular example, it

113
00:03:57,720 --> 00:03:59,400
looks like linear regression is actually

114
00:03:59,790 --> 00:04:01,870
doing something reasonable even though

115
00:04:02,190 --> 00:04:03,910
this is a classification task we're

116
00:04:04,140 --> 00:04:05,430
interested in.

117
00:04:05,500 --> 00:04:07,420
But now let's try changing problem a bit.

118
00:04:08,060 --> 00:04:09,360
Let me extend out the horizontal

119
00:04:10,040 --> 00:04:11,460
axis of orbit and let's

120
00:04:11,650 --> 00:04:12,640
say we got one more training

121
00:04:12,990 --> 00:04:15,030
example way out there on the right.

122
00:04:16,520 --> 00:04:17,830
Notice that that additional training

123
00:04:18,170 --> 00:04:19,200
example, this one out

124
00:04:19,390 --> 00:04:21,710
here, it doesn't actually change anything, right?

125
00:04:22,420 --> 00:04:23,470
Looking at the training set, it

126
00:04:23,560 --> 00:04:26,340
is pretty clear what a good hypothesis is.

127
00:04:26,890 --> 00:04:27,920
Well, everything to the right of

128
00:04:28,000 --> 00:04:29,050
somewhere around here to the

129
00:04:29,190 --> 00:04:29,970
right of this we should predict

130
00:04:30,300 --> 00:04:31,280
as positive, and everything to

131
00:04:31,480 --> 00:04:32,690
the left we should probably predict

132
00:04:33,060 --> 00:04:34,700
as negative because from this

133
00:04:34,880 --> 00:04:35,940
training set it looks like

134
00:04:36,200 --> 00:04:37,880
all the tumors larger than, you

135
00:04:37,970 --> 00:04:39,190
know, a certain value around here

136
00:04:39,490 --> 00:04:41,030
are malignant, and all the

137
00:04:41,200 --> 00:04:42,110
tumors smaller than that are

138
00:04:42,220 --> 00:04:44,660
not malignant, at least for this training set.

139
00:04:46,160 --> 00:04:47,280
But once we've added

140
00:04:47,720 --> 00:04:49,060
that extra example out here,

141
00:04:49,620 --> 00:04:50,660
if you now run linear regression,

142
00:04:51,580 --> 00:04:53,590
you instead get a straight line fit to the data.

143
00:04:54,430 --> 00:04:55,630
That might maybe look like this, and

144
00:04:57,890 --> 00:04:59,860
if you now threshold this hypothesis

145
00:05:02,480 --> 00:05:03,460
at 0.5, you end up with

146
00:05:04,110 --> 00:05:05,550
a threshold that's around here

147
00:05:06,320 --> 00:05:07,320
so that everything to the right

148
00:05:07,570 --> 00:05:08,790
of this point you predict as

149
00:05:08,960 --> 00:05:11,510
positive, and everything to the left of that point you predict as negative.

150
00:05:14,580 --> 00:05:15,720
And this seems a pretty

151
00:05:16,100 --> 00:05:18,500
bad thing for linear regression to have done, right?

152
00:05:18,770 --> 00:05:19,840
Because, you know, these are

153
00:05:19,930 --> 00:05:22,010
our positive examples, these are our negative examples.

154
00:05:23,050 --> 00:05:24,580
It's pretty clear, we should

155
00:05:24,800 --> 00:05:26,000
really be separating the two classes

156
00:05:26,550 --> 00:05:28,180
somewhere around there, but somehow

157
00:05:28,670 --> 00:05:30,030
by adding one example way

158
00:05:30,190 --> 00:05:31,280
out here to the right, this

159
00:05:31,420 --> 00:05:33,340
example really isn't giving us any new information.

160
00:05:33,770 --> 00:05:34,950
I mean, it should be no

161
00:05:35,170 --> 00:05:36,300
surprise to the learning out of

162
00:05:37,030 --> 00:05:39,100
that the example way out here turns out to be malignant.

163
00:05:40,230 --> 00:05:41,210
But somehow adding that example

164
00:05:41,740 --> 00:05:43,420
out there caused linear regression

165
00:05:44,410 --> 00:05:45,670
to change in straight line fit

166
00:05:45,980 --> 00:05:47,650
to the data from this

167
00:05:48,840 --> 00:05:50,000
magenta line out here

168
00:05:50,840 --> 00:05:51,940
to this blue line over here,

169
00:05:52,850 --> 00:05:54,770
and caused it to give us a worse hypothesis.

170
00:05:56,950 --> 00:05:58,440
So, applying linear regression

171
00:05:59,080 --> 00:06:01,030
to a classification problem usually

172
00:06:01,610 --> 00:06:03,400
isn't, often isn't a great idea.

173
00:06:04,430 --> 00:06:05,750
In the first instance, in the

174
00:06:05,810 --> 00:06:07,090
first example before I added

175
00:06:07,540 --> 00:06:08,780
this extra training example,

176
00:06:09,810 --> 00:06:11,430
previously linear regression was

177
00:06:11,650 --> 00:06:13,200
just getting lucky and it

178
00:06:13,380 --> 00:06:14,990
got us a hypothesis that, you

179
00:06:15,090 --> 00:06:16,290
know, worked well for that particular

180
00:06:16,670 --> 00:06:19,470
example, but usually apply

181
00:06:19,980 --> 00:06:20,970
linear regression to a data set,

182
00:06:21,820 --> 00:06:23,040
you know, you might get lucky but

183
00:06:23,270 --> 00:06:24,130
often it isn't a good

184
00:06:24,260 --> 00:06:25,730
idea, so I wouldn't use

185
00:06:25,980 --> 00:06:27,960
linear regression for classification problems.

186
00:06:29,670 --> 00:06:30,820
Here is one other funny thing

187
00:06:31,250 --> 00:06:32,650
about what would happen if

188
00:06:32,930 --> 00:06:35,510
we were to use linear regression for a classification problem.

189
00:06:36,690 --> 00:06:38,220
For classification, we know that

190
00:06:38,450 --> 00:06:39,790
Y is either zero or one,

191
00:06:40,580 --> 00:06:41,620
but if you are using

192
00:06:41,890 --> 00:06:43,050
linear regression, well the hypothesis

193
00:06:44,210 --> 00:06:45,750
can output values much larger

194
00:06:46,060 --> 00:06:47,330
than one or less than

195
00:06:47,500 --> 00:06:48,820
zero, even if all

196
00:06:49,050 --> 00:06:50,690
of good the training examples have labels

197
00:06:51,140 --> 00:06:52,410
Y equals zero or one,

198
00:06:53,900 --> 00:06:54,880
and it seems kind of strange

199
00:06:55,520 --> 00:06:56,760
that even though we

200
00:06:56,960 --> 00:06:58,160
know that the label should

201
00:06:58,350 --> 00:06:59,320
be zero one, it seems

202
00:06:59,420 --> 00:07:00,890
kind of strange if the

203
00:07:01,210 --> 00:07:02,580
algorithm can offer values much

204
00:07:02,840 --> 00:07:04,900
larger than one or much smaller than zero.

205
00:07:09,540 --> 00:07:10,900
So what we'll do in the

206
00:07:11,000 --> 00:07:12,400
next few videos is develop

207
00:07:12,860 --> 00:07:14,640
an algorithm called logistic regression

208
00:07:15,550 --> 00:07:17,390
which has the property that the

209
00:07:17,780 --> 00:07:19,290
output, the predictions of logistic

210
00:07:19,670 --> 00:07:21,220
regression are always between zero

211
00:07:21,630 --> 00:07:22,750
and one, and doesn't become

212
00:07:23,060 --> 00:07:24,170
bigger than one or become less

213
00:07:24,370 --> 00:07:26,370
than zero and by

214
00:07:26,530 --> 00:07:28,570
the way, logistic regression is

215
00:07:29,090 --> 00:07:30,150
and we will use it as

216
00:07:30,350 --> 00:07:32,770
a classification algorithm in some,

217
00:07:33,330 --> 00:07:35,060
maybe sometimes confusing that

218
00:07:35,780 --> 00:07:37,410
the term regression appears in

219
00:07:37,700 --> 00:07:39,360
his name, even though logistic regression

220
00:07:39,970 --> 00:07:41,280
is actually a classification algorithm.

221
00:07:42,120 --> 00:07:43,040
But that's just the name it

222
00:07:43,160 --> 00:07:46,140
was given for historical reasons so don't be confused by that.

223
00:07:46,680 --> 00:07:48,340
Logistic Regression is actually a

224
00:07:48,430 --> 00:07:50,250
classification algorithm that we

225
00:07:50,380 --> 00:07:52,030
apply to settings where the

226
00:07:52,160 --> 00:07:54,780
label Y is discreet valued. The 1001.

227
00:07:55,820 --> 00:07:57,440
So hopefully you now

228
00:07:57,680 --> 00:07:59,180
know why if you

229
00:07:59,280 --> 00:08:00,950
have a causation problem using

230
00:08:01,400 --> 00:08:02,660
linear regression isn't a good idea .

231
00:08:03,210 --> 00:08:04,480
In the next video we'll

232
00:08:04,700 --> 00:08:05,680
start working out the details

233
00:08:06,290 --> 00:08:07,640
of the logistic regression algorithm.