1
00:00:04,443 --> 00:00:09,144
We started with the intuition of a linear
classifier using the sentimental

2
00:00:09,144 --> 00:00:10,820
analysis example.

3
00:00:10,820 --> 00:00:13,040
Let's take a little bit
of a next level dive and

4
00:00:13,040 --> 00:00:18,350
let's understand more about what
a linear classier model really captures.

5
00:00:18,350 --> 00:00:21,574
In particular,
we're going to take some data set,

6
00:00:21,574 --> 00:00:25,150
some data set which is
going to feed us x's.

7
00:00:25,150 --> 00:00:28,470
We're going to get features out of that
just like we did in the regression course.

8
00:00:28,470 --> 00:00:31,632
And we're going to feed it out to
the machine learning model for

9
00:00:31,632 --> 00:00:35,610
classification which is going to
output predictions y hat.

10
00:00:35,610 --> 00:00:38,420
And that model depends on
some parameters w hat which

11
00:00:38,420 --> 00:00:39,750
we're going to train from data.

12
00:00:41,190 --> 00:00:44,180
Now let's go back to that
example that we had with

13
00:00:44,180 --> 00:00:49,060
just two features with no zero
coefficients, awesome and awful.

14
00:00:49,060 --> 00:00:52,320
Now suppose that we had a third
feature with no zero coefficient.

15
00:00:52,320 --> 00:00:53,410
Let's say that the word great.

16
00:00:54,420 --> 00:00:59,220
Now in this case every data
point is associated with a core

17
00:00:59,220 --> 00:01:00,900
in this three dimensional space.

18
00:01:00,900 --> 00:01:08,200
So for example this data point over
here might have five awesomes,

19
00:01:08,200 --> 00:01:13,750
it might have three awfuls, and

20
00:01:13,750 --> 00:01:19,340
it might have two greats
associated with it.

21
00:01:19,340 --> 00:01:22,630
And what a linear classifier
model is going to do, is try to

22
00:01:22,630 --> 00:01:26,820
build a hyperplane that tries to separate
the positives from the negative examples.

23
00:01:26,820 --> 00:01:30,153
And the hyperplane is associated
with the score function.

24
00:01:30,153 --> 00:01:34,367
The score function is going to
be a weighted combination of

25
00:01:34,367 --> 00:01:38,928
the coefficients w0 multiplied
by the features that we have.

26
00:01:38,928 --> 00:01:43,603
So w0 + w1 times number
of awesomes which in

27
00:01:43,603 --> 00:01:48,646
our case was 5 + w2 times
the number of awfuls,

28
00:01:48,646 --> 00:01:52,336
which in our case is 3 and finally,

29
00:01:52,336 --> 00:01:57,890
w3 times number of greats,
which in our case was 2.

30
00:01:57,890 --> 00:02:02,118
So for this data point over here,

31
00:02:02,118 --> 00:02:06,950
the score of xi is going to be defined by

32
00:02:06,950 --> 00:02:11,631
w0 + 5w1 plus 3w2 plus 2w3 and

33
00:02:11,631 --> 00:02:17,069
if this, depending on the coefficients,

34
00:02:17,069 --> 00:02:22,220
that score may be positive or negative.

35
00:02:22,220 --> 00:02:24,860
So, this is a positive training example,

36
00:02:24,860 --> 00:02:28,990
we want to choose the w's that
make that score positive.

37
00:02:28,990 --> 00:02:30,710
Now that we set up
the classification problem,

38
00:02:30,710 --> 00:02:35,300
and the task that we're after, let's do a
quick review of notation, for the course.

39
00:02:35,300 --> 00:02:38,130
In this course we're going to use
the same notation that we used

40
00:02:38,130 --> 00:02:43,030
in the regression course, that was
the second course in the specialization.

41
00:02:43,030 --> 00:02:47,920
And here we have an output y, which is
the thing you're trying to predict.

42
00:02:47,920 --> 00:02:53,600
In the regression case that was a real
value but in our case is a class.

43
00:02:53,600 --> 00:02:57,906
And we have a set of inputs, x,
which have little d dimensions.

44
00:02:57,906 --> 00:03:01,520
x[1], x[2], through x[d].

45
00:03:01,520 --> 00:03:05,380
So x is really a d-dimensional vector, and
y is an output we're trying to predict,

46
00:03:05,380 --> 00:03:07,920
which in our case is either minus one or
plus one in

47
00:03:07,920 --> 00:03:11,338
the binary classification setting,
which is where we're starting out today.

48
00:03:11,338 --> 00:03:20,210
Now we use xj to denote the jth
input which is a scaler value.

49
00:03:20,210 --> 00:03:26,020
We're going to use hj of x
to denote the jth feature.

50
00:03:26,020 --> 00:03:31,160
And then we'll use x sub i, very
importantly to denote the ith data point.

51
00:03:32,820 --> 00:03:38,600
And x sub i of j denoted the jth
input of the ith data point.

52
00:03:38,600 --> 00:03:41,627
It's a little bit of a handful but

53
00:03:41,627 --> 00:03:47,160
it's the exact same as what we
did in the regression course.

54
00:03:47,160 --> 00:03:51,905
Now with this, in part by this notation,
we can go back and

55
00:03:51,905 --> 00:03:56,050
define our simple hyperplane, the one
we just saw with awfuls and awesomes.

56
00:03:56,050 --> 00:04:00,560
And just say that, in this case,
y hat, our prediction is the sign

57
00:04:00,560 --> 00:04:04,840
of the score that we have for
this particular input.

58
00:04:04,840 --> 00:04:09,360
And this sign function just says
that if the score is greater than 0,

59
00:04:09,360 --> 00:04:11,080
predict plus 1.

60
00:04:11,080 --> 00:04:13,510
If the score is less than 0,
predict minus 1.

61
00:04:13,510 --> 00:04:19,070
And then at zero, you have the choice
of either predicting minus 1 or plus 1.

62
00:04:19,070 --> 00:04:20,660
You can make an arbitrary choice.

63
00:04:20,660 --> 00:04:25,140
The way I think about it is if it's 0,
I predict plus 1.

64
00:04:25,140 --> 00:04:30,262
Now the score of an input x i is w0 +

65
00:04:30,262 --> 00:04:35,914
w1 times x1 of i + w2 times x2 of i,

66
00:04:35,914 --> 00:04:41,921
all the way to wd, the dth coefficient,

67
00:04:41,921 --> 00:04:47,420
times the dth entry in the x i vector.

68
00:04:50,200 --> 00:04:54,060
And here the features
input the first one is 1,

69
00:04:54,060 --> 00:04:59,640
the constant feature like we did
with regression and x[1] could

70
00:04:59,640 --> 00:05:04,460
be awesome, number of awesomes,
x[2] could be number of awfuls, and

71
00:05:04,460 --> 00:05:09,040
say the last one x[d] could be the number
of times the word ramen shows up.

72
00:05:09,040 --> 00:05:12,790
Which to me might be associated
with a negative review, but

73
00:05:12,790 --> 00:05:14,390
it might be kind of indifferent.

74
00:05:14,390 --> 00:05:15,880
Depends on what coefficient
you have there.

75
00:05:18,230 --> 00:05:23,010
So our goal is to optimize the score,
fit the score.

76
00:05:23,010 --> 00:05:27,456
And I'm going to use w
transpose xi as a short hand so

77
00:05:27,456 --> 00:05:33,435
I don't have to always write w0
plus wix1 plus w2x2 and so on.

78
00:05:33,435 --> 00:05:37,646
So, we use this transpose notation
which is the same one that we

79
00:05:37,646 --> 00:05:40,354
talked about in the regression course.

80
00:05:40,354 --> 00:05:44,839
[MUSIC]