1
00:00:00,000 --> 00:00:04,401
[MUSIC]

2
00:00:04,401 --> 00:00:10,157
Now let's take a deep dive, and understand
how we're going to predict the probability

3
00:00:10,157 --> 00:00:14,610
that a sentence is positive or
negative using linear classifiers.

4
00:00:14,610 --> 00:00:16,129
Or we call them linear models.

5
00:00:17,380 --> 00:00:20,210
So now we're taking some input data x.

6
00:00:20,210 --> 00:00:24,550
We're going to compute some features h,
and we're going to output this P hat,

7
00:00:24,550 --> 00:00:28,930
the probability that the label
is positive or negative.

8
00:00:30,290 --> 00:00:35,921
Now if you'll go back to our example for
the decision boundary we

9
00:00:35,921 --> 00:00:41,135
computed the score of every data
point as w transpose h of x or

10
00:00:41,135 --> 00:00:45,223
w0 h0 + w1 h1 + w2 h2 + w3 h3 and so on.

11
00:00:45,223 --> 00:00:49,380
Everything below the line
had score greater than 0.

12
00:00:49,380 --> 00:00:53,080
Everything above the line have score
less than 0, but we don't know how far.

13
00:00:53,080 --> 00:00:57,580
It's like a lot less than 0 and
a lot greater than 0 potentially.

14
00:00:57,580 --> 00:01:02,500
And how do we relate these scores which
could be somewhere between minus infinity

15
00:01:02,500 --> 00:01:07,660
and positive infinity, with the
probability that the output is plus one,

16
00:01:07,660 --> 00:01:10,290
the probability of
the sentence is plus one.

17
00:01:10,290 --> 00:01:13,380
And that's the task that we're
going to try to do today.

18
00:01:14,450 --> 00:01:16,740
In fact, we have the w transpose h,

19
00:01:16,740 --> 00:01:20,980
the score, can range from minus
infinity to plus infinity.

20
00:01:22,210 --> 00:01:26,780
If it's positive, greater than 0, we're
going to output +1, and if it's negative,

21
00:01:26,780 --> 00:01:28,714
less than 0, we're going to put -1.

22
00:01:30,240 --> 00:01:34,560
Now what we want to say is if that score
is really, really, really, really big,

23
00:01:34,560 --> 00:01:38,260
it's like infinity,
then we're very sure that y hat is +1.

24
00:01:38,260 --> 00:01:44,061
So we're going to say that the probability
that y=+1 given this input is 1.

25
00:01:44,061 --> 00:01:48,687
On the other end of the spectrum, if the
score is very, very, very, very negative,

26
00:01:48,687 --> 00:01:51,599
minus infinity,
we're very sure that y hat is -1.

27
00:01:53,150 --> 00:01:59,380
And then we should output the probability
y is +1 is 0, for this particular input x.

28
00:02:00,850 --> 00:02:01,720
Now if the score is 0,

29
00:02:01,720 --> 00:02:07,560
we're right at that decision boundary
where we're neither positive or negative.

30
00:02:07,560 --> 00:02:12,904
So we're kind of indifferent,
in predicting whether y hat is +1 or

31
00:02:12,904 --> 00:02:17,380
-1, so if we're indifferent with
probabilities we can interpret that.

32
00:02:17,380 --> 00:02:24,792
We can say the probability the y hat,
the y is +1 given the input is 0.5, 50-50.

33
00:02:24,792 --> 00:02:26,400
It could go either way.

34
00:02:26,400 --> 00:02:29,760
So that's our goal is to predict
those probabilities from the scores.

35
00:02:30,860 --> 00:02:32,660
So, we have the scores.

36
00:02:32,660 --> 00:02:35,930
The scores range from minus
infinity to plus infinity.

37
00:02:35,930 --> 00:02:38,090
And they're this weighted
combination of the features.

38
00:02:39,220 --> 00:02:41,500
And probabilities are between 0 and 1.

39
00:02:43,460 --> 00:02:47,660
If it's minus infinity the score, I want
the probability of that output to be 0.

40
00:02:47,660 --> 00:02:51,840
If it's plus infinity the score,
I want output that I predict to be 1.0.

41
00:02:51,840 --> 00:02:57,080
And if the score is 0.0,
I want to say the probability is 0.5.

42
00:02:57,080 --> 00:03:00,758
Scores range from minus
infinity to plus infinity,

43
00:03:00,758 --> 00:03:03,607
probabilities range between 0 and 1.

44
00:03:03,607 --> 00:03:08,458
The question is how do I relate score
from minus infinity to plus infinity,

45
00:03:08,458 --> 00:03:10,160
to probability 0 and 1.

46
00:03:10,160 --> 00:03:13,890
How do I link these two things?

47
00:03:13,890 --> 00:03:15,482
And now we're going to see some magic.

48
00:03:15,482 --> 00:03:22,350
[LAUGH] The magic that glues,
that links this range minus infinity,

49
00:03:22,350 --> 00:03:30,360
plus infinity to the range 0,1 is called
a link function, so it links the two.

50
00:03:30,360 --> 00:03:35,246
I'm going to take the score,
which is between minus infinity and

51
00:03:35,246 --> 00:03:39,488
plus infinity,
I'm going to push it through a function g

52
00:03:39,488 --> 00:03:43,873
that squeezes that huge line
into the interval 0, 1.

53
00:03:43,873 --> 00:03:51,538
[SOUND] And uses it to predict
the probability that y equals +1.

54
00:03:51,538 --> 00:03:56,338
And when you're taking a linear model,
w transpose h minus

55
00:03:56,338 --> 00:04:01,234
infinity to plus infinity and
you're squeezing it into 0,

56
00:04:01,234 --> 00:04:08,780
1 using link functions you are building
what's called a generalized linear model.

57
00:04:08,780 --> 00:04:10,710
So, if somebody stops you
in the street today and

58
00:04:10,710 --> 00:04:12,920
asks you what's
a generalized linear model?

59
00:04:12,920 --> 00:04:13,960
Say no problem.

60
00:04:13,960 --> 00:04:17,428
It's just like a regression model,
but you squeeze it into 0,

61
00:04:17,428 --> 00:04:20,430
1 by pushing through a link function.

62
00:04:20,430 --> 00:04:21,660
So it's a little abstract.

63
00:04:21,660 --> 00:04:24,330
We're going to talk about
it in the context of

64
00:04:25,800 --> 00:04:29,670
logistic regression which is
a specific type of link function.

65
00:04:29,670 --> 00:04:32,510
Now I talked about generalized
linear models as squeezing

66
00:04:32,510 --> 00:04:35,850
minus infinity to plus
infinity to the interval 0, 1.

67
00:04:35,850 --> 00:04:38,730
That's true for classifiers and
for most kinds of classifiers.

68
00:04:38,730 --> 00:04:42,730
There are other types of generalized
linear models that don't squeeze

69
00:04:42,730 --> 00:04:44,090
between 0 and 1.

70
00:04:44,090 --> 00:04:48,680
But for our purposes you can
think about them in that context.

71
00:04:48,680 --> 00:04:52,970
So in this context our goal now
becomes taking your training data,

72
00:04:52,970 --> 00:04:56,810
pushing it through some feature
extraction which gives us the h's,

73
00:04:56,810 --> 00:05:01,170
the FIDF or whatever else you
represent in data, number of awesomes.

74
00:05:01,170 --> 00:05:05,300
And now we build a linear
model W transpose h,

75
00:05:05,300 --> 00:05:09,960
we push it through the link function
that squeezes it into interval 0, 1.

76
00:05:09,960 --> 00:05:14,642
And use that to predict
the probability that your sentiment of

77
00:05:14,642 --> 00:05:18,523
your review is positive
given the input sentence.

78
00:05:18,523 --> 00:05:22,799
[MUSIC]