1
00:00:00,000 --> 00:00:04,378
[MUSIC]

2
00:00:04,378 --> 00:00:06,842
In order to learn
the coefficients w hat from data,

3
00:00:06,842 --> 00:00:10,165
we have to define some kind of quality
metric so that the coefficients

4
00:00:10,165 --> 00:00:14,060
give you the highest quality,
that's what w hat is supposed to be.

5
00:00:14,060 --> 00:00:19,100
So if we go back to our learning loop when
getting some training data, we squeeze it

6
00:00:19,100 --> 00:00:23,960
from feature generation, or feature
extraction systems, that gets to h(x).

7
00:00:23,960 --> 00:00:26,540
In our model is a logistic regression
model we're not going to talk about

8
00:00:26,540 --> 00:00:29,180
the quality metric which is going to
feed into the machine learning

9
00:00:29,180 --> 00:00:30,540
algorithm that's output in w hat.

10
00:00:31,580 --> 00:00:33,370
So turning into this orange box over here.

11
00:00:35,290 --> 00:00:38,900
In particular we would not be given
some training data as the input and

12
00:00:38,900 --> 00:00:41,780
the data's kind of like the one
I'm showing you, so it has for

13
00:00:41,780 --> 00:00:45,750
example if it only had two features the
number of awesomes the number of awfuls

14
00:00:45,750 --> 00:00:50,760
and the output label or sentiment which
might be plus one or negative one.

15
00:00:50,760 --> 00:00:56,100
In all this data, the n datapoints
that say datapoints when it use some

16
00:00:56,100 --> 00:01:00,490
kind of learning algorithm that optimize
the quality metric to give us w hat.

17
00:01:00,490 --> 00:01:02,570
So what's the quality metric look like?

18
00:01:02,570 --> 00:01:06,810
Let's get right into it and try to
understand what the likelihood function

19
00:01:06,810 --> 00:01:10,360
that we hinted at in the previous
module is really about.

20
00:01:11,870 --> 00:01:15,150
Let's take our data set,
just for intuition and

21
00:01:15,150 --> 00:01:20,450
split it into the data points that have
positive sentiment on the right and

22
00:01:20,450 --> 00:01:22,680
the data points that have
negative sentiment on the left.

23
00:01:22,680 --> 00:01:24,440
So we have two tables but

24
00:01:24,440 --> 00:01:28,670
slightly more negative sentiments than
positive sentiments in this example.

25
00:01:28,670 --> 00:01:31,090
So what do we want w hat to satisfy?

26
00:01:31,090 --> 00:01:32,345
Like a good w satisfy.

27
00:01:32,345 --> 00:01:34,768
For all the positive sentiments, for

28
00:01:34,768 --> 00:01:39,160
all the data points with positive
sentiments, in the extreme we want

29
00:01:39,160 --> 00:01:43,267
the probability that it is positive
to go all the way to plus one.

30
00:01:44,520 --> 00:01:49,120
For all the negative ones, we want
the probability to go all the way to zero.

31
00:01:49,120 --> 00:01:50,630
So, it's not positive.

32
00:01:50,630 --> 00:01:52,050
It must be negative.

33
00:01:52,050 --> 00:01:57,240
So, our goal is to find a w hat that makes
this possible or is close as possible.

34
00:01:58,930 --> 00:02:01,250
So, in other word,
if we take the negative examples and

35
00:02:01,250 --> 00:02:05,780
the positive examples, there might not be
a w hat that achieves exactly zero for

36
00:02:05,780 --> 00:02:09,080
the negative, one for
the positives for all of them.

37
00:02:09,080 --> 00:02:13,412
So the quality metric of the likelihood
function tries to figure out

38
00:02:13,412 --> 00:02:18,830
kind of on average it measures the quality
throughout all the data points.

39
00:02:18,830 --> 00:02:23,120
With respect to coefficients W,
how well we're making extremes happen.

40
00:02:23,120 --> 00:02:27,300
Now, if I have the likelihood function,
I can evaluate multiple lines or

41
00:02:27,300 --> 00:02:29,130
multiple classifiers.

42
00:02:29,130 --> 00:02:31,740
So for example, for the green line here,

43
00:02:31,740 --> 00:02:36,980
the likelihood function may have a certain
value, let's say 10 to the minus 6,

44
00:02:36,980 --> 00:02:41,820
well for this other line where
instead of having w0 be 0,

45
00:02:41,820 --> 00:02:47,010
now w0 is 1, but the w1 and
the w2 coefficients are the same then

46
00:02:47,010 --> 00:02:50,860
the likelihood is slightly higher,
10 to the minus 6.

47
00:02:50,860 --> 00:02:57,480
But for the best line which maybe sets
w0 to be 1, w1 to be 0.5, and w2 to

48
00:02:57,480 --> 00:03:03,009
be -1.5 then the likelihood is biggest
which in this case is 10 to minus 4.

49
00:03:04,340 --> 00:03:06,090
Now, you see these numbers.

50
00:03:06,090 --> 00:03:08,340
They're kind of weird,
10 to the minus something.

51
00:03:08,340 --> 00:03:10,340
But this is what likelihoods
will come out to.

52
00:03:10,340 --> 00:03:12,180
They're going to be very,
very, small numbers.

53
00:03:12,180 --> 00:03:14,080
They're going to be less than one.

54
00:03:14,080 --> 00:03:17,250
But the higher you get,
the closer you get to one, the better.

55
00:03:17,250 --> 00:03:21,980
And so the question is, how do we find
the best w's, the best classifiers,

56
00:03:21,980 --> 00:03:24,910
we'll find the ones that make this
likelihood function that we're going to

57
00:03:24,910 --> 00:03:27,010
talk about, as big as possible.

58
00:03:27,010 --> 00:03:31,790
So we're going to define this function
l(w) and then we're going to use gradient

59
00:03:33,870 --> 00:03:39,270
ascent to find w hat.

60
00:03:39,270 --> 00:03:45,470
And you should have some fond memories,
maybe some sad sad memories,

61
00:03:45,470 --> 00:03:49,820
from the regression course where we
talked about gradient descent and

62
00:03:49,820 --> 00:03:52,480
we explored the idea of using

63
00:03:52,480 --> 00:03:56,860
gradient to find the best possible
parameters to optimize the quality metric.

64
00:03:56,860 --> 00:03:59,547
And we're going to go through
that in this case again.

65
00:03:59,547 --> 00:04:04,529
[MUSIC]