1
00:00:00,000 --> 00:00:04,607
[MUSIC]

2
00:00:04,607 --> 00:00:06,569
Before we get to gradient descent though,

3
00:00:06,569 --> 00:00:10,400
we need to figure out what the quality
metric we're trying to optimize is.

4
00:00:10,400 --> 00:00:12,930
That quality metric is going to
call the data likelihood.

5
00:00:12,930 --> 00:00:16,640
So let's dive in and
explore exactly that concept.

6
00:00:18,020 --> 00:00:20,009
To better understand the data likelihood,

7
00:00:20,009 --> 00:00:23,770
let's start from some simple examples and
try to understand what w is trying to do,

8
00:00:23,770 --> 00:00:26,258
what we're trying to do here
in the learning problem.

9
00:00:26,258 --> 00:00:31,198
So let's say that we have this data point
that has two awesomes and one awful, so

10
00:00:31,198 --> 00:00:32,981
let's call that input x1.

11
00:00:32,981 --> 00:00:38,640
And we're going to try to output y1
which is this is a positive sentiment.

12
00:00:38,640 --> 00:00:41,830
This is a review of positive sentiment.

13
00:00:41,830 --> 00:00:47,250
If our classifier were really good,
if ws were good, what would happen?

14
00:00:47,250 --> 00:00:54,900
Well for this particular input you would
output y hat for 1 is also equal to +1.

15
00:00:54,900 --> 00:00:57,390
In other words,
y hat agrees with the true label y.

16
00:00:59,150 --> 00:01:02,396
So what should we do for
y hat to agree with the true label y,

17
00:01:02,396 --> 00:01:04,310
we're using logistic regression.

18
00:01:04,310 --> 00:01:09,106
Well we're going to try to find w that

19
00:01:09,106 --> 00:01:14,057
makes the probability that y = +1,

20
00:01:14,057 --> 00:01:17,008
when the input is x1.

21
00:01:17,008 --> 00:01:20,990
And the parameter's w, we're going to
make that as big as possible.

22
00:01:20,990 --> 00:01:27,603
In other words, we want to make
the probability that y = +1,

23
00:01:27,603 --> 00:01:32,560
when x1, number of awesomes is 2, and x2,

24
00:01:32,560 --> 00:01:37,160
number of awfuls is 1 for the parameter w.

25
00:01:37,160 --> 00:01:40,590
We're going to make that
probability as big as possible.

26
00:01:40,590 --> 00:01:41,840
But that was for a positive example.

27
00:01:41,840 --> 00:01:45,440
We're going to make the probability
of y = +1 as big as possible.

28
00:01:45,440 --> 00:01:46,650
Now let's take another example.

29
00:01:46,650 --> 00:01:51,581
Let's call this example x2,
where had 0 awesomes and 2 awfuls,

30
00:01:51,581 --> 00:01:55,210
must be an awful review, awful restaurant.

31
00:01:55,210 --> 00:01:59,443
And so the sentiment for y2 is -1.

32
00:01:59,443 --> 00:02:02,441
So if our classify is good, if w is good,

33
00:02:02,441 --> 00:02:06,606
in this case when it should
predict that y hat 2 = -1.

34
00:02:06,606 --> 00:02:08,139
So, agree with the true label.

35
00:02:08,139 --> 00:02:12,504
So in this case, we're not
maximizing the probability of y = 1.

36
00:02:12,504 --> 00:02:17,008
But, we're maximizing
the probability that y

37
00:02:17,008 --> 00:02:21,744
= -1 when the input is x2 for
the parameters w.

38
00:02:21,744 --> 00:02:24,063
In other words, when the input is x1,

39
00:02:24,063 --> 00:02:27,238
we'll maximize
the probability that y = +1.

40
00:02:27,238 --> 00:02:31,658
And when we put this x2, we
want to maximize the probability that y = -1.

41
00:02:31,658 --> 00:02:35,452
That's what the data likelihood
tries to do, but let's dig in and

42
00:02:35,452 --> 00:02:37,637
understand that a little bit better.

43
00:02:37,637 --> 00:02:41,090
Now we just don't have two training
examples, we have a ton of examples.

44
00:02:41,090 --> 00:02:42,342
We have a big dataset.

45
00:02:42,342 --> 00:02:43,520
So what are we trying to do?

46
00:02:43,520 --> 00:02:45,010
So let's look at the first example.

47
00:02:45,010 --> 00:02:48,457
In the first example, we want to maximize,

48
00:02:48,457 --> 00:02:53,055
since this is a positive example,
we're going to maximize

49
00:02:53,055 --> 00:02:57,964
the probability that y is equal
to +1 when the input is x1.

50
00:02:57,964 --> 00:03:01,580
And the parameters are w, so

51
00:03:01,580 --> 00:03:06,553
in this case is the probability that y =

52
00:03:06,553 --> 00:03:11,677
+1 when the number of awesomes is 2 and

53
00:03:11,677 --> 00:03:18,220
the number of awfuls is 1,
and the parameter is w.

54
00:03:18,220 --> 00:03:23,470
So we try to make the probability y =
+1 given x1 to be as big as possible.

55
00:03:23,470 --> 00:03:29,016
For the next example, we're trying
to make the probability that y = -1

56
00:03:29,016 --> 00:03:34,320
given x2, w to be to be as big as
possible as a negative example.

57
00:03:34,320 --> 00:03:36,220
For the third one is also
a negative example, so

58
00:03:36,220 --> 00:03:44,330
we're trying to make the probability
that y = -1 given x to be nw.

59
00:03:44,330 --> 00:03:48,140
For the fourth one, we're trying
to make the probability that y,

60
00:03:48,140 --> 00:03:55,730
it's a positive example, so y = +1
given x4 and w as large as possible.

61
00:03:55,730 --> 00:03:58,760
So in other words, for the possible
example of trying to make the probability

62
00:03:58,760 --> 00:04:01,680
y = +1 as big as possible for
negative examples, we're trying

63
00:04:01,680 --> 00:04:05,990
to make the probability y = -1 as big
as possible, which is pretty natural.

64
00:04:05,990 --> 00:04:10,500
And we want to do that for each example,
for every single one of those.

65
00:04:12,050 --> 00:04:14,880
Now the question is how do we combine,

66
00:04:16,030 --> 00:04:20,280
like for a single quality metric,
how do we combine these?

67
00:04:20,280 --> 00:04:25,390
And you can imagine multiple
ways of combining this averages,

68
00:04:25,390 --> 00:04:29,490
I don't know,
all sorts of ideas out there.

69
00:04:29,490 --> 00:04:33,332
The way that you typically combine when
you're doing what's called maximizing

70
00:04:33,332 --> 00:04:33,950
likelihood estimation or

71
00:04:33,950 --> 00:04:37,578
maximizing likelihood is by
multiplying these things together.

72
00:04:37,578 --> 00:04:40,375
So you multiply.

73
00:04:44,019 --> 00:04:49,590
So, in other words, you say that
it's the product of y = +1 given,

74
00:04:49,590 --> 00:04:51,960
that's from the first line.

75
00:04:51,960 --> 00:04:57,205
Given x1 and
w times the probability that y = -1 from

76
00:04:57,205 --> 00:05:04,670
the second line here given x2 and w,
and the third line is also negative.

77
00:05:04,670 --> 00:05:12,667
So, times the probability that
y = -1 given x3, w and so on.

78
00:05:12,667 --> 00:05:14,800
So you just multiply them all together.

79
00:05:14,800 --> 00:05:17,700
Here's a side little note who those
who know about probabilities.

80
00:05:17,700 --> 00:05:20,630
The reason you multiply is that you
assume that every row is independent

81
00:05:20,630 --> 00:05:21,160
of each other.

82
00:05:21,160 --> 00:05:23,800
So, there's an independence
that comes into play, but

83
00:05:23,800 --> 00:05:25,540
don't worry too much about this.

84
00:05:25,540 --> 00:05:27,490
Just think about as a multiplication.

85
00:05:29,445 --> 00:05:31,640
Okay, so let's do that
a little bit more explicitly.

86
00:05:31,640 --> 00:05:34,650
Now we have these data points 1 through 4,
and for

87
00:05:34,650 --> 00:05:39,110
each one of them we're trying to
maximize the specific probability.

88
00:05:39,110 --> 00:05:42,955
So for the positive examples, we're trying
to maximize probability of y = +1, for

89
00:05:42,955 --> 00:05:45,139
the negative examples
of probability y = -1.

90
00:05:45,139 --> 00:05:49,387
Given the parameter w, we're going to find
w that makes those as big as possible.

91
00:05:49,387 --> 00:05:52,458
So the likelihood function,
the thing in which we want to optimize

92
00:05:52,458 --> 00:05:56,469
is the product of
each one of these probabilities.

93
00:05:58,570 --> 00:05:59,440
I like that animation.

94
00:05:59,440 --> 00:06:00,200
It's pretty cool?

95
00:06:01,650 --> 00:06:06,390
So, we're going to use
a shorthand notation here.

96
00:06:06,390 --> 00:06:13,500
So for the first example,
we have that y1 up here, was +1.

97
00:06:13,500 --> 00:06:18,702
So we're going to denote that by
probability of y1, given x1 and w.

98
00:06:18,702 --> 00:06:23,837
So, the y1's going to get from this +1,
and

99
00:06:23,837 --> 00:06:30,570
x1 is going to come from
this representation of x.

100
00:06:30,570 --> 00:06:36,090
And similarly for the other one, so the
notation's going to get long and heavy.

101
00:06:37,280 --> 00:06:39,010
We just say probability of y1 given x1 and

102
00:06:39,010 --> 00:06:43,430
that would be the probability
of y2 given x2 and w.

103
00:06:43,430 --> 00:06:46,560
Probability of y3 given x3 and w, and
we're just multiplying them together.

104
00:06:48,080 --> 00:06:52,850
And finally so that that line doesn't
get really long, because if we have

105
00:06:52,850 --> 00:06:55,590
a million or two examples,
we would have a million of these entries.

106
00:06:55,590 --> 00:06:57,910
We use the product notation.

107
00:06:58,950 --> 00:07:01,220
So this is the little notation over here.

108
00:07:01,220 --> 00:07:05,175
And it just says, I'm just going
to write same function here.

109
00:07:05,175 --> 00:07:10,343
l(w) is going to be equal to the product

110
00:07:10,343 --> 00:07:15,815
that ranges from the first
data point to n,

111
00:07:15,815 --> 00:07:22,048
which is number of data
points of the probability

112
00:07:22,048 --> 00:07:27,230
of whatever label yi has, +1, -1.

113
00:07:27,230 --> 00:07:32,716
Given the input xi, which is the sentence
of that review, and the parameter w.

114
00:07:32,716 --> 00:07:35,640
This is the likelihood function
that we're trying to optimize.

115
00:07:35,640 --> 00:07:41,102
So, our goal here is to pick

116
00:07:41,102 --> 00:07:46,565
w to make this crazy thing,

117
00:07:46,565 --> 00:07:51,553
I mean, this function as

118
00:07:51,553 --> 00:07:55,610
large as possible.

119
00:07:55,610 --> 00:07:56,928
That's our goal.

120
00:07:56,928 --> 00:08:01,399
[MUSIC]