1
00:00:04,960 --> 00:00:06,530
Now that we learned about
the likelihood function,

2
00:00:06,530 --> 00:00:10,530
the thing that were trying to maximize,
let's talk about gradient ascent algorithm

3
00:00:10,530 --> 00:00:13,330
that tries to make it
as large as possible.

4
00:00:13,330 --> 00:00:16,770
In this section, were going to go through
a little bit of math, a little bit of

5
00:00:16,770 --> 00:00:20,442
detail but in the end, the gradient
ascent algorithm for learning a logistic

6
00:00:20,442 --> 00:00:24,432
regression classifier is going to be
extremely simple and extremely intuitive.

7
00:00:24,432 --> 00:00:30,980
Even if the likelihood function's a little
bit fuzzy for you and the gradient

8
00:00:30,980 --> 00:00:34,594
stuff's not totally clear but in the end,
the algorithm that you're going to

9
00:00:34,594 --> 00:00:38,330
implement is going to be something that
only requires a few lines of code.

10
00:00:38,330 --> 00:00:40,210
In fact,
you'll be able to do that extremely easy.

11
00:00:41,570 --> 00:00:42,090
Good.

12
00:00:42,090 --> 00:00:45,280
We defined the model we want to fit,
the logistic regression model,

13
00:00:45,280 --> 00:00:48,360
we talked about the quality max trick,
the likelihood function,

14
00:00:48,360 --> 00:00:51,121
now we're going to define
the gradient ascent algorithm.

15
00:00:51,121 --> 00:00:55,082
Which is machine learning algorithm that tries
to make the likelihood function as large

16
00:00:55,082 --> 00:00:58,240
as possible to find their famous
W hat fits our data really well.

17
00:00:59,310 --> 00:01:01,950
Now, we can go back to this picture
that we've seen a few times where

18
00:01:01,950 --> 00:01:04,930
we have multiple lines and
they have likelihood function and

19
00:01:04,930 --> 00:01:07,810
we're trying to find the one
that has best likelihood.

20
00:01:07,810 --> 00:01:11,680
This line here with the W0 = 1,
W1 = 0.5, W2 = -1.5.

21
00:01:11,680 --> 00:01:19,810
We now know that the likelihood function
is exactly this function up here.

22
00:01:19,810 --> 00:01:24,560
The product over my data point of
the probability of the true label

23
00:01:24,560 --> 00:01:27,490
given the input centers that we have.

24
00:01:28,670 --> 00:01:34,650
Our goal is to take this l and

25
00:01:34,650 --> 00:01:42,470
optimize it with gradient ascent.

26
00:01:42,470 --> 00:01:43,780
That's what we're going to
go after right now.

27
00:01:47,000 --> 00:01:53,620
As a quick, quick, quick review,
we have our likelihood function and

28
00:01:53,620 --> 00:01:57,820
when I find the parameter values
that take this likelihood function,

29
00:01:57,820 --> 00:01:59,930
so that this is the W0, W1, W2,

30
00:01:59,930 --> 00:02:04,340
which is a function of three parameters
in this little example over here.

31
00:02:04,340 --> 00:02:08,700
We're trying to find the maximum
over all possible parameters W0,

32
00:02:08,700 --> 00:02:13,270
W1, and W2 and
there's infinitely many of those, so

33
00:02:13,270 --> 00:02:16,430
if you try to enumerate it will
be impossible to try them all.

34
00:02:16,430 --> 00:02:21,065
But gradient ascent is this magically
simple but wonderful algorithm where you

35
00:02:21,065 --> 00:02:24,317
start from some point over
here in the parameter space,

36
00:02:24,317 --> 00:02:29,130
which might be the weight for awful is 0,
the weight for awesome is -6.

37
00:02:29,130 --> 00:02:34,320
And you slowly climb up the hill
in order to find the optimum,

38
00:02:34,320 --> 00:02:39,480
the top of the hill here,
which going to be our famous W hat.

39
00:02:39,480 --> 00:02:44,043
They might say that the weight for
awesome is,

40
00:02:44,043 --> 00:02:51,649
it's probably going to be a positive
number, so maybe somewhere like this,

41
00:02:51,649 --> 00:02:56,580
say 0.5 and the weight for
awful is maybe -1.

42
00:02:56,580 --> 00:03:00,560
Now in this plot, I've only shown
two of the coordinates W1 and W2.

43
00:03:00,560 --> 00:03:07,443
I didn't show W0 because it's really
hard to plot in four-dimensional space.

44
00:03:07,443 --> 00:03:11,981
Four dimensions, so I'm just showing
you three out of those four dimensions.

45
00:03:13,653 --> 00:03:18,204
Now, let's discuss the gradient ascent
algorithm to go ahead and do that.

46
00:03:18,204 --> 00:03:22,479
[MUSIC]