1
00:00:00,170 --> 00:00:04,848
[MUSIC]

2
00:00:04,848 --> 00:00:05,670
Very good.

3
00:00:06,850 --> 00:00:12,240
We've now seen the simple gradient ascent
algorithm for logistic regression.

4
00:00:12,240 --> 00:00:15,590
How we update the parameters,
how we implement it and

5
00:00:15,590 --> 00:00:18,985
we talked a little bit about how
to set that step size parameter

6
00:00:18,985 --> 00:00:24,510
[INAUDIBLE] parameter and what impact it
has on the progress of our algorithm.

7
00:00:24,510 --> 00:00:30,140
Now I'm going to take a little bit of time
to show you how to derive the derivative

8
00:00:30,140 --> 00:00:34,860
of the likelihood function for logistic
regression, how that gradient is computed.

9
00:00:34,860 --> 00:00:40,030
The material we're going to talk
about here is quite mathematical.

10
00:00:40,030 --> 00:00:44,770
This is really PhD level material and
it can be a little

11
00:00:46,270 --> 00:00:52,330
annoying but for some folks, it might
be exciting to go through every detail.

12
00:00:52,330 --> 00:00:56,270
For others, this could be something that
you skip and it doesn't change anything.

13
00:00:56,270 --> 00:01:02,610
Up to you, you're warned but it's there
for when you want to learn more about it.

14
00:01:02,610 --> 00:01:05,548
We're going to jump in to
deriving the gradient for

15
00:01:05,548 --> 00:01:10,300
the likelihood function for
logistic regression.

16
00:01:10,300 --> 00:01:12,090
Again, PhD level material.

17
00:01:12,090 --> 00:01:12,740
Here we go.

18
00:01:13,990 --> 00:01:19,100
As we said, our goal is to pick
the coefficients w to maximize

19
00:01:19,100 --> 00:01:21,620
the likelihood function and

20
00:01:21,620 --> 00:01:26,050
that is the product of our data points
are the probability of yi given xi and w.

21
00:01:26,050 --> 00:01:31,220
Now, it turns out that
all the math that we need

22
00:01:31,220 --> 00:01:36,130
to do becomes quite a lot simpler if you
take the log of that likelihood function.

23
00:01:36,130 --> 00:01:41,845
I'm going to call that ll(w) and
this is the natural log ln,

24
00:01:41,845 --> 00:01:49,160
natural log of that product.

25
00:01:49,160 --> 00:01:54,290
Now, turns out for most of machine
learning, especially for stuff,

26
00:01:54,290 --> 00:01:59,050
you often take the log, and
it usually makes your math a lot simpler.

27
00:01:59,050 --> 00:02:00,150
Logs are your friends.

28
00:02:01,560 --> 00:02:05,860
Let me do kind of a quick review
of the log, natural log function.

29
00:02:05,860 --> 00:02:09,652
Log lm and
raise the function here that I am showing.

30
00:02:09,652 --> 00:02:13,358
You might have on the x-axis some value z,

31
00:02:13,358 --> 00:02:18,320
on the y-axis this is what
the log of z looks like and

32
00:02:18,320 --> 00:02:21,810
you see that it's something that actually
at zero would be minus infinity.

33
00:02:21,810 --> 00:02:25,640
It grows quickly at first and
then kind of much, much,

34
00:02:25,640 --> 00:02:29,328
much slower later but
this is what the log does.

35
00:02:29,328 --> 00:02:33,370
The reason that it's useful is
that the big product that we had

36
00:02:33,370 --> 00:02:34,810
actually becomes a sum.

37
00:02:34,810 --> 00:02:42,410
In general, if you have the log of a times
b that is equal to the log a plus log b.

38
00:02:44,200 --> 00:02:48,640
Similarly, if you have
the log of a over b,

39
00:02:48,640 --> 00:02:52,210
that is the log of a minus the log of b.

40
00:02:53,720 --> 00:02:54,850
Here's a side note.

41
00:02:54,850 --> 00:02:58,640
I remember, in high school, I think,
I learned about log function, and

42
00:02:58,640 --> 00:03:03,470
that the log of a times b is log
a plus log b and I thought my god,

43
00:03:03,470 --> 00:03:05,590
this is the most useless
thing I've ever seen.

44
00:03:05,590 --> 00:03:07,630
Why are they spending time teaching it?

45
00:03:07,630 --> 00:03:11,590
It really took about six years before I
actually saw it was useful for something.

46
00:03:11,590 --> 00:03:14,930
It's actually extremely useful for
machine learning, a funny side note.

47
00:03:14,930 --> 00:03:18,300
But anyway, a log has some
very interesting properties.

48
00:03:18,300 --> 00:03:22,040
The other property it has is
that you take a function F and

49
00:03:22,040 --> 00:03:26,830
you compute it's maximum,
taking the log doesn't change anything.

50
00:03:29,780 --> 00:03:36,456
If you don't note W hat here to be,

51
00:03:36,456 --> 00:03:42,212
we're going to call it the arg

52
00:03:42,212 --> 00:03:46,587
max over w of f(w) and

53
00:03:46,587 --> 00:03:52,344
this notation here just means

54
00:03:52,344 --> 00:03:58,810
the w that makes f(w) largest.

55
00:03:58,810 --> 00:04:02,630
And if you were to take
the log of that function,

56
00:04:02,630 --> 00:04:07,840
let's call W hat underscore ln to be

57
00:04:07,840 --> 00:04:12,134
the thing that maximizes the log of f(w).

58
00:04:12,134 --> 00:04:16,294
So w hat ln =

59
00:04:16,294 --> 00:04:21,030
arg max over w of

60
00:04:21,030 --> 00:04:26,090
the log (f(w))

61
00:04:26,090 --> 00:04:31,090
because log is what's called
the positive monotonic function.

62
00:04:32,500 --> 00:04:36,320
This doesn't change the optimum,
this transformation.

63
00:04:37,720 --> 00:04:43,960
It turns out that w hat is
going to be equal to w hat ln.

64
00:04:43,960 --> 00:04:48,070
What we did in the previous slide, we just
take the log of the likelihood function,

65
00:04:48,070 --> 00:04:51,210
still keeps the optimum
exactly at the same place.

66
00:04:51,210 --> 00:04:54,120
And it's going to make your
math quite a bit easier.

67
00:04:54,120 --> 00:04:58,469
[MUSIC]