1
00:00:00,000 --> 00:00:04,003
[MUSIC]

2
00:00:04,003 --> 00:00:08,270
We'll start with a very quick
review of gradient ascent.

3
00:00:08,270 --> 00:00:13,000
In the regression course, Emily went
into quite a lot of detail explaining

4
00:00:13,000 --> 00:00:17,040
the gradient ascent algorithm,
where it comes from, and the details.

5
00:00:17,040 --> 00:00:20,620
I really recommend you go
back to that if you want to

6
00:00:20,620 --> 00:00:23,390
remind yourself of where
it all comes from.

7
00:00:23,390 --> 00:00:26,730
I'm just going to do a very,
very, very quick review here so

8
00:00:26,730 --> 00:00:28,750
that we remember the fundamentals.

9
00:00:30,380 --> 00:00:33,140
You can think about the gradient
ascent algorithm as

10
00:00:33,140 --> 00:00:34,700
a kind of hill climbing algorithm.

11
00:00:34,700 --> 00:00:41,865
If you look at the picture on the left
here if you only had one parameter w,

12
00:00:41,865 --> 00:00:49,032
you can imagine starting at some point,
let's say w t with t iteration,

13
00:00:49,032 --> 00:00:55,640
and then moving little bit uphill
to the next parameter, w t+1.

14
00:00:57,538 --> 00:01:03,020
And the amount you move from one to the
other has to do with this term over here,

15
00:01:03,020 --> 00:01:07,870
which is the derivative of our likelihood
function with respect to the parameter w.

16
00:01:07,870 --> 00:01:13,600
And its computed at
the current parameter w t.

17
00:01:14,990 --> 00:01:20,270
Now remember we have this little
extra coefficient parameter eta,

18
00:01:20,270 --> 00:01:21,440
which we call the step size.

19
00:01:22,710 --> 00:01:25,360
A little later in the module,
we're going to discuss how that step size

20
00:01:25,360 --> 00:01:28,590
actually gets picked and
what effect it has.

21
00:01:28,590 --> 00:01:33,040
But imagine it's given, so you're
going to move by eta times the step size.

22
00:01:33,040 --> 00:01:37,500
And so you move here, then you move
a little bit, and you keep moving, moving,

23
00:01:37,500 --> 00:01:42,890
moving, moving uphill, and eventually you
get to the top of the hill to the optimum,

24
00:01:42,890 --> 00:01:47,830
which may be,
we're going to call w star for right now.

25
00:01:47,830 --> 00:01:52,440
And this is our goal, to get to that very,
very top of the hill.

26
00:01:54,400 --> 00:01:58,530
Now that we seen kind of hill climbing
the abstract way, getting to that top of

27
00:01:58,530 --> 00:02:04,180
the hill, so we step up until we
hit the top here which is w's term.

28
00:02:04,180 --> 00:02:08,450
We can talk about when we know we
are done in the interpretive algorithm,

29
00:02:08,450 --> 00:02:09,927
when have we converged?

30
00:02:09,927 --> 00:02:13,990
So you might remember that at
the optimum of convex functions and

31
00:02:13,990 --> 00:02:15,980
what we are dealing with today
is a convex function and

32
00:02:15,980 --> 00:02:20,770
we talked in quite a bit of detail what
that means, but just in general we'll have

33
00:02:20,770 --> 00:02:25,905
the derivative with respect to w for
likelihood is going to be equal to

34
00:02:25,905 --> 00:02:30,820
zero at the optimum here,
because that curve flattens out.

35
00:02:32,350 --> 00:02:35,430
Now, if we could get to the point where
the derivative is equal to zero we

36
00:02:35,430 --> 00:02:37,240
would be absolutely done.

37
00:02:37,240 --> 00:02:39,518
However we're never going to
get there exactly, so

38
00:02:39,518 --> 00:02:42,076
we're going to stop when
the derivative is small enough.

39
00:02:42,076 --> 00:02:47,252
So we stop when the derivative
with respect to

40
00:02:47,252 --> 00:02:52,980
parameter w computed to
recurrent parameter wt.

41
00:02:54,110 --> 00:02:57,570
So we'll stop when the derivative
is smaller than some

42
00:02:57,570 --> 00:02:59,310
taller s parameter epsilon.

43
00:03:00,750 --> 00:03:04,630
And, you know, sent out some small
numbers, 0.001 or something.

44
00:03:04,630 --> 00:03:08,260
Depends on your problem,
you'll explore this in your homework.

45
00:03:08,260 --> 00:03:11,373
But keep going,
you won't get exactly to the top, but

46
00:03:11,373 --> 00:03:15,382
you're going to get pretty close to
the top and that will be your w hat.

47
00:03:15,382 --> 00:03:18,193
So we'll continue going up and up and

48
00:03:18,193 --> 00:03:21,930
up until our derivative
is sufficiently small.

49
00:03:21,930 --> 00:03:26,810
Now that was for a one dimensional space,
but we have now a large animation

50
00:03:26,810 --> 00:03:30,150
space because we can have thousands
of coefficients we're trying to fit.

51
00:03:31,900 --> 00:03:36,530
So, instead of computing
a one dimensional derivative,

52
00:03:36,530 --> 00:03:38,200
we need what's called a gradient.

53
00:03:38,200 --> 00:03:43,400
And a gradient is just a stacking
here of the derivative of l

54
00:03:43,400 --> 00:03:46,720
with respect to the first
parameter the zero.

55
00:03:46,720 --> 00:03:52,140
The partial relative of l with respect to
the second parameter of w1 all the way to

56
00:03:52,140 --> 00:03:57,500
the derivative l the partial derivative
with respect to the last parameter wd.

57
00:03:58,600 --> 00:04:05,043
So this is a D + 1 dimensional vector for

58
00:04:05,043 --> 00:04:09,880
when you have D features.

59
00:04:09,880 --> 00:04:15,590
And in our case here, if we start
from this point and the derivative

60
00:04:15,590 --> 00:04:21,190
is taken this way, the gradient is
going to be something like this maybe.

61
00:04:21,190 --> 00:04:23,550
And that's what
the gradient correspond to.

62
00:04:26,270 --> 00:04:30,310
And then we're just going to
follow that gradient all the way

63
00:04:30,310 --> 00:04:31,070
to the top of the hill.

64
00:04:34,740 --> 00:04:36,840
In a little bit more detail,
more pictorially,

65
00:04:36,840 --> 00:04:40,270
we're going to set some sub-points.

66
00:04:40,270 --> 00:04:42,180
They're mainly heuristics for
where to start, but

67
00:04:42,180 --> 00:04:45,440
let's say we start over here and
we're just going to follow the gradient.

68
00:04:45,440 --> 00:04:50,080
And so on the right side I'm showing you

69
00:04:50,080 --> 00:04:54,307
what are called the contour plots.

70
00:04:54,307 --> 00:04:59,060
And so it corresponds to the level
set to the contours of this line.

71
00:04:59,060 --> 00:05:02,023
So if I start on the left side over here,
it's kind of like starting over here.

72
00:05:02,023 --> 00:05:04,111
I take a gradient step and

73
00:05:04,111 --> 00:05:08,684
I can keep following it until
I get to the optimum here,

74
00:05:08,684 --> 00:05:13,657
which is going to be rw* and
that corresponds to climbing this

75
00:05:13,657 --> 00:05:18,870
hill until I get to the top of the hill,
which is going to be rw*.

76
00:05:18,870 --> 00:05:21,970
And again, Emily went to quite a bit
of detail on contour plots and

77
00:05:21,970 --> 00:05:23,340
how they relate to that 3D plot.

78
00:05:24,884 --> 00:05:25,950
But that's our goal.

79
00:05:27,950 --> 00:05:29,420
So finally, just as a reminder,

80
00:05:29,420 --> 00:05:33,910
here's what the gradient
ascent algorithm looks like.

81
00:05:33,910 --> 00:05:40,537
You start from some point, w 0,
and so you might say w = 0,

82
00:05:40,537 --> 00:05:46,720
or random parameters or something else.

83
00:05:48,680 --> 00:05:50,250
And you just keep following the gradient

84
00:05:51,850 --> 00:05:55,310
until the magnitude of the gradient
is sufficiently small.

85
00:05:56,430 --> 00:05:59,310
So that's our algorithm,
that's going to take us to the optimum.

86
00:06:00,760 --> 00:06:05,090
And it's simplified over here and
the eta here is our famous step size.

87
00:06:07,590 --> 00:06:13,353
Great, so now we're done for
a review of gradient ascent.

88
00:06:13,353 --> 00:06:17,369
[MUSIC]