1
00:00:00,000 --> 00:00:04,934
We've previously defined the cost function
J. In this video I want to tell you about

2
00:00:04,934 --> 00:00:09,634
an algorithm called gradient descent for
minimizing the cost function J. It turns

3
00:00:09,634 --> 00:00:14,275
out gradient descent is a more general
algorithm and is used not only in linear

4
00:00:14,275 --> 00:00:18,916
regression. It's actually used all over
the place in machine learning. And later

5
00:00:18,916 --> 00:00:23,791
in the class we'll use gradient descent to
minimize other functions as well, not just

6
00:00:23,791 --> 00:00:27,845
the cost function J, for linear regression.
So in this video, I'm going to

7
00:00:27,845 --> 00:00:32,558
talk about gradient descent for minimizing
some arbitrary function J. And then in

8
00:00:32,558 --> 00:00:37,406
later videos, we'll take those algorithm
and apply it specifically to the cost

9
00:00:37,406 --> 00:00:43,332
function J that we had to find for linear
regression. So here's the problem

10
00:00:43,332 --> 00:00:48,112
setup. We're going to see that we have
some function J of (theta0, theta1).

11
00:00:48,112 --> 00:00:52,773
Maybe it's a cost function from linear
regression. Maybe it's some other function

12
00:00:52,773 --> 00:00:56,801
we want to minimize. And we want
to come up with an algorithm for

13
00:00:56,801 --> 00:01:01,174
minimizing that as a function of J of
(theta0, theta1). Just as an aside,

14
00:01:01,174 --> 00:01:05,793
it turns out that gradient descent
actually applies to more general

15
00:01:05,793 --> 00:01:10,994
functions. So imagine if you have
a function that's a function of

16
00:01:10,994 --> 00:01:16,194
J of (theta0, theta1, theta2, up to
some theta n). And you want to

17
00:01:16,405 --> 00:01:21,795
minimize over (theta0 up to theta n)
of this J of (theta0 up to theta n).

18
00:01:21,795 --> 00:01:26,580
It turns out gradient descent
is an algorithm for solving

19
00:01:26,580 --> 00:01:31,368
this more general problem, but for the
sake of brevity, for the sake of

20
00:01:31,368 --> 00:01:35,935
your succinctness of notation, I'm just
going to present only two parameters

21
00:01:36,113 --> 00:01:41,097
throughout the rest of this video. Here's
the idea for gradient descent. What we're

22
00:01:41,097 --> 00:01:45,882
going to do is we're going to start off
with some initial guesses for theta0 and

23
00:01:45,882 --> 00:01:50,788
theta1. Doesn't really matter what they
are, but a common choice would be if we

24
00:01:50,788 --> 00:01:55,452
set theta0 to 0, and
set theta1 to 0. Just initialize

25
00:01:55,452 --> 00:02:00,322
them to 0. What we're going to do in
gradient descent is we'll keep changing

26
00:02:00,322 --> 00:02:05,258
theta0 and theta1 a little bit to
try to reduce J of (theta0, theta1)

27
00:02:05,258 --> 00:02:10,571
until hopefully we wind up at a minimum or
maybe a local minimum. So, let's see

28
00:02:10,796 --> 00:02:16,106
see pictures of what gradient descent
does. Let's say I try to minimize this

29
00:02:16,106 --> 00:02:20,849
function. So notice the axes. This is,
(theta0, theta1) on the horizontal

30
00:02:20,849 --> 00:02:25,774
axes, and J is a vertical axis. And so the
height of the surface shows J, and we

31
00:02:25,774 --> 00:02:30,582
want to minimize this function. So, we're
going to start off with (theta0, theta1)

32
00:02:30,582 --> 00:02:35,375
at some point. So imagine picking some
value for (theta0, theta1), and that

33
00:02:35,375 --> 00:02:39,934
corresponds to starting at some point on
the surface of this function. Okay? So

34
00:02:39,934 --> 00:02:44,201
whatever value of (theta0, theta1)
gives you some point here. I did

35
00:02:44,201 --> 00:02:48,819
initialize them to 0, but
sometimes you initialize it to other

36
00:02:48,819 --> 00:02:53,942
values as well. Now. I want us to imagine
that this figure shows a hill. Imagine

37
00:02:53,942 --> 00:02:59,178
this is like a landscape of some grassy
park with two hills like so.

38
00:02:59,178 --> 00:03:04,618
And I want you to imagine that you are
physically standing at that point on the

39
00:03:04,618 --> 00:03:09,990
hill on this little red hill in your park.
In gradient descent, what we're

40
00:03:09,990 --> 00:03:15,770
going to do is spin 360 degrees around and
just look all around us and ask, "If I were

41
00:03:15,770 --> 00:03:20,423
to take a little baby step in some
direction, and I want to go downhill as

42
00:03:20,423 --> 00:03:25,320
quickly as possible, what direction do I
take that little baby step in if I want to

43
00:03:25,320 --> 00:03:29,686
go down, if I sort of want to physically
walk down this hill as rapidly as

44
00:03:29,686 --> 00:03:34,465
possible?" Turns out that if we're standing
at that point on the hill, you look all

45
00:03:34,465 --> 00:03:39,185
around, you find that the best direction
to take a little step downhill

46
00:03:39,185 --> 00:03:44,035
is roughly that direction. Okay. And now
you're at this new point on your hill.

47
00:03:44,035 --> 00:03:49,430
You're going to, again, look all around, and then
say, "What direction should I step in order

48
00:03:49,430 --> 00:03:54,695
to take a little baby step downhill?" And
if you do that and take another step, you

49
00:03:54,695 --> 00:03:59,700
take a step in that direction, and then
you keep going. You know, from this new

50
00:03:59,700 --> 00:04:04,835
point, you look around, decide what
direction will take you downhill most

51
00:04:04,835 --> 00:04:09,775
quickly, take another step, another step,
and so on, until you converge to this,

52
00:04:09,970 --> 00:04:15,059
local minimum down here. Further descent
has an interesting property. This first

53
00:04:15,059 --> 00:04:19,682
time we ran gradient descent, we were
starting at this point over here, right?

54
00:04:19,682 --> 00:04:24,183
Started at that point over here. Now
imagine, we initialize gradient

55
00:04:24,183 --> 00:04:29,232
descent just a couple steps to the right.
Imagine we initialized gradient descent with

56
00:04:29,232 --> 00:04:34,159
that point on the upper right. If you were
to repeat this process, and stop at the

57
00:04:34,159 --> 00:04:39,207
point, and look all around. Take a little
step in the direction of steepest descent.

58
00:04:39,207 --> 00:04:44,772
You would do that. Then look around, take
another step, and so on. And if you start

59
00:04:44,772 --> 00:04:50,570
it just a couple steps to the right, the
gradient descent will have taken you to

60
00:04:50,570 --> 00:04:56,236
this second local optimum over on the
right. So if you had started at this first

61
00:04:56,236 --> 00:05:01,602
point, you would have wound up at this
local optimum. But if you started just a

62
00:05:01,602 --> 00:05:06,762
little bit, a slightly different location,
you would have wound up at a very

63
00:05:06,762 --> 00:05:12,197
different local optimum. And this is a
property of gradient descent that we'll

64
00:05:12,197 --> 00:05:17,425
say a little bit more about later. So,
that's the intuition in pictures. Let's

65
00:05:17,425 --> 00:05:22,929
look at the map. This is the definition of
the gradient descent algorithm. We're

66
00:05:22,929 --> 00:05:28,240
going to just repeatedly do this. On to
convergence. We're going to update my

67
00:05:28,240 --> 00:05:33,543
parameter theta j by, you know, taking
theta j and subtracting from it alpha

68
00:05:33,543 --> 00:05:39,129
times this term over here. So, let's
see. There are a lot of details in this

69
00:05:39,129 --> 00:05:45,030
equation, so let me unpack some of it.
First, this notation here, colon equals.

70
00:05:45,030 --> 00:05:51,643
We're going to use := to denote
assignment, so it's the assignment

71
00:05:51,643 --> 00:05:57,790
operator. So concretely, if I
write A: =B, what this means in

72
00:05:57,790 --> 00:06:02,878
a computer, this means take the
value in B and use it to overwrite

73
00:06:02,878 --> 00:06:08,517
whatever the value of A. So this means we
will set A to be equal to the value of B.

74
00:06:08,517 --> 00:06:13,674
Okay, it's assignment. I can
also do A:=A+1. This means

75
00:06:13,674 --> 00:06:18,969
take A and increase its value by one.
Whereas in contrast, if I use the equals

76
00:06:18,969 --> 00:06:26,067
sign and I write A=B, then this is a
truth assertion. So if I write A=B,

77
00:06:26,067 --> 00:06:31,006
then I'm asserting that the value of A
equals to the value of B. So the left hand

78
00:06:31,006 --> 00:06:36,331
side, that's a computer operation, where
you set the value of A to be a value. The

79
00:06:36,331 --> 00:06:41,399
right hand side, this is asserting, I'm
just making a claim that the values of A

80
00:06:41,399 --> 00:06:46,274
and B are the same. And so, whereas I can
write A: =A+1, that means increment A by

81
00:06:46,274 --> 00:06:50,764
1. Hopefully, I won't ever write A=A+1.
Because that's just wrong.

82
00:06:50,764 --> 00:06:55,704
A and A+1 can never be equal to
the same values. So that's the first

83
00:06:55,704 --> 00:07:05,733
part of the definition. This alpha
here is a number that is called the

84
00:07:05,733 --> 00:07:12,360
learning rate. And what alpha does is, it
basically controls how big a step we take

85
00:07:12,360 --> 00:07:17,113
downhill with gradient descent. So if
alpha is very large, then that corresponds

86
00:07:17,113 --> 00:07:21,925
to a very aggressive gradient descent
procedure, where we're trying to take huge

87
00:07:21,925 --> 00:07:26,322
steps downhill. And if alpha is very
small, then we're taking little, little

88
00:07:26,322 --> 00:07:31,194
baby steps downhill. And, I'll come
back and say more about this later.

89
00:07:31,194 --> 00:07:35,660
About how to set alpha and so on.
And finally, this term here. That's the

90
00:07:35,660 --> 00:07:40,582
derivative term, I don't want to talk
about it right now, but I will derive this

91
00:07:40,582 --> 00:07:45,564
derivative term and tell you exactly what
this is based on. And some of you

92
00:07:45,564 --> 00:07:50,547
will be more familiar with calculus than
others, but even if you aren't familiar

93
00:07:50,547 --> 00:07:55,469
with calculus don't worry about it, I'll
tell you what you need to know about this

94
00:07:55,469 --> 00:08:00,580
term here. Now there's one more subtlety
about gradient descent which is, in

95
00:08:00,580 --> 00:08:05,837
gradient descent, we're going to
update theta0 and theta1. So

96
00:08:05,837 --> 00:08:10,699
this update takes place where j=0, and
j=1. So you're going to update j, theta0,

97
00:08:10,699 --> 00:08:15,955
and update theta1. And the subtlety of
how you implement gradient descent is,

98
00:08:15,955 --> 00:08:21,562
for this expression, for this
update equation, you want to

99
00:08:21,562 --> 00:08:31,384
simultaneously update theta0 and
theta1. What I mean by that is

100
00:08:31,384 --> 00:08:36,432
that in this equation,
we're going to update

101
00:08:36,432 --> 00:08:40,975
theta0:=theta0 - something, and update
theta1:=theta1 - something.

102
00:08:40,975 --> 00:08:45,834
And the way to implement this is,
you should compute the right hand

103
00:08:45,834 --> 00:08:52,677
side. Compute that thing for both theta0
and theta1, and then simultaneously at

104
00:08:52,677 --> 00:08:57,469
the same time update theta0 and
theta1. So let me say what I mean

105
00:08:57,469 --> 00:09:02,024
by that. This is a correct implementation
of gradient descent meaning simultaneous

106
00:09:02,024 --> 00:09:06,461
updates. I'm going to set temp0 equals
that, set temp1 equals that. So basically

107
00:09:06,461 --> 00:09:11,430
compute the right hand sides. And then having
computed the right hand sides and stored

108
00:09:11,430 --> 00:09:15,926
them together in temp0 and temp1,
I'm going to update theta0 and theta1

109
00:09:15,926 --> 00:09:20,245
simultaneously, because that's the
correct implementation. In contrast,

110
00:09:20,245 --> 00:09:25,533
here's an incorrect implementation that
does not do a simultaneous update. So in

111
00:09:25,533 --> 00:09:31,666
this incorrect implementation, we compute
temp0, and then we update theta0

112
00:09:31,666 --> 00:09:36,644
and then we compute temp1. Then we
update temp1. And the difference between

113
00:09:36,644 --> 00:09:41,877
the right hand side and the left hand side
implementations is that if we look down

114
00:09:41,877 --> 00:09:46,791
here, you look at this step, if by this
time you've already updated theta0

115
00:09:46,791 --> 00:09:51,897
then you would be using the new
value of theta0 to compute this

116
00:09:51,897 --> 00:09:57,340
derivative term and so this gives you a
different value of temp1 than the left

117
00:09:57,340 --> 00:10:01,565
hand side, because you've now
plugged in the new value of theta0

118
00:10:01,565 --> 00:10:05,852
into this equation. And so this on right
hand side is not a correct implementation

119
00:10:05,852 --> 00:10:09,916
of gradient descent. So I don't
want to say why you need to do the

120
00:10:09,916 --> 00:10:14,617
simultaneous updates, it turns
out that the way gradient descent

121
00:10:14,617 --> 00:10:18,735
is usually implemented, we'll say more
about it later, it actually turns out to

122
00:10:18,735 --> 00:10:22,496
be more natural to implement the
simultaneous update. And when people talk

123
00:10:22,496 --> 00:10:26,665
about gradient descent, they always mean
simultaneous update. If you implement the

124
00:10:26,665 --> 00:10:30,630
non-simultaneous update, it turns out
it will probably work anyway, but this

125
00:10:30,630 --> 00:10:34,747
algorithm on the right is not what people
people refer to as gradient descent and

126
00:10:34,747 --> 00:10:38,356
this is some other algorithm with
different properties. And for various

127
00:10:38,356 --> 00:10:42,220
reasons, this can behave in
slightly stranger ways. And

128
00:10:42,220 --> 00:10:46,626
what you should do is to really
implement the simultaneous update of

129
00:10:46,626 --> 00:10:52,313
gradient descent. So, that's the outline of the
gradient descent algorithm. In the next video,

130
00:10:52,313 --> 00:10:56,998
we're going to go into the details of the
derivative term, which I wrote out but

131
00:10:56,998 --> 00:11:01,799
didn't really define. And if you've taken
a calculus class before and if you're

132
00:11:01,799 --> 00:11:06,367
familiar with partial derivatives and
derivatives, it turns out that's exactly

133
00:11:06,367 --> 00:11:11,425
what that derivative term is. But in case
you aren't familiar with calculus, don't

134
00:11:11,425 --> 00:11:15,680
worry about it. The next video will give
you all the intuitions and will tell you

135
00:11:15,680 --> 00:11:19,883
everything you need to know to compute
that derivative term, even if you haven't

136
00:11:19,883 --> 00:11:24,296
seen calculus, or even if you haven't seen
partial derivatives before. And with

137
00:11:24,296 --> 00:11:28,288
that, with the next video, hopefully,
we'll be able to give all the intuitions

138
00:11:28,288 --> 00:11:30,180
you need to apply gradient descent.