1
00:00:00,007 --> 00:00:04,187
[MUSIC]

2
00:00:04,187 --> 00:00:09,110
Okay, well like we discussed the other
approach that we can take is to do

3
00:00:09,110 --> 00:00:13,052
Gradient descent where we're
walking down this surface

4
00:00:13,052 --> 00:00:17,390
of residual sum of squares
trying to get to the minimum.

5
00:00:17,390 --> 00:00:19,140
Of course we might over shoot it and
go back and

6
00:00:19,140 --> 00:00:23,590
forth but that's the general idea that
we're doing this iterative procedure.

7
00:00:23,590 --> 00:00:28,230
And in this case it's
useful to reinterpret this

8
00:00:29,250 --> 00:00:32,760
gradient of the residual sum of
squares that we computed previously.

9
00:00:32,760 --> 00:00:34,930
So, this is what we've been working with.

10
00:00:34,930 --> 00:00:39,663
But what I want to point
out here is that this term,

11
00:00:39,663 --> 00:00:44,406
Yi, this is our actual
house sales observation.

12
00:00:51,103 --> 00:00:52,264
And what is this term here?

13
00:00:52,264 --> 00:00:57,251
Well, it's the predicted
value if we use W0 and

14
00:00:57,251 --> 00:01:00,690
W1 to form that prediction.

15
00:01:00,690 --> 00:01:08,306
So, I'll call it the predicted value,

16
00:01:08,306 --> 00:01:13,810
y at i, but I'm gonna write
it a function of W0 and W1,

17
00:01:13,810 --> 00:01:21,370
to make it clear that it's the prediction
I'm forming when using W0 and W1.

18
00:01:23,160 --> 00:01:30,220
Okay, so what we can do is re-write
this residual sum of squares

19
00:01:30,220 --> 00:01:35,360
in terms of these predicted
observation values.

20
00:01:37,020 --> 00:01:41,980
Then when we go to write our
gradient descent algorithm,

21
00:01:41,980 --> 00:01:43,020
what's the algorithm say?

22
00:01:43,020 --> 00:01:47,289
Well we have, while not

23
00:01:47,289 --> 00:01:52,850
converged, we're gonna take

24
00:01:52,850 --> 00:01:58,720
our previous vector of W0 at iteration T,

25
00:02:00,020 --> 00:02:05,760
W1 at iteration T and
what are we going to.

26
00:02:05,760 --> 00:02:07,270
We're going to subtract.

27
00:02:08,340 --> 00:02:13,110
Going to write it up here,
we're going to subtract eta times

28
00:02:13,110 --> 00:02:17,780
the gradient,
maybe I'll write it in two steps.

29
00:02:17,780 --> 00:02:22,070
So we're subtracting
eta times the gradient.

30
00:02:22,070 --> 00:02:23,700
And what's the gradient?

31
00:02:23,700 --> 00:02:28,419
The gradient is minus two

32
00:02:28,419 --> 00:02:33,000
sum i = 1 to n,

33
00:02:33,000 --> 00:02:37,040
yi- yi hat W0 to iteration T,

34
00:02:37,040 --> 00:02:43,990
that's the value I'm using to predict
my observation i, W1 at iteration T.

35
00:02:45,640 --> 00:02:51,420
And likewise for
the second component for W1.

36
00:02:51,420 --> 00:02:54,955
But in this case,
I'm gonna have to multiply by xi

37
00:02:54,955 --> 00:02:58,750
because the gradient is
a little bit different here.

38
00:03:02,697 --> 00:03:07,403
W1t And then I'm gonna

39
00:03:07,403 --> 00:03:11,873
multiply this by xi and

40
00:03:11,873 --> 00:03:17,047
that is my update to form my

41
00:03:17,047 --> 00:03:22,943
next estimate of w0 and w1.

42
00:03:22,943 --> 00:03:28,428
Okay let me just quickly rewrite
this in the way that I was going

43
00:03:28,428 --> 00:03:33,388
to before where I see in both
components of this gradient

44
00:03:33,388 --> 00:03:38,260
vector I have this -2 term here and
I have a minus eta.

45
00:03:38,260 --> 00:03:41,083
So I'm going to bring that -2 out here.

46
00:03:41,083 --> 00:03:43,740
So I'm just gonna erase this.

47
00:03:43,740 --> 00:03:44,770
And rewrite this.

48
00:03:44,770 --> 00:03:50,630
The minus two times minus eta
is gonna turn into a plus sign.

49
00:03:53,620 --> 00:03:55,480
So I'll just write this explicitly.

50
00:03:55,480 --> 00:04:00,558
This was from minus two times minus eta.

51
00:04:00,558 --> 00:04:01,830
Okay.

52
00:04:01,830 --> 00:04:03,260
So we are doing gradient descent,

53
00:04:03,260 --> 00:04:06,180
even though you see a plus sign,
we're still doing gradient descent.

54
00:04:06,180 --> 00:04:10,050
It's just that our gradient
had a negative sign, and

55
00:04:10,050 --> 00:04:14,140
that made it become a positive sign here,
okay?

56
00:04:14,140 --> 00:04:19,968
But I want it in this form to provide
a little bit of intuition here.

57
00:04:19,968 --> 00:04:26,910
Because what happens if overall, we just
tend to be underestimating our values y?

58
00:04:26,910 --> 00:04:29,800
So, if overall,

59
00:04:29,800 --> 00:04:35,197
we're under predicting y hat i,

60
00:04:35,197 --> 00:04:40,787
then we're gonna have that the sum

61
00:04:40,787 --> 00:04:46,977
of yi- y hat i is going to be positive.

62
00:04:46,977 --> 00:04:52,164
Because we're saying that y hat

63
00:04:52,164 --> 00:04:57,735
i is always below, or in general,

64
00:04:57,735 --> 00:05:02,150
below the true value yi.

65
00:05:02,150 --> 00:05:06,035
So this is going to be positive.

66
00:05:10,516 --> 00:05:11,433
And what's gonna happen?

67
00:05:11,433 --> 00:05:17,869
Well, this term here is positive.

68
00:05:19,120 --> 00:05:24,080
We're multiplying by a positive thing,
and adding that to our W.

69
00:05:24,080 --> 00:05:31,138
So, W zero is going

70
00:05:31,138 --> 00:05:36,680
to increase.

71
00:05:36,680 --> 00:05:42,880
And that makes sense, because we have some
current estimate of our regression fit.

72
00:05:42,880 --> 00:05:45,860
But if generally we're under predicting

73
00:05:45,860 --> 00:05:50,250
our observations that means
probably that line is too low.

74
00:05:50,250 --> 00:05:51,480
So, we wanna shift it up.

75
00:05:51,480 --> 00:05:52,820
And what does that mean?

76
00:05:52,820 --> 00:05:54,010
That means increasing W0.

77
00:05:54,010 --> 00:05:57,830
So, there's a lot of

78
00:05:57,830 --> 00:06:03,090
intuition in this formula for what's going
on in this gradient descent algorithm.

79
00:06:03,090 --> 00:06:05,910
And that's just talking about
this first term W0, but

80
00:06:05,910 --> 00:06:09,640
then there's this second term W1,
which is the slope of the line.

81
00:06:09,640 --> 00:06:12,070
And in this case there's
a similar intuition.

82
00:06:13,350 --> 00:06:21,839
So, I'll say similar intuition, For W1.

83
00:06:21,839 --> 00:06:26,889
But we need to multiply by this xi,

84
00:06:26,889 --> 00:06:34,215
accounting for
the fact that this is a slope term.

85
00:06:41,192 --> 00:06:41,970
Okay.

86
00:06:41,970 --> 00:06:45,743
So that's our gradient
decent algorithm for

87
00:06:45,743 --> 00:06:49,713
minimizing our residual
sum of squares where,

88
00:06:49,713 --> 00:06:56,890
when we assess convergence what we're
gonna output is w hat zero, W hat one.

89
00:06:56,890 --> 00:06:59,000
That's going to be our
fitted regression line.

90
00:07:00,430 --> 00:07:04,759
And this is an alternative approach to
studying the gradient equal to zero and

91
00:07:04,759 --> 00:07:07,365
solving for W hat zero and
W hat one in that way.

92
00:07:07,365 --> 00:07:12,169
[MUSIC]