1
00:00:00,000 --> 00:00:04,547
[MUSIC]

2
00:00:04,547 --> 00:00:08,910
Now, let's actually summarize the entire
gradient descent algorithm for

3
00:00:08,910 --> 00:00:10,340
multiple regression.

4
00:00:10,340 --> 00:00:14,280
Stepping through, very carefully,
every step of this algorithm.

5
00:00:14,280 --> 00:00:19,330
So in particular at first, what we're
gonna do, is we're just gonna initialize

6
00:00:19,330 --> 00:00:24,950
all of our different parameters to
be zero at the first iteration.

7
00:00:24,950 --> 00:00:29,220
Or you could initialize them randomly or
you could do something a bit smarter.

8
00:00:30,570 --> 00:00:33,410
But let's just assume that they're
all initialized to zero and

9
00:00:33,410 --> 00:00:38,431
we're gonna start our
iteration counter at one.

10
00:00:38,431 --> 00:00:45,260
And then what we're doing is we're
saying while we're not converged,

11
00:00:45,260 --> 00:00:50,260
and what was the condition we talked about
before in our simple regression module for

12
00:00:50,260 --> 00:00:51,370
not being converged?

13
00:00:51,370 --> 00:00:54,558
We said while the gradient of
our residual sum of squares,

14
00:00:54,558 --> 00:00:57,628
the magnitude of that gradient
is sufficiently large.

15
00:00:57,628 --> 00:01:04,160
Larger than some tolerance epsilon
then we were going to keep going.

16
00:01:06,610 --> 00:01:11,350
So what is the magnitude
of residual sum of squares?

17
00:01:14,390 --> 00:01:20,480
Here this thing just to be very
explicit is the square root of

18
00:01:20,480 --> 00:01:25,460
the square, so what are the elements of
the gradient of residual sum of squares?

19
00:01:25,460 --> 00:01:28,060
Well, it's a vector where every element

20
00:01:28,060 --> 00:01:31,610
is the partial derivative with
respect to some parameter.

21
00:01:31,610 --> 00:01:35,530
I'm gonna refer to that as partial of j,
okay?

22
00:01:35,530 --> 00:01:39,966
So, when I take the magnitude
of the vector I multiply

23
00:01:39,966 --> 00:01:44,712
the vector times its transposed,
take the square root.

24
00:01:44,712 --> 00:01:51,510
That's equivalent to saying I'm
gonna sum up the partial derivative

25
00:01:51,510 --> 00:01:57,039
with respect to the first
feature squared plus all the way

26
00:01:57,039 --> 00:02:02,586
up to the partial derivative
of the capital Dth feature.

27
00:02:02,586 --> 00:02:07,230
Sorry, I guess I should start, the
indexing really starts with zero, squared,

28
00:02:07,230 --> 00:02:09,840
and then I take the square root.

29
00:02:09,840 --> 00:02:12,490
So if the result of this is greater than

30
00:02:12,490 --> 00:02:16,190
epsilon then I'm gonna continue
my gradient descent iterates.

31
00:02:16,190 --> 00:02:19,190
If it's less than epsilon
then I'm gonna stop.

32
00:02:19,190 --> 00:02:22,040
But let's talk about what
the actual iterates are.

33
00:02:22,040 --> 00:02:26,560
Well, for every feature in my
multiple regression model,

34
00:02:26,560 --> 00:02:30,220
first thing I'm going to do is I'm going
to calculate this partial derivative,

35
00:02:30,220 --> 00:02:32,150
with respect to the jth feature.

36
00:02:33,190 --> 00:02:36,960
And I'm going to store that,
because that's going to be useful in

37
00:02:36,960 --> 00:02:41,580
both taking the gradient step as well as
monitoring convergence as I wrote here.

38
00:02:43,100 --> 00:02:47,690
So this jth partial,
we derived it on the previous slide.

39
00:02:48,710 --> 00:02:56,910
It has this formed and then my gradient
step takes that jth coefficient at timed t

40
00:02:56,910 --> 00:03:02,110
and subtracts my step size
times that partial derivative.

41
00:03:02,110 --> 00:03:05,790
And then once I cycle through
all the features in my model,

42
00:03:05,790 --> 00:03:09,270
then I'm gonna increment this t counter.

43
00:03:09,270 --> 00:03:13,260
I'm gonna check whether I've
achieved convergence or not.

44
00:03:13,260 --> 00:03:17,822
If not I'm gonna loop through,
and I'm gonna do this until this

45
00:03:17,822 --> 00:03:22,335
condition, this magnitude of my
gradient is less than epsilon.

46
00:03:22,335 --> 00:03:22,963
Okay, so

47
00:03:22,963 --> 00:03:27,837
I wanna take a few moments to talk
about this gradient descent algorithm.

48
00:03:27,837 --> 00:03:32,398
Because we presented it specifically in
the context of multiple regression, and

49
00:03:32,398 --> 00:03:35,110
also for the simple regression case.

50
00:03:35,110 --> 00:03:38,090
But this algorithm is really,
really important.

51
00:03:38,090 --> 00:03:42,004
It's probably the most widely used
machine learning algorithm out there.

52
00:03:42,004 --> 00:03:45,560
And we're gonna see it when
we talk about classification,

53
00:03:45,560 --> 00:03:48,920
all the way to talking
about deep learning.

54
00:03:48,920 --> 00:03:53,007
So even though we presented this in
the context of multiple regression,

55
00:03:53,007 --> 00:03:57,831
this is a really really useful algorithm,
actually an extremely useful algorithm,

56
00:03:57,831 --> 00:03:59,712
as the title of this slide shows.

57
00:03:59,712 --> 00:04:03,969
[MUSIC]