1
00:00:00,000 --> 00:00:04,148
[MUSIC]

2
00:00:04,148 --> 00:00:07,880
Congratulations for getting through
this really challenging course.

3
00:00:07,880 --> 00:00:10,966
We covered a lot, a lot of ground
discussing many ideas relating to

4
00:00:10,966 --> 00:00:15,060
regression, as well as more general ideas
that are foundational to machine learning.

5
00:00:15,060 --> 00:00:17,704
So let's just recap where we've been, and

6
00:00:17,704 --> 00:00:21,870
then look ahead to what's
next in a specialization.

7
00:00:21,870 --> 00:00:26,360
Okay, I know, I love to start slides
with okay, but that's how it is.

8
00:00:26,360 --> 00:00:30,040
So okay, what have we learned
in this regression course?

9
00:00:30,040 --> 00:00:33,250
Well in the first module we talked about
something called simple regression.

10
00:00:33,250 --> 00:00:35,620
Let's just remember what
simple regression was.

11
00:00:35,620 --> 00:00:39,780
This is where we assume that we just had
a single input and we're just fitting

12
00:00:39,780 --> 00:00:44,660
a very simple line as a relationship
between the input and the output.

13
00:00:44,660 --> 00:00:48,650
Then we discussed what the cost
was of using a given line, and for

14
00:00:48,650 --> 00:00:52,300
this we define something we called
the residual sum of squares.

15
00:00:52,300 --> 00:00:55,113
Then once we've defined our
residual sum of squares,

16
00:00:55,113 --> 00:00:58,426
this allows us to assess the different
fits of different lines.

17
00:00:58,426 --> 00:01:03,280
And then with that we can choose which
best fits our specific training data set.

18
00:01:04,410 --> 00:01:09,800
In particular, to find the best fitting
line, we talked about our objective being

19
00:01:09,800 --> 00:01:14,860
an objective over two variables, which
represented our slope and our intercept.

20
00:01:14,860 --> 00:01:19,270
And the specific algorithm that we
went through in this module was

21
00:01:19,270 --> 00:01:24,082
gradient descent, which we discussed
was an iterative algorithm that

22
00:01:24,082 --> 00:01:27,469
moves in the direction of
the negative gradient.

23
00:01:27,469 --> 00:01:31,840
And for convex functions,
converges to the optimum.

24
00:01:32,960 --> 00:01:35,010
We then turn to multiple regression.

25
00:01:35,010 --> 00:01:37,960
And multiple regression allowed us to fit

26
00:01:37,960 --> 00:01:42,370
more complicated relationships between
our single input and our output.

27
00:01:42,370 --> 00:01:46,660
We talked about polynomial regression,
seasonality, and lots of different

28
00:01:47,680 --> 00:01:51,960
features of our single input that we
could use in a multiple regression model.

29
00:01:51,960 --> 00:01:56,790
But then we talked more generically
about incorporating different inputs.

30
00:01:56,790 --> 00:01:59,946
Like in our housing applications,
square feet, number of bathrooms,

31
00:01:59,946 --> 00:02:03,529
number of bedrooms, lot size, year built,
as well as features of these inputs.

32
00:02:03,529 --> 00:02:08,530
And so generically, we wrote our
multiple regression model as follows.

33
00:02:08,530 --> 00:02:13,814
So it's just a weighted collection of
our features h of our inputs xi for

34
00:02:13,814 --> 00:02:14,968
our ith house.

35
00:02:14,968 --> 00:02:16,406
Plus some epsilon term,

36
00:02:16,406 --> 00:02:20,110
which is our error representing
the noise in our observations.

37
00:02:21,140 --> 00:02:26,048
Then for this case of multiple regression
we have some more complicated relationship

38
00:02:26,048 --> 00:02:31,310
between, now,
a whole bunch of features and our output.

39
00:02:31,310 --> 00:02:34,970
We talked about how we define
our residual sum of squares.

40
00:02:34,970 --> 00:02:39,540
And then, in order to derive both
a closed form solution as well as

41
00:02:39,540 --> 00:02:41,630
a gradient descent algorithm,

42
00:02:41,630 --> 00:02:45,990
we talked about taking the gradient
of this residual sum of squares.

43
00:02:45,990 --> 00:02:49,300
So it's pretty cool how, even when dealing
with a large collection of features,

44
00:02:49,300 --> 00:02:55,100
we can derive a closed-form solution
which says, just given all of our data,

45
00:02:55,100 --> 00:02:58,520
we just have to compute
the term shown here.

46
00:02:58,520 --> 00:03:05,500
And that will give us our estimated set
of coefficients for all of our features.

47
00:03:05,500 --> 00:03:10,640
But we also talked about the fact that
this could be computationally intensive.

48
00:03:10,640 --> 00:03:12,650
It's operations are cubic and

49
00:03:12,650 --> 00:03:17,620
the number of features that we have in
addition to the fact that this matrix,

50
00:03:17,620 --> 00:03:21,960
H transpose, H that we might have
to invert, might not be invertible.

51
00:03:21,960 --> 00:03:26,000
And so now we know that this is where
ridge regression can be so useful.

52
00:03:26,000 --> 00:03:31,130
In the cases where we might not
have a matrix that we can invert,

53
00:03:31,130 --> 00:03:35,450
the ridge regression is just a very simple
modification to this closed-form solution,

54
00:03:35,450 --> 00:03:39,520
and leads to forms that
are always invertible.

55
00:03:39,520 --> 00:03:43,540
And thus allows us to handle cases where
we have lots and lots of features,

56
00:03:43,540 --> 00:03:45,580
even more features than
we have observations.

57
00:03:46,990 --> 00:03:50,639
But instead of this closed-form
solution which we mentioned could

58
00:03:50,639 --> 00:03:54,544
be very computationally intensive,
we talked about a gradient descent

59
00:03:54,544 --> 00:03:59,230
algorithm which is an iterative procedure
for solving this optimization objective.

60
00:03:59,230 --> 00:04:05,170
Where we can start at any point
in our space, so anywhere.

61
00:04:05,170 --> 00:04:08,350
And compute the gradient,
and take these steps.

62
00:04:08,350 --> 00:04:10,970
Where, remember,
there was a step size that we had to set.

63
00:04:10,970 --> 00:04:13,150
Where is our step size, here it is.

64
00:04:13,150 --> 00:04:16,410
And that determined a lot
of the properties of

65
00:04:16,410 --> 00:04:19,090
how quickly we converged to the optimal.

66
00:04:19,090 --> 00:04:21,530
But we showed that,
in our well-behaved cases,

67
00:04:21,530 --> 00:04:24,440
that we would converge to
this optimal solution.

68
00:04:24,440 --> 00:04:28,680
And this idea of gradient descent, or
this optimization algorithm, is a very

69
00:04:28,680 --> 00:04:33,063
general purpose tool that we're gonna
see again later in this specialization.

70
00:04:33,063 --> 00:04:38,549
[MUSIC]