1
00:00:00,000 --> 00:00:04,595
[MUSIC]

2
00:00:04,595 --> 00:00:06,756
This course consists of
six different modules,

3
00:00:06,756 --> 00:00:08,610
and let's preview what these are about.

4
00:00:09,760 --> 00:00:13,090
To start with, we're gonna talk
about simple regression, and

5
00:00:13,090 --> 00:00:14,850
why is it called simple regression?

6
00:00:14,850 --> 00:00:18,780
That's because we're gonna focus
on just having a single input, for

7
00:00:18,780 --> 00:00:23,810
example the size of your house and
when we think about defining

8
00:00:23,810 --> 00:00:29,320
the relationship between the single input
and our output, the house sales price,

9
00:00:29,320 --> 00:00:34,290
we're just gonna look at fitting
a very simple line to this data.

10
00:00:34,290 --> 00:00:36,160
And that's why it's
called simple regression.

11
00:00:38,060 --> 00:00:42,580
And one thing we're gonna need to do in
order to discover what the best fitted

12
00:00:42,580 --> 00:00:46,480
line is is we're gonna need to define
something called a goodness-of-fit metric

13
00:00:46,480 --> 00:00:52,250
that allows us to say that some lines
fit better than other lines and

14
00:00:52,250 --> 00:00:55,720
according to this metric we can then
search over all possible lines to find

15
00:00:55,720 --> 00:00:58,170
the one that fits the data the best.

16
00:00:58,170 --> 00:01:02,650
So in this context of this goodness-of-fit
metric we're then going to discuss some

17
00:01:02,650 --> 00:01:07,000
important optimization techniques that are
much more general than regression itself.

18
00:01:07,000 --> 00:01:10,240
So this is an example of one of these
general concepts that we're going to see

19
00:01:10,240 --> 00:01:13,450
again, many times in the specialization.

20
00:01:13,450 --> 00:01:18,800
And the method that we're going to explore
here is something called gradient decent.

21
00:01:18,800 --> 00:01:21,670
So in the context of our
regression problem, and

22
00:01:21,670 --> 00:01:24,870
we have just a simple line,
a question is well what's the slope and

23
00:01:24,870 --> 00:01:30,810
what's the intercept of this line that
minimizes this goodness-of-fit metric.

24
00:01:30,810 --> 00:01:35,260
So this blue curve I'm showing here
is showing how good the fit is,

25
00:01:35,260 --> 00:01:39,780
where lower means actually a better
fit and we're gonna try and

26
00:01:39,780 --> 00:01:43,340
minimize over all possible
combinations of intercepts and slopes.

27
00:01:44,860 --> 00:01:46,450
To find the once that fits best.

28
00:01:47,500 --> 00:01:50,610
This gradient descent algorithm is
an iterative algorithm that's gonna take

29
00:01:50,610 --> 00:01:52,210
multiple steps.

30
00:01:52,210 --> 00:01:59,090
And eventually go to this optimal
solution that minimizes this metric.

31
00:01:59,090 --> 00:02:00,750
Measuring how well this line fits.

32
00:02:02,090 --> 00:02:06,190
Then once we've found this minimum, so
we've found the best intercepting slope,

33
00:02:06,190 --> 00:02:11,090
what we're gonna talk about is how
to think about interpreting these

34
00:02:11,090 --> 00:02:15,620
estimated parameters, and then how to use
these parameters to form our predictions.

35
00:02:16,970 --> 00:02:20,790
We're then gonna move from simple
regression to multiple regression, and

36
00:02:20,790 --> 00:02:25,160
multiple regression allows us to
fit more complicated relationships

37
00:02:25,160 --> 00:02:27,620
between our single input and an output.

38
00:02:29,410 --> 00:02:33,270
So, more complicated than just
a line as well as to consider

39
00:02:33,270 --> 00:02:34,820
more inputs in our model.

40
00:02:34,820 --> 00:02:36,470
So, for our housing application,

41
00:02:36,470 --> 00:02:39,220
maybe in addition to measure
up the size of the house,

42
00:02:39,220 --> 00:02:43,470
we have something talking about how many
bedrooms there are, how many bathrooms.

43
00:02:43,470 --> 00:02:44,600
What the lot size is,

44
00:02:44,600 --> 00:02:47,850
what year the house was built and
we might have a whole bunch of different

45
00:02:47,850 --> 00:02:51,830
attributes of the house that we wanna
incorporate into forming our predictions.

46
00:02:51,830 --> 00:02:54,020
And that's what multiple
regression allows us to do.

47
00:02:55,260 --> 00:02:58,570
Having learned how to fit these models
to data, we're then gonna turn to

48
00:02:58,570 --> 00:03:03,150
how to assess whether the fin is any
good for example in forming predictions.

49
00:03:03,150 --> 00:03:07,020
For example,
this fit shown here between house size and

50
00:03:07,020 --> 00:03:12,250
house price seems to be pretty good, but
if we go to a much more complicated model,

51
00:03:12,250 --> 00:03:16,490
we see that it's doing really crazy
things, but if you look carefully this

52
00:03:16,490 --> 00:03:22,740
fitted function is actually explaining our
observed data these circles really well.

53
00:03:22,740 --> 00:03:26,730
But what it's not doing is
it's not able to generalize

54
00:03:26,730 --> 00:03:29,880
to predictions of new houses
that we haven't observed,

55
00:03:29,880 --> 00:03:34,350
because at least I wouldn't use this
to form predictions of my house value.

56
00:03:34,350 --> 00:03:39,020
I wouldn't really believe that this is the
relationship between house size and price.

57
00:03:39,020 --> 00:03:41,410
So question is, what's going wrong here?

58
00:03:41,410 --> 00:03:44,310
And this is something that we
call over fitting to the data,

59
00:03:44,310 --> 00:03:50,040
and we're gonna talk about this in great
length in this module of the course.

60
00:03:50,040 --> 00:03:53,240
In particular we're gonna define
three different measures of air.

61
00:03:53,240 --> 00:03:57,020
One called training air,
one is called test air, and

62
00:03:57,020 --> 00:03:59,750
one is called our generalization or
true air, and

63
00:03:59,750 --> 00:04:04,230
that's actually the air we really want
to form, but we're gonna show that

64
00:04:04,230 --> 00:04:07,470
that is something that we can't actually
attain and we have to use proxies instead.

65
00:04:07,470 --> 00:04:10,720
And we're gonna look at
these measures of error

66
00:04:10,720 --> 00:04:15,290
as a function of model complexity to
describe this notion of over-fitting.

67
00:04:15,290 --> 00:04:18,010
Then we're gonna discuss these ideas
in the context of something called

68
00:04:18,010 --> 00:04:23,280
the bias-variance trade-off and
what this describes is the fact that

69
00:04:23,280 --> 00:04:28,290
simple models tend to be really
well behaved but they're

70
00:04:28,290 --> 00:04:32,210
often just too simple to really describe
the relationship that we're interested in.

71
00:04:32,210 --> 00:04:35,850
On the other hand, really complex
models are really flexible, so

72
00:04:35,850 --> 00:04:41,060
they have the potential to fit really
intricate relationships that might

73
00:04:41,060 --> 00:04:45,520
describe pretty well what's happening with
the data but unfortunately these really,

74
00:04:45,520 --> 00:04:49,550
really complex models can
have very strange behavior.

75
00:04:49,550 --> 00:04:53,780
And so this leads to what's called
this bias-variance trade-off.

76
00:04:53,780 --> 00:04:57,344
And machine learning is all
about exploring this trade off.

77
00:04:57,344 --> 00:05:01,579
[MUSIC]