1
00:00:00,000 --> 00:00:05,399
In this video we'll define something
called the cost function. This will let us
2
00:00:05,399 --> 00:00:10,688
figure out how to fit the best possible
straight line to our data. In linear
3
00:00:10,688 --> 00:00:16,758
regression we have a training set like
that shown here. Remember our notation M
4
00:00:16,758 --> 00:00:21,972
was the number of training examples. So
maybe M=47. And the form of the
5
00:00:21,972 --> 00:00:27,731
hypothesis, which we use to make
predictions, is this linear function. To
6
00:00:27,731 --> 00:00:33,723
introduce a little bit more terminology,
these theta zero and theta one, right,
7
00:00:33,723 --> 00:00:39,759
these theta i's are what I call the
parameters of the model. What we're
8
00:00:39,759 --> 00:00:44,578
going to do in this video is talk about
how to go about choosing these two
9
00:00:44,578 --> 00:00:49,654
parameter values, theta zero and theta
one. With different choices of parameters
10
00:00:49,654 --> 00:00:54,408
theta zero and theta one we get different
hypotheses, different hypothesis
11
00:00:54,408 --> 00:00:59,355
functions. I know some of you will
probably be already familiar with what I'm
12
00:00:59,355 --> 00:01:04,367
going to do on this slide, but just to
review here are a few examples. If theta
13
00:01:04,367 --> 00:01:09,378
zero is 1.5 and theta one is 0, then
the hypothesis function will look like
14
00:01:09,378 --> 00:01:15,701
this. Right, because your hypothesis
function will be h( x) equals 1.5 plus
15
00:01:15,701 --> 00:01:22,645
0 times x which is this constant value
function, this is flat at 1.5. If
16
00:01:22,645 --> 00:01:29,332
theta zero equals 0 and theta one
equals 0.5, then the hypothesis will look
17
00:01:29,332 --> 00:01:35,536
like this. And it should pass through this
point (2, 1), says you now have h(x) or
18
00:01:35,536 --> 00:01:40,666
really some htheta(x) but
sometimes I'll just omit theta for
19
00:01:40,666 --> 00:01:46,518
brevity. So, h(x) will be equal to just
0.5 times x which looks like that. And
20
00:01:46,518 --> 00:01:52,443
finally if theta zero equals 1 and theta
one equals 0.5 then we end up with the
21
00:01:52,443 --> 00:01:58,598
hypothesis that looks like this. Let's
see, it should pass through the (2, 2)
22
00:01:58,598 --> 00:02:04,468
point like so. And this is my new h(x)
or my new htheta(x). All right? Well
23
00:02:04,468 --> 00:02:09,980
you remember that this is
htheta(x) but as a shorthand
24
00:02:09,980 --> 00:02:16,584
sometimes I just write this as h(x). In
linear regression we have a training set,
25
00:02:16,584 --> 00:02:22,439
like maybe the one I've plotted here. What
we want to do is come up with values for
26
00:02:22,439 --> 00:02:28,295
the parameters theta zero and theta one.
So that the straight line we get out
27
00:02:28,295 --> 00:02:33,799
of this corresponds to a straight line
that somehow fits the data well. Like
28
00:02:33,799 --> 00:02:39,756
maybe that line over there. So how do we
come up with values theta zero, theta one
29
00:02:39,756 --> 00:02:45,350
that corresponds to a good fit to the
data? The idea is we're going to choose
30
00:02:45,350 --> 00:02:51,162
our parameters theta zero, theta one so
that h(x), meaning the value we predict
31
00:02:51,162 --> 00:02:56,330
on input x, that this at least close to
the values y for the examples in our
32
00:02:56,330 --> 00:03:01,133
training set, for our training examples.
So, in our training set we're given a
33
00:03:01,133 --> 00:03:06,505
number of examples where we know x decides
the house and we know the actual price of
34
00:03:06,505 --> 00:03:11,796
what it's sold for. So let's try to
choose values for the parameters so that
35
00:03:11,796 --> 00:03:17,302
at least in the training set, given the
x's in the training set, we make
36
00:03:17,302 --> 00:03:23,507
reasonably accurate predictions for the y
values. Let's formalize this. So linear
37
00:03:23,507 --> 00:03:29,401
regression, what we're going to do is that I'm
going to want to solve a minimization
38
00:03:29,401 --> 00:03:38,787
problem. So I'm going to write minimize over theta
zero, theta one. And, I want this to be
39
00:03:38,787 --> 00:03:44,379
small, right, I want the difference
between h(x) and y to be small. And one
40
00:03:44,379 --> 00:03:50,493
thing I'm gonna do is try to minimize the
square difference between the output of
41
00:03:50,493 --> 00:03:56,159
the hypothesis and the actual price of the
house. Okay? So let's fill in some
42
00:03:56,159 --> 00:04:01,379
details. Remember that I was using the
notation (x(i), y(i)) to represent the
43
00:04:01,379 --> 00:04:07,418
ith training example. So what I
want really is to sum over my training
44
00:04:07,418 --> 00:04:13,202
set. Sum from i equals 1 to M of
the square difference between
45
00:04:13,202 --> 00:04:18,896
this is the prediction of my hypothesis
when it is input the size of house number
46
00:04:18,896 --> 00:04:24,380
i, right, minus the actual price that
house number i will sell for and I want to
47
00:04:24,380 --> 00:04:29,588
minimize the sum of my training set sum
from i equals 1 through M of the
48
00:04:29,588 --> 00:04:35,281
difference of this squared error,
square difference between the predicted
49
00:04:35,281 --> 00:04:41,091
price of the house and the price
that it will actually sell for. And just
50
00:04:41,091 --> 00:04:47,723
remind you of your notation M here was
the, the size of my training set, right,
51
00:04:47,723 --> 00:04:53,347
so the M there is my number of training
examples. Right? That hash sign is the
52
00:04:53,347 --> 00:04:59,045
abbreviation for "number" of training
examples. Okay? And to make some of our,
53
00:04:59,045 --> 00:05:04,888
make the math a little bit easier, I'm
going to actually look at, you know, 1
54
00:05:04,888 --> 00:05:09,578
over M times that. So we're going to try
to minimize my average error, which we're
55
00:05:09,578 --> 00:05:13,926
going to minimize one by 2M.
Putting the 2, the constant one half, in
56
00:05:13,926 --> 00:05:18,386
front it just makes some of the math a
little easier. So minimizing one half of
57
00:05:18,386 --> 00:05:23,073
something, right, should give you the same
values of the parameters theta zero, theta
58
00:05:23,073 --> 00:05:27,647
one as minimizing that function. And just
make sure this, this, this equation is
59
00:05:27,647 --> 00:05:35,569
clear, right? This expression in here,
htheta(x), this is my, this is
60
00:05:35,569 --> 00:05:44,880
our usual, right? That's equal to this
plus theta one x(i). And, this notation,
61
00:05:44,880 --> 00:05:49,814
minimize over theta zero and theta one,
this means find me the values of theta
62
00:05:49,814 --> 00:05:54,369
zero and theta one that causes this
expression to be minimized. And this
63
00:05:54,369 --> 00:05:59,557
expression depends on theta zero and theta
one. Okay? So just to recap, we're posing
64
00:05:59,557 --> 00:06:04,382
this problem as find me the values of
theta zero and theta one so that the
65
00:06:04,575 --> 00:06:09,292
average already one over two M times the
sum of square errors between my
66
00:06:09,292 --> 00:06:14,590
predictions on the training set minus the
actual values of the houses on the
67
00:06:14,590 --> 00:06:19,694
training set is minimized. So this is
going to be my overall objective function
68
00:06:19,694 --> 00:06:25,127
for linear regression. And just to, you
know rewrite this out a little bit more
69
00:06:25,127 --> 00:06:30,580
cleanly what I'm going to do by convention
is we usually define a cost function.
70
00:06:30,860 --> 00:06:38,965
Which is going to be exactly this. That
formula that I have up here. And what I
71
00:06:38,965 --> 00:06:48,388
want to do is minimize over theta zero and
theta one my function J of theta zero
72
00:06:48,388 --> 00:06:57,428
comma theta one. Just write this
out, this is my cost function. So, this
73
00:06:57,428 --> 00:07:06,943
cost function is also called the squared
error function or sometimes called the
74
00:07:06,943 --> 00:07:14,461
square error cost function and it turns
out that Why, why do we, you know, take
75
00:07:14,461 --> 00:07:19,006
the squares of the errors? It turns out
that the squared error cost function is a
76
00:07:19,006 --> 00:07:23,214
reasonable choice and will work well for
most problems, for most regression
77
00:07:23,214 --> 00:07:27,815
problems. There are other cost functions
that will work pretty well, but the squared
78
00:07:27,815 --> 00:07:32,473
error cost function is probably the most
commonly used one for regression problems.
79
00:07:32,473 --> 00:07:36,793
Later in this class we'll also talk about alternative
cost functions as well, but this, this
80
00:07:36,793 --> 00:07:41,282
choice that we just had, should be a
pret-, pretty reasonable thing to try for
81
00:07:41,282 --> 00:07:45,706
most linear regression problems. Okay. So
that's the cost function. So far we've
82
00:07:45,706 --> 00:07:50,899
just seen a mathematical definition of you
know this cost function and in case this
83
00:07:50,899 --> 00:07:55,973
function J of theta zero theta one in case
this function seems a little bit abstract
84
00:07:55,973 --> 00:08:00,808
and you still don't have a good sense of
what its doing in the next video, in the
85
00:08:00,808 --> 00:08:05,882
next couple videos we're actually going to
go a little bit deeper into what the cost
86
00:08:05,882 --> 00:08:10,776
function J is doing and try to give you
better intuition about what its computing
87
00:08:10,776 --> 00:08:12,329
and why we want to use it.