1
00:00:02,338 --> 00:00:04,677
Our first learning algorithm will be
linear regression. In this video, you'll see
2
00:00:06,956 --> 00:00:09,234
what the model looks like and more
importantly you'll see what the overall
3
00:00:09,234 --> 00:00:14,801
process of supervised learning looks like. Let's
use some motivating example of predicting
4
00:00:14,801 --> 00:00:20,036
housing prices. We're going to use a data
set of housing prices from the city of
5
00:00:20,036 --> 00:00:25,205
Portland, Oregon. And here I'm gonna
plot my data set of a number of houses
6
00:00:25,205 --> 00:00:30,833
that were different sizes that were sold
for a range of different prices. Let's say
7
00:00:30,833 --> 00:00:35,872
that given this data set, you have a
friend that's trying to sell a house and
8
00:00:35,872 --> 00:00:41,238
let's see if friend's house is size of
1250 square feet and you want to tell them
9
00:00:41,238 --> 00:00:46,459
how much they might be able to sell the
house for. Well one thing you could do is
10
00:00:46,648 --> 00:00:53,039
fit a model. Maybe fit a straight line
to this data. Looks something like that and based
11
00:00:53,039 --> 00:00:59,168
on that, maybe you could tell your friend
that let's say maybe he can sell the
12
00:00:59,168 --> 00:01:03,575
house for around $220,000.
So this is an example of a
13
00:01:03,575 --> 00:01:08,834
supervised learning algorithm. And it's
supervised learning because we're given
14
00:01:08,834 --> 00:01:14,287
the, quotes, "right answer" for each of
our examples. Namely we're told what was
15
00:01:14,287 --> 00:01:19,351
the actual house, what was the actual
price of each of the houses in our data
16
00:01:19,351 --> 00:01:24,441
set were sold for and moreover, this is
an example of a regression problem where
17
00:01:24,441 --> 00:01:29,545
the term regression refers to the fact
that we are predicting a real-valued output
18
00:01:29,545 --> 00:01:34,586
namely the price. And just to remind you
the other most common type of supervised
19
00:01:34,586 --> 00:01:39,006
learning problem is called the
classification problem where we predict
20
00:01:39,006 --> 00:01:45,202
discrete-valued outputs such as if we are
looking at cancer tumors and trying to
21
00:01:45,202 --> 00:01:52,032
decide if a tumor is malignant or benign.
So that's a zero-one valued discrete output. More
22
00:01:52,032 --> 00:01:57,087
formally, in supervised learning, we have
a data set and this data set is called a
23
00:01:57,087 --> 00:02:02,018
training set. So for housing prices
example, we have a training set of
24
00:02:02,018 --> 00:02:07,386
different housing prices and our job is to
learn from this data how to predict prices
25
00:02:07,386 --> 00:02:11,907
of the houses. Let's define some notation
that we're using throughout this course.
26
00:02:11,907 --> 00:02:16,100
We're going to define quite a lot of
symbols. It's okay if you don't remember
27
00:02:16,100 --> 00:02:20,075
all the symbols right now but as the
course progresses it will be useful
28
00:02:20,075 --> 00:02:24,267
[inaudible] convenient notation. So I'm gonna use
lower case m throughout this course to
29
00:02:24,267 --> 00:02:28,897
denote the number of training examples. So
in this data set, if I have, you know,
30
00:02:28,897 --> 00:02:34,366
let's say 47 rows in this table. Then I
have 47 training examples and m equals 47.
31
00:02:34,366 --> 00:02:39,497
Let me use lowercase x to denote the
input variables often also called the
32
00:02:39,497 --> 00:02:44,290
features. That would be the x is here, it would the input features. And I'm gonna
33
00:02:44,290 --> 00:02:49,556
use y to denote my output variables or the
target variable which I'm going to
34
00:02:49,556 --> 00:02:54,552
predict and so that's the second
column here. [inaudible] notation, I'm
35
00:02:54,552 --> 00:03:05,749
going to use (x, y) to denote a single
training example. So, a single row in this
36
00:03:05,749 --> 00:03:12,068
table corresponds to a single training
example and to refer to a specific
37
00:03:12,068 --> 00:03:19,708
training example, I'm going to use this
notation x(i) comma gives me y(i) And, we're
38
00:03:25,322 --> 00:03:30,935
going to use this to refer to the ith
training example. So this superscript i
39
00:03:30,935 --> 00:03:37,864
over here, this is not exponentiation
right? This (x(i), y(i)), the superscript i in
40
00:03:37,864 --> 00:03:44,873
parentheses that's just an index into my
training set and refers to the ith row in
41
00:03:44,873 --> 00:03:51,629
this table, okay? So this is not x to
the power of i, y to the power of i. Instead
42
00:03:51,629 --> 00:03:58,216
(x(i), y(i)) just refers to the ith row of this
table. So for example, x(1) refers to the
43
00:03:58,216 --> 00:04:04,972
input value for the first training example so
that's 2104. That's this x in the first
44
00:04:04,972 --> 00:04:11,685
row. x(2) will be equal to
1416 right? That's the second x
45
00:04:11,685 --> 00:04:17,385
and y(1) will be equal to 460.
The first, the y value for my first
46
00:04:17,385 --> 00:04:24,526
training example, that's what that (1)
refers to. So as mentioned, occasionally I'll ask you a
47
00:04:24,526 --> 00:04:28,345
question to let you check your
understanding and a few seconds in this
48
00:04:28,345 --> 00:04:34,044
video a multiple-choice question
will pop up in the video. When it does,
49
00:04:34,044 --> 00:04:40,362
please use your mouse to select what you
think is the right answer. What defined by
50
00:04:40,362 --> 00:04:45,124
the training set is. So here's how this
supervised learning algorithm works.
51
00:04:45,124 --> 00:04:50,513
We saw that with the training set like our
training set of housing prices and we feed
52
00:04:50,513 --> 00:04:55,715
that to our learning algorithm. Is the job
of a learning algorithm to then output a
53
00:04:55,715 --> 00:05:00,101
function which by convention is
usually denoted lowercase h and h
54
00:05:00,101 --> 00:05:06,574
stands for hypothesis And what the job of
the hypothesis is, is, is a function that
55
00:05:06,574 --> 00:05:12,471
takes as input the size of a house like
maybe the size of the new house your friend's
56
00:05:12,471 --> 00:05:18,368
trying to sell so it takes in the value of
x and it tries to output the estimated
57
00:05:18,368 --> 00:05:31,630
value of y for the corresponding house.
So h is a function that maps from x's
58
00:05:31,630 --> 00:05:37,729
to y's. People often ask me, you
know, why is this function called
59
00:05:37,729 --> 00:05:42,121
hypothesis. Some of you may know the
meaning of the term hypothesis, from the
60
00:05:42,121 --> 00:05:46,744
dictionary or from science or whatever. It
turns out that in machine learning, this
61
00:05:46,744 --> 00:05:51,310
is a name that was used in the early days of
machine learning and it kinda stuck. 'Cause
62
00:05:51,310 --> 00:05:55,239
maybe not a great name for this sort of
function, for mapping from sizes of
63
00:05:55,239 --> 00:05:59,978
houses to the predictions, that you know....
I think the term hypothesis, maybe isn't
64
00:05:59,978 --> 00:06:04,543
the best possible name for this, but this is the
standard terminology that people use in
65
00:06:04,543 --> 00:06:09,362
machine learning. So don't worry too much
about why people call it that. When
66
00:06:09,362 --> 00:06:14,332
designing a learning algorithm, the next
thing we need to decide is how do we
67
00:06:14,332 --> 00:06:20,540
represent this hypothesis h. For this and
the next few videos, I'm going to choose
68
00:06:20,540 --> 00:06:26,978
our initial choice , for representing the
hypothesis, will be the following. We're going to
69
00:06:26,978 --> 00:06:33,009
represent h as follows. And we will write this as
htheta(x) equals theta0
70
00:06:33,009 --> 00:06:39,254
plus theta1 of x. And as a shorthand,
sometimes instead of writing, you
71
00:06:39,254 --> 00:06:45,441
know, h subscript theta of x, sometimes
there's a shorthand, I'll just write as a h of x.
72
00:06:45,441 --> 00:06:51,627
But more often I'll write it as a
subscript theta over there. And plotting
73
00:06:51,627 --> 00:06:58,210
this in the pictures, all this means is that,
we are going to predict that y is a linear
74
00:06:58,210 --> 00:07:04,634
function of x. Right, so that's the
data set and what this function is doing,
75
00:07:04,634 --> 00:07:11,698
is predicting that y is some straight
line function of x. That's h of x equals theta 0
76
00:07:11,698 --> 00:07:18,450
plus theta 1 x, okay? And why a linear
function? Well, sometimes we'll want to
77
00:07:18,450 --> 00:07:23,405
fit more complicated, perhaps non-linear
functions as well. But since this linear
78
00:07:23,405 --> 00:07:28,298
case is the simple building block, we will
start with this example first of fitting
79
00:07:28,298 --> 00:07:32,943
linear functions, and we will build on
this to eventually have more complex
80
00:07:32,943 --> 00:07:37,403
models, and more complex learning
algorithms. Let me also give this
81
00:07:37,403 --> 00:07:42,628
particular model a name. This model is
called linear regression or this, for
82
00:07:42,628 --> 00:07:48,271
example, is actually linear regression
with one variable, with the variable being
83
00:07:48,271 --> 00:07:53,914
x. Predicting all the prices as functions
of one variable X. And another name for
84
00:07:53,914 --> 00:07:58,852
this model is univariate linear
regression. And univariate is just a
85
00:07:58,852 --> 00:08:04,400
fancy way of saying one variable. So,
that's linear regression. In the next
86
00:08:04,400 --> 00:08:09,760
video we'll start to talk about just how
we go about implementing this model.