1 00:00:02,338 --> 00:00:04,677 Our first learning algorithm will be linear regression. In this video, you'll see 2 00:00:06,956 --> 00:00:09,234 what the model looks like and more importantly you'll see what the overall 3 00:00:09,234 --> 00:00:14,801 process of supervised learning looks like. Let's use some motivating example of predicting 4 00:00:14,801 --> 00:00:20,036 housing prices. We're going to use a data set of housing prices from the city of 5 00:00:20,036 --> 00:00:25,205 Portland, Oregon. And here I'm gonna plot my data set of a number of houses 6 00:00:25,205 --> 00:00:30,833 that were different sizes that were sold for a range of different prices. Let's say 7 00:00:30,833 --> 00:00:35,872 that given this data set, you have a friend that's trying to sell a house and 8 00:00:35,872 --> 00:00:41,238 let's see if friend's house is size of 1250 square feet and you want to tell them 9 00:00:41,238 --> 00:00:46,459 how much they might be able to sell the house for. Well one thing you could do is 10 00:00:46,648 --> 00:00:53,039 fit a model. Maybe fit a straight line to this data. Looks something like that and based 11 00:00:53,039 --> 00:00:59,168 on that, maybe you could tell your friend that let's say maybe he can sell the 12 00:00:59,168 --> 00:01:03,575 house for around $220,000. So this is an example of a 13 00:01:03,575 --> 00:01:08,834 supervised learning algorithm. And it's supervised learning because we're given 14 00:01:08,834 --> 00:01:14,287 the, quotes, "right answer" for each of our examples. Namely we're told what was 15 00:01:14,287 --> 00:01:19,351 the actual house, what was the actual price of each of the houses in our data 16 00:01:19,351 --> 00:01:24,441 set were sold for and moreover, this is an example of a regression problem where 17 00:01:24,441 --> 00:01:29,545 the term regression refers to the fact that we are predicting a real-valued output 18 00:01:29,545 --> 00:01:34,586 namely the price. And just to remind you the other most common type of supervised 19 00:01:34,586 --> 00:01:39,006 learning problem is called the classification problem where we predict 20 00:01:39,006 --> 00:01:45,202 discrete-valued outputs such as if we are looking at cancer tumors and trying to 21 00:01:45,202 --> 00:01:52,032 decide if a tumor is malignant or benign. So that's a zero-one valued discrete output. More 22 00:01:52,032 --> 00:01:57,087 formally, in supervised learning, we have a data set and this data set is called a 23 00:01:57,087 --> 00:02:02,018 training set. So for housing prices example, we have a training set of 24 00:02:02,018 --> 00:02:07,386 different housing prices and our job is to learn from this data how to predict prices 25 00:02:07,386 --> 00:02:11,907 of the houses. Let's define some notation that we're using throughout this course. 26 00:02:11,907 --> 00:02:16,100 We're going to define quite a lot of symbols. It's okay if you don't remember 27 00:02:16,100 --> 00:02:20,075 all the symbols right now but as the course progresses it will be useful 28 00:02:20,075 --> 00:02:24,267 [inaudible] convenient notation. So I'm gonna use lower case m throughout this course to 29 00:02:24,267 --> 00:02:28,897 denote the number of training examples. So in this data set, if I have, you know, 30 00:02:28,897 --> 00:02:34,366 let's say 47 rows in this table. Then I have 47 training examples and m equals 47. 31 00:02:34,366 --> 00:02:39,497 Let me use lowercase x to denote the input variables often also called the 32 00:02:39,497 --> 00:02:44,290 features. That would be the x is here, it would the input features. And I'm gonna 33 00:02:44,290 --> 00:02:49,556 use y to denote my output variables or the target variable which I'm going to 34 00:02:49,556 --> 00:02:54,552 predict and so that's the second column here. [inaudible] notation, I'm 35 00:02:54,552 --> 00:03:05,749 going to use (x, y) to denote a single training example. So, a single row in this 36 00:03:05,749 --> 00:03:12,068 table corresponds to a single training example and to refer to a specific 37 00:03:12,068 --> 00:03:19,708 training example, I'm going to use this notation x(i) comma gives me y(i) And, we're 38 00:03:25,322 --> 00:03:30,935 going to use this to refer to the ith training example. So this superscript i 39 00:03:30,935 --> 00:03:37,864 over here, this is not exponentiation right? This (x(i), y(i)), the superscript i in 40 00:03:37,864 --> 00:03:44,873 parentheses that's just an index into my training set and refers to the ith row in 41 00:03:44,873 --> 00:03:51,629 this table, okay? So this is not x to the power of i, y to the power of i. Instead 42 00:03:51,629 --> 00:03:58,216 (x(i), y(i)) just refers to the ith row of this table. So for example, x(1) refers to the 43 00:03:58,216 --> 00:04:04,972 input value for the first training example so that's 2104. That's this x in the first 44 00:04:04,972 --> 00:04:11,685 row. x(2) will be equal to 1416 right? That's the second x 45 00:04:11,685 --> 00:04:17,385 and y(1) will be equal to 460. The first, the y value for my first 46 00:04:17,385 --> 00:04:24,526 training example, that's what that (1) refers to. So as mentioned, occasionally I'll ask you a 47 00:04:24,526 --> 00:04:28,345 question to let you check your understanding and a few seconds in this 48 00:04:28,345 --> 00:04:34,044 video a multiple-choice question will pop up in the video. When it does, 49 00:04:34,044 --> 00:04:40,362 please use your mouse to select what you think is the right answer. What defined by 50 00:04:40,362 --> 00:04:45,124 the training set is. So here's how this supervised learning algorithm works. 51 00:04:45,124 --> 00:04:50,513 We saw that with the training set like our training set of housing prices and we feed 52 00:04:50,513 --> 00:04:55,715 that to our learning algorithm. Is the job of a learning algorithm to then output a 53 00:04:55,715 --> 00:05:00,101 function which by convention is usually denoted lowercase h and h 54 00:05:00,101 --> 00:05:06,574 stands for hypothesis And what the job of the hypothesis is, is, is a function that 55 00:05:06,574 --> 00:05:12,471 takes as input the size of a house like maybe the size of the new house your friend's 56 00:05:12,471 --> 00:05:18,368 trying to sell so it takes in the value of x and it tries to output the estimated 57 00:05:18,368 --> 00:05:31,630 value of y for the corresponding house. So h is a function that maps from x's 58 00:05:31,630 --> 00:05:37,729 to y's. People often ask me, you know, why is this function called 59 00:05:37,729 --> 00:05:42,121 hypothesis. Some of you may know the meaning of the term hypothesis, from the 60 00:05:42,121 --> 00:05:46,744 dictionary or from science or whatever. It turns out that in machine learning, this 61 00:05:46,744 --> 00:05:51,310 is a name that was used in the early days of machine learning and it kinda stuck. 'Cause 62 00:05:51,310 --> 00:05:55,239 maybe not a great name for this sort of function, for mapping from sizes of 63 00:05:55,239 --> 00:05:59,978 houses to the predictions, that you know.... I think the term hypothesis, maybe isn't 64 00:05:59,978 --> 00:06:04,543 the best possible name for this, but this is the standard terminology that people use in 65 00:06:04,543 --> 00:06:09,362 machine learning. So don't worry too much about why people call it that. When 66 00:06:09,362 --> 00:06:14,332 designing a learning algorithm, the next thing we need to decide is how do we 67 00:06:14,332 --> 00:06:20,540 represent this hypothesis h. For this and the next few videos, I'm going to choose 68 00:06:20,540 --> 00:06:26,978 our initial choice , for representing the hypothesis, will be the following. We're going to 69 00:06:26,978 --> 00:06:33,009 represent h as follows. And we will write this as htheta(x) equals theta0 70 00:06:33,009 --> 00:06:39,254 plus theta1 of x. And as a shorthand, sometimes instead of writing, you 71 00:06:39,254 --> 00:06:45,441 know, h subscript theta of x, sometimes there's a shorthand, I'll just write as a h of x. 72 00:06:45,441 --> 00:06:51,627 But more often I'll write it as a subscript theta over there. And plotting 73 00:06:51,627 --> 00:06:58,210 this in the pictures, all this means is that, we are going to predict that y is a linear 74 00:06:58,210 --> 00:07:04,634 function of x. Right, so that's the data set and what this function is doing, 75 00:07:04,634 --> 00:07:11,698 is predicting that y is some straight line function of x. That's h of x equals theta 0 76 00:07:11,698 --> 00:07:18,450 plus theta 1 x, okay? And why a linear function? Well, sometimes we'll want to 77 00:07:18,450 --> 00:07:23,405 fit more complicated, perhaps non-linear functions as well. But since this linear 78 00:07:23,405 --> 00:07:28,298 case is the simple building block, we will start with this example first of fitting 79 00:07:28,298 --> 00:07:32,943 linear functions, and we will build on this to eventually have more complex 80 00:07:32,943 --> 00:07:37,403 models, and more complex learning algorithms. Let me also give this 81 00:07:37,403 --> 00:07:42,628 particular model a name. This model is called linear regression or this, for 82 00:07:42,628 --> 00:07:48,271 example, is actually linear regression with one variable, with the variable being 83 00:07:48,271 --> 00:07:53,914 x. Predicting all the prices as functions of one variable X. And another name for 84 00:07:53,914 --> 00:07:58,852 this model is univariate linear regression. And univariate is just a 85 00:07:58,852 --> 00:08:04,400 fancy way of saying one variable. So, that's linear regression. In the next 86 00:08:04,400 --> 00:08:09,760 video we'll start to talk about just how we go about implementing this model.