1 00:00:00,000 --> 00:00:05,399 In this video we'll define something called the cost function. This will let us 2 00:00:05,399 --> 00:00:10,688 figure out how to fit the best possible straight line to our data. In linear 3 00:00:10,688 --> 00:00:16,758 regression we have a training set like that shown here. Remember our notation M 4 00:00:16,758 --> 00:00:21,972 was the number of training examples. So maybe M=47. And the form of the 5 00:00:21,972 --> 00:00:27,731 hypothesis, which we use to make predictions, is this linear function. To 6 00:00:27,731 --> 00:00:33,723 introduce a little bit more terminology, these theta zero and theta one, right, 7 00:00:33,723 --> 00:00:39,759 these theta i's are what I call the parameters of the model. What we're 8 00:00:39,759 --> 00:00:44,578 going to do in this video is talk about how to go about choosing these two 9 00:00:44,578 --> 00:00:49,654 parameter values, theta zero and theta one. With different choices of parameters 10 00:00:49,654 --> 00:00:54,408 theta zero and theta one we get different hypotheses, different hypothesis 11 00:00:54,408 --> 00:00:59,355 functions. I know some of you will probably be already familiar with what I'm 12 00:00:59,355 --> 00:01:04,367 going to do on this slide, but just to review here are a few examples. If theta 13 00:01:04,367 --> 00:01:09,378 zero is 1.5 and theta one is 0, then the hypothesis function will look like 14 00:01:09,378 --> 00:01:15,701 this. Right, because your hypothesis function will be h( x) equals 1.5 plus 15 00:01:15,701 --> 00:01:22,645 0 times x which is this constant value function, this is flat at 1.5. If 16 00:01:22,645 --> 00:01:29,332 theta zero equals 0 and theta one equals 0.5, then the hypothesis will look 17 00:01:29,332 --> 00:01:35,536 like this. And it should pass through this point (2, 1), says you now have h(x) or 18 00:01:35,536 --> 00:01:40,666 really some htheta(x) but sometimes I'll just omit theta for 19 00:01:40,666 --> 00:01:46,518 brevity. So, h(x) will be equal to just 0.5 times x which looks like that. And 20 00:01:46,518 --> 00:01:52,443 finally if theta zero equals 1 and theta one equals 0.5 then we end up with the 21 00:01:52,443 --> 00:01:58,598 hypothesis that looks like this. Let's see, it should pass through the (2, 2) 22 00:01:58,598 --> 00:02:04,468 point like so. And this is my new h(x) or my new htheta(x). All right? Well 23 00:02:04,468 --> 00:02:09,980 you remember that this is htheta(x) but as a shorthand 24 00:02:09,980 --> 00:02:16,584 sometimes I just write this as h(x). In linear regression we have a training set, 25 00:02:16,584 --> 00:02:22,439 like maybe the one I've plotted here. What we want to do is come up with values for 26 00:02:22,439 --> 00:02:28,295 the parameters theta zero and theta one. So that the straight line we get out 27 00:02:28,295 --> 00:02:33,799 of this corresponds to a straight line that somehow fits the data well. Like 28 00:02:33,799 --> 00:02:39,756 maybe that line over there. So how do we come up with values theta zero, theta one 29 00:02:39,756 --> 00:02:45,350 that corresponds to a good fit to the data? The idea is we're going to choose 30 00:02:45,350 --> 00:02:51,162 our parameters theta zero, theta one so that h(x), meaning the value we predict 31 00:02:51,162 --> 00:02:56,330 on input x, that this at least close to the values y for the examples in our 32 00:02:56,330 --> 00:03:01,133 training set, for our training examples. So, in our training set we're given a 33 00:03:01,133 --> 00:03:06,505 number of examples where we know x decides the house and we know the actual price of 34 00:03:06,505 --> 00:03:11,796 what it's sold for. So let's try to choose values for the parameters so that 35 00:03:11,796 --> 00:03:17,302 at least in the training set, given the x's in the training set, we make 36 00:03:17,302 --> 00:03:23,507 reasonably accurate predictions for the y values. Let's formalize this. So linear 37 00:03:23,507 --> 00:03:29,401 regression, what we're going to do is that I'm going to want to solve a minimization 38 00:03:29,401 --> 00:03:38,787 problem. So I'm going to write minimize over theta zero, theta one. And, I want this to be 39 00:03:38,787 --> 00:03:44,379 small, right, I want the difference between h(x) and y to be small. And one 40 00:03:44,379 --> 00:03:50,493 thing I'm gonna do is try to minimize the square difference between the output of 41 00:03:50,493 --> 00:03:56,159 the hypothesis and the actual price of the house. Okay? So let's fill in some 42 00:03:56,159 --> 00:04:01,379 details. Remember that I was using the notation (x(i), y(i)) to represent the 43 00:04:01,379 --> 00:04:07,418 ith training example. So what I want really is to sum over my training 44 00:04:07,418 --> 00:04:13,202 set. Sum from i equals 1 to M of the square difference between 45 00:04:13,202 --> 00:04:18,896 this is the prediction of my hypothesis when it is input the size of house number 46 00:04:18,896 --> 00:04:24,380 i, right, minus the actual price that house number i will sell for and I want to 47 00:04:24,380 --> 00:04:29,588 minimize the sum of my training set sum from i equals 1 through M of the 48 00:04:29,588 --> 00:04:35,281 difference of this squared error, square difference between the predicted 49 00:04:35,281 --> 00:04:41,091 price of the house and the price that it will actually sell for. And just 50 00:04:41,091 --> 00:04:47,723 remind you of your notation M here was the, the size of my training set, right, 51 00:04:47,723 --> 00:04:53,347 so the M there is my number of training examples. Right? That hash sign is the 52 00:04:53,347 --> 00:04:59,045 abbreviation for "number" of training examples. Okay? And to make some of our, 53 00:04:59,045 --> 00:05:04,888 make the math a little bit easier, I'm going to actually look at, you know, 1 54 00:05:04,888 --> 00:05:09,578 over M times that. So we're going to try to minimize my average error, which we're 55 00:05:09,578 --> 00:05:13,926 going to minimize one by 2M. Putting the 2, the constant one half, in 56 00:05:13,926 --> 00:05:18,386 front it just makes some of the math a little easier. So minimizing one half of 57 00:05:18,386 --> 00:05:23,073 something, right, should give you the same values of the parameters theta zero, theta 58 00:05:23,073 --> 00:05:27,647 one as minimizing that function. And just make sure this, this, this equation is 59 00:05:27,647 --> 00:05:35,569 clear, right? This expression in here, htheta(x), this is my, this is 60 00:05:35,569 --> 00:05:44,880 our usual, right? That's equal to this plus theta one x(i). And, this notation, 61 00:05:44,880 --> 00:05:49,814 minimize over theta zero and theta one, this means find me the values of theta 62 00:05:49,814 --> 00:05:54,369 zero and theta one that causes this expression to be minimized. And this 63 00:05:54,369 --> 00:05:59,557 expression depends on theta zero and theta one. Okay? So just to recap, we're posing 64 00:05:59,557 --> 00:06:04,382 this problem as find me the values of theta zero and theta one so that the 65 00:06:04,575 --> 00:06:09,292 average already one over two M times the sum of square errors between my 66 00:06:09,292 --> 00:06:14,590 predictions on the training set minus the actual values of the houses on the 67 00:06:14,590 --> 00:06:19,694 training set is minimized. So this is going to be my overall objective function 68 00:06:19,694 --> 00:06:25,127 for linear regression. And just to, you know rewrite this out a little bit more 69 00:06:25,127 --> 00:06:30,580 cleanly what I'm going to do by convention is we usually define a cost function. 70 00:06:30,860 --> 00:06:38,965 Which is going to be exactly this. That formula that I have up here. And what I 71 00:06:38,965 --> 00:06:48,388 want to do is minimize over theta zero and theta one my function J of theta zero 72 00:06:48,388 --> 00:06:57,428 comma theta one. Just write this out, this is my cost function. So, this 73 00:06:57,428 --> 00:07:06,943 cost function is also called the squared error function or sometimes called the 74 00:07:06,943 --> 00:07:14,461 square error cost function and it turns out that Why, why do we, you know, take 75 00:07:14,461 --> 00:07:19,006 the squares of the errors? It turns out that the squared error cost function is a 76 00:07:19,006 --> 00:07:23,214 reasonable choice and will work well for most problems, for most regression 77 00:07:23,214 --> 00:07:27,815 problems. There are other cost functions that will work pretty well, but the squared 78 00:07:27,815 --> 00:07:32,473 error cost function is probably the most commonly used one for regression problems. 79 00:07:32,473 --> 00:07:36,793 Later in this class we'll also talk about alternative cost functions as well, but this, this 80 00:07:36,793 --> 00:07:41,282 choice that we just had, should be a pret-, pretty reasonable thing to try for 81 00:07:41,282 --> 00:07:45,706 most linear regression problems. Okay. So that's the cost function. So far we've 82 00:07:45,706 --> 00:07:50,899 just seen a mathematical definition of you know this cost function and in case this 83 00:07:50,899 --> 00:07:55,973 function J of theta zero theta one in case this function seems a little bit abstract 84 00:07:55,973 --> 00:08:00,808 and you still don't have a good sense of what its doing in the next video, in the 85 00:08:00,808 --> 00:08:05,882 next couple videos we're actually going to go a little bit deeper into what the cost 86 00:08:05,882 --> 00:08:10,776 function J is doing and try to give you better intuition about what its computing 87 00:08:10,776 --> 00:08:12,329 and why we want to use it.