In this video we'll define something called the cost function. This will let us figure out how to fit the best possible straight line to our data. In linear regression we have a training set like that shown here. Remember our notation M was the number of training examples. So maybe M=47. And the form of the hypothesis, which we use to make predictions, is this linear function. To introduce a little bit more terminology, these theta zero and theta one, right, these theta i's are what I call the parameters of the model. What we're going to do in this video is talk about how to go about choosing these two parameter values, theta zero and theta one. With different choices of parameters theta zero and theta one we get different hypotheses, different hypothesis functions. I know some of you will probably be already familiar with what I'm going to do on this slide, but just to review here are a few examples. If theta zero is 1.5 and theta one is 0, then the hypothesis function will look like this. Right, because your hypothesis function will be h( x) equals 1.5 plus 0 times x which is this constant value function, this is flat at 1.5. If theta zero equals 0 and theta one equals 0.5, then the hypothesis will look like this. And it should pass through this point (2, 1), says you now have h(x) or really some htheta(x) but sometimes I'll just omit theta for brevity. So, h(x) will be equal to just 0.5 times x which looks like that. And finally if theta zero equals 1 and theta one equals 0.5 then we end up with the hypothesis that looks like this. Let's see, it should pass through the (2, 2) point like so. And this is my new h(x) or my new htheta(x). All right? Well you remember that this is htheta(x) but as a shorthand sometimes I just write this as h(x). In linear regression we have a training set, like maybe the one I've plotted here. What we want to do is come up with values for the parameters theta zero and theta one. So that the straight line we get out of this corresponds to a straight line that somehow fits the data well. Like maybe that line over there. So how do we come up with values theta zero, theta one that corresponds to a good fit to the data? The idea is we're going to choose our parameters theta zero, theta one so that h(x), meaning the value we predict on input x, that this at least close to the values y for the examples in our training set, for our training examples. So, in our training set we're given a number of examples where we know x decides the house and we know the actual price of what it's sold for. So let's try to choose values for the parameters so that at least in the training set, given the x's in the training set, we make reasonably accurate predictions for the y values. Let's formalize this. So linear regression, what we're going to do is that I'm going to want to solve a minimization problem. So I'm going to write minimize over theta zero, theta one. And, I want this to be small, right, I want the difference between h(x) and y to be small. And one thing I'm gonna do is try to minimize the square difference between the output of the hypothesis and the actual price of the house. Okay? So let's fill in some details. Remember that I was using the notation (x(i), y(i)) to represent the ith training example. So what I want really is to sum over my training set. Sum from i equals 1 to M of the square difference between this is the prediction of my hypothesis when it is input the size of house number i, right, minus the actual price that house number i will sell for and I want to minimize the sum of my training set sum from i equals 1 through M of the difference of this squared error, square difference between the predicted price of the house and the price that it will actually sell for. And just remind you of your notation M here was the, the size of my training set, right, so the M there is my number of training examples. Right? That hash sign is the abbreviation for "number" of training examples. Okay? And to make some of our, make the math a little bit easier, I'm going to actually look at, you know, 1 over M times that. So we're going to try to minimize my average error, which we're going to minimize one by 2M. Putting the 2, the constant one half, in front it just makes some of the math a little easier. So minimizing one half of something, right, should give you the same values of the parameters theta zero, theta one as minimizing that function. And just make sure this, this, this equation is clear, right? This expression in here, htheta(x), this is my, this is our usual, right? That's equal to this plus theta one x(i). And, this notation, minimize over theta zero and theta one, this means find me the values of theta zero and theta one that causes this expression to be minimized. And this expression depends on theta zero and theta one. Okay? So just to recap, we're posing this problem as find me the values of theta zero and theta one so that the average already one over two M times the sum of square errors between my predictions on the training set minus the actual values of the houses on the training set is minimized. So this is going to be my overall objective function for linear regression. And just to, you know rewrite this out a little bit more cleanly what I'm going to do by convention is we usually define a cost function. Which is going to be exactly this. That formula that I have up here. And what I want to do is minimize over theta zero and theta one my function J of theta zero comma theta one. Just write this out, this is my cost function. So, this cost function is also called the squared error function or sometimes called the square error cost function and it turns out that Why, why do we, you know, take the squares of the errors? It turns out that the squared error cost function is a reasonable choice and will work well for most problems, for most regression problems. There are other cost functions that will work pretty well, but the squared error cost function is probably the most commonly used one for regression problems. Later in this class we'll also talk about alternative cost functions as well, but this, this choice that we just had, should be a pret-, pretty reasonable thing to try for most linear regression problems. Okay. So that's the cost function. So far we've just seen a mathematical definition of you know this cost function and in case this function J of theta zero theta one in case this function seems a little bit abstract and you still don't have a good sense of what its doing in the next video, in the next couple videos we're actually going to go a little bit deeper into what the cost function J is doing and try to give you better intuition about what its computing and why we want to use it.