Let's look at the problem of prediction from our framework that we define for learning way back in the lecture on learn. Once again, we have many data points. This time, we'll use a slightly different notation. We have m data points, each having features, and we'll have n - one features for each data point. There's a reason why we're going to have n - one, it will become clear shortly. And, of course, with each set of features or instance of features, we have an output variable, more than one possibly, which will be the ones we want to predict. Just as before we have a large number of data points each having a number of features. And given a new data point, we would like to know it's value. In the case of learning, we had categories that we wanted to predict whether this was a buyer or a browser, a good comment or a negative comment. But now, we have a situation where these are numbers. The y's are numbers and we would like to predict the actual value of y. For example, xi's could be features about certain products, And the numbers, y, might be the price that they fetch in different markets. Or the volume of sales that you might expect based on the actions that you might have taken about each product. So, instead of trying to decide what class a new instance is, you want to predict something about the instance, such as it's price or how much it's going to sell, a value. So, in this sense, there's a close relationship between learning and prediction. But, we are actually trying to predict the value, rather than predict the class. So, to a certain extent, what we study in learning is also prediction but we were predicting categories as opposed to actual values. And, when we talk about predict in this lecture, we'll be predicting actual values. There is, of course a close relationship that if you know that the class of something that you observe, you can make predictions about what it would do in the future based on your understanding of how other instances of that class behaves. But, that requires a certain degree of reasoning and it's not directly coming from the data. For example, if you found that somebody was likely to be a buyer. You would naturally predict that they will go and click on an ad or buy something. That's not something that was learned from the data, it was something which was part of how the data has been quoted, And involved some degree of reasoning and preparation of the data as well as inference once one has observed that class. It's a subtle but important point that one needs to understand, Which is a distinction between prediction of values and learning categories. In that sense, prediction, learning, reasoning will all come together later in this lecture. But, for the moment, let's talk about simply trying to predict the value y. In particular, we'll be talking about just one y output variable, if you have knowledge about a large number of instances where you have features and the y values. The technique that we'll talk about is called linear prediction, or regression, or least squares. And, follows fairly naturally from our formulation of the learning problem in terms of finding the expected value of y given an instance x. It turns out, And this is very important for linear prediction, That the function, f of x, which we are so familiar with from our previous lectures, also minimizes the error between the predicted y and fX). Of x. And, in particular, it minimizes the expected sum of squares of this error. Now, the proof of this is not difficult but we won't go into this here. Let's take it as given that if we want to find the expected value of y given x, we might as well minimize the error, or rather the sum of squares of the error values between y and f of x. It's important to clarify the notation here because it could get a bit confusing. Bold x means one data point which may consist of n - one different features. A bold x subscripted is not one of the features, It's actually one particular instance, and is one whole data point. We don't actually have many features as far as output variables are concerned, We are only dealing with one output variable. So, Bold y is just y, in this case. Normal, un-bold y, you may think about it. And, y sub i is not any of these but is actually the output variable for the i-th data point. Now, let's take a look at what this means. For one particular data point, the difference between the output variable and the predicted output variable which is f of the data point, x sub i, is just this quantity. We square it, And take the average of that square across all the possible data points that we have. There are m different data points so we take the sum of all these squared errors and divide by m to get the average error. And that's nothing but the expected value, of the error as measured by the square of the difference between y and the predicted y. Even if we didn't get the relationship between the expected value and the sum of squares, just think of it this way. We have a prediction, F, which says that y should be f of x, And we have the actual values y, yi,. And we just want to make sure that the difference isn't too large. And, the best way we can think of to do that is to measure the differences and try to find an f which minimizes the average distance, or difference between y and f of x. Let's see what that looks like if we have a particular form of f. We're dealing with linear predictions, so in some sense, we're saying that f is not just any function, But it is a particular type of function, which is a linear function. So, what f looks like is x times some set of weights. If we look at this as a row vector, then x consists of all the elements of that row vector, then we have a one because it'll be a constant. And then, x, one transposed f is our function. So, we're essentially taking a linear combination of the features. That means, you know, f1 x1 + f2 x2, etc., for every feature, Plus some f0 which is just a constant, and that's our function f. And what we really want is that, the sum, the differences between the predicted f and the actual y's are very small. So, in some sense, if you look at, if you write this as a matrix, what that works out is that this matrix, X, capital X, where each row of the matrix is just that particular data point. That is they are in different columns in this matrix, each row has n features corresponding to the n features in the data point, Multiplied by f, which is one unknown vector that we're trying to find, gives me, as close as possible the, i-th data point. So, x1 transpose f is almost y1, X1 transpose f is almost y1, and so on. We have this extra one there just to make sure that we don't just take the combination of the features, but we also allow for a constant factor in this linear combination. Please look through this matrix formulation and convince yourself that what we're really trying to do is minimize the difference between extra, xi transposed f and y1.. And, if you write it down on the matrix, we're really trying to find x times f is almost equal to y. So, we are trying to find an f such that x times f is almost equal to y. Well, some of you might have seen such things, matrix equations before, and might wonder why we have this almost equal to. Well, the reason is that, the number of rows in this matrix is much larger than the number of columns because you have a few features, maybe a few dozen features. And, you have maybe hundreds of thousands of data points that you may have observed, And we can't solve this exactly because there are just too many rows and too few columns, So we have to solve it approximately. Of course, if this were m by, n by n, we could solve it exactly which is what you learned in high school is how we solve systems of linear equations. Instead, we'll have to solve it approximately using a technique called Least Squares. And here's how that works.