Okay. So, let's look at our matrix equation once more. We have x which is simply draw is the data, set of features for data point. The value is y sub i for each data point which is in this m by one vec, column vector y. And, we want to make, try to find f such that x times f is almost what? I'm going to use bold y and bold f to indicate that they're vectors, A small mistake. I've used bold y here it actually should be unbold y and that was a typo, I apologize for that. Let's go on, though. We want to find f at approximately matches this equation. And we're going to do that by minimizing the difference between X transposed f and yi for each of these data points by taking the sum of the squares. That's what we decided earlier. And if we write the sum of squares in matrix form, it turns out we simply, the difference between x, f and y transposed and multiplied by itself. So, essentially, it's the norm or the sum of squares of this difference vector, which is exactly what we have here. This quantity needs to be minimized. Now, to minimize a quantity, if you go back to high school, you need to make its derivative zero. What derivative with respect to what? The unknown quantity is f, In this case it's a vector. So, it's a bit more complicated than high school. You want to take the derivative with respect to every element of x. Well, If you study the vector calculus, it's fairly easy to do that. We won't go through that here. But, just think about f as being one variables for the time being. the way you take the derivative, derivative is you would take the, you would expand this out and you will get one term which is f transposed, X transposed Xf, So you get two f's there. So, when you take derivative of that, you will get twice X transpose Xf. And then, you'll get two elements where there is a y transposed X into f, or an f transposed X transposed into y. And, for each of those, you'll get a y transposed there. So, Of you set the derivative to zero, you'll get twice of this being zero which essentially, this minus this being zero which are called the normal equations. X transposed f times f should be equal to X transposed y. Now, X transposed f, if you look at it in matrix form, it's actually no longer a very large matrix. It's only an n by n matrix, because you took this long matrix multiplied by its transposed you'll get an n by n matrix. You have f which is a vector of length n, And X transpose y again will become a much smaller column vector. And so, this, those two equations is n by n, and this side is n, so you should be able to solve it exactly as long as, of course, the matrix is not singular or it actually has a unique solution. Most of the time, it does have a unique solution. And so, we solve this to get our f. Once we have our f, our least squares estimate for f is nothing but X transposed f, Or x prime transposed f, where x prime is x, one. So, given an x, the value of y most likely by least squares is, is x, one transpose x, and that's our least squares estimate. Let's do an example. Let's take a simple four data point example with only one feature. So, x is just a one variable. X can be many variables, of course. And, y is a set of values as well. We saw the normal equations, Normal equations will look something like this. If you take x transpose x, each element of x is x, 1.. Y is just the values of y. The f that you get by solving the normal equation turns out to be this one. And when you plot f transposed x, which is simply the line 0.11x - 0.26 on a graph, And you have the y,. X, y coordinates being flooded as well, X, y. You see that the line actually almost fits the, the, the data. This is all that it, that there is in linear regression. Of course, if there are many variables, you will have, well, possibly, many such plots. One for each value, each, each, each feature x, And, different, the f will be much longer in terms of the number coefficients. So, we have a approximation for y for every value of x that we might come across. But, we also might want to ask, how good is this fit to the data. A common measure of how good the fit is, is called the R squared value. Which is simply the sum of squares of the errors of our actual estimates. So, you have fx) of x, which is f transposed xy) sub i for each data point, Minus the actual value. Square it and sum, And you'll divide that by the variation in y about it's means. So, y has a certain mean which would be somewhere here. And you want to see how the error, the sum of all these errors compares with the actual variation in the y values across their mean. Doesn't sound like a great measure because, you know, if you, if you have, have a steeper slope, you, you probably have a higher value over here. But then, if you have a steeper slope, you're actually tolerating more error in your in your estimate. But this is a common and easy-to-calculate measure. There are better measures which tell you how good each of these coefficients are in terms of how confident one should be in each coefficient. But, we will not go into that in detail. For this particular example, the data is pretty much fitting a line. The R square value comes out to be somewhere around 0.95. As you can see, R squared value can be close to one, image of this is a good fit. And can be close to zero if there isn't a good fit. Let's take, take another example. This is an example which is a little bit more realistic. It comes from a book called Super Crunchers written by Ian Aryes in 2007. It talks about the power of reasoning from data. It's a great book if you want to read it. He, he tells the story of a wine expert called Orley Ashenfelter who predicted the quality of wine based on the winter rainfall, the average temperature, and the harverst rainfall in a season. So, Ashenfelter could predict how good a particular wine would be depending on the weather in the growing region for that wine and do this much before the wine hit the market. And, turns out that his estimates were simply based on linear regression and turned out to be extremely successful. And, surprised many experienced wine critics who would have to taste the wine many times and judge it before predicting whether it be high priced wine or not. The kind of estimate he got was something like, you know, linear least squares fit which is you know, the f0, the f1, the f2, and the f3. And he got a very nice fit. The important thing to note is that, we have possibly positive correlations between the output variable y and the input variable, such as a positive value here, here, and here, or we can have a negative correlation. So, in this case, 0.26 is negative, 0.00386 is negative. So, if harvest rain rainfall goes higher, the quality goes down unlike the positive correlation of with respect to say, winter rainfall. Let's take a look at a few more examples of, of correlation that one might get with a single variable. Of course, all this applies to multiple variables. You can always draw graphs of the value with respect to any of the feature variables. This is clearly very strong correlations. You will get a very high R squared value for the data that looks like this. Of course, if it looks a bit more scattered, you'll get a lower value via R. Squared. And similarly, if it's negatively correlated, your slopes will be negative which could have lower R squared if they are more scattered. And, suppose your data is like, like this. Well, it doesn't look like there's any correlation. And the R2 squared will reflect because if you look at the variation of y across its mean, it's fairly large. And whatever line one puts around it, the variation of the y's with respect to whatever line one decides, decides to draw will also be very large. And so, this ratio will be close to one and R2 squared will be close to zero. However, Let's look at this case for example. This is a little bit more subtle. y is almost constant as x varies. So, there's a line which goes through these points. But, what's the R squared value? Y doesn't change at all, And the line, if you fit it, might be exact or very close to being exact, so the error might be very small as well. So, this ratio is again very close to one, and the R squared value is very small. So, essentially you're saying there's no correlation. And that's actually true because whatever be the value of x, y doesn't change. So, the only way you can get non-correlation is not just if it, the data is scattered all across, But even if it is a straight line here. If y doesn't change its vector x, then there is no correlation either. And lastly, we have another situation which we'll come to in the next segment which is a situation like this. Whatever line one draws through these points, one will always make a lot of error. This is an example of non-linear correlation. The data is correlated so you can draw a line which is like a parabola it could certainly work, but it doesn't work for linear correlation. For this and many other reasons as we will come to in the next segment, we have to go beyond linear least squares to more complicated prediction models.