Okay.
So, let's look at our matrix equation once
more.
We have x which is simply draw is the
data, set of features for data point.
The value is y sub i for each data point
which is in this m by one vec, column
vector y.
And, we want to make, try to find f such
that x times f is almost what?
I'm going to use bold y and bold f to
indicate that they're vectors,
A small mistake.
I've used bold y here it actually should
be unbold y and that was a typo, I
apologize for that.
Let's go on, though.
We want to find f at approximately matches
this equation.
And we're going to do that by minimizing
the difference between X transposed f and
yi for each of these data points by taking
the sum of the squares.
That's what we decided earlier.
And if we write the sum of squares in
matrix form, it turns out we simply, the
difference between x, f and y transposed
and multiplied by itself.
So, essentially, it's the norm or the sum
of squares of this difference vector,
which is exactly what we have here.
This quantity needs to be minimized.
Now, to minimize a quantity, if you go
back to high school, you need to make its
derivative zero.
What derivative with respect to what?
The unknown quantity is f,
In this case it's a vector.
So, it's a bit more complicated than high
school.
You want to take the derivative with
respect to every element of x.
Well,
If you study the vector calculus, it's
fairly easy to do that.
We won't go through that here. But, just
think about f as being one variables for
the time being. the way you take the
derivative, derivative is you would take
the, you would expand this out and you
will get one term which is f transposed, X
transposed Xf,
So you get two f's there. So, when you
take derivative of that, you will get
twice X transpose Xf.
And then, you'll get two elements where
there is a y transposed X into f, or an f
transposed X transposed into y.
And, for each of those, you'll get a y
transposed there.
So,
Of you set the derivative to zero, you'll
get twice of this being zero which
essentially, this minus this being zero
which are called the normal equations.
X transposed f times f should be equal to
X transposed y.
Now, X transposed f, if you look at it in
matrix form, it's actually no longer a
very large matrix.
It's only an n by n matrix, because you
took this long matrix multiplied by its
transposed you'll get an n by n matrix.
You have f which is a vector of length n,
And X transpose y again will become a much
smaller column vector.
And so, this, those two equations is n by
n, and this side is n, so you should be
able to solve it exactly as long as, of
course, the matrix is not singular or it
actually has a unique solution.
Most of the time, it does have a unique
solution.
And so, we solve this to get our f.
Once we have our f, our least squares
estimate for f is nothing but X transposed
f,
Or x prime transposed f, where x prime is
x, one.
So, given an x, the value of y most likely
by least squares is, is x, one transpose
x, and that's our least squares estimate.
Let's do an example.
Let's take a simple four data point
example with only one feature.
So, x is just a one variable.
X can be many variables, of course.
And, y is a set of values as well.
We saw the normal equations,
Normal equations will look something like
this.
If you take x transpose x, each element of
x is x, 1..
Y is just the values of y.
The f that you get by solving the normal
equation turns out to be this one.
And when you plot f transposed x, which is
simply the line 0.11x - 0.26 on a graph,
And you have the y,.
X, y coordinates being flooded as well,
X, y.
You see that the line actually almost fits
the, the, the data.
This is all that it, that there is in
linear regression.
Of course, if there are many variables,
you will have, well, possibly, many such
plots.
One for each value, each, each, each
feature x,
And, different, the f will be much longer
in terms of the number coefficients.
So, we have a approximation for y for
every value of x that we might come
across.
But, we also might want to ask, how good
is this fit to the data.
A common measure of how good the fit is,
is called the R squared value.
Which is simply the sum of squares of the
errors of our actual estimates.
So, you have fx) of x, which is f
transposed xy) sub i for each data point,
Minus the actual value. Square it and sum,
And you'll divide that by the variation in
y about it's means.
So, y has a certain mean which would be
somewhere here.
And you want to see how the error, the sum
of all these errors compares with the
actual variation in the y values across
their mean.
Doesn't sound like a great measure
because, you know, if you, if you have,
have a steeper slope, you, you probably
have a higher value over here. But then,
if you have a steeper slope, you're
actually tolerating more error in your in
your estimate.
But this is a common and easy-to-calculate
measure.
There are better measures which tell you
how good each of these coefficients are in
terms of how confident one should be in
each coefficient.
But, we will not go into that in detail.
For this particular example, the data is
pretty much fitting a line.
The R square value comes out to be
somewhere around 0.95. As you can see, R
squared value can be close to one, image
of this is a good fit.
And can be close to zero if there isn't a
good fit.
Let's take, take another example.
This is an example which is a little bit
more realistic.
It comes from a book called Super
Crunchers written by Ian Aryes in 2007.
It talks about the power of reasoning from
data.
It's a great book if you want to read it.
He, he tells the story of a wine expert
called Orley Ashenfelter who predicted the
quality of wine based on the winter
rainfall, the average temperature, and the
harverst rainfall in a season.
So, Ashenfelter could predict how good a
particular wine would be depending on the
weather in the growing region for that
wine and do this much before the wine hit
the market.
And, turns out that his estimates were
simply based on linear regression and
turned out to be extremely successful.
And, surprised many experienced wine
critics who would have to taste the wine
many times and judge it before predicting
whether it be high priced wine or not.
The kind of estimate he got was something
like, you know, linear least squares fit
which is you know, the f0, the f1, the f2,
and the f3.
And he got a very nice fit.
The important thing to note is that, we
have possibly positive correlations
between the output variable y and the
input variable, such as a positive value
here, here, and here, or we can have a
negative correlation.
So, in this case, 0.26 is negative,
0.00386 is negative.
So, if harvest rain rainfall goes higher,
the quality goes down unlike the positive
correlation of with respect to say, winter
rainfall.
Let's take a look at a few more examples
of, of correlation that one might get with
a single variable.
Of course, all this applies to multiple
variables.
You can always draw graphs of the value
with respect to any of the feature
variables.
This is clearly very strong correlations.
You will get a very high R squared value
for the data that looks like this.
Of course, if it looks a bit more
scattered, you'll get a lower value via R.
Squared. And similarly, if it's negatively
correlated, your slopes will be negative
which could have lower R squared if they
are more scattered.
And, suppose your data is like, like this.
Well, it doesn't look like there's any
correlation.
And the R2 squared will reflect because if
you look at the variation of y across its
mean, it's fairly large.
And whatever line one puts around it, the
variation of the y's with respect to
whatever line one decides, decides to draw
will also be very large.
And so, this ratio will be close to one
and R2 squared will be close to zero.
However,
Let's look at this case for example.
This is a little bit more subtle. y is
almost constant as x varies.
So, there's a line which goes through
these points.
But, what's the R squared value?
Y doesn't change at all,
And the line, if you fit it, might be
exact or very close to being exact, so the
error might be very small as well.
So, this ratio is again very close to one,
and the R squared value is very small.
So, essentially you're saying there's no
correlation.
And that's actually true because whatever
be the value of x, y doesn't change.
So, the only way you can get
non-correlation is not just if it, the
data is scattered all across,
But even if it is a straight line here. If
y doesn't change its vector x, then there
is no correlation either.
And lastly, we have another situation
which we'll come to in the next segment
which is a situation like this.
Whatever line one draws through these
points, one will always make a lot of
error.
This is an example of non-linear
correlation.
The data is correlated so you can draw a
line which is like a parabola it could
certainly work, but it doesn't work for
linear correlation.
For this and many other reasons as we will
come to in the next segment, we have to go
beyond linear least squares to more
complicated prediction models.