Let's look at the problem of prediction
from our framework that we define for
learning way back in the lecture on learn.
Once again, we have many data points.
This time, we'll use a slightly different
notation.
We have m data points, each having
features, and we'll have n - one features
for each data point.
There's a reason why we're going to have n
- one, it will become clear shortly.
And, of course, with each set of features
or instance of features, we have an output
variable, more than one possibly, which
will be the ones we want to predict.
Just as before we have a large number of
data points each having a number of
features.
And given a new data point, we would like
to know it's value.
In the case of learning, we had categories
that we wanted to predict whether this was
a buyer or a browser, a good comment or a
negative comment.
But now, we have a situation where these
are numbers.
The y's are numbers and we would like to
predict the actual value of y.
For example, xi's could be features about
certain products,
And the numbers, y, might be the price
that they fetch in different markets.
Or the volume of sales that you might
expect based on the actions that you might
have taken about each product.
So, instead of trying to decide what class
a new instance is, you want to predict
something about the instance, such as it's
price or how much it's going to sell, a
value.
So, in this sense, there's a close
relationship between learning and
prediction. But, we are actually trying to
predict the value, rather than predict the
class.
So, to a certain extent, what we study in
learning is also prediction but we were
predicting categories as opposed to actual
values.
And, when we talk about predict in this
lecture, we'll be predicting actual
values.
There is, of course a close relationship
that if you know that the class of
something that you observe, you can make
predictions about what it would do in the
future based on your understanding of how
other instances of that class behaves.
But, that requires a certain degree of
reasoning and it's not directly coming
from the data.
For example, if you found that somebody
was likely to be a buyer.
You would naturally predict that they will
go and click on an ad or buy something.
That's not something that was learned from
the data, it was something which was part
of how the data has been quoted,
And involved some degree of reasoning and
preparation of the data as well as
inference once one has observed that
class.
It's a subtle but important point that one
needs to understand,
Which is a distinction between prediction
of values and learning categories.
In that sense, prediction, learning,
reasoning will all come together later in
this lecture. But, for the moment, let's
talk about simply trying to predict the
value y.
In particular, we'll be talking about just
one y output variable, if you have
knowledge about a large number of
instances where you have features and the
y values.
The technique that we'll talk about is
called linear prediction, or regression,
or least squares.
And, follows fairly naturally from our
formulation of the learning problem in
terms of finding the expected value of y
given an instance x.
It turns out,
And this is very important for linear
prediction,
That the function, f of x, which we are so
familiar with from our previous lectures,
also minimizes the error between the
predicted y and fX).
Of x. And, in particular, it minimizes the
expected sum of squares of this error.
Now, the proof of this is not difficult
but we won't go into this here.
Let's take it as given that if we want to
find the expected value of y given x, we
might as well minimize the error, or
rather the sum of squares of the error
values between y and f of x.
It's important to clarify the notation
here because it could get a bit confusing.
Bold x means one data point which may
consist of n - one different features.
A bold x subscripted is not one of the
features,
It's actually one particular instance, and
is one whole data point.
We don't actually have many features as
far as output variables are concerned,
We are only dealing with one output
variable. So,
Bold y is just y, in this case.
Normal, un-bold y, you may think about it.
And, y sub i is not any of these but is
actually the output variable for the i-th
data point.
Now, let's take a look at what this means.
For one particular data point, the
difference between the output variable and
the predicted output variable which is f
of the data point, x sub i, is just this
quantity.
We square it,
And take the average of that square across
all the possible data points that we have.
There are m different data points so we
take the sum of all these squared errors
and divide by m to get the average error.
And that's nothing but the expected value,
of the error as measured by the square of
the difference between y and the predicted
y.
Even if we didn't get the relationship
between the expected value and the sum of
squares, just think of it this way.
We have a prediction,
F, which says that y should be f of x,
And we have the actual values y, yi,.
And we just want to make sure that the
difference isn't too large. And, the best
way we can think of to do that is to
measure the differences and try to find an
f which minimizes the average distance, or
difference between y and f of x.
Let's see what that looks like if we have
a particular form of f.
We're dealing with linear predictions, so
in some sense, we're saying that f is not
just any function,
But it is a particular type of function,
which is a linear function.
So, what f looks like is x times some set
of weights.
If we look at this as a row vector, then x
consists of all the elements of that row
vector, then we have a one because it'll
be a constant.
And then, x, one transposed f is our
function. So, we're essentially taking a
linear combination of the features.
That means, you know, f1  x1 + f2  x2,
etc., for every feature,
Plus some f0 which is just a constant, and
that's our function f.
And what we really want is that, the sum,
the differences between the predicted f
and the actual y's are very small.
So, in some sense, if you look at, if you
write this as a matrix, what that works
out is that this matrix, X, capital X,
where each row of the matrix is just that
particular data point.
That is they are in different columns in
this matrix, each row has n features
corresponding to the n features in the
data point,
Multiplied by f, which is one unknown
vector that we're trying to find, gives
me, as close as possible the, i-th data
point.
So, x1 transpose f is almost y1,
X1 transpose f is almost y1, and so on.
We have this extra one there just to make
sure that we don't just take the
combination of the features, but we also
allow for a constant factor in this linear
combination.
Please look through this matrix
formulation and convince yourself that
what we're really trying to do is minimize
the difference between extra, xi
transposed f and y1..
And, if you write it down on the matrix,
we're really trying to find x times f is
almost equal to y.
So, we are trying to find an f such that x
times f is almost equal to y.
Well, some of you might have seen such
things, matrix equations before, and might
wonder why we have this almost equal to.
Well, the reason is that, the number of
rows in this matrix is much larger than
the number of columns because you have a
few features, maybe a few dozen features.
And, you have maybe hundreds of thousands
of data points that you may have observed,
And we can't solve this exactly because
there are just too many rows and too few
columns,
So we have to solve it approximately. Of
course, if this were m by, n by n, we
could solve it exactly which is what you
learned in high school is how we solve
systems of linear equations.
Instead, we'll have to solve it
approximately using a technique called
Least Squares.
And here's how that works.