So now, we'll go beyond least squares.
However, let me mention that the technique
that we're going to cover this week is
least squares, which have already been
covered. And, what we will follow is
essentially a broad overview of a variety
of different areas dealing with different
types of prediction models and their
relationships which are increasingly
becoming apparent in research community.
Let's first see what happens if we have
categorical data. That is that, the target
variables that we are trying to predict,
the y's, are not numbers but yes is a
node, which is essentially the learning
problem that we had earlier.
Well, the challenge here is that the x's
are still numbers. So, it might not be
such a great idea to use something like a
naive Bayes classifier because one would
have to discretize those numbers into
categorical variables which might not be
appropriate.
So, one tends to try to use regression
techniques. But, using a linear regression
when all one is trying to do is separate
out the no's which is the reds from the
blues, which may be yes's.
It's quite brittle because the points
right near the line might get classified
either way, and the distinctions one has
to make are very small and are subject to
a lot of error.
The other problem, of course, is that
there may be no such lines separating the
data and we'll come to that in a minute.
But, just to make regression work for
categorical data, what the most common
technique being used today is logistic
regression. Which essentially replaces f
by a function of the form one - one over e
to the -f transposed x, instead of f
transposed x.
Let's see how this function behaves.
Now, suppose f transposed x is a very
large value.
Then, even the -f transposed x is close to
zero,
Which means that one up, one over that is
close to infinity and we have one minus of
that, becomes a very,
As f transposed x is close to zero, F of x
is close to zero because you have one over
one,
And one minus of that is almost zero.
But if, as f transposed x increases,
The denominator here rapidly decreases to
almost zero which makes the second term
here rapidly increased to infinity, where
f transposed x very rapidly goes away from
zero to a negative value.
On the other hand, if f transposed x is,
becomes a negative value,
Even the x transpose x rapidly increases
towards infinity making the second term
almost zero. So, fx) of x goes close to
one.
So, the speed at which the second term
deviates from one is what makes logistic
regression particularly attractive to
separate out the yes's from the no's.
Of course, trying to fit such a function
is not as easier as linear squares.
But I'll show how that can be done very
soon.
The other problem with linear regression
is that the data might not be separable
using a line.
So, one might, for example, you might have
a parabolic relationship, as we described
earlier.
To support more complex functions, and
when one doesn't even know what kind of
function to have,
The technique called support vector
machines have become very popular. And
they're the kind of function, f, or is
parameterized by kernel parameters, and
these parameters are also learned along
with the, the function itself.
So, not only does the we learn the
function, but we also learn the type of
function from the data.
And so, if it's a parabola, you learn a
parabola.
If it's something more complicated, like a
circle with a hole inside, rather like a
doughnut, the support vector machine will
learn such functions as well.
The point I'm trying to make is that
starting from linear least squares, one,
one increases the complexity of the
functions one is trying to learn.
And we arrive at other types of
regressions and support vector machines.
Way back in the early days of the AI, one
found many efforts to mimic the brain,
which essentially resulted in the field of
neural networks.
Where one tried to create structures that
sort of look like how neurons in the brain
affect each other, based on their
connections to other neurons.
The first neural networks created in the
50's by McCullouch, Pitts, Rosenblatt and
others essentially, look like linear
combinations of inputs.
So, they were pretty much like least
squares, in the sense that we had, you
would have a neural network, which would
say that these are neurons and the input
to this neuron is the rainfall and this
one's the temperature. This one's the
harvest rainfall, and then we have another
one which has just one.
And, the output would be the wine quality
and the weights, that is the connections
between these neurons would be learned by,
by a process of iterative refinement which
could be essentially looked upon as a way
of solving the least squares problem.
To deal with exactly the same problems of
classification, we found logistic
functions being included in neural
networks.
So essentially, the, the, this field was
evolving parallely to statistical
prediction techniques.
And interest in neural networks, sort of
wind over the years along the way, of
course, we figured out different types of
neural networks which were multi-layer.
And essentially, by combination of
logistic functions, least linear
functions, and other types of functions,
one could learn more complex non-linear
functions of the input variables.
Hidden layers were introduced which were
essentially creating more complicated
forms of f.,
And parametrized by the weights which
would then get learned through a process
of optimization as we'll see shortly.
Finally, in recent years, in spite of the
fact that interest in neural networks has
waned over the years, these have become
interesting once again with the notion of
feedback.
Notice that these networks that are shown
here,
All the links go forward from the input
nodes to the output nodes.
So, they are feed forward even though
they're multi-layer.
But, when we start having feedback,
That is that an input, a node in the
middle, output is fed back into the
previous nodes this starts behaving like a
Bayesian belief network.
In particular, feedback neural networks
display the same kind of explaining away
effect which we found in Bayesian networks
that once you figure out that something, a
particular cause is, is more likely, other
causes which might also contribute to this
input node become less likely.
So, that makes this much more interesting
than the older neural networks and has
sparked much recent interest in this
field.
The field of deep belief networks,
Multilayer-feedback neural networks,
Temporal neural networks,
All this are all coming together and deep
relationships between them are being
explored.
We'll, we'll come to these, these points
soon.
But, for the moment, lets try to
understand,
If one has such a complex f, how would one
learn the parameters of f?