Whatever our model,
One needs to minimize f of x minus y.
Otherwise, the, the data, the value that
you want to predict you want to minimize
the error between the, what you already
have.
Real values and your predicted values for
those real values..
Well, f is complicated, there's no formula
like the normal equations that we had for
linear least squares.
So, all these methods, whether they are
regression with logistic function, support
vector machines, neural networks, all
learn the parameters through uniterative
process.
We start with some f which is a
parameterization of the function f,
It maybe not be linear like f transposed
x.
It could be some other parameterization
so, you know, you could have squares, you
could have exponentials, you can all kinds
of things.
But, you have some parameters which tells
you what the function f for each choice of
parameters is.
And, we need to refine it alliteratively
to find the f which minimizes the error.
In the world of neural networks, the first
search algorith was called back
propagation which has nothing going to do
with the feed back, neural networks, is
just a new algorithms, it is just going to
learn self.
While the other areas which were probably
more mathematically founded, straight
forward optimization techniques were used
to find the minimum of this error.
So, if we start off with a guess and we
choose maybe a slightly different f as two
different guesses,
One way to start minimizing this error or
moving towards a value of f which is
likely to have lower error, is to use
something called gradient descent.
This formula looks a bit daunting so let
me explain it in simple language.
Epsilon of f is a function of the
parameters of f.
So, one can measure how fast this function
changes if one changes each component of f
separately, which is nothing but taking
the partial derivative of the function
with respect to that component.
String all those partial derivatives
together, and you'll get a vector which is
just this del sub f of epsilon.
That vector tells you which direction to
move each component of f so as to reduce
subset.
So, for example, the first component of f
is such that if you increase it, then
epsilon decreases, then you would choose
to increase that component of f.
If the second component is one which
decreases epsilon, only if you decrease
that component, you will decrease f so
then the, that particular element of del
would be negative.
So, you'd get this vector del, computed
using your own value of f of r of f.
That means, your previous iterations start
with f0, for example. And then, add some
small multiple of this derivative to your
current guess to create a new guess.
Now, you need to compute these derivatives
so you use the difference is between
epsilon at f of i and eps, epsilon at f of
i minus one divided by, you know, f of i
minus f of i minus one component-wise to
get estimates for this derivative and
that's why you needed two guesses to start
with so, because you have to bootstrap
this duration.
Now, this is a very popular and very
simple method and you can have.
More complicated ways of finding the
minimum value of, of epsilon by, by using
Newton's iteration where you use second
derivatives in addition to first
derivatives and various different
techniques in the mathematics literature
are available to you.
This works fine if you have numbers, that
means if you have the x's are actual
values as we've been assuming, but there
are, of course, some caveats.
This iteration sort of works and all
situations work provided there is only,
you know, one minima and you're actually
starting very close to that minima if, if
there are multiple minimas.
And if there are many different minimas,
you might get stuck in a local minima, and
not, and not really get to the real
minimum value.
So, so you need better techniques.
The other thing would be that the function
f might not necessarily be parametrized by
a single set of values there may be some
constraints.
You might have a function that says, that
you have five different lines which meet
at, at the, the connecting points and
that's your function.
So, you're going to try to fit five lines
to all your data.
Then, you have constraints which will then
make this situation a little bit more
complicated and, and, and, and, complex as
well.
But, we won't go into those details right
now.
The other problem is, is if x is not
numbers and if x is categorical, at least
some of the x are categories like yes, no,
red, blue, or green, then you somehow have
to convert them to a binary form.
So, if there are n categorical variables,
you might want to convert, convert them
to, for example, zero or one.
Of course, now, you're going to have many
more than n.
So, you would say like words.
You, you have categorical variables which
are words and you suddenly increase the
dimension of, you know, space from say,
You're, you're talking about one word, but
words could be out of a set of millions.
So now, you have a million dimensional
space and it's zero or one depending on
whether that word is there.
And that tremendously complicates such
methods and they often become unviable if
your categorical data is highly, very high
dimension and you have to use different
techniques to learn parameters.
You could also convert the these
techniques, these categorical variables to
real numbers through a positive, through a
process of reversed classification, which
we wont talk about, but essentially there,
you, you think about high, low, and medium
as function which are not fixed as being
high, or low, or medium but could vary as
something being 50% high, twenty percent
low, etc.
And then, use such fuzzy membership to
convert your categorical variables to, to,
into, into real numbers.
And then, once you have real numbers, you
can use this technique or you could deal
with the space of categorical data itself.
And then, there are techniques like
neighborhood search, heuristic AI search,
genetic algorithms, and such techniques
which don't rely on converting your
parameters into real space.
And then, you get the problem of a
combinatorial explosion and you have to
use all kinds of techniques to figure out
good ways of minimizing the error.
And lastly, as we have seen in previous
lectures, probabilistic models can deal
with probabilities of variables taking on
particular categorical values as opposed
to having to convert those to numbers or
deal with them directly.
And these techniques have become
increasingly powerful and, and popular,
precisely because you're able to deal
with, with the numbers without having to
get into too much exponential
combinatorial blowup.
Now, that's quite a mouthful and many of
you might not have got it.
But those of you who are on the verge of
graduate research, might want to listen to
that spill a couple of times cuz it does,
sort of, unify different or sort of give a
bigger picture of how learning parameters
in predictive models work and what are the
possible options.
There are some more linkages which I'd
like to explore.
For example,
Here we are trying to minimize the error
in a predictive model.
In other situations, we might need to
maximize a choice which is want to find
the best way to transport goods across the
country,
Or the best way to allocate one's
advertising funds across different
channels.
So, there you're trying to maximize some
function and that's related closely to the
problem of minimizing a prediction error.
Similar techniques, gradient-descent,
Newton's method and various others apply
all the similar, similar problems in terms
of categorical data,
Local minima, and constraints all apply as
well.
Then, once one has decided where one wants
to go, one needs to figure out how to get
there.
So typically, if you're driving a car, for
example, you have a system and your angle
of steering could be, say, theta.
And every time you, you, you change your
theta, your car will move, move in a
certain direction so you'll have a, a new
state of your system.
And the goal is to figure out which
sequence of control actions will
ultimately minimize the difference between
the state of your system and where you
want that state to be.
And there could be other constraints that
you want that paths, paths to be fairly
smooth, etc.
So, you get all those problems of
constraints again.
Again, it's a minimization problem.
And here, unlike in the first two cases of
minimizing the prediction error and, or
optimizing a function each iteration is
actually executed and the response of a
system, rather than a function kind of
abstractly is what you're dealing with.
So, an actual control system will make a
change,
See what the output is and make another
change again, techniques like
gradient-descent,, Newton's method, and
all that start applying, but now you're
controlling actions.
So, from predicting the future or
predicting where things are going to go to
figuring out how to best utilize that
information to optimize one's own goals.
And then, to decide how to change ones
behavior or change ones system, so as to
get to the gold state, all these are
tightly related.
And, I hope you've seen that the
mathematics is related, now lets talk
about some of the applications.
Before we do that, of course, in practice,
whether you're dealing with support vector
machines, logistic regression, neural
networks, or what have you, you won't ever
really have to do these things.
There are packages available that let you
choose let, let you apply any of these
techniques without having to get into the
math, or get into the duration.
The important thing is, of course, trying
to choose which package to use. And, as
soon as we do the applications,
I'll try to summarize which package to
use, what kind of techniques to use from
my experience.
And then, we move on to more issues of
deep belief networks.