Whatever our model, One needs to minimize f of x minus y. Otherwise, the, the data, the value that you want to predict you want to minimize the error between the, what you already have. Real values and your predicted values for those real values.. Well, f is complicated, there's no formula like the normal equations that we had for linear least squares. So, all these methods, whether they are regression with logistic function, support vector machines, neural networks, all learn the parameters through uniterative process. We start with some f which is a parameterization of the function f, It maybe not be linear like f transposed x. It could be some other parameterization so, you know, you could have squares, you could have exponentials, you can all kinds of things. But, you have some parameters which tells you what the function f for each choice of parameters is. And, we need to refine it alliteratively to find the f which minimizes the error. In the world of neural networks, the first search algorith was called back propagation which has nothing going to do with the feed back, neural networks, is just a new algorithms, it is just going to learn self. While the other areas which were probably more mathematically founded, straight forward optimization techniques were used to find the minimum of this error. So, if we start off with a guess and we choose maybe a slightly different f as two different guesses, One way to start minimizing this error or moving towards a value of f which is likely to have lower error, is to use something called gradient descent. This formula looks a bit daunting so let me explain it in simple language. Epsilon of f is a function of the parameters of f. So, one can measure how fast this function changes if one changes each component of f separately, which is nothing but taking the partial derivative of the function with respect to that component. String all those partial derivatives together, and you'll get a vector which is just this del sub f of epsilon. That vector tells you which direction to move each component of f so as to reduce subset. So, for example, the first component of f is such that if you increase it, then epsilon decreases, then you would choose to increase that component of f. If the second component is one which decreases epsilon, only if you decrease that component, you will decrease f so then the, that particular element of del would be negative. So, you'd get this vector del, computed using your own value of f of r of f. That means, your previous iterations start with f0, for example. And then, add some small multiple of this derivative to your current guess to create a new guess. Now, you need to compute these derivatives so you use the difference is between epsilon at f of i and eps, epsilon at f of i minus one divided by, you know, f of i minus f of i minus one component-wise to get estimates for this derivative and that's why you needed two guesses to start with so, because you have to bootstrap this duration. Now, this is a very popular and very simple method and you can have. More complicated ways of finding the minimum value of, of epsilon by, by using Newton's iteration where you use second derivatives in addition to first derivatives and various different techniques in the mathematics literature are available to you. This works fine if you have numbers, that means if you have the x's are actual values as we've been assuming, but there are, of course, some caveats. This iteration sort of works and all situations work provided there is only, you know, one minima and you're actually starting very close to that minima if, if there are multiple minimas. And if there are many different minimas, you might get stuck in a local minima, and not, and not really get to the real minimum value. So, so you need better techniques. The other thing would be that the function f might not necessarily be parametrized by a single set of values there may be some constraints. You might have a function that says, that you have five different lines which meet at, at the, the connecting points and that's your function. So, you're going to try to fit five lines to all your data. Then, you have constraints which will then make this situation a little bit more complicated and, and, and, and, complex as well. But, we won't go into those details right now. The other problem is, is if x is not numbers and if x is categorical, at least some of the x are categories like yes, no, red, blue, or green, then you somehow have to convert them to a binary form. So, if there are n categorical variables, you might want to convert, convert them to, for example, zero or one. Of course, now, you're going to have many more than n. So, you would say like words. You, you have categorical variables which are words and you suddenly increase the dimension of, you know, space from say, You're, you're talking about one word, but words could be out of a set of millions. So now, you have a million dimensional space and it's zero or one depending on whether that word is there. And that tremendously complicates such methods and they often become unviable if your categorical data is highly, very high dimension and you have to use different techniques to learn parameters. You could also convert the these techniques, these categorical variables to real numbers through a positive, through a process of reversed classification, which we wont talk about, but essentially there, you, you think about high, low, and medium as function which are not fixed as being high, or low, or medium but could vary as something being 50% high, twenty percent low, etc. And then, use such fuzzy membership to convert your categorical variables to, to, into, into real numbers. And then, once you have real numbers, you can use this technique or you could deal with the space of categorical data itself. And then, there are techniques like neighborhood search, heuristic AI search, genetic algorithms, and such techniques which don't rely on converting your parameters into real space. And then, you get the problem of a combinatorial explosion and you have to use all kinds of techniques to figure out good ways of minimizing the error. And lastly, as we have seen in previous lectures, probabilistic models can deal with probabilities of variables taking on particular categorical values as opposed to having to convert those to numbers or deal with them directly. And these techniques have become increasingly powerful and, and popular, precisely because you're able to deal with, with the numbers without having to get into too much exponential combinatorial blowup. Now, that's quite a mouthful and many of you might not have got it. But those of you who are on the verge of graduate research, might want to listen to that spill a couple of times cuz it does, sort of, unify different or sort of give a bigger picture of how learning parameters in predictive models work and what are the possible options. There are some more linkages which I'd like to explore. For example, Here we are trying to minimize the error in a predictive model. In other situations, we might need to maximize a choice which is want to find the best way to transport goods across the country, Or the best way to allocate one's advertising funds across different channels. So, there you're trying to maximize some function and that's related closely to the problem of minimizing a prediction error. Similar techniques, gradient-descent, Newton's method and various others apply all the similar, similar problems in terms of categorical data, Local minima, and constraints all apply as well. Then, once one has decided where one wants to go, one needs to figure out how to get there. So typically, if you're driving a car, for example, you have a system and your angle of steering could be, say, theta. And every time you, you, you change your theta, your car will move, move in a certain direction so you'll have a, a new state of your system. And the goal is to figure out which sequence of control actions will ultimately minimize the difference between the state of your system and where you want that state to be. And there could be other constraints that you want that paths, paths to be fairly smooth, etc. So, you get all those problems of constraints again. Again, it's a minimization problem. And here, unlike in the first two cases of minimizing the prediction error and, or optimizing a function each iteration is actually executed and the response of a system, rather than a function kind of abstractly is what you're dealing with. So, an actual control system will make a change, See what the output is and make another change again, techniques like gradient-descent,, Newton's method, and all that start applying, but now you're controlling actions. So, from predicting the future or predicting where things are going to go to figuring out how to best utilize that information to optimize one's own goals. And then, to decide how to change ones behavior or change ones system, so as to get to the gold state, all these are tightly related. And, I hope you've seen that the mathematics is related, now lets talk about some of the applications. Before we do that, of course, in practice, whether you're dealing with support vector machines, logistic regression, neural networks, or what have you, you won't ever really have to do these things. There are packages available that let you choose let, let you apply any of these techniques without having to get into the math, or get into the duration. The important thing is, of course, trying to choose which package to use. And, as soon as we do the applications, I'll try to summarize which package to use, what kind of techniques to use from my experience. And then, we move on to more issues of deep belief networks.