So now, we'll go beyond least squares. However, let me mention that the technique that we're going to cover this week is least squares, which have already been covered. And, what we will follow is essentially a broad overview of a variety of different areas dealing with different types of prediction models and their relationships which are increasingly becoming apparent in research community. Let's first see what happens if we have categorical data. That is that, the target variables that we are trying to predict, the y's, are not numbers but yes is a node, which is essentially the learning problem that we had earlier. Well, the challenge here is that the x's are still numbers. So, it might not be such a great idea to use something like a naive Bayes classifier because one would have to discretize those numbers into categorical variables which might not be appropriate. So, one tends to try to use regression techniques. But, using a linear regression when all one is trying to do is separate out the no's which is the reds from the blues, which may be yes's. It's quite brittle because the points right near the line might get classified either way, and the distinctions one has to make are very small and are subject to a lot of error. The other problem, of course, is that there may be no such lines separating the data and we'll come to that in a minute. But, just to make regression work for categorical data, what the most common technique being used today is logistic regression. Which essentially replaces f by a function of the form one - one over e to the -f transposed x, instead of f transposed x. Let's see how this function behaves. Now, suppose f transposed x is a very large value. Then, even the -f transposed x is close to zero, Which means that one up, one over that is close to infinity and we have one minus of that, becomes a very, As f transposed x is close to zero, F of x is close to zero because you have one over one, And one minus of that is almost zero. But if, as f transposed x increases, The denominator here rapidly decreases to almost zero which makes the second term here rapidly increased to infinity, where f transposed x very rapidly goes away from zero to a negative value. On the other hand, if f transposed x is, becomes a negative value, Even the x transpose x rapidly increases towards infinity making the second term almost zero. So, fx) of x goes close to one. So, the speed at which the second term deviates from one is what makes logistic regression particularly attractive to separate out the yes's from the no's. Of course, trying to fit such a function is not as easier as linear squares. But I'll show how that can be done very soon. The other problem with linear regression is that the data might not be separable using a line. So, one might, for example, you might have a parabolic relationship, as we described earlier. To support more complex functions, and when one doesn't even know what kind of function to have, The technique called support vector machines have become very popular. And they're the kind of function, f, or is parameterized by kernel parameters, and these parameters are also learned along with the, the function itself. So, not only does the we learn the function, but we also learn the type of function from the data. And so, if it's a parabola, you learn a parabola. If it's something more complicated, like a circle with a hole inside, rather like a doughnut, the support vector machine will learn such functions as well. The point I'm trying to make is that starting from linear least squares, one, one increases the complexity of the functions one is trying to learn. And we arrive at other types of regressions and support vector machines. Way back in the early days of the AI, one found many efforts to mimic the brain, which essentially resulted in the field of neural networks. Where one tried to create structures that sort of look like how neurons in the brain affect each other, based on their connections to other neurons. The first neural networks created in the 50's by McCullouch, Pitts, Rosenblatt and others essentially, look like linear combinations of inputs. So, they were pretty much like least squares, in the sense that we had, you would have a neural network, which would say that these are neurons and the input to this neuron is the rainfall and this one's the temperature. This one's the harvest rainfall, and then we have another one which has just one. And, the output would be the wine quality and the weights, that is the connections between these neurons would be learned by, by a process of iterative refinement which could be essentially looked upon as a way of solving the least squares problem. To deal with exactly the same problems of classification, we found logistic functions being included in neural networks. So essentially, the, the, this field was evolving parallely to statistical prediction techniques. And interest in neural networks, sort of wind over the years along the way, of course, we figured out different types of neural networks which were multi-layer. And essentially, by combination of logistic functions, least linear functions, and other types of functions, one could learn more complex non-linear functions of the input variables. Hidden layers were introduced which were essentially creating more complicated forms of f., And parametrized by the weights which would then get learned through a process of optimization as we'll see shortly. Finally, in recent years, in spite of the fact that interest in neural networks has waned over the years, these have become interesting once again with the notion of feedback. Notice that these networks that are shown here, All the links go forward from the input nodes to the output nodes. So, they are feed forward even though they're multi-layer. But, when we start having feedback, That is that an input, a node in the middle, output is fed back into the previous nodes this starts behaving like a Bayesian belief network. In particular, feedback neural networks display the same kind of explaining away effect which we found in Bayesian networks that once you figure out that something, a particular cause is, is more likely, other causes which might also contribute to this input node become less likely. So, that makes this much more interesting than the older neural networks and has sparked much recent interest in this field. The field of deep belief networks, Multilayer-feedback neural networks, Temporal neural networks, All this are all coming together and deep relationships between them are being explored. We'll, we'll come to these, these points soon. But, for the moment, lets try to understand, If one has such a complex f, how would one learn the parameters of f?