Let's now revisit machine learning, like we studied during the listen chapter and the naive based classifier is one method for the, machine learning. Let's try to do this a little bit formally. In a manner so as to make some of the material that will come later fall naturally into the same formal framework. We have a bunch of data. In this case, features X1 to XN which are. We call X, capital X a, a vector, a collection of features. Earlier we had different examples, you know, the query terms that we used, the words in a comment. And in general, anything that we observe about instances which we then want to classify into different. Classes and then we have some output variables which are the labels that we would like to assign to a class, whether a positive sentiment or a negative sentiment or a buyer, or a browser, and. In this case, we'll say the output variable is zero. If it's a buyer, a one. If it's a browser, a zero. If it's a negative, a one. If it's positive, and so on. In general, we might have many different output variables. Each of them 01, and. The points in the space, X, could be shown as, you know, points in, in some high dimensional space. I've shown them here as just points in the plane. But in general each of these will have many coordinates. Like N for example. And some of them are red which means they have zero assigned to them. And some of them are blue. That means they have a one assigned to them, in terms of the Y variable. And the,, the goal of classification is to figure out, given a new point whether it's likely to bem red or blue. In general, this need not to be a binary task and you may want to classified in to three, four many classes, but the moment we stick to binary classification, as it makes things a bit simpler. Now, to formally model the classification problem, if you have a bunch of features in a space, we define a function, let's say, F of X which is the expected value Of the output variables y, given the input variable, x. Now, this is a little bit different than what we had earlier, which was the probability. Of getting some variable Y equal to zero or one, given a particular combination of X's but as we shall soon see it really its the same thing, at least for classification, but for other types of machine learning we'll have different values of F and that's why this framework is useful so that it unifies our whole concept of machine learning. So, what we're trying to do is figure out what is the expected value of y. Is it a zero or a one? Is it closer to zero or closer to one given a particular combination x? Well, the expected value is nothing but. And clearly, Y is zero one, then the expected value is. One times the probability of Y equal to one, given X, plus zero times the probability that Y equal to zero, given X. So. Since you multiply by zero, this term disappears, and we simply get the probability that Y equal to one, given X. And this is exactly what we had earlier estimated using various assumptions, like independence and the bye, Bayes rule. And we will, could figure out that then we, all we needed was a training set were we could compute the required likelihoods that would enable us to compute the probability of y equal to one given a combination x. Now that's, that's fairly basic. But this formulation tells us a few things. First. It's not always necessary to compute this probability y equal to one given x. Directly. We did that through some approximation in ninth phase. We might instead want to compute this function F of X. In this case it turns out to be just the same as this probability. In other cases it might not be the same. For example if you have a classifier where you didn't have two classes you might have a different combination of F. The second thing it tells you is that. The probability. It's one way of figuring out where a particular combination X belongs. Another way might be to somehow estimate this F and estimate the boundary. Where F of X is more than or less than a half. For example, in this case, if you could figure out that the boundary lay somewhere along this line over here. So that we minimize the number of false positives and true negatives. We could find a much easier way to classify the points as being positive or negative than actually computing this rather complex animal, which is the a posteriori probability. Directly trying to find estimates of F is what more sophisticated machine learning techniques, like support vector machines, end up doing. They do other complicated things like, changing the nature of the space. So instead of dealing with the space at its given, they use, various combinations of the x's, sometimes squaring them, sometimes doing, other, funny things to them to, project this space to a different, space, and then find, a line or a plane which, nicely separates, the, instances of positive and negative. So what they're really doing is trying e-, estimate the function boundary F directly, without actually computing the function directly. Now, that's another long-winded way of defining machine learning, but it, it will serve to. Unify the different types of learning that we'll see as we go along this week. Let's take a look at some examples using this new notation that we just introduced. Recall our machine learning of whether somebody's a, buyer or a browser, based on the queries that they, issue in a search box. In this case the queries that we had last time were, red, flower, gift, or cheap. And we had various instances of people querying, and then possibly buying or not buying. So our x was essentially the set of query words, y was whether or not one buys. And so we have y, x as our. Space, B, R, F, G and C in the notation we had last time. Similarly, we had machine learning of positive or negative comments. In this case, the sentiment was the y. The set of all words formed our X, which is extremely high dimensional space. A word could either be present or absence in a comment, and all of these would still binary variables. We turn to another example now. Imagine a baby observing various animals, And wants to figure out which animal is of what name. So, we have various features, the size of the animal, the size of the head, the noise the animal makes, the number of legs it has and the animal itself. And the baby observes many instances and somehow it is able to discern these features in terms of whether or not the size of the head is large or small or medium. Or whether the animal itself is large or small. And the noise that the animal makes, etc. The machine learning task is then to classify a new animal into appropriate category. Lion, cat, elephant, etc. So here Y comma X is now the animal itself is Y, size of the animal, it's head size, it's noise, number of legs or the features, it's a fixed set. So here on the fore features of multi valued categorical variable so they take. Values in specific categories. They don't, they're not real numbers for example. They're, they take one of, of few categories: small, medium, large for example. And lastly we consider another example where we have customers going to supermarkets and buying bunches of products together in transactions A transaction might consists of milk, diapers and some Cola. An other one might consists of some diapers and beer. Yet more transactions will have different products. And here we have. Interestingly no output variable, but only features which are just the items that people buy and the items are again multi valued categorical variables but they're a variables set just like the comments that we had earlier that we could, we could consider the set of all possible items and have binary variables or we could have a variable number of features and. Multi-valued categorical variables. These examples are all cases where one could do, various kinds of machine learning. In the first three examples, that is queries, comments and animals, there was a clear output variable, which indicated the class of the particular instance. And so one could. Imagine a supervised machine learning scenario, using something like a naive based classifier, to compute the likelihoods and estimate the A plus GRI probability, or as we have just seen, the expected value of the class, given X. In the last example transactions, there was no output variable, so the task there is a little bit different, which, which we, which we shall come to, very soon. Do go back over the formalism, and make sure that you're able to figure out how. Classification happens in each of these cases. That is exactly how the Y and the X are formulated so that we get a formula, a formal representation of the problem in terms of features and an output variable. In this particular case of course there is no output variable as we have already mentioned. And let's now get into the reason why we had that.