Let's now revisit machine learning, like
we studied during the listen chapter and
the naive based classifier is one method
for the, machine learning.
Let's try to do this a little bit
formally.
In a manner so as to make some of the
material that will come later fall
naturally into the same formal framework.
We have a bunch of data.
In this case, features X1 to XN which are.
We call X, capital X a, a vector, a
collection of features.
Earlier we had different examples, you
know, the query terms that we used, the
words in a comment.
And in general, anything that we observe
about instances which we then want to
classify into different.
Classes and then we have some output
variables which are the labels that we
would like to assign to a class, whether a
positive sentiment or a negative sentiment
or a buyer, or a browser, and.
In this case, we'll say the output
variable is zero.
If it's a buyer, a one.
If it's a browser, a zero.
If it's a negative, a one.
If it's positive, and so on.
In general, we might have many different
output variables.
Each of them 01, and.
The points in the space, X, could be shown
as, you know, points in, in some high
dimensional space.
I've shown them here as just points in the
plane.
But in general each of these will have
many coordinates.
Like N for example.
And some of them are red which means they
have zero assigned to them.
And some of them are blue.
That means they have a one assigned to
them, in terms of the Y variable.
And the,, the goal of classification is to
figure out, given a new point whether it's
likely to bem red or blue.
In general, this need not to be a binary
task and you may want to classified in to
three, four many classes, but the moment
we stick to binary classification, as it
makes things a bit simpler.
Now, to formally model the classification
problem, if you have a bunch of features
in a space, we define a function, let's
say, F of X which is the expected value Of
the output variables y, given the input
variable, x.
Now, this is a little bit different than
what we had earlier, which was the
probability.
Of getting some variable Y equal to zero
or one, given a particular combination of
X's but as we shall soon see it really its
the same thing, at least for
classification, but for other types of
machine learning we'll have different
values of F and that's why this framework
is useful so that it unifies our whole
concept of machine learning.
So, what we're trying to do is figure out
what is the expected value of y.
Is it a zero or a one?
Is it closer to zero or closer to one
given a particular combination x?
Well, the expected value is nothing but.
And clearly, Y is zero one, then the
expected value is.
One times the probability of Y equal to
one, given X, plus zero times the
probability that Y equal to zero, given X.
So.
Since you multiply by zero, this term
disappears, and we simply get the
probability that Y equal to one, given X.
And this is exactly what we had earlier
estimated using various assumptions, like
independence and the bye, Bayes rule.
And we will, could figure out that then
we, all we needed was a training set were
we could compute the required likelihoods
that would enable us to compute the
probability of y equal to one given a
combination x.
Now that's, that's fairly basic.
But this formulation tells us a few
things.
First.
It's not always necessary to compute this
probability y equal to one given x.
Directly.
We did that through some approximation in
ninth phase.
We might instead want to compute this
function F of X.
In this case it turns out to be just the
same as this probability.
In other cases it might not be the same.
For example if you have a classifier where
you didn't have two classes you might have
a different combination of F.
The second thing it tells you is that.
The probability.
It's one way of figuring out where a
particular combination X belongs.
Another way might be to somehow estimate
this F and estimate the boundary.
Where F of X is more than or less than a
half.
For example, in this case, if you could
figure out that the boundary lay somewhere
along this line over here.
So that we minimize the number of false
positives and true negatives.
We could find a much easier way to
classify the points as being positive or
negative than actually computing this
rather complex animal, which is the a
posteriori probability.
Directly trying to find estimates of F is
what more sophisticated machine learning
techniques, like support vector machines,
end up doing.
They do other complicated things like,
changing the nature of the space.
So instead of dealing with the space at
its given, they use, various combinations
of the x's, sometimes squaring them,
sometimes doing, other, funny things to
them to, project this space to a
different, space, and then find, a line or
a plane which, nicely separates, the,
instances of positive and negative.
So what they're really doing is trying e-,
estimate the function boundary F directly,
without actually computing the function
directly.
Now, that's another long-winded way of
defining machine learning, but it, it will
serve to.
Unify the different types of learning that
we'll see as we go along this week.
Let's take a look at some examples using
this new notation that we just introduced.
Recall our machine learning of whether
somebody's a, buyer or a browser, based on
the queries that they, issue in a search
box.
In this case the queries that we had last
time were, red, flower, gift, or cheap.
And we had various instances of people
querying, and then possibly buying or not
buying.
So our x was essentially the set of query
words, y was whether or not one buys.
And so we have y, x as our.
Space, B, R, F, G and C in the notation we
had last time.
Similarly, we had machine learning of
positive or negative comments.
In this case, the sentiment was the y.
The set of all words formed our X, which
is extremely high dimensional space.
A word could either be present or absence
in a comment, and all of these would still
binary variables.
We turn to another example now.
Imagine a baby observing various animals,
And wants to figure out which animal is of
what name.
So, we have various features, the size of
the animal, the size of the head, the
noise the animal makes, the number of legs
it has and the animal itself.
And the baby observes many instances and
somehow it is able to discern these
features in terms of whether or not the
size of the head is large or small or
medium.
Or whether the animal itself is large or
small.
And the noise that the animal makes, etc.
The machine learning task is then to
classify a new animal into appropriate
category.
Lion, cat, elephant, etc.
So here Y comma X is now the animal itself
is Y, size of the animal, it's head size,
it's noise, number of legs or the
features, it's a fixed set.
So here on the fore features of multi
valued categorical variable so they take.
Values in specific categories.
They don't, they're not real numbers for
example.
They're, they take one of, of few
categories: small, medium, large for
example.
And lastly we consider another example
where we have customers going to
supermarkets and buying bunches of
products together in transactions A
transaction might consists of milk,
diapers and some Cola.
An other one might consists of some
diapers and beer.
Yet more transactions will have different
products.
And here we have.
Interestingly no output variable, but only
features which are just the items that
people buy and the items are again multi
valued categorical variables but they're a
variables set just like the comments that
we had earlier that we could, we could
consider the set of all possible items and
have binary variables or we could have a
variable number of features and.
Multi-valued categorical variables.
These examples are all cases where one
could do, various kinds of machine
learning.
In the first three examples, that is
queries, comments and animals, there was a
clear output variable, which indicated the
class of the particular instance.
And so one could.
Imagine a supervised machine learning
scenario, using something like a naive
based classifier, to compute the
likelihoods and estimate the A plus GRI
probability, or as we have just seen, the
expected value of the class, given X.
In the last example transactions, there
was no output variable, so the task there
is a little bit different, which, which
we, which we shall come to, very soon.
Do go back over the formalism, and make
sure that you're able to figure out how.
Classification happens in each of these
cases.
That is exactly how the Y and the X are
formulated so that we get a formula, a
formal representation of the problem in
terms of features and an output variable.
In this particular case of course there is
no output variable as we have already
mentioned.
And let's now get into the reason why we
had that.