So far, whatever we've done in machine
learning has assumed that data is labeled
in the training set with some amount, some
classes, positive negative, buy or browser
or type of animals, for example.
But, in the real world, there's nobody
always telling you what classes there are.
I mean, in some cases, you can measure it
by, figuring out which users actually
bought something and which users didn't.
But for example, in cases of sentiment,
it's difficult to imagine how one would
get that input, whether a sentence is
positive or negative, without some human
actually doing the labeling.
So the question does arise, how the
classes emerge from the data without any
external human input or in the real world,
how do we figure out that an animal is an
animal, a table is an object, a chair is a
chair etc., without some one explicitly
telling us what are objects what are not
objects.
Well the answer is clustering.
In clustering what happens is groups of
similar users or user queries are clumped
together or clustered together based on
the terms that they contain.
So if a large number of queries contained
red flowers, yellow flowers, cheap
flowers, etcetera then all of these
queries would get clumped into a cluster,
which talked essentially about flowers and
their color.
Okay.
Similarly, comments, which often use the
words like, great, love, excellent would
go together and naturally would form a
cluster which talks about.
Positive sentiment, whereas others which
had words like hate, uncomfortable, very
difficult and stuff like that would get
into another cluster which would hopefully
be those which are negative.
Now those are ideal situations in practice
the clusters that emerge directly from the
data need not be so nicely categorized
into the classes that we actually expect.
In the real world, for example, we might
be looking at observations of animals
having features like legs of the side of
the head and things like that and you
might group them based on which animals
appeared to be similar in terms of their
head size or the number of legs they have
or the noises that they make and perhaps
we'll get clusters which sort of look like
the groups which represent the animal
classes that we would naturally give and
that's probably how we actually assign.
Classes to objects in the real world in
the first place.
An important part of all of these is
figuring out what classes there are from
the data is the fact that we assume that
the features on the w-, basis of which
classes emerge are given.
This is very important to note, because
often clustering is highly dependent on
which features one chooses.
Later on this week, we will see how we
might be able to find the features
themselves from the data itself.
But for the moment, let's assume that
features are given.
So the goal of clustering is to find
regions of our space X.
Remember our space X with features being
the elements of X1 to XN.
You want to find regions that are more
populated than random data.
Now let's, let's look at this statement a
little bit carefully.
Regions means that things which are in the
same region are close together or similar
to each other, just because their values
of features, that is the X values are
close in some respects.
Some agree, some don't agree and even if
they don't agree in some respects they are
close.
For example large and extra large are
closer than perhaps, to each other than
perhaps to small.
Secondly.
These regions are not just, regions which
have.
A few.
Instances in them but are more populated
than one would otherwise expect if the
data were totally random in the sense that
if the X values for each of the features
were chosen at random from all possible
values that they could have.
So what we're trying to see is.
Other regions, where the ratio, of the
probability of X falling, in that
particular region in, given the data that
we have, is larger than the probability
that.
The data would take that value if, data
was generated uniformly.
That means if, it was completely random
data.
Every element of x, being chosen at
random.
Another way of looking at this is, lets
suppose we have all our data.
Which is the black dots in the space here.
And we set y equal to one, so you know, in
clustering we didn't have the output
variable, since we didn't have the class.
But we want to find these classes, so
let's set y equal to one for all the dots
which are actually there in our data.
That means the actual observations that we
experience in, in the real world.
Are all set to one and then let's so
imagine that we added to this data, not
some other data which we chose at random
and we just chose X values at random and
just threw them in to data set and those
were the, the light blue or colored or the
lighter colored dots and we, we threw alot
of them in and we assigned them values y
equal to zero.
Now we want to find out with this
definition of our data.
Now it's no longer just the old data, but
it's the old data plus this random data.
How we could figure out what the clusters
are.
If we now define our F of X just like
before, the expected value of Y given X.
Well, Y is one for the data that we
actually observed, and zero for the data
that we added at random, then the expected
value of Y given X, is just, you know,
the, essentially it'll be probability of X
divided by probability of X plus the
probability of the random value, variable
being chosen.
And if you just work this out, it'll be
the ratio R over one plus R, because
essentially you can divide the numerator
and the denominator by P0, and you'll get
R over one plus R.
This particular function, will have.
Extreme values.
We'd have some areas where it's very
large.
Because lot of the dots in that space have
a one associated with them.
In other areas it might not be large,
there are no black dots in that area but
there are a lot of random dots.
So the fine regions where this data, this
function is large gives us clusters.
Now, this is quite important when we have
big data because in big data, we can
actually afford to do this kind of
clustering which is sometimes possibly
even efficient.
But that's not normally done, because big
data is quite new.
So traditional means of clustering are
k-means clustering, agglomerative or
hierarchical clustering, and even our
locality sensitive hashing that we
discussed in week one is a form of
clustering, because it.
In an unsupervised manner, group similar
items together.
Of course, it doesn't care if the clusters
are big or small.
So, often it doesn't give us great
clusters.
But it can definitely be used as a first
step towards more careful clustering where
we are trying to find areas which actually
have large values of F that is contained
large number of points from our data.
All of which are close together.
Once you have a cluster, one can assign a
class label to each cluster and say, okay,
cluster one is class A, cluster two is
class B.
We can't really give them names because we
don't really know how to name them.
Well, as human beings we figure out how to
name them through language and collective
agreement but, you know, the computer
doesn't know how to do that.
So clustering is one way of.
Classes emerging from data in an
unsupervised manner and we've seen one
means of clustering.
Look how these sensitive hashing even if
it's not giving a great set of clusters.
So we want to really study the other means
in this course.
There are other courses where you can get
into all kinds of clustering algorithms as
well as supervised classification
algorithms.
But we're going to move on.
To look at other kinds of learning, so the
message here is clustering allows us to
get classes from data without having to do
any explicit labeling, but an important
aspect is that one needs to have the
features, because otherwise you don't have
any basis.
On whcih to cluster And if you choose your
features wrong then you'll get wrong
clusters.
Another nice point that I'd like you to
note is that we use the same formulation
of F of X equal to the expected value of Y
given X with an appropriate definition of
our space.
As including not only the original data
points but also some random data, so we
can use the same mechanism.
Of defining the problem in terms of the
function F of X.
While this doesn't normally yield any
practical benefit, it certainly allows us
to understand both classification and
clustering, and some of the other
techniques that we will study very soon
with the same formalism.
It also allows us to imagine a situation
where if one were able to find decision
boundaries, of functions, like F of X, or
find regions where F of X is large or
small, efficiently.
One could solve classification,
clustering, and a whole bunch of other
machine learning techniques with one set
of methods.
However, this remains a research area even
though much work has been done on this
direction in the past, and assumes
especially more importance now with big
datawear.
Wearing the data directly often yields
much better results than if one actually
had very small amounts of data.