[MUSIC] So this idea of performing unsupervised
learning sounds really crazy. So we're given no labels ever and
somehow we're supposed to output labels. Well, there are two things that
allow us to sometimes do this. One is by the definition
of what a cluster is. So what is the structure that we're
trying to extract from the data? So, for example, maybe we're looking for
these elliptical clusters. And the second thing that allows us to sometimes do this is just
the structure of the data itself. So in some cases,
the problem is fairly easy. So here's an example of an easy
problem where the data, they're pretty well separated. I could go in there and kind of say okay,
I'm going to call this one cluster, this another one, and this another one. And so you could imagine an algorithm
that could uncover this structure from the data. But in other cases,
it's really basically pretty impossible. Here, I really don't know how
I would want to cluster this data if I had to choose
three different clusters. Maybe I would do something
like this if I had to, question mark,
turns out that would be very wrong. In this case if we look
at the true labels, so these are the,
colors are the data points here. They're all mixed together and
it's just really not clear how we could extract that from this
data without some other information informing what distinguishes the blue
cluster, the green, and the pink. But then there are cases where
it's a little bit more plausible. So maybe my guess would have been
fairly reasonable, something like this. And indeed that's what
the data actually tells us. So often in most applications that we're
interested in, we're in this in between scenario where it's not immediately
clear just by visualizing the data. But we can't hope to successfully
perform unsupervised learning if we're in scenarios that
look like this impossible case. But regardless of the data structure,
the thing that's really, really crucial to what the performance is, it's
how we think about defining a cluster. So for example, here are six different, very popular toy examples of
challenging clustering problems. Where visually it just jumps out at us, especially since in this case things were
colored, but even if they weren't colored. You would probably say okay, you could
draw out what are the observations you would say that went together in
one group versus another group. But these clusters sure don't look
like simple ellipses like we saw in the previous few slides. And if you provide this data to
an algorithm that does things like just measuring distance from observations
to cluster centers, then you get out results that don't match the kind of
questions that jumped at you visually. So for
each one of these different examples we're showing what a clustering might
look like under just distance to your nearest cluster center type of objective. So the point of these slides here is that
clustering can be really challenging. Even when the data look really clean,
and nice, and well separated, sometimes it can be hard to define
what it means to be a cluster and to devise algorithms to discover that. So you have to think very carefully about what are the implications of
the data that you're providing, and the model that you're specifying, or
the algorithm that you're specifying. So as we go through the next set
of algorithms in this module, and the coming modules, it's just useful to
think about the challenges that we might be faced with based on
the structure of our data. [MUSIC]