1
00:00:00,178 --> 00:00:04,452
[MUSIC]

2
00:00:04,452 --> 00:00:08,750
So this idea of performing unsupervised
learning sounds really crazy.

3
00:00:08,750 --> 00:00:14,620
So we're given no labels ever and
somehow we're supposed to output labels.

4
00:00:14,620 --> 00:00:19,000
Well, there are two things that
allow us to sometimes do this.

5
00:00:19,000 --> 00:00:23,510
One is by the definition
of what a cluster is.

6
00:00:23,510 --> 00:00:27,120
So what is the structure that we're
trying to extract from the data?

7
00:00:28,830 --> 00:00:33,320
So, for example, maybe we're looking for
these elliptical clusters.

8
00:00:33,320 --> 00:00:36,340
And the second thing that allows us to

9
00:00:36,340 --> 00:00:39,650
sometimes do this is just
the structure of the data itself.

10
00:00:39,650 --> 00:00:43,480
So in some cases,
the problem is fairly easy.

11
00:00:43,480 --> 00:00:47,407
So here's an example of an easy
problem where the data,

12
00:00:47,407 --> 00:00:49,840
they're pretty well separated.

13
00:00:49,840 --> 00:00:54,150
I could go in there and kind of say okay,
I'm going to call this one cluster,

14
00:00:54,150 --> 00:00:56,010
this another one, and this another one.

15
00:00:56,010 --> 00:01:00,985
And so you could imagine an algorithm
that could uncover this structure from

16
00:01:00,985 --> 00:01:01,691
the data.

17
00:01:03,846 --> 00:01:09,985
But in other cases,
it's really basically pretty impossible.

18
00:01:09,985 --> 00:01:13,510
Here, I really don't know how
I would want to cluster this

19
00:01:13,510 --> 00:01:15,660
data if I had to choose
three different clusters.

20
00:01:15,660 --> 00:01:19,432
Maybe I would do something
like this if I had to,

21
00:01:19,432 --> 00:01:23,460
question mark,
turns out that would be very wrong.

22
00:01:23,460 --> 00:01:26,209
In this case if we look
at the true labels, so

23
00:01:26,209 --> 00:01:30,090
these are the,
colors are the data points here.

24
00:01:30,090 --> 00:01:34,200
They're all mixed together and
it's just really not clear

25
00:01:34,200 --> 00:01:38,720
how we could extract that from this
data without some other information

26
00:01:38,720 --> 00:01:41,970
informing what distinguishes the blue
cluster, the green, and the pink.

27
00:01:43,130 --> 00:01:46,942
But then there are cases where
it's a little bit more plausible.

28
00:01:46,942 --> 00:01:52,620
So maybe my guess would have been
fairly reasonable, something like this.

29
00:01:52,620 --> 00:01:57,710
And indeed that's what
the data actually tells us.

30
00:01:57,710 --> 00:02:02,670
So often in most applications that we're
interested in, we're in this in between

31
00:02:02,670 --> 00:02:07,230
scenario where it's not immediately
clear just by visualizing the data.

32
00:02:07,230 --> 00:02:12,530
But we can't hope to successfully
perform unsupervised

33
00:02:12,530 --> 00:02:16,040
learning if we're in scenarios that
look like this impossible case.

34
00:02:16,040 --> 00:02:22,280
But regardless of the data structure,
the thing that's really, really

35
00:02:22,280 --> 00:02:27,850
crucial to what the performance is, it's
how we think about defining a cluster.

36
00:02:29,100 --> 00:02:31,880
So for example, here are six different,

37
00:02:31,880 --> 00:02:35,710
very popular toy examples of
challenging clustering problems.

38
00:02:35,710 --> 00:02:37,900
Where visually it just jumps out at us,

39
00:02:37,900 --> 00:02:42,560
especially since in this case things were
colored, but even if they weren't colored.

40
00:02:42,560 --> 00:02:47,320
You would probably say okay, you could
draw out what are the observations you

41
00:02:47,320 --> 00:02:51,320
would say that went together in
one group versus another group.

42
00:02:53,120 --> 00:02:57,281
But these clusters sure don't look
like simple ellipses like we saw

43
00:02:57,281 --> 00:03:00,000
in the previous few slides.

44
00:03:00,000 --> 00:03:04,010
And if you provide this data to
an algorithm that does things like just

45
00:03:04,010 --> 00:03:08,910
measuring distance from observations
to cluster centers, then you get out

46
00:03:08,910 --> 00:03:14,080
results that don't match the kind of
questions that jumped at you visually.

47
00:03:14,080 --> 00:03:18,090
So for
each one of these different examples

48
00:03:18,090 --> 00:03:22,790
we're showing what a clustering might
look like under just distance to your

49
00:03:22,790 --> 00:03:26,500
nearest cluster center type of objective.

50
00:03:26,500 --> 00:03:30,760
So the point of these slides here is that
clustering can be really challenging.

51
00:03:30,760 --> 00:03:35,400
Even when the data look really clean,
and nice, and well separated,

52
00:03:35,400 --> 00:03:39,925
sometimes it can be hard to define
what it means to be a cluster and

53
00:03:39,925 --> 00:03:42,370
to devise algorithms to discover that.

54
00:03:43,740 --> 00:03:46,640
So you have to think very carefully about

55
00:03:46,640 --> 00:03:49,390
what are the implications of
the data that you're providing, and

56
00:03:49,390 --> 00:03:52,890
the model that you're specifying, or
the algorithm that you're specifying.

57
00:03:52,890 --> 00:03:56,725
So as we go through the next set
of algorithms in this module, and

58
00:03:56,725 --> 00:04:01,625
the coming modules, it's just useful to
think about the challenges that we might

59
00:04:01,625 --> 00:04:04,554
be faced with based on
the structure of our data.

60
00:04:04,554 --> 00:04:08,869
[MUSIC]