1
00:00:00,000 --> 00:00:04,065
[MUSIC]

2
00:00:04,065 --> 00:00:08,000
So this leads directly into this
discussion that clustering is

3
00:00:08,000 --> 00:00:09,840
an unsupervised task.

4
00:00:09,840 --> 00:00:12,610
So let's talk a little bit
about what this means.

5
00:00:12,610 --> 00:00:16,760
Because so far in this specialization
we've assumed that we've been presented

6
00:00:16,760 --> 00:00:20,350
with a set of outputs
associated with our inputs.

7
00:00:20,350 --> 00:00:24,319
So for example in the regression course
when we're talking about predicting

8
00:00:24,319 --> 00:00:26,455
house prices, in our training data set,

9
00:00:26,455 --> 00:00:30,482
we assume that not only do we have the
features of the house which are the input,

10
00:00:30,482 --> 00:00:34,356
but we also had the price, the sales
price of the house, which was output.

11
00:00:34,356 --> 00:00:38,721
In the classification course,
we assume that we had restaurant reviews,

12
00:00:38,721 --> 00:00:42,464
we have a text of those restaurant
reviews which are our input and

13
00:00:42,464 --> 00:00:46,828
now we also had the output, whether it
is a positive or a negative review for

14
00:00:46,828 --> 00:00:48,797
the reviews in our training set.

15
00:00:51,066 --> 00:00:54,980
So, if we were to imagine that we
had labels in this application,

16
00:00:54,980 --> 00:00:59,610
maybe we would have labels of
articles in term of topics.

17
00:00:59,610 --> 00:01:04,410
So these articles where the input again
would be the text of the article,

18
00:01:04,410 --> 00:01:09,960
just like with reviews but
the output would be a topic like sports or

19
00:01:09,960 --> 00:01:13,412
world news or entertainment or science.

20
00:01:13,412 --> 00:01:17,260
And so in this case,
we can think of problem as a multi class

21
00:01:17,260 --> 00:01:20,450
classification problem,
where imagine you get some new article and

22
00:01:20,450 --> 00:01:25,220
you want to label what topic
that article is about.

23
00:01:25,220 --> 00:01:31,000
Well then you just go
into your training set

24
00:01:31,000 --> 00:01:34,620
and based on the classifier that you've
learned from that training set where

25
00:01:34,620 --> 00:01:40,430
labels were provided, you can then think
about labeling this document as one of

26
00:01:40,430 --> 00:01:42,450
these five different
topics presented here.

27
00:01:44,570 --> 00:01:49,640
So this is an example of a supervised
learning problem, because outputs

28
00:01:49,640 --> 00:01:54,891
are provided or these class labels
are provided on our training examples.

29
00:01:54,891 --> 00:01:58,616
But when we're thinking about doing
clustering we're assuming there

30
00:01:58,616 --> 00:02:03,100
are no labels provided, or that
the labels are really really unreliable.

31
00:02:03,100 --> 00:02:07,641
And the goal here is to uncover
this grouping structure

32
00:02:07,641 --> 00:02:11,793
directly from the data,
the input values alone.

33
00:02:11,793 --> 00:02:16,114
So in particular,
here we might imagine that we have

34
00:02:16,114 --> 00:02:20,250
documents with just two
words in the vocabulary.

35
00:02:20,250 --> 00:02:25,488
So let's say here's word one,

36
00:02:25,488 --> 00:02:30,726
counts, and here's word two,

37
00:02:30,726 --> 00:02:36,740
counts, and each one of these points

38
00:02:36,740 --> 00:02:42,570
is a given document, document Xi.

39
00:02:46,250 --> 00:02:48,900
And so in particular, our input

40
00:02:48,900 --> 00:02:53,536
to the algorithm is going to be
documents represented as vectors, Xi.

41
00:02:53,536 --> 00:02:59,210
But then the output are going
to be cluster labels.

42
00:03:01,050 --> 00:03:06,075
So in particular,
maybe we'll label this red cluster

43
00:03:06,075 --> 00:03:10,989
as cluster 1,
this green cluster as cluster 2, and

44
00:03:10,989 --> 00:03:16,579
this blue cluster as cluster 3,
then associate it with Xi.

45
00:03:16,579 --> 00:03:23,612
The output might be zi = 1,
that it's associated with cluster 1,

46
00:03:23,612 --> 00:03:31,300
and we are going to output this for
every observation in our data set.

47
00:03:31,300 --> 00:03:37,154
Just to be clear, maybe this red cluster
corresponds to articles about world news,

48
00:03:37,154 --> 00:03:41,606
this green cluster corresponds
to articles about science, and

49
00:03:41,606 --> 00:03:46,240
this blue cluster corresponds
to articles about entertainment.

50
00:03:47,600 --> 00:03:50,230
So we're not provided with those labels,
but

51
00:03:50,230 --> 00:03:55,830
we're going to group articles, color them,
as we're doing in this picture,

52
00:03:55,830 --> 00:04:00,320
based on similarities in
the observed input space.

53
00:04:00,320 --> 00:04:05,710
So in this simple 2D example, this
would be similarities between articles,

54
00:04:05,710 --> 00:04:09,000
based on a simple two word vocabulary and

55
00:04:09,000 --> 00:04:12,910
counting how many times those words
appear in each of these articles.

56
00:04:15,030 --> 00:04:19,080
Okay, well this is an example of
an unsupervised learning task because

57
00:04:19,080 --> 00:04:23,870
all we're presenting our
algorithm with are the inputs and

58
00:04:23,870 --> 00:04:27,760
the algorithm is supposed to output
a set of labels as the output.

59
00:04:29,400 --> 00:04:31,520
Well, how can we think
about learning such labels?

60
00:04:32,810 --> 00:04:37,020
Well a key component of that question is
thinking about what defines a cluster?

61
00:04:38,420 --> 00:04:41,085
And once we've defined
what a cluster means,

62
00:04:41,085 --> 00:04:45,064
then we need to define algorithms for
inferring these clusterings.

63
00:04:45,064 --> 00:04:50,465
Okay, but
a cluster is defined by its center,

64
00:04:50,465 --> 00:04:54,483
which is often called a centroid,

65
00:04:54,483 --> 00:04:58,650
as well as the shape of the cluster.

66
00:04:58,650 --> 00:05:04,440
And in this application here,
the clusters are defined by ellipses.

67
00:05:04,440 --> 00:05:08,972
So here, this red circle is
an example of an ellipse, and

68
00:05:08,972 --> 00:05:16,810
this green cluster has this elongated
ellipse that's at one angle and

69
00:05:16,810 --> 00:05:20,980
the blue cluster has another smaller
ellipse at a different rotation.

70
00:05:22,270 --> 00:05:26,700
So these are the family of shapes
that we're exploring, our ellipses,

71
00:05:26,700 --> 00:05:31,940
different rotations and
stretchings of these ellipses, but

72
00:05:31,940 --> 00:05:34,590
that's what we're using
to define a cluster.

73
00:05:36,420 --> 00:05:40,032
And then,
when we have a given observation, again,

74
00:05:40,032 --> 00:05:45,130
observation xi,
which is a specific document.

75
00:05:45,130 --> 00:05:50,020
The way we think about assigning that
document to a given cluster is based on

76
00:05:50,020 --> 00:05:53,770
something we'll call
the score under each cluster.

77
00:05:53,770 --> 00:05:59,200
So often we just think about defining
this score as the distance to the cluster

78
00:05:59,200 --> 00:06:04,460
center, so we can think about,
let me choose another color just for

79
00:06:04,460 --> 00:06:08,660
fun to make this stand out
a little bit more on this slide.

80
00:06:09,790 --> 00:06:15,098
So the distance to
the cluster centers are these

81
00:06:15,098 --> 00:06:20,570
distances here, And

82
00:06:20,570 --> 00:06:26,910
what we see is that the smallest distance
is the distance to this red cluster.

83
00:06:26,910 --> 00:06:32,637
So if the distance to the cluster
alone is what defines the score,

84
00:06:32,637 --> 00:06:37,961
then this document Xi would be
assigned to this red cluster.

85
00:06:37,961 --> 00:06:42,329
[MUSIC]