1
00:00:00,000 --> 00:00:03,493
[MUSIC]

2
00:00:08,207 --> 00:00:10,590
We're gonna assume that
no labels are provided.

3
00:00:12,380 --> 00:00:18,200
And we're gonna aim to infer groups of
related articles which are Clusters.

4
00:00:19,730 --> 00:00:23,420
Here the input is gonna be a vector.

5
00:00:23,420 --> 00:00:27,420
So every observation that we're
plotting here is our word count vector.

6
00:00:29,450 --> 00:00:33,430
And in this case, we're just looking at a
very simple example with a vocabulary that

7
00:00:33,430 --> 00:00:34,530
only has two words.

8
00:00:34,530 --> 00:00:41,780
So we have a vector,
we have word 1 and word 2.

9
00:00:41,780 --> 00:00:48,349
And this axis here is word 2 and
this axis is word 1.

10
00:00:48,349 --> 00:00:54,127
Of course, remember that in reality we
tend to have very large vocabularies and

11
00:00:54,127 --> 00:00:57,417
we have these big high
dimensional vectors.

12
00:00:57,417 --> 00:01:01,377
So, when we're plotting our observations,
they're really in this high dimensional

13
00:01:01,377 --> 00:01:05,410
space but for visualization,
let's just look at this 2D representation.

14
00:01:05,410 --> 00:01:10,240
So we have a whole bunch of
documents here, all represented

15
00:01:10,240 --> 00:01:14,330
by their word counts over these two
different words in the vocabulary.

16
00:01:14,330 --> 00:01:18,068
Okay, so that's the input to
a clustering algorithm, and

17
00:01:18,068 --> 00:01:20,690
output is gonna be cluster labels.

18
00:01:20,690 --> 00:01:23,891
And so
what I mean is that this observation and

19
00:01:23,891 --> 00:01:28,956
all of these observations here,
[SOUND] all of these get labeled as red.

20
00:01:28,956 --> 00:01:33,550
So maybe they get applied,
some cluster label one.

21
00:01:33,550 --> 00:01:34,396
So let's just call this.

22
00:01:38,715 --> 00:01:40,070
Cluster 1.

23
00:01:40,070 --> 00:01:44,790
So for every document,
it's gonna get some label.

24
00:01:44,790 --> 00:01:46,510
So that'll be labelled one.

25
00:01:46,510 --> 00:01:50,920
Then all of these observations here
they're gonna get some other label.

26
00:01:50,920 --> 00:01:53,590
And let's assume that this is cluster 2.

27
00:01:53,590 --> 00:01:55,885
So this observation gets the label 2.

28
00:01:57,500 --> 00:02:01,065
And all of these observations
here would get the label 3.

29
00:02:02,600 --> 00:02:05,200
So that's gonna be the output
of this algorithm.

30
00:02:06,850 --> 00:02:09,190
And maybe what you could do is post facto,

31
00:02:09,190 --> 00:02:12,830
you could go through and
look at some articles in cluster 1 and

32
00:02:12,830 --> 00:02:18,110
you could say that, this cluster
is really a cluster about sports.

33
00:02:18,110 --> 00:02:23,810
And I just wanna write down explicitly
that this label is provided post facto.

34
00:02:25,000 --> 00:02:27,770
Okay, well this is an example

35
00:02:27,770 --> 00:02:32,330
of an unsupervised learning task because
we're operating without any labels.

36
00:02:32,330 --> 00:02:34,300
All we have are observations and

37
00:02:34,300 --> 00:02:38,180
we're trying to uncover some
structure in these observations.

38
00:02:38,180 --> 00:02:42,298
So, again, just to reiterate,
the input are our word count vectors and

39
00:02:42,298 --> 00:02:45,133
the output is, for
every document in the corpus,

40
00:02:45,133 --> 00:02:48,802
we're gonna associate some
cluster label with that document.

41
00:02:51,321 --> 00:02:54,700
Okay, well, what defines a cluster?

42
00:02:54,700 --> 00:02:58,210
Well, every cluster is defined
by a cluster center, so

43
00:02:58,210 --> 00:03:00,595
maybe I'll mark the cluster
centers with Xs.

44
00:03:04,620 --> 00:03:07,079
And then there's the shape to the cluster,
and

45
00:03:07,079 --> 00:03:11,509
these ellipses are representing
the shapes of each of these clusters.

46
00:03:13,460 --> 00:03:16,940
And so, when we think about
whether this observation,

47
00:03:18,990 --> 00:03:25,440
this observation here should be assigned
to the green cluster, or the red cluster.

48
00:03:25,440 --> 00:03:30,940
What we're doing is we're
looking at how similar

49
00:03:30,940 --> 00:03:35,510
this article is, to other articles
based on the shape of this cluster.

50
00:03:36,510 --> 00:03:39,850
So we score every observation

51
00:03:39,850 --> 00:03:43,810
based on the cluster center as
well as the shape of that cluster.

52
00:03:43,810 --> 00:03:48,380
And in this case, because this cluster is,
has this kind of oblong-skewed shape,

53
00:03:48,380 --> 00:03:53,370
it actually gets assigned to the green
cluster instead of this red cluster.

54
00:03:53,370 --> 00:03:57,580
But another approach which is very common

55
00:03:57,580 --> 00:04:01,570
is instead of looking at the shape of the
cluster, we just look at cluster centers.

56
00:04:01,570 --> 00:04:06,120
So we just measure the distance
of this observation, so

57
00:04:06,120 --> 00:04:10,670
maybe let me change colors here,
so this alternative approach here.

58
00:04:10,670 --> 00:04:15,670
As we would just look at
the distance of this observation to

59
00:04:15,670 --> 00:04:20,570
the green cluster center
versus the red cluster center.

60
00:04:20,570 --> 00:04:24,129
And in this case, it would be very
challenging to decide whether that article

61
00:04:24,129 --> 00:04:27,240
should go with the green cluster or
the red cluster.

62
00:04:27,240 --> 00:04:31,492
But, there are other cases like this
observation here where pretty obvious

63
00:04:31,492 --> 00:04:35,087
with this metric that it would
get assigned to this red cluster.