1
00:00:00,164 --> 00:00:05,007
[MUSIC]

2
00:00:05,007 --> 00:00:09,302
So far we've been focusing on a document
retrieval task where somebody is reading

3
00:00:09,302 --> 00:00:13,410
an article and we just want to search over
all other articles in the corpus to find

4
00:00:13,410 --> 00:00:17,787
one that's most similar or very similar to
the article that the person is reading.

5
00:00:17,787 --> 00:00:21,691
But in a lot of cases, we're actually
interested in discovering a structured

6
00:00:21,691 --> 00:00:24,720
representation of our data,
of all the articles out there.

7
00:00:25,790 --> 00:00:28,690
So, one example of a structured
representation we might be

8
00:00:28,690 --> 00:00:32,040
interested in is discovering
a clustering of articles, so

9
00:00:32,040 --> 00:00:35,480
discovering groups of articles
that are related to one another.

10
00:00:35,480 --> 00:00:38,330
And we're going to discuss both
reasons why we would want to

11
00:00:38,330 --> 00:00:42,650
perform such clustering as well as
algorithms for performing this clustering.

12
00:00:43,980 --> 00:00:46,240
So let's dig into
the objective of clustering,

13
00:00:46,240 --> 00:00:48,000
as well as some motivating
applications for

14
00:00:48,000 --> 00:00:52,060
performing clustering within the context
of our document application.

15
00:00:52,060 --> 00:00:55,790
So the goal of clustering is to
discover groups of related articles.

16
00:00:55,790 --> 00:00:58,730
So, for example,
maybe there's a group where all

17
00:00:58,730 --> 00:01:03,190
the articles within this group relate
to subjects related to sports.

18
00:01:03,190 --> 00:01:07,960
And maybe there's another group where
all the articles are about world news.

19
00:01:07,960 --> 00:01:12,170
If we can discover this type of structure
in our data, then if we're doing something

20
00:01:12,170 --> 00:01:15,480
like the document retrieval task
that we talked about before.

21
00:01:15,480 --> 00:01:18,110
Then if a person is reading
an article we can look at it

22
00:01:18,110 --> 00:01:20,270
what group that article falls in.

23
00:01:20,270 --> 00:01:22,410
And then when we go to
retrieve another article for

24
00:01:22,410 --> 00:01:26,930
that reader, then we can simply search
over the articles within that given group.

25
00:01:26,930 --> 00:01:28,950
So if a person's reading a sports article,

26
00:01:28,950 --> 00:01:33,030
maybe we can recommend another
sports article just by searching for

27
00:01:33,030 --> 00:01:37,170
the nearest neighbor within
this group of sports articles.

28
00:01:37,170 --> 00:01:40,330
But we can actually use clustering
to do even fancier things.

29
00:01:40,330 --> 00:01:44,700
So, as an example, maybe we're interested
in learning the preferences of a user

30
00:01:44,700 --> 00:01:46,980
over a set of different topics.

31
00:01:46,980 --> 00:01:52,800
In this case, we're assuming that the
person isn't just interested in sports.

32
00:01:52,800 --> 00:01:54,850
They might have other interests as well.

33
00:01:54,850 --> 00:01:58,040
And when we're going to present
an article to that user,

34
00:01:58,040 --> 00:02:01,300
maybe we'd want to explore some
of these other preferences.

35
00:02:01,300 --> 00:02:05,460
So in this case, if we imagine that we
have a clustering of our articles into

36
00:02:05,460 --> 00:02:07,650
these groups of related articles,

37
00:02:07,650 --> 00:02:12,780
then a user is going to read some
subset of the articles in the corpus,

38
00:02:12,780 --> 00:02:16,110
and so, these are the articles
that are not grayed out here.

39
00:02:16,110 --> 00:02:18,550
So these are all the articles
that the user read, and

40
00:02:18,550 --> 00:02:21,690
then we can imagine that the user gives
us some feedback about the articles,

41
00:02:21,690 --> 00:02:23,170
whether they like the article or not.

42
00:02:23,170 --> 00:02:26,110
So, plus sign is going to say,
yep, they like the article,

43
00:02:26,110 --> 00:02:29,120
minus sign is going to be that
they did not like the article.

44
00:02:29,120 --> 00:02:33,920
So this user liked those
two articles in Cluster 1,

45
00:02:33,920 --> 00:02:38,470
that green Cluster as well this article
in Cluster 4, the orange Custer,

46
00:02:38,470 --> 00:02:42,582
didn’t like the article that they
read in this blue Cluster, Cluster 2.

47
00:02:42,582 --> 00:02:46,225
Did not like this other
article read in Cluster 4,

48
00:02:46,225 --> 00:02:51,160
liked this article in this blue cluster,
Cluster 2 and so on.

49
00:02:52,550 --> 00:02:56,680
And after we get this feedback or
as we're getting it over time,

50
00:02:56,680 --> 00:03:03,050
we can use this to learn this type
of preference factor over topics.

51
00:03:03,050 --> 00:03:06,230
One thing that we're going to discuss
is that we don't actually have topic

52
00:03:06,230 --> 00:03:09,540
labels like sports, world news and so on.

53
00:03:09,540 --> 00:03:11,970
But we do know that there
are groups of articles and

54
00:03:11,970 --> 00:03:16,750
we can use it to select
articles from these groups.

55
00:03:16,750 --> 00:03:20,230
Or we can post facto go in and
dig in to these groups and

56
00:03:20,230 --> 00:03:22,299
put labels on them ourselves.

57
00:03:22,299 --> 00:03:27,009
[MUSIC]