1
00:00:00,273 --> 00:00:04,003
[MUSIC]

2
00:00:04,003 --> 00:00:06,423
Okay, so that's one way to
retrieve a document of interest.

3
00:00:06,423 --> 00:00:09,262
Just take all articles out there,
scan over them, and

4
00:00:09,262 --> 00:00:13,103
find the one that's most similar
according to the metric that we define.

5
00:00:13,103 --> 00:00:17,987
But another thing we might be interested
in doing is clustering documents that

6
00:00:17,987 --> 00:00:20,440
are related, so for example.

7
00:00:20,440 --> 00:00:25,400
You might have a whole bunch of
articles about sports or world news or

8
00:00:25,400 --> 00:00:31,200
different things like this, and if we
can structure our corpus in this way,

9
00:00:31,200 --> 00:00:34,890
if a person is reading
an article that's about sports,

10
00:00:34,890 --> 00:00:40,070
then we can very quickly search over
all the other articles about sports,

11
00:00:40,070 --> 00:00:44,260
instead of looking at every article
that's out there in the entire corpus.

12
00:00:44,260 --> 00:00:48,719
But the challenge here is the fact that
these articles aren't going to have

13
00:00:48,719 --> 00:00:49,300
labels.

14
00:00:49,300 --> 00:00:52,024
It's not gonna be like the New York Times,
where you go and

15
00:00:52,024 --> 00:00:54,260
somebody says,
this is an education article.

16
00:00:54,260 --> 00:00:57,778
Okay, and all we have are articles,
and what we like to do, or

17
00:00:57,778 --> 00:01:01,300
discover these underlying
groups of articles.

18
00:01:01,300 --> 00:01:06,270
Okay, so the goal is to discover these
groups or clusters of related articles,

19
00:01:06,270 --> 00:01:10,830
and like I mentioned, one might represent
a set of articles like sports, and

20
00:01:10,830 --> 00:01:13,290
another one a set of
articles about world news.

21
00:01:15,140 --> 00:01:19,350
For the time being, let's actually assume
that someone provides us with labels,

22
00:01:19,350 --> 00:01:23,130
so, somebody goes through and
reads every article or

23
00:01:23,130 --> 00:01:27,770
at least a large sub-set of articles
in our corpus and labels them and

24
00:01:27,770 --> 00:01:30,800
says, okay, these articles,
these are all about sports.

25
00:01:30,800 --> 00:01:33,260
And these ones,
these are about world news.

26
00:01:33,260 --> 00:01:35,060
And these about entertainment.

27
00:01:35,060 --> 00:01:37,300
And these ones are about science.

28
00:01:37,300 --> 00:01:42,370
So we have some set of articles that
have labels associated with them.

29
00:01:42,370 --> 00:01:47,500
Well, in this case, when we have
our query article and we like to

30
00:01:47,500 --> 00:01:54,200
sign it to a cluster, this ends up being
just a multi-class classification problem.

31
00:01:54,200 --> 00:01:57,665
Because the question is just,
here is my query article,

32
00:01:57,665 --> 00:02:02,590
I don't know the label associated with it,
and I have a bunch of label document.

33
00:02:02,590 --> 00:02:08,403
I have the labels world news, science,
sports, entertainment, technology,

34
00:02:08,403 --> 00:02:13,390
and I just wanna classify which
class this article belongs to, okay?

35
00:02:13,390 --> 00:02:14,850
So this is the question.

36
00:02:14,850 --> 00:02:17,600
It's just a multi class
classification problem.

37
00:02:17,600 --> 00:02:23,970
So if that were the case, that would be an
example of a supervise learning problem.

38
00:02:23,970 --> 00:02:27,884
[MUSIC]