1
00:00:00,000 --> 00:00:04,598
[MUSIC]

2
00:00:04,598 --> 00:00:08,710
Welcome to the Clustering and Retrieval
course, a machine learning specialization.

3
00:00:08,710 --> 00:00:12,680
And as we're going to see, clustering and
retrieval are both widely used and

4
00:00:12,680 --> 00:00:14,330
broadly applicable tools.

5
00:00:14,330 --> 00:00:15,440
These tools can be used for

6
00:00:15,440 --> 00:00:19,920
anything from finding similar products to
the one that you're currently browsing.

7
00:00:19,920 --> 00:00:24,100
To discovering groups of patients
with related medical conditions.

8
00:00:24,100 --> 00:00:25,540
And in the foundation's course,

9
00:00:25,540 --> 00:00:29,630
we provided the high level introduction to
the concepts of clustering and retrieval.

10
00:00:29,630 --> 00:00:32,600
But now in this course, we're going to
take a deep dive into the models and

11
00:00:32,600 --> 00:00:34,280
algorithms.

12
00:00:34,280 --> 00:00:37,420
This course is part of a specialization
that was designed to be taken

13
00:00:37,420 --> 00:00:38,910
in a specific sequence.

14
00:00:38,910 --> 00:00:41,890
So, although you can take this
course as a standalone course.

15
00:00:41,890 --> 00:00:43,790
To get the experience that we intended,

16
00:00:43,790 --> 00:00:47,370
we strongly suggest that you
take the entire sequence.

17
00:00:47,370 --> 00:00:51,570
In particular, in the Foundations course,
we provided a high level overview of

18
00:00:51,570 --> 00:00:54,899
the concepts that we're going to be cover
throughout the entire specialization.

19
00:00:56,020 --> 00:01:01,860
Then we went into much more detailed
courses on regression and classification.

20
00:01:01,860 --> 00:01:05,298
And in those courses, we learned
general machine learning concepts that

21
00:01:05,298 --> 00:01:08,472
are going to appear again in this
course on clustering and retrieval.

22
00:01:08,472 --> 00:01:12,307
And we'll go through a specific list of
the concepts that we expect you to know

23
00:01:12,307 --> 00:01:15,520
later on in this
introduction to the course.

24
00:01:15,520 --> 00:01:19,170
But first, let's provide an overview
of what this course is about.

25
00:01:19,170 --> 00:01:22,190
And remember that in the machine
learning foundations course,

26
00:01:22,190 --> 00:01:26,800
we said that machine learning was about
extracting intelligence from data.

27
00:01:26,800 --> 00:01:30,040
Well, in the first part of this course,
the machine learning method that

28
00:01:30,040 --> 00:01:35,080
we're going to be going through is nearest
neighbor search for retrieval task.

29
00:01:35,080 --> 00:01:37,160
So in particular here,

30
00:01:37,160 --> 00:01:41,800
the inputs are going to be the features
of a specific query point.

31
00:01:41,800 --> 00:01:45,415
As well as the features of all other
datapoints in a specific dataset.

32
00:01:46,750 --> 00:01:50,680
And then the output is going to be
the nearest neighbor to the query point.

33
00:01:50,680 --> 00:01:57,330
Being the point within this entire dataset
that's most similar to this query point.

34
00:01:57,330 --> 00:01:58,900
So, what's needed for

35
00:01:58,900 --> 00:02:02,540
our nearest neighbor search is to
search over all these other datapoints.

36
00:02:02,540 --> 00:02:05,980
And compute the similarity
to these datapoints or

37
00:02:05,980 --> 00:02:10,220
their distances and
then return the one that's closest.

38
00:02:10,220 --> 00:02:12,970
So as a running example that we're
going to use throughout this course,

39
00:02:12,970 --> 00:02:16,290
let's imagine that we had
a whole bunch of documents and

40
00:02:16,290 --> 00:02:18,650
somebody's reading a specific document.

41
00:02:18,650 --> 00:02:20,950
So this is this query
article in gray here and

42
00:02:20,950 --> 00:02:23,321
let's say that the user
likes this article.

43
00:02:23,321 --> 00:02:28,010
And we want to retrieve another article
that's similar that they might also like.

44
00:02:28,010 --> 00:02:31,260
So this would be the task of
finding the nearest neighbor.

45
00:02:31,260 --> 00:02:34,520
But maybe we also want to retrieve
a set of nearest neighbors, so

46
00:02:34,520 --> 00:02:37,370
a set of similar document
to display to this user.

47
00:02:39,038 --> 00:02:42,940
And we're going to be providing this
search over all these other documents to

48
00:02:42,940 --> 00:02:47,030
retrieve these nearest neighbors just
based on the text of the document.

49
00:02:47,030 --> 00:02:49,295
So just based on the features
of our datapoints.

50
00:02:50,360 --> 00:02:53,590
And we see retrieval
applications almost everywhere.

51
00:02:53,590 --> 00:02:58,180
So perhaps we have some image and
want to find a set of similar images.

52
00:02:58,180 --> 00:03:02,090
Or maybe we're shopping for some product
and we want to be provided with a set of

53
00:03:02,090 --> 00:03:04,788
other similar items that we
might want to consider instead.

54
00:03:04,788 --> 00:03:08,650
Or like we've discussed,

55
00:03:08,650 --> 00:03:13,960
maybe we're reading some article and
want to read a related article on a topic.

56
00:03:13,960 --> 00:03:18,570
And this kind of idea appears in a lot
of streaming contexts where you might be

57
00:03:18,570 --> 00:03:21,738
listening to a song,
watching a movie or watching a TV show.

58
00:03:21,738 --> 00:03:25,200
And you want to be provided with
a set of other songs or movies or

59
00:03:25,200 --> 00:03:27,470
TV shows that you might
also be interested in.

60
00:03:29,440 --> 00:03:32,420
Or perhaps you're a user
on some social network and

61
00:03:32,420 --> 00:03:35,040
based on features of you as a user.

62
00:03:35,040 --> 00:03:38,040
You want to be presented with other people
that you might want to connect with.

63
00:03:39,170 --> 00:03:44,760
So in general, this idea of retrieving
similar items can be a really,

64
00:03:44,760 --> 00:03:46,460
really powerful tool.

65
00:03:46,460 --> 00:03:50,070
And throughout this course we're going to
discuss ways to provide this type of

66
00:03:50,070 --> 00:03:50,570
search.

67
00:03:51,690 --> 00:03:54,490
Then in the second part of
the course we're going to turn to

68
00:03:54,490 --> 00:03:56,740
clustering as our machine learning method.

69
00:03:56,740 --> 00:03:57,540
And here,

70
00:03:57,540 --> 00:04:02,000
our inputs are features associated with
each of our datapoints within a dataset.

71
00:04:02,000 --> 00:04:05,720
And the output are going to be
cluster labels associated with

72
00:04:05,720 --> 00:04:07,310
each of these datapoints.

73
00:04:07,310 --> 00:04:12,310
Where the goal of the clustering method
is to discover these disjoint sets

74
00:04:12,310 --> 00:04:13,075
of datapoints.

75
00:04:14,390 --> 00:04:16,980
And we're going to look at a couple
case studies when we're talking about

76
00:04:16,980 --> 00:04:17,780
clustering.

77
00:04:17,780 --> 00:04:21,650
But one of them is going to be the same
idea of a document retrieval task

78
00:04:21,650 --> 00:04:25,060
where we have an entire
corpus of documents.

79
00:04:25,060 --> 00:04:29,060
And here though as opposed
to our simple retrieval

80
00:04:29,060 --> 00:04:31,710
setting where we're just performing
nearest neighbor search.

81
00:04:31,710 --> 00:04:36,340
Our goal is to discover a structured
representation of this input space.

82
00:04:36,340 --> 00:04:41,210
So a structured grouping of
our corpus into groups that

83
00:04:41,210 --> 00:04:45,410
might relate to things like
topics present in this corpus.

84
00:04:46,610 --> 00:04:48,810
And just like we saw through retrieval,

85
00:04:48,810 --> 00:04:51,560
clustering has applications
almost everywhere as well.

86
00:04:52,630 --> 00:04:56,970
So maybe when we're doing image search, we
don't just want to take a given image and

87
00:04:56,970 --> 00:04:59,010
provide a set of related images.

88
00:04:59,010 --> 00:05:03,335
But we might want to actually structure
our dataset where we're going to discover

89
00:05:03,335 --> 00:05:05,600
groups of related images.

90
00:05:05,600 --> 00:05:09,440
Or maybe we want to discover groups
of Coursera learners like yourselves.

91
00:05:09,440 --> 00:05:13,610
Where based on features of the user and
past behavior on Coursera.

92
00:05:13,610 --> 00:05:16,290
We can find users who
are related to one another and

93
00:05:16,290 --> 00:05:20,770
might have similar interest and use this
to better target courses to the learners.

94
00:05:22,250 --> 00:05:25,830
So this very simple ideas of retrieval and
clustering.

95
00:05:25,830 --> 00:05:29,250
Actually have very,
very significant impact in the world and

96
00:05:29,250 --> 00:05:31,120
our very much taken for granted.

97
00:05:31,120 --> 00:05:35,240
You just go on to your app or
your device and you search for something.

98
00:05:35,240 --> 00:05:38,720
And you just assume okay, of course,
is going to be able to provide a set of

99
00:05:38,720 --> 00:05:42,600
other things that might be related
to the given query that you've made.

100
00:05:43,650 --> 00:05:48,670
Or maybe we can very easily just
discover from data that we've seen,

101
00:05:48,670 --> 00:05:51,820
what are related groups of people or
items.

102
00:05:51,820 --> 00:05:53,720
Well, how do we actually do this?

103
00:05:53,720 --> 00:05:55,680
So this is what this course is about and

104
00:05:55,680 --> 00:05:58,230
we're also going to touch
upon how we do this at scale.

105
00:05:59,450 --> 00:06:03,370
And throughout this course,
like always we're going to learn general

106
00:06:03,370 --> 00:06:06,490
machine learning concepts
that are very useful.

107
00:06:06,490 --> 00:06:09,260
For example, in this course we're
going to talk about the task of

108
00:06:09,260 --> 00:06:11,080
unsupervised learning.

109
00:06:11,080 --> 00:06:15,913
And we're also going to talk about
things like MapReduce framework for

110
00:06:15,913 --> 00:06:20,341
scaling up algorithms by
parallelizing their computations.

111
00:06:20,341 --> 00:06:24,409
[MUSIC]