1
00:00:04,830 --> 00:00:08,460
Congratulations on getting to the end
of the clustering and retrieval course.

2
00:00:08,460 --> 00:00:11,250
We've covered a lot of
ground in this course and

3
00:00:11,250 --> 00:00:14,070
in addition a lot of
really advanced material.

4
00:00:14,070 --> 00:00:17,960
But through this course, you've learned
some very practical algorithms for

5
00:00:17,960 --> 00:00:21,610
performing clustering and retrieval,
and you should be able to actually

6
00:00:21,610 --> 00:00:26,610
go out there, implement these methods,
and deploy them on real world problems.

7
00:00:26,610 --> 00:00:30,080
But before we wrap up, let's spend
a little time reflecting on what we've

8
00:00:30,080 --> 00:00:35,025
learned in this course and look ahead
at what's left in this specialization.

9
00:00:35,025 --> 00:00:37,480
Okay, well,
what have you learned in this course?

10
00:00:37,480 --> 00:00:41,642
To start with, in module one,
we looked at the idea of retrieval and

11
00:00:41,642 --> 00:00:45,523
cast that as a problem of
performing nearest neighbor search.

12
00:00:45,523 --> 00:00:49,651
And the first algorithm we looked at was
something that we called one nearest

13
00:00:49,651 --> 00:00:53,714
neighbor search, where we just look at
all of our data points, and look for

14
00:00:53,714 --> 00:00:57,158
the data point that's most
similar to the query point.

15
00:00:57,158 --> 00:01:00,920
And we explored this specifically in the
context of a document retrievable task,

16
00:01:00,920 --> 00:01:02,830
where there's a whole bunch
of documents out there.

17
00:01:02,830 --> 00:01:06,130
There's an entire corpus
of documents that we have.

18
00:01:06,130 --> 00:01:08,710
And we have some article
that's our query article.

19
00:01:08,710 --> 00:01:11,760
It might be an article that a person is
currently reading and is interested in.

20
00:01:11,760 --> 00:01:16,260
And we want to search over all the other
articles to find the closest article.

21
00:01:16,260 --> 00:01:19,020
So we presented some
algorithmic details behind

22
00:01:19,020 --> 00:01:21,530
performing one nearest neighbor search.

23
00:01:21,530 --> 00:01:24,510
And then we presented a very
straightforward extension of one

24
00:01:24,510 --> 00:01:27,700
nearest neighbor search,
which is k nearest neighbor searching.

25
00:01:27,700 --> 00:01:33,470
Where we simply return the k most
similar articles or, generically,

26
00:01:33,470 --> 00:01:38,070
data points, to a given query point,
instead of just the nearest neighbor.

27
00:01:38,070 --> 00:01:41,330
But in both one nearest neighbor and
k nearest neighbor search,

28
00:01:41,330 --> 00:01:45,150
there are two really critical elements
to the performance of the method.

29
00:01:45,150 --> 00:01:48,760
The first is how we think
about representing our data.

30
00:01:48,760 --> 00:01:52,740
So in this case, in our case study that
we're considering, the question is,

31
00:01:52,740 --> 00:01:54,920
how are we going to
represent our documents?

32
00:01:54,920 --> 00:01:57,760
And then the second critical element is,
how are we going to measure

33
00:01:57,760 --> 00:02:00,880
the similarity, or
the distance between two data points?

34
00:02:00,880 --> 00:02:03,880
So again, in our case, two documents.

35
00:02:03,880 --> 00:02:06,490
Well, in this course,
we took a deep dive into these two

36
00:02:06,490 --> 00:02:09,170
critical components of
nearest neighbor search.

37
00:02:09,170 --> 00:02:13,255
And to begin with, we talked about
a document representation based on TF-IDF,

38
00:02:13,255 --> 00:02:16,760
term frequency-inverse document frequency,

39
00:02:16,760 --> 00:02:20,870
where the term frequency is simply
counting words in the document.

40
00:02:20,870 --> 00:02:23,320
So we look at our entire vocabulary,

41
00:02:23,320 --> 00:02:27,670
we have a data representation that's
exactly the length of the vocabulary.

42
00:02:27,670 --> 00:02:29,740
And in each index of this vector,

43
00:02:29,740 --> 00:02:32,670
we simply count how many
times we saw a given word.

44
00:02:32,670 --> 00:02:35,390
But we talked about the fact that this can

45
00:02:35,390 --> 00:02:40,210
bias our nearest neighbor search
towards emphasizing very common words.

46
00:02:40,210 --> 00:02:42,320
And so that's typically not desired.

47
00:02:42,320 --> 00:02:46,928
So to account for this, we introduce
this inverse document frequency term.

48
00:02:46,928 --> 00:02:51,370
Where that down-weights
words that are very commonly

49
00:02:51,370 --> 00:02:54,188
used globally throughout the corpus.

50
00:02:54,188 --> 00:02:59,860
So our TF-IDF representation is simply
multiplying our tf vector with this

51
00:02:59,860 --> 00:03:07,800
idf vector, and this trades off between
local frequency and global rarity.

52
00:03:07,800 --> 00:03:10,760
We then turn to how are we
going to compute the distance

53
00:03:10,760 --> 00:03:13,000
between two different articles.

54
00:03:13,000 --> 00:03:17,660
And the first approach that we talked
about was using scaled Euclidean distance

55
00:03:17,660 --> 00:03:21,700
where it was just a straightforward
variant of Euclidean distance.

56
00:03:21,700 --> 00:03:25,690
And instead of having equal weights
on every component of the vector,

57
00:03:25,690 --> 00:03:27,920
we can specify a set of weights.

58
00:03:27,920 --> 00:03:32,300
So for example, for our documents, we
might have a representation that separates

59
00:03:32,300 --> 00:03:36,700
out the title from the abstract from
the main body from the conclusion.

60
00:03:36,700 --> 00:03:40,560
And then based on this representation
across these different components,

61
00:03:40,560 --> 00:03:44,890
we could put weights that more heavily
emphasize the words that are appearing in

62
00:03:44,890 --> 00:03:47,060
the title and
the abstract than the main body.

63
00:03:48,660 --> 00:03:54,050
And so this might be appropriate in terms
of finding similarities between articles.

64
00:03:54,050 --> 00:03:55,100
Because maybe the titles and

65
00:03:55,100 --> 00:03:59,510
abstracts have the really key words for
describing what those articles are about.

66
00:03:59,510 --> 00:04:03,090
Whereas the main body might be bogged
down with lots of things that are a bit

67
00:04:03,090 --> 00:04:05,940
tangential to the main
point of the article.

68
00:04:05,940 --> 00:04:07,300
But when we're thinking about text data,

69
00:04:07,300 --> 00:04:10,830
a really common distance metric
used is cosine similarity.

70
00:04:10,830 --> 00:04:15,316
And so we talked about how cosine
similarity is just the inner product of

71
00:04:15,316 --> 00:04:18,925
two normalized vectors,
one for each of our documents.

72
00:04:18,925 --> 00:04:23,991
And this is equivalent to computing
the similarity between two articles

73
00:04:23,991 --> 00:04:29,170
just based on the cosine of the angle
between those two documents.

74
00:04:29,170 --> 00:04:31,750
But remember, this wasn't actually
a proper distance metric.

75
00:04:31,750 --> 00:04:34,940
It doesn't satisfy things
like the triangle inequality.

76
00:04:34,940 --> 00:04:38,430
But we can definitely use it to
compute the distance between

77
00:04:38,430 --> 00:04:41,540
two different documents in
our nearest neighbor search.

78
00:04:41,540 --> 00:04:45,060
We then discussed the implications
of using cosine similarity,

79
00:04:45,060 --> 00:04:47,910
because it normalizes the vectors
that we're looking at.

80
00:04:47,910 --> 00:04:51,850
So it ignores the scale of
the individual vectors.

81
00:04:51,850 --> 00:04:54,680
So that could be a desirable property.

82
00:04:54,680 --> 00:04:59,270
For example, we talked about how it
ignores the length of the document, so

83
00:04:59,270 --> 00:05:03,070
longer documents don't get emphasized
more than shorter documents.

84
00:05:03,070 --> 00:05:05,910
But that could also be
something that we don't want.

85
00:05:05,910 --> 00:05:10,485
And we discussed how, for example,
it could make dissimilar objects,

86
00:05:10,485 --> 00:05:13,990
like a tweet and
a very lengthy article very similar.

87
00:05:13,990 --> 00:05:16,570
Place them on the same ground.

88
00:05:16,570 --> 00:05:20,684
So instead we talked about other ways of
thinking about normalization where we're

89
00:05:20,684 --> 00:05:24,765
not actually normalizing,
we're simply capping the max word count.

90
00:05:24,765 --> 00:05:30,095
And using that to account for
our differences in document lengths.

91
00:05:30,095 --> 00:05:33,186
But I want to mention that cosine
similarity is really commonly used when

92
00:05:33,186 --> 00:05:36,794
you're looking at features of text data or
in other applications where the overall

93
00:05:36,794 --> 00:05:40,445
scale of the feature vector that
you're looking at isn't so critical.

94
00:05:40,445 --> 00:05:43,675
But we've mentioned in contrast,
that in certain applications, for example,

95
00:05:43,675 --> 00:05:46,705
if you're looking at features of a house,
and you're looking at number of bedrooms,

96
00:05:46,705 --> 00:05:49,340
number of bathrooms, square feet.

97
00:05:49,340 --> 00:05:53,370
That the raw scale of those numbers
is really, really critical.

98
00:05:53,370 --> 00:05:55,880
So you wouldn't want to do
something like cosine similarity.

99
00:05:55,880 --> 00:05:59,251
And instead you'd want to use
something like Euclidean distance or

100
00:05:59,251 --> 00:06:02,070
possibly scaled Euclidean distance.

101
00:06:02,070 --> 00:06:06,797
Okay, but in this module we talked a lot
about the tradeoffs between different

102
00:06:06,797 --> 00:06:10,954
choices made and different options for
our data representations and

103
00:06:10,954 --> 00:06:13,690
our distance calculations.

104
00:06:13,690 --> 00:06:17,470
But there's still one really important
aspect of nearest neighbor search that we

105
00:06:17,470 --> 00:06:22,090
spent quite a bit of time on in this
module, which is the complexity of search.

106
00:06:22,090 --> 00:06:26,320
We mentioned that brute-force search,
where you take a query point and

107
00:06:26,320 --> 00:06:31,380
then you compute the distance to every
other point in the entire dataset,

108
00:06:31,380 --> 00:06:34,900
can be really, really intensive,
especially if you're going to be doing

109
00:06:34,900 --> 00:06:38,730
many different queries
on the same dataset.

110
00:06:38,730 --> 00:06:42,658
So instead, what we talked about
first was an idea of using

111
00:06:42,658 --> 00:06:47,243
something called KD-trees as
an efficient data representation.

112
00:06:47,243 --> 00:06:53,495
Where KD-trees recursively partition our
feature space to create a binary tree,

113
00:06:53,495 --> 00:06:58,240
and where the data points are stored
at the leaves of this tree.

114
00:06:59,310 --> 00:07:01,530
And we discussed how to form this tree.

115
00:07:01,530 --> 00:07:04,280
And then we discussed how
to use this tree to perform

116
00:07:04,280 --> 00:07:06,170
efficient nearest neighbor search.

117
00:07:06,170 --> 00:07:11,440
Where the really important insight here
is the fact that we can prune, often,

118
00:07:11,440 --> 00:07:16,550
a very large amount of the space by
the fact that the distance to our

119
00:07:16,550 --> 00:07:21,390
nearest neighbor found so far is
smaller than the distance to a lot of

120
00:07:21,390 --> 00:07:24,940
the boxes in this partitioning
of the feature space.

121
00:07:24,940 --> 00:07:28,040
And that means that no point
within any of these boxes

122
00:07:28,040 --> 00:07:30,830
could be aa nearest neighbor
to our query point.

123
00:07:30,830 --> 00:07:34,180
And if we don't care about performing
exact nearest neighbor search,

124
00:07:34,180 --> 00:07:37,364
then we can get even larger
efficiencies using KD-trees.

125
00:07:39,070 --> 00:07:44,500
Instead of pruning when the distance
to a bounding box is greater

126
00:07:44,500 --> 00:07:48,410
than the distance to our nearest neighbor
found so far, we prune more aggressively.

127
00:07:48,410 --> 00:07:52,920
So we're going to basically shrink
the distance to our nearest neighbor so

128
00:07:52,920 --> 00:07:56,570
far, and
prune based off of this shrunken distance.

129
00:07:57,660 --> 00:08:03,390
And depending on how much we're shrinking
to do this more aggressive pruning,

130
00:08:03,390 --> 00:08:07,630
this can result in significant
savings in terms of our search time

131
00:08:07,630 --> 00:08:11,690
at fairly low cost in terms of
the quality of our nearest neighbor.

132
00:08:11,690 --> 00:08:16,490
But we did mention that KD-trees
can have some serious limitations.

133
00:08:16,490 --> 00:08:19,860
One is the fact that they're not just
that straightforward to implement.

134
00:08:19,860 --> 00:08:22,350
But the other, which might be
more important to you if you're

135
00:08:22,350 --> 00:08:25,390
a really awesome hacker,
is the fact that KD-trees just

136
00:08:25,390 --> 00:08:28,940
really don't scale well to
high-dimensional feature spaces.

137
00:08:28,940 --> 00:08:32,302
So as an alternative for performing
approximate nearest neighbor search,

138
00:08:32,302 --> 00:08:35,690
we presented something called
locality-sensitive hashing.

139
00:08:35,690 --> 00:08:39,570
Where what we do here is we simply
throw down a set of lines where

140
00:08:39,570 --> 00:08:42,720
each line is just randomly
defined in the space.

141
00:08:42,720 --> 00:08:44,310
And if we're in more
than two dimensions and

142
00:08:44,310 --> 00:08:47,310
we're talking about these hyperplanes
that we're throwing down.

143
00:08:47,310 --> 00:08:49,870
But the point is that now these lines,

144
00:08:49,870 --> 00:08:53,390
these random lines partition
our feature space.

145
00:08:54,440 --> 00:08:58,710
And the collection of the lines define
what we're going to think of as

146
00:08:58,710 --> 00:09:01,740
bin indices for each of these data points.

147
00:09:01,740 --> 00:09:06,728
In particular, we talked about how these
bin indices can be used within a hash

148
00:09:06,728 --> 00:09:08,811
table to store our data points.

149
00:09:08,811 --> 00:09:14,412
So for example, these pink points
are stored in this third index of

150
00:09:14,412 --> 00:09:20,950
the hash table, based on the binary
representation of this bin index.

151
00:09:20,950 --> 00:09:24,930
And if we have a query point
in this same partition,

152
00:09:25,980 --> 00:09:28,830
then what we would do in our nearest
neighbor search, as we discussed,

153
00:09:28,830 --> 00:09:33,160
is we'd first search the same
bin as the query point.

154
00:09:33,160 --> 00:09:36,790
But then if we wanted to improve
the quality of the nearest neighbor search

155
00:09:36,790 --> 00:09:41,100
performed, we know that the nearest
neighbor to this point doesn't have to

156
00:09:41,100 --> 00:09:43,710
lie within that specific partition.

157
00:09:43,710 --> 00:09:49,090
So we'd search neighboring bins where
these neighboring bins, we simply flipped

158
00:09:49,090 --> 00:09:56,250
bits on this binary representation here
to define what were the adjacent bins.

159
00:09:56,250 --> 00:10:00,550
And then we just continue this until
we ran out of our computational budget.

160
00:10:00,550 --> 00:10:05,090
So in contrast to KD-trees,
locality sensitive hashing is really,

161
00:10:05,090 --> 00:10:07,760
really, really
straightforward to implement.

162
00:10:07,760 --> 00:10:10,800
And you got some practice
in your assignment in

163
00:10:10,800 --> 00:10:14,890
exploring the quality of the results
from locality sensitive hashing.

164
00:10:14,890 --> 00:10:19,397
Showing that you could get fairly high
quality neighbors returned using this very

165
00:10:19,397 --> 00:10:21,235
simple and efficient approach.

166
00:10:21,235 --> 00:10:25,009
[MUSIC]