1 00:00:04,830 --> 00:00:08,460 Congratulations on getting to the end of the clustering and retrieval course. 2 00:00:08,460 --> 00:00:11,250 We've covered a lot of ground in this course and 3 00:00:11,250 --> 00:00:14,070 in addition a lot of really advanced material. 4 00:00:14,070 --> 00:00:17,960 But through this course, you've learned some very practical algorithms for 5 00:00:17,960 --> 00:00:21,610 performing clustering and retrieval, and you should be able to actually 6 00:00:21,610 --> 00:00:26,610 go out there, implement these methods, and deploy them on real world problems. 7 00:00:26,610 --> 00:00:30,080 But before we wrap up, let's spend a little time reflecting on what we've 8 00:00:30,080 --> 00:00:35,025 learned in this course and look ahead at what's left in this specialization. 9 00:00:35,025 --> 00:00:37,480 Okay, well, what have you learned in this course? 10 00:00:37,480 --> 00:00:41,642 To start with, in module one, we looked at the idea of retrieval and 11 00:00:41,642 --> 00:00:45,523 cast that as a problem of performing nearest neighbor search. 12 00:00:45,523 --> 00:00:49,651 And the first algorithm we looked at was something that we called one nearest 13 00:00:49,651 --> 00:00:53,714 neighbor search, where we just look at all of our data points, and look for 14 00:00:53,714 --> 00:00:57,158 the data point that's most similar to the query point. 15 00:00:57,158 --> 00:01:00,920 And we explored this specifically in the context of a document retrievable task, 16 00:01:00,920 --> 00:01:02,830 where there's a whole bunch of documents out there. 17 00:01:02,830 --> 00:01:06,130 There's an entire corpus of documents that we have. 18 00:01:06,130 --> 00:01:08,710 And we have some article that's our query article. 19 00:01:08,710 --> 00:01:11,760 It might be an article that a person is currently reading and is interested in. 20 00:01:11,760 --> 00:01:16,260 And we want to search over all the other articles to find the closest article. 21 00:01:16,260 --> 00:01:19,020 So we presented some algorithmic details behind 22 00:01:19,020 --> 00:01:21,530 performing one nearest neighbor search. 23 00:01:21,530 --> 00:01:24,510 And then we presented a very straightforward extension of one 24 00:01:24,510 --> 00:01:27,700 nearest neighbor search, which is k nearest neighbor searching. 25 00:01:27,700 --> 00:01:33,470 Where we simply return the k most similar articles or, generically, 26 00:01:33,470 --> 00:01:38,070 data points, to a given query point, instead of just the nearest neighbor. 27 00:01:38,070 --> 00:01:41,330 But in both one nearest neighbor and k nearest neighbor search, 28 00:01:41,330 --> 00:01:45,150 there are two really critical elements to the performance of the method. 29 00:01:45,150 --> 00:01:48,760 The first is how we think about representing our data. 30 00:01:48,760 --> 00:01:52,740 So in this case, in our case study that we're considering, the question is, 31 00:01:52,740 --> 00:01:54,920 how are we going to represent our documents? 32 00:01:54,920 --> 00:01:57,760 And then the second critical element is, how are we going to measure 33 00:01:57,760 --> 00:02:00,880 the similarity, or the distance between two data points? 34 00:02:00,880 --> 00:02:03,880 So again, in our case, two documents. 35 00:02:03,880 --> 00:02:06,490 Well, in this course, we took a deep dive into these two 36 00:02:06,490 --> 00:02:09,170 critical components of nearest neighbor search. 37 00:02:09,170 --> 00:02:13,255 And to begin with, we talked about a document representation based on TF-IDF, 38 00:02:13,255 --> 00:02:16,760 term frequency-inverse document frequency, 39 00:02:16,760 --> 00:02:20,870 where the term frequency is simply counting words in the document. 40 00:02:20,870 --> 00:02:23,320 So we look at our entire vocabulary, 41 00:02:23,320 --> 00:02:27,670 we have a data representation that's exactly the length of the vocabulary. 42 00:02:27,670 --> 00:02:29,740 And in each index of this vector, 43 00:02:29,740 --> 00:02:32,670 we simply count how many times we saw a given word. 44 00:02:32,670 --> 00:02:35,390 But we talked about the fact that this can 45 00:02:35,390 --> 00:02:40,210 bias our nearest neighbor search towards emphasizing very common words. 46 00:02:40,210 --> 00:02:42,320 And so that's typically not desired. 47 00:02:42,320 --> 00:02:46,928 So to account for this, we introduce this inverse document frequency term. 48 00:02:46,928 --> 00:02:51,370 Where that down-weights words that are very commonly 49 00:02:51,370 --> 00:02:54,188 used globally throughout the corpus. 50 00:02:54,188 --> 00:02:59,860 So our TF-IDF representation is simply multiplying our tf vector with this 51 00:02:59,860 --> 00:03:07,800 idf vector, and this trades off between local frequency and global rarity. 52 00:03:07,800 --> 00:03:10,760 We then turn to how are we going to compute the distance 53 00:03:10,760 --> 00:03:13,000 between two different articles. 54 00:03:13,000 --> 00:03:17,660 And the first approach that we talked about was using scaled Euclidean distance 55 00:03:17,660 --> 00:03:21,700 where it was just a straightforward variant of Euclidean distance. 56 00:03:21,700 --> 00:03:25,690 And instead of having equal weights on every component of the vector, 57 00:03:25,690 --> 00:03:27,920 we can specify a set of weights. 58 00:03:27,920 --> 00:03:32,300 So for example, for our documents, we might have a representation that separates 59 00:03:32,300 --> 00:03:36,700 out the title from the abstract from the main body from the conclusion. 60 00:03:36,700 --> 00:03:40,560 And then based on this representation across these different components, 61 00:03:40,560 --> 00:03:44,890 we could put weights that more heavily emphasize the words that are appearing in 62 00:03:44,890 --> 00:03:47,060 the title and the abstract than the main body. 63 00:03:48,660 --> 00:03:54,050 And so this might be appropriate in terms of finding similarities between articles. 64 00:03:54,050 --> 00:03:55,100 Because maybe the titles and 65 00:03:55,100 --> 00:03:59,510 abstracts have the really key words for describing what those articles are about. 66 00:03:59,510 --> 00:04:03,090 Whereas the main body might be bogged down with lots of things that are a bit 67 00:04:03,090 --> 00:04:05,940 tangential to the main point of the article. 68 00:04:05,940 --> 00:04:07,300 But when we're thinking about text data, 69 00:04:07,300 --> 00:04:10,830 a really common distance metric used is cosine similarity. 70 00:04:10,830 --> 00:04:15,316 And so we talked about how cosine similarity is just the inner product of 71 00:04:15,316 --> 00:04:18,925 two normalized vectors, one for each of our documents. 72 00:04:18,925 --> 00:04:23,991 And this is equivalent to computing the similarity between two articles 73 00:04:23,991 --> 00:04:29,170 just based on the cosine of the angle between those two documents. 74 00:04:29,170 --> 00:04:31,750 But remember, this wasn't actually a proper distance metric. 75 00:04:31,750 --> 00:04:34,940 It doesn't satisfy things like the triangle inequality. 76 00:04:34,940 --> 00:04:38,430 But we can definitely use it to compute the distance between 77 00:04:38,430 --> 00:04:41,540 two different documents in our nearest neighbor search. 78 00:04:41,540 --> 00:04:45,060 We then discussed the implications of using cosine similarity, 79 00:04:45,060 --> 00:04:47,910 because it normalizes the vectors that we're looking at. 80 00:04:47,910 --> 00:04:51,850 So it ignores the scale of the individual vectors. 81 00:04:51,850 --> 00:04:54,680 So that could be a desirable property. 82 00:04:54,680 --> 00:04:59,270 For example, we talked about how it ignores the length of the document, so 83 00:04:59,270 --> 00:05:03,070 longer documents don't get emphasized more than shorter documents. 84 00:05:03,070 --> 00:05:05,910 But that could also be something that we don't want. 85 00:05:05,910 --> 00:05:10,485 And we discussed how, for example, it could make dissimilar objects, 86 00:05:10,485 --> 00:05:13,990 like a tweet and a very lengthy article very similar. 87 00:05:13,990 --> 00:05:16,570 Place them on the same ground. 88 00:05:16,570 --> 00:05:20,684 So instead we talked about other ways of thinking about normalization where we're 89 00:05:20,684 --> 00:05:24,765 not actually normalizing, we're simply capping the max word count. 90 00:05:24,765 --> 00:05:30,095 And using that to account for our differences in document lengths. 91 00:05:30,095 --> 00:05:33,186 But I want to mention that cosine similarity is really commonly used when 92 00:05:33,186 --> 00:05:36,794 you're looking at features of text data or in other applications where the overall 93 00:05:36,794 --> 00:05:40,445 scale of the feature vector that you're looking at isn't so critical. 94 00:05:40,445 --> 00:05:43,675 But we've mentioned in contrast, that in certain applications, for example, 95 00:05:43,675 --> 00:05:46,705 if you're looking at features of a house, and you're looking at number of bedrooms, 96 00:05:46,705 --> 00:05:49,340 number of bathrooms, square feet. 97 00:05:49,340 --> 00:05:53,370 That the raw scale of those numbers is really, really critical. 98 00:05:53,370 --> 00:05:55,880 So you wouldn't want to do something like cosine similarity. 99 00:05:55,880 --> 00:05:59,251 And instead you'd want to use something like Euclidean distance or 100 00:05:59,251 --> 00:06:02,070 possibly scaled Euclidean distance. 101 00:06:02,070 --> 00:06:06,797 Okay, but in this module we talked a lot about the tradeoffs between different 102 00:06:06,797 --> 00:06:10,954 choices made and different options for our data representations and 103 00:06:10,954 --> 00:06:13,690 our distance calculations. 104 00:06:13,690 --> 00:06:17,470 But there's still one really important aspect of nearest neighbor search that we 105 00:06:17,470 --> 00:06:22,090 spent quite a bit of time on in this module, which is the complexity of search. 106 00:06:22,090 --> 00:06:26,320 We mentioned that brute-force search, where you take a query point and 107 00:06:26,320 --> 00:06:31,380 then you compute the distance to every other point in the entire dataset, 108 00:06:31,380 --> 00:06:34,900 can be really, really intensive, especially if you're going to be doing 109 00:06:34,900 --> 00:06:38,730 many different queries on the same dataset. 110 00:06:38,730 --> 00:06:42,658 So instead, what we talked about first was an idea of using 111 00:06:42,658 --> 00:06:47,243 something called KD-trees as an efficient data representation. 112 00:06:47,243 --> 00:06:53,495 Where KD-trees recursively partition our feature space to create a binary tree, 113 00:06:53,495 --> 00:06:58,240 and where the data points are stored at the leaves of this tree. 114 00:06:59,310 --> 00:07:01,530 And we discussed how to form this tree. 115 00:07:01,530 --> 00:07:04,280 And then we discussed how to use this tree to perform 116 00:07:04,280 --> 00:07:06,170 efficient nearest neighbor search. 117 00:07:06,170 --> 00:07:11,440 Where the really important insight here is the fact that we can prune, often, 118 00:07:11,440 --> 00:07:16,550 a very large amount of the space by the fact that the distance to our 119 00:07:16,550 --> 00:07:21,390 nearest neighbor found so far is smaller than the distance to a lot of 120 00:07:21,390 --> 00:07:24,940 the boxes in this partitioning of the feature space. 121 00:07:24,940 --> 00:07:28,040 And that means that no point within any of these boxes 122 00:07:28,040 --> 00:07:30,830 could be aa nearest neighbor to our query point. 123 00:07:30,830 --> 00:07:34,180 And if we don't care about performing exact nearest neighbor search, 124 00:07:34,180 --> 00:07:37,364 then we can get even larger efficiencies using KD-trees. 125 00:07:39,070 --> 00:07:44,500 Instead of pruning when the distance to a bounding box is greater 126 00:07:44,500 --> 00:07:48,410 than the distance to our nearest neighbor found so far, we prune more aggressively. 127 00:07:48,410 --> 00:07:52,920 So we're going to basically shrink the distance to our nearest neighbor so 128 00:07:52,920 --> 00:07:56,570 far, and prune based off of this shrunken distance. 129 00:07:57,660 --> 00:08:03,390 And depending on how much we're shrinking to do this more aggressive pruning, 130 00:08:03,390 --> 00:08:07,630 this can result in significant savings in terms of our search time 131 00:08:07,630 --> 00:08:11,690 at fairly low cost in terms of the quality of our nearest neighbor. 132 00:08:11,690 --> 00:08:16,490 But we did mention that KD-trees can have some serious limitations. 133 00:08:16,490 --> 00:08:19,860 One is the fact that they're not just that straightforward to implement. 134 00:08:19,860 --> 00:08:22,350 But the other, which might be more important to you if you're 135 00:08:22,350 --> 00:08:25,390 a really awesome hacker, is the fact that KD-trees just 136 00:08:25,390 --> 00:08:28,940 really don't scale well to high-dimensional feature spaces. 137 00:08:28,940 --> 00:08:32,302 So as an alternative for performing approximate nearest neighbor search, 138 00:08:32,302 --> 00:08:35,690 we presented something called locality-sensitive hashing. 139 00:08:35,690 --> 00:08:39,570 Where what we do here is we simply throw down a set of lines where 140 00:08:39,570 --> 00:08:42,720 each line is just randomly defined in the space. 141 00:08:42,720 --> 00:08:44,310 And if we're in more than two dimensions and 142 00:08:44,310 --> 00:08:47,310 we're talking about these hyperplanes that we're throwing down. 143 00:08:47,310 --> 00:08:49,870 But the point is that now these lines, 144 00:08:49,870 --> 00:08:53,390 these random lines partition our feature space. 145 00:08:54,440 --> 00:08:58,710 And the collection of the lines define what we're going to think of as 146 00:08:58,710 --> 00:09:01,740 bin indices for each of these data points. 147 00:09:01,740 --> 00:09:06,728 In particular, we talked about how these bin indices can be used within a hash 148 00:09:06,728 --> 00:09:08,811 table to store our data points. 149 00:09:08,811 --> 00:09:14,412 So for example, these pink points are stored in this third index of 150 00:09:14,412 --> 00:09:20,950 the hash table, based on the binary representation of this bin index. 151 00:09:20,950 --> 00:09:24,930 And if we have a query point in this same partition, 152 00:09:25,980 --> 00:09:28,830 then what we would do in our nearest neighbor search, as we discussed, 153 00:09:28,830 --> 00:09:33,160 is we'd first search the same bin as the query point. 154 00:09:33,160 --> 00:09:36,790 But then if we wanted to improve the quality of the nearest neighbor search 155 00:09:36,790 --> 00:09:41,100 performed, we know that the nearest neighbor to this point doesn't have to 156 00:09:41,100 --> 00:09:43,710 lie within that specific partition. 157 00:09:43,710 --> 00:09:49,090 So we'd search neighboring bins where these neighboring bins, we simply flipped 158 00:09:49,090 --> 00:09:56,250 bits on this binary representation here to define what were the adjacent bins. 159 00:09:56,250 --> 00:10:00,550 And then we just continue this until we ran out of our computational budget. 160 00:10:00,550 --> 00:10:05,090 So in contrast to KD-trees, locality sensitive hashing is really, 161 00:10:05,090 --> 00:10:07,760 really, really straightforward to implement. 162 00:10:07,760 --> 00:10:10,800 And you got some practice in your assignment in 163 00:10:10,800 --> 00:10:14,890 exploring the quality of the results from locality sensitive hashing. 164 00:10:14,890 --> 00:10:19,397 Showing that you could get fairly high quality neighbors returned using this very 165 00:10:19,397 --> 00:10:21,235 simple and efficient approach. 166 00:10:21,235 --> 00:10:25,009 [MUSIC]