Congratulations on getting to the end of the clustering and retrieval course. We've covered a lot of ground in this course and in addition a lot of really advanced material. But through this course, you've learned some very practical algorithms for performing clustering and retrieval, and you should be able to actually go out there, implement these methods, and deploy them on real world problems. But before we wrap up, let's spend a little time reflecting on what we've learned in this course and look ahead at what's left in this specialization. Okay, well, what have you learned in this course? To start with, in module one, we looked at the idea of retrieval and cast that as a problem of performing nearest neighbor search. And the first algorithm we looked at was something that we called one nearest neighbor search, where we just look at all of our data points, and look for the data point that's most similar to the query point. And we explored this specifically in the context of a document retrievable task, where there's a whole bunch of documents out there. There's an entire corpus of documents that we have. And we have some article that's our query article. It might be an article that a person is currently reading and is interested in. And we want to search over all the other articles to find the closest article. So we presented some algorithmic details behind performing one nearest neighbor search. And then we presented a very straightforward extension of one nearest neighbor search, which is k nearest neighbor searching. Where we simply return the k most similar articles or, generically, data points, to a given query point, instead of just the nearest neighbor. But in both one nearest neighbor and k nearest neighbor search, there are two really critical elements to the performance of the method. The first is how we think about representing our data. So in this case, in our case study that we're considering, the question is, how are we going to represent our documents? And then the second critical element is, how are we going to measure the similarity, or the distance between two data points? So again, in our case, two documents. Well, in this course, we took a deep dive into these two critical components of nearest neighbor search. And to begin with, we talked about a document representation based on TF-IDF, term frequency-inverse document frequency, where the term frequency is simply counting words in the document. So we look at our entire vocabulary, we have a data representation that's exactly the length of the vocabulary. And in each index of this vector, we simply count how many times we saw a given word. But we talked about the fact that this can bias our nearest neighbor search towards emphasizing very common words. And so that's typically not desired. So to account for this, we introduce this inverse document frequency term. Where that down-weights words that are very commonly used globally throughout the corpus. So our TF-IDF representation is simply multiplying our tf vector with this idf vector, and this trades off between local frequency and global rarity. We then turn to how are we going to compute the distance between two different articles. And the first approach that we talked about was using scaled Euclidean distance where it was just a straightforward variant of Euclidean distance. And instead of having equal weights on every component of the vector, we can specify a set of weights. So for example, for our documents, we might have a representation that separates out the title from the abstract from the main body from the conclusion. And then based on this representation across these different components, we could put weights that more heavily emphasize the words that are appearing in the title and the abstract than the main body. And so this might be appropriate in terms of finding similarities between articles. Because maybe the titles and abstracts have the really key words for describing what those articles are about. Whereas the main body might be bogged down with lots of things that are a bit tangential to the main point of the article. But when we're thinking about text data, a really common distance metric used is cosine similarity. And so we talked about how cosine similarity is just the inner product of two normalized vectors, one for each of our documents. And this is equivalent to computing the similarity between two articles just based on the cosine of the angle between those two documents. But remember, this wasn't actually a proper distance metric. It doesn't satisfy things like the triangle inequality. But we can definitely use it to compute the distance between two different documents in our nearest neighbor search. We then discussed the implications of using cosine similarity, because it normalizes the vectors that we're looking at. So it ignores the scale of the individual vectors. So that could be a desirable property. For example, we talked about how it ignores the length of the document, so longer documents don't get emphasized more than shorter documents. But that could also be something that we don't want. And we discussed how, for example, it could make dissimilar objects, like a tweet and a very lengthy article very similar. Place them on the same ground. So instead we talked about other ways of thinking about normalization where we're not actually normalizing, we're simply capping the max word count. And using that to account for our differences in document lengths. But I want to mention that cosine similarity is really commonly used when you're looking at features of text data or in other applications where the overall scale of the feature vector that you're looking at isn't so critical. But we've mentioned in contrast, that in certain applications, for example, if you're looking at features of a house, and you're looking at number of bedrooms, number of bathrooms, square feet. That the raw scale of those numbers is really, really critical. So you wouldn't want to do something like cosine similarity. And instead you'd want to use something like Euclidean distance or possibly scaled Euclidean distance. Okay, but in this module we talked a lot about the tradeoffs between different choices made and different options for our data representations and our distance calculations. But there's still one really important aspect of nearest neighbor search that we spent quite a bit of time on in this module, which is the complexity of search. We mentioned that brute-force search, where you take a query point and then you compute the distance to every other point in the entire dataset, can be really, really intensive, especially if you're going to be doing many different queries on the same dataset. So instead, what we talked about first was an idea of using something called KD-trees as an efficient data representation. Where KD-trees recursively partition our feature space to create a binary tree, and where the data points are stored at the leaves of this tree. And we discussed how to form this tree. And then we discussed how to use this tree to perform efficient nearest neighbor search. Where the really important insight here is the fact that we can prune, often, a very large amount of the space by the fact that the distance to our nearest neighbor found so far is smaller than the distance to a lot of the boxes in this partitioning of the feature space. And that means that no point within any of these boxes could be aa nearest neighbor to our query point. And if we don't care about performing exact nearest neighbor search, then we can get even larger efficiencies using KD-trees. Instead of pruning when the distance to a bounding box is greater than the distance to our nearest neighbor found so far, we prune more aggressively. So we're going to basically shrink the distance to our nearest neighbor so far, and prune based off of this shrunken distance. And depending on how much we're shrinking to do this more aggressive pruning, this can result in significant savings in terms of our search time at fairly low cost in terms of the quality of our nearest neighbor. But we did mention that KD-trees can have some serious limitations. One is the fact that they're not just that straightforward to implement. But the other, which might be more important to you if you're a really awesome hacker, is the fact that KD-trees just really don't scale well to high-dimensional feature spaces. So as an alternative for performing approximate nearest neighbor search, we presented something called locality-sensitive hashing. Where what we do here is we simply throw down a set of lines where each line is just randomly defined in the space. And if we're in more than two dimensions and we're talking about these hyperplanes that we're throwing down. But the point is that now these lines, these random lines partition our feature space. And the collection of the lines define what we're going to think of as bin indices for each of these data points. In particular, we talked about how these bin indices can be used within a hash table to store our data points. So for example, these pink points are stored in this third index of the hash table, based on the binary representation of this bin index. And if we have a query point in this same partition, then what we would do in our nearest neighbor search, as we discussed, is we'd first search the same bin as the query point. But then if we wanted to improve the quality of the nearest neighbor search performed, we know that the nearest neighbor to this point doesn't have to lie within that specific partition. So we'd search neighboring bins where these neighboring bins, we simply flipped bits on this binary representation here to define what were the adjacent bins. And then we just continue this until we ran out of our computational budget. So in contrast to KD-trees, locality sensitive hashing is really, really, really straightforward to implement. And you got some practice in your assignment in exploring the quality of the results from locality sensitive hashing. Showing that you could get fairly high quality neighbors returned using this very simple and efficient approach. [MUSIC]