[MUSIC] So up to this point, we've described just the concept of partitioning our data, just using these simple lines. And I should have mentioned that in everything we're going to describe here, we're just going to assume that there's no intercept term to the line, so it's always just going to go through the origin. So it's a very simple structure, a very simple way to think about partitioning our data into two separate bins. And then it’s a very simple way that we can just search along the points that fall into each of these partitions. But there are number of questions that we might be faced with. So one of these questions is how do we think of defining a good line to split the points? And associated with that question is, what's the resulting quality of this type of nearest neighbor search? We said that it provides an approximate nearest neighbor search, but what are the chances that we're actually going to find the nearest neighbor to a given query point? And then finally, what types of efficiencies are we actually going to get from the simple partitioning of data points? Well, let's start with the question on how are we going to define this line? And for this, let's propose something that sounds really, really crazy. What if we just draw the line randomly? Just randomly throw down a line that goes through the 0, 0 point. Well, how bad can this be? It sounds that could actually be really, really bad, but let's think through this a bit. And remember that our objective is if two points are close together, and here we're going to be focusing on cosine similarity when we're talking about the distance between two different points. Well, our goal is that if they're close together according to cosine similarity, then we want those two points to fall into the same bin. To formalize this a bit, let's just say that the angle, remember that cosine similarity is defined in terms of the cosine of the angle between two points. So let's say that there are two points, x and y. And the angle between x and y is theta xy. Okay, now let's just start drawing lines randomly. So here's one line. And this line happens to put both x and y into bin 0, because the score of both of the points is less than 0. Here's another line and another, and another, and another. And here's a whole bunch of lines that all place x and y into the same bin, bin 0. Well here's a line that places both x and y into bin 1 since both of their scores are greater than 0. And here's another one, another one, another one, another one, another one. Whole bunch of lines that all place x and y into the same bin, bin 1. But we can make mistakes, of course. Here's one line that separates x and y, so that they will be put into different bins. Because the y point has a negative score so that's going to be in bin 0, the x point has a positive score, it's going to be in bin 1. And here are some other lines that would do the same thing. Okay, so what we're saying is that this whole range of points here that I'm showing in blue place both x and y into the same bin, bin 1. And all of these points, or all of these lines that would fall into this orange region that I'm showing here, place both x and y into bin 0. But then there's this green range where any line that falls in that space is going to put x and y into separate bins. But a question is, what's the chance of that happening? Well, if theta x, y is really small, then there's just a small chance that a random line placed, let's say, uniformly at random from 0 to pi over 2, there's just a small chance that it's going to fall within that theta range. So what that means is that there's just a small probability that x and y are going to be put in different bins, so if x were our query, then we wouldn't find it's nearest neighbor y which was in some other bin. So in summary, this idea of just throwing down a random line, it's actually not that crazy. We're saying that it provides good performance in terms of a probabilistic statement of providing exact nearest neighbors for lots and lots of queries. Okay, so far what we've done is we've talked about the first two points. We showed how we can just very simply throw it out of line, nothing challenging to how we have to think about defining that line. And we also showed that we can get very good performance in terms of nearest neighbor search with this very simple approach. However, there's this really, really problematic last third point which is what are the efficiencies that we get from this approach? And if we think about just throwing down one line, then what happens is that we often end up with bins that still have lots and lots of data points. If we assume we have lots and lots and lots of data points to begin with, if we just partition them into two sets, we're probably going to end up with two bins, each of which have lots and lots of datapoints. And a query is going to have to search over lots and lots of points. Okay, so it seems like we've gained something in terms of efficiency but not that much. So let's think about how we can address this. [MUSIC]