1 00:00:00,000 --> 00:00:05,155 [MUSIC] 2 00:00:05,155 --> 00:00:09,858 So up to this point, we've described just the concept of partitioning our data, 3 00:00:09,858 --> 00:00:11,710 just using these simple lines. 4 00:00:11,710 --> 00:00:15,680 And I should have mentioned that in everything we're going to describe here, 5 00:00:15,680 --> 00:00:19,540 we're just going to assume that there's no intercept term to the line, so 6 00:00:19,540 --> 00:00:21,848 it's always just going to go through the origin. 7 00:00:21,848 --> 00:00:26,312 So it's a very simple structure, a very simple way to think 8 00:00:26,312 --> 00:00:30,342 about partitioning our data into two separate bins. 9 00:00:30,342 --> 00:00:34,752 And then it’s a very simple way that we can just search along the points that 10 00:00:34,752 --> 00:00:36,929 fall into each of these partitions. 11 00:00:36,929 --> 00:00:39,930 But there are number of questions that we might be faced with. 12 00:00:39,930 --> 00:00:44,650 So one of these questions is how do we think of defining a good line 13 00:00:44,650 --> 00:00:46,320 to split the points? 14 00:00:46,320 --> 00:00:49,060 And associated with that question is, 15 00:00:49,060 --> 00:00:53,520 what's the resulting quality of this type of nearest neighbor search? 16 00:00:53,520 --> 00:00:57,030 We said that it provides an approximate nearest neighbor search, but 17 00:00:57,030 --> 00:01:00,460 what are the chances that we're actually going to find the nearest neighbor 18 00:01:00,460 --> 00:01:02,710 to a given query point? 19 00:01:02,710 --> 00:01:06,370 And then finally, what types of efficiencies are we actually going to get 20 00:01:06,370 --> 00:01:08,650 from the simple partitioning of data points? 21 00:01:10,420 --> 00:01:14,010 Well, let's start with the question on how are we going to define this line? 22 00:01:14,010 --> 00:01:17,800 And for this, let's propose something that sounds really, really crazy. 23 00:01:17,800 --> 00:01:19,920 What if we just draw the line randomly? 24 00:01:19,920 --> 00:01:23,940 Just randomly throw down a line that goes through the 0, 0 point. 25 00:01:25,620 --> 00:01:29,058 Well, how bad can this be? 26 00:01:29,058 --> 00:01:32,329 It sounds that could actually be really, really bad, but 27 00:01:32,329 --> 00:01:34,213 let's think through this a bit. 28 00:01:34,213 --> 00:01:38,530 And remember that our objective is if two points are close together, and 29 00:01:38,530 --> 00:01:41,858 here we're going to be focusing on cosine similarity when 30 00:01:41,858 --> 00:01:45,845 we're talking about the distance between two different points. 31 00:01:45,845 --> 00:01:51,560 Well, our goal is that if they're close together according to cosine similarity, 32 00:01:51,560 --> 00:01:55,300 then we want those two points to fall into the same bin. 33 00:01:55,300 --> 00:02:00,360 To formalize this a bit, let's just say that the angle, remember that cosine 34 00:02:00,360 --> 00:02:04,200 similarity is defined in terms of the cosine of the angle between two points. 35 00:02:04,200 --> 00:02:07,140 So let's say that there are two points, x and y. 36 00:02:07,140 --> 00:02:10,030 And the angle between x and y is theta xy. 37 00:02:11,120 --> 00:02:14,160 Okay, now let's just start drawing lines randomly. 38 00:02:14,160 --> 00:02:16,150 So here's one line. 39 00:02:16,150 --> 00:02:20,120 And this line happens to put both x and y into bin 0, 40 00:02:20,120 --> 00:02:25,480 because the score of both of the points is less than 0. 41 00:02:25,480 --> 00:02:27,550 Here's another line and another, and another, and another. 42 00:02:27,550 --> 00:02:32,085 And here's a whole bunch of lines that all place x and y into the same bin, bin 0. 43 00:02:34,040 --> 00:02:37,810 Well here's a line that places both x and 44 00:02:37,810 --> 00:02:42,000 y into bin 1 since both of their scores are greater than 0. 45 00:02:42,000 --> 00:02:45,760 And here's another one, another one, another one, another one, another one. 46 00:02:45,760 --> 00:02:50,275 Whole bunch of lines that all place x and y into the same bin, bin 1. 47 00:02:52,340 --> 00:02:55,780 But we can make mistakes, of course. 48 00:02:55,780 --> 00:03:02,007 Here's one line that separates x and y, so that they will be put into different bins. 49 00:03:02,007 --> 00:03:06,805 Because the y point has a negative score so that's going to be in bin 0, 50 00:03:06,805 --> 00:03:10,877 the x point has a positive score, it's going to be in bin 1. 51 00:03:10,877 --> 00:03:13,320 And here are some other lines that would do the same thing. 52 00:03:14,570 --> 00:03:19,666 Okay, so what we're saying is that this whole range of points here 53 00:03:19,666 --> 00:03:24,947 that I'm showing in blue place both x and y into the same bin, bin 1. 54 00:03:24,947 --> 00:03:30,211 And all of these points, or all of these lines that would fall into this 55 00:03:30,211 --> 00:03:35,405 orange region that I'm showing here, place both x and y into bin 0. 56 00:03:35,405 --> 00:03:39,465 But then there's this green range where any line that falls 57 00:03:39,465 --> 00:03:43,050 in that space is going to put x and y into separate bins. 58 00:03:44,200 --> 00:03:47,550 But a question is, what's the chance of that happening? 59 00:03:47,550 --> 00:03:50,770 Well, if theta x, y is really small, 60 00:03:50,770 --> 00:03:55,890 then there's just a small chance that a random line placed, let's say, 61 00:03:55,890 --> 00:04:00,560 uniformly at random from 0 to pi over 2, 62 00:04:00,560 --> 00:04:04,610 there's just a small chance that it's going to fall within that theta range. 63 00:04:05,840 --> 00:04:10,700 So what that means is that there's just a small probability that x and 64 00:04:10,700 --> 00:04:13,880 y are going to be put in different bins, so if x were our query, 65 00:04:13,880 --> 00:04:18,400 then we wouldn't find it's nearest neighbor y which was in some other bin. 66 00:04:19,710 --> 00:04:20,810 So in summary, 67 00:04:20,810 --> 00:04:25,880 this idea of just throwing down a random line, it's actually not that crazy. 68 00:04:26,950 --> 00:04:33,120 We're saying that it provides good performance in terms of a probabilistic 69 00:04:33,120 --> 00:04:37,949 statement of providing exact nearest neighbors for lots and lots of queries. 70 00:04:39,020 --> 00:04:41,956 Okay, so far what we've done is we've talked about the first two points. 71 00:04:41,956 --> 00:04:45,494 We showed how we can just very simply throw it out of line, 72 00:04:45,494 --> 00:04:49,866 nothing challenging to how we have to think about defining that line. 73 00:04:49,866 --> 00:04:53,402 And we also showed that we can get very good performance in terms of nearest 74 00:04:53,402 --> 00:04:56,780 neighbor search with this very simple approach. 75 00:04:56,780 --> 00:05:01,340 However, there's this really, really problematic last third point which is what 76 00:05:01,340 --> 00:05:04,440 are the efficiencies that we get from this approach? 77 00:05:04,440 --> 00:05:09,760 And if we think about just throwing down one line, then what happens is that 78 00:05:09,760 --> 00:05:13,140 we often end up with bins that still have lots and lots of data points. 79 00:05:13,140 --> 00:05:16,710 If we assume we have lots and lots and lots of data points to begin with, if we 80 00:05:16,710 --> 00:05:20,550 just partition them into two sets, we're probably going to end up with two bins, 81 00:05:20,550 --> 00:05:23,050 each of which have lots and lots of datapoints. 82 00:05:23,050 --> 00:05:27,410 And a query is going to have to search over lots and lots of points. 83 00:05:28,410 --> 00:05:29,550 Okay, so 84 00:05:29,550 --> 00:05:33,930 it seems like we've gained something in terms of efficiency but not that much. 85 00:05:33,930 --> 00:05:36,588 So let's think about how we can address this. 86 00:05:36,588 --> 00:05:40,849 [MUSIC]