1 00:00:00,007 --> 00:00:04,735 [MUSIC] 2 00:00:04,735 --> 00:00:08,798 So we mentioned that we're going to look at locality sensitive hashing or 3 00:00:08,798 --> 00:00:10,930 LSH as an alternative to KD-trees. 4 00:00:10,930 --> 00:00:15,040 Well, now let's dig into the details of what locality sensitive hashing does. 5 00:00:16,390 --> 00:00:21,380 And just like KD-trees, we're going to think of partitioning our data into bins, 6 00:00:21,380 --> 00:00:26,227 but the way in which locality sensitive hashing does this is just fundamentally 7 00:00:26,227 --> 00:00:30,374 different than the types of axis-aligned cuts made by KD-trees. 8 00:00:30,374 --> 00:00:34,294 And in this section we're going to return to an example that we used in 9 00:00:34,294 --> 00:00:38,990 the classification course, which is imagine we have reviews of restaurants. 10 00:00:38,990 --> 00:00:40,940 We have text from those reviews. 11 00:00:40,940 --> 00:00:44,330 And we're just going to count how many times we see the word 12 00:00:44,330 --> 00:00:46,740 awesome versus the word awful. 13 00:00:46,740 --> 00:00:50,005 And when we're thinking about retrieving similar types of reviews we're just 14 00:00:50,005 --> 00:00:53,280 going to look at these counts. 15 00:00:53,280 --> 00:00:57,570 Okay well just like in classification we can think of 16 00:00:57,570 --> 00:01:02,120 dividing these set of points just using a simple line. 17 00:01:02,120 --> 00:01:07,070 So this is like the decision boundary we saw in the classification course. 18 00:01:07,070 --> 00:01:10,540 And here I'm just writing the equation of this line and 19 00:01:10,540 --> 00:01:13,690 we're defining this term which we call the score. 20 00:01:13,690 --> 00:01:17,910 So it says for any input in terms of counts of the number of awesome and 21 00:01:17,910 --> 00:01:21,170 the number of awful, I'm going to score that review. 22 00:01:22,440 --> 00:01:25,990 And everything that falls below this line has a negative score and 23 00:01:25,990 --> 00:01:30,080 everything that's above this line has a positive score. 24 00:01:30,080 --> 00:01:32,740 But we're going to be using these scores in a different way than we did in 25 00:01:32,740 --> 00:01:34,530 the classification course. 26 00:01:34,530 --> 00:01:40,250 Here, if we're going to say that if the sign of the score is negative, 27 00:01:40,250 --> 00:01:45,280 then all of these points fall into what we're calling the 0 bin. 28 00:01:45,280 --> 00:01:47,090 But if the sign is positive, 29 00:01:47,090 --> 00:01:51,640 then all of those points are going to be assigned to bin 1. 30 00:01:51,640 --> 00:01:56,130 So we're going to translate the sign of the scores into a binary index. 31 00:01:57,620 --> 00:02:00,740 And then we can use this for nearest neighbor search in the following way. 32 00:02:00,740 --> 00:02:05,594 So if we get this green query point here then, we're only going to search for 33 00:02:05,594 --> 00:02:09,916 nearest neighbors in its bin, so all these other orange points. 34 00:02:09,916 --> 00:02:12,331 So, here we're only searching for 35 00:02:12,331 --> 00:02:16,766 nearest neighbors with the points whose score is less than zero. 36 00:02:16,766 --> 00:02:20,711 And in terms of this bin in this is that corresponds to 37 00:02:20,711 --> 00:02:25,050 all the points that where labeled as having bin index 0. 38 00:02:25,050 --> 00:02:29,640 Alternatively, if the query were to have fallen into this other region, 39 00:02:29,640 --> 00:02:34,790 this blue region, we would have looked over all the points that had bin index 1, 40 00:02:34,790 --> 00:02:38,170 because all of their scores were greater than 0. 41 00:02:38,170 --> 00:02:42,802 In practice, the way that were going to do this search over all data points that have 42 00:02:42,802 --> 00:02:47,180 the same bin index as the query point, is using something called a hashtable. 43 00:02:47,180 --> 00:02:51,874 And what a hashtable does is for each of our bin values, bin 0 or bin 1, 44 00:02:51,874 --> 00:02:54,067 we're going to call that the key and 45 00:02:54,067 --> 00:02:59,076 then we're going to associate with that key a set of values, a list of points, 46 00:02:59,076 --> 00:03:03,490 that are the indices of the data points, that have that bin value. 47 00:03:04,590 --> 00:03:08,720 Said another way, we're taking all of the datapoints that have been value 0 and 48 00:03:08,720 --> 00:03:13,430 we're going to list those data indices in the first element of this table and 49 00:03:13,430 --> 00:03:16,090 we're taking all the datapoints that have been value 1 and 50 00:03:16,090 --> 00:03:20,150 we're going to list those indices in the second element of this table. 51 00:03:20,150 --> 00:03:25,540 Then when we get a given query point and we look at the bin value of that query 52 00:03:25,540 --> 00:03:29,810 based on looking at the score, the sign of the score and determining the bin index, 53 00:03:29,810 --> 00:03:35,550 then we're just going to search along the list of points that are in that bin. 54 00:03:37,320 --> 00:03:42,200 So, this provides a really really simple way of doing nearest neighbor search. 55 00:03:42,200 --> 00:03:46,728 Very simple idea of throwing down a line and then just simply searching the other 56 00:03:46,728 --> 00:03:49,580 points that either fall above or below this line. 57 00:03:49,580 --> 00:03:53,373 But a question is does this return an exact nearest neighbor? 58 00:03:53,373 --> 00:03:55,730 And the answer is no. 59 00:03:55,730 --> 00:03:58,790 So for example, here we have this green query point, and 60 00:03:58,790 --> 00:04:03,060 let's say that this orange point which is just above the line and 61 00:04:03,060 --> 00:04:06,140 very close to that green point is actually the true nearest neighbor. 62 00:04:06,140 --> 00:04:09,756 Well, if we look at this green point and we're only searching all 63 00:04:09,756 --> 00:04:13,438 the blue points which have this same bin value as this green point, 64 00:04:13,438 --> 00:04:16,024 then we're not going to find that orange point. 65 00:04:16,024 --> 00:04:20,299 [MUSIC]