1 00:00:00,022 --> 00:00:04,692 [MUSIC] 2 00:00:04,692 --> 00:00:08,668 To wrap up this module, we're going to talk about a way to perform even more 3 00:00:08,668 --> 00:00:12,843 efficient nearest neighbor search using localitive sense of hashing using 4 00:00:12,843 --> 00:00:15,340 something that are called multiple tables. 5 00:00:15,340 --> 00:00:17,730 So these are multiple hash tables. 6 00:00:17,730 --> 00:00:21,300 And this is going to be advanced material. 7 00:00:21,300 --> 00:00:23,390 I actually think you could watch the video and 8 00:00:23,390 --> 00:00:28,320 get out the high level concept of what's going on and the importance of it. 9 00:00:28,320 --> 00:00:33,000 But it's going to include some content that's quite advanced in terms of 10 00:00:33,000 --> 00:00:36,320 looking at probabilities of different events and 11 00:00:36,320 --> 00:00:39,540 assessing the relative probabilities of different setups. 12 00:00:39,540 --> 00:00:43,790 And so we're putting this as an optional video, but 13 00:00:43,790 --> 00:00:47,670 I would encourage you to at least look at it and think through these different ideas 14 00:00:47,670 --> 00:00:50,910 even if you don't follow all the details mathematically of what's going on here. 15 00:00:52,610 --> 00:00:56,880 Okay, so let's look at a specific example of locality sense of hashing, 16 00:00:56,880 --> 00:00:59,780 where I'm just going to throw down two lines. 17 00:00:59,780 --> 00:01:05,310 And for simplicity I'm going to assume an algorithm where we search the bin 18 00:01:05,310 --> 00:01:10,510 that the query point falls into, as well as just one bit off. 19 00:01:10,510 --> 00:01:16,200 So in the case of throwing down two lines, we would search, for example, 20 00:01:16,200 --> 00:01:21,390 let's say that this is our query point, here. 21 00:01:23,280 --> 00:01:29,110 And it has been index 0,0 then we would also search bins 0,1 and 1,0. 22 00:01:29,110 --> 00:01:35,130 So these are cases where we've taken, switching colors again. 23 00:01:37,413 --> 00:01:41,036 We've taken one bit and flipped it. 24 00:01:44,302 --> 00:01:46,520 And searched those two neighboring bins. 25 00:01:46,520 --> 00:01:52,220 And the other thing we're going to assume is that, we're just going to call delta 26 00:01:52,220 --> 00:01:59,170 the probability of putting a line between two points that are theta angles apart. 27 00:01:59,170 --> 00:02:00,789 So if had two points. 28 00:02:02,750 --> 00:02:09,386 Living in this space and they were theta radians apart, 29 00:02:09,386 --> 00:02:14,469 then we're going to say that the probability 30 00:02:14,469 --> 00:02:20,553 that a randomly placed line divides these points. 31 00:02:20,553 --> 00:02:22,412 So this happens 32 00:02:25,351 --> 00:02:29,858 With probability delta. 33 00:02:29,858 --> 00:02:32,229 And so we can think through what delta is. 34 00:02:32,229 --> 00:02:37,906 So if this line is just uniformly placed between, 35 00:02:37,906 --> 00:02:40,885 let's say, zero and pi. 36 00:02:40,885 --> 00:02:44,734 So once we get to the other side we start wrapping around if you think about 37 00:02:44,734 --> 00:02:45,740 visualizing it. 38 00:02:45,740 --> 00:02:50,790 But you can say that the probability of falling within some little region 39 00:02:50,790 --> 00:02:53,910 theta is just theta divided by pi. 40 00:02:55,660 --> 00:02:58,150 So that would be one example of this probability delta but 41 00:02:58,150 --> 00:03:05,830 let's just generically say that we have probability delta falling in this range. 42 00:03:05,830 --> 00:03:10,260 Okay, then what we're saying here is that in this case 43 00:03:10,260 --> 00:03:11,922 we're going to search three bins. 44 00:03:11,922 --> 00:03:14,030 We're going to search our query bin and 45 00:03:14,030 --> 00:03:16,640 the two neighboring bins where we flipped just one bit. 46 00:03:18,000 --> 00:03:22,299 And the chance that we're not going to find a nearest neighbor, 47 00:03:22,299 --> 00:03:26,170 within these three bins is, delta squared. 48 00:03:26,170 --> 00:03:29,570 Because it's the chance that two lines, 49 00:03:29,570 --> 00:03:33,940 not just one line, divide our query point from its nearest neighbor. 50 00:03:33,940 --> 00:03:38,100 And, what's the chance of getting two lines within a region theta? 51 00:03:39,360 --> 00:03:42,210 Well, these two lines were drawn independently and 52 00:03:42,210 --> 00:03:48,150 each had probability delta of falling between in that range, 53 00:03:48,150 --> 00:03:52,350 that theta range, so we have delta times delta which is delta squared. 54 00:03:53,900 --> 00:03:57,495 But what would happen if I just repeated this process multiple times of throwing 55 00:03:57,495 --> 00:03:59,190 down two lines at random? 56 00:03:59,190 --> 00:04:03,363 So here's the example that we showed before where we threw down our two lines. 57 00:04:03,363 --> 00:04:08,445 And here I'm using L0, L1, L2, L3 just generically 58 00:04:08,445 --> 00:04:15,306 to refer to the list of data points that fall into this Bin 0, 1, 2, and 3. 59 00:04:15,306 --> 00:04:19,030 Okay so now let's say we throw down two other random lines. 60 00:04:19,030 --> 00:04:23,570 So it's these two lines and that creates a different partitioning of our data. 61 00:04:23,570 --> 00:04:27,890 And then we throw down two more lines and we get a different partitioning. 62 00:04:27,890 --> 00:04:32,770 And so what we've done is we've ended up with three different hash tables 63 00:04:32,770 --> 00:04:35,450 if we repeat this process three times. 64 00:04:35,450 --> 00:04:43,570 And now let's make an assumption that we're only going to search the bin 65 00:04:43,570 --> 00:04:47,940 in which our query point falls, so in particular, 66 00:04:47,940 --> 00:04:53,440 here's our query point and for this first case, 67 00:04:54,730 --> 00:05:00,155 we know that it has bin index 0,0 so, we're going to search this bin 0. 68 00:05:01,260 --> 00:05:05,926 And in the second case, this is our same query point here, 69 00:05:05,926 --> 00:05:10,976 has bin index 0,1, so we're going to search the 0,1 bin and 70 00:05:10,976 --> 00:05:16,023 then, likewise, in this case, it also has bin index 0,1, 71 00:05:16,023 --> 00:05:19,670 so again we're going to search the 0,1 bin. 72 00:05:20,760 --> 00:05:23,200 So, I've constructed this example. 73 00:05:23,200 --> 00:05:25,070 I've created three different hash tables and 74 00:05:25,070 --> 00:05:30,120 said we're only searching the bin that the query point falls in, so 75 00:05:30,120 --> 00:05:34,800 that in this case we're still searching three bins. 76 00:05:34,800 --> 00:05:37,461 Just like we did before when we searched the query point and 77 00:05:37,461 --> 00:05:39,550 then the two neighboring bins. 78 00:05:39,550 --> 00:05:41,570 So we're still searching three bins. 79 00:05:41,570 --> 00:05:42,905 But the question is now, 80 00:05:42,905 --> 00:05:46,700 what's the probability that we're not going to find our nearest neighbor. 81 00:05:46,700 --> 00:05:48,600 So let's work through this. 82 00:05:48,600 --> 00:05:53,860 So what this statement is equivalent to is, what's the chance that the query point 83 00:05:53,860 --> 00:05:57,670 and the nearest neighbor are in different bins, 84 00:05:57,670 --> 00:06:00,900 that they get split in all of these different hash tables? 85 00:06:02,590 --> 00:06:06,090 So what's the probability that a line falls between the query point and its 86 00:06:06,090 --> 00:06:10,630 nearest neighbor in this first example, the second example and this third example. 87 00:06:11,705 --> 00:06:14,800 Well, we can work through this mathematically or probabilistically. 88 00:06:16,710 --> 00:06:19,780 So in particular the probability of this event if we look, 89 00:06:19,780 --> 00:06:23,940 let's just look at this for a setup here to begin with, and the probability that 90 00:06:23,940 --> 00:06:28,920 the nearest neighbor's in a different bin than the query point in this first example 91 00:06:28,920 --> 00:06:33,020 is one minus the probability that they fell into the same bin. 92 00:06:34,300 --> 00:06:37,190 And what's the probability that the query point and 93 00:06:37,190 --> 00:06:39,694 the nearest neighbor fall into the same bin? 94 00:06:40,810 --> 00:06:44,541 Well, it's the probability that neither the first line nor 95 00:06:44,541 --> 00:06:47,330 the second line separated these two points. 96 00:06:48,410 --> 00:06:52,810 So what's the probability that neither line or sorry, 97 00:06:52,810 --> 00:06:57,060 what's the probability that one line does not split the two points? 98 00:06:57,060 --> 00:06:59,720 We know that delta is the probability that splits it. 99 00:06:59,720 --> 00:07:04,180 So one minus delta is the probability that there's no split from one line. 100 00:07:04,180 --> 00:07:05,960 But we have two lines. 101 00:07:05,960 --> 00:07:08,660 So, I mean just write this down, so we have. 102 00:07:08,660 --> 00:07:16,392 One minus delta equals probability, that one line. 103 00:07:18,265 --> 00:07:21,059 Does not split. 104 00:07:22,534 --> 00:07:27,172 Query and nearest neighbor. 105 00:07:27,172 --> 00:07:32,666 And then the reason we get this squared is because 106 00:07:32,666 --> 00:07:37,759 these lines are thrown down independently, 107 00:07:37,759 --> 00:07:42,450 so one minus delta squared is probability 108 00:07:42,450 --> 00:07:47,292 that two lines don't split these points. 109 00:07:47,292 --> 00:07:52,180 So what we have here is we have one minus the probability that neither 110 00:07:52,180 --> 00:07:55,629 line split our query and its nearest neighbor. 111 00:07:55,629 --> 00:07:58,668 And if you rewrite this here, 112 00:07:58,668 --> 00:08:04,270 you can rewrite it as 2 delta- delta squared, okay. 113 00:08:06,440 --> 00:08:14,140 So now that was just the probability of having our query point, 114 00:08:14,140 --> 00:08:18,140 and the nearest neighbor in different bins for this one example here. 115 00:08:18,140 --> 00:08:22,070 Well, what's the probability that that happens in all three cases? 116 00:08:22,070 --> 00:08:27,331 Well, again all three cases are completely independent events and, 117 00:08:27,331 --> 00:08:31,788 so we can write the probability as just raised to the power of 118 00:08:31,788 --> 00:08:34,748 the number of hash tables that we have. 119 00:08:37,344 --> 00:08:42,313 So we have this 2 delta- delta squared probability of splitting in any one of 120 00:08:42,313 --> 00:08:46,780 these cases, and we have three cases that we're going to look at. 121 00:08:46,780 --> 00:08:49,110 So the probability of splitting in each of them 122 00:08:50,510 --> 00:08:54,910 is that probability we looked at before raised to the third power. 123 00:08:56,260 --> 00:09:01,127 Okay, so to recap, if we looked at three bins in one hash table, 124 00:09:01,127 --> 00:09:05,908 then the probability of not finding our nearest neighbor after 125 00:09:05,908 --> 00:09:10,616 completing the search of those three bins was delta squared. 126 00:09:10,616 --> 00:09:16,001 But if we defined three hash tables, and only in each one searched one bin, 127 00:09:16,001 --> 00:09:19,920 so still a total of three bins searched overall. 128 00:09:19,920 --> 00:09:25,140 The probability of not finding the nearest neighbor is 2 delta- delta squared cubed. 129 00:09:26,370 --> 00:09:29,550 Okay, well it's a little bit hard to compare 130 00:09:29,550 --> 00:09:32,390 these two functions just written as they are. 131 00:09:32,390 --> 00:09:34,970 So I made a plot of them for you. 132 00:09:34,970 --> 00:09:38,816 And what, the X-axis is, is delta. 133 00:09:38,816 --> 00:09:43,670 It's the probability of a split of points that are an angle theta apart. 134 00:09:45,138 --> 00:09:49,610 And the Y-axis is the probability of not finding the nearest neighbor 135 00:09:49,610 --> 00:09:52,830 after performing the search of three bins that I described. 136 00:09:52,830 --> 00:09:57,063 So again, let me just be clear that this is under the scenario of 137 00:09:57,063 --> 00:10:03,607 throwing down two lines, 138 00:10:03,607 --> 00:10:11,530 searching three bins just to make sure the setup is clear. 139 00:10:12,970 --> 00:10:18,271 Okay, and what we see is that here, this pink case, 140 00:10:18,271 --> 00:10:23,576 let me switch colors, maybe there's a pink color. 141 00:10:23,576 --> 00:10:26,640 No, nothing very pink will just go with red. 142 00:10:26,640 --> 00:10:32,531 So this case here is the multiple 143 00:10:32,531 --> 00:10:36,895 hash table scenario and 144 00:10:36,895 --> 00:10:41,913 this blue line Is the single 145 00:10:41,913 --> 00:10:46,068 hash table scenario. 146 00:10:49,685 --> 00:10:55,167 And what we see is that we have a lower probability of not finding 147 00:10:55,167 --> 00:11:00,030 our nearest neighbor when we use multiple hash tables, 148 00:11:00,030 --> 00:11:05,840 still searching the same number of bins in almost all of the cases. 149 00:11:05,840 --> 00:11:09,910 There is this crossover point, somewhere around here, 150 00:11:12,120 --> 00:11:15,640 where one method, a single hash table, 151 00:11:15,640 --> 00:11:20,520 starts outperforming the multiple hash table, but that's when the angle 152 00:11:20,520 --> 00:11:25,290 between our nearest neighbor, the query and its nearest neighbor becomes larger. 153 00:11:25,290 --> 00:11:28,350 Well, we're going to assume in a lot of cases 154 00:11:29,380 --> 00:11:33,730 that the distance between our query and its nearest neighbor is small, and so 155 00:11:33,730 --> 00:11:37,529 in these cases we definitely get a benefit from looking at multiple hash tables. 156 00:11:38,590 --> 00:11:42,960 But the benefit is even larger if I think about throwing down more 157 00:11:44,570 --> 00:11:45,698 lines than just two. 158 00:11:45,698 --> 00:11:53,191 So let's restrict exactly what the algorithm's going to be doing here. 159 00:11:53,191 --> 00:11:58,250 So in this case we're going to think about throwing down some H lines, 160 00:11:58,250 --> 00:12:01,445 and then we're going to search, as before, 161 00:12:01,445 --> 00:12:07,710 the bin that the query point falls into, as well as all bins that are one bit off. 162 00:12:07,710 --> 00:12:12,170 But remember when we have more lines we have more bits, and so 163 00:12:12,170 --> 00:12:16,190 flipping just one bit there are going to be more bins that we end up searching. 164 00:12:16,190 --> 00:12:20,980 So, in particular, even when we're still searching bins that are just one bit 165 00:12:20,980 --> 00:12:25,058 off from the query, we're going to have to look at 166 00:12:25,058 --> 00:12:32,650 the probability of our nearest neighbor being more than one bit away. 167 00:12:32,650 --> 00:12:37,730 Okay, so, this is equal to 1 minus the probability that 168 00:12:37,730 --> 00:12:42,220 the points fall into this same bin, because then we would of found them, minus 169 00:12:42,220 --> 00:12:47,670 the probability that they're more than this 1 bin away search that we're doing. 170 00:12:49,440 --> 00:12:50,310 Okay, and so 171 00:12:50,310 --> 00:12:54,500 this is equivalent to saying it's 1 minus the probability that there were no splits. 172 00:12:54,500 --> 00:12:59,104 So no line split are querying its nearest neighbor 173 00:12:59,104 --> 00:13:03,489 minus the probability that there was one line, 174 00:13:03,489 --> 00:13:07,778 exactly one line, that split the two points. 175 00:13:07,778 --> 00:13:13,994 And so here what we have, 176 00:13:13,994 --> 00:13:19,027 so probability of no 177 00:13:19,027 --> 00:13:24,355 split is 1- delta, 178 00:13:24,355 --> 00:13:29,979 and we have h lines that 179 00:13:29,979 --> 00:13:35,913 we're throwing down. 180 00:13:35,913 --> 00:13:43,111 So this implies the probability 181 00:13:43,111 --> 00:13:48,708 that none of h lines split 182 00:13:48,708 --> 00:13:54,873 is (1- delta) to the h. 183 00:13:54,873 --> 00:13:57,590 Okay, now, let's talk about this other one, 184 00:13:57,590 --> 00:13:59,490 which is a little bit more complicated. 185 00:13:59,490 --> 00:14:03,920 because we need to compute what's the probability that a single line, 186 00:14:03,920 --> 00:14:08,930 only a single line, splits our query and its nearest neighbor. 187 00:14:08,930 --> 00:14:14,740 Not two lines or three lines or four lines or each lines, just one. 188 00:14:14,740 --> 00:14:17,110 So that's the probability. 189 00:14:17,110 --> 00:14:24,037 So here delta is the probability of one line splitting. 190 00:14:27,670 --> 00:14:33,417 And then here, We have 191 00:14:33,417 --> 00:14:38,316 the probability of h-1 192 00:14:38,316 --> 00:14:44,440 lines, Not splitting the points. 193 00:14:47,390 --> 00:14:54,160 But there are h different ways we can have just one line and not the other lines. 194 00:14:54,160 --> 00:14:56,600 So that's why we have to multiply by the h here. 195 00:14:59,280 --> 00:15:05,793 So in summary, we're going to search h + 1 bins, and not find the nearest neighbor, 196 00:15:05,793 --> 00:15:12,340 with probability 1- (1- delta) to the h- h delta(1 -delta) to the h- 1. 197 00:15:13,590 --> 00:15:18,100 Okay, well now let's consider the alternative using multiple hash tables. 198 00:15:18,100 --> 00:15:23,173 So again, just to keep the number of comparisons the same, 199 00:15:23,173 --> 00:15:27,535 remember that if we throw down h different lines and 200 00:15:27,535 --> 00:15:32,808 we search the query bin, and all the bins that are one bit off, 201 00:15:32,808 --> 00:15:36,479 we're going to end up searching h + 1 bins. 202 00:15:36,479 --> 00:15:41,968 Well now, when we're thinking about multiple hash tables again 203 00:15:41,968 --> 00:15:47,752 to make it as much apples and apples type of comparison that we can do, 204 00:15:47,752 --> 00:15:53,270 we're going to think about defining h + 1 different hash tables. 205 00:15:53,270 --> 00:15:58,553 And then just searching the query bin in each of these different tables. 206 00:15:58,553 --> 00:16:03,882 Okay and so then what we're going to calculate is what's the probability that 207 00:16:03,882 --> 00:16:09,920 my nearest neighbor is in a different bin in all h+1 of these hash tables. 208 00:16:09,920 --> 00:16:15,970 Well, it's 1 minus the probability that the query point and 209 00:16:15,970 --> 00:16:22,800 the nearest neighbor are in the same bin for a given hash table but 210 00:16:22,800 --> 00:16:28,460 then we have that to the h+1 because we're looking at h+1 different hash tables. 211 00:16:28,460 --> 00:16:33,279 So it's 1 minus the probability that there's 212 00:16:33,279 --> 00:16:38,096 no split line and repeating that h+1 times and 213 00:16:38,096 --> 00:16:42,710 so that's 1- delta to the h, to the h+1. 214 00:16:42,710 --> 00:16:48,510 Okay just to be very clear 215 00:16:48,510 --> 00:16:57,075 this is probability of no split in any 216 00:17:00,794 --> 00:17:05,187 I'm sorry, split from any of the h 217 00:17:05,187 --> 00:17:09,892 different lines that we threw down. 218 00:17:09,892 --> 00:17:14,361 So 1 minus that is the probability that they fall 219 00:17:14,361 --> 00:17:18,503 into separate bins rather than the same bin. 220 00:17:18,503 --> 00:17:21,430 And we need this to hold for all hash tables. 221 00:17:25,430 --> 00:17:27,880 I want to say this one other way in case 222 00:17:29,160 --> 00:17:30,900 the way I was presenting that wasn't clear. 223 00:17:32,620 --> 00:17:37,649 So in particular, what we're computing here per hash table 224 00:17:37,649 --> 00:17:42,776 is what's the probability that 1 or 2 or 3 or all the way up to 225 00:17:42,776 --> 00:17:48,416 h different lines split are query point from its nearest neighbor. 226 00:17:48,416 --> 00:17:53,347 And one simple way of calculating that is just 1 minus the complement of that event 227 00:17:53,347 --> 00:17:56,348 which is the probability that there are no lines, 228 00:17:56,348 --> 00:18:00,013 none of the h lines that we throw down split these two points. 229 00:18:00,013 --> 00:18:04,582 Okay, and so that's what we have to calculate per hash table and we just 230 00:18:04,582 --> 00:18:09,992 multiply those together to get the overall probability that this doesn't happen, 231 00:18:09,992 --> 00:18:15,930 or rather, that there is a split of points in each one of our different tables. 232 00:18:15,930 --> 00:18:20,580 Okay, so now let's summarize in this slightly more generic setting. 233 00:18:20,580 --> 00:18:23,790 Where we're going to assume that we're throwing down h lines. 234 00:18:26,500 --> 00:18:29,870 And in the first case we're going to search h+1 bins and 235 00:18:29,870 --> 00:18:33,000 the probability of not finding a nearest neighbor is written here. 236 00:18:33,000 --> 00:18:35,650 I'm not going to repeat that equation. 237 00:18:35,650 --> 00:18:41,370 And in the case of boltable hash tables in particular are looking at h+1 different 238 00:18:41,370 --> 00:18:45,330 hash tables, but always searching just the bin where the query point falls in 239 00:18:45,330 --> 00:18:48,720 where you have this probability of not finding the nearest neighbor. 240 00:18:48,720 --> 00:18:51,590 Again, not very straightforward to think about comparing these two 241 00:18:51,590 --> 00:18:52,320 different equations. 242 00:18:52,320 --> 00:18:54,280 So let's look at some plots. 243 00:18:54,280 --> 00:18:57,170 And we're going to look at plots in two different scenarios. 244 00:18:57,170 --> 00:19:02,880 One is for throwing down three lines, the other one is for throwing down ten lines. 245 00:19:02,880 --> 00:19:06,960 And, what we see is here, this top case is 246 00:19:06,960 --> 00:19:12,400 the one hash table scenario. 247 00:19:12,400 --> 00:19:17,506 And this future line is that 248 00:19:17,506 --> 00:19:23,507 multiple hash table scenario. 249 00:19:23,507 --> 00:19:29,835 And what we see in these plots is that now there's a much larger difference between 250 00:19:29,835 --> 00:19:35,900 the benefits that we have using multiple hash tables over a single hash table. 251 00:19:35,900 --> 00:19:40,396 And that's especially true as we increase the number of lines that we 252 00:19:40,396 --> 00:19:41,720 are throwing down. 253 00:19:42,805 --> 00:19:47,110 And I want to mention that the set up that I created here was really for 254 00:19:47,110 --> 00:19:51,750 the sake of comparing looking at one hash table versus multiple hash tables but 255 00:19:51,750 --> 00:19:56,770 in general the way people really use this in practice is you don't have to 256 00:19:56,770 --> 00:20:01,920 create exactly as many hash tables as lines thrown down. 257 00:20:01,920 --> 00:20:03,710 You can actually create many, 258 00:20:03,710 --> 00:20:07,760 many more hash tables than the number of lines you throw down. 259 00:20:07,760 --> 00:20:12,843 So in particular, you can think of keeping the number of bits fixed, 260 00:20:12,843 --> 00:20:16,173 the number of lines you throw down fixed, and 261 00:20:16,173 --> 00:20:19,964 think about increasing the number of hash tables. 262 00:20:19,964 --> 00:20:25,168 So we're going to call m, the number of hash tables. 263 00:20:25,168 --> 00:20:30,941 And h, here is the number of lines or 264 00:20:30,941 --> 00:20:33,640 number of bits. 265 00:20:35,650 --> 00:20:39,830 And so people sometimes refer to this as the width and the depth of a table. 266 00:20:39,830 --> 00:20:41,515 So the width being the number of bits. 267 00:20:41,515 --> 00:20:45,300 The depth being the number of hash tables that we're looking at. 268 00:20:45,300 --> 00:20:48,200 And what you can show is that the probability 269 00:20:48,200 --> 00:20:50,820 of not finding the nearest neighbor. 270 00:20:50,820 --> 00:20:54,560 So, the probability that the query point in the nearest neighbor in different bins 271 00:20:54,560 --> 00:21:00,980 in all of these m different hash tables falls off exponentially fast. 272 00:21:00,980 --> 00:21:05,150 So in particular, the probability calculation is, exactly what we went 273 00:21:05,150 --> 00:21:10,150 through before but instead of having h+1 different hash tables, 274 00:21:10,150 --> 00:21:13,110 here we're looking at m+1. 275 00:21:13,110 --> 00:21:14,140 Or I guess, sorry. 276 00:21:14,140 --> 00:21:17,450 This really should just be n, not n+1, 277 00:21:17,450 --> 00:21:22,390 because we're not looking at that specific contrived example. 278 00:21:22,390 --> 00:21:27,610 But the point still holds that, as you increase the number of hash tables 279 00:21:27,610 --> 00:21:32,070 because you're raising some probability, so some number less than 1, to higher and 280 00:21:32,070 --> 00:21:35,600 higher powers, you're getting lower and lower probability at 281 00:21:35,600 --> 00:21:39,960 this exponentially fast rate of not finding our nearest neighbor. 282 00:21:41,090 --> 00:21:44,520 And typically in many different setups we're going to consider, 283 00:21:44,520 --> 00:21:51,160 you can show that the probability of finding a nearest neighbor after searching 284 00:21:51,160 --> 00:21:55,880 one bin in m different hash tables, so again in total m different bins, 285 00:21:55,880 --> 00:22:02,590 is higher than if we were to search m bins in a single hash table. 286 00:22:02,590 --> 00:22:06,340 Okay, so to summarize what we've shown in this optional video, 287 00:22:06,340 --> 00:22:11,230 is that when we think about searching multiple bins within a single hash table, 288 00:22:12,660 --> 00:22:15,180 the cost of binning the points is lower. 289 00:22:15,180 --> 00:22:17,780 You only have to form one hash table. 290 00:22:17,780 --> 00:22:22,710 But typically you tend to have to search more bins to have the same quality of 291 00:22:22,710 --> 00:22:29,100 approximation as if you performed multiple binnings of points. 292 00:22:29,100 --> 00:22:33,810 So, look at multiple hash tables which cost more. 293 00:22:33,810 --> 00:22:37,130 You have to create all of these different hash tables. 294 00:22:37,130 --> 00:22:40,482 But tends to provide a better approximation for 295 00:22:40,482 --> 00:22:42,669 our nearest neighbor search. 296 00:22:42,669 --> 00:22:46,870 [MUSIC]