1 00:00:00,000 --> 00:00:04,119 Let's consider some other kinds of search now. 2 00:00:04,119 --> 00:00:10,926 So far, we have been indexing objects, such as documents, by the features they 3 00:00:10,926 --> 00:00:16,658 contain, which are words. Or in the case of records in a database, 4 00:00:16,658 --> 00:00:20,510 they are the contents of different columns. 5 00:00:20,510 --> 00:00:29,133 Our assumption has been that when we search for a document or a record, our 6 00:00:29,133 --> 00:00:36,031 query is a bag of words. In other words, features are what we 7 00:00:36,031 --> 00:00:42,240 search with. But suppose our query is itself an object. 8 00:00:42,560 --> 00:00:49,697 For example, an image, such as in Google Goggles or Google Image search, or a 9 00:00:49,697 --> 00:00:56,930 bio-metric, such as a fingerprint and iris scan in the example of the unique 10 00:00:56,930 --> 00:01:01,689 identification project of the Government of India. 11 00:01:01,689 --> 00:01:08,922 For those of you not familiar with this, ambitious and equally controversial 12 00:01:08,922 --> 00:01:14,727 project, it aims to... Every resident of India with the unique 13 00:01:14,727 --> 00:01:19,360 bio-metric ID. We will return to this in a minute. 14 00:01:20,560 --> 00:01:27,216 But coming back to the question of searching for an object is in inverted 15 00:01:27,216 --> 00:01:34,413 index of features, whether they are words or image features, or other features in 16 00:01:34,413 --> 00:01:40,800 the case of fingerprints and irises, the best way to search for objects. 17 00:01:42,280 --> 00:01:48,788 What do you think? I would say that, no, it probably is not 18 00:01:48,788 --> 00:01:56,668 the best way to search for objects. As to why, I encourage you'd all to think 19 00:01:56,668 --> 00:02:01,244 about this. And discuss this, on the discussion forum. 20 00:02:01,244 --> 00:02:08,242 But, right now we're going to discuss. Another very powerful method, called 21 00:02:08,242 --> 00:02:15,830 locality sensitive hashing. In-, invented by Indic and Motwani in the 22 00:02:15,830 --> 00:02:22,122 late'90s, and described in. Fair detail in the Anand Rajaraman book, 23 00:02:22,122 --> 00:02:28,578 which is a reference for the course. This technique is particularly important 24 00:02:28,578 --> 00:02:35,742 in all kinds of big data applications because it somehow manages to compare, at 25 00:02:35,742 --> 00:02:42,180 least approximately, n pair of objects in linear, that is order and time. 26 00:02:42,180 --> 00:02:49,616 There are n square pairs, but it manages to tell us which pairs are actually close 27 00:02:49,616 --> 00:02:53,788 together without actually comparing all pairs. 28 00:02:53,788 --> 00:02:59,854 Sounds like magic, let's see how. The basic idea of locality sensitive 29 00:02:59,854 --> 00:03:04,130 hashing, LSH as we shall refer to it going forward. 30 00:03:04,130 --> 00:03:09,591 Is the following. It's hashing so, an object is obviously 31 00:03:09,591 --> 00:03:17,961 hashed a something called HX. In a manner so as, that if X and Y are two 32 00:03:17,961 --> 00:03:21,240 objects. And if X is close to Y. 33 00:03:21,240 --> 00:03:25,694 Then the two hash functions are equal with high probability. 34 00:03:25,694 --> 00:03:31,484 In this respect its similar to normal hashing where if X is equal to Y, then X 35 00:03:31,484 --> 00:03:36,829 and Y are hashed with high probabilities that you are just saying value. 36 00:03:36,829 --> 00:03:42,026 But this deals with, distances and closeness and not just equality and 37 00:03:42,026 --> 00:03:48,039 that's, that's what makes it interesting. Its hashing which is sensitive to things 38 00:03:48,039 --> 00:03:51,380 being close together or in the same locality. 39 00:03:51,840 --> 00:03:58,476 And conversely obviously if x is not close to y, rather x is far from y, then the two 40 00:03:58,476 --> 00:04:02,980 hash functions will not be equal with a high probability. 41 00:04:03,260 --> 00:04:09,260 Of course constructing the hash functions themselves is quite tricky. 42 00:04:09,620 --> 00:04:13,724 And here, we have to. Do interesting things like. 43 00:04:13,724 --> 00:04:18,528 Take random functions from locally sensitive family. 44 00:04:18,528 --> 00:04:25,165 And combine them in particular ways. Allman and Roger Allman's chapter three 45 00:04:25,165 --> 00:04:30,580 described this process. Fairly well, and in considerable detail. 46 00:04:30,920 --> 00:04:37,259 An example is biometric matching. In the case of the UID project for 47 00:04:37,259 --> 00:04:43,412 example, more than a billion people are going to get biometric IDs. 48 00:04:43,412 --> 00:04:48,820 Of these, more than 200,000,000 have already been enrolled. 49 00:04:49,380 --> 00:04:55,572 At the same let me make the following disclaimer that, what the UID actually 50 00:04:55,572 --> 00:05:01,927 uses internally is proprietary and I'm not intending to suggest that they use 51 00:05:01,927 --> 00:05:07,060 locality sensitive hashing. This is merely a motivating example.