1 00:00:00,000 --> 00:00:10,002 The power of locality sensitive hashing is closely related to the behavior of this 2 00:00:10,002 --> 00:00:16,044 particular function one minus something to the power k. 3 00:00:16,071 --> 00:00:25,009 Full, which is then raised. To the power of b, and then again subtract 4 00:00:25,009 --> 00:00:30,002 it from one. This particular function, as you can see, 5 00:00:30,002 --> 00:00:36,035 is a very steep one. With the result that this particular value 6 00:00:36,035 --> 00:00:43,918 pq which is nothing but the probability of a match in one of the locality sensitive 7 00:00:43,918 --> 00:00:49,445 functions f, is amplified. So even if this value pq is fairly 8 00:00:49,445 --> 00:00:57,130 moderate, you know, say something close to 0.1 or 0.15, it really becomes amplified 9 00:00:57,130 --> 00:01:03,028 to a very large value giving a very large probability of match. 10 00:01:03,028 --> 00:01:12,739 At the same time, the false match probability is driven to zero as long as 11 00:01:12,739 --> 00:01:21,684 it is a little smaller say 0.07. Fairly close to something like 0.15, but 12 00:01:21,684 --> 00:01:30,007 it gets driven to a very small value as long it's reasonably smaller than pq. 13 00:01:31,001 --> 00:01:38,084 The place where this steep rise happens depends on the choices of k and b. 14 00:01:39,048 --> 00:01:48,079 And by suitably adjusting it, these parameters depending on our values of p 15 00:01:48,079 --> 00:01:57,037 and pq, we can get the required behaviour all that we want to achieve. 16 00:01:57,037 --> 00:02:07,015 Alesage has many Important applications which fall under the big data category 17 00:02:07,015 --> 00:02:12,078 these days. For example, when you have hundreds of 18 00:02:12,078 --> 00:02:18,055 millions of tweets, how do we group similar tweets without having to compare, 19 00:02:18,055 --> 00:02:21,074 all the pairs? Which is an impossible task. 20 00:02:21,074 --> 00:02:27,096 But since there is n square pairs and if n is in the hundreds of millions, it just 21 00:02:27,096 --> 00:02:32,013 becomes impossible. So LSH which is one way out. 22 00:02:33,015 --> 00:02:41,040 In enterprise search, we saw the problem of finding near duplicates or versions of 23 00:02:41,040 --> 00:02:47,051 the same root document. This is another problem which can be 24 00:02:47,051 --> 00:02:54,079 addressed using LSH. Finding patterns in time series from 25 00:02:54,079 --> 00:03:02,021 sensors where you have multiple sensors capturing many different parameters such 26 00:03:02,021 --> 00:03:09,045 as from a car or from a piece of machinery or from a ship, and you want to find 27 00:03:09,045 --> 00:03:16,078 patterns in that time series, then LSH like concepts turn out to be useful there 28 00:03:16,078 --> 00:03:21,066 as well. Another example which is also discussed in 29 00:03:21,066 --> 00:03:27,063 the Anand Rajaraman book is resolving identities of people from different 30 00:03:27,063 --> 00:03:34,016 databases or multiple inputs. An example of such a problem is, one is 31 00:03:34,016 --> 00:03:40,091 described in the book resolving databases from different systems, other example is 32 00:03:40,091 --> 00:03:47,033 figuring out which Twitter ID's match which Facebook ID's and which LinkedIn 33 00:03:47,033 --> 00:03:53,091 ID's and which e-mail ID's now this is private information which people want to 34 00:03:53,091 --> 00:04:00,073 keep separate but there are many companies which are engaged in figuring out using 35 00:04:00,073 --> 00:04:06,014 concepts like LSH which identities actually should be clumped together are 36 00:04:06,014 --> 00:04:08,056 very likely to be from the same person.