1 00:00:04,800 --> 00:00:08,130 So in summary we've covered a lot of ground in this module. 2 00:00:08,130 --> 00:00:11,980 So we've returned to this task of retrieving a document of interest. 3 00:00:11,980 --> 00:00:15,460 And have formulated this in terms of performing nearest neighbor search. 4 00:00:15,460 --> 00:00:18,379 We reviewed our nearest neighbor search algorithms for 5 00:00:18,379 --> 00:00:22,548 performing one nearest neighbors search as well as K nearest neighbor search. 6 00:00:22,548 --> 00:00:25,500 But we got into some of the nitty-gritty of these algorithms. 7 00:00:25,500 --> 00:00:27,950 In particular, thinking about how we represent our data. 8 00:00:27,950 --> 00:00:32,120 So we focused here on different representations of documents. 9 00:00:32,120 --> 00:00:35,590 But we also talked about the really, really critical idea of, 10 00:00:35,590 --> 00:00:39,110 how do we compute distances between two different data points? 11 00:00:39,110 --> 00:00:42,588 And we went through a set of different ways to compute these types of 12 00:00:42,588 --> 00:00:46,620 distances and discussed the trade-offs that we're faced with. 13 00:00:46,620 --> 00:00:49,310 And the fact that there's really no one answer for 14 00:00:49,310 --> 00:00:53,640 what's the right data representation and what's the right distance to use. 15 00:00:53,640 --> 00:00:56,170 And that's where a lot of domain knowledge comes in. 16 00:00:56,170 --> 00:00:59,783 It can have a lot of impact on the performance of the algorithms. 17 00:00:59,783 --> 00:01:02,710 And you really have to think very critically there. 18 00:01:02,710 --> 00:01:06,049 But then, for the rest of the module we talked about ways to think about 19 00:01:06,049 --> 00:01:09,900 performing really efficient in nearest neighbour search to help handle the fact 20 00:01:09,900 --> 00:01:11,330 that the brute force search. 21 00:01:11,330 --> 00:01:14,840 When you have lots and lots of observations, can be really, really, 22 00:01:14,840 --> 00:01:17,630 really computationally intensive. 23 00:01:17,630 --> 00:01:19,086 So we talked about KD-trees. 24 00:01:19,086 --> 00:01:23,357 That's a very efficient data structure that we can use for 25 00:01:23,357 --> 00:01:26,503 performing our nearest neighbor search. 26 00:01:26,503 --> 00:01:28,506 And we showed when it can be useful, but 27 00:01:28,506 --> 00:01:31,560 we also said that it's struggled in high dimensions. 28 00:01:31,560 --> 00:01:34,901 So then we turned to this idea of locality sensitive hashing, 29 00:01:34,901 --> 00:01:37,870 which only provides approximate nearest neighbors. 30 00:01:37,870 --> 00:01:42,651 But we motivated that that can actually be reasonable in a really large set of 31 00:01:42,651 --> 00:01:44,380 different applications. 32 00:01:44,380 --> 00:01:47,548 But, this locality sense of hashing is really, really, 33 00:01:47,548 --> 00:01:49,172 really simple to implement. 34 00:01:49,172 --> 00:01:50,756 And can be very efficient. 35 00:01:50,756 --> 00:01:54,550 And provide good approximations that have probabilistic guarantees on 36 00:01:54,550 --> 00:01:56,660 their approximations made. 37 00:01:56,660 --> 00:01:58,850 So, coming out of this module, 38 00:01:58,850 --> 00:02:03,750 you really have some real powerful tools that are useful in just a huge number of 39 00:02:03,750 --> 00:02:07,670 retrieval tasks going beyond document retrieval. 40 00:02:07,670 --> 00:02:10,815 Lots and lots of applications where you have some query and 41 00:02:10,815 --> 00:02:12,814 you want to find some similar object. 42 00:02:12,814 --> 00:02:17,140 And all the tools we've discussed in this module generalize to these other 43 00:02:17,140 --> 00:02:18,524 applications as well. 44 00:02:18,524 --> 00:02:23,260 And so, I'll just leave you with a summary of a set of things that you should be able 45 00:02:23,260 --> 00:02:25,609 to do now that you've seen this module. 46 00:02:25,609 --> 00:02:29,909 [MUSIC]