1 00:00:00,000 --> 00:00:04,123 [MUSIC] 2 00:00:04,123 --> 00:00:07,861 We've set up how to perform nearest neighbor search in terms of describing 3 00:00:07,861 --> 00:00:10,400 the algorithm that we have to run through. 4 00:00:10,400 --> 00:00:13,190 And then spending some time discussing two really critical 5 00:00:13,190 --> 00:00:15,010 components of that algorithm. 6 00:00:15,010 --> 00:00:18,900 The first being, how do we represent our documents of interest? 7 00:00:18,900 --> 00:00:19,670 And the second being, 8 00:00:19,670 --> 00:00:22,730 how do we compute the distance between two different documents? 9 00:00:23,950 --> 00:00:27,500 And we showed that those two components are really, really important for 10 00:00:27,500 --> 00:00:29,760 the performance of the method. 11 00:00:29,760 --> 00:00:34,230 But now, I want to spend a little bit of time thinking about how computationally 12 00:00:34,230 --> 00:00:37,320 intensive doing nearest neighbor search can be. 13 00:00:38,610 --> 00:00:42,730 And we refer to this as the complexity of search. 14 00:00:42,730 --> 00:00:45,520 Okay, so what's the complexity of brute-force search? 15 00:00:45,520 --> 00:00:48,680 So let's say we have a given query point. 16 00:00:48,680 --> 00:00:49,830 What do we need to do? 17 00:00:49,830 --> 00:00:53,630 Well, we need to scan through all the documents in the corpus, 18 00:00:53,630 --> 00:00:57,360 compute the distance to each one, and then we have to say, 19 00:00:57,360 --> 00:00:59,770 okay, which is the smallest distance? 20 00:00:59,770 --> 00:01:01,840 So what's the complexity of doing that? 21 00:01:01,840 --> 00:01:08,680 Well, for a one nearest neighbor query, we have to compute N different distances. 22 00:01:08,680 --> 00:01:12,486 So the complexity is order capital N, number of documents in the corpus. 23 00:01:12,486 --> 00:01:15,829 And if we want to do k nearest neighbors and 24 00:01:15,829 --> 00:01:21,643 return the k nearest documents, then the complexity is order N log k. 25 00:01:21,643 --> 00:01:26,524 And the reason we get that log k term is if we do a smart implementation where 26 00:01:26,524 --> 00:01:28,888 we maintain this priority queue and 27 00:01:28,888 --> 00:01:32,620 insert into the priority queue in an efficient manner. 28 00:01:33,860 --> 00:01:37,280 Okay, but what if N is really, really large? 29 00:01:37,280 --> 00:01:39,153 And importantly, what if you have to do many, 30 00:01:39,153 --> 00:01:41,758 many queries on this corpus again and again and again and again? 31 00:01:41,758 --> 00:01:46,427 Is there something we can do that's more efficient than this brute-force search? 32 00:01:46,427 --> 00:01:50,899 [MUSIC]