1
00:00:00,000 --> 00:00:04,123
[MUSIC]

2
00:00:04,123 --> 00:00:07,861
We've set up how to perform nearest
neighbor search in terms of describing

3
00:00:07,861 --> 00:00:10,400
the algorithm that we have to run through.

4
00:00:10,400 --> 00:00:13,190
And then spending some time
discussing two really critical

5
00:00:13,190 --> 00:00:15,010
components of that algorithm.

6
00:00:15,010 --> 00:00:18,900
The first being, how do we represent
our documents of interest?

7
00:00:18,900 --> 00:00:19,670
And the second being,

8
00:00:19,670 --> 00:00:22,730
how do we compute the distance
between two different documents?

9
00:00:23,950 --> 00:00:27,500
And we showed that those two components
are really, really important for

10
00:00:27,500 --> 00:00:29,760
the performance of the method.

11
00:00:29,760 --> 00:00:34,230
But now, I want to spend a little bit of
time thinking about how computationally

12
00:00:34,230 --> 00:00:37,320
intensive doing nearest
neighbor search can be.

13
00:00:38,610 --> 00:00:42,730
And we refer to this as
the complexity of search.

14
00:00:42,730 --> 00:00:45,520
Okay, so what's the complexity
of brute-force search?

15
00:00:45,520 --> 00:00:48,680
So let's say we have a given query point.

16
00:00:48,680 --> 00:00:49,830
What do we need to do?

17
00:00:49,830 --> 00:00:53,630
Well, we need to scan through
all the documents in the corpus,

18
00:00:53,630 --> 00:00:57,360
compute the distance to each one,
and then we have to say,

19
00:00:57,360 --> 00:00:59,770
okay, which is the smallest distance?

20
00:00:59,770 --> 00:01:01,840
So what's the complexity of doing that?

21
00:01:01,840 --> 00:01:08,680
Well, for a one nearest neighbor query,
we have to compute N different distances.

22
00:01:08,680 --> 00:01:12,486
So the complexity is order capital N,
number of documents in the corpus.

23
00:01:12,486 --> 00:01:15,829
And if we want to do k
nearest neighbors and

24
00:01:15,829 --> 00:01:21,643
return the k nearest documents,
then the complexity is order N log k.

25
00:01:21,643 --> 00:01:26,524
And the reason we get that log k term is
if we do a smart implementation where

26
00:01:26,524 --> 00:01:28,888
we maintain this priority queue and

27
00:01:28,888 --> 00:01:32,620
insert into the priority
queue in an efficient manner.

28
00:01:33,860 --> 00:01:37,280
Okay, but what if N is really,
really large?

29
00:01:37,280 --> 00:01:39,153
And importantly,
what if you have to do many,

30
00:01:39,153 --> 00:01:41,758
many queries on this corpus again and
again and again and again?

31
00:01:41,758 --> 00:01:46,427
Is there something we can do that's more
efficient than this brute-force search?

32
00:01:46,427 --> 00:01:50,899
[MUSIC]