1
00:00:04,800 --> 00:00:08,130
So in summary we've covered
a lot of ground in this module.

2
00:00:08,130 --> 00:00:11,980
So we've returned to this task of
retrieving a document of interest.

3
00:00:11,980 --> 00:00:15,460
And have formulated this in terms of
performing nearest neighbor search.

4
00:00:15,460 --> 00:00:18,379
We reviewed our nearest
neighbor search algorithms for

5
00:00:18,379 --> 00:00:22,548
performing one nearest neighbors search
as well as K nearest neighbor search.

6
00:00:22,548 --> 00:00:25,500
But we got into some of
the nitty-gritty of these algorithms.

7
00:00:25,500 --> 00:00:27,950
In particular,
thinking about how we represent our data.

8
00:00:27,950 --> 00:00:32,120
So we focused here on different
representations of documents.

9
00:00:32,120 --> 00:00:35,590
But we also talked about the really,
really critical idea of,

10
00:00:35,590 --> 00:00:39,110
how do we compute distances
between two different data points?

11
00:00:39,110 --> 00:00:42,588
And we went through a set of different
ways to compute these types of

12
00:00:42,588 --> 00:00:46,620
distances and discussed
the trade-offs that we're faced with.

13
00:00:46,620 --> 00:00:49,310
And the fact that there's
really no one answer for

14
00:00:49,310 --> 00:00:53,640
what's the right data representation and
what's the right distance to use.

15
00:00:53,640 --> 00:00:56,170
And that's where a lot of
domain knowledge comes in.

16
00:00:56,170 --> 00:00:59,783
It can have a lot of impact on
the performance of the algorithms.

17
00:00:59,783 --> 00:01:02,710
And you really have to think
very critically there.

18
00:01:02,710 --> 00:01:06,049
But then, for the rest of the module
we talked about ways to think about

19
00:01:06,049 --> 00:01:09,900
performing really efficient in nearest
neighbour search to help handle the fact

20
00:01:09,900 --> 00:01:11,330
that the brute force search.

21
00:01:11,330 --> 00:01:14,840
When you have lots and lots of
observations, can be really, really,

22
00:01:14,840 --> 00:01:17,630
really computationally intensive.

23
00:01:17,630 --> 00:01:19,086
So we talked about KD-trees.

24
00:01:19,086 --> 00:01:23,357
That's a very efficient data
structure that we can use for

25
00:01:23,357 --> 00:01:26,503
performing our nearest neighbor search.

26
00:01:26,503 --> 00:01:28,506
And we showed when it can be useful, but

27
00:01:28,506 --> 00:01:31,560
we also said that it's
struggled in high dimensions.

28
00:01:31,560 --> 00:01:34,901
So then we turned to this idea
of locality sensitive hashing,

29
00:01:34,901 --> 00:01:37,870
which only provides
approximate nearest neighbors.

30
00:01:37,870 --> 00:01:42,651
But we motivated that that can actually
be reasonable in a really large set of

31
00:01:42,651 --> 00:01:44,380
different applications.

32
00:01:44,380 --> 00:01:47,548
But, this locality sense of
hashing is really, really,

33
00:01:47,548 --> 00:01:49,172
really simple to implement.

34
00:01:49,172 --> 00:01:50,756
And can be very efficient.

35
00:01:50,756 --> 00:01:54,550
And provide good approximations that
have probabilistic guarantees on

36
00:01:54,550 --> 00:01:56,660
their approximations made.

37
00:01:56,660 --> 00:01:58,850
So, coming out of this module,

38
00:01:58,850 --> 00:02:03,750
you really have some real powerful tools
that are useful in just a huge number of

39
00:02:03,750 --> 00:02:07,670
retrieval tasks going
beyond document retrieval.

40
00:02:07,670 --> 00:02:10,815
Lots and lots of applications
where you have some query and

41
00:02:10,815 --> 00:02:12,814
you want to find some similar object.

42
00:02:12,814 --> 00:02:17,140
And all the tools we've discussed in
this module generalize to these other

43
00:02:17,140 --> 00:02:18,524
applications as well.

44
00:02:18,524 --> 00:02:23,260
And so, I'll just leave you with a summary
of a set of things that you should be able

45
00:02:23,260 --> 00:02:25,609
to do now that you've seen this module.

46
00:02:25,609 --> 00:02:29,909
[MUSIC]