1
00:00:00,000 --> 00:00:05,155
[MUSIC]

2
00:00:05,155 --> 00:00:09,858
So up to this point, we've described just
the concept of partitioning our data,

3
00:00:09,858 --> 00:00:11,710
just using these simple lines.

4
00:00:11,710 --> 00:00:15,680
And I should have mentioned that in
everything we're going to describe here,

5
00:00:15,680 --> 00:00:19,540
we're just going to assume that there's
no intercept term to the line, so

6
00:00:19,540 --> 00:00:21,848
it's always just going to
go through the origin.

7
00:00:21,848 --> 00:00:26,312
So it's a very simple structure,
a very simple way to think

8
00:00:26,312 --> 00:00:30,342
about partitioning our data
into two separate bins.

9
00:00:30,342 --> 00:00:34,752
And then it’s a very simple way that we
can just search along the points that

10
00:00:34,752 --> 00:00:36,929
fall into each of these partitions.

11
00:00:36,929 --> 00:00:39,930
But there are number of questions
that we might be faced with.

12
00:00:39,930 --> 00:00:44,650
So one of these questions is how do
we think of defining a good line

13
00:00:44,650 --> 00:00:46,320
to split the points?

14
00:00:46,320 --> 00:00:49,060
And associated with that question is,

15
00:00:49,060 --> 00:00:53,520
what's the resulting quality of this
type of nearest neighbor search?

16
00:00:53,520 --> 00:00:57,030
We said that it provides an approximate
nearest neighbor search, but

17
00:00:57,030 --> 00:01:00,460
what are the chances that we're actually
going to find the nearest neighbor

18
00:01:00,460 --> 00:01:02,710
to a given query point?

19
00:01:02,710 --> 00:01:06,370
And then finally, what types of
efficiencies are we actually going to get

20
00:01:06,370 --> 00:01:08,650
from the simple partitioning
of data points?

21
00:01:10,420 --> 00:01:14,010
Well, let's start with the question on
how are we going to define this line?

22
00:01:14,010 --> 00:01:17,800
And for this, let's propose something
that sounds really, really crazy.

23
00:01:17,800 --> 00:01:19,920
What if we just draw the line randomly?

24
00:01:19,920 --> 00:01:23,940
Just randomly throw down a line
that goes through the 0, 0 point.

25
00:01:25,620 --> 00:01:29,058
Well, how bad can this be?

26
00:01:29,058 --> 00:01:32,329
It sounds that could actually be really,
really bad, but

27
00:01:32,329 --> 00:01:34,213
let's think through this a bit.

28
00:01:34,213 --> 00:01:38,530
And remember that our objective is if
two points are close together, and

29
00:01:38,530 --> 00:01:41,858
here we're going to be focusing
on cosine similarity when

30
00:01:41,858 --> 00:01:45,845
we're talking about the distance
between two different points.

31
00:01:45,845 --> 00:01:51,560
Well, our goal is that if they're close
together according to cosine similarity,

32
00:01:51,560 --> 00:01:55,300
then we want those two points
to fall into the same bin.

33
00:01:55,300 --> 00:02:00,360
To formalize this a bit, let's just say
that the angle, remember that cosine

34
00:02:00,360 --> 00:02:04,200
similarity is defined in terms of the
cosine of the angle between two points.

35
00:02:04,200 --> 00:02:07,140
So let's say that there are two points,
x and y.

36
00:02:07,140 --> 00:02:10,030
And the angle between x and y is theta xy.

37
00:02:11,120 --> 00:02:14,160
Okay, now let's just start
drawing lines randomly.

38
00:02:14,160 --> 00:02:16,150
So here's one line.

39
00:02:16,150 --> 00:02:20,120
And this line happens to put both x and
y into bin 0,

40
00:02:20,120 --> 00:02:25,480
because the score of both of
the points is less than 0.

41
00:02:25,480 --> 00:02:27,550
Here's another line and another,
and another, and another.

42
00:02:27,550 --> 00:02:32,085
And here's a whole bunch of lines that all
place x and y into the same bin, bin 0.

43
00:02:34,040 --> 00:02:37,810
Well here's a line that places both x and

44
00:02:37,810 --> 00:02:42,000
y into bin 1 since both of their
scores are greater than 0.

45
00:02:42,000 --> 00:02:45,760
And here's another one, another one,
another one, another one, another one.

46
00:02:45,760 --> 00:02:50,275
Whole bunch of lines that all place x and
y into the same bin, bin 1.

47
00:02:52,340 --> 00:02:55,780
But we can make mistakes, of course.

48
00:02:55,780 --> 00:03:02,007
Here's one line that separates x and y, so
that they will be put into different bins.

49
00:03:02,007 --> 00:03:06,805
Because the y point has a negative
score so that's going to be in bin 0,

50
00:03:06,805 --> 00:03:10,877
the x point has a positive score,
it's going to be in bin 1.

51
00:03:10,877 --> 00:03:13,320
And here are some other lines
that would do the same thing.

52
00:03:14,570 --> 00:03:19,666
Okay, so what we're saying is that
this whole range of points here

53
00:03:19,666 --> 00:03:24,947
that I'm showing in blue place both x and
y into the same bin, bin 1.

54
00:03:24,947 --> 00:03:30,211
And all of these points, or all of
these lines that would fall into this

55
00:03:30,211 --> 00:03:35,405
orange region that I'm showing here,
place both x and y into bin 0.

56
00:03:35,405 --> 00:03:39,465
But then there's this green
range where any line that falls

57
00:03:39,465 --> 00:03:43,050
in that space is going to put x and
y into separate bins.

58
00:03:44,200 --> 00:03:47,550
But a question is,
what's the chance of that happening?

59
00:03:47,550 --> 00:03:50,770
Well, if theta x, y is really small,

60
00:03:50,770 --> 00:03:55,890
then there's just a small chance that
a random line placed, let's say,

61
00:03:55,890 --> 00:04:00,560
uniformly at random from 0 to pi over 2,

62
00:04:00,560 --> 00:04:04,610
there's just a small chance that it's
going to fall within that theta range.

63
00:04:05,840 --> 00:04:10,700
So what that means is that there's
just a small probability that x and

64
00:04:10,700 --> 00:04:13,880
y are going to be put in different bins,
so if x were our query,

65
00:04:13,880 --> 00:04:18,400
then we wouldn't find it's nearest
neighbor y which was in some other bin.

66
00:04:19,710 --> 00:04:20,810
So in summary,

67
00:04:20,810 --> 00:04:25,880
this idea of just throwing down a random
line, it's actually not that crazy.

68
00:04:26,950 --> 00:04:33,120
We're saying that it provides good
performance in terms of a probabilistic

69
00:04:33,120 --> 00:04:37,949
statement of providing exact nearest
neighbors for lots and lots of queries.

70
00:04:39,020 --> 00:04:41,956
Okay, so far what we've done is we've
talked about the first two points.

71
00:04:41,956 --> 00:04:45,494
We showed how we can just very
simply throw it out of line,

72
00:04:45,494 --> 00:04:49,866
nothing challenging to how we have
to think about defining that line.

73
00:04:49,866 --> 00:04:53,402
And we also showed that we can get very
good performance in terms of nearest

74
00:04:53,402 --> 00:04:56,780
neighbor search with this
very simple approach.

75
00:04:56,780 --> 00:05:01,340
However, there's this really, really
problematic last third point which is what

76
00:05:01,340 --> 00:05:04,440
are the efficiencies that
we get from this approach?

77
00:05:04,440 --> 00:05:09,760
And if we think about just throwing
down one line, then what happens is that

78
00:05:09,760 --> 00:05:13,140
we often end up with bins that still
have lots and lots of data points.

79
00:05:13,140 --> 00:05:16,710
If we assume we have lots and lots and
lots of data points to begin with, if we

80
00:05:16,710 --> 00:05:20,550
just partition them into two sets, we're
probably going to end up with two bins,

81
00:05:20,550 --> 00:05:23,050
each of which have lots and
lots of datapoints.

82
00:05:23,050 --> 00:05:27,410
And a query is going to have to
search over lots and lots of points.

83
00:05:28,410 --> 00:05:29,550
Okay, so

84
00:05:29,550 --> 00:05:33,930
it seems like we've gained something in
terms of efficiency but not that much.

85
00:05:33,930 --> 00:05:36,588
So let's think about how
we can address this.

86
00:05:36,588 --> 00:05:40,849
[MUSIC]