1
00:00:00,000 --> 00:00:04,349
[MUSIC]

2
00:00:04,349 --> 00:00:09,168
So that was one idea of computing an inner
product to compute a distance, but

3
00:00:09,168 --> 00:00:14,360
here's another really natural inner
product measure that we could have.

4
00:00:14,360 --> 00:00:19,400
And this is simply to look at
the inner product between our

5
00:00:20,610 --> 00:00:26,870
one article which might be our query
article xq and another article xi,

6
00:00:26,870 --> 00:00:31,945
simply as the inner product between these
two vectors that we are going to multiply,

7
00:00:31,945 --> 00:00:37,610
element-wise each of the different values,

8
00:00:37,610 --> 00:00:40,260
add them up, and call that the similarity.

9
00:00:40,260 --> 00:00:46,780
So it's the inner product between xi and
xq, so it's simply the sum

10
00:00:46,780 --> 00:00:51,900
of all our dimensions of the product
of the values in these two vectors.

11
00:00:51,900 --> 00:00:57,660
So we can think of this as measuring,
how much these articles overlap in terms

12
00:00:57,660 --> 00:01:03,560
of the vocabularies used and
what's the weight of that overlap.

13
00:01:03,560 --> 00:01:07,560
Okay, so in this example it would
give us a similarity of 13 between

14
00:01:07,560 --> 00:01:10,640
these two different articles about soccer.

15
00:01:10,640 --> 00:01:15,030
But if we looked at an article about
soccer relative to an article about

16
00:01:15,030 --> 00:01:19,740
some world news event, then maybe
there would be very little overlap.

17
00:01:19,740 --> 00:01:23,760
Actually in this case, no overlap in
the vocabularies of these articles.

18
00:01:23,760 --> 00:01:25,879
So this similarity would be 0.

19
00:01:25,879 --> 00:01:30,509
And the similarity that we talked
about on the previous slide

20
00:01:30,509 --> 00:01:35,411
where we just summed up the products
of the different features is

21
00:01:35,411 --> 00:01:39,861
very related to a popular
similarity metric called cosine

22
00:01:39,861 --> 00:01:44,890
similarity where it looks exactly
the same as what we had before.

23
00:01:44,890 --> 00:01:50,710
But in cosine similarity you're going to
divide by the following two terms where

24
00:01:50,710 --> 00:01:53,600
I think it's a little bit more
straightforward to write it as follows

25
00:01:53,600 --> 00:01:56,140
where what we see is that

26
00:01:56,140 --> 00:01:59,960
each one of these terms is
just normalizing the vector.

27
00:01:59,960 --> 00:02:05,080
So we're summing over the square of
each element just within vector xi and

28
00:02:05,080 --> 00:02:06,700
just within vector xq.

29
00:02:07,920 --> 00:02:12,730
So that's, by definition, this norm here.

30
00:02:12,730 --> 00:02:16,851
So this is equivalent to this.

31
00:02:16,851 --> 00:02:19,939
That's the definition of these bars here,

32
00:02:19,939 --> 00:02:23,034
is the norm of the magnitude
of this vector.

33
00:02:23,034 --> 00:02:27,344
And we can rewrite this further as xi

34
00:02:27,344 --> 00:02:32,374
divided by the magnitude of this vector or

35
00:02:32,374 --> 00:02:40,584
the norm of the vector transposed
times xq divided by the norm of xq.

36
00:02:40,584 --> 00:02:45,129
And so what we're doing relative to
the example we had on the past two

37
00:02:45,129 --> 00:02:49,916
slides is instead of just computing
the similarity as the inner product

38
00:02:49,916 --> 00:02:54,805
between these two vectors, we're
first going to normalize the vectors.

39
00:02:54,805 --> 00:02:56,066
And this is a really,

40
00:02:56,066 --> 00:03:00,000
really critical difference
actually which we'll discuss more.

41
00:03:02,110 --> 00:03:06,636
Okay, and so you can show that what
we're doing here is equivalent to just

42
00:03:06,636 --> 00:03:11,308
looking at the angle between two vectors,
regardless of their magnitude.

43
00:03:11,308 --> 00:03:15,887
And the reason this normalized
inner product of vectors is

44
00:03:15,887 --> 00:03:20,559
equivalent to cosine of the angle
between the vectors comes

45
00:03:20,559 --> 00:03:25,430
straightforwardly from
the definition of an inner product.

46
00:03:25,430 --> 00:03:30,670
So we know that A transposed B is equal to

47
00:03:30,670 --> 00:03:36,050
magnitude of some
Vector A magnitude of the Vector B

48
00:03:37,570 --> 00:03:42,160
times cosine of the angle
between the vectors A and B.

49
00:03:43,340 --> 00:03:47,815
Okay, so if we have two different
points here, two different articles,

50
00:03:47,815 --> 00:03:51,670
let's say there's just a vocabulary
with two words, word one and word two.

51
00:03:53,030 --> 00:03:58,950
So this is maybe the word count vector for

52
00:03:58,950 --> 00:04:02,860
one article and the word count vector for
the other article.

53
00:04:02,860 --> 00:04:07,862
What cosine similarity is doing
was just looking at the cosine of

54
00:04:07,862 --> 00:04:13,615
the angle between the angles regardless
of the magnitude of this vector.

55
00:04:15,278 --> 00:04:19,080
So I want to highlight a couple of
things about cosine similarity.

56
00:04:19,080 --> 00:04:22,560
One is the fact that it's not actually
a proper distance metric like you

57
00:04:22,560 --> 00:04:27,050
put in distance because the triangle
inequality doesn't hold.

58
00:04:27,050 --> 00:04:32,300
But it's also important to know that
it's extremely efficient to compute for

59
00:04:32,300 --> 00:04:37,500
sparse vectors because you only need to
consider the nonzero elements when you're

60
00:04:37,500 --> 00:04:38,620
forming this calculation.

61
00:04:41,010 --> 00:04:44,800
Okay, but now I want to run through
an example of what I mean by this

62
00:04:44,800 --> 00:04:48,020
normalization just to make sure
it's very clear to everyone.

63
00:04:48,020 --> 00:04:53,490
So here's our standard wordcount vector,
or maybe TFIDF vector, and

64
00:04:53,490 --> 00:04:58,960
when we think about normalizing it,
we're simply thinking about dividing by

65
00:04:58,960 --> 00:05:03,870
the sum of the squares of the counts
in this vector square rooted.

66
00:05:03,870 --> 00:05:07,680
And if we do out this calculation then

67
00:05:07,680 --> 00:05:11,220
our normalized representation of
this document would be as follows.

68
00:05:12,270 --> 00:05:16,450
Okay, so let's talk a little bit more
about this cosine similarity metric.

69
00:05:19,420 --> 00:05:22,780
Let's think about the values
that it can take.

70
00:05:22,780 --> 00:05:26,830
So let's say there are two really,
really similar articles.

71
00:05:26,830 --> 00:05:34,682
So they have a very small angle theta,
very small theta.

72
00:05:34,682 --> 00:05:40,130
Then the cosine of theta Is going to be

73
00:05:40,130 --> 00:05:46,000
approximately equal to,
you guys remember, 1, as you would hope.

74
00:05:46,000 --> 00:05:49,920
High similarity if they're
very close together.

75
00:05:49,920 --> 00:05:53,860
So let's just remember how
we think about cosine.

76
00:05:53,860 --> 00:05:56,920
We can think about
drawing this unit circle.

77
00:05:59,220 --> 00:06:01,320
That does not look at all like a circle.

78
00:06:01,320 --> 00:06:06,276
Let's try again.

79
00:06:06,276 --> 00:06:12,858
Okay, I'm not sure that is much better but
imagine our circle with Radius 1.

80
00:06:12,858 --> 00:06:21,059
Well, if we are looking
at some angle of theta,

81
00:06:21,059 --> 00:06:28,030
then this length here is sign of theta and

82
00:06:28,030 --> 00:06:34,410
this length here is cosine of theta.

83
00:06:34,410 --> 00:06:36,540
Okay, so
as were walking around the circle,

84
00:06:36,540 --> 00:06:42,000
when an angle is zero,
we see that cosine of theta is one.

85
00:06:42,000 --> 00:06:47,970
When we get up to 90 degrees, so vertical,
we see that that cosine drops down.

86
00:06:47,970 --> 00:06:52,239
Maybe I can switch colors and make this
clear instead of waving my hands around.

87
00:06:55,302 --> 00:06:56,130
So let's see.

88
00:06:56,130 --> 00:07:02,866
When we're at angle 0,
you see cosine is 1.

89
00:07:02,866 --> 00:07:07,767
When we're at angle pi over 2 or
90 degrees,

90
00:07:07,767 --> 00:07:14,030
we see that this distance along
this x axis has dropped to 0.

91
00:07:14,030 --> 00:07:21,320
And when we shift over
here to cosine of pi or

92
00:07:21,320 --> 00:07:27,810
180 degrees,
we see that cosine of theta is -1.

93
00:07:27,810 --> 00:07:31,170
Okay, so now that we've reviewed
cosine a little bit, let's go back

94
00:07:32,780 --> 00:07:39,140
to these drawings here and this was
supposed to be roughly 90 degrees.

95
00:07:39,140 --> 00:07:46,249
It's a little bit more, but
when we look at cosine of theta here,

96
00:07:46,249 --> 00:07:50,402
we know that it's approximately 0.

97
00:07:50,402 --> 00:07:52,211
And if we have,

98
00:07:57,263 --> 00:08:02,280
Cosine of theta in this case it's
going to be approximately -1.

99
00:08:02,280 --> 00:08:09,736
Okay, so in general cosine
similarity can range from -1 to 1.

100
00:08:09,736 --> 00:08:14,002
But if we restrict ourselves to
having just positive features,

101
00:08:14,002 --> 00:08:17,399
like we would if we were
looking at a TFIDF vector for

102
00:08:17,399 --> 00:08:22,070
a document, in our similarity we
could never have this example here.

103
00:08:22,070 --> 00:08:25,465
We're always going to be living
in a positive quadrant, so

104
00:08:25,465 --> 00:08:28,022
our angles are going to
range from 0 to 90.

105
00:08:28,022 --> 00:08:32,920
So our cosine similarity is
going to range from 0 to 1.

106
00:08:32,920 --> 00:08:35,989
Okay, so this is going to be our focus.

107
00:08:38,550 --> 00:08:40,070
And in these cases,

108
00:08:40,070 --> 00:08:45,150
the way we're going to define a distance
is simply one minus the similarity.

109
00:08:45,150 --> 00:08:50,658
And remember, it's not a proper distance,
according to formal definitions

110
00:08:50,658 --> 00:08:56,256
of a distance metric but we can use it as
a measure of distance between articles.

111
00:08:56,256 --> 00:09:00,539
[MUSIC]