[MUSIC] So that was one idea of computing an inner
product to compute a distance, but here's another really natural inner
product measure that we could have. And this is simply to look at
the inner product between our one article which might be our query
article xq and another article xi, simply as the inner product between these
two vectors that we are going to multiply, element-wise each of the different values, add them up, and call that the similarity. So it's the inner product between xi and
xq, so it's simply the sum of all our dimensions of the product
of the values in these two vectors. So we can think of this as measuring,
how much these articles overlap in terms of the vocabularies used and
what's the weight of that overlap. Okay, so in this example it would
give us a similarity of 13 between these two different articles about soccer. But if we looked at an article about
soccer relative to an article about some world news event, then maybe
there would be very little overlap. Actually in this case, no overlap in
the vocabularies of these articles. So this similarity would be 0. And the similarity that we talked
about on the previous slide where we just summed up the products
of the different features is very related to a popular
similarity metric called cosine similarity where it looks exactly
the same as what we had before. But in cosine similarity you're going to
divide by the following two terms where I think it's a little bit more
straightforward to write it as follows where what we see is that each one of these terms is
just normalizing the vector. So we're summing over the square of
each element just within vector xi and just within vector xq. So that's, by definition, this norm here. So this is equivalent to this. That's the definition of these bars here, is the norm of the magnitude
of this vector. And we can rewrite this further as xi divided by the magnitude of this vector or the norm of the vector transposed
times xq divided by the norm of xq. And so what we're doing relative to
the example we had on the past two slides is instead of just computing
the similarity as the inner product between these two vectors, we're
first going to normalize the vectors. And this is a really, really critical difference
actually which we'll discuss more. Okay, and so you can show that what
we're doing here is equivalent to just looking at the angle between two vectors,
regardless of their magnitude. And the reason this normalized
inner product of vectors is equivalent to cosine of the angle
between the vectors comes straightforwardly from
the definition of an inner product. So we know that A transposed B is equal to magnitude of some
Vector A magnitude of the Vector B times cosine of the angle
between the vectors A and B. Okay, so if we have two different
points here, two different articles, let's say there's just a vocabulary
with two words, word one and word two. So this is maybe the word count vector for one article and the word count vector for
the other article. What cosine similarity is doing
was just looking at the cosine of the angle between the angles regardless
of the magnitude of this vector. So I want to highlight a couple of
things about cosine similarity. One is the fact that it's not actually
a proper distance metric like you put in distance because the triangle
inequality doesn't hold. But it's also important to know that
it's extremely efficient to compute for sparse vectors because you only need to
consider the nonzero elements when you're forming this calculation. Okay, but now I want to run through
an example of what I mean by this normalization just to make sure
it's very clear to everyone. So here's our standard wordcount vector,
or maybe TFIDF vector, and when we think about normalizing it,
we're simply thinking about dividing by the sum of the squares of the counts
in this vector square rooted. And if we do out this calculation then our normalized representation of
this document would be as follows. Okay, so let's talk a little bit more
about this cosine similarity metric. Let's think about the values
that it can take. So let's say there are two really,
really similar articles. So they have a very small angle theta,
very small theta. Then the cosine of theta Is going to be approximately equal to,
you guys remember, 1, as you would hope. High similarity if they're
very close together. So let's just remember how
we think about cosine. We can think about
drawing this unit circle. That does not look at all like a circle. Let's try again. Okay, I'm not sure that is much better but
imagine our circle with Radius 1. Well, if we are looking
at some angle of theta, then this length here is sign of theta and this length here is cosine of theta. Okay, so
as were walking around the circle, when an angle is zero,
we see that cosine of theta is one. When we get up to 90 degrees, so vertical,
we see that that cosine drops down. Maybe I can switch colors and make this
clear instead of waving my hands around. So let's see. When we're at angle 0,
you see cosine is 1. When we're at angle pi over 2 or
90 degrees, we see that this distance along
this x axis has dropped to 0. And when we shift over
here to cosine of pi or 180 degrees,
we see that cosine of theta is -1. Okay, so now that we've reviewed
cosine a little bit, let's go back to these drawings here and this was
supposed to be roughly 90 degrees. It's a little bit more, but
when we look at cosine of theta here, we know that it's approximately 0. And if we have, Cosine of theta in this case it's
going to be approximately -1. Okay, so in general cosine
similarity can range from -1 to 1. But if we restrict ourselves to
having just positive features, like we would if we were
looking at a TFIDF vector for a document, in our similarity we
could never have this example here. We're always going to be living
in a positive quadrant, so our angles are going to
range from 0 to 90. So our cosine similarity is
going to range from 0 to 1. Okay, so this is going to be our focus. And in these cases, the way we're going to define a distance
is simply one minus the similarity. And remember, it's not a proper distance,
according to formal definitions of a distance metric but we can use it as
a measure of distance between articles. [MUSIC]