[MUSIC] So let's take this TF IDF's, and
do a couple of fun things with them. So first, we're gonna manually, okay, so let me just type in here. So we're gonna manually compute distances between a few people. So the goal here is just to show
what those distances look like, just to get a sense of what
we're learning from TF IDF. So for example, let's take three people. So we already talked about Obama. We already have that
Obama variable in there. Let's create two new variables. Let's say Clinton, which are the people
I'm going to select one. The person whose name is equal to Bill Clinton, the former US President. Let's select another person, so
for example, let's select Beckham. So, Beckham is a famous British, English footballer and so, we're gonna select out of the people, the person whose name is equal to, his name is actually David Beckham. Now that we selected Clinton and Beckham,
let's compute the similarity between those two people, and
the President, Barack Obama. So, what we gonna do is ask the question. Is Obama closer to Clinton than to Beckham? Okay a little mistake here. So is Obama closer to
Clinton than to Beckham? Now there are various ways
to measure similarity or distances between two vectors,
or in this case two documents. We computed the TF IDF's. And what we're gonna do is
compute the distance between these different documents, the one about
Clinton, the one about Obama and so on. So I'm gonna use distance
metrics that are already implemented inside the GraphLab Create so
we don't have to implement them ourselves. So we need to look at
graphlab.distances and press Tab, you'll see several options here. The Clinton distance that we talked
about in class, cosine distance, jaccard similarity, and so on that we're
gonna see throughout this specialization. We're gonna use cosine distance. And just as a little note,
normally we think about cosine similarity, if you've heard of it. Where the higher the number
the more similar two articles are. Here, we have a distance version of
this number, so the lower the better. The lower the cosine distance,
the closer the articles are. So the question is, what is
the cosine distance between Obama's tfidf and that of Clinton? But notice that I have selected the column
tfidf, and I have to have the little 0 at the end here, because it is
the zeroth row of this table. The table only has one element in it, but we still have to say what row
of the table we're looking at. And so we're gonna compare the Obama
tfidf with the Clinton tfidf. Also at 0,
in terms of the cosine distance, and you see that the distance is 0.83. Now, the question is,
what is the distance between, in the same metric in the cosine distance,
between Obama and Beckham? So I'm gonna type Obama tfidf, computed at 0, with Beckham's tfidf, also at 0. And you'll see that this distance is 0.97. The biggest distance you can get is 1.0,
in fact. And so in this case, Obama is much closer to Clinton than he is
to Beckham, which makes a lot of sense. But we've done this just manually for
a few people, how do we automate this process of finding out how
close an article is to other articles. And in this case how close
is a person to other people. In the lectures, Emily talked
about nearest neighbor models and how they can be used for
document retrieval. So today we're gonna actually do
some cool document retrieval using a simple nearest neighbor model. [MUSIC]