[MUSIC] In this module, Emily covered various
techniques for retrieving documents, exploring representations for
a data like word count and TFIDF. Now we're gonna get a really cool Notebook
where we put these ideas together and build a document retrieval system. Using TFIDF. So, let's go ahead and do just that. As usual,
we're gonna be using Python Notebook and in this one I'm gonna change
the title to Document retrieval, and here we go. And again, as usual,
I'm gonna hide the header. And hide the toolbar to give
us a little bit more space. Okay, let's go ahead and
fire up GraphLab Create. So we're gonna do import graphlab since we're gonna be using
it again in our notebook. Now the first step we're
going to load some data. So let's Load some text data. This is interesting text
data from Wikipedia, and it's pages on people. So, cool data set and
I'm just going to load it, and we're going to see People
which is gonna be an s frame is gonna be graphlab.SFrame
from a file and that file is called people wiki,
right here. And Here we go, we're loading it. And the first thing that we're gonna do is
just look at a few top lines of that file. So I'm just gonna go ahead and
you should see we have, so this URI is basically the location
of that page on Wikipedia. This is the name of the person involved. and the text of that
page about that person. And you have this for
a pretty nice number of people here. So if you type len of people. In our data set. And type enter you see that we are talking
about 59,000 people in this data set. It's a pretty good set and you will see with DFIF we are going to
do some really interesting documents. Even in this relatively large data set. So the first thing we are going
to do is just explore our data. So let's #Explore the dataset and checkout the text it contains. So let's start looking at
a particular person dataset. And we're gonna look at the page for Barack Obama who is
the current US President. So out of this people s-frame I'm
going to select the one whose name, so that's the name column, is equal to Barack Obama. And so I pressed enter, and
I created this new variable called Obama. And if you take a quick look at it,
you see [COUGH] That it has the URL for the Obama page, the name Barack Obama,
and the text for that page. So let's dig in, and
see what that text looks like. So for Barack Obama, you'll see that Barack Hussein Obama was born
on August 4th, 1961, and is the 44th The current
President of the United States. So pretty natural text will really
expect for this kind of data. So we can also look at some
other personal data set. So for example, there's an actor called George Clooney
who's been in a lot of movies. So if you look at the people
As frame we're gonna select, so this is again a filter operation which
we've been doing almost every network now. Whose name is George Clooney? And then I'm just gonna go ahead and show you the text that we get for
George Clooney. And you see that George Timothy Clooney
was born in 1961. So, basically, he's the same age as
Barack Obama, but he's not president. He's an American actor, writer. Producer, director and activist, right here. [MUSIC]