[MUSIC] And so, let's actually compute TF/IDF. Now, I can't just compute TF/IDF and
this is an important note by the way. Can't just compute TF/IDF for the Obama article in isolation because
tf/idf depends on entire corpus. You need that normalizer which is
the number of times a word appears in every article. So, I have to show it I have computed for
the entire corpus. So let's go ahead and do just that. So here we go, so I'm gonna compute TF/IDF for the corpus. And I'm gonna do this in two steps. First, I'm gonna compute the word
counts for the entire corpus. So, I'm gonna add a new column to
the people table called word_count. Remember we just did this only for Barack
Obama, so now, we're gonna do it for everyone. So I'm gonna call,
graphlab.text_analytics.count_words and I'm going got put in the input, which is gonna be the the text column. In other words, I'm gonna count
the words in the text column. And just so that we're clear. Then you just print the SFrame
people after we do this, so I'm gonna print the first
few lines of that SFrame. So here we are, we've executed. Now we have the URI column, the location,
the webpage, the name of person, text, and now we have this dictionary of
word counts on the right, the new column. Good, so
next we're gonna compute the TF/IDFs. So just like with word counts,
you can implement your own TF/IDF system, it will take you a little while to do. So, Graphlocate one already implemented,
and we're just gonna use that to make
this whole process pretty quick, so we're gonna call
graphlab.text_analytics. Just like with word count,
there's a function here, .tf_idf. And all you need to do is give an input, just like we're gonna give
in the input the word_count. word_count, and it will output the TF/IDF. And let me just again show
you what that looks like. Oops, I did a little typo here. That should be word_count. That's call not word_counts. And now I'm going through the whole
corpus,with 50,000 documents. Computing the frequency of words,
normalizing. And what we end up with is a table where
for every document, it calls the table docs, it has a dictionary of TF/IDF's for
each one of those documents. And just so that we get all this right,
I'm gonna add a new column. To the people table, and
this new column is gonna be called tfidf, and I'm just gonna store in there
the tfidf's that I just computed. So it's all in one table,
and it's a docs column. Here we go.
We just added it in. So now we have tfidf's for every
document computed and stored in there. Let's do some examination. So here's what we gonna do, we gonna Examine the TF-IDF for the Obama article. So just like we examined and
sorted the word counts, we're not gonna examine
sorting the TF ideas. I'm gonna reread the variable for Obama, because we now added these two new
columns in the latest version of that. So I'm gonna take people, and I'm gonna
select out of the people, the one whose name is equal to Barack Obama. Okay, I've done it, I've created this
Obama, and now, just like we did with word counts, I will create an obama_tfidf_table
so that we can sort it too. It's a dictionary. We're just gonna sort to the exactly
the same way as we did before. We will stack it and then sort it. And we're gonna do this. Actually, instead of create a table,
I'm just gonna do it in one line. Oops, I'm gonna do just in one line here. So I'm gonna write out
what we did earlier, so I'm gonna just take the obama variable and when I select only the tfidf column so
it looks a little prettier. Then I'm gonna call the stack method,
which takes a dictionary and just stacks it into two columns. So I'm gonna stack tfidf. And I'm going to output
new column names and those names are going to be the word and the tfidf, and let me show you something. Ooops, I, Forgot to close, click it here. And let me show you a little neat
trick that you can use with Python in various ways. I'm just gonna chain the source
comment at the end of this. So I'm just gonna type .sort. And I'm gonna sort this
output on the tfidf column. And I'm gonna say ascending=false. So what I did in multiple lines before,
now I'm doing it in just one line. I'm taking the obama column tfidf. I'm gonna stack it into a word column,
tfidf column, and now I'm gonna sort it in descending order. So from highest to lowest. And if you remember,
just before we run this. When we did it for
just word count you look like this. There was the most popular word,
then in, then and, then of, then to, his, then Obama, act, a, and he. So those words are mostly uninformative
except for the word Obama. So let's execute it for TF-IDF. And what we see here, voila,
the most informative word is Obama which makes a lot of sense
because the article is about him. But then you have art, Iraq,
control, law, ordered, military, involvement, response,
democratic, as in Democratic Party. So you see,
lots of action going on here around words that are important,
with respect to Obama. [MUSIC]