[MUSIC] Cool, so
this is the data we're working with. It's a pretty neat data set, and
let's explore it a little bit more. So, in the lectures,
Emily talked about building word counts. And some of the challenges
with word count. So let's start by taking a quick
look at our word counts. So for example, let's #Get the word counts for the Obama article. So we have an article on
Wikipedia about Barack Obama. And what we're gonna do first is just
take a quick look at the word counts for that article. So I'm going to take this Obama
variable that we created, and I'm going to add a new column
to it called word_count. So this is going to store
the word count for Barack Obama. And we can do that by just calling. You could write, as we showed in
the classifications sentiment analysis notebook, you could write the function
yourself to compute the word counts. But we just have one ready for
us in the text analytics toolbox. So we just gonna use it
to get started quickly. So we just call count_words. See this count_ngrams, we're just gonna
count words, which in the unit grams, single words. And as input, we're gonna give it the Obama text. So I've now done it. And let's take a quick look. Let's just print the Obama word count. And here we go. We've now printed the Obama word count. And you see that operations appeared once,
represent appears one, office appears two, unemployed appears
one, and so on for various words. And it's not super intuitive here. So we're gonna play with this a little
bit, and in the process, I'm gonna show you a cute little data engineering trick
that might be useful in other areas. So let's, what we're gonna do next is ##Sort the word count for the Obama article. Now, here's something to understand a
little bit better, if you notice the word counts is really a dictionary this is
the kind of the Python dictionary. You are given a key,
in this case, it's the word. For example, Honolulu where he was born,
weather, marriage and so on. And then, he has a value of 1,
2, 3, 5, 30, which is the count, how often that word appeared. And so
what we're going to do is sort this. And to sort those words, we have to
turn it into a table where one column is the word, the key of the dictionary,
and the second column is the count. And then we're going to sort that table. So, the way to do this,
this famous way to learn Python, but let me show you a quick little
trick that you will find useful. So I'm gonna createa new table called the,
obama_word_count_table. And what I'm gonna do is take this
Obama and I'm just gonna select here, out of all the columns in the Obama table,
I'm gonna select the, word_count. Because this will make
the printing a little neater. But you could have done that
with the whole table too. And then,
I'm gonna call a function called stack. And this function stack
is extremely useful. It takes one column of an SFrame
that contains a dictionary, and stacks it one of top of each
other into multiple columns. In this case, two columns. One for word, and the other one for count. So we're gonna stack a particular
column called word_count, the one that we really care about. And it generates some new columns. So we have to call them, give them a name. So new_column_name,
in this case there are two of them. One I'm gonna call, word. And another one,
I'm just going to call count. And if I execute this and
we take a look at this table, Obama word count table.head
just the first few lines. You'll see it shows words normalize,
sought, combat, but it's not sorted. It's not a sorted table. So what we do next is just take this
table and just sort it by the count. And we've seen this before,
but it's pretty simple. So I'm gonna take the Obama count table,
and I'm gonna call the sort function. And I'm gonna sort it by the count column. Sort by count. And I'm gonna say ascending=false. So, instead of sorting in
an ascending order, one, two, three, like we'd normally sort,
we're gonna sort in a descending order. Three, two, one. So, if I press Enter,
you'll see the most common word is the, followed by in, followed by and,
of, to, his, eventually Obama. And so, act, a, he. So those are not that informative. And in the lectures, when we were working
with Emily on them, she covered the issue that these uninformative words can
drown out the important words. And that's why we introduced
the notion of tfidf. [MUSIC]