[MUSIC] Cool, so this is the data we're working with. It's a pretty neat data set, and let's explore it a little bit more. So, in the lectures, Emily talked about building word counts. And some of the challenges with word count. So let's start by taking a quick look at our word counts. So for example, let's #Get the word counts for the Obama article. So we have an article on Wikipedia about Barack Obama. And what we're gonna do first is just take a quick look at the word counts for that article. So I'm going to take this Obama variable that we created, and I'm going to add a new column to it called word_count. So this is going to store the word count for Barack Obama. And we can do that by just calling. You could write, as we showed in the classifications sentiment analysis notebook, you could write the function yourself to compute the word counts. But we just have one ready for us in the text analytics toolbox. So we just gonna use it to get started quickly. So we just call count_words. See this count_ngrams, we're just gonna count words, which in the unit grams, single words. And as input, we're gonna give it the Obama text. So I've now done it. And let's take a quick look. Let's just print the Obama word count. And here we go. We've now printed the Obama word count. And you see that operations appeared once, represent appears one, office appears two, unemployed appears one, and so on for various words. And it's not super intuitive here. So we're gonna play with this a little bit, and in the process, I'm gonna show you a cute little data engineering trick that might be useful in other areas. So let's, what we're gonna do next is ##Sort the word count for the Obama article. Now, here's something to understand a little bit better, if you notice the word counts is really a dictionary this is the kind of the Python dictionary. You are given a key, in this case, it's the word. For example, Honolulu where he was born, weather, marriage and so on. And then, he has a value of 1, 2, 3, 5, 30, which is the count, how often that word appeared. And so what we're going to do is sort this. And to sort those words, we have to turn it into a table where one column is the word, the key of the dictionary, and the second column is the count. And then we're going to sort that table. So, the way to do this, this famous way to learn Python, but let me show you a quick little trick that you will find useful. So I'm gonna createa new table called the, obama_word_count_table. And what I'm gonna do is take this Obama and I'm just gonna select here, out of all the columns in the Obama table, I'm gonna select the, word_count. Because this will make the printing a little neater. But you could have done that with the whole table too. And then, I'm gonna call a function called stack. And this function stack is extremely useful. It takes one column of an SFrame that contains a dictionary, and stacks it one of top of each other into multiple columns. In this case, two columns. One for word, and the other one for count. So we're gonna stack a particular column called word_count, the one that we really care about. And it generates some new columns. So we have to call them, give them a name. So new_column_name, in this case there are two of them. One I'm gonna call, word. And another one, I'm just going to call count. And if I execute this and we take a look at this table, Obama word count table.head just the first few lines. You'll see it shows words normalize, sought, combat, but it's not sorted. It's not a sorted table. So what we do next is just take this table and just sort it by the count. And we've seen this before, but it's pretty simple. So I'm gonna take the Obama count table, and I'm gonna call the sort function. And I'm gonna sort it by the count column. Sort by count. And I'm gonna say ascending=false. So, instead of sorting in an ascending order, one, two, three, like we'd normally sort, we're gonna sort in a descending order. Three, two, one. So, if I press Enter, you'll see the most common word is the, followed by in, followed by and, of, to, his, eventually Obama. And so, act, a, he. So those are not that informative. And in the lectures, when we were working with Emily on them, she covered the issue that these uninformative words can drown out the important words. And that's why we introduced the notion of tfidf. [MUSIC]