1 00:00:00,000 --> 00:00:04,733 [MUSIC] 2 00:00:04,733 --> 00:00:08,220 Cool, so this is the data we're working with. 3 00:00:08,220 --> 00:00:14,160 It's a pretty neat data set, and let's explore it a little bit more. 4 00:00:14,160 --> 00:00:18,104 So, in the lectures, Emily talked about building word counts. 5 00:00:18,104 --> 00:00:21,020 And some of the challenges with word count. 6 00:00:21,020 --> 00:00:24,500 So let's start by taking a quick look at our word counts. 7 00:00:24,500 --> 00:00:30,840 So for example, let's #Get the word 8 00:00:30,840 --> 00:00:36,010 counts for the Obama article. 9 00:00:36,010 --> 00:00:38,810 So we have an article on Wikipedia about Barack Obama. 10 00:00:38,810 --> 00:00:42,280 And what we're gonna do first is just take a quick look at the word counts for 11 00:00:42,280 --> 00:00:43,790 that article. 12 00:00:43,790 --> 00:00:48,090 So I'm going to take this Obama variable that we created, and 13 00:00:48,090 --> 00:00:52,930 I'm going to add a new column to it called word_count. 14 00:00:52,930 --> 00:00:57,240 So this is going to store the word count for Barack Obama. 15 00:00:57,240 --> 00:00:59,840 And we can do that by just calling. 16 00:00:59,840 --> 00:01:05,770 You could write, as we showed in the classifications sentiment analysis 17 00:01:05,770 --> 00:01:08,960 notebook, you could write the function yourself to compute the word counts. 18 00:01:08,960 --> 00:01:12,940 But we just have one ready for us in the text analytics toolbox. 19 00:01:12,940 --> 00:01:16,080 So we just gonna use it to get started quickly. 20 00:01:16,080 --> 00:01:18,247 So we just call count_words. 21 00:01:18,247 --> 00:01:22,409 See this count_ngrams, we're just gonna count words, which in the unit grams, 22 00:01:22,409 --> 00:01:23,193 single words. 23 00:01:23,193 --> 00:01:28,109 And as input, we're gonna 24 00:01:28,109 --> 00:01:32,407 give it the Obama text. 25 00:01:32,407 --> 00:01:34,750 So I've now done it. 26 00:01:34,750 --> 00:01:37,048 And let's take a quick look. 27 00:01:37,048 --> 00:01:42,781 Let's just print the Obama word count. 28 00:01:44,770 --> 00:01:45,280 And here we go. 29 00:01:45,280 --> 00:01:48,230 We've now printed the Obama word count. 30 00:01:48,230 --> 00:01:53,630 And you see that operations appeared once, represent appears one, 31 00:01:53,630 --> 00:01:59,360 office appears two, unemployed appears one, and so on for various words. 32 00:01:59,360 --> 00:02:01,530 And it's not super intuitive here. 33 00:02:01,530 --> 00:02:05,570 So we're gonna play with this a little bit, and in the process, I'm gonna show 34 00:02:05,570 --> 00:02:12,010 you a cute little data engineering trick that might be useful in other areas. 35 00:02:12,010 --> 00:02:17,904 So let's, what we're gonna 36 00:02:17,904 --> 00:02:25,212 do next is ##Sort the word count for 37 00:02:25,212 --> 00:02:29,233 the Obama article. 38 00:02:30,850 --> 00:02:34,460 Now, here's something to understand a little bit better, if you notice the word 39 00:02:34,460 --> 00:02:37,620 counts is really a dictionary this is the kind of the Python dictionary. 40 00:02:37,620 --> 00:02:40,780 You are given a key, in this case, it's the word. 41 00:02:40,780 --> 00:02:46,400 For example, Honolulu where he was born, weather, marriage and so on. 42 00:02:46,400 --> 00:02:51,380 And then, he has a value of 1, 2, 3, 5, 30, which is the count, 43 00:02:51,380 --> 00:02:53,320 how often that word appeared. 44 00:02:53,320 --> 00:02:57,790 And so what we're going to do is sort this. 45 00:02:57,790 --> 00:03:02,400 And to sort those words, we have to turn it into a table where one column 46 00:03:02,400 --> 00:03:06,670 is the word, the key of the dictionary, and the second column is the count. 47 00:03:06,670 --> 00:03:08,550 And then we're going to sort that table. 48 00:03:08,550 --> 00:03:12,470 So, the way to do this, this famous way to learn Python, but 49 00:03:12,470 --> 00:03:16,600 let me show you a quick little trick that you will find useful. 50 00:03:16,600 --> 00:03:23,327 So I'm gonna createa new table called the, obama_word_count_table. 51 00:03:23,327 --> 00:03:29,258 And what I'm gonna do is take this Obama and I'm just gonna select here, 52 00:03:29,258 --> 00:03:35,883 out of all the columns in the Obama table, I'm gonna select the, word_count. 53 00:03:35,883 --> 00:03:40,030 Because this will make the printing a little neater. 54 00:03:40,030 --> 00:03:42,540 But you could have done that with the whole table too. 55 00:03:42,540 --> 00:03:45,240 And then, I'm gonna call a function called stack. 56 00:03:47,070 --> 00:03:48,880 And this function stack is extremely useful. 57 00:03:48,880 --> 00:03:53,700 It takes one column of an SFrame that contains a dictionary, and 58 00:03:53,700 --> 00:03:56,330 stacks it one of top of each other into multiple columns. 59 00:03:56,330 --> 00:03:57,290 In this case, two columns. 60 00:03:57,290 --> 00:03:59,460 One for word, and the other one for count. 61 00:03:59,460 --> 00:04:03,984 So we're gonna stack a particular column called word_count, 62 00:04:03,984 --> 00:04:07,240 the one that we really care about. 63 00:04:07,240 --> 00:04:10,710 And it generates some new columns. 64 00:04:10,710 --> 00:04:14,270 So we have to call them, give them a name. 65 00:04:14,270 --> 00:04:21,110 So new_column_name, in this case there are two of them. 66 00:04:21,110 --> 00:04:23,440 One I'm gonna call, word. 67 00:04:23,440 --> 00:04:26,900 And another one, I'm just going to call count. 68 00:04:28,320 --> 00:04:32,946 And if I execute this and we take a look at this table, 69 00:04:32,946 --> 00:04:38,870 Obama word count table.head just the first few lines. 70 00:04:38,870 --> 00:04:45,420 You'll see it shows words normalize, sought, combat, but it's not sorted. 71 00:04:45,420 --> 00:04:46,840 It's not a sorted table. 72 00:04:46,840 --> 00:04:51,820 So what we do next is just take this table and just sort it by the count. 73 00:04:51,820 --> 00:04:55,330 And we've seen this before, but it's pretty simple. 74 00:04:55,330 --> 00:05:01,320 So I'm gonna take the Obama count table, and I'm gonna call the sort function. 75 00:05:01,320 --> 00:05:04,502 And I'm gonna sort it by the count column. 76 00:05:04,502 --> 00:05:05,810 Sort by count. 77 00:05:05,810 --> 00:05:09,541 And I'm gonna say ascending=false. 78 00:05:09,541 --> 00:05:13,227 So, instead of sorting in an ascending order, one, two, three, 79 00:05:13,227 --> 00:05:16,998 like we'd normally sort, we're gonna sort in a descending order. 80 00:05:16,998 --> 00:05:18,564 Three, two, one. 81 00:05:18,564 --> 00:05:23,179 So, if I press Enter, you'll see the most common word is the, 82 00:05:23,179 --> 00:05:28,331 followed by in, followed by and, of, to, his, eventually Obama. 83 00:05:28,331 --> 00:05:31,222 And so, act, a, he. 84 00:05:31,222 --> 00:05:33,881 So those are not that informative. 85 00:05:33,881 --> 00:05:40,042 And in the lectures, when we were working with Emily on them, she covered the issue 86 00:05:40,042 --> 00:05:45,073 that these uninformative words can drown out the important words. 87 00:05:45,073 --> 00:05:48,695 And that's why we introduced the notion of tfidf. 88 00:05:48,695 --> 00:05:53,869 [MUSIC]