1
00:00:00,000 --> 00:00:04,128
[MUSIC]

2
00:00:04,128 --> 00:00:08,040
And so, let's actually compute TF/IDF.

3
00:00:08,040 --> 00:00:11,760
Now, I can't just compute TF/IDF and
this is an important note by the way.

4
00:00:11,760 --> 00:00:13,190
Can't just compute TF/IDF for

5
00:00:13,190 --> 00:00:18,560
the Obama article in isolation because
tf/idf depends on entire corpus.

6
00:00:18,560 --> 00:00:21,750
You need that normalizer which is
the number of times a word appears in

7
00:00:21,750 --> 00:00:22,780
every article.

8
00:00:22,780 --> 00:00:25,620
So, I have to show it I have computed for
the entire corpus.

9
00:00:25,620 --> 00:00:29,087
So let's go ahead and do just that.

10
00:00:29,087 --> 00:00:33,302
So here we go, so

11
00:00:33,302 --> 00:00:38,124
I'm gonna compute

12
00:00:38,124 --> 00:00:44,462
TF/IDF for the corpus.

13
00:00:44,462 --> 00:00:46,320
And I'm gonna do this in two steps.

14
00:00:47,660 --> 00:00:51,712
First, I'm gonna compute the word
counts for the entire corpus.

15
00:00:51,712 --> 00:00:57,330
So, I'm gonna add a new column to
the people table called word_count.

16
00:00:57,330 --> 00:01:01,660
Remember we just did this only for Barack
Obama, so now, we're gonna do it for

17
00:01:01,660 --> 00:01:02,160
everyone.

18
00:01:02,160 --> 00:01:10,494
So I'm gonna call,
graphlab.text_analytics.count_words and

19
00:01:10,494 --> 00:01:14,304
I'm going got put in the input,

20
00:01:14,304 --> 00:01:18,837
which is gonna be the the text column.

21
00:01:18,837 --> 00:01:22,370
In other words, I'm gonna count
the words in the text column.

22
00:01:23,500 --> 00:01:25,750
And just so that we're clear.

23
00:01:25,750 --> 00:01:31,010
Then you just print the SFrame
people after we do this,

24
00:01:31,010 --> 00:01:33,780
so I'm gonna print the first
few lines of that SFrame.

25
00:01:33,780 --> 00:01:35,430
So here we are, we've executed.

26
00:01:35,430 --> 00:01:40,634
Now we have the URI column, the location,
the webpage, the name of person,

27
00:01:40,634 --> 00:01:46,254
text, and now we have this dictionary of
word counts on the right, the new column.

28
00:01:46,254 --> 00:01:50,860
Good, so
next we're gonna compute the TF/IDFs.

29
00:01:50,860 --> 00:01:55,364
So just like with word counts,
you can implement your own TF/IDF system,

30
00:01:55,364 --> 00:01:57,588
it will take you a little while to do.

31
00:01:57,588 --> 00:02:01,086
So, Graphlocate one already implemented,
and

32
00:02:01,086 --> 00:02:05,949
we're just gonna use that to make
this whole process pretty quick,

33
00:02:05,949 --> 00:02:09,730
so we're gonna call
graphlab.text_analytics.

34
00:02:09,730 --> 00:02:14,820
Just like with word count,
there's a function here, .tf_idf.

35
00:02:14,820 --> 00:02:17,450
And all you need to do is give an input,

36
00:02:17,450 --> 00:02:20,975
just like we're gonna give
in the input the word_count.

37
00:02:22,750 --> 00:02:26,296
word_count, and it will output the TF/IDF.

38
00:02:26,296 --> 00:02:29,780
And let me just again show
you what that looks like.

39
00:02:30,900 --> 00:02:33,560
Oops, I did a little typo here.

40
00:02:33,560 --> 00:02:35,671
That should be word_count.

41
00:02:35,671 --> 00:02:38,629
That's call not word_counts.

42
00:02:38,629 --> 00:02:43,390
And now I'm going through the whole
corpus,with 50,000 documents.

43
00:02:43,390 --> 00:02:47,440
Computing the frequency of words,
normalizing.

44
00:02:47,440 --> 00:02:52,150
And what we end up with is a table where
for every document, it calls the table

45
00:02:52,150 --> 00:02:59,150
docs, it has a dictionary of TF/IDF's for
each one of those documents.

46
00:02:59,150 --> 00:03:03,830
And just so that we get all this right,
I'm gonna add a new column.

47
00:03:03,830 --> 00:03:09,220
To the people table, and
this new column is gonna be called tfidf,

48
00:03:09,220 --> 00:03:14,510
and I'm just gonna store in there
the tfidf's that I just computed.

49
00:03:14,510 --> 00:03:17,360
So it's all in one table,
and it's a docs column.

50
00:03:19,350 --> 00:03:20,930
Here we go.
We just added it in.

51
00:03:20,930 --> 00:03:25,880
So now we have tfidf's for every
document computed and stored in there.

52
00:03:25,880 --> 00:03:27,700
Let's do some examination.

53
00:03:27,700 --> 00:03:33,283
So here's what we gonna do,

54
00:03:33,283 --> 00:03:39,565
we gonna Examine the TF-IDF for

55
00:03:39,565 --> 00:03:43,530
the Obama article.

56
00:03:47,500 --> 00:03:50,657
So just like we examined and
sorted the word counts,

57
00:03:50,657 --> 00:03:53,547
we're not gonna examine
sorting the TF ideas.

58
00:03:53,547 --> 00:03:56,703
I'm gonna reread the variable for Obama,

59
00:03:56,703 --> 00:04:02,530
because we now added these two new
columns in the latest version of that.

60
00:04:02,530 --> 00:04:07,800
So I'm gonna take people, and I'm gonna
select out of the people, the one whose

61
00:04:07,800 --> 00:04:14,320
name is equal to Barack Obama.

62
00:04:16,120 --> 00:04:20,700
Okay, I've done it, I've created this
Obama, and now, just like we did with word

63
00:04:20,700 --> 00:04:29,150
counts, I will create an obama_tfidf_table
so that we can sort it too.

64
00:04:29,150 --> 00:04:30,310
It's a dictionary.

65
00:04:30,310 --> 00:04:32,580
We're just gonna sort to the exactly
the same way as we did before.

66
00:04:32,580 --> 00:04:35,150
We will stack it and then sort it.

67
00:04:35,150 --> 00:04:37,296
And we're gonna do this.

68
00:04:37,296 --> 00:04:41,400
Actually, instead of create a table,
I'm just gonna do it in one line.

69
00:04:41,400 --> 00:04:43,837
Oops, I'm gonna do just in one line here.

70
00:04:43,837 --> 00:04:47,469
So I'm gonna write out
what we did earlier, so

71
00:04:47,469 --> 00:04:50,824
I'm gonna just take the obama variable and

72
00:04:50,824 --> 00:04:56,388
when I select only the tfidf column so
it looks a little prettier.

73
00:04:56,388 --> 00:05:00,460
Then I'm gonna call the stack method,
which takes a dictionary and

74
00:05:00,460 --> 00:05:02,580
just stacks it into two columns.

75
00:05:02,580 --> 00:05:07,422
So I'm gonna stack tfidf.

76
00:05:07,422 --> 00:05:13,804
And I'm going to output
new column names and

77
00:05:13,804 --> 00:05:19,324
those names are going to be the word and

78
00:05:19,324 --> 00:05:25,547
the tfidf, and let me show you something.

79
00:05:25,547 --> 00:05:32,212
Ooops, I, Forgot to close, click it here.

80
00:05:32,212 --> 00:05:36,725
And let me show you a little neat
trick that you can use with Python in

81
00:05:36,725 --> 00:05:38,300
various ways.

82
00:05:38,300 --> 00:05:41,880
I'm just gonna chain the source
comment at the end of this.

83
00:05:41,880 --> 00:05:43,959
So I'm just gonna type .sort.

84
00:05:45,540 --> 00:05:52,730
And I'm gonna sort this
output on the tfidf column.

85
00:05:52,730 --> 00:05:59,590
And I'm gonna say ascending=false.

86
00:05:59,590 --> 00:06:02,920
So what I did in multiple lines before,
now I'm doing it in just one line.

87
00:06:02,920 --> 00:06:06,087
I'm taking the obama column tfidf.

88
00:06:06,087 --> 00:06:10,139
I'm gonna stack it into a word column,
tfidf column, and

89
00:06:10,139 --> 00:06:13,920
now I'm gonna sort it in descending order.

90
00:06:13,920 --> 00:06:16,070
So from highest to lowest.

91
00:06:16,070 --> 00:06:19,530
And if you remember,
just before we run this.

92
00:06:19,530 --> 00:06:24,475
When we did it for
just word count you look like this.

93
00:06:24,475 --> 00:06:27,250
There was the most popular word,
then in, then and, then of,

94
00:06:27,250 --> 00:06:30,260
then to, his, then Obama, act, a, and he.

95
00:06:30,260 --> 00:06:35,010
So those words are mostly uninformative
except for the word Obama.

96
00:06:36,320 --> 00:06:39,980
So let's execute it for TF-IDF.

97
00:06:39,980 --> 00:06:44,180
And what we see here, voila,
the most informative word is Obama

98
00:06:44,180 --> 00:06:47,220
which makes a lot of sense
because the article is about him.

99
00:06:47,220 --> 00:06:50,630
But then you have art, Iraq,
control, law, ordered,

100
00:06:50,630 --> 00:06:54,690
military, involvement, response,
democratic, as in Democratic Party.

101
00:06:54,690 --> 00:06:59,669
So you see,
lots of action going on here around

102
00:06:59,669 --> 00:07:05,171
words that are important,
with respect to Obama.

103
00:07:05,171 --> 00:07:09,309
[MUSIC]