1
00:00:00,000 --> 00:00:04,277
[MUSIC]

2
00:00:04,277 --> 00:00:07,024
To start with,
let's talk about some choices we have for

3
00:00:07,024 --> 00:00:08,580
how to represent our document.

4
00:00:09,620 --> 00:00:10,990
One really simple choice for

5
00:00:10,990 --> 00:00:14,640
representing the document is
simply as a vector of word counts.

6
00:00:14,640 --> 00:00:18,117
So in particular, this is an example of
a bag of words model where we take our

7
00:00:18,117 --> 00:00:19,902
document, has a bunch of words in it.

8
00:00:19,902 --> 00:00:23,322
Let's assume that it's
a really simple document and

9
00:00:23,322 --> 00:00:26,751
the only words it has is
the following two sentences.

10
00:00:26,751 --> 00:00:29,353
So Carlos calls the sport futbol.

11
00:00:29,353 --> 00:00:32,070
Emily calls the sport soccer.

12
00:00:32,070 --> 00:00:35,756
And in this bag of words model,
we just ignore the order of these words,

13
00:00:35,756 --> 00:00:39,269
and we're simply going to count
the number of instances of each word.

14
00:00:39,269 --> 00:00:41,406
And that's going to be
our word count vector.

15
00:00:41,406 --> 00:00:45,469
So here's our vector of
our entire vocabulary.

16
00:00:45,469 --> 00:00:48,906
And maybe this is the index for Carlos.

17
00:00:48,906 --> 00:00:52,191
And maybe here we have Emily.

18
00:00:52,191 --> 00:00:56,139
And here we have the.

19
00:00:56,139 --> 00:01:04,014
And maybe there's soccer and football.

20
00:01:04,014 --> 00:01:05,220
And what else do we need?

21
00:01:05,220 --> 00:01:11,210
We need the word sport and calls.

22
00:01:12,820 --> 00:01:17,999
So let's see,
we have one count of the word Carlos,

23
00:01:17,999 --> 00:01:23,531
two counts of the word the,
one count of the word Emily,

24
00:01:23,531 --> 00:01:27,769
one count of soccer, two counts of calls,

25
00:01:27,769 --> 00:01:33,370
two counts of sport, and
one count of football.

26
00:01:33,370 --> 00:01:37,180
So this would be our word
count vector representation

27
00:01:37,180 --> 00:01:38,760
of this very simple document here.

28
00:01:40,110 --> 00:01:45,260
But an issue with this very simple
representation has to do with rare words.

29
00:01:45,260 --> 00:01:50,315
So for example, let's imagine that
this document had lots of common

30
00:01:50,315 --> 00:01:55,107
words like the, player, field and
goal, but there are only a few

31
00:01:55,107 --> 00:02:00,009
instances of some rare but
important words like futbol and Messi.

32
00:02:00,009 --> 00:02:04,535
So Messi is one of the players in that
article that Carlos was reading about and

33
00:02:04,535 --> 00:02:09,076
that really highlights something
important and unique about that article.

34
00:02:09,076 --> 00:02:13,666
And that might be of interest specifically
to Carlos more than just something like

35
00:02:13,666 --> 00:02:17,924
player or field which might have to do
with lots of other types of sports, and

36
00:02:17,924 --> 00:02:20,390
lots of other types of settings.

37
00:02:20,390 --> 00:02:25,340
So when we're thinking about doing
distance calculations just based off of

38
00:02:25,340 --> 00:02:30,260
word counts like raw word counts,
these very common words that have

39
00:02:30,260 --> 00:02:34,620
high counts in these documents
can dominate the calculation.

40
00:02:34,620 --> 00:02:37,786
And this is especially bad when
there are words like the and

41
00:02:37,786 --> 00:02:42,680
of and in, really, really common words
that are basically meaningless for

42
00:02:42,680 --> 00:02:45,970
the sake of assessing
similarities of documents.

43
00:02:45,970 --> 00:02:47,885
But when you're doing that
distance computation,

44
00:02:47,885 --> 00:02:52,390
it can totally swamp out these
more important but rarer words.

45
00:02:52,390 --> 00:02:59,060
So an alternative representation
that we can consider is TF-IDF.

46
00:02:59,060 --> 00:03:03,190
So Term Frequency Inverse
Document Frequency.

47
00:03:03,190 --> 00:03:05,970
And so this, again, is something that
we talked about in the first course but

48
00:03:05,970 --> 00:03:09,970
it's important to review because this
is a representation that we're going to

49
00:03:09,970 --> 00:03:11,270
work with very commonly here.

50
00:03:12,350 --> 00:03:17,612
So TF-IDF emphasizes important
words in the following way.

51
00:03:17,612 --> 00:03:21,870
It's going to emphasize words
that appear very common locally.

52
00:03:21,870 --> 00:03:27,080
And what that means, is that they
appear frequently in the document that

53
00:03:27,080 --> 00:03:32,990
the person is reading, but
they appear rarely globally.

54
00:03:32,990 --> 00:03:38,550
So an entire corpus,
it's pretty unique to see those words.

55
00:03:38,550 --> 00:03:41,440
And so
if we see both of those things together,

56
00:03:41,440 --> 00:03:45,680
that's definitely an important word and
something that we should pick up on.

57
00:03:45,680 --> 00:03:46,782
So to quantify this.

58
00:03:48,722 --> 00:03:53,396
This first term appearing
frequently locally,

59
00:03:53,396 --> 00:03:56,594
that term is the term frequency.

60
00:03:56,594 --> 00:03:59,674
Okay, so that simply are word counts for

61
00:03:59,674 --> 00:04:04,120
our article that we're reading,
or this query article.

62
00:04:05,290 --> 00:04:09,010
So we simply, again,
go into this document, shake it up.

63
00:04:09,010 --> 00:04:12,760
Just count up all the words
that appear in that document.

64
00:04:12,760 --> 00:04:16,420
And that's a representation of
how often things appear locally.

65
00:04:16,420 --> 00:04:20,638
But then we're going to scale that by
how often those words appear globally.

66
00:04:20,638 --> 00:04:26,890
We go to our entire corpus of documents so
every article out there.

67
00:04:26,890 --> 00:04:32,320
And one common way to compute the inverse
document frequency is the following form

68
00:04:32,320 --> 00:04:36,640
where you have the log of the number of
documents that you're looking over but

69
00:04:36,640 --> 00:04:42,430
importantly divided by 1 plus the number
of documents that use the given word.

70
00:04:42,430 --> 00:04:47,040
And that's the really key thing that's
going to allow us to down-weight

71
00:04:47,040 --> 00:04:50,160
words that appear frequently in many,
many documents.

72
00:04:51,900 --> 00:04:55,810
Okay, so we have these two terms and

73
00:04:55,810 --> 00:05:01,400
we are faced with this trade off between
local frequency and global rarity.

74
00:05:02,400 --> 00:05:07,590
And what term frequency,
inverse document frequency does

75
00:05:07,590 --> 00:05:14,020
is it simply multiplies these two factors
together to get our TF-IDF representation.

76
00:05:14,020 --> 00:05:19,780
So just to reiterate, by looking at
TF-IDF vectors as the representation for

77
00:05:19,780 --> 00:05:23,780
our document, we're going to up-weight
rare words that appear often

78
00:05:23,780 --> 00:05:26,770
in the document that we're looking at but
not in the corpus.

79
00:05:26,770 --> 00:05:29,699
And we're going to down-weight
words like the and of, and

80
00:05:29,699 --> 00:05:33,911
all these things that probably appear in
the document that we're looking at, but

81
00:05:33,911 --> 00:05:36,923
also appear in basically every
other document out there.

82
00:05:36,923 --> 00:05:40,450
And by doing this when we go to
do our distance calculations,

83
00:05:40,450 --> 00:05:45,226
these important words to the document that
we're actually reading will have more

84
00:05:45,226 --> 00:05:48,359
significance when we're
computing that distance.

85
00:05:48,359 --> 00:05:48,859
[MUSIC]