1 00:00:00,000 --> 00:00:04,277 [MUSIC] 2 00:00:04,277 --> 00:00:07,024 To start with, let's talk about some choices we have for 3 00:00:07,024 --> 00:00:08,580 how to represent our document. 4 00:00:09,620 --> 00:00:10,990 One really simple choice for 5 00:00:10,990 --> 00:00:14,640 representing the document is simply as a vector of word counts. 6 00:00:14,640 --> 00:00:18,117 So in particular, this is an example of a bag of words model where we take our 7 00:00:18,117 --> 00:00:19,902 document, has a bunch of words in it. 8 00:00:19,902 --> 00:00:23,322 Let's assume that it's a really simple document and 9 00:00:23,322 --> 00:00:26,751 the only words it has is the following two sentences. 10 00:00:26,751 --> 00:00:29,353 So Carlos calls the sport futbol. 11 00:00:29,353 --> 00:00:32,070 Emily calls the sport soccer. 12 00:00:32,070 --> 00:00:35,756 And in this bag of words model, we just ignore the order of these words, 13 00:00:35,756 --> 00:00:39,269 and we're simply going to count the number of instances of each word. 14 00:00:39,269 --> 00:00:41,406 And that's going to be our word count vector. 15 00:00:41,406 --> 00:00:45,469 So here's our vector of our entire vocabulary. 16 00:00:45,469 --> 00:00:48,906 And maybe this is the index for Carlos. 17 00:00:48,906 --> 00:00:52,191 And maybe here we have Emily. 18 00:00:52,191 --> 00:00:56,139 And here we have the. 19 00:00:56,139 --> 00:01:04,014 And maybe there's soccer and football. 20 00:01:04,014 --> 00:01:05,220 And what else do we need? 21 00:01:05,220 --> 00:01:11,210 We need the word sport and calls. 22 00:01:12,820 --> 00:01:17,999 So let's see, we have one count of the word Carlos, 23 00:01:17,999 --> 00:01:23,531 two counts of the word the, one count of the word Emily, 24 00:01:23,531 --> 00:01:27,769 one count of soccer, two counts of calls, 25 00:01:27,769 --> 00:01:33,370 two counts of sport, and one count of football. 26 00:01:33,370 --> 00:01:37,180 So this would be our word count vector representation 27 00:01:37,180 --> 00:01:38,760 of this very simple document here. 28 00:01:40,110 --> 00:01:45,260 But an issue with this very simple representation has to do with rare words. 29 00:01:45,260 --> 00:01:50,315 So for example, let's imagine that this document had lots of common 30 00:01:50,315 --> 00:01:55,107 words like the, player, field and goal, but there are only a few 31 00:01:55,107 --> 00:02:00,009 instances of some rare but important words like futbol and Messi. 32 00:02:00,009 --> 00:02:04,535 So Messi is one of the players in that article that Carlos was reading about and 33 00:02:04,535 --> 00:02:09,076 that really highlights something important and unique about that article. 34 00:02:09,076 --> 00:02:13,666 And that might be of interest specifically to Carlos more than just something like 35 00:02:13,666 --> 00:02:17,924 player or field which might have to do with lots of other types of sports, and 36 00:02:17,924 --> 00:02:20,390 lots of other types of settings. 37 00:02:20,390 --> 00:02:25,340 So when we're thinking about doing distance calculations just based off of 38 00:02:25,340 --> 00:02:30,260 word counts like raw word counts, these very common words that have 39 00:02:30,260 --> 00:02:34,620 high counts in these documents can dominate the calculation. 40 00:02:34,620 --> 00:02:37,786 And this is especially bad when there are words like the and 41 00:02:37,786 --> 00:02:42,680 of and in, really, really common words that are basically meaningless for 42 00:02:42,680 --> 00:02:45,970 the sake of assessing similarities of documents. 43 00:02:45,970 --> 00:02:47,885 But when you're doing that distance computation, 44 00:02:47,885 --> 00:02:52,390 it can totally swamp out these more important but rarer words. 45 00:02:52,390 --> 00:02:59,060 So an alternative representation that we can consider is TF-IDF. 46 00:02:59,060 --> 00:03:03,190 So Term Frequency Inverse Document Frequency. 47 00:03:03,190 --> 00:03:05,970 And so this, again, is something that we talked about in the first course but 48 00:03:05,970 --> 00:03:09,970 it's important to review because this is a representation that we're going to 49 00:03:09,970 --> 00:03:11,270 work with very commonly here. 50 00:03:12,350 --> 00:03:17,612 So TF-IDF emphasizes important words in the following way. 51 00:03:17,612 --> 00:03:21,870 It's going to emphasize words that appear very common locally. 52 00:03:21,870 --> 00:03:27,080 And what that means, is that they appear frequently in the document that 53 00:03:27,080 --> 00:03:32,990 the person is reading, but they appear rarely globally. 54 00:03:32,990 --> 00:03:38,550 So an entire corpus, it's pretty unique to see those words. 55 00:03:38,550 --> 00:03:41,440 And so if we see both of those things together, 56 00:03:41,440 --> 00:03:45,680 that's definitely an important word and something that we should pick up on. 57 00:03:45,680 --> 00:03:46,782 So to quantify this. 58 00:03:48,722 --> 00:03:53,396 This first term appearing frequently locally, 59 00:03:53,396 --> 00:03:56,594 that term is the term frequency. 60 00:03:56,594 --> 00:03:59,674 Okay, so that simply are word counts for 61 00:03:59,674 --> 00:04:04,120 our article that we're reading, or this query article. 62 00:04:05,290 --> 00:04:09,010 So we simply, again, go into this document, shake it up. 63 00:04:09,010 --> 00:04:12,760 Just count up all the words that appear in that document. 64 00:04:12,760 --> 00:04:16,420 And that's a representation of how often things appear locally. 65 00:04:16,420 --> 00:04:20,638 But then we're going to scale that by how often those words appear globally. 66 00:04:20,638 --> 00:04:26,890 We go to our entire corpus of documents so every article out there. 67 00:04:26,890 --> 00:04:32,320 And one common way to compute the inverse document frequency is the following form 68 00:04:32,320 --> 00:04:36,640 where you have the log of the number of documents that you're looking over but 69 00:04:36,640 --> 00:04:42,430 importantly divided by 1 plus the number of documents that use the given word. 70 00:04:42,430 --> 00:04:47,040 And that's the really key thing that's going to allow us to down-weight 71 00:04:47,040 --> 00:04:50,160 words that appear frequently in many, many documents. 72 00:04:51,900 --> 00:04:55,810 Okay, so we have these two terms and 73 00:04:55,810 --> 00:05:01,400 we are faced with this trade off between local frequency and global rarity. 74 00:05:02,400 --> 00:05:07,590 And what term frequency, inverse document frequency does 75 00:05:07,590 --> 00:05:14,020 is it simply multiplies these two factors together to get our TF-IDF representation. 76 00:05:14,020 --> 00:05:19,780 So just to reiterate, by looking at TF-IDF vectors as the representation for 77 00:05:19,780 --> 00:05:23,780 our document, we're going to up-weight rare words that appear often 78 00:05:23,780 --> 00:05:26,770 in the document that we're looking at but not in the corpus. 79 00:05:26,770 --> 00:05:29,699 And we're going to down-weight words like the and of, and 80 00:05:29,699 --> 00:05:33,911 all these things that probably appear in the document that we're looking at, but 81 00:05:33,911 --> 00:05:36,923 also appear in basically every other document out there. 82 00:05:36,923 --> 00:05:40,450 And by doing this when we go to do our distance calculations, 83 00:05:40,450 --> 00:05:45,226 these important words to the document that we're actually reading will have more 84 00:05:45,226 --> 00:05:48,359 significance when we're computing that distance. 85 00:05:48,359 --> 00:05:48,859 [MUSIC]