[MUSIC] To start with, let's talk about some choices we have for how to represent our document. One really simple choice for representing the document is simply as a vector of word counts. So in particular, this is an example of a bag of words model where we take our document, has a bunch of words in it. Let's assume that it's a really simple document and the only words it has is the following two sentences. So Carlos calls the sport futbol. Emily calls the sport soccer. And in this bag of words model, we just ignore the order of these words, and we're simply going to count the number of instances of each word. And that's going to be our word count vector. So here's our vector of our entire vocabulary. And maybe this is the index for Carlos. And maybe here we have Emily. And here we have the. And maybe there's soccer and football. And what else do we need? We need the word sport and calls. So let's see, we have one count of the word Carlos, two counts of the word the, one count of the word Emily, one count of soccer, two counts of calls, two counts of sport, and one count of football. So this would be our word count vector representation of this very simple document here. But an issue with this very simple representation has to do with rare words. So for example, let's imagine that this document had lots of common words like the, player, field and goal, but there are only a few instances of some rare but important words like futbol and Messi. So Messi is one of the players in that article that Carlos was reading about and that really highlights something important and unique about that article. And that might be of interest specifically to Carlos more than just something like player or field which might have to do with lots of other types of sports, and lots of other types of settings. So when we're thinking about doing distance calculations just based off of word counts like raw word counts, these very common words that have high counts in these documents can dominate the calculation. And this is especially bad when there are words like the and of and in, really, really common words that are basically meaningless for the sake of assessing similarities of documents. But when you're doing that distance computation, it can totally swamp out these more important but rarer words. So an alternative representation that we can consider is TF-IDF. So Term Frequency Inverse Document Frequency. And so this, again, is something that we talked about in the first course but it's important to review because this is a representation that we're going to work with very commonly here. So TF-IDF emphasizes important words in the following way. It's going to emphasize words that appear very common locally. And what that means, is that they appear frequently in the document that the person is reading, but they appear rarely globally. So an entire corpus, it's pretty unique to see those words. And so if we see both of those things together, that's definitely an important word and something that we should pick up on. So to quantify this. This first term appearing frequently locally, that term is the term frequency. Okay, so that simply are word counts for our article that we're reading, or this query article. So we simply, again, go into this document, shake it up. Just count up all the words that appear in that document. And that's a representation of how often things appear locally. But then we're going to scale that by how often those words appear globally. We go to our entire corpus of documents so every article out there. And one common way to compute the inverse document frequency is the following form where you have the log of the number of documents that you're looking over but importantly divided by 1 plus the number of documents that use the given word. And that's the really key thing that's going to allow us to down-weight words that appear frequently in many, many documents. Okay, so we have these two terms and we are faced with this trade off between local frequency and global rarity. And what term frequency, inverse document frequency does is it simply multiplies these two factors together to get our TF-IDF representation. So just to reiterate, by looking at TF-IDF vectors as the representation for our document, we're going to up-weight rare words that appear often in the document that we're looking at but not in the corpus. And we're going to down-weight words like the and of, and all these things that probably appear in the document that we're looking at, but also appear in basically every other document out there. And by doing this when we go to do our distance calculations, these important words to the document that we're actually reading will have more significance when we're computing that distance. [MUSIC]