1 00:00:00,000 --> 00:00:03,755 [MUSIC] 2 00:00:03,755 --> 00:00:09,005 So let's take this TF IDF's, and do a couple of fun things with them. 3 00:00:09,005 --> 00:00:14,398 So first, we're gonna manually, 4 00:00:14,398 --> 00:00:20,100 okay, so let me just type in here. 5 00:00:20,100 --> 00:00:25,150 So we're gonna manually compute 6 00:00:25,150 --> 00:00:30,770 distances between a few people. 7 00:00:32,230 --> 00:00:36,250 So the goal here is just to show what those distances look like, 8 00:00:36,250 --> 00:00:41,700 just to get a sense of what we're learning from TF IDF. 9 00:00:41,700 --> 00:00:45,830 So for example, let's take three people. 10 00:00:45,830 --> 00:00:48,090 So we already talked about Obama. 11 00:00:48,090 --> 00:00:50,020 We already have that Obama variable in there. 12 00:00:50,020 --> 00:00:51,770 Let's create two new variables. 13 00:00:51,770 --> 00:01:00,116 Let's say Clinton, which are the people I'm going to select one. 14 00:01:00,116 --> 00:01:06,050 The person whose name is equal 15 00:01:06,050 --> 00:01:12,290 to Bill Clinton, the former US President. 16 00:01:15,210 --> 00:01:20,034 Let's select another person, so for example, let's select Beckham. 17 00:01:20,034 --> 00:01:25,702 So, Beckham is a famous British, 18 00:01:25,702 --> 00:01:30,393 English footballer and so, 19 00:01:30,393 --> 00:01:36,843 we're gonna select out of the people, 20 00:01:36,843 --> 00:01:42,511 the person whose name is equal to, 21 00:01:42,511 --> 00:01:48,578 his name is actually David Beckham. 22 00:01:48,578 --> 00:01:53,570 Now that we selected Clinton and Beckham, let's compute the similarity between those 23 00:01:53,570 --> 00:01:56,320 two people, and the President, Barack Obama. 24 00:01:58,280 --> 00:02:06,078 So, what we gonna do is ask the question. 25 00:02:06,078 --> 00:02:10,649 Is Obama closer to 26 00:02:10,649 --> 00:02:16,740 Clinton than to Beckham? 27 00:02:22,593 --> 00:02:24,210 Okay a little mistake here. 28 00:02:25,280 --> 00:02:29,806 So is Obama closer to Clinton than to Beckham? 29 00:02:34,707 --> 00:02:38,252 Now there are various ways to measure similarity or 30 00:02:38,252 --> 00:02:42,710 distances between two vectors, or in this case two documents. 31 00:02:42,710 --> 00:02:44,770 We computed the TF IDF's. 32 00:02:44,770 --> 00:02:47,880 And what we're gonna do is compute the distance between 33 00:02:47,880 --> 00:02:53,320 these different documents, the one about Clinton, the one about Obama and so on. 34 00:02:53,320 --> 00:02:59,360 So I'm gonna use distance metrics that are already 35 00:02:59,360 --> 00:03:02,930 implemented inside the GraphLab Create so we don't have to implement them ourselves. 36 00:03:02,930 --> 00:03:07,220 So we need to look at graphlab.distances and press Tab, 37 00:03:07,220 --> 00:03:09,450 you'll see several options here. 38 00:03:09,450 --> 00:03:15,550 The Clinton distance that we talked about in class, cosine distance, 39 00:03:15,550 --> 00:03:20,850 jaccard similarity, and so on that we're gonna see throughout this specialization. 40 00:03:20,850 --> 00:03:24,300 We're gonna use cosine distance. 41 00:03:24,300 --> 00:03:28,120 And just as a little note, normally we think about cosine similarity, 42 00:03:28,120 --> 00:03:29,260 if you've heard of it. 43 00:03:29,260 --> 00:03:32,570 Where the higher the number the more similar two articles are. 44 00:03:32,570 --> 00:03:38,160 Here, we have a distance version of this number, so the lower the better. 45 00:03:38,160 --> 00:03:43,230 The lower the cosine distance, the closer the articles are. 46 00:03:43,230 --> 00:03:49,915 So the question is, what is the cosine distance between Obama's 47 00:03:49,915 --> 00:03:54,410 tfidf and that of Clinton? 48 00:03:54,410 --> 00:03:58,240 But notice that I have selected the column tfidf, and I have to have the little 0 49 00:03:58,240 --> 00:04:02,360 at the end here, because it is the zeroth row of this table. 50 00:04:02,360 --> 00:04:04,050 The table only has one element in it, but 51 00:04:04,050 --> 00:04:05,870 we still have to say what row of the table we're looking at. 52 00:04:05,870 --> 00:04:12,880 And so we're gonna compare the Obama tfidf with the Clinton tfidf. 53 00:04:15,308 --> 00:04:19,360 Also at 0, in terms of the cosine distance, and 54 00:04:19,360 --> 00:04:21,500 you see that the distance is 0.83. 55 00:04:21,500 --> 00:04:26,858 Now, the question is, what is the distance between, 56 00:04:26,858 --> 00:04:33,861 in the same metric in the cosine distance, between Obama and Beckham? 57 00:04:33,861 --> 00:04:38,470 So I'm gonna type Obama 58 00:04:38,470 --> 00:04:43,310 tfidf, computed at 0, 59 00:04:43,310 --> 00:04:50,470 with Beckham's tfidf, also at 0. 60 00:04:50,470 --> 00:04:53,770 And you'll see that this distance is 0.97. 61 00:04:53,770 --> 00:04:58,140 The biggest distance you can get is 1.0, in fact. 62 00:04:58,140 --> 00:05:03,202 And so in this case, Obama 63 00:05:03,202 --> 00:05:08,810 is much closer to Clinton than he is to Beckham, which makes a lot of sense. 64 00:05:08,810 --> 00:05:12,210 But we've done this just manually for a few people, how do we 65 00:05:12,210 --> 00:05:16,980 automate this process of finding out how close an article is to other articles. 66 00:05:16,980 --> 00:05:18,990 And in this case how close is a person to other people. 67 00:05:20,610 --> 00:05:26,390 In the lectures, Emily talked about nearest neighbor models and 68 00:05:26,390 --> 00:05:27,850 how they can be used for document retrieval. 69 00:05:27,850 --> 00:05:32,394 So today we're gonna actually do some cool document retrieval using 70 00:05:32,394 --> 00:05:34,679 a simple nearest neighbor model. 71 00:05:34,679 --> 00:05:35,179 [MUSIC]