1 00:00:00,000 --> 00:00:04,172 [MUSIC] 2 00:00:04,172 --> 00:00:09,319 In this module, Emily covered various techniques for retrieving documents, 3 00:00:09,319 --> 00:00:13,580 exploring representations for a data like word count and TFIDF. 4 00:00:13,580 --> 00:00:17,550 Now we're gonna get a really cool Notebook where we put these ideas together and 5 00:00:17,550 --> 00:00:19,630 build a document retrieval system. 6 00:00:19,630 --> 00:00:21,370 Using TFIDF. 7 00:00:21,370 --> 00:00:23,040 So, let's go ahead and do just that. 8 00:00:24,290 --> 00:00:28,020 As usual, we're gonna be using Python Notebook and 9 00:00:28,020 --> 00:00:32,160 in this one I'm gonna change the title to Document 10 00:00:34,740 --> 00:00:38,530 retrieval, and here we go. 11 00:00:38,530 --> 00:00:40,830 And again, as usual, I'm gonna hide the header. 12 00:00:42,040 --> 00:00:44,480 And hide the toolbar to give us a little bit more space. 13 00:00:45,590 --> 00:00:50,460 Okay, let's go ahead and fire up GraphLab Create. 14 00:00:50,460 --> 00:00:53,890 So we're gonna do import graphlab 15 00:00:53,890 --> 00:00:57,350 since we're gonna be using it again in our notebook. 16 00:00:57,350 --> 00:01:00,610 Now the first step we're going to load some data. 17 00:01:00,610 --> 00:01:04,900 So let's Load some text data. 18 00:01:06,380 --> 00:01:10,460 This is interesting text data from Wikipedia, 19 00:01:12,520 --> 00:01:18,870 and it's pages on people. 20 00:01:18,870 --> 00:01:21,810 So, cool data set and I'm just going to load it, and 21 00:01:21,810 --> 00:01:26,860 we're going to see People which is gonna be an s frame 22 00:01:26,860 --> 00:01:31,570 is gonna be graphlab.SFrame from a file and 23 00:01:31,570 --> 00:01:37,074 that file is called people wiki, right here. 24 00:01:37,074 --> 00:01:40,630 And Here we go, we're loading it. 25 00:01:40,630 --> 00:01:45,680 And the first thing that we're gonna do is just look at a few top lines of that file. 26 00:01:45,680 --> 00:01:51,200 So I'm just gonna go ahead and you should see we have, 27 00:01:51,200 --> 00:01:55,730 so this URI is basically the location of that page on Wikipedia. 28 00:01:55,730 --> 00:01:58,760 This is the name of the person involved. 29 00:01:58,760 --> 00:02:04,380 and the text of that page about that person. 30 00:02:04,380 --> 00:02:09,660 And you have this for a pretty nice number of people here. 31 00:02:09,660 --> 00:02:13,340 So if you type len of people. 32 00:02:13,340 --> 00:02:15,060 In our data set. 33 00:02:15,060 --> 00:02:19,330 And type enter you see that we are talking about 59,000 people in this data set. 34 00:02:19,330 --> 00:02:21,790 It's a pretty good set and 35 00:02:21,790 --> 00:02:26,040 you will see with DFIF we are going to do some really interesting documents. 36 00:02:26,040 --> 00:02:28,200 Even in this relatively large data set. 37 00:02:29,530 --> 00:02:34,901 So the first thing we are going to do is just explore our data. 38 00:02:34,901 --> 00:02:42,734 So let's #Explore the dataset and 39 00:02:42,734 --> 00:02:49,789 checkout the text it contains. 40 00:02:52,204 --> 00:02:56,360 So let's start looking at a particular person dataset. 41 00:02:56,360 --> 00:02:59,150 And we're gonna look at the page for 42 00:02:59,150 --> 00:03:02,040 Barack Obama who is the current US President. 43 00:03:04,240 --> 00:03:11,755 So out of this people s-frame I'm going to select the one whose name, 44 00:03:11,755 --> 00:03:16,786 so that's the name column, is equal 45 00:03:16,786 --> 00:03:22,737 to Barack Obama. 46 00:03:22,737 --> 00:03:29,610 And so I pressed enter, and I created this new variable called Obama. 47 00:03:29,610 --> 00:03:34,450 And if you take a quick look at it, you see [COUGH] That it has the URL for 48 00:03:34,450 --> 00:03:39,200 the Obama page, the name Barack Obama, and the text for that page. 49 00:03:39,200 --> 00:03:42,560 So let's dig in, and see what that text looks like. 50 00:03:42,560 --> 00:03:48,210 So for Barack Obama, you'll see that 51 00:03:48,210 --> 00:03:53,460 Barack Hussein Obama was born on August 4th, 1961, and 52 00:03:53,460 --> 00:03:58,670 is the 44th The current President of the United States. 53 00:04:00,100 --> 00:04:04,700 So pretty natural text will really expect for this kind of data. 54 00:04:05,720 --> 00:04:09,240 So we can also look at some other personal data set. 55 00:04:09,240 --> 00:04:11,030 So for example, 56 00:04:11,030 --> 00:04:15,970 there's an actor called George Clooney who's been in a lot of movies. 57 00:04:15,970 --> 00:04:21,610 So if you look at the people As frame we're gonna select, 58 00:04:21,610 --> 00:04:26,043 so this is again a filter operation which we've been doing almost every network now. 59 00:04:26,043 --> 00:04:34,381 Whose name is George Clooney? 60 00:04:34,381 --> 00:04:39,720 And then I'm just gonna go ahead and 61 00:04:39,720 --> 00:04:45,560 show you the text that we get for George Clooney. 62 00:04:45,560 --> 00:04:52,080 And you see that George Timothy Clooney was born in 1961. 63 00:04:52,080 --> 00:04:57,120 So, basically, he's the same age as Barack Obama, but he's not president. 64 00:04:58,180 --> 00:05:00,470 He's an American actor, writer. 65 00:05:00,470 --> 00:05:07,861 Producer, director 66 00:05:07,861 --> 00:05:13,199 and activist, 67 00:05:13,199 --> 00:05:17,721 right here. 68 00:05:17,721 --> 00:05:21,879 [MUSIC]