1 00:00:00,000 --> 00:00:06,148 Now let's see how we could create a text index for a bunch of documents. 2 00:00:06,148 --> 00:00:09,820 Such as the three documents we had earlier. 3 00:00:10,260 --> 00:00:17,133 We go through each document one by one, reading each word in turn, when we read 4 00:00:17,133 --> 00:00:24,623 the word the, we insert an element against the in the final structure that we want to 5 00:00:24,623 --> 00:00:29,699 create. Similarly to read the word quick we enter 6 00:00:29,699 --> 00:00:34,241 A.com as the id of this document against quick. 7 00:00:34,241 --> 00:00:41,586 Similarly for brown, similarly for over and for other words in this document. 8 00:00:41,586 --> 00:00:46,612 Moving on to the next document, we do the same thing. 9 00:00:46,612 --> 00:00:54,633 Enter word with B.com this time instead of A.com because now we are processing the 10 00:00:54,633 --> 00:01:00,000 second document. Singularly for the third document. 11 00:01:00,000 --> 00:01:07,098 But this time since the is already present, in the index, we merely append 12 00:01:07,098 --> 00:01:14,196 the entry c dot com after a dot com against the, rather than create a new 13 00:01:14,196 --> 00:01:17,680 entry. Similarly for lazy. 14 00:01:19,180 --> 00:01:26,760 Bird, and finally for worm we enter a new entry with just c dot com against it. 15 00:01:28,220 --> 00:01:34,551 How would we write a program to do this? Now, we'll go a little bit slowly while 16 00:01:34,551 --> 00:01:40,642 describing this program because I'm not assuming everybody is familiar with 17 00:01:40,642 --> 00:01:46,973 programming Python or object-oriented programming, or even if you are I'm 18 00:01:46,973 --> 00:01:50,500 assuming that you need a bit of refreshment. 19 00:01:51,260 --> 00:01:57,318 So we'll create a class index which is going to be a data structure index that 20 00:01:57,318 --> 00:02:03,606 we, we are gonna create which will have a function which will create, which given a 21 00:02:03,606 --> 00:02:10,661 list of documents D. This function will essentially populate 22 00:02:10,661 --> 00:02:17,090 our index in this manner. So what we'll do is we'll go through each 23 00:02:17,090 --> 00:02:21,475 document in D. Now in Python what this means is that if D 24 00:02:21,475 --> 00:02:26,860 is a list, for D in D will essentially iterate over each document in D. 25 00:02:29,120 --> 00:02:34,586 One by one. And for each such document small d will go 26 00:02:34,586 --> 00:02:40,578 through each word w in that document. Once again we are assuming the document 27 00:02:40,578 --> 00:02:46,881 small, the document small d in this list of documents d is itself a list of words 28 00:02:46,881 --> 00:02:50,800 this time. So for each of these words, we'll go 29 00:02:50,800 --> 00:02:58,911 through the document small d. And you've to call another function, look 30 00:02:58,911 --> 00:03:05,160 up, also on the same beta structure index that we are creating. 31 00:03:05,160 --> 00:03:10,730 And look up and check if this word W is already in the index. 32 00:03:10,730 --> 00:03:18,239 Moreover if the word W is in the index then we assume that the look up function 33 00:03:18,239 --> 00:03:23,590 will return I, if you'll give us the position in the list. 34 00:03:23,590 --> 00:03:28,744 Where W is actually present, so we can access W directly. 35 00:03:28,744 --> 00:03:35,648 For example, if we're looking at the word "quick", then it will return I = zero, 36 00:03:35,648 --> 00:03:42,551 one, two, three, four, five, six, so that we can directly access this particular 37 00:03:42,551 --> 00:03:48,811 part of the index structure. If W is not in the index structure, then 38 00:03:48,811 --> 00:03:54,610 we assume that this function look up returns a negative number. 39 00:03:54,610 --> 00:04:01,035 If that is the case then we have to add W to the index, fresh just like we added 40 00:04:01,035 --> 00:04:05,839 worm in the end. And in that case the, the function add 41 00:04:05,839 --> 00:04:11,718 which does this will return the position where W was added, so for example here, it 42 00:04:11,718 --> 00:04:16,952 would return the position eight. Once its, once we have the position eight, 43 00:04:16,952 --> 00:04:23,424 we can append to this List against worm for example the ID of the document where 44 00:04:23,424 --> 00:04:26,659 the word "W" was found which is "D" dot ID. 45 00:04:26,659 --> 00:04:32,880 Not is that we don't append "D" directly because "D" itself is a list of words. 46 00:04:32,880 --> 00:04:39,433 So we don't want to append the entire document here we want to append only the 47 00:04:39,433 --> 00:04:48,247 name of the document of the ID of the web page or URL of the document Going back, 48 00:04:48,247 --> 00:04:54,461 there's an [inaudible] part, of course. If we did find W in the first place, we 49 00:04:54,461 --> 00:04:59,625 would simply append it. Append the ID at that position I, without 50 00:04:59,625 --> 00:05:03,580 having to ap, add a fresh entry into the index. 51 00:05:04,020 --> 00:05:10,079 So that's simple, so this program, if we, if we're able to write all this functions, 52 00:05:10,079 --> 00:05:14,942 look-up, add, append. Then we can, essentially go through this 53 00:05:14,942 --> 00:05:20,851 list of documents, where each document is a list of words and create this index 54 00:05:20,851 --> 00:05:23,459 structure. Very simply. 55 00:05:23,459 --> 00:05:28,620 We'll probably do some such assignment later on in the course. 56 00:05:28,620 --> 00:05:35,364 But at that time we will do it on many machines using parallel computing, the way 57 00:05:35,364 --> 00:05:40,360 it's actually done in Google and other large search engines.