1 00:00:00,000 --> 00:00:07,153 Now let's ask how fast the procedure that we just described for index creation 2 00:00:07,153 --> 00:00:13,814 actually is. Lets assume there are n documents, m words 3 00:00:13,814 --> 00:00:21,573 and on the average, w words per document. For example, in the case of the web, n 4 00:00:21,573 --> 00:00:27,880 will be a really large number, maybe in the millions, billions or trillions. 5 00:00:28,380 --> 00:00:35,140 M will be another large number, perhaps not as large as n. 6 00:00:35,980 --> 00:00:41,893 But quite large nevertheless. And W may or may not be very large because 7 00:00:41,893 --> 00:00:48,442 every document may have a few thousand or a few hundred thousand words but certainly 8 00:00:48,442 --> 00:00:52,680 not billions of words. Let's see now what the answer is. 9 00:00:54,760 --> 00:01:00,150 We have an index structure here, we have our documents and we assume we have N 10 00:01:00,150 --> 00:01:05,540 documents M total possible words and on the average twenty words per document. 11 00:01:06,300 --> 00:01:12,810 First thing we notice is that every word in each document needs to be read at-least 12 00:01:12,810 --> 00:01:16,220 once, so the complexity is at-least order nw. 13 00:01:18,220 --> 00:01:25,246 Additionally as each word is read, we need to look up this sort of structure of words 14 00:01:25,246 --> 00:01:30,620 to figure out whether or not it has been already inserted before. 15 00:01:31,600 --> 00:01:37,213 The cost of looking this kind of a structure up is, log in, as we've seen 16 00:01:37,213 --> 00:01:41,034 earlier. For example, if a balanced binary tree is 17 00:01:41,034 --> 00:01:46,024 used to store these words. Further, we must insert the URL of the 18 00:01:46,024 --> 00:01:51,560 document against the word if we happen to find it already in the index. 19 00:01:52,180 --> 00:01:57,452 Else, we must insert an entry for that word if we don't find it. 20 00:01:57,452 --> 00:02:06,245 And then insert the document against it. Now we claim that each of these is just a 21 00:02:06,245 --> 00:02:12,712 constant cost per word. Obviously the second part which is 22 00:02:12,712 --> 00:02:17,278 inserting a blank entry for a word like worm with... 23 00:02:17,278 --> 00:02:22,194 With just one fresh document is clearly a constant cost. 24 00:02:22,194 --> 00:02:25,970 Once we know where to insert the word, worm. 25 00:02:25,970 --> 00:02:32,905 But suppose the word already exist and there are thousands of document already 26 00:02:32,905 --> 00:02:39,227 against it, then it is not that clear that it is constant cost to insert. 27 00:02:39,227 --> 00:02:45,260 The fresh ID in a long list of IDs. So there's an important assumption. 28 00:02:46,660 --> 00:02:51,088 That this is going be a constant cost. And the reason for that assumption we'll 29 00:02:51,088 --> 00:02:54,843 come to in a later lecture. For the moment we'll go ahead with that 30 00:02:54,843 --> 00:02:59,551 assumption. If either of you missed it, in the 31 00:02:59,551 --> 00:03:05,980 homework, and got confused that's okay because you were thinking pretty well. 32 00:03:06,420 --> 00:03:09,160 But for the moment we're gonna make this assumption. 33 00:03:09,580 --> 00:03:15,900 And therefore the complexity of our procedure is order N W log M. 34 00:03:16,340 --> 00:03:22,198 Using a balance bynatry destroy the words and assuming that entering a fresh 35 00:03:22,198 --> 00:03:27,220 document in an existing list against some word is a constant cost.