1
00:00:00,000 --> 00:00:07,153
Now let's ask how fast the procedure that
we just described for index creation

2
00:00:07,153 --> 00:00:13,814
actually is.
Lets assume there are n documents, m words

3
00:00:13,814 --> 00:00:21,573
and on the average, w words per document.
For example, in the case of the web, n

4
00:00:21,573 --> 00:00:27,880
will be a really large number, maybe in
the millions, billions or trillions.

5
00:00:28,380 --> 00:00:35,140
M will be another large number, perhaps
not as large as n.

6
00:00:35,980 --> 00:00:41,893
But quite large nevertheless.
And W may or may not be very large because

7
00:00:41,893 --> 00:00:48,442
every document may have a few thousand or
a few hundred thousand words but certainly

8
00:00:48,442 --> 00:00:52,680
not billions of words.
Let's see now what the answer is.

9
00:00:54,760 --> 00:01:00,150
We have an index structure here, we have
our documents and we assume we have N

10
00:01:00,150 --> 00:01:05,540
documents M total possible words and on
the average twenty words per document.

11
00:01:06,300 --> 00:01:12,810
First thing we notice is that every word
in each document needs to be read at-least

12
00:01:12,810 --> 00:01:16,220
once, so the complexity is at-least order
nw.

13
00:01:18,220 --> 00:01:25,246
Additionally as each word is read, we need
to look up this sort of structure of words

14
00:01:25,246 --> 00:01:30,620
to figure out whether or not it has been
already inserted before.

15
00:01:31,600 --> 00:01:37,213
The cost of looking this kind of a
structure up is, log in, as we've seen

16
00:01:37,213 --> 00:01:41,034
earlier.
For example, if a balanced binary tree is

17
00:01:41,034 --> 00:01:46,024
used to store these words.
Further, we must insert the URL of the

18
00:01:46,024 --> 00:01:51,560
document against the word if we happen to
find it already in the index.

19
00:01:52,180 --> 00:01:57,452
Else, we must insert an entry for that
word if we don't find it.

20
00:01:57,452 --> 00:02:06,245
And then insert the document against it.
Now we claim that each of these is just a

21
00:02:06,245 --> 00:02:12,712
constant cost per word.
Obviously the second part which is

22
00:02:12,712 --> 00:02:17,278
inserting a blank entry for a word like
worm with...

23
00:02:17,278 --> 00:02:22,194
With just one fresh document is clearly a
constant cost.

24
00:02:22,194 --> 00:02:25,970
Once we know where to insert the word,
worm.

25
00:02:25,970 --> 00:02:32,905
But suppose the word already exist and
there are thousands of document already

26
00:02:32,905 --> 00:02:39,227
against it, then it is not that clear that
it is constant cost to insert.

27
00:02:39,227 --> 00:02:45,260
The fresh ID in a long list of IDs.
So there's an important assumption.

28
00:02:46,660 --> 00:02:51,088
That this is going be a constant cost.
And the reason for that assumption we'll

29
00:02:51,088 --> 00:02:54,843
come to in a later lecture.
For the moment we'll go ahead with that

30
00:02:54,843 --> 00:02:59,551
assumption.
If either of you missed it, in the

31
00:02:59,551 --> 00:03:05,980
homework, and got confused that's okay
because you were thinking pretty well.

32
00:03:06,420 --> 00:03:09,160
But for the moment we're gonna make this
assumption.

33
00:03:09,580 --> 00:03:15,900
And therefore the complexity of our
procedure is order N W log M.

34
00:03:16,340 --> 00:03:22,198
Using a balance bynatry destroy the words
and assuming that entering a fresh

35
00:03:22,198 --> 00:03:27,220
document in an existing list against some
word is a constant cost.