Now let's ask how fast the procedure that
we just described for index creation
actually is.
Lets assume there are n documents, m words
and on the average, w words per document.
For example, in the case of the web, n
will be a really large number, maybe in
the millions, billions or trillions.
M will be another large number, perhaps
not as large as n.
But quite large nevertheless.
And W may or may not be very large because
every document may have a few thousand or
a few hundred thousand words but certainly
not billions of words.
Let's see now what the answer is.
We have an index structure here, we have
our documents and we assume we have N
documents M total possible words and on
the average twenty words per document.
First thing we notice is that every word
in each document needs to be read at-least
once, so the complexity is at-least order
nw.
Additionally as each word is read, we need
to look up this sort of structure of words
to figure out whether or not it has been
already inserted before.
The cost of looking this kind of a
structure up is, log in, as we've seen
earlier.
For example, if a balanced binary tree is
used to store these words.
Further, we must insert the URL of the
document against the word if we happen to
find it already in the index.
Else, we must insert an entry for that
word if we don't find it.
And then insert the document against it.
Now we claim that each of these is just a
constant cost per word.
Obviously the second part which is
inserting a blank entry for a word like
worm with...
With just one fresh document is clearly a
constant cost.
Once we know where to insert the word,
worm.
But suppose the word already exist and
there are thousands of document already
against it, then it is not that clear that
it is constant cost to insert.
The fresh ID in a long list of IDs.
So there's an important assumption.
That this is going be a constant cost.
And the reason for that assumption we'll
come to in a later lecture.
For the moment we'll go ahead with that
assumption.
If either of you missed it, in the
homework, and got confused that's okay
because you were thinking pretty well.
But for the moment we're gonna make this
assumption.
And therefore the complexity of our
procedure is order N W log M.
Using a balance bynatry destroy the words
and assuming that entering a fresh
document in an existing list against some
word is a constant cost.