Now let's see how we could create a text
index for a bunch of documents.
Such as the three documents we had
earlier.
We go through each document one by one,
reading each word in turn, when we read
the word the, we insert an element against
the in the final structure that we want to
create.
Similarly to read the word quick we enter
A.com as the id of this document against
quick.
Similarly for brown, similarly for over
and for other words in this document.
Moving on to the next document, we do the
same thing.
Enter word with B.com this time instead of
A.com because now we are processing the
second document.
Singularly for the third document.
But this time since the is already
present, in the index, we merely append
the entry c dot com after a dot com
against the, rather than create a new
entry.
Similarly for lazy.
Bird, and finally for worm we enter a new
entry with just c dot com against it.
How would we write a program to do this?
Now, we'll go a little bit slowly while
describing this program because I'm not
assuming everybody is familiar with
programming Python or object-oriented
programming, or even if you are I'm
assuming that you need a bit of
refreshment.
So we'll create a class index which is
going to be a data structure index that
we, we are gonna create which will have a
function which will create, which given a
list of documents D.
This function will essentially populate
our index in this manner.
So what we'll do is we'll go through each
document in D.
Now in Python what this means is that if D
is a list, for D in D will essentially
iterate over each document in D.
One by one.
And for each such document small d will go
through each word w in that document.
Once again we are assuming the document
small, the document small d in this list
of documents d is itself a list of words
this time.
So for each of these words, we'll go
through the document small d.
And you've to call another function, look
up, also on the same beta structure index
that we are creating.
And look up and check if this word W is
already in the index.
Moreover if the word W is in the index
then we assume that the look up function
will return I, if you'll give us the
position in the list.
Where W is actually present, so we can
access W directly.
For example, if we're looking at the word
"quick", then it will return I = zero,
one, two, three, four, five, six, so that
we can directly access this particular
part of the index structure.
If W is not in the index structure, then
we assume that this function look up
returns a negative number.
If that is the case then we have to add W
to the index, fresh just like we added
worm in the end.
And in that case the, the function add
which does this will return the position
where W was added, so for example here, it
would return the position eight.
Once its, once we have the position eight,
we can append to this List against worm
for example the ID of the document where
the word "W" was found which is "D" dot
ID.
Not is that we don't append "D" directly
because "D" itself is a list of words.
So we don't want to append the entire
document here we want to append only the
name of the document of the ID of the web
page or URL of the document Going back,
there's an [inaudible] part, of course.
If we did find W in the first place, we
would simply append it.
Append the ID at that position I, without
having to ap, add a fresh entry into the
index.
So that's simple, so this program, if we,
if we're able to write all this functions,
look-up, add, append.
Then we can, essentially go through this
list of documents, where each document is
a list of words and create this index
structure.
Very simply.
We'll probably do some such assignment
later on in the course.
But at that time we will do it on many
machines using parallel computing, the way
it's actually done in Google and other
large search engines.