Let's begin our exploration with the first
element in our cycle of look, listen,
learn, connect, predict, correct.
Looking is about finding stuff.
For example, finding stuff on the web,
finding stuff on one's computer or even
finding stuff when one looks around the
room, when one look at some data to find
pattern which may be hidden in that data.
Or even finding stuff when one tries to
recall things from ones memory.
So finding stuff is really essential, to
what we consider intelligent behavior.
But on the web, finding stuff is about
finding documents.
Let's see how that happens.
When you want to look for documents on the
web, we type in queries into a search box.
And some how miraculously those documents
which best match our query turn up.
The technique that makes this possible is
called text indexing.
Now let's illustrate that with an example.
The simplest of being these three
documents, at three different web sites,
'a.com', 'b.com' and'c.com'.
In all the cases these documents are very
simple and just consider a situation where
we have only three documents.
And when we type a query into a search box
such as "fox," we'd like the first
document to be returned.
On the other hand, when we type in "the
bird," we we'd like all three documents to
be returned, because the word "the" is
present in two of the documents and the
word "bird" is present in another two of
the documents, combined.
The combination, the bird is present in
one of these three documents.
But when we type the query word, we would
like only two of these documents to be
returned, because word exists in only two
documents.
To make this possible, what the web does,
or what a search engine does, is create
what is called an inverted index.
So against every possible word which might
occur in any possible document on the web.
A text index stores, which documents that
particular word is found in.
So, for example, the word "board" is found
in the documents B.com and C.com, whereas
the word "fox" is found only in A.com.
So when we type in the query "fox," we get
one result, which is A.com.
On the other hand, when we type in the
query, the word.
We get two documents, A.com and C.com from
the, and we get two more documents, B.com
and C.com from word, taking the union of
these two lists that we get back, we get
the results which are all three documents.
However, if we type in a query, bird,
alone then we only get two documents p dot
com and c dot com, a very simple idea but
it is fundamental to how the web works.
Well let's see what it takes to find a
document in such a text index.
The web consists of documents which have
1,000,000's of words, so the number of
words, on the left hand side of such text
index, is huge.
And when we, want, to search for a word,
we have to look up.
Some structure where all these words are
stored.
So assuming some sworded structure such as
binary tree, which you may have studied in
your data structures courses, is used to,
to store all these words, then looking up
a word takes log'n' time.
In fact you can do a little bit better if
we use hash tables, but we won't go into
that in detail right now, but it's not
enough to just lookup the words in your
query.
We also have to assemble the results.
That we get when we look up the words for
example, if you have few terms in your
query like, the bird.
Two terms in this case.
You have to look up the word the, and get
a.com and c.com, look up the word, word
and getting b.com and c.com.
And you have to assemble the results which
consist of all three documents and also
order them, so that the document c.com,
comes first because it matches both the
words, in your query, unlike a.com or
b.com, where only one word matches.
The process of assembling the results of a
cue term query.
Takes order RQ time.
If R is the total number of intermediate
results.
That is, the number before one takes the
union of the result.
So in this case of a query like the bird,
R would be four.
Because you get two results from word and
two results from the, even though c.com is
repeated you count it twice, because you
have to traverse.
Both this list as well as this list.
To show that this is the case.
Is something I'll leave to you as
homework.
And we'll come back to this as a homework
question later on.
The problem is.
If R is huge you really don't want to
assemble the entire result sets and you
want some other way of ordering.
The results of the query, apart from just
which query matters best.
And that's the second major invention on
the web and we will come to that shortly.
But before we do this, we will first look
at how text indexes created in the first
place.
This is something search engines do all
the times continuously and forms the heart
of what's going on inside a large search
engine like Google.