Okay, let's step back and reflect on what
we've learned so far.
We've got a basic idea of how documents
are stored on the web and how they can be
retrieved using text indexes.
But, we still haven't really understood
how big the problem is.
We don't know, for example, how many
documents there are on the web.
We don't know how to best order the
documents when they are returned from a
search.
And we don't really understand what any of
this has to do with intelligence, memory,
and the kind of things that we're trying
to understand about predictive
intelligence.
But bear with me for a while, and we'll
get to these subjects very soon.
So, now that we know what an index is, and
how to create it, how many web pages do
you think are actually indexed by a search
engine like Google or Bing for, for that
matter?
Is it two to five billion?
Is it 30 to 40 billion?
200 to 300 billion or in the trillions?
Now, the number of possible URL's of web
pages are probably in the trillions.
So, I mean, that's not really the answer.
So, we can rule one out immediately.
Well, what about the rest?
Remember, this is, you're not asking how
many web pages there are.
How many webpages are indexed by a search
engine.
Think about it.
You can actually find this out by doing a
few experiments with your browser.
So how did you do?
The correct answer according to me is
between 30 and 40 billion.
And this is how I reason.
If you search for a common word such as,
a, in, the, or, on Google, you get around
20 to 25 billion results.
Assuming that, that not all pages are
English, my guess is between 30 and 40
billion.
Now, let's return to our problem of
arranging the results of a search query.
As we have seen, the web has a lot of
pages, billions and billions.
What if the result set is very large.
For example, when we searched for a or the
in Google, we got 20 to 25 billion
results.
If you have q-terms in your query, such as
the quick brown fox , you'll still get
many hundreds of thousands of results and
assembling these results in the way we had
discussed earlier in terms of which
results match best is still pretty costly
if the number of results is very large.
Then, another problem.
Suppose we search for something like
Clinton plays India cards?
Well, when I did this a little while back
I got results of two types.
One about Hillary Clinton visiting India
but Islamabad was not on the cards or more
recently Mitt Romney playing the Hllary
Clinton card or something like that, about
American politics.
And the another set of results about a
company called Clinton Cards which was
acquired, which shut down, which closed
shop, and, and anything to do with that
topic.
So, similarity from search index is quite
different from importance of a webpage.
Clinton cards India may actually have
many, many results, which talk about the
second topic.
But clearly the first topic is more
popular and more important.
How does Google figure this out?