1 00:00:00,012 --> 00:00:06,077 Okay, let's step back and reflect on what we've learned so far. 2 00:00:06,078 --> 00:00:11,530 We've got a basic idea of how documents are stored on the web and how they can be 3 00:00:11,530 --> 00:00:17,664 retrieved using text indexes. But, we still haven't really understood 4 00:00:17,664 --> 00:00:23,790 how big the problem is. We don't know, for example, how many 5 00:00:23,790 --> 00:00:29,236 documents there are on the web. We don't know how to best order the 6 00:00:29,236 --> 00:00:33,947 documents when they are returned from a search. 7 00:00:33,948 --> 00:00:40,400 And we don't really understand what any of this has to do with intelligence, memory, 8 00:00:40,400 --> 00:00:45,920 and the kind of things that we're trying to understand about predictive 9 00:00:45,920 --> 00:00:49,669 intelligence. But bear with me for a while, and we'll 10 00:00:49,669 --> 00:00:55,426 get to these subjects very soon. So, now that we know what an index is, and 11 00:00:55,426 --> 00:01:02,554 how to create it, how many web pages do you think are actually indexed by a search 12 00:01:02,554 --> 00:01:06,930 engine like Google or Bing for, for that matter? 13 00:01:06,930 --> 00:01:13,394 Is it two to five billion? Is it 30 to 40 billion? 14 00:01:13,394 --> 00:01:21,346 200 to 300 billion or in the trillions? Now, the number of possible URL's of web 15 00:01:21,346 --> 00:01:27,754 pages are probably in the trillions. So, I mean, that's not really the answer. 16 00:01:27,754 --> 00:01:32,162 So, we can rule one out immediately. Well, what about the rest? 17 00:01:32,163 --> 00:01:36,867 Remember, this is, you're not asking how many web pages there are. 18 00:01:36,867 --> 00:01:42,372 How many webpages are indexed by a search engine. 19 00:01:42,372 --> 00:01:48,699 Think about it. You can actually find this out by doing a 20 00:01:48,699 --> 00:01:55,652 few experiments with your browser. So how did you do? 21 00:01:55,653 --> 00:02:01,362 The correct answer according to me is between 30 and 40 billion. 22 00:02:01,363 --> 00:02:08,208 And this is how I reason. If you search for a common word such as, 23 00:02:08,208 --> 00:02:17,367 a, in, the, or, on Google, you get around 20 to 25 billion results. 24 00:02:17,368 --> 00:02:23,952 Assuming that, that not all pages are English, my guess is between 30 and 40 25 00:02:23,952 --> 00:02:27,928 billion. Now, let's return to our problem of 26 00:02:27,928 --> 00:02:34,543 arranging the results of a search query. As we have seen, the web has a lot of 27 00:02:34,543 --> 00:02:40,916 pages, billions and billions. What if the result set is very large. 28 00:02:40,917 --> 00:02:49,895 For example, when we searched for a or the in Google, we got 20 to 25 billion 29 00:02:49,895 --> 00:02:54,384 results. If you have q-terms in your query, such as 30 00:02:54,384 --> 00:03:01,050 the quick brown fox , you'll still get many hundreds of thousands of results and 31 00:03:01,050 --> 00:03:07,615 assembling these results in the way we had discussed earlier in terms of which 32 00:03:07,615 --> 00:03:14,607 results match best is still pretty costly if the number of results is very large. 33 00:03:14,608 --> 00:03:20,098 Then, another problem. Suppose we search for something like 34 00:03:20,099 --> 00:03:27,891 Clinton plays India cards? Well, when I did this a little while back 35 00:03:27,891 --> 00:03:33,990 I got results of two types. One about Hillary Clinton visiting India 36 00:03:33,990 --> 00:03:40,110 but Islamabad was not on the cards or more recently Mitt Romney playing the Hllary 37 00:03:40,110 --> 00:03:44,940 Clinton card or something like that, about American politics. 38 00:03:44,940 --> 00:03:51,920 And the another set of results about a company called Clinton Cards which was 39 00:03:51,920 --> 00:03:59,114 acquired, which shut down, which closed shop, and, and anything to do with that 40 00:03:59,114 --> 00:04:05,050 topic. So, similarity from search index is quite 41 00:04:05,050 --> 00:04:13,004 different from importance of a webpage. Clinton cards India may actually have 42 00:04:13,004 --> 00:04:17,075 many, many results, which talk about the second topic. 43 00:04:17,075 --> 00:04:21,776 But clearly the first topic is more popular and more important. 44 00:04:21,776 --> 00:04:24,443 How does Google figure this out?