1 00:00:00,000 --> 00:00:06,115 An equally interesting question might be how we figure out what is the first word 2 00:00:06,115 --> 00:00:10,142 that comes to mind. For example, what is the first word 3 00:00:10,142 --> 00:00:14,560 starting with an a? Many of you might say apple. 4 00:00:15,860 --> 00:00:21,976 Are some words more important than others. Is it just the common words which are more 5 00:00:21,976 --> 00:00:26,908 important than others. In this context if you are talking about 6 00:00:26,908 --> 00:00:32,631 this course in this lecture, what's the first word that comes to mind starting 7 00:00:32,631 --> 00:00:35,420 with G. I bet many of you think Google. 8 00:00:38,660 --> 00:00:43,180 Does this, have anything to do with how Google figures out? 9 00:00:43,680 --> 00:00:49,988 Which topics to include, in the top ten documents in displays matching the query 10 00:00:49,988 --> 00:00:56,218 Clinton plays in their carts. Importance of search results in google, as 11 00:00:56,218 --> 00:01:02,764 many of you may have read, is because of an algorithm called pagerank which we will 12 00:01:02,764 --> 00:01:07,811 describe in a minute. What we also want to ask is there anything 13 00:01:07,811 --> 00:01:11,360 deeper? Which we will come to just, after that. 14 00:01:11,860 --> 00:01:18,403 So let's look at page [inaudible]. The web consists of documents which are 15 00:01:18,403 --> 00:01:24,761 linked to each other, through hyper-links. And this was the initial structure of the 16 00:01:24,761 --> 00:01:29,586 web, in a way when it first became very popular in the late 90s. 17 00:01:29,586 --> 00:01:35,561 And it remains so today, but we will come to, come to that shortly, as to how it 18 00:01:35,561 --> 00:01:42,175 might be changing. So what Brandon Page at Google, the 19 00:01:42,175 --> 00:01:50,310 founders of Google imagined while they must have stand for was suppose there was 20 00:01:50,310 --> 00:01:58,224 a random surfer, who hops from page to page, hyperlink to hyperlink At random. 21 00:01:58,224 --> 00:02:04,903 So at a page the surfer chooses at random any of the links that go out from that 22 00:02:04,903 --> 00:02:08,696 page. So the surfer is going from page to page, 23 00:02:08,696 --> 00:02:15,376 and the question that Sergie brings and every page asks was; what is the relative 24 00:02:15,376 --> 00:02:21,890 probability of visiting a particular page? So of all the pages on the web, which 25 00:02:21,890 --> 00:02:27,580 pages are more likely to visited by such a random surfer than others? 26 00:02:28,100 --> 00:02:34,540 And that probability across all the pages on the web is the page rank of that page. 27 00:02:35,020 --> 00:02:40,553 No. It might appear that the number of 28 00:02:40,553 --> 00:02:47,400 hyperlinks going into a page. Is sufficient to compute its page rank. 29 00:02:48,000 --> 00:02:57,178 Obviously if more links point to a page. More likely it is that this random circle 30 00:02:57,178 --> 00:03:01,200 will reach there. Question is, is this enough? 31 00:03:02,080 --> 00:03:10,400 It turns out that this is not enough. And the answer is null because. 32 00:03:11,080 --> 00:03:16,560 Even if a page doesn't have many incoming links. 33 00:03:16,840 --> 00:03:25,405 A surfer can revisit a page because of cycles in the graph, so that The surfer 34 00:03:25,405 --> 00:03:31,306 will go and come back to the page through variety of different roots maybe 35 00:03:31,306 --> 00:03:37,915 traversing the same link again and again but because there are so many cycles which 36 00:03:37,915 --> 00:03:40,448 return back. To the same page. 37 00:03:40,448 --> 00:03:46,586 A particular page can become important even if it doesn't have a lot of incoming 38 00:03:46,586 --> 00:03:54,309 hyperlinks. The point is that page rank is a global 39 00:03:54,309 --> 00:04:02,323 property of this web graph and cannot be computed simply by looking at the number 40 00:04:02,323 --> 00:04:07,706 links of each page. This is the second major computation that 41 00:04:07,706 --> 00:04:13,313 a search engine like Google has to do, computing the page rank of each page, the 42 00:04:13,313 --> 00:04:16,600 first of course being indexing the web as it grows. 43 00:04:17,620 --> 00:04:24,856 This page rank of each page is computed iteratively continuously and in parallel 44 00:04:24,856 --> 00:04:29,680 on, as we shall see, thousands and thousands of servers. 45 00:04:30,600 --> 00:04:35,561 For those of you who are slightly more mathematically minded. 46 00:04:35,561 --> 00:04:41,255 The page rank is related to the eigenvector of a particular adjacency 47 00:04:41,255 --> 00:04:45,159 matrix. But we're not going to go into that math 48 00:04:45,159 --> 00:04:46,380 in this course.