1 00:00:00,000 --> 00:00:07,305 Let's begin our exploration with the first element in our cycle of look, listen, 2 00:00:07,305 --> 00:00:13,540 learn, connect, predict, correct. Looking is about finding stuff. 3 00:00:14,060 --> 00:00:20,881 For example, finding stuff on the web, finding stuff on one's computer or even 4 00:00:20,881 --> 00:00:27,967 finding stuff when one looks around the room, when one look at some data to find 5 00:00:27,967 --> 00:00:36,294 pattern which may be hidden in that data. Or even finding stuff when one tries to 6 00:00:36,294 --> 00:00:43,717 recall things from ones memory. So finding stuff is really essential, to 7 00:00:43,717 --> 00:00:50,665 what we consider intelligent behavior. But on the web, finding stuff is about 8 00:00:50,665 --> 00:00:54,780 finding documents. Let's see how that happens. 9 00:00:54,780 --> 00:01:02,460 When you want to look for documents on the web, we type in queries into a search box. 10 00:01:02,760 --> 00:01:11,080 And some how miraculously those documents which best match our query turn up. 11 00:01:11,480 --> 00:01:16,940 The technique that makes this possible is called text indexing. 12 00:01:17,860 --> 00:01:24,526 Now let's illustrate that with an example. The simplest of being these three 13 00:01:24,526 --> 00:01:29,356 documents, at three different web sites, 'a.com', 'b.com' and'c.com'. 14 00:01:29,356 --> 00:01:35,771 In all the cases these documents are very simple and just consider a situation where 15 00:01:35,771 --> 00:01:42,466 we have only three documents. And when we type a query into a search box 16 00:01:42,466 --> 00:01:47,162 such as "fox," we'd like the first document to be returned. 17 00:01:47,162 --> 00:01:53,787 On the other hand, when we type in "the bird," we we'd like all three documents to 18 00:01:53,787 --> 00:02:00,160 be returned, because the word "the" is present in two of the documents and the 19 00:02:00,160 --> 00:02:05,360 word "bird" is present in another two of the documents, combined. 20 00:02:06,140 --> 00:02:11,960 The combination, the bird is present in one of these three documents. 21 00:02:13,600 --> 00:02:18,876 But when we type the query word, we would like only two of these documents to be 22 00:02:18,876 --> 00:02:22,240 returned, because word exists in only two documents. 23 00:02:23,300 --> 00:02:30,424 To make this possible, what the web does, or what a search engine does, is create 24 00:02:30,424 --> 00:02:38,245 what is called an inverted index. So against every possible word which might 25 00:02:38,245 --> 00:02:47,244 occur in any possible document on the web. A text index stores, which documents that 26 00:02:47,244 --> 00:02:53,853 particular word is found in. So, for example, the word "board" is found 27 00:02:53,853 --> 00:03:01,449 in the documents B.com and C.com, whereas the word "fox" is found only in A.com. 28 00:03:01,449 --> 00:03:07,960 So when we type in the query "fox," we get one result, which is A.com. 29 00:03:08,260 --> 00:03:12,240 On the other hand, when we type in the query, the word. 30 00:03:12,500 --> 00:03:20,290 We get two documents, A.com and C.com from the, and we get two more documents, B.com 31 00:03:20,290 --> 00:03:27,890 and C.com from word, taking the union of these two lists that we get back, we get 32 00:03:27,890 --> 00:03:35,849 the results which are all three documents. However, if we type in a query, bird, 33 00:03:35,849 --> 00:03:43,505 alone then we only get two documents p dot com and c dot com, a very simple idea but 34 00:03:43,505 --> 00:03:50,451 it is fundamental to how the web works. Well let's see what it takes to find a 35 00:03:50,451 --> 00:03:57,038 document in such a text index. The web consists of documents which have 36 00:03:57,038 --> 00:04:04,396 1,000,000's of words, so the number of words, on the left hand side of such text 37 00:04:04,396 --> 00:04:10,860 index, is huge. And when we, want, to search for a word, 38 00:04:10,860 --> 00:04:16,842 we have to look up. Some structure where all these words are 39 00:04:16,842 --> 00:04:22,509 stored. So assuming some sworded structure such as 40 00:04:22,509 --> 00:04:30,726 binary tree, which you may have studied in your data structures courses, is used to, 41 00:04:30,726 --> 00:04:37,340 to store all these words, then looking up a word takes log'n' time. 42 00:04:38,940 --> 00:04:45,993 In fact you can do a little bit better if we use hash tables, but we won't go into 43 00:04:45,993 --> 00:04:52,703 that in detail right now, but it's not enough to just lookup the words in your 44 00:04:52,703 --> 00:04:57,250 query. We also have to assemble the results. 45 00:04:57,250 --> 00:05:04,046 That we get when we look up the words for example, if you have few terms in your 46 00:05:04,046 --> 00:05:07,700 query like, the bird. Two terms in this case. 47 00:05:07,700 --> 00:05:15,081 You have to look up the word the, and get a.com and c.com, look up the word, word 48 00:05:15,081 --> 00:05:21,714 and getting b.com and c.com. And you have to assemble the results which 49 00:05:21,714 --> 00:05:29,002 consist of all three documents and also order them, so that the document c.com, 50 00:05:29,002 --> 00:05:36,103 comes first because it matches both the words, in your query, unlike a.com or 51 00:05:36,103 --> 00:05:45,062 b.com, where only one word matches. The process of assembling the results of a 52 00:05:45,062 --> 00:05:49,447 cue term query. Takes order RQ time. 53 00:05:49,447 --> 00:05:53,924 If R is the total number of intermediate results. 54 00:05:53,924 --> 00:05:59,407 That is, the number before one takes the union of the result. 55 00:05:59,407 --> 00:06:04,615 So in this case of a query like the bird, R would be four. 56 00:06:04,615 --> 00:06:12,291 Because you get two results from word and two results from the, even though c.com is 57 00:06:12,291 --> 00:06:17,500 repeated you count it twice, because you have to traverse. 58 00:06:18,540 --> 00:06:27,060 Both this list as well as this list. To show that this is the case. 59 00:06:27,880 --> 00:06:30,954 Is something I'll leave to you as homework. 60 00:06:30,954 --> 00:06:35,460 And we'll come back to this as a homework question later on. 61 00:06:37,380 --> 00:06:43,361 The problem is. If R is huge you really don't want to 62 00:06:43,361 --> 00:06:50,450 assemble the entire result sets and you want some other way of ordering. 63 00:06:50,450 --> 00:06:55,913 The results of the query, apart from just which query matters best. 64 00:06:55,913 --> 00:07:03,198 And that's the second major invention on the web and we will come to that shortly. 65 00:07:03,198 --> 00:07:09,985 But before we do this, we will first look at how text indexes created in the first 66 00:07:09,985 --> 00:07:13,793 place. This is something search engines do all 67 00:07:13,793 --> 00:07:20,747 the times continuously and forms the heart of what's going on inside a large search 68 00:07:20,747 --> 00:07:22,320 engine like Google.