Let's begin our exploration with the first element in our cycle of look, listen, learn, connect, predict, correct. Looking is about finding stuff. For example, finding stuff on the web, finding stuff on one's computer or even finding stuff when one looks around the room, when one look at some data to find pattern which may be hidden in that data. Or even finding stuff when one tries to recall things from ones memory. So finding stuff is really essential, to what we consider intelligent behavior. But on the web, finding stuff is about finding documents. Let's see how that happens. When you want to look for documents on the web, we type in queries into a search box. And some how miraculously those documents which best match our query turn up. The technique that makes this possible is called text indexing. Now let's illustrate that with an example. The simplest of being these three documents, at three different web sites, 'a.com', 'b.com' and'c.com'. In all the cases these documents are very simple and just consider a situation where we have only three documents. And when we type a query into a search box such as "fox," we'd like the first document to be returned. On the other hand, when we type in "the bird," we we'd like all three documents to be returned, because the word "the" is present in two of the documents and the word "bird" is present in another two of the documents, combined. The combination, the bird is present in one of these three documents. But when we type the query word, we would like only two of these documents to be returned, because word exists in only two documents. To make this possible, what the web does, or what a search engine does, is create what is called an inverted index. So against every possible word which might occur in any possible document on the web. A text index stores, which documents that particular word is found in. So, for example, the word "board" is found in the documents B.com and C.com, whereas the word "fox" is found only in A.com. So when we type in the query "fox," we get one result, which is A.com. On the other hand, when we type in the query, the word. We get two documents, A.com and C.com from the, and we get two more documents, B.com and C.com from word, taking the union of these two lists that we get back, we get the results which are all three documents. However, if we type in a query, bird, alone then we only get two documents p dot com and c dot com, a very simple idea but it is fundamental to how the web works. Well let's see what it takes to find a document in such a text index. The web consists of documents which have 1,000,000's of words, so the number of words, on the left hand side of such text index, is huge. And when we, want, to search for a word, we have to look up. Some structure where all these words are stored. So assuming some sworded structure such as binary tree, which you may have studied in your data structures courses, is used to, to store all these words, then looking up a word takes log'n' time. In fact you can do a little bit better if we use hash tables, but we won't go into that in detail right now, but it's not enough to just lookup the words in your query. We also have to assemble the results. That we get when we look up the words for example, if you have few terms in your query like, the bird. Two terms in this case. You have to look up the word the, and get a.com and c.com, look up the word, word and getting b.com and c.com. And you have to assemble the results which consist of all three documents and also order them, so that the document c.com, comes first because it matches both the words, in your query, unlike a.com or b.com, where only one word matches. The process of assembling the results of a cue term query. Takes order RQ time. If R is the total number of intermediate results. That is, the number before one takes the union of the result. So in this case of a query like the bird, R would be four. Because you get two results from word and two results from the, even though c.com is repeated you count it twice, because you have to traverse. Both this list as well as this list. To show that this is the case. Is something I'll leave to you as homework. And we'll come back to this as a homework question later on. The problem is. If R is huge you really don't want to assemble the entire result sets and you want some other way of ordering. The results of the query, apart from just which query matters best. And that's the second major invention on the web and we will come to that shortly. But before we do this, we will first look at how text indexes created in the first place. This is something search engines do all the times continuously and forms the heart of what's going on inside a large search engine like Google.