1
00:00:00,000 --> 00:00:07,305
Let's begin our exploration with the first
element in our cycle of look, listen,

2
00:00:07,305 --> 00:00:13,540
learn, connect, predict, correct.
Looking is about finding stuff.

3
00:00:14,060 --> 00:00:20,881
For example, finding stuff on the web,
finding stuff on one's computer or even

4
00:00:20,881 --> 00:00:27,967
finding stuff when one looks around the
room, when one look at some data to find

5
00:00:27,967 --> 00:00:36,294
pattern which may be hidden in that data.
Or even finding stuff when one tries to

6
00:00:36,294 --> 00:00:43,717
recall things from ones memory.
So finding stuff is really essential, to

7
00:00:43,717 --> 00:00:50,665
what we consider intelligent behavior.
But on the web, finding stuff is about

8
00:00:50,665 --> 00:00:54,780
finding documents.
Let's see how that happens.

9
00:00:54,780 --> 00:01:02,460
When you want to look for documents on the
web, we type in queries into a search box.

10
00:01:02,760 --> 00:01:11,080
And some how miraculously those documents
which best match our query turn up.

11
00:01:11,480 --> 00:01:16,940
The technique that makes this possible is
called text indexing.

12
00:01:17,860 --> 00:01:24,526
Now let's illustrate that with an example.
The simplest of being these three

13
00:01:24,526 --> 00:01:29,356
documents, at three different web sites,
'a.com', 'b.com' and'c.com'.

14
00:01:29,356 --> 00:01:35,771
In all the cases these documents are very
simple and just consider a situation where

15
00:01:35,771 --> 00:01:42,466
we have only three documents.
And when we type a query into a search box

16
00:01:42,466 --> 00:01:47,162
such as "fox," we'd like the first
document to be returned.

17
00:01:47,162 --> 00:01:53,787
On the other hand, when we type in "the
bird," we we'd like all three documents to

18
00:01:53,787 --> 00:02:00,160
be returned, because the word "the" is
present in two of the documents and the

19
00:02:00,160 --> 00:02:05,360
word "bird" is present in another two of
the documents, combined.

20
00:02:06,140 --> 00:02:11,960
The combination, the bird is present in
one of these three documents.

21
00:02:13,600 --> 00:02:18,876
But when we type the query word, we would
like only two of these documents to be

22
00:02:18,876 --> 00:02:22,240
returned, because word exists in only two
documents.

23
00:02:23,300 --> 00:02:30,424
To make this possible, what the web does,
or what a search engine does, is create

24
00:02:30,424 --> 00:02:38,245
what is called an inverted index.
So against every possible word which might

25
00:02:38,245 --> 00:02:47,244
occur in any possible document on the web.
A text index stores, which documents that

26
00:02:47,244 --> 00:02:53,853
particular word is found in.
So, for example, the word "board" is found

27
00:02:53,853 --> 00:03:01,449
in the documents B.com and C.com, whereas
the word "fox" is found only in A.com.

28
00:03:01,449 --> 00:03:07,960
So when we type in the query "fox," we get
one result, which is A.com.

29
00:03:08,260 --> 00:03:12,240
On the other hand, when we type in the
query, the word.

30
00:03:12,500 --> 00:03:20,290
We get two documents, A.com and C.com from
the, and we get two more documents, B.com

31
00:03:20,290 --> 00:03:27,890
and C.com from word, taking the union of
these two lists that we get back, we get

32
00:03:27,890 --> 00:03:35,849
the results which are all three documents.
However, if we type in a query, bird,

33
00:03:35,849 --> 00:03:43,505
alone then we only get two documents p dot
com and c dot com, a very simple idea but

34
00:03:43,505 --> 00:03:50,451
it is fundamental to how the web works.
Well let's see what it takes to find a

35
00:03:50,451 --> 00:03:57,038
document in such a text index.
The web consists of documents which have

36
00:03:57,038 --> 00:04:04,396
1,000,000's of words, so the number of
words, on the left hand side of such text

37
00:04:04,396 --> 00:04:10,860
index, is huge.
And when we, want, to search for a word,

38
00:04:10,860 --> 00:04:16,842
we have to look up.
Some structure where all these words are

39
00:04:16,842 --> 00:04:22,509
stored.
So assuming some sworded structure such as

40
00:04:22,509 --> 00:04:30,726
binary tree, which you may have studied in
your data structures courses, is used to,

41
00:04:30,726 --> 00:04:37,340
to store all these words, then looking up
a word takes log'n' time.

42
00:04:38,940 --> 00:04:45,993
In fact you can do a little bit better if
we use hash tables, but we won't go into

43
00:04:45,993 --> 00:04:52,703
that in detail right now, but it's not
enough to just lookup the words in your

44
00:04:52,703 --> 00:04:57,250
query.
We also have to assemble the results.

45
00:04:57,250 --> 00:05:04,046
That we get when we look up the words for
example, if you have few terms in your

46
00:05:04,046 --> 00:05:07,700
query like, the bird.
Two terms in this case.

47
00:05:07,700 --> 00:05:15,081
You have to look up the word the, and get
a.com and c.com, look up the word, word

48
00:05:15,081 --> 00:05:21,714
and getting b.com and c.com.
And you have to assemble the results which

49
00:05:21,714 --> 00:05:29,002
consist of all three documents and also
order them, so that the document c.com,

50
00:05:29,002 --> 00:05:36,103
comes first because it matches both the
words, in your query, unlike a.com or

51
00:05:36,103 --> 00:05:45,062
b.com, where only one word matches.
The process of assembling the results of a

52
00:05:45,062 --> 00:05:49,447
cue term query.
Takes order RQ time.

53
00:05:49,447 --> 00:05:53,924
If R is the total number of intermediate
results.

54
00:05:53,924 --> 00:05:59,407
That is, the number before one takes the
union of the result.

55
00:05:59,407 --> 00:06:04,615
So in this case of a query like the bird,
R would be four.

56
00:06:04,615 --> 00:06:12,291
Because you get two results from word and
two results from the, even though c.com is

57
00:06:12,291 --> 00:06:17,500
repeated you count it twice, because you
have to traverse.

58
00:06:18,540 --> 00:06:27,060
Both this list as well as this list.
To show that this is the case.

59
00:06:27,880 --> 00:06:30,954
Is something I'll leave to you as
homework.

60
00:06:30,954 --> 00:06:35,460
And we'll come back to this as a homework
question later on.

61
00:06:37,380 --> 00:06:43,361
The problem is.
If R is huge you really don't want to

62
00:06:43,361 --> 00:06:50,450
assemble the entire result sets and you
want some other way of ordering.

63
00:06:50,450 --> 00:06:55,913
The results of the query, apart from just
which query matters best.

64
00:06:55,913 --> 00:07:03,198
And that's the second major invention on
the web and we will come to that shortly.

65
00:07:03,198 --> 00:07:09,985
But before we do this, we will first look
at how text indexes created in the first

66
00:07:09,985 --> 00:07:13,793
place.
This is something search engines do all

67
00:07:13,793 --> 00:07:20,747
the times continuously and forms the heart
of what's going on inside a large search

68
00:07:20,747 --> 00:07:22,320
engine like Google.