1
00:00:00,000 --> 00:00:06,148
Now let's see how we could create a text
index for a bunch of documents.

2
00:00:06,148 --> 00:00:09,820
Such as the three documents we had
earlier.

3
00:00:10,260 --> 00:00:17,133
We go through each document one by one,
reading each word in turn, when we read

4
00:00:17,133 --> 00:00:24,623
the word the, we insert an element against
the in the final structure that we want to

5
00:00:24,623 --> 00:00:29,699
create.
Similarly to read the word quick we enter

6
00:00:29,699 --> 00:00:34,241
A.com as the id of this document against
quick.

7
00:00:34,241 --> 00:00:41,586
Similarly for brown, similarly for over
and for other words in this document.

8
00:00:41,586 --> 00:00:46,612
Moving on to the next document, we do the
same thing.

9
00:00:46,612 --> 00:00:54,633
Enter word with B.com this time instead of
A.com because now we are processing the

10
00:00:54,633 --> 00:01:00,000
second document.
Singularly for the third document.

11
00:01:00,000 --> 00:01:07,098
But this time since the is already
present, in the index, we merely append

12
00:01:07,098 --> 00:01:14,196
the entry c dot com after a dot com
against the, rather than create a new

13
00:01:14,196 --> 00:01:17,680
entry.
Similarly for lazy.

14
00:01:19,180 --> 00:01:26,760
Bird, and finally for worm we enter a new
entry with just c dot com against it.

15
00:01:28,220 --> 00:01:34,551
How would we write a program to do this?
Now, we'll go a little bit slowly while

16
00:01:34,551 --> 00:01:40,642
describing this program because I'm not
assuming everybody is familiar with

17
00:01:40,642 --> 00:01:46,973
programming Python or object-oriented
programming, or even if you are I'm

18
00:01:46,973 --> 00:01:50,500
assuming that you need a bit of
refreshment.

19
00:01:51,260 --> 00:01:57,318
So we'll create a class index which is
going to be a data structure index that

20
00:01:57,318 --> 00:02:03,606
we, we are gonna create which will have a
function which will create, which given a

21
00:02:03,606 --> 00:02:10,661
list of documents D.
This function will essentially populate

22
00:02:10,661 --> 00:02:17,090
our index in this manner.
So what we'll do is we'll go through each

23
00:02:17,090 --> 00:02:21,475
document in D.
Now in Python what this means is that if D

24
00:02:21,475 --> 00:02:26,860
is a list, for D in D will essentially
iterate over each document in D.

25
00:02:29,120 --> 00:02:34,586
One by one.
And for each such document small d will go

26
00:02:34,586 --> 00:02:40,578
through each word w in that document.
Once again we are assuming the document

27
00:02:40,578 --> 00:02:46,881
small, the document small d in this list
of documents d is itself a list of words

28
00:02:46,881 --> 00:02:50,800
this time.
So for each of these words, we'll go

29
00:02:50,800 --> 00:02:58,911
through the document small d.
And you've to call another function, look

30
00:02:58,911 --> 00:03:05,160
up, also on the same beta structure index
that we are creating.

31
00:03:05,160 --> 00:03:10,730
And look up and check if this word W is
already in the index.

32
00:03:10,730 --> 00:03:18,239
Moreover if the word W is in the index
then we assume that the look up function

33
00:03:18,239 --> 00:03:23,590
will return I, if you'll give us the
position in the list.

34
00:03:23,590 --> 00:03:28,744
Where W is actually present, so we can
access W directly.

35
00:03:28,744 --> 00:03:35,648
For example, if we're looking at the word
"quick", then it will return I = zero,

36
00:03:35,648 --> 00:03:42,551
one, two, three, four, five, six, so that
we can directly access this particular

37
00:03:42,551 --> 00:03:48,811
part of the index structure.
If W is not in the index structure, then

38
00:03:48,811 --> 00:03:54,610
we assume that this function look up
returns a negative number.

39
00:03:54,610 --> 00:04:01,035
If that is the case then we have to add W
to the index, fresh just like we added

40
00:04:01,035 --> 00:04:05,839
worm in the end.
And in that case the, the function add

41
00:04:05,839 --> 00:04:11,718
which does this will return the position
where W was added, so for example here, it

42
00:04:11,718 --> 00:04:16,952
would return the position eight.
Once its, once we have the position eight,

43
00:04:16,952 --> 00:04:23,424
we can append to this List against worm
for example the ID of the document where

44
00:04:23,424 --> 00:04:26,659
the word "W" was found which is "D" dot
ID.

45
00:04:26,659 --> 00:04:32,880
Not is that we don't append "D" directly
because "D" itself is a list of words.

46
00:04:32,880 --> 00:04:39,433
So we don't want to append the entire
document here we want to append only the

47
00:04:39,433 --> 00:04:48,247
name of the document of the ID of the web
page or URL of the document Going back,

48
00:04:48,247 --> 00:04:54,461
there's an [inaudible] part, of course.
If we did find W in the first place, we

49
00:04:54,461 --> 00:04:59,625
would simply append it.
Append the ID at that position I, without

50
00:04:59,625 --> 00:05:03,580
having to ap, add a fresh entry into the
index.

51
00:05:04,020 --> 00:05:10,079
So that's simple, so this program, if we,
if we're able to write all this functions,

52
00:05:10,079 --> 00:05:14,942
look-up, add, append.
Then we can, essentially go through this

53
00:05:14,942 --> 00:05:20,851
list of documents, where each document is
a list of words and create this index

54
00:05:20,851 --> 00:05:23,459
structure.
Very simply.

55
00:05:23,459 --> 00:05:28,620
We'll probably do some such assignment
later on in the course.

56
00:05:28,620 --> 00:05:35,364
But at that time we will do it on many
machines using parallel computing, the way

57
00:05:35,364 --> 00:05:40,360
it's actually done in Google and other
large search engines.