1
00:00:00,000 --> 00:00:06,030
Let's now look at some structured data.
An example of a database that stores

2
00:00:06,030 --> 00:00:11,982
songs, their lyrics, the albums that
contain those songs, and the artists who

3
00:00:11,982 --> 00:00:16,944
may have sung them.
This example is taken from a Sigmont

4
00:00:16,944 --> 00:00:23,948
paper, which talks about searching
relational databases, as opposed to using

5
00:00:23,948 --> 00:00:28,617
queries.
Let's see what it would take in sequel, to

6
00:00:28,617 --> 00:00:34,780
get the albums with, World in their title.
One might write a Sequel.

7
00:00:34,780 --> 00:00:44,131
Such as this, which is an Oracle, sequel.
You select star from the album table,

8
00:00:44,131 --> 00:00:49,300
where the title contains the string,
world.

9
00:00:50,060 --> 00:00:56,515
Things get a little more difficult.
If one now wants more information from

10
00:00:56,515 --> 00:01:00,878
this, complex schema.
So let's ask, how many sequels.

11
00:01:00,878 --> 00:01:05,240
Will it take.
To retrieve the names of each artist.

12
00:01:05,240 --> 00:01:11,260
And the lyrics of every song.
In an album, that has world in its title?

13
00:01:11,880 --> 00:01:16,891
Take a look at the schema and answer the
question.

14
00:01:16,891 --> 00:01:24,208
Please avoid complex joints which joins
every single table in the schema.

15
00:01:24,208 --> 00:01:32,526
You could do that but we are trying to get
at how many simple queries will it take.

16
00:01:32,526 --> 00:01:38,240
Here is how we could do this.
We first retrieve the algum.

17
00:01:39,020 --> 00:01:46,412
From the album table then we have to
traverse this table to find out all the

18
00:01:46,412 --> 00:01:51,500
songs in that album and their lyrics from
this table.

19
00:01:52,380 --> 00:01:58,770
We also have to find out all the artists
that composed this album, B1.

20
00:01:58,770 --> 00:02:05,253
From the artist album table and then
retrieve the actual artist names.

21
00:02:05,253 --> 00:02:12,384
Each of these can be done with a single
query, to allow a one table traversal

22
00:02:12,384 --> 00:02:17,253
joint.
Otherwise each of these will require two

23
00:02:17,253 --> 00:02:22,292
separate queries.
Quite complicated for doing something

24
00:02:22,292 --> 00:02:28,548
which is easy to do, if one just had a
Google like search on this database.

25
00:02:28,548 --> 00:02:32,690
Unfortunately, that is quite difficult to
achieve.

26
00:02:32,690 --> 00:02:39,199
Imagine, if we had a search interface, so
that we could issue a query, like, off

27
00:02:39,199 --> 00:02:45,540
the, the world, since we didn't really
remember the exact title of the album.

28
00:02:46,440 --> 00:02:54,689
The sequel approach would end up missing
partial matches such as the album title

29
00:02:54,689 --> 00:03:03,545
World or the album title, Off The Wall.
Next the schemer needs to be understood

30
00:03:03,545 --> 00:03:10,074
quite carefully in order to issue the
multiple sequence needed to retrieve the

31
00:03:10,074 --> 00:03:17,640
information that we want.
Some times a complex join might be needed.

32
00:03:19,820 --> 00:03:26,579
But there's even more.
Suppose there were multiple databases,

33
00:03:26,579 --> 00:03:32,502
each with a different schema.
And partial, or duplicated data, across

34
00:03:32,502 --> 00:03:37,541
these databases.
And suppose the keys used in the vate, in

35
00:03:37,541 --> 00:03:43,200
each database were different, with no
relationship to each other.

36
00:03:44,420 --> 00:03:50,483
Most importantly, suppose we have some
unstructured data in documents, like text

37
00:03:50,483 --> 00:03:54,858
files which contain the lyrics or
biographies of artists.

38
00:03:54,858 --> 00:03:59,770
And the others in structured databases
like the lyrics database.

39
00:03:59,770 --> 00:04:05,603
How to search both of these together.
When you [inaudible], when you search a

40
00:04:05,603 --> 00:04:11,589
set of documents and you get an album, can
you find the songs in that album by

41
00:04:11,589 --> 00:04:15,120
looking at the lyrics database, and vice
versa?

42
00:04:17,220 --> 00:04:23,713
The point I'm trying to make is that
searching structure data well, in a query

43
00:04:23,713 --> 00:04:31,717
like manner remains a research problem.
The fact that so much structured data is

44
00:04:31,717 --> 00:04:39,001
being accessed using applications, is only
because a lot of complicated programming

45
00:04:39,001 --> 00:04:42,949
using C quill goes in to accessing that
data.

46
00:04:42,949 --> 00:04:49,180
Let us conclude now by asking whether
looking is the same as searching.

47
00:04:51,720 --> 00:04:58,047
When we look around a room for example and
recognize objects and people doing

48
00:04:58,047 --> 00:05:02,260
activities.
We're not really searching for anything.

49
00:05:03,020 --> 00:05:09,213
Or when we're browsing a book shelf or
flipping the pages of a book or even

50
00:05:09,213 --> 00:05:15,569
looking at some data, like some time
series or a histogram or charts, to see if

51
00:05:15,569 --> 00:05:19,400
there might be any hidden patterns in the
data.

52
00:05:20,260 --> 00:05:27,160
In the first case while seeing, we are
visualizing a scene.

53
00:05:27,780 --> 00:05:34,420
Computationally this involves techniques
such as clustering and classification.

54
00:05:35,460 --> 00:05:42,807
In the second and third examples, we're
trying to get a feel for a document A

55
00:05:42,807 --> 00:05:49,662
collection of documents or some data.
Which requires, techniques such as

56
00:05:49,662 --> 00:05:56,580
automatically summarizing documents,
discovering the topics and documents.

57
00:05:57,000 --> 00:06:04,300
And discovering interesting correlations
in data without direct intervention.

58
00:06:05,700 --> 00:06:12,060
Each of these are deep research areas of
current interest.

59
00:06:12,440 --> 00:06:20,277
And we'll get into them as we go beyond
looking to listening, learning,

60
00:06:20,277 --> 00:06:25,540
connecting, and predicting.
So see you next week.