1 00:00:00,000 --> 00:00:06,030 Let's now look at some structured data. An example of a database that stores 2 00:00:06,030 --> 00:00:11,982 songs, their lyrics, the albums that contain those songs, and the artists who 3 00:00:11,982 --> 00:00:16,944 may have sung them. This example is taken from a Sigmont 4 00:00:16,944 --> 00:00:23,948 paper, which talks about searching relational databases, as opposed to using 5 00:00:23,948 --> 00:00:28,617 queries. Let's see what it would take in sequel, to 6 00:00:28,617 --> 00:00:34,780 get the albums with, World in their title. One might write a Sequel. 7 00:00:34,780 --> 00:00:44,131 Such as this, which is an Oracle, sequel. You select star from the album table, 8 00:00:44,131 --> 00:00:49,300 where the title contains the string, world. 9 00:00:50,060 --> 00:00:56,515 Things get a little more difficult. If one now wants more information from 10 00:00:56,515 --> 00:01:00,878 this, complex schema. So let's ask, how many sequels. 11 00:01:00,878 --> 00:01:05,240 Will it take. To retrieve the names of each artist. 12 00:01:05,240 --> 00:01:11,260 And the lyrics of every song. In an album, that has world in its title? 13 00:01:11,880 --> 00:01:16,891 Take a look at the schema and answer the question. 14 00:01:16,891 --> 00:01:24,208 Please avoid complex joints which joins every single table in the schema. 15 00:01:24,208 --> 00:01:32,526 You could do that but we are trying to get at how many simple queries will it take. 16 00:01:32,526 --> 00:01:38,240 Here is how we could do this. We first retrieve the algum. 17 00:01:39,020 --> 00:01:46,412 From the album table then we have to traverse this table to find out all the 18 00:01:46,412 --> 00:01:51,500 songs in that album and their lyrics from this table. 19 00:01:52,380 --> 00:01:58,770 We also have to find out all the artists that composed this album, B1. 20 00:01:58,770 --> 00:02:05,253 From the artist album table and then retrieve the actual artist names. 21 00:02:05,253 --> 00:02:12,384 Each of these can be done with a single query, to allow a one table traversal 22 00:02:12,384 --> 00:02:17,253 joint. Otherwise each of these will require two 23 00:02:17,253 --> 00:02:22,292 separate queries. Quite complicated for doing something 24 00:02:22,292 --> 00:02:28,548 which is easy to do, if one just had a Google like search on this database. 25 00:02:28,548 --> 00:02:32,690 Unfortunately, that is quite difficult to achieve. 26 00:02:32,690 --> 00:02:39,199 Imagine, if we had a search interface, so that we could issue a query, like, off 27 00:02:39,199 --> 00:02:45,540 the, the world, since we didn't really remember the exact title of the album. 28 00:02:46,440 --> 00:02:54,689 The sequel approach would end up missing partial matches such as the album title 29 00:02:54,689 --> 00:03:03,545 World or the album title, Off The Wall. Next the schemer needs to be understood 30 00:03:03,545 --> 00:03:10,074 quite carefully in order to issue the multiple sequence needed to retrieve the 31 00:03:10,074 --> 00:03:17,640 information that we want. Some times a complex join might be needed. 32 00:03:19,820 --> 00:03:26,579 But there's even more. Suppose there were multiple databases, 33 00:03:26,579 --> 00:03:32,502 each with a different schema. And partial, or duplicated data, across 34 00:03:32,502 --> 00:03:37,541 these databases. And suppose the keys used in the vate, in 35 00:03:37,541 --> 00:03:43,200 each database were different, with no relationship to each other. 36 00:03:44,420 --> 00:03:50,483 Most importantly, suppose we have some unstructured data in documents, like text 37 00:03:50,483 --> 00:03:54,858 files which contain the lyrics or biographies of artists. 38 00:03:54,858 --> 00:03:59,770 And the others in structured databases like the lyrics database. 39 00:03:59,770 --> 00:04:05,603 How to search both of these together. When you [inaudible], when you search a 40 00:04:05,603 --> 00:04:11,589 set of documents and you get an album, can you find the songs in that album by 41 00:04:11,589 --> 00:04:15,120 looking at the lyrics database, and vice versa? 42 00:04:17,220 --> 00:04:23,713 The point I'm trying to make is that searching structure data well, in a query 43 00:04:23,713 --> 00:04:31,717 like manner remains a research problem. The fact that so much structured data is 44 00:04:31,717 --> 00:04:39,001 being accessed using applications, is only because a lot of complicated programming 45 00:04:39,001 --> 00:04:42,949 using C quill goes in to accessing that data. 46 00:04:42,949 --> 00:04:49,180 Let us conclude now by asking whether looking is the same as searching. 47 00:04:51,720 --> 00:04:58,047 When we look around a room for example and recognize objects and people doing 48 00:04:58,047 --> 00:05:02,260 activities. We're not really searching for anything. 49 00:05:03,020 --> 00:05:09,213 Or when we're browsing a book shelf or flipping the pages of a book or even 50 00:05:09,213 --> 00:05:15,569 looking at some data, like some time series or a histogram or charts, to see if 51 00:05:15,569 --> 00:05:19,400 there might be any hidden patterns in the data. 52 00:05:20,260 --> 00:05:27,160 In the first case while seeing, we are visualizing a scene. 53 00:05:27,780 --> 00:05:34,420 Computationally this involves techniques such as clustering and classification. 54 00:05:35,460 --> 00:05:42,807 In the second and third examples, we're trying to get a feel for a document A 55 00:05:42,807 --> 00:05:49,662 collection of documents or some data. Which requires, techniques such as 56 00:05:49,662 --> 00:05:56,580 automatically summarizing documents, discovering the topics and documents. 57 00:05:57,000 --> 00:06:04,300 And discovering interesting correlations in data without direct intervention. 58 00:06:05,700 --> 00:06:12,060 Each of these are deep research areas of current interest. 59 00:06:12,440 --> 00:06:20,277 And we'll get into them as we go beyond looking to listening, learning, 60 00:06:20,277 --> 00:06:25,540 connecting, and predicting. So see you next week.