1 00:00:01,280 --> 00:00:06,244 Now, given this RDF graph, what kind of queries I mean not necessarily RDF graph, 2 00:00:06,244 --> 00:00:10,528 again you can look at it, come back to the previous setting of just looking at 3 00:00:10,528 --> 00:00:16,110 graph databases. What kind of queries do you have? 4 00:00:16,110 --> 00:00:18,450 What kind of query languages that you have? 5 00:00:18,450 --> 00:00:21,360 In the original logic database world, we had Datalog. 6 00:00:21,360 --> 00:00:27,912 Which for most settings, it seemed like a relational related SQL kind of query, but 7 00:00:27,912 --> 00:00:35,004 for the recursive reasoning part. Which was in Datalog, which was not at 8 00:00:35,004 --> 00:00:40,194 that time in SQL, okay? And this has generated, I don't know, 9 00:00:40,194 --> 00:00:44,795 probably 20 or 30 really top class research papers. 10 00:00:44,795 --> 00:00:47,319 Okay? And this is just a very low, 11 00:00:47,319 --> 00:00:52,690 underestimate of the papers that have come out. 12 00:00:52,690 --> 00:00:58,085 And there is a huge work done on how to do recursive reasoning in Datalogs. 13 00:00:58,085 --> 00:01:00,092 Okay? And after that again in XPath setting, 14 00:01:00,092 --> 00:01:04,118 which I was talking about XML wave, which came into the database world. 15 00:01:04,118 --> 00:01:08,343 XPath also resulted in huge numbers of papers, again for exactly the same kind 16 00:01:08,343 --> 00:01:12,182 of problems. Like take the first query, 17 00:01:12,182 --> 00:01:17,468 wikimedia//editions. So what you're saying is, start from the 18 00:01:17,468 --> 00:01:22,714 root of wikimedia and look at any reachable note from wikimedia, and look 19 00:01:22,714 --> 00:01:28,920 for those notes which have editions as their type. 20 00:01:30,200 --> 00:01:34,826 So look at all of them and return those. So essentially this xpath returns all 21 00:01:34,826 --> 00:01:41,072 those paths which start from wikipedia, wikimedia and end with editions, okay. 22 00:01:41,072 --> 00:01:44,740 Editions is something which you have on Wikipedia. 23 00:01:44,740 --> 00:01:47,650 For example, different editions of the same page, right? 24 00:01:47,650 --> 00:01:51,476 so different, so that's one annotation. That's a type that you can add to the 25 00:01:51,476 --> 00:01:54,610 note. And you can actually have more 26 00:01:54,610 --> 00:01:58,394 constraints on it. You can specify the path, and you can say 27 00:01:58,394 --> 00:02:05,350 I want a specific name property, of that node, which is down there, that's also. 28 00:02:05,350 --> 00:02:10,564 So these are more general database type queries, once the graph databases, the 29 00:02:10,564 --> 00:02:18,003 real graph databases came into being, there was a, a, file called Blueprints. 30 00:02:18,003 --> 00:02:23,059 And on top of it, a language Gremlin, which is quite popular now for many graph 31 00:02:23,059 --> 00:02:27,528 databases. So, which is, almost, you can look at it 32 00:02:27,528 --> 00:02:32,948 as a JDBC for graph traversals, okay? So you pretty much have a Rich-ability 33 00:02:32,948 --> 00:02:36,370 queries that you can ask. You can ask for parts, you can ask for 34 00:02:36,370 --> 00:02:41,630 graph structure queries. All of them have same kind of JDBC style. 35 00:02:41,630 --> 00:02:44,985 Okay, you have a cursor, you can get the next one, if you talk really 36 00:02:44,985 --> 00:02:50,100 materializing. So, all these are supported in Gremlin 37 00:02:50,100 --> 00:02:56,127 interface. But, one disadvantage of Gremlin has 38 00:02:56,127 --> 00:03:01,521 always been that, if you want to merge the relational queries along with the 39 00:03:01,521 --> 00:03:07,310 graph traversals, you needed some hacking. 40 00:03:07,310 --> 00:03:09,470 Right? You need to write your own programs for 41 00:03:09,470 --> 00:03:13,020 this. So you essentially think of this as you 42 00:03:13,020 --> 00:03:19,120 want to add graph traversal on top of your standard sequel. 43 00:03:19,120 --> 00:03:23,301 It requires some extra effort, right? You need to put in different JDBC 44 00:03:23,301 --> 00:03:27,076 statements. Similarly, if you want to move from 45 00:03:27,076 --> 00:03:31,834 Gremlin into some kind of sequel like query, then you need to do this extra 46 00:03:31,834 --> 00:03:36,541 program. Which was not all that appreciated by 47 00:03:36,541 --> 00:03:40,878 SPARQL word, which is mainly for ID of data. 48 00:03:40,878 --> 00:03:43,660 Right? So there, they said, okay. 49 00:03:43,660 --> 00:03:48,149 Most of the queries that we focus on in SPARQL, which is, which has a recursive 50 00:03:48,149 --> 00:03:54,680 definition, we will not, get into that. Okay, SPARQL is a query language. 51 00:03:54,680 --> 00:03:57,705 Right, that's basically what SPARQL query language. 52 00:03:57,705 --> 00:04:03,980 So the focus there was, now given that there's idea of graph. 53 00:04:03,980 --> 00:04:07,300 Most of my queries are going to be query by patterns. 54 00:04:07,300 --> 00:04:11,899 So I give you a pattern of the graph that I'm interested in, subgraph that I'm 55 00:04:11,899 --> 00:04:17,625 interested in, okay? And retrieve all instances of this 56 00:04:17,625 --> 00:04:24,040 sub-pattern in my data set. So this could be extended to have 57 00:04:24,040 --> 00:04:27,940 templates of patterns, as in, you can have variables saying okay, I'm leaving 58 00:04:27,940 --> 00:04:33,519 some things unbounded. Okay, that's one thing that SPARQL, at 59 00:04:33,519 --> 00:04:39,659 least the version 1.0 focused on. And now, there are extensions which are 60 00:04:39,659 --> 00:04:44,515 trying to move toward the graph traversal support as well. 61 00:04:44,515 --> 00:04:49,230 So, actually providing the paths and trying to find Rich-abilities and so on. 62 00:04:49,230 --> 00:04:54,040 That's hopefully, it's going to come through in SPARQL 1.1 and 1.2. 63 00:04:54,040 --> 00:04:57,117 People are working on this. So how does a SPARQL query look like if 64 00:04:57,117 --> 00:05:01,807 you go back to the original Dolph Granite and Bruce Willis graph, you can ask for a 65 00:05:01,807 --> 00:05:08,490 query like this, select question mark, name, question mark, movie. 66 00:05:09,540 --> 00:05:13,860 So, which means I want the actor's name and the movies, where the certain 67 00:05:13,860 --> 00:05:18,016 condition's hold. These conditions can be seen as a 68 00:05:18,016 --> 00:05:22,310 subgraph conditions, right? So you want this guy to be action hero 69 00:05:22,310 --> 00:05:27,286 that you're interested in. And he should have acted in a movie and 70 00:05:27,286 --> 00:05:32,010 it should, so this movie is what you're looking for. 71 00:05:32,010 --> 00:05:38,070 And the name that you're looking for. The constants that you're adding is that, 72 00:05:38,070 --> 00:05:42,606 this person you're looking at, question mark name, should have worked with 73 00:05:42,606 --> 00:05:48,820 someone who was born in Stockholm. So if you look at the RDF graph that we 74 00:05:48,820 --> 00:05:51,025 explained, so clearly all you get is Bruce Willis for this, Bruce Willis and 75 00:05:51,025 --> 00:05:53,970 all his movies, not just the Expendables, all the movies. 76 00:05:53,970 --> 00:05:56,970 That's basically what you're getting out of this. 77 00:05:56,970 --> 00:06:09,293 So these are the query languages, yeah? >> All these movies provided for, the 78 00:06:09,293 --> 00:06:15,240 last two conditions were also satisfied? >> no, last two conditions are largely 79 00:06:15,240 --> 00:06:20,132 on just a name, so I'm not providing. >> No, I mean, Is it a must that, that 80 00:06:20,132 --> 00:06:25,141 he would have to work with a person and the person has to be born in Stockholm? 81 00:06:25,141 --> 00:06:27,411 >> Yes. >> So, only if a set of people are 82 00:06:27,411 --> 00:06:32,933 there and you have worked with Bruce Willis and was born in Stockholm? 83 00:06:32,933 --> 00:06:36,609 Only those conditions... >> Exactly. 84 00:06:36,609 --> 00:06:39,560 Exactly. So in, that's why I said in a previous 85 00:06:39,560 --> 00:06:42,661 example it was only Bruce Willis was there. 86 00:06:42,661 --> 00:06:46,083 But you can have, I mean if you, if anybody has seen the Expendables, you 87 00:06:46,083 --> 00:06:50,740 know there is a huge list of people that will come out of this. 88 00:06:50,740 --> 00:06:53,958 Right? pretty much Sylvester Stallone and Arnold 89 00:06:53,958 --> 00:06:56,298 Schwarzenegger. Every one of them. 90 00:06:56,298 --> 00:06:59,450 >> Who satisfies all these conditions? >> Who satisfies all these conditions? 91 00:06:59,450 --> 00:07:03,609 So you can find all the movies and their names. 92 00:07:03,609 --> 00:07:06,835 >> [INAUDIBLE]. >> Not as significantly different, 93 00:07:06,835 --> 00:07:10,710 right? So at least when SPARQL started, it 94 00:07:10,710 --> 00:07:16,499 seemed like very much like a SQL query. All right. 95 00:07:16,499 --> 00:07:21,374 But the point is that, in sequel, if you look at it from the sequel angle, this is 96 00:07:21,374 --> 00:07:27,960 a bunch of joints, huge bunch of joints. And SQL tries to a wide or at least 97 00:07:27,960 --> 00:07:33,000 whenever you're the world of relational databases, you want to minimize the 98 00:07:33,000 --> 00:07:36,190 joints. Right? 99 00:07:36,190 --> 00:07:42,340 So you come up with strategies for materializing these and reuse this. 100 00:07:42,340 --> 00:07:46,488 Similar ideas can be applied here, but the key differences are the kind of 101 00:07:46,488 --> 00:07:51,209 predicates that you have. There could be huge in number, which is 102 00:07:51,209 --> 00:07:54,435 again not the case in relational databases. 103 00:07:54,435 --> 00:07:57,679 Right? You typically have, how many tables are 104 00:07:57,679 --> 00:08:01,540 there in your database? Not more than probably 200 in the extreme 105 00:08:01,540 --> 00:08:03,588 setting. Okay. 106 00:08:03,588 --> 00:08:05,540 >> This, this looks like a SQL query. We're suppose that. 107 00:08:05,540 --> 00:08:08,580 >> This looks like. >> You take the, particularly 108 00:08:08,580 --> 00:08:10,871 [INAUDIBLE]. >> Yes. 109 00:08:10,871 --> 00:08:14,968 >> Then that's not there in SQL. >> That's not there in SQL, but you can 110 00:08:14,968 --> 00:08:18,748 always turn it around and say, okay, I will also model predicates as another 111 00:08:18,748 --> 00:08:22,452 column. All right? 112 00:08:22,452 --> 00:08:25,196 >> You have to have a data base which stores all possible predicates. 113 00:08:25,196 --> 00:08:27,184 >> Predicates as well. >> You have that on a separate table. 114 00:08:27,184 --> 00:08:29,139 >> Yes. >> And then query that. 115 00:08:29,139 --> 00:08:30,320 >> And then get back. >> Yeah. 116 00:08:30,320 --> 00:08:32,598 >> So there are some efforts for that kind of thing also. 117 00:08:32,598 --> 00:08:36,194 So, some kind of you have these metadata that you have, you can query the 118 00:08:36,194 --> 00:08:39,780 metadata, get the table names and then query. 119 00:08:39,780 --> 00:08:44,299 So, you can do that also, but these are never the preferred model in relation. 120 00:08:44,299 --> 00:08:46,536 >> [CROSSTALK]. >> Yeah, so you're breaking the 121 00:08:46,536 --> 00:08:51,497 relational ideas there. >> [INAUDIBLE]. 122 00:08:51,497 --> 00:08:52,823 >> Yes. >> All of that data? 123 00:08:52,823 --> 00:08:57,060 >> You are making a universal database and then filing queries on it. 124 00:08:57,060 --> 00:08:58,690 Right? That's one way of looking at it from a 125 00:08:58,690 --> 00:09:02,120 relational point. You don't do all these normalizations. 126 00:09:02,120 --> 00:09:08,923 You don't do anything. You just make it one big huge universal. 127 00:09:08,923 --> 00:09:11,894 >> [INAUDIBLE]. >> As you state, 128 00:09:11,894 --> 00:09:14,982 [INAUDIBLE]. 129 00:09:14,982 --> 00:09:16,494 >> Yes. >> [INAUDIBLE]. 130 00:09:16,494 --> 00:09:28,571 >> Yeah. Everything falls apart. 131 00:09:28,571 --> 00:09:30,586 >> [INAUDIBLE]. >> Okay. 132 00:09:30,586 --> 00:09:30,980 >> [INAUDIBLE]. >> Yes. 133 00:09:30,980 --> 00:09:35,037 >> Do you [INAUDIBLE]. I think you want to get to that, right? 134 00:09:35,037 --> 00:09:37,416 [INAUDIBLE] >> Yes, so that's, that's the meat of 135 00:09:37,416 --> 00:09:41,383 the, challenge, right? So if you look at, so whenever you look 136 00:09:41,383 --> 00:09:46,540 at these query languages they don't talk about how they have to be evaluated. 137 00:09:46,540 --> 00:09:49,888 So at this point I have no clue how this particle has to be evaluated, I don't 138 00:09:49,888 --> 00:09:53,060 care also. These are declarative, right? 139 00:09:53,060 --> 00:09:57,229 I don't really care about it. But you really have to worry about these 140 00:09:57,229 --> 00:10:02,002 kind of issues, like, should I go for universal relation? 141 00:10:02,002 --> 00:10:06,538 Which means I have to deal with null values, storage issues, minimizing the 142 00:10:06,538 --> 00:10:12,447 search space, whole bunch of things, or should I do some other trick? 143 00:10:12,447 --> 00:10:17,260 Should I normalize only in certain cases, not normalize it, right? 144 00:10:17,260 --> 00:10:24,748 These are the challenges which you will find if you try directly translating 145 00:10:24,748 --> 00:10:32,081 these graph-like queries into relational setting. 146 00:10:32,081 --> 00:10:34,678 >> [INAUDIBLE]. >> Yes. 147 00:10:34,678 --> 00:10:37,062 >> [INAUDIBLE]. >> Yes. 148 00:10:37,062 --> 00:10:40,955 >> [INAUDIBLE]. >> I mean see all these normal forms 149 00:10:40,955 --> 00:10:46,523 only help not only the efficiency but also in order to keep the consistency in 150 00:10:46,523 --> 00:10:53,100 some sense, right? The same requirements hold here also. 151 00:10:53,100 --> 00:10:58,527 In fact, as we will see in a couple of slides, one extreme way which you already 152 00:10:58,527 --> 00:11:05,648 might have seen in the, this example RDF graph, and I was looking at. 153 00:11:05,648 --> 00:11:09,080 You can look at this part, the triple part. 154 00:11:09,080 --> 00:11:12,172 So, I have Bruce Willis, born in Idar Oberstein. 155 00:11:12,172 --> 00:11:14,160 The edge can be my table. Just triple pattern table. 156 00:11:14,160 --> 00:11:21,115 So which, you if you go back to your normal form setting, it's almost like b, 157 00:11:21,115 --> 00:11:24,786 c, and f. Right? 158 00:11:24,786 --> 00:11:30,395 Where you have just key and value, nothing else. 159 00:11:30,395 --> 00:11:35,353 And the predicate is encoded in the table name, in the left-hand setting, but now 160 00:11:35,353 --> 00:11:40,834 you are explicitly storing it. That's, right? 161 00:11:40,834 --> 00:11:45,751 That's, that's basically the way in which graphs can be stored. 162 00:11:45,751 --> 00:11:47,561 >> The edge list. >> Which list? 163 00:11:47,561 --> 00:11:50,208 >> Yeah, that list with the [INAUDIBLE]. 164 00:11:50,208 --> 00:11:53,829 >> Exactly. >> [INAUDIBLE]. 165 00:11:53,829 --> 00:11:59,764 >> Yeah. >> On the SPARQL queries. 166 00:11:59,764 --> 00:12:03,922 What kind of queries are better suited for SPARQL, the ones which are deeper 167 00:12:03,922 --> 00:12:09,707 which in the sense you have table a right to b and then c and then d? 168 00:12:09,707 --> 00:12:10,916 Or is it b, c, d are all corrected to a directly? 169 00:12:10,916 --> 00:12:12,396 Which, what, where, which kinds of queries are better suited for SQL? 170 00:12:12,396 --> 00:12:14,275 >> It's, so that depends on entirely the application. 171 00:12:14,275 --> 00:12:22,163 So it's, so let's not worry about SPARQL, let's worry about SQL okay, I'm going to, 172 00:12:22,163 --> 00:12:29,953 so, what kind of queries are better suited for SQL? 173 00:12:29,953 --> 00:12:37,504 >> Not SQL, SQL is only one side of it. >> But that's only because your 174 00:12:37,504 --> 00:12:42,950 performance is weak. Suppose if I go from main memory 175 00:12:42,950 --> 00:12:47,661 databases, which support SQL. Probably deeper, whatever, it's perfectly 176 00:12:47,661 --> 00:12:52,770 fine, right? So in the same setting, in the same way. 177 00:12:52,770 --> 00:12:57,450 You cannot ask a question that is partly suited for kind of queries. 178 00:12:57,450 --> 00:13:00,360 Whether your database which really implements SPARQL. 179 00:13:00,360 --> 00:13:04,120 Is it better suited for this? So that's the, so in terms of power of 180 00:13:04,120 --> 00:13:08,377 the language, SPARQL is no more powerful than SQL. 181 00:13:08,377 --> 00:13:12,025 I mean SQL is already too incomplete, so, that is you can not get anything more 182 00:13:12,025 --> 00:13:17,120 powerful than that. So once you have that, SPARQL is no more 183 00:13:17,120 --> 00:13:20,646 powerful. So the reason why you came up with a new 184 00:13:20,646 --> 00:13:26,475 language than just reusing SQL is that the ease of use and the way you think. 185 00:13:26,475 --> 00:13:29,270 Right? You go to the add a file and you look at 186 00:13:29,270 --> 00:13:33,953 it from the relational setting so the way in which people think is different. 187 00:13:33,953 --> 00:13:38,771 So, in XML, people thought in threes. While in, relational tables, they looked 188 00:13:38,771 --> 00:13:44,804 at in table format. While in SPARQL, that is in the ideal 189 00:13:44,804 --> 00:13:54,834 world, people always look it as graphs. So, it's just the ease of use, not which 190 00:13:54,834 --> 00:13:59,270 is more powerful. Everything is equally powerful. 191 00:13:59,270 --> 00:14:03,480 It just simplifies your life, right?