1
00:00:01,280 --> 00:00:06,244
Now, given this RDF graph, what kind of 
queries I mean not necessarily RDF graph, 

2
00:00:06,244 --> 00:00:10,528
again you can look at it, come back to 
the previous setting of just looking at 

3
00:00:10,528 --> 00:00:16,110
graph databases. 
What kind of queries do you have? 

4
00:00:16,110 --> 00:00:18,450
What kind of query languages that you 
have? 

5
00:00:18,450 --> 00:00:21,360
In the original logic database world, we 
had Datalog. 

6
00:00:21,360 --> 00:00:27,912
Which for most settings, it seemed like a 
relational related SQL kind of query, but 

7
00:00:27,912 --> 00:00:35,004
for the recursive reasoning part. 
Which was in Datalog, which was not at 

8
00:00:35,004 --> 00:00:40,194
that time in SQL, okay? 
And this has generated, I don't know, 

9
00:00:40,194 --> 00:00:44,795
probably 20 or 30 really top class 
research papers. 

10
00:00:44,795 --> 00:00:47,319
Okay? 
And this is just a very low, 

11
00:00:47,319 --> 00:00:52,690
underestimate of the papers that have 
come out. 

12
00:00:52,690 --> 00:00:58,085
And there is a huge work done on how to 
do recursive reasoning in Datalogs. 

13
00:00:58,085 --> 00:01:00,092
Okay? 
And after that again in XPath setting, 

14
00:01:00,092 --> 00:01:04,118
which I was talking about XML wave, which 
came into the database world. 

15
00:01:04,118 --> 00:01:08,343
XPath also resulted in huge numbers of 
papers, again for exactly the same kind 

16
00:01:08,343 --> 00:01:12,182
of problems. 
Like take the first query, 

17
00:01:12,182 --> 00:01:17,468
wikimedia//editions. 
So what you're saying is, start from the 

18
00:01:17,468 --> 00:01:22,714
root of wikimedia and look at any 
reachable note from wikimedia, and look 

19
00:01:22,714 --> 00:01:28,920
for those notes which have editions as 
their type. 

20
00:01:30,200 --> 00:01:34,826
So look at all of them and return those. 
So essentially this xpath returns all 

21
00:01:34,826 --> 00:01:41,072
those paths which start from wikipedia, 
wikimedia and end with editions, okay. 

22
00:01:41,072 --> 00:01:44,740
Editions is something which you have on 
Wikipedia. 

23
00:01:44,740 --> 00:01:47,650
For example, different editions of the 
same page, right? 

24
00:01:47,650 --> 00:01:51,476
so different, so that's one annotation. 
That's a type that you can add to the 

25
00:01:51,476 --> 00:01:54,610
note. 
And you can actually have more 

26
00:01:54,610 --> 00:01:58,394
constraints on it. 
You can specify the path, and you can say 

27
00:01:58,394 --> 00:02:05,350
I want a specific name property, of that 
node, which is down there, that's also. 

28
00:02:05,350 --> 00:02:10,564
So these are more general database type 
queries, once the graph databases, the 

29
00:02:10,564 --> 00:02:18,003
real graph databases came into being, 
there was a, a, file called Blueprints. 

30
00:02:18,003 --> 00:02:23,059
And on top of it, a language Gremlin, 
which is quite popular now for many graph 

31
00:02:23,059 --> 00:02:27,528
databases. 
So, which is, almost, you can look at it 

32
00:02:27,528 --> 00:02:32,948
as a JDBC for graph traversals, okay? 
So you pretty much have a Rich-ability 

33
00:02:32,948 --> 00:02:36,370
queries that you can ask. 
You can ask for parts, you can ask for 

34
00:02:36,370 --> 00:02:41,630
graph structure queries. 
All of them have same kind of JDBC style. 

35
00:02:41,630 --> 00:02:44,985
Okay, you have a cursor, you can get the 
next one, if you talk really 

36
00:02:44,985 --> 00:02:50,100
materializing. 
So, all these are supported in Gremlin 

37
00:02:50,100 --> 00:02:56,127
interface. 
But, one disadvantage of Gremlin has 

38
00:02:56,127 --> 00:03:01,521
always been that, if you want to merge 
the relational queries along with the 

39
00:03:01,521 --> 00:03:07,310
graph traversals, you needed some 
hacking. 

40
00:03:07,310 --> 00:03:09,470
Right? 
You need to write your own programs for 

41
00:03:09,470 --> 00:03:13,020
this. 
So you essentially think of this as you 

42
00:03:13,020 --> 00:03:19,120
want to add graph traversal on top of 
your standard sequel. 

43
00:03:19,120 --> 00:03:23,301
It requires some extra effort, right? 
You need to put in different JDBC 

44
00:03:23,301 --> 00:03:27,076
statements. 
Similarly, if you want to move from 

45
00:03:27,076 --> 00:03:31,834
Gremlin into some kind of sequel like 
query, then you need to do this extra 

46
00:03:31,834 --> 00:03:36,541
program. 
Which was not all that appreciated by 

47
00:03:36,541 --> 00:03:40,878
SPARQL word, which is mainly for ID of 
data. 

48
00:03:40,878 --> 00:03:43,660
Right? 
So there, they said, okay. 

49
00:03:43,660 --> 00:03:48,149
Most of the queries that we focus on in 
SPARQL, which is, which has a recursive 

50
00:03:48,149 --> 00:03:54,680
definition, we will not, get into that. 
Okay, SPARQL is a query language. 

51
00:03:54,680 --> 00:03:57,705
Right, that's basically what SPARQL query 
language. 

52
00:03:57,705 --> 00:04:03,980
So the focus there was, now given that 
there's idea of graph. 

53
00:04:03,980 --> 00:04:07,300
Most of my queries are going to be query 
by patterns. 

54
00:04:07,300 --> 00:04:11,899
So I give you a pattern of the graph that 
I'm interested in, subgraph that I'm 

55
00:04:11,899 --> 00:04:17,625
interested in, okay? 
And retrieve all instances of this 

56
00:04:17,625 --> 00:04:24,040
sub-pattern in my data set. 
So this could be extended to have 

57
00:04:24,040 --> 00:04:27,940
templates of patterns, as in, you can 
have variables saying okay, I'm leaving 

58
00:04:27,940 --> 00:04:33,519
some things unbounded. 
Okay, that's one thing that SPARQL, at 

59
00:04:33,519 --> 00:04:39,659
least the version 1.0 focused on. 
And now, there are extensions which are 

60
00:04:39,659 --> 00:04:44,515
trying to move toward the graph traversal 
support as well. 

61
00:04:44,515 --> 00:04:49,230
So, actually providing the paths and 
trying to find Rich-abilities and so on. 

62
00:04:49,230 --> 00:04:54,040
That's hopefully, it's going to come 
through in SPARQL 1.1 and 1.2. 

63
00:04:54,040 --> 00:04:57,117
People are working on this. 
So how does a SPARQL query look like if 

64
00:04:57,117 --> 00:05:01,807
you go back to the original Dolph Granite 
and Bruce Willis graph, you can ask for a 

65
00:05:01,807 --> 00:05:08,490
query like this, select question mark, 
name, question mark, movie. 

66
00:05:09,540 --> 00:05:13,860
So, which means I want the actor's name 
and the movies, where the certain 

67
00:05:13,860 --> 00:05:18,016
condition's hold. 
These conditions can be seen as a 

68
00:05:18,016 --> 00:05:22,310
subgraph conditions, right? 
So you want this guy to be action hero 

69
00:05:22,310 --> 00:05:27,286
that you're interested in. 
And he should have acted in a movie and 

70
00:05:27,286 --> 00:05:32,010
it should, so this movie is what you're 
looking for. 

71
00:05:32,010 --> 00:05:38,070
And the name that you're looking for. 
The constants that you're adding is that, 

72
00:05:38,070 --> 00:05:42,606
this person you're looking at, question 
mark name, should have worked with 

73
00:05:42,606 --> 00:05:48,820
someone who was born in Stockholm. 
So if you look at the RDF graph that we 

74
00:05:48,820 --> 00:05:51,025
explained, so clearly all you get is 
Bruce Willis for this, Bruce Willis and 

75
00:05:51,025 --> 00:05:53,970
all his movies, not just the Expendables, 
all the movies. 

76
00:05:53,970 --> 00:05:56,970
That's basically what you're getting out 
of this. 

77
00:05:56,970 --> 00:06:09,293
So these are the query languages, yeah? 
 >> All these movies provided for, the 

78
00:06:09,293 --> 00:06:15,240
last two conditions were also satisfied? 
 >> no, last two conditions are largely 

79
00:06:15,240 --> 00:06:20,132
on just a name, so I'm not providing. 
 >> No, I mean, Is it a must that, that 

80
00:06:20,132 --> 00:06:25,141
he would have to work with a person and 
the person has to be born in Stockholm? 

81
00:06:25,141 --> 00:06:27,411
 >> Yes. 
 >> So, only if a set of people are 

82
00:06:27,411 --> 00:06:32,933
there and you have worked with Bruce 
Willis and was born in Stockholm? 

83
00:06:32,933 --> 00:06:36,609
Only those conditions... 
 >> Exactly. 

84
00:06:36,609 --> 00:06:39,560
Exactly. 
So in, that's why I said in a previous 

85
00:06:39,560 --> 00:06:42,661
example it was only Bruce Willis was 
there. 

86
00:06:42,661 --> 00:06:46,083
But you can have, I mean if you, if 
anybody has seen the Expendables, you 

87
00:06:46,083 --> 00:06:50,740
know there is a huge list of people that 
will come out of this. 

88
00:06:50,740 --> 00:06:53,958
Right? 
pretty much Sylvester Stallone and Arnold 

89
00:06:53,958 --> 00:06:56,298
Schwarzenegger. 
Every one of them. 

90
00:06:56,298 --> 00:06:59,450
 >> Who satisfies all these conditions? 
 >> Who satisfies all these conditions? 

91
00:06:59,450 --> 00:07:03,609
So you can find all the movies and their 
names. 

92
00:07:03,609 --> 00:07:06,835
 >> [INAUDIBLE]. 
 >> Not as significantly different, 

93
00:07:06,835 --> 00:07:10,710
right? 
So at least when SPARQL started, it 

94
00:07:10,710 --> 00:07:16,499
seemed like very much like a SQL query. 
All right. 

95
00:07:16,499 --> 00:07:21,374
But the point is that, in sequel, if you 
look at it from the sequel angle, this is 

96
00:07:21,374 --> 00:07:27,960
a bunch of joints, huge bunch of joints. 
And SQL tries to a wide or at least 

97
00:07:27,960 --> 00:07:33,000
whenever you're the world of relational 
databases, you want to minimize the 

98
00:07:33,000 --> 00:07:36,190
joints. 
Right? 

99
00:07:36,190 --> 00:07:42,340
So you come up with strategies for 
materializing these and reuse this. 

100
00:07:42,340 --> 00:07:46,488
Similar ideas can be applied here, but 
the key differences are the kind of 

101
00:07:46,488 --> 00:07:51,209
predicates that you have. 
There could be huge in number, which is 

102
00:07:51,209 --> 00:07:54,435
again not the case in relational 
databases. 

103
00:07:54,435 --> 00:07:57,679
Right? 
You typically have, how many tables are 

104
00:07:57,679 --> 00:08:01,540
there in your database? 
Not more than probably 200 in the extreme 

105
00:08:01,540 --> 00:08:03,588
setting. 
Okay. 

106
00:08:03,588 --> 00:08:05,540
 >> This, this looks like a SQL query. 
We're suppose that. 

107
00:08:05,540 --> 00:08:08,580
 >> This looks like. 
 >> You take the, particularly 

108
00:08:08,580 --> 00:08:10,871
[INAUDIBLE]. 
 >> Yes. 

109
00:08:10,871 --> 00:08:14,968
 >> Then that's not there in SQL. 
 >> That's not there in SQL, but you can 

110
00:08:14,968 --> 00:08:18,748
always turn it around and say, okay, I 
will also model predicates as another 

111
00:08:18,748 --> 00:08:22,452
column. 
All right? 

112
00:08:22,452 --> 00:08:25,196
 >> You have to have a data base which 
stores all possible predicates. 

113
00:08:25,196 --> 00:08:27,184
 >> Predicates as well. 
 >> You have that on a separate table. 

114
00:08:27,184 --> 00:08:29,139
 >> Yes. 
 >> And then query that. 

115
00:08:29,139 --> 00:08:30,320
 >> And then get back. 
 >> Yeah. 

116
00:08:30,320 --> 00:08:32,598
 >> So there are some efforts for that 
kind of thing also. 

117
00:08:32,598 --> 00:08:36,194
So, some kind of you have these metadata 
that you have, you can query the 

118
00:08:36,194 --> 00:08:39,780
metadata, get the table names and then 
query. 

119
00:08:39,780 --> 00:08:44,299
So, you can do that also, but these are 
never the preferred model in relation. 

120
00:08:44,299 --> 00:08:46,536
 >> [CROSSTALK]. 
 >> Yeah, so you're breaking the 

121
00:08:46,536 --> 00:08:51,497
relational ideas there. 
 >> [INAUDIBLE]. 

122
00:08:51,497 --> 00:08:52,823
 >> Yes. 
 >> All of that data? 

123
00:08:52,823 --> 00:08:57,060
 >> You are making a universal database 
and then filing queries on it. 

124
00:08:57,060 --> 00:08:58,690
Right? 
That's one way of looking at it from a 

125
00:08:58,690 --> 00:09:02,120
relational point. 
You don't do all these normalizations. 

126
00:09:02,120 --> 00:09:08,923
You don't do anything. 
You just make it one big huge universal. 

127
00:09:08,923 --> 00:09:11,894
 >> [INAUDIBLE]. 
 >> As you state, 

128
00:09:11,894 --> 00:09:14,982
[INAUDIBLE]. 

129
00:09:14,982 --> 00:09:16,494
 >> Yes. 
 >> [INAUDIBLE]. 

130
00:09:16,494 --> 00:09:28,571
 >> Yeah. 
Everything falls apart. 

131
00:09:28,571 --> 00:09:30,586
 >> [INAUDIBLE]. 
 >> Okay. 

132
00:09:30,586 --> 00:09:30,980
 >> [INAUDIBLE]. 
 >> Yes. 

133
00:09:30,980 --> 00:09:35,037
 >> Do you [INAUDIBLE]. 
I think you want to get to that, right? 

134
00:09:35,037 --> 00:09:37,416
[INAUDIBLE] 
 >> Yes, so that's, that's the meat of 

135
00:09:37,416 --> 00:09:41,383
the, challenge, right? 
So if you look at, so whenever you look 

136
00:09:41,383 --> 00:09:46,540
at these query languages they don't talk 
about how they have to be evaluated. 

137
00:09:46,540 --> 00:09:49,888
So at this point I have no clue how this 
particle has to be evaluated, I don't 

138
00:09:49,888 --> 00:09:53,060
care also. 
These are declarative, right? 

139
00:09:53,060 --> 00:09:57,229
I don't really care about it. 
But you really have to worry about these 

140
00:09:57,229 --> 00:10:02,002
kind of issues, like, should I go for 
universal relation? 

141
00:10:02,002 --> 00:10:06,538
Which means I have to deal with null 
values, storage issues, minimizing the 

142
00:10:06,538 --> 00:10:12,447
search space, whole bunch of things, or 
should I do some other trick? 

143
00:10:12,447 --> 00:10:17,260
Should I normalize only in certain cases, 
not normalize it, right? 

144
00:10:17,260 --> 00:10:24,748
These are the challenges which you will 
find if you try directly translating 

145
00:10:24,748 --> 00:10:32,081
these graph-like queries into relational 
setting. 

146
00:10:32,081 --> 00:10:34,678
 >> [INAUDIBLE]. 
 >> Yes. 

147
00:10:34,678 --> 00:10:37,062
 >> [INAUDIBLE]. 
 >> Yes. 

148
00:10:37,062 --> 00:10:40,955
 >> [INAUDIBLE]. 
 >> I mean see all these normal forms 

149
00:10:40,955 --> 00:10:46,523
only help not only the efficiency but 
also in order to keep the consistency in 

150
00:10:46,523 --> 00:10:53,100
some sense, right? 
The same requirements hold here also. 

151
00:10:53,100 --> 00:10:58,527
In fact, as we will see in a couple of 
slides, one extreme way which you already 

152
00:10:58,527 --> 00:11:05,648
might have seen in the, this example RDF 
graph, and I was looking at. 

153
00:11:05,648 --> 00:11:09,080
You can look at this part, the triple 
part. 

154
00:11:09,080 --> 00:11:12,172
So, I have Bruce Willis, born in Idar 
Oberstein. 

155
00:11:12,172 --> 00:11:14,160
The edge can be my table. 
Just triple pattern table. 

156
00:11:14,160 --> 00:11:21,115
So which, you if you go back to your 
normal form setting, it's almost like b, 

157
00:11:21,115 --> 00:11:24,786
c, and f. 
Right? 

158
00:11:24,786 --> 00:11:30,395
Where you have just key and value, 
nothing else. 

159
00:11:30,395 --> 00:11:35,353
And the predicate is encoded in the table 
name, in the left-hand setting, but now 

160
00:11:35,353 --> 00:11:40,834
you are explicitly storing it. 
That's, right? 

161
00:11:40,834 --> 00:11:45,751
That's, that's basically the way in which 
graphs can be stored. 

162
00:11:45,751 --> 00:11:47,561
 >> The edge list. 
 >> Which list? 

163
00:11:47,561 --> 00:11:50,208
 >> Yeah, that list with the 
[INAUDIBLE]. 

164
00:11:50,208 --> 00:11:53,829
 >> Exactly. 
 >> [INAUDIBLE]. 

165
00:11:53,829 --> 00:11:59,764
 >> Yeah. 
 >> On the SPARQL queries. 

166
00:11:59,764 --> 00:12:03,922
What kind of queries are better suited 
for SPARQL, the ones which are deeper 

167
00:12:03,922 --> 00:12:09,707
which in the sense you have table a right 
to b and then c and then d? 

168
00:12:09,707 --> 00:12:10,916
Or is it b, c, d are all corrected to a 
directly? 

169
00:12:10,916 --> 00:12:12,396
Which, what, where, which kinds of 
queries are better suited for SQL? 

170
00:12:12,396 --> 00:12:14,275
 >> It's, so that depends on entirely 
the application. 

171
00:12:14,275 --> 00:12:22,163
So it's, so let's not worry about SPARQL, 
let's worry about SQL okay, I'm going to, 

172
00:12:22,163 --> 00:12:29,953
so, what kind of queries are better 
suited for SQL? 

173
00:12:29,953 --> 00:12:37,504
 >> Not SQL, SQL is only one side of it. 
 >> But that's only because your 

174
00:12:37,504 --> 00:12:42,950
performance is weak. 
Suppose if I go from main memory 

175
00:12:42,950 --> 00:12:47,661
databases, which support SQL. 
Probably deeper, whatever, it's perfectly 

176
00:12:47,661 --> 00:12:52,770
fine, right? 
So in the same setting, in the same way. 

177
00:12:52,770 --> 00:12:57,450
You cannot ask a question that is partly 
suited for kind of queries. 

178
00:12:57,450 --> 00:13:00,360
Whether your database which really 
implements SPARQL. 

179
00:13:00,360 --> 00:13:04,120
Is it better suited for this? 
So that's the, so in terms of power of 

180
00:13:04,120 --> 00:13:08,377
the language, SPARQL is no more powerful 
than SQL. 

181
00:13:08,377 --> 00:13:12,025
I mean SQL is already too incomplete, so, 
that is you can not get anything more 

182
00:13:12,025 --> 00:13:17,120
powerful than that. 
So once you have that, SPARQL is no more 

183
00:13:17,120 --> 00:13:20,646
powerful. 
So the reason why you came up with a new 

184
00:13:20,646 --> 00:13:26,475
language than just reusing SQL is that 
the ease of use and the way you think. 

185
00:13:26,475 --> 00:13:29,270
Right? 
You go to the add a file and you look at 

186
00:13:29,270 --> 00:13:33,953
it from the relational setting so the way 
in which people think is different. 

187
00:13:33,953 --> 00:13:38,771
So, in XML, people thought in threes. 
While in, relational tables, they looked 

188
00:13:38,771 --> 00:13:44,804
at in table format. 
While in SPARQL, that is in the ideal 

189
00:13:44,804 --> 00:13:54,834
world, people always look it as graphs. 
So, it's just the ease of use, not which 

190
00:13:54,834 --> 00:13:59,270
is more powerful. 
Everything is equally powerful. 

191
00:13:59,270 --> 00:14:03,480
It just simplifies your life, right?