1
00:00:00,025 --> 00:00:06,025
 >> [COUGH] another thing many of these 
scalable Graph Data management solutions 

2
00:00:06,025 --> 00:00:11,145
try, except one which is one the board, 
is that let's put, since we know graphs 

3
00:00:11,145 --> 00:00:16,585
are really complicated beasts which have 
very diverse kind of relationships 

4
00:00:16,585 --> 00:00:23,604
between them. 
And we also know that ram is getting 

5
00:00:23,604 --> 00:00:30,640
cheaper, let's assume that I can 
distribute this graph over many machines. 

6
00:00:30,640 --> 00:00:34,816
And each machine has GB's of ram, so let 
me just put all the database in, 

7
00:00:34,816 --> 00:00:39,568
completely in memory, when you actually 
need to perform transactions on them, 

8
00:00:39,568 --> 00:00:44,570
Right? 
So, if you look at Neo4J it loads the 

9
00:00:44,570 --> 00:00:50,270
entire graph into memory when you fast 
fire any one query. 

10
00:00:51,930 --> 00:00:54,110
Until then, it's peacefully sitting on 
the disk. 

11
00:00:54,110 --> 00:01:00,710
So your fast query can easily take few 
hours, it loads everything into memory. 

12
00:01:00,710 --> 00:01:03,206
And then after that, every query is like, 
super fast. 

13
00:01:03,206 --> 00:01:05,810
Zoop, zoop, zoop, zoop, it comes through. 
Right? 

14
00:01:05,810 --> 00:01:11,335
So, average query response times are 
extremely good, but the first call start 

15
00:01:11,335 --> 00:01:17,626
queries are pretty bad. 
But on the other hand, if you are trying 

16
00:01:17,626 --> 00:01:23,996
to look at billion node graphs, Neo4J 
really, really struggles even on fairly 

17
00:01:23,996 --> 00:01:32,415
large servers that we have tried on. 
So, the ultimate solution is let's put 

18
00:01:32,415 --> 00:01:36,100
everything on disk, and just do it super 
fast. 

19
00:01:37,380 --> 00:01:41,860
To some means, we don't know how, right? 
So there is one solution, the idea of 

20
00:01:41,860 --> 00:01:46,640
triple x spin. 
which is basically what most of my work 

21
00:01:46,640 --> 00:01:50,978
depends on. 
Is that, essentially look at triples and 

22
00:01:50,978 --> 00:01:55,400
look at what kind of access patterns that 
you have. 

23
00:01:55,400 --> 00:02:00,930
So you can a, you can access looking 
source and look for the predicate object. 

24
00:02:00,930 --> 00:02:03,800
You can look at predicate, and look for 
source and object. 

25
00:02:03,800 --> 00:02:07,120
You can look for object and predicate and 
source, Right? 

26
00:02:07,120 --> 00:02:12,194
So let's create multiple indexes which 
are storing the same redundant 

27
00:02:12,194 --> 00:02:18,590
information in some sense, right? 
I just store it really compactly in a 

28
00:02:18,590 --> 00:02:23,316
very compressed format. 
And then I add some more indexes just to 

29
00:02:23,316 --> 00:02:29,052
make life easier for few other queries 
which do not follow this kind of pattern. 

30
00:02:29,052 --> 00:02:33,995
 >> [INAUDIBLE] No, blinks does 
everything in memory. 

31
00:02:33,995 --> 00:02:39,224
Blinks, banks, everything. 
 >> [INAUDIBLE] Even the index is in 

32
00:02:39,224 --> 00:02:40,025
memory. 
 >> Huh. 

33
00:02:40,025 --> 00:02:42,990
Link. 
So, there, there the focus is uneven 

34
00:02:42,990 --> 00:02:46,609
computing the [INAUDIBLE] on memory is 
expensive. 

35
00:02:46,609 --> 00:02:50,509
So, you really want to speed that up, so 
they maintain these fringe nodes and try 

36
00:02:50,509 --> 00:02:54,260
to do it [CROSSTALK] really fast. 
Exactly. 

37
00:02:54,260 --> 00:02:57,138
But everything is in memory. 
It doesn't ever go into the database, I 

38
00:02:57,138 --> 00:03:00,475
mean, on disk. 
 >> This [INAUDIBLE]. 

39
00:03:00,475 --> 00:03:03,592
 >> Yeah, it's not mine. 
I just use it. 

40
00:03:03,592 --> 00:03:05,060
 >> [INAUDIBLE]. 
 >> So the idea of Triple X was done in, 

41
00:03:05,060 --> 00:03:10,326
at Max Planck Institute. 
And it's perhaps the fastest idea store I 

42
00:03:10,326 --> 00:03:15,680
have seen so far. 
So that's why it's called Triple Express, 

43
00:03:15,680 --> 00:03:18,605
right? 
And they have added transactional 

44
00:03:18,605 --> 00:03:21,000
support. 
And what I'm working on is to add 

45
00:03:21,000 --> 00:03:26,734
analytic support on top of the triple X. 
So they, the transactional site and we 

46
00:03:26,734 --> 00:03:30,950
are adding analytic site. 
Trying to do mining operations on even 

47
00:03:30,950 --> 00:03:37,243
recursive reasoning kind of things. 
 >> [INAUDIBLE] 

48
00:03:37,243 --> 00:03:42,232
Yes. 
 >> [INAUDIBLE] 

49
00:03:42,232 --> 00:03:55,657
It will surely make faster, alright? 
But, again, things are so messy, that 

50
00:03:55,657 --> 00:04:01,486
constructing such communities [NOISE] is 
not as easy that you are not as clean as 

51
00:04:01,486 --> 00:04:06,218
you would see on social networks. 
 >> [INAUDIBLE]. 

52
00:04:06,218 --> 00:04:12,404
 >> Yes. 
 >> And then 

53
00:04:12,404 --> 00:04:16,825
[INAUDIBLE]. 

54
00:04:16,825 --> 00:04:17,724
Yes? 
Yes 

55
00:04:17,724 --> 00:04:23,314
 >> [CROSSTALK] [INAUDIBLE] Exactly, so 
you are basically trying to implicitly 

56
00:04:23,314 --> 00:04:29,100
derive some types. 
I'm trying to move it up the hiearchy. 

57
00:04:29,100 --> 00:04:33,356
You can do that, but at least I have 
tried it on YAGO and it fails quite 

58
00:04:33,356 --> 00:04:37,904
miserably. 
Because the community start out to be 

59
00:04:37,904 --> 00:04:42,881
very small in size and then you have 
really high type hierarchy, really deep 

60
00:04:42,881 --> 00:04:48,865
type hierarchy and each one has very 
little filtering. 

61
00:04:48,865 --> 00:04:57,748
 >> [INAUDIBLE] And then we will go 
[UNKNOWN] [COUGH] [COUGH] [UNKNOWN] could 

62
00:04:57,748 --> 00:05:07,735
they use that [UNKNOWN] just like they 
used the one-way diagram? 

63
00:05:07,735 --> 00:05:09,990
 >> Yes. 
 >> [UNKNOWN] Yeah, yeah. 

64
00:05:09,990 --> 00:05:15,022
 >> [UNKNOWN] You can do that, you can 
do that. 

65
00:05:15,022 --> 00:05:18,361
But I can assure you, the performance 
will not any better. 

66
00:05:18,361 --> 00:05:20,730
I can assure you this. 
 >> [INAUDIBLE]. 

67
00:05:20,730 --> 00:05:23,237
 >> Yeah. 
 >> Standard graphs. 

68
00:05:23,237 --> 00:05:27,123
 >> you can run some community detection 
mining algorithms and try to just group 

69
00:05:27,123 --> 00:05:28,414
them. 
 >> [INAUDIBLE]. 

70
00:05:28,414 --> 00:05:33,140
 >> But your queries will not be of that 
nature. 

71
00:05:33,140 --> 00:05:37,070
So lets assume again go back to our good 
old D expendables example. 

72
00:05:37,070 --> 00:05:41,687
At some point Bruce Willis and Dolph 
Lundgreen were in two different 

73
00:05:41,687 --> 00:05:47,050
communities. 
Dolph was always in martial arts and 

74
00:05:47,050 --> 00:05:51,594
largely Europe setting, right and he was 
not in the big hits like Bruce Willis 

75
00:05:51,594 --> 00:05:55,809
was. 
So, in any definition of your community 

76
00:05:55,809 --> 00:05:58,695
structure, you would keep these two 
separated. 

77
00:05:58,695 --> 00:06:04,790
And you are at one node, which connects 
these two. 

78
00:06:04,790 --> 00:06:08,970
And your queries now are all about the 
expendables. 

79
00:06:08,970 --> 00:06:12,996
So, every time you have to touch this 
community and that community [SOUND] then 

80
00:06:12,996 --> 00:06:16,490
your community is not really useful, 
right? 

81
00:06:16,490 --> 00:06:27,398
So you can easily extrapolate it to even 
more complicated settings. 

82
00:06:27,398 --> 00:06:30,272
 >> [INAUDIBLE] This is a, yeah. 
 >> [INAUDIBLE] [INAUDIBLE] Yeah. 

83
00:06:30,272 --> 00:06:34,279
 >> [INAUDIBLE] So, we can discuss this 
further. 

84
00:06:34,279 --> 00:06:38,990
I mean, if let me just proceed and, I can 
see that the time is way past. 

85
00:06:38,990 --> 00:06:43,337
And then we will try now that we have 
convinced ourselves that link open data 

86
00:06:43,337 --> 00:06:49,881
is actually a graph with some additional 
little details like labels and so on. 

87
00:06:49,881 --> 00:06:53,930
Let's try to look at, what are the 
standard graph problems? 

88
00:06:53,930 --> 00:06:57,634
And how they are related to linked open 
data setting. 

89
00:06:57,634 --> 00:07:00,461
Okay. 
I mean these are, you're, at least the 

90
00:07:00,461 --> 00:07:05,670
things that are in the blue are your 
undergrad math, right? 

91
00:07:05,670 --> 00:07:08,970
So you have reachability queries. 
You are given with two nodes, you want to 

92
00:07:08,970 --> 00:07:13,100
find if these two are connected. 
Okay. 

93
00:07:13,100 --> 00:07:18,720
So we have studied different solutions 
for it in, graph algorithms courses. 

94
00:07:19,900 --> 00:07:23,996
And then you actually have under the 
material, not looking for connectivity 

95
00:07:23,996 --> 00:07:28,530
alone, you're actually looking for how 
they're connected. 

96
00:07:28,530 --> 00:07:31,540
Give me all the nodes that are in 
between. 

97
00:07:31,540 --> 00:07:35,100
The shortest path between these two 
nodes. 

98
00:07:35,100 --> 00:07:40,510
Clearly if two nodes are reachable, the 
second shortest path can also be found. 

99
00:07:40,510 --> 00:07:43,040
That's guaranteed. 
Right? 

100
00:07:43,040 --> 00:07:47,256
So these are in some sense, answer 
somewhat the same setting, but just that 

101
00:07:47,256 --> 00:07:52,383
shortest path is a little more 
informative than reachability. 

102
00:07:53,390 --> 00:07:56,915
And then you have totally arbitrary 
pattern queries. 

103
00:07:56,915 --> 00:07:59,119
Right? 
I can form any kind of pattern in my 

104
00:07:59,119 --> 00:08:04,250
query template which could have wild 
cards, which may not have wild cards. 

105
00:08:04,250 --> 00:08:07,755
And then I want to find out all instance 
of this in the graph. 

106
00:08:07,755 --> 00:08:12,748
So these are the three main graph 
problems in the query. 

107
00:08:12,748 --> 00:08:19,280
And many other problems that you see can 
be decomposed into some variant of these, 

108
00:08:19,280 --> 00:08:23,720
are some group of these. 
Right? 

109
00:08:23,720 --> 00:08:27,689
Base rank, for example, you can look at 
it as some variant of computing many, 

110
00:08:27,689 --> 00:08:32,010
many reachability and shortest pathways. 
Right? 

111
00:08:32,010 --> 00:08:36,390
Steiner trees, for example, well known 
solution is to use shortest path 

112
00:08:36,390 --> 00:08:40,934
increase, okay. 
And if you link them back to SPARQL and 

113
00:08:40,934 --> 00:08:46,428
audio you can see that the reachability 
queries by Dempsey don't exist in SPARQL 

114
00:08:46,428 --> 00:08:51,594
except when you are looking at the 
extensions like property paths which are 

115
00:08:51,594 --> 00:08:58,088
coming through. 
It says that okay, if a can reach b, then 

116
00:08:58,088 --> 00:09:02,228
pick that b as your binary variable and 
then start processing something 

117
00:09:02,228 --> 00:09:09,911
underneath, looking for patterns. 
So, one of the first examples I gave was, 

118
00:09:09,911 --> 00:09:15,730
find me all friends of, of all connected 
nodes from Shea Chantal. 

119
00:09:15,730 --> 00:09:19,623
This was one of the examples I gave. 
I could add additional constraint 

120
00:09:19,623 --> 00:09:24,039
additional structure on the query saying 
find all reachable nodes from Shea 

121
00:09:24,039 --> 00:09:28,800
chantal, and return me only those nodes 
which have certain graph structure around 

122
00:09:28,800 --> 00:09:33,922
them. 
Those who work in DCS and. 

123
00:09:33,922 --> 00:09:37,988
 >> [INAUDIBLE] Why don't we just say, 
from A question mark X? 

124
00:09:37,988 --> 00:09:41,745
Y and it. 
 >> No, no you need also star. 

125
00:09:43,060 --> 00:09:46,136
If you just have question mark X it is 
still 1H. 

126
00:09:46,136 --> 00:09:49,430
No, but there is no star 
 >> There is no star. 

127
00:09:49,430 --> 00:09:53,580
Star is coming in the property parts, 
same thing in existed in XEmacs. 

128
00:09:53,580 --> 00:09:57,240
So, in the XPath in the Xquery in the 
beginning they did not have a star but 

129
00:09:57,240 --> 00:10:01,080
later they added a star. 
Stars are always pink. 

130
00:10:01,080 --> 00:10:01,540
Okay. 
[COUGH]. 

131
00:10:01,540 --> 00:10:07,684
So, since reachability is being added, 
clearly shortest path wouldn't have 

132
00:10:07,684 --> 00:10:16,375
existed before but now again people have 
realized shortest paths are required. 

133
00:10:16,375 --> 00:10:20,653
so there is efforts for adding shortest 
paths as part of this SPARQL extensions 

134
00:10:20,653 --> 00:10:25,460
and pattern queries always. 
All of SPARQL can be seen as a pattern 

135
00:10:25,460 --> 00:10:26,580
query.