1 00:00:00,025 --> 00:00:06,025 >> [COUGH] another thing many of these scalable Graph Data management solutions 2 00:00:06,025 --> 00:00:11,145 try, except one which is one the board, is that let's put, since we know graphs 3 00:00:11,145 --> 00:00:16,585 are really complicated beasts which have very diverse kind of relationships 4 00:00:16,585 --> 00:00:23,604 between them. And we also know that ram is getting 5 00:00:23,604 --> 00:00:30,640 cheaper, let's assume that I can distribute this graph over many machines. 6 00:00:30,640 --> 00:00:34,816 And each machine has GB's of ram, so let me just put all the database in, 7 00:00:34,816 --> 00:00:39,568 completely in memory, when you actually need to perform transactions on them, 8 00:00:39,568 --> 00:00:44,570 Right? So, if you look at Neo4J it loads the 9 00:00:44,570 --> 00:00:50,270 entire graph into memory when you fast fire any one query. 10 00:00:51,930 --> 00:00:54,110 Until then, it's peacefully sitting on the disk. 11 00:00:54,110 --> 00:01:00,710 So your fast query can easily take few hours, it loads everything into memory. 12 00:01:00,710 --> 00:01:03,206 And then after that, every query is like, super fast. 13 00:01:03,206 --> 00:01:05,810 Zoop, zoop, zoop, zoop, it comes through. Right? 14 00:01:05,810 --> 00:01:11,335 So, average query response times are extremely good, but the first call start 15 00:01:11,335 --> 00:01:17,626 queries are pretty bad. But on the other hand, if you are trying 16 00:01:17,626 --> 00:01:23,996 to look at billion node graphs, Neo4J really, really struggles even on fairly 17 00:01:23,996 --> 00:01:32,415 large servers that we have tried on. So, the ultimate solution is let's put 18 00:01:32,415 --> 00:01:36,100 everything on disk, and just do it super fast. 19 00:01:37,380 --> 00:01:41,860 To some means, we don't know how, right? So there is one solution, the idea of 20 00:01:41,860 --> 00:01:46,640 triple x spin. which is basically what most of my work 21 00:01:46,640 --> 00:01:50,978 depends on. Is that, essentially look at triples and 22 00:01:50,978 --> 00:01:55,400 look at what kind of access patterns that you have. 23 00:01:55,400 --> 00:02:00,930 So you can a, you can access looking source and look for the predicate object. 24 00:02:00,930 --> 00:02:03,800 You can look at predicate, and look for source and object. 25 00:02:03,800 --> 00:02:07,120 You can look for object and predicate and source, Right? 26 00:02:07,120 --> 00:02:12,194 So let's create multiple indexes which are storing the same redundant 27 00:02:12,194 --> 00:02:18,590 information in some sense, right? I just store it really compactly in a 28 00:02:18,590 --> 00:02:23,316 very compressed format. And then I add some more indexes just to 29 00:02:23,316 --> 00:02:29,052 make life easier for few other queries which do not follow this kind of pattern. 30 00:02:29,052 --> 00:02:33,995 >> [INAUDIBLE] No, blinks does everything in memory. 31 00:02:33,995 --> 00:02:39,224 Blinks, banks, everything. >> [INAUDIBLE] Even the index is in 32 00:02:39,224 --> 00:02:40,025 memory. >> Huh. 33 00:02:40,025 --> 00:02:42,990 Link. So, there, there the focus is uneven 34 00:02:42,990 --> 00:02:46,609 computing the [INAUDIBLE] on memory is expensive. 35 00:02:46,609 --> 00:02:50,509 So, you really want to speed that up, so they maintain these fringe nodes and try 36 00:02:50,509 --> 00:02:54,260 to do it [CROSSTALK] really fast. Exactly. 37 00:02:54,260 --> 00:02:57,138 But everything is in memory. It doesn't ever go into the database, I 38 00:02:57,138 --> 00:03:00,475 mean, on disk. >> This [INAUDIBLE]. 39 00:03:00,475 --> 00:03:03,592 >> Yeah, it's not mine. I just use it. 40 00:03:03,592 --> 00:03:05,060 >> [INAUDIBLE]. >> So the idea of Triple X was done in, 41 00:03:05,060 --> 00:03:10,326 at Max Planck Institute. And it's perhaps the fastest idea store I 42 00:03:10,326 --> 00:03:15,680 have seen so far. So that's why it's called Triple Express, 43 00:03:15,680 --> 00:03:18,605 right? And they have added transactional 44 00:03:18,605 --> 00:03:21,000 support. And what I'm working on is to add 45 00:03:21,000 --> 00:03:26,734 analytic support on top of the triple X. So they, the transactional site and we 46 00:03:26,734 --> 00:03:30,950 are adding analytic site. Trying to do mining operations on even 47 00:03:30,950 --> 00:03:37,243 recursive reasoning kind of things. >> [INAUDIBLE] 48 00:03:37,243 --> 00:03:42,232 Yes. >> [INAUDIBLE] 49 00:03:42,232 --> 00:03:55,657 It will surely make faster, alright? But, again, things are so messy, that 50 00:03:55,657 --> 00:04:01,486 constructing such communities [NOISE] is not as easy that you are not as clean as 51 00:04:01,486 --> 00:04:06,218 you would see on social networks. >> [INAUDIBLE]. 52 00:04:06,218 --> 00:04:12,404 >> Yes. >> And then 53 00:04:12,404 --> 00:04:16,825 [INAUDIBLE]. 54 00:04:16,825 --> 00:04:17,724 Yes? Yes 55 00:04:17,724 --> 00:04:23,314 >> [CROSSTALK] [INAUDIBLE] Exactly, so you are basically trying to implicitly 56 00:04:23,314 --> 00:04:29,100 derive some types. I'm trying to move it up the hiearchy. 57 00:04:29,100 --> 00:04:33,356 You can do that, but at least I have tried it on YAGO and it fails quite 58 00:04:33,356 --> 00:04:37,904 miserably. Because the community start out to be 59 00:04:37,904 --> 00:04:42,881 very small in size and then you have really high type hierarchy, really deep 60 00:04:42,881 --> 00:04:48,865 type hierarchy and each one has very little filtering. 61 00:04:48,865 --> 00:04:57,748 >> [INAUDIBLE] And then we will go [UNKNOWN] [COUGH] [COUGH] [UNKNOWN] could 62 00:04:57,748 --> 00:05:07,735 they use that [UNKNOWN] just like they used the one-way diagram? 63 00:05:07,735 --> 00:05:09,990 >> Yes. >> [UNKNOWN] Yeah, yeah. 64 00:05:09,990 --> 00:05:15,022 >> [UNKNOWN] You can do that, you can do that. 65 00:05:15,022 --> 00:05:18,361 But I can assure you, the performance will not any better. 66 00:05:18,361 --> 00:05:20,730 I can assure you this. >> [INAUDIBLE]. 67 00:05:20,730 --> 00:05:23,237 >> Yeah. >> Standard graphs. 68 00:05:23,237 --> 00:05:27,123 >> you can run some community detection mining algorithms and try to just group 69 00:05:27,123 --> 00:05:28,414 them. >> [INAUDIBLE]. 70 00:05:28,414 --> 00:05:33,140 >> But your queries will not be of that nature. 71 00:05:33,140 --> 00:05:37,070 So lets assume again go back to our good old D expendables example. 72 00:05:37,070 --> 00:05:41,687 At some point Bruce Willis and Dolph Lundgreen were in two different 73 00:05:41,687 --> 00:05:47,050 communities. Dolph was always in martial arts and 74 00:05:47,050 --> 00:05:51,594 largely Europe setting, right and he was not in the big hits like Bruce Willis 75 00:05:51,594 --> 00:05:55,809 was. So, in any definition of your community 76 00:05:55,809 --> 00:05:58,695 structure, you would keep these two separated. 77 00:05:58,695 --> 00:06:04,790 And you are at one node, which connects these two. 78 00:06:04,790 --> 00:06:08,970 And your queries now are all about the expendables. 79 00:06:08,970 --> 00:06:12,996 So, every time you have to touch this community and that community [SOUND] then 80 00:06:12,996 --> 00:06:16,490 your community is not really useful, right? 81 00:06:16,490 --> 00:06:27,398 So you can easily extrapolate it to even more complicated settings. 82 00:06:27,398 --> 00:06:30,272 >> [INAUDIBLE] This is a, yeah. >> [INAUDIBLE] [INAUDIBLE] Yeah. 83 00:06:30,272 --> 00:06:34,279 >> [INAUDIBLE] So, we can discuss this further. 84 00:06:34,279 --> 00:06:38,990 I mean, if let me just proceed and, I can see that the time is way past. 85 00:06:38,990 --> 00:06:43,337 And then we will try now that we have convinced ourselves that link open data 86 00:06:43,337 --> 00:06:49,881 is actually a graph with some additional little details like labels and so on. 87 00:06:49,881 --> 00:06:53,930 Let's try to look at, what are the standard graph problems? 88 00:06:53,930 --> 00:06:57,634 And how they are related to linked open data setting. 89 00:06:57,634 --> 00:07:00,461 Okay. I mean these are, you're, at least the 90 00:07:00,461 --> 00:07:05,670 things that are in the blue are your undergrad math, right? 91 00:07:05,670 --> 00:07:08,970 So you have reachability queries. You are given with two nodes, you want to 92 00:07:08,970 --> 00:07:13,100 find if these two are connected. Okay. 93 00:07:13,100 --> 00:07:18,720 So we have studied different solutions for it in, graph algorithms courses. 94 00:07:19,900 --> 00:07:23,996 And then you actually have under the material, not looking for connectivity 95 00:07:23,996 --> 00:07:28,530 alone, you're actually looking for how they're connected. 96 00:07:28,530 --> 00:07:31,540 Give me all the nodes that are in between. 97 00:07:31,540 --> 00:07:35,100 The shortest path between these two nodes. 98 00:07:35,100 --> 00:07:40,510 Clearly if two nodes are reachable, the second shortest path can also be found. 99 00:07:40,510 --> 00:07:43,040 That's guaranteed. Right? 100 00:07:43,040 --> 00:07:47,256 So these are in some sense, answer somewhat the same setting, but just that 101 00:07:47,256 --> 00:07:52,383 shortest path is a little more informative than reachability. 102 00:07:53,390 --> 00:07:56,915 And then you have totally arbitrary pattern queries. 103 00:07:56,915 --> 00:07:59,119 Right? I can form any kind of pattern in my 104 00:07:59,119 --> 00:08:04,250 query template which could have wild cards, which may not have wild cards. 105 00:08:04,250 --> 00:08:07,755 And then I want to find out all instance of this in the graph. 106 00:08:07,755 --> 00:08:12,748 So these are the three main graph problems in the query. 107 00:08:12,748 --> 00:08:19,280 And many other problems that you see can be decomposed into some variant of these, 108 00:08:19,280 --> 00:08:23,720 are some group of these. Right? 109 00:08:23,720 --> 00:08:27,689 Base rank, for example, you can look at it as some variant of computing many, 110 00:08:27,689 --> 00:08:32,010 many reachability and shortest pathways. Right? 111 00:08:32,010 --> 00:08:36,390 Steiner trees, for example, well known solution is to use shortest path 112 00:08:36,390 --> 00:08:40,934 increase, okay. And if you link them back to SPARQL and 113 00:08:40,934 --> 00:08:46,428 audio you can see that the reachability queries by Dempsey don't exist in SPARQL 114 00:08:46,428 --> 00:08:51,594 except when you are looking at the extensions like property paths which are 115 00:08:51,594 --> 00:08:58,088 coming through. It says that okay, if a can reach b, then 116 00:08:58,088 --> 00:09:02,228 pick that b as your binary variable and then start processing something 117 00:09:02,228 --> 00:09:09,911 underneath, looking for patterns. So, one of the first examples I gave was, 118 00:09:09,911 --> 00:09:15,730 find me all friends of, of all connected nodes from Shea Chantal. 119 00:09:15,730 --> 00:09:19,623 This was one of the examples I gave. I could add additional constraint 120 00:09:19,623 --> 00:09:24,039 additional structure on the query saying find all reachable nodes from Shea 121 00:09:24,039 --> 00:09:28,800 chantal, and return me only those nodes which have certain graph structure around 122 00:09:28,800 --> 00:09:33,922 them. Those who work in DCS and. 123 00:09:33,922 --> 00:09:37,988 >> [INAUDIBLE] Why don't we just say, from A question mark X? 124 00:09:37,988 --> 00:09:41,745 Y and it. >> No, no you need also star. 125 00:09:43,060 --> 00:09:46,136 If you just have question mark X it is still 1H. 126 00:09:46,136 --> 00:09:49,430 No, but there is no star >> There is no star. 127 00:09:49,430 --> 00:09:53,580 Star is coming in the property parts, same thing in existed in XEmacs. 128 00:09:53,580 --> 00:09:57,240 So, in the XPath in the Xquery in the beginning they did not have a star but 129 00:09:57,240 --> 00:10:01,080 later they added a star. Stars are always pink. 130 00:10:01,080 --> 00:10:01,540 Okay. [COUGH]. 131 00:10:01,540 --> 00:10:07,684 So, since reachability is being added, clearly shortest path wouldn't have 132 00:10:07,684 --> 00:10:16,375 existed before but now again people have realized shortest paths are required. 133 00:10:16,375 --> 00:10:20,653 so there is efforts for adding shortest paths as part of this SPARQL extensions 134 00:10:20,653 --> 00:10:25,460 and pattern queries always. All of SPARQL can be seen as a pattern 135 00:10:25,460 --> 00:10:26,580 query.