1 00:00:00,025 --> 00:00:04,880 Yeah, thanks Gautam for inviting and it's great to be here. 2 00:00:04,880 --> 00:00:07,690 nice to see that there are so many people for this talk. 3 00:00:07,690 --> 00:00:11,281 I thought this talk would be one of those fringe talks which only three, four 4 00:00:11,281 --> 00:00:16,270 people would be interested, but great to see the room is getting full. 5 00:00:16,270 --> 00:00:21,630 So, what I'm going to talk about is look at largely efficiency issues when one is 6 00:00:21,630 --> 00:00:26,750 dealing with large scale graphs, and, for me the graphs are typically not 7 00:00:26,750 --> 00:00:31,870 necessarily the social network style graphs, but largely from the link data 8 00:00:31,870 --> 00:00:38,410 community. So we will see what these link data 9 00:00:38,410 --> 00:00:43,090 graphs, how they are different from social network graphs. 10 00:00:43,090 --> 00:00:47,374 An how big are these, an what are the sum, problems that you see while [NOISE] 11 00:00:47,374 --> 00:00:51,590 you're trying to deal with these graphs, and uh, [NOISE] how we can go about 12 00:00:51,590 --> 00:00:56,680 solving it. And much of the work, that may come later 13 00:00:56,680 --> 00:01:01,740 in the talk, is clearly not just me alone who has worked on it. 14 00:01:01,740 --> 00:01:04,740 I just haven't listed all my collaborators. 15 00:01:04,740 --> 00:01:08,400 But we can talk about it. I can send you the papers if required. 16 00:01:08,400 --> 00:01:13,552 So, first, I don't need to give a introduction about what a graph is, 17 00:01:13,552 --> 00:01:17,407 right? Everybody knows what a graph structure 18 00:01:17,407 --> 00:01:20,240 looks like. It's one of the most general way of 19 00:01:20,240 --> 00:01:24,840 representing information, right. One of the most flexible. 20 00:01:24,840 --> 00:01:26,499 Forms. But let's focus on what is a graph 21 00:01:26,499 --> 00:01:30,234 database. As compared to, you can look at, 22 00:01:30,234 --> 00:01:34,850 actually, any relation database as a graph database. 23 00:01:34,850 --> 00:01:37,778 In some sets. So you have foreign keys which link one 24 00:01:37,778 --> 00:01:40,730 record to another. So, no issues. 25 00:01:40,730 --> 00:01:44,030 So you can actually form a graph structure using these foreign key 26 00:01:44,030 --> 00:01:48,208 relationships. And within a triple between two values in 27 00:01:48,208 --> 00:01:52,688 the triple, you can actually form a relationship, because they're actually 28 00:01:52,688 --> 00:01:56,414 related. That's why they are called Relational 29 00:01:56,414 --> 00:01:59,864 Structures. So how is it different form a graph 30 00:01:59,864 --> 00:02:03,570 database, which is setting more popular nowadays? 31 00:02:03,570 --> 00:02:07,065 The key different is in the way the data gets accessed. 32 00:02:07,065 --> 00:02:10,658 Right. In relational databases, the focus is 33 00:02:10,658 --> 00:02:15,955 always on indexed access to some triple. So you know this group of values are a 34 00:02:15,955 --> 00:02:22,310 relation so I just want to access this entire group and then process it. 35 00:02:22,310 --> 00:02:25,720 On the other hand, in graph databases, the focus is more on and currently on 36 00:02:25,720 --> 00:02:30,610 this particularly value, tell me all the values that are related to it. 37 00:02:30,610 --> 00:02:33,480 It need not be through a single triple relationship. 38 00:02:33,480 --> 00:02:37,460 It could be normalized, de-normalized, across foreign key. 39 00:02:37,460 --> 00:02:41,690 I really have no control on what kind of relationship I'm looking for. 40 00:02:41,690 --> 00:02:47,800 But I could traverse this relationships from a given node. 41 00:02:47,800 --> 00:02:52,856 So this is the key difference, and this. If you, again, put it back in the 42 00:02:52,856 --> 00:02:56,996 relational world, it can be seen as a huge number of joins that you need to 43 00:02:56,996 --> 00:03:02,176 perform, right? Which is not a a fun task to do on 44 00:03:02,176 --> 00:03:08,590 relational databases, so you want to avoid these joins. 45 00:03:08,590 --> 00:03:13,010 On the other hand this is the only mode of traversal access that you are allowed 46 00:03:13,010 --> 00:03:18,807 to do on graph databases. Although we will relax it a little later 47 00:03:18,807 --> 00:03:22,288 and say that we want to do both relational style as well as navigational 48 00:03:22,288 --> 00:03:26,241 style of access, but this is the main difference between standard databases and 49 00:03:26,241 --> 00:03:31,693 graph databases. So some of the examples of what kind of 50 00:03:31,693 --> 00:03:36,760 questions you could get on graph databases find all friends of Gautam. 51 00:03:36,760 --> 00:03:42,328 So you know, a node Gautam, and you want to find all the relationships find out 52 00:03:42,328 --> 00:03:51,440 only those relationships which say it's friendship and locate the other end of. 53 00:03:51,440 --> 00:03:55,220 So this is one hop BFS with certain kind of restriction, right from the graph 54 00:03:55,220 --> 00:03:58,590 wall. You can actually look at little more 55 00:03:58,590 --> 00:04:02,718 complicated issues. So I, there is another the note called 56 00:04:02,718 --> 00:04:06,372 Srikanta, and I want to find out all the connections that, not an individual 57 00:04:06,372 --> 00:04:12,200 single hop connections, but anyone who can be reached from Srikanta. 58 00:04:12,200 --> 00:04:15,962 So I want to contact him and probably sent out resume to him and say hey, 59 00:04:15,962 --> 00:04:21,860 please, propagate it in your network. So you should know how valuable that 60 00:04:21,860 --> 00:04:25,476 person is in terms of how many people he can reach out. 61 00:04:25,476 --> 00:04:28,244 Right? So look at Srikanta and look at all the 62 00:04:28,244 --> 00:04:30,841 reachable nodes from there. Right? 63 00:04:30,841 --> 00:04:36,080 These are some of the queries. Which you don't often find on relations 64 00:04:36,080 --> 00:04:39,498 databases. But these are extremely common when we 65 00:04:39,498 --> 00:04:43,162 are dealing with graph databases. >> Essentially they are recursive 66 00:04:43,162 --> 00:04:44,781 joins. >> They are recursive joins. 67 00:04:44,781 --> 00:04:47,925 >> Including [INAUDIBLE] everything. >> Yeah, right so, recursive join is 68 00:04:47,925 --> 00:04:51,690 one way of looking at if you have self join. 69 00:04:51,690 --> 00:04:55,775 But it could be simply join stream of joins. 70 00:04:55,775 --> 00:04:57,556 Okay? So it could be on the same table or 71 00:04:57,556 --> 00:05:00,578 across tables. You really don't have a requirement to 72 00:05:00,578 --> 00:05:05,952 stop that. So, it's not, again, something very new 73 00:05:05,952 --> 00:05:10,510 in fact, I haven't put the dates, but you can easily guess. 74 00:05:10,510 --> 00:05:17,395 Logic databases was well before, I think, probably 80% of this room was even born. 75 00:05:17,395 --> 00:05:20,287 Right? People have been working on it, and if 76 00:05:20,287 --> 00:05:25,915 you have met Alman or whoever visited TCS you can easily see that these guys 77 00:05:25,915 --> 00:05:27,200 >> Database. >> Yeah. 78 00:05:27,200 --> 00:05:31,620 They worked on data log, and even before that there was prolog which was from AI 79 00:05:31,620 --> 00:05:36,650 side, and these things were 30 years long history, right? 80 00:05:36,650 --> 00:05:40,234 So that's where. The real grounds of graph databases were 81 00:05:40,234 --> 00:05:44,022 sown, right. And then, of course, once the web came, 82 00:05:44,022 --> 00:05:49,287 it's very natural to see the web as a big graph and you're all familiar with page 83 00:05:49,287 --> 00:05:55,250 rank and Google's mode of trying to rank pages. 84 00:05:55,250 --> 00:05:59,158 Which is largely a graph query, right as we will see soon. 85 00:05:59,158 --> 00:06:08,040 And then XML wave hit database community. And XML is both a tree and if you relax 86 00:06:08,040 --> 00:06:14,954 it a little graph, alright? So, again, huge amount of work was came 87 00:06:14,954 --> 00:06:22,690 out during XML activity, right? I mean XML database research. 88 00:06:22,690 --> 00:06:28,020 So whenever you look at many graph papers from about 10, 20 years, 10 to 15 years 89 00:06:28,020 --> 00:06:32,694 old, then you will see they all refer back to XML, Xquery X path kind of 90 00:06:32,694 --> 00:06:38,045 settings. So, although they are not strictly 91 00:06:38,045 --> 00:06:41,364 graphs. Graph databases research flourished 92 00:06:41,364 --> 00:06:45,320 during that time. But now there is even bigger beast called 93 00:06:45,320 --> 00:06:48,886 Link Data. Okay, which is what I'm really excited 94 00:06:48,886 --> 00:06:53,015 about, which I'm going to talk about, in this talk. 95 00:06:53,015 --> 00:06:58,333 There the linked data graphs are really graphs. 96 00:06:58,333 --> 00:07:05,950 As in XML was mostly tree-structured, with some deviations from the tree. 97 00:07:05,950 --> 00:07:10,500 But, link data is mostly graph structures with very little deviating from the graph 98 00:07:10,500 --> 00:07:15,670 structure, as in moving into the tree. Very little. 99 00:07:15,670 --> 00:07:21,520 But so, we are really seeing true graph requirements in databases now. 100 00:07:24,400 --> 00:07:28,670 So, just to give a brief introduction of what these graphs would look like in link 101 00:07:28,670 --> 00:07:32,772 data setting, right? So I just was thinking about which are 102 00:07:32,772 --> 00:07:36,820 the good examples to give. And then I happened to catch the DVD of 103 00:07:36,820 --> 00:07:40,845 The Expendables. So I thought I will talk about. 104 00:07:40,845 --> 00:07:45,054 The good old Bruce Willis and Dolph Lundgren who are heroes in this movie, 105 00:07:45,054 --> 00:07:49,106 right? So you can form such a graph where Bruce 106 00:07:49,106 --> 00:07:54,695 Willis is well known and Dolph Lundgren, and you can have relationships, like both 107 00:07:54,695 --> 00:08:02,210 were born in different cities, Ida-Oberstein and, Stockholm. 108 00:08:02,210 --> 00:08:05,895 And then both of them have worked in one single movie, The Expendables, and it's a 109 00:08:05,895 --> 00:08:09,690 movie, and you can even construct further relationships like, this is a movie made 110 00:08:09,690 --> 00:08:14,830 in Hollywood and it's a action movie and these kind of things. 111 00:08:14,830 --> 00:08:18,730 And both of them are action heroes, out of which Dolph knows martial arts while 112 00:08:18,730 --> 00:08:22,780 Bruce Willis knows how handle a gun. Right. 113 00:08:22,780 --> 00:08:28,341 so this is a big this is a small snapshot of a fairly large graph that you can 114 00:08:28,341 --> 00:08:35,922 construct just from IMDB data set, just from movies that are out there. 115 00:08:35,922 --> 00:08:38,552 Yeah? >> This looks like its very similar to 116 00:08:38,552 --> 00:08:42,172 Google Knowledge Graph, that they created. 117 00:08:42,172 --> 00:08:43,002 >> Yes. [COUGH] 118 00:08:43,002 --> 00:08:45,594 >> First two, three sentences of Wikipedia [INAUDIBLE]. 119 00:08:45,594 --> 00:08:47,990 >> Yes. In fact, Google Knowledge Graph is 120 00:08:47,990 --> 00:08:52,840 another idea of link data graph that you can see. 121 00:08:52,840 --> 00:08:57,313 And, in fact, they have used not just used a Wikipedia, they have used IMDB as 122 00:08:57,313 --> 00:09:03,925 well, that used many of these almost structured sources in order to extract. 123 00:09:03,925 --> 00:09:08,320 >> So I'm not going to talk about much on how exactly they build the graph. 124 00:09:08,320 --> 00:09:12,878 But this is exactly what you are saying. This is the way Google Knowledge Graph 125 00:09:12,878 --> 00:09:15,765 also looks like. >> But the problem with Google 126 00:09:15,765 --> 00:09:21,499 Knowledge Graph is, they relate each node with some static node. 127 00:09:21,499 --> 00:09:27,449 They cannot operate on dynamically changing environment, [INAUDIBLE] 128 00:09:27,449 --> 00:09:33,059 >> [COUGH] [INAUDIBLE] So if the query is written in dynamically changing events 129 00:09:33,059 --> 00:09:38,271 or some time series type of data change related. 130 00:09:38,271 --> 00:09:40,029 >> Yeah. >> They clearly [INAUDIBLE]. 131 00:09:40,029 --> 00:09:42,170 >> They have a problem, using the knowledge graph correctly. 132 00:09:42,170 --> 00:09:44,526 You are perfectly right. In fact this is actually active research 133 00:09:44,526 --> 00:09:47,030 area. I'm not going to talk much about it. 134 00:09:47,030 --> 00:09:50,769 But we can discuss this offline. In fact this is also one of my research 135 00:09:50,769 --> 00:09:57,150 areas, coincidentally. So looking at how these even kind of, 136 00:09:58,200 --> 00:10:03,366 Data stream that's coming in. How you can represent them as graphs, how 137 00:10:03,366 --> 00:10:07,732 you can query them and clearly the problems only compound from what I'm 138 00:10:07,732 --> 00:10:13,480 talking about here right? So they're much harder problems. 139 00:10:13,480 --> 00:10:18,282 Google is still catching up with us on that front 140 00:10:18,282 --> 00:10:20,128 >> This one. >> Yeah. 141 00:10:20,128 --> 00:10:26,288 >> When I'm looking right here [INAUDIBLE] in one sentence and 142 00:10:26,288 --> 00:10:34,768 [INAUDIBLE] makes [INAUDIBLE] so I would like to [INAUDIBLE] [COUGH] So given such 143 00:10:34,768 --> 00:10:44,340 a graph that you have constructed from known facts, right? 144 00:10:44,340 --> 00:10:48,430 So you can actually look at the graph as simply a bag of edges. 145 00:10:48,430 --> 00:10:51,630 So you can allow for multiple edges between nodes, right, this, with the same 146 00:10:51,630 --> 00:10:54,730 label, also. It's perfectly okay. 147 00:10:54,730 --> 00:10:57,640 RDF does not prevent it. So. 148 00:10:57,640 --> 00:11:01,546 And every idea is represented as a triple, which essentially says, what is 149 00:11:01,546 --> 00:11:05,389 the source of the edge, what is the target of the edge, and what is the label 150 00:11:05,389 --> 00:11:11,300 on the edge? And edges are always directed in RDF 151 00:11:11,300 --> 00:11:15,864 setting. And again, in the semantic web 152 00:11:15,864 --> 00:11:21,137 definition, when you look at it. So, RDF graphs, every node on edge has a 153 00:11:21,137 --> 00:11:25,873 unique identifier in the form of URI, but it's only a minor detail which is not 154 00:11:25,873 --> 00:11:32,450 really important and for looking at RDF data as a graph, right? 155 00:11:32,450 --> 00:11:35,760 Just just gives you some way of identifying the nodes and edges.