1 00:00:00,510 --> 00:00:05,474 So now, since we talked about RDF and we said this is large, we really have to see 2 00:00:05,474 --> 00:00:11,688 how large is this. Okay, large is just in the saying or just 3 00:00:11,688 --> 00:00:17,517 because I want to get more grant, or is it really that there are real large data 4 00:00:17,517 --> 00:00:23,284 sets? So the big thing in linked data and 5 00:00:23,284 --> 00:00:30,860 linked open data is that because it has really seen as a bag of edges, right? 6 00:00:30,860 --> 00:00:35,200 That's basically what we said. Now we can throw in any kind of edge on 7 00:00:35,200 --> 00:00:38,104 it, right? You can just keep putting, you don't have 8 00:00:38,104 --> 00:00:42,760 a constraint on schema. You don't have to follow some structure. 9 00:00:42,760 --> 00:00:50,780 The least that you need is subject, some label on the edge and the target. 10 00:00:50,780 --> 00:00:55,000 That's basically all you want. So given that, you can go across various 11 00:00:55,000 --> 00:00:59,472 domains. Starting from biology to movies, right, 12 00:00:59,472 --> 00:01:04,590 and in fact you do go from biology to movies. 13 00:01:04,590 --> 00:01:09,150 You look at it the kind of LOD, that is Linked Open Data consortium. 14 00:01:09,150 --> 00:01:14,350 What it is trying to do is look for such different link data struc, sources from 15 00:01:14,350 --> 00:01:20,703 different domains, put them all together, link them up, right? 16 00:01:20,703 --> 00:01:25,631 You may have a situation where one of the actors happened to write a PubMed 17 00:01:25,631 --> 00:01:30,362 article. It's perfectly possible, although very 18 00:01:30,362 --> 00:01:33,830 rare. You want to link this up, okay. 19 00:01:33,830 --> 00:01:38,432 You want to link up if there is a politician who is a chemical scientist, 20 00:01:38,432 --> 00:01:44,870 chemistry scientist and has written many journal papers. 21 00:01:44,870 --> 00:01:46,666 You may want to know this. >> Right? 22 00:01:46,666 --> 00:01:51,570 >> [INAUDIBLE] Construction issues at that level. 23 00:01:51,570 --> 00:01:53,533 Right? So, at this moment, what people are 24 00:01:53,533 --> 00:01:56,151 trying is that they're not really worried about. 25 00:01:56,151 --> 00:01:58,225 So there are two types of construction issues. 26 00:01:58,225 --> 00:02:01,996 You are right. One is from you have this textual sources 27 00:02:01,996 --> 00:02:07,131 where people mostly report things, and you want to move from there to this kind 28 00:02:07,131 --> 00:02:11,940 of structure. That's one kind of structural 29 00:02:11,940 --> 00:02:15,060 construction issue that people are facing. 30 00:02:15,060 --> 00:02:18,579 In fact Yago, DBPedia, Freebase, and TextRunner, Cyc these are all the efforts 31 00:02:18,579 --> 00:02:21,945 which have moved towards focused on that aspect where you have text and you want 32 00:02:21,945 --> 00:02:29,820 to extract into some structured form. But what this LOD is really all about is 33 00:02:29,820 --> 00:02:41,642 a, not really care about how they have been derived, but to put them together. 34 00:02:41,642 --> 00:02:48,944 >> Oh, in that case, [INAUDIBLE]. >> [COUGH] [INAUDIBLE] x, random y, 35 00:02:48,944 --> 00:02:54,405 which is like a big chunk in, for example. 36 00:02:54,405 --> 00:02:56,951 >> Yes. >> So the subject is x, random y, 37 00:02:56,951 --> 00:03:01,310 [INAUDIBLE] now that [INAUDIBLE] Precisely, yes. 38 00:03:01,310 --> 00:03:05,359 >> So, even after extraction. >> Yes. 39 00:03:05,359 --> 00:03:09,010 >> It is problematic. >> Yeah. 40 00:03:09,010 --> 00:03:12,210 So that, that, that's the second, kind of, construction problem I was kind of 41 00:03:12,210 --> 00:03:15,417 talking about. So, in fact, even the example that you 42 00:03:15,417 --> 00:03:20,890 gave is already a complex one. I can take a much simpler example. 43 00:03:20,890 --> 00:03:22,334 So somebody is calling PubMed Mr. Howard. 44 00:03:22,334 --> 00:03:22,801 Right? Hetch Howard. 45 00:03:22,801 --> 00:03:27,895 Okay, his name is Hetch Howard. I know there is another guy in IMDB which 46 00:03:27,895 --> 00:03:34,307 is Harvey Howard. How do I know this Harvey Howard and 47 00:03:34,307 --> 00:03:41,775 Hetch Howard, whether they are the same? Whether they are different I have no 48 00:03:41,775 --> 00:03:45,886 idea, right? So there is a requirement that everything 49 00:03:45,886 --> 00:03:50,536 should, every such entity, whether it's a node or a edge, has to be uniquely 50 00:03:50,536 --> 00:03:55,658 identified. So that's the basis of the way semantic 51 00:03:55,658 --> 00:04:00,065 web can function. That's the way LOD can function. 52 00:04:00,065 --> 00:04:04,265 But how do I get, and assign such a unique identifier? 53 00:04:05,530 --> 00:04:09,800 How do I know that these two are the same, so I shouldn't just give separate 54 00:04:09,800 --> 00:04:18,708 identifiers? So, that's a huge problem, and yeah? 55 00:04:18,708 --> 00:04:28,110 >> [INAUDIBLE] No, Google Knowledge Base does much simpler solution for this. 56 00:04:28,110 --> 00:04:33,216 It minimizes the number of resources from where they take, right, and they use some 57 00:04:33,216 --> 00:04:39,794 standard machine learning techniques. Through the web, relatively easy 58 00:04:39,794 --> 00:04:42,240 matching. All right? 59 00:04:42,240 --> 00:04:43,950 At this moment at least. OK? 60 00:04:43,950 --> 00:04:46,960 And there is a usual amount of human creation also going on. 61 00:04:46,960 --> 00:04:52,320 In fact I was just talking about, even. >> [INAUDIBLE] Yeah, yeah. 62 00:04:52,320 --> 00:04:58,618 In fact, many of these sources, right, the yellow things that I have here. 63 00:04:58,618 --> 00:05:00,740 >> [INAUDIBLE] Yeah. Yeah. 64 00:05:00,740 --> 00:05:02,580 Okay. So the Yago, Freebase, and DBPedia. 65 00:05:02,580 --> 00:05:05,580 All of them are largely built on Wikipedia. 66 00:05:05,580 --> 00:05:09,872 In fact they haven't existed, if Wikipedia did not exist in such a rich 67 00:05:09,872 --> 00:05:14,880 form, then you wouldn't have many of these resources. 68 00:05:16,020 --> 00:05:19,434 But. LOD is beyond just these. 69 00:05:19,434 --> 00:05:22,638 Right? Not just the things that have come from 70 00:05:22,638 --> 00:05:27,192 Wikipedia but way beyond it because there is a whole bunch of biological facts, 71 00:05:27,192 --> 00:05:32,480 DBLP citation graphs, these are not part of Wikipedia. 72 00:05:32,480 --> 00:05:37,006 These have come from different sources. >> [INAUDIBLE] Yes, that's what I was 73 00:05:37,006 --> 00:05:41,360 about to tell him. That although Google Knowledge Base, 74 00:05:41,360 --> 00:05:46,139 knowledge graph, does so many things nicely, it can trip up quite badly 75 00:05:46,139 --> 00:05:52,149 because of exactly this problem. So I was giving the example of fff, you 76 00:05:52,149 --> 00:05:56,570 filing the query, all of you can go back and try it out. 77 00:05:56,570 --> 00:05:59,370 You'll find the query for fff. Right? 78 00:05:59,370 --> 00:06:03,290 He is one of the well known figures in computer science, in database work. 79 00:06:04,520 --> 00:06:07,588 So you look at the kind of snippet that you get on the right side, which is 80 00:06:07,588 --> 00:06:13,214 powered by a knowledge graph. It's perfectly matching what you know of 81 00:06:13,214 --> 00:06:17,021 knowledge, fff. >> But you look at the picture that is 82 00:06:17,021 --> 00:06:21,854 associated with it, it's someone else. He's from chemistry. 83 00:06:21,854 --> 00:06:27,533 He's also called fff. >> In fact, if, [LAUGH] [INAUDIBLE] He 84 00:06:27,533 --> 00:06:32,780 doesn't look like [FOREIGN] at all, right? 85 00:06:32,780 --> 00:06:38,175 So there is no I mean, how do you believe that everything that you get is perfect? 86 00:06:38,175 --> 00:06:41,335 Right? they try their best, so there are still, 87 00:06:41,335 --> 00:06:45,755 these are the recent challenges which are still open and there are many, many such 88 00:06:45,755 --> 00:06:53,003 research challenges. >> iii It is not a small problem. 89 00:06:53,003 --> 00:06:54,600 >> It is not at all a small problem. No, no, no. 90 00:06:54,600 --> 00:06:57,390 Entity resolution. People know that entity resolution is not 91 00:06:57,390 --> 00:07:00,360 a small problem in noisy text. Like Twitter. 92 00:07:00,360 --> 00:07:05,572 Yeah sure, we don't know. But I'm talking about perfectly well 93 00:07:05,572 --> 00:07:13,070 handcrafted source like Wikipedia. Even there things are not easy. 94 00:07:13,070 --> 00:07:14,274 Extremely messy. Okay? 95 00:07:14,274 --> 00:07:18,306 So that's only one part of the problem that you will face when you go to linked 96 00:07:18,306 --> 00:07:24,005 open data world. And the size of the linked open data at 97 00:07:24,005 --> 00:07:32,134 this moment with all these errors in, is about 30 billion edges. 98 00:07:33,630 --> 00:07:35,200 Okay? Fairly large. 99 00:07:35,200 --> 00:07:40,020 Just compare it with something like what Facebook already announced. 100 00:07:40,020 --> 00:07:44,164 One billion users. One billion is already very exciting for 101 00:07:44,164 --> 00:07:47,481 many people. And this is 30 billion triples with so 102 00:07:47,481 --> 00:07:51,324 much of nice already and clearly all these one billion have to be part of such 103 00:07:51,324 --> 00:07:57,115 a graph right, every human being in principle can be part of the LOD. 104 00:07:57,115 --> 00:07:58,736 Alright. So we have a. 105 00:07:58,736 --> 00:08:03,021 >> [INAUDIBLE]. >> There is no super imposed schema, 106 00:08:03,021 --> 00:08:12,381 but most of these ontologies are already part of LOD. 107 00:08:12,381 --> 00:08:18,851 >> [INAUDIBLE] What do you mean by that? 108 00:08:18,851 --> 00:08:22,782 >> I mean, for example 109 00:08:22,782 --> 00:08:25,142 [INAUDIBLE] 110 00:08:25,142 --> 00:08:28,518 you can write any 111 00:08:28,518 --> 00:08:30,967 [INAUDIBLE] 112 00:08:30,967 --> 00:08:33,530 you want 113 00:08:33,530 --> 00:08:37,804 [INAUDIBLE]. 114 00:08:37,804 --> 00:08:46,635 >> Yes. >> [INAUDIBLE]. 115 00:08:46,635 --> 00:08:52,500 >> Yeah. >> [INAUDIBLE] Are the research 116 00:08:52,500 --> 00:08:55,244 problem. Another research problem, right? 117 00:08:55,244 --> 00:08:58,049 >> [INAUDIBLE] 118 00:08:58,049 --> 00:09:02,242 Yeah. >> [INAUDIBLE] Yeah, that's another 119 00:09:02,242 --> 00:09:08,172 research problem which is not yet solved. Again, these things have to be solved in 120 00:09:08,172 --> 00:09:11,085 order to make LOD even more valuable, right? 121 00:09:11,085 --> 00:09:14,480 So in fact none of these I'm going to cover in this talk at all. 122 00:09:14,480 --> 00:09:18,100 But I'm going to keep agreeing with all these problems that you say. 123 00:09:18,100 --> 00:09:20,280 Yes, yes, these are all the research problems. 124 00:09:20,280 --> 00:09:21,520 Right. >> We have to move on. 125 00:09:21,520 --> 00:09:24,436 >> Yeah, we'll, we'll move on, all right? 126 00:09:24,436 --> 00:09:29,200 So clearly one of the big challenges is that graphs are really are big. 127 00:09:29,200 --> 00:09:32,590 Okay? Big massive. 128 00:09:32,590 --> 00:09:36,350 Really massive. You can add any kind of these fancy 129 00:09:36,350 --> 00:09:40,460 adjectives. And more importantly then just having the 130 00:09:40,460 --> 00:09:45,220 graphs getting larger, the queries that you fire on these graphs are getting more 131 00:09:45,220 --> 00:09:50,770 and more interesting and exciting and harder to evaluate. 132 00:09:50,770 --> 00:09:54,480 So one is, you can have the standard pattern like queries. 133 00:09:54,480 --> 00:09:58,500 Which you can, all, all of you, I think some of you already said it's similar to 134 00:09:58,500 --> 00:10:05,110 SQL but few joins, yeah, SQL joined. But really done well for join queries. 135 00:10:05,110 --> 00:10:09,020 But you can actually go into a little more complicated things in these graphs. 136 00:10:09,020 --> 00:10:13,840 And say I wanted to recursive traversals with unbounded recursion, right? 137 00:10:13,840 --> 00:10:16,921 I just want to keep doing reach abilities, right? 138 00:10:16,921 --> 00:10:20,444 Then, I have no good solution which is really efficient. 139 00:10:20,444 --> 00:10:22,595 Right? This is allowed in SPARQL extensions 140 00:10:22,595 --> 00:10:26,075 people are talking about and then you have on these graphs many analytic 141 00:10:26,075 --> 00:10:32,190 queries that you can I mean all of you are already familiar with PageRank. 142 00:10:32,190 --> 00:10:36,308 Pagerank is an analytic way which runs on the entire graph, tries to rank, based on 143 00:10:36,308 --> 00:10:41,888 the random mark potentials, right? [NOISE] You can also think of our first 144 00:10:41,888 --> 00:10:46,628 file, one query. Get a sub-graph out, some ad hoc subset 145 00:10:46,628 --> 00:10:51,337 of the entire LOD. And then, from this I will extract bin 146 00:10:51,337 --> 00:10:56,342 sub-graphs, or some kind of k cord decomposition of this or, com-, compute 147 00:10:56,342 --> 00:11:01,551 standard lease, right? If you're familiar with this keyword 148 00:11:01,551 --> 00:11:04,680 search and graphs, this is basically what they do. 149 00:11:04,680 --> 00:11:08,633 They fire some initial set of queries which will filter out certain notes and 150 00:11:08,633 --> 00:11:13,230 then try to connect them up using standard lease. 151 00:11:13,230 --> 00:11:17,770 Again, these are not so easy to solve using relational databases. 152 00:11:17,770 --> 00:11:21,552 The solution most of the current approaches use is, let's take this graph 153 00:11:21,552 --> 00:11:25,199 and put it in memory and deal with it, okay? 154 00:11:25,199 --> 00:11:29,415 While it is a valid solution, given that the memory is getting cheaper, it's still 155 00:11:29,415 --> 00:11:33,235 not cheap enough to hold 30 billion triples. 156 00:11:33,235 --> 00:11:35,782 Okay? So we need to come up with better 157 00:11:35,782 --> 00:11:40,316 strategies for dealing with it. >> Isn't this a concept of uh, 158 00:11:40,316 --> 00:11:46,090 [INAUDIBLE] core assets and roots, in the [INAUDIBLE] graph? 159 00:11:46,090 --> 00:11:48,444 >> Yes. So can always look at frequents of 160 00:11:48,444 --> 00:11:50,072 graphs. >> Right? 161 00:11:50,072 --> 00:11:54,872 So in fact there's another, as they say, some kind of. 162 00:11:54,872 --> 00:12:00,701 >> [INAUDIBLE] [SOUND] Your opinion always predict the, the edges also. 163 00:12:00,701 --> 00:12:05,093 >> So you can relax, you can relax the labels on the, edges and say I will, I 164 00:12:05,093 --> 00:12:10,820 don't care about the edges. But, really any kind of structure. 165 00:12:10,820 --> 00:12:14,726 Or you can say, I will include just the labels but I relax on the subjects and 166 00:12:14,726 --> 00:12:19,041 objects that are there. So people have worked on these kind of. 167 00:12:19,041 --> 00:12:23,236 >> [INAUDIBLE] subgraphs and you can also have interesting subgraphs. 168 00:12:23,236 --> 00:12:25,515 >> Interesting subgraphs. Right?