So now, since we talked about RDF and we said this is large, we really have to see how large is this. Okay, large is just in the saying or just because I want to get more grant, or is it really that there are real large data sets? So the big thing in linked data and linked open data is that because it has really seen as a bag of edges, right? That's basically what we said. Now we can throw in any kind of edge on it, right? You can just keep putting, you don't have a constraint on schema. You don't have to follow some structure. The least that you need is subject, some label on the edge and the target. That's basically all you want. So given that, you can go across various domains. Starting from biology to movies, right, and in fact you do go from biology to movies. You look at it the kind of LOD, that is Linked Open Data consortium. What it is trying to do is look for such different link data struc, sources from different domains, put them all together, link them up, right? You may have a situation where one of the actors happened to write a PubMed article. It's perfectly possible, although very rare. You want to link this up, okay. You want to link up if there is a politician who is a chemical scientist, chemistry scientist and has written many journal papers. You may want to know this. >> Right? >> [INAUDIBLE] Construction issues at that level. Right? So, at this moment, what people are trying is that they're not really worried about. So there are two types of construction issues. You are right. One is from you have this textual sources where people mostly report things, and you want to move from there to this kind of structure. That's one kind of structural construction issue that people are facing. In fact Yago, DBPedia, Freebase, and TextRunner, Cyc these are all the efforts which have moved towards focused on that aspect where you have text and you want to extract into some structured form. But what this LOD is really all about is a, not really care about how they have been derived, but to put them together. >> Oh, in that case, [INAUDIBLE]. >> [COUGH] [INAUDIBLE] x, random y, which is like a big chunk in, for example. >> Yes. >> So the subject is x, random y, [INAUDIBLE] now that [INAUDIBLE] Precisely, yes. >> So, even after extraction. >> Yes. >> It is problematic. >> Yeah. So that, that, that's the second, kind of, construction problem I was kind of talking about. So, in fact, even the example that you gave is already a complex one. I can take a much simpler example. So somebody is calling PubMed Mr. Howard. Right? Hetch Howard. Okay, his name is Hetch Howard. I know there is another guy in IMDB which is Harvey Howard. How do I know this Harvey Howard and Hetch Howard, whether they are the same? Whether they are different I have no idea, right? So there is a requirement that everything should, every such entity, whether it's a node or a edge, has to be uniquely identified. So that's the basis of the way semantic web can function. That's the way LOD can function. But how do I get, and assign such a unique identifier? How do I know that these two are the same, so I shouldn't just give separate identifiers? So, that's a huge problem, and yeah? >> [INAUDIBLE] No, Google Knowledge Base does much simpler solution for this. It minimizes the number of resources from where they take, right, and they use some standard machine learning techniques. Through the web, relatively easy matching. All right? At this moment at least. OK? And there is a usual amount of human creation also going on. In fact I was just talking about, even. >> [INAUDIBLE] Yeah, yeah. In fact, many of these sources, right, the yellow things that I have here. >> [INAUDIBLE] Yeah. Yeah. Okay. So the Yago, Freebase, and DBPedia. All of them are largely built on Wikipedia. In fact they haven't existed, if Wikipedia did not exist in such a rich form, then you wouldn't have many of these resources. But. LOD is beyond just these. Right? Not just the things that have come from Wikipedia but way beyond it because there is a whole bunch of biological facts, DBLP citation graphs, these are not part of Wikipedia. These have come from different sources. >> [INAUDIBLE] Yes, that's what I was about to tell him. That although Google Knowledge Base, knowledge graph, does so many things nicely, it can trip up quite badly because of exactly this problem. So I was giving the example of fff, you filing the query, all of you can go back and try it out. You'll find the query for fff. Right? He is one of the well known figures in computer science, in database work. So you look at the kind of snippet that you get on the right side, which is powered by a knowledge graph. It's perfectly matching what you know of knowledge, fff. >> But you look at the picture that is associated with it, it's someone else. He's from chemistry. He's also called fff. >> In fact, if, [LAUGH] [INAUDIBLE] He doesn't look like [FOREIGN] at all, right? So there is no I mean, how do you believe that everything that you get is perfect? Right? they try their best, so there are still, these are the recent challenges which are still open and there are many, many such research challenges. >> iii It is not a small problem. >> It is not at all a small problem. No, no, no. Entity resolution. People know that entity resolution is not a small problem in noisy text. Like Twitter. Yeah sure, we don't know. But I'm talking about perfectly well handcrafted source like Wikipedia. Even there things are not easy. Extremely messy. Okay? So that's only one part of the problem that you will face when you go to linked open data world. And the size of the linked open data at this moment with all these errors in, is about 30 billion edges. Okay? Fairly large. Just compare it with something like what Facebook already announced. One billion users. One billion is already very exciting for many people. And this is 30 billion triples with so much of nice already and clearly all these one billion have to be part of such a graph right, every human being in principle can be part of the LOD. Alright. So we have a. >> [INAUDIBLE]. >> There is no super imposed schema, but most of these ontologies are already part of LOD. >> [INAUDIBLE] What do you mean by that? >> I mean, for example [INAUDIBLE] you can write any [INAUDIBLE] you want [INAUDIBLE]. >> Yes. >> [INAUDIBLE]. >> Yeah. >> [INAUDIBLE] Are the research problem. Another research problem, right? >> [INAUDIBLE] Yeah. >> [INAUDIBLE] Yeah, that's another research problem which is not yet solved. Again, these things have to be solved in order to make LOD even more valuable, right? So in fact none of these I'm going to cover in this talk at all. But I'm going to keep agreeing with all these problems that you say. Yes, yes, these are all the research problems. Right. >> We have to move on. >> Yeah, we'll, we'll move on, all right? So clearly one of the big challenges is that graphs are really are big. Okay? Big massive. Really massive. You can add any kind of these fancy adjectives. And more importantly then just having the graphs getting larger, the queries that you fire on these graphs are getting more and more interesting and exciting and harder to evaluate. So one is, you can have the standard pattern like queries. Which you can, all, all of you, I think some of you already said it's similar to SQL but few joins, yeah, SQL joined. But really done well for join queries. But you can actually go into a little more complicated things in these graphs. And say I wanted to recursive traversals with unbounded recursion, right? I just want to keep doing reach abilities, right? Then, I have no good solution which is really efficient. Right? This is allowed in SPARQL extensions people are talking about and then you have on these graphs many analytic queries that you can I mean all of you are already familiar with PageRank. Pagerank is an analytic way which runs on the entire graph, tries to rank, based on the random mark potentials, right? [NOISE] You can also think of our first file, one query. Get a sub-graph out, some ad hoc subset of the entire LOD. And then, from this I will extract bin sub-graphs, or some kind of k cord decomposition of this or, com-, compute standard lease, right? If you're familiar with this keyword search and graphs, this is basically what they do. They fire some initial set of queries which will filter out certain notes and then try to connect them up using standard lease. Again, these are not so easy to solve using relational databases. The solution most of the current approaches use is, let's take this graph and put it in memory and deal with it, okay? While it is a valid solution, given that the memory is getting cheaper, it's still not cheap enough to hold 30 billion triples. Okay? So we need to come up with better strategies for dealing with it. >> Isn't this a concept of uh, [INAUDIBLE] core assets and roots, in the [INAUDIBLE] graph? >> Yes. So can always look at frequents of graphs. >> Right? So in fact there's another, as they say, some kind of. >> [INAUDIBLE] [SOUND] Your opinion always predict the, the edges also. >> So you can relax, you can relax the labels on the, edges and say I will, I don't care about the edges. But, really any kind of structure. Or you can say, I will include just the labels but I relax on the subjects and objects that are there. So people have worked on these kind of. >> [INAUDIBLE] subgraphs and you can also have interesting subgraphs. >> Interesting subgraphs. Right?