So now, since we talked about RDF and we 
said this is large, we really have to see 
how large is this. 
Okay, large is just in the saying or just 
because I want to get more grant, or is 
it really that there are real large data 
sets? 
So the big thing in linked data and 
linked open data is that because it has 
really seen as a bag of edges, right? 
That's basically what we said. 
Now we can throw in any kind of edge on 
it, right? 
You can just keep putting, you don't have 
a constraint on schema. 
You don't have to follow some structure. 
The least that you need is subject, some 
label on the edge and the target. 
That's basically all you want. 
So given that, you can go across various 
domains. 
Starting from biology to movies, right, 
and in fact you do go from biology to 
movies. 
You look at it the kind of LOD, that is 
Linked Open Data consortium. 
What it is trying to do is look for such 
different link data struc, sources from 
different domains, put them all together, 
link them up, right? 
You may have a situation where one of the 
actors happened to write a PubMed 
article. 
It's perfectly possible, although very 
rare. 
You want to link this up, okay. 
You want to link up if there is a 
politician who is a chemical scientist, 
chemistry scientist and has written many 
journal papers. 
You may want to know this. 
 >> Right? 
 >> [INAUDIBLE] Construction issues at 
that level. 
Right? 
So, at this moment, what people are 
trying is that they're not really worried 
about. 
So there are two types of construction 
issues. 
You are right. 
One is from you have this textual sources 
where people mostly report things, and 
you want to move from there to this kind 
of structure. 
That's one kind of structural 
construction issue that people are 
facing. 
In fact Yago, DBPedia, Freebase, and 
TextRunner, Cyc these are all the efforts 
which have moved towards focused on that 
aspect where you have text and you want 
to extract into some structured form. 
But what this LOD is really all about is 
a, not really care about how they have 
been derived, but to put them together. 
 >> Oh, in that case, [INAUDIBLE]. 
 >> [COUGH] [INAUDIBLE] x, random y, 
which is like a big chunk in, for 
example. 
 >> Yes. 
 >> So the subject is x, random y, 
[INAUDIBLE] now that [INAUDIBLE] 
Precisely, yes. 
 >> So, even after extraction. 
 >> Yes. 
 >> It is problematic. 
 >> Yeah. 
So that, that, that's the second, kind 
of, construction problem I was kind of 
talking about. 
So, in fact, even the example that you 
gave is already a complex one. 
I can take a much simpler example. 
So somebody is calling PubMed Mr. 
Howard. 
Right? 
Hetch Howard. 
Okay, his name is Hetch Howard. 
I know there is another guy in IMDB which 
is Harvey Howard. 
How do I know this Harvey Howard and 
Hetch Howard, whether they are the same? 
Whether they are different I have no 
idea, right? 
So there is a requirement that everything 
should, every such entity, whether it's a 
node or a edge, has to be uniquely 
identified. 
So that's the basis of the way semantic 
web can function. 
That's the way LOD can function. 
But how do I get, and assign such a 
unique identifier? 
How do I know that these two are the 
same, so I shouldn't just give separate 
identifiers? 
So, that's a huge problem, and yeah? 
 >> [INAUDIBLE] No, Google Knowledge 
Base does much simpler solution for this. 
It minimizes the number of resources from 
where they take, right, and they use some 
standard machine learning techniques. 
Through the web, relatively easy 
matching. 
All right? 
At this moment at least. 
OK? 
And there is a usual amount of human 
creation also going on. 
In fact I was just talking about, even. 
 >> [INAUDIBLE] Yeah, yeah. 
In fact, many of these sources, right, 
the yellow things that I have here. 
 >> [INAUDIBLE] Yeah. 
Yeah. 
Okay. 
So the Yago, Freebase, and DBPedia. 
All of them are largely built on 
Wikipedia. 
In fact they haven't existed, if 
Wikipedia did not exist in such a rich 
form, then you wouldn't have many of 
these resources. 
But. 
LOD is beyond just these. 
Right? 
Not just the things that have come from 
Wikipedia but way beyond it because there 
is a whole bunch of biological facts, 
DBLP citation graphs, these are not part 
of Wikipedia. 
These have come from different sources. 
 >> [INAUDIBLE] Yes, that's what I was 
about to tell him. 
That although Google Knowledge Base, 
knowledge graph, does so many things 
nicely, it can trip up quite badly 
because of exactly this problem. 
So I was giving the example of fff, you 
filing the query, all of you can go back 
and try it out. 
You'll find the query for fff. 
Right? 
He is one of the well known figures in 
computer science, in database work. 
So you look at the kind of snippet that 
you get on the right side, which is 
powered by a knowledge graph. 
It's perfectly matching what you know of 
knowledge, fff. 
 >> But you look at the picture that is 
associated with it, it's someone else. 
He's from chemistry. 
He's also called fff. 
 >> In fact, if, [LAUGH] [INAUDIBLE] He 
doesn't look like [FOREIGN] at all, 
right? 
So there is no I mean, how do you believe 
that everything that you get is perfect? 
Right? 
they try their best, so there are still, 
these are the recent challenges which are 
still open and there are many, many such 
research challenges. 
 >> iii It is not a small problem. 
 >> It is not at all a small problem. 
No, no, no. 
Entity resolution. 
People know that entity resolution is not 
a small problem in noisy text. 
Like Twitter. 
Yeah sure, we don't know. 
But I'm talking about perfectly well 
handcrafted source like Wikipedia. 
Even there things are not easy. 
Extremely messy. 
Okay? 
So that's only one part of the problem 
that you will face when you go to linked 
open data world. 
And the size of the linked open data at 
this moment with all these errors in, is 
about 30 billion edges. 
Okay? 
Fairly large. 
Just compare it with something like what 
Facebook already announced. 
One billion users. 
One billion is already very exciting for 
many people. 
And this is 30 billion triples with so 
much of nice already and clearly all 
these one billion have to be part of such 
a graph right, every human being in 
principle can be part of the LOD. 
Alright. 
So we have a. 
 >> [INAUDIBLE]. 
 >> There is no super imposed schema, 
but most of these ontologies are already 
part of LOD. 
 >> [INAUDIBLE] What do you mean by 
that? 
 >> I mean, for example 
[INAUDIBLE] 
you can write any 
[INAUDIBLE] 
you want 
[INAUDIBLE]. 
 >> Yes. 
 >> [INAUDIBLE]. 
 >> Yeah. 
 >> [INAUDIBLE] Are the research 
problem. 
Another research problem, right? 
 >> [INAUDIBLE] 
Yeah. 
 >> [INAUDIBLE] Yeah, that's another 
research problem which is not yet solved. 
Again, these things have to be solved in 
order to make LOD even more valuable, 
right? 
So in fact none of these I'm going to 
cover in this talk at all. 
But I'm going to keep agreeing with all 
these problems that you say. 
Yes, yes, these are all the research 
problems. 
Right. 
 >> We have to move on. 
 >> Yeah, we'll, we'll move on, all 
right? 
So clearly one of the big challenges is 
that graphs are really are big. 
Okay? 
Big massive. 
Really massive. 
You can add any kind of these fancy 
adjectives. 
And more importantly then just having the 
graphs getting larger, the queries that 
you fire on these graphs are getting more 
and more interesting and exciting and 
harder to evaluate. 
So one is, you can have the standard 
pattern like queries. 
Which you can, all, all of you, I think 
some of you already said it's similar to 
SQL but few joins, yeah, SQL joined. 
But really done well for join queries. 
But you can actually go into a little 
more complicated things in these graphs. 
And say I wanted to recursive traversals 
with unbounded recursion, right? 
I just want to keep doing reach 
abilities, right? 
Then, I have no good solution which is 
really efficient. 
Right? 
This is allowed in SPARQL extensions 
people are talking about and then you 
have on these graphs many analytic 
queries that you can I mean all of you 
are already familiar with PageRank. 
Pagerank is an analytic way which runs on 
the entire graph, tries to rank, based on 
the random mark potentials, right? 
[NOISE] You can also think of our first 
file, one query. 
Get a sub-graph out, some ad hoc subset 
of the entire LOD. 
And then, from this I will extract bin 
sub-graphs, or some kind of k cord 
decomposition of this or, com-, compute 
standard lease, right? 
If you're familiar with this keyword 
search and graphs, this is basically what 
they do. 
They fire some initial set of queries 
which will filter out certain notes and 
then try to connect them up using 
standard lease. 
Again, these are not so easy to solve 
using relational databases. 
The solution most of the current 
approaches use is, let's take this graph 
and put it in memory and deal with it, 
okay? 
While it is a valid solution, given that 
the memory is getting cheaper, it's still 
not cheap enough to hold 30 billion 
triples. 
Okay? 
So we need to come up with better 
strategies for dealing with it. 
 >> Isn't this a concept of uh, 
[INAUDIBLE] core assets and roots, in the 
[INAUDIBLE] graph? 
 >> Yes. 
So can always look at frequents of 
graphs. 
 >> Right? 
So in fact there's another, as they say, 
some kind of. 
 >> [INAUDIBLE] [SOUND] Your opinion 
always predict the, the edges also. 
 >> So you can relax, you can relax the 
labels on the, edges and say I will, I 
don't care about the edges. 
But, really any kind of structure. 
Or you can say, I will include just the 
labels but I relax on the subjects and 
objects that are there. 
So people have worked on these kind of. 
 >> [INAUDIBLE] subgraphs and you can 
also have interesting subgraphs. 
 >> Interesting subgraphs. 
Right?