1
00:00:00,510 --> 00:00:05,474
So now, since we talked about RDF and we 
said this is large, we really have to see 

2
00:00:05,474 --> 00:00:11,688
how large is this. 
Okay, large is just in the saying or just 

3
00:00:11,688 --> 00:00:17,517
because I want to get more grant, or is 
it really that there are real large data 

4
00:00:17,517 --> 00:00:23,284
sets? 
So the big thing in linked data and 

5
00:00:23,284 --> 00:00:30,860
linked open data is that because it has 
really seen as a bag of edges, right? 

6
00:00:30,860 --> 00:00:35,200
That's basically what we said. 
Now we can throw in any kind of edge on 

7
00:00:35,200 --> 00:00:38,104
it, right? 
You can just keep putting, you don't have 

8
00:00:38,104 --> 00:00:42,760
a constraint on schema. 
You don't have to follow some structure. 

9
00:00:42,760 --> 00:00:50,780
The least that you need is subject, some 
label on the edge and the target. 

10
00:00:50,780 --> 00:00:55,000
That's basically all you want. 
So given that, you can go across various 

11
00:00:55,000 --> 00:00:59,472
domains. 
Starting from biology to movies, right, 

12
00:00:59,472 --> 00:01:04,590
and in fact you do go from biology to 
movies. 

13
00:01:04,590 --> 00:01:09,150
You look at it the kind of LOD, that is 
Linked Open Data consortium. 

14
00:01:09,150 --> 00:01:14,350
What it is trying to do is look for such 
different link data struc, sources from 

15
00:01:14,350 --> 00:01:20,703
different domains, put them all together, 
link them up, right? 

16
00:01:20,703 --> 00:01:25,631
You may have a situation where one of the 
actors happened to write a PubMed 

17
00:01:25,631 --> 00:01:30,362
article. 
It's perfectly possible, although very 

18
00:01:30,362 --> 00:01:33,830
rare. 
You want to link this up, okay. 

19
00:01:33,830 --> 00:01:38,432
You want to link up if there is a 
politician who is a chemical scientist, 

20
00:01:38,432 --> 00:01:44,870
chemistry scientist and has written many 
journal papers. 

21
00:01:44,870 --> 00:01:46,666
You may want to know this. 
 >> Right? 

22
00:01:46,666 --> 00:01:51,570
 >> [INAUDIBLE] Construction issues at 
that level. 

23
00:01:51,570 --> 00:01:53,533
Right? 
So, at this moment, what people are 

24
00:01:53,533 --> 00:01:56,151
trying is that they're not really worried 
about. 

25
00:01:56,151 --> 00:01:58,225
So there are two types of construction 
issues. 

26
00:01:58,225 --> 00:02:01,996
You are right. 
One is from you have this textual sources 

27
00:02:01,996 --> 00:02:07,131
where people mostly report things, and 
you want to move from there to this kind 

28
00:02:07,131 --> 00:02:11,940
of structure. 
That's one kind of structural 

29
00:02:11,940 --> 00:02:15,060
construction issue that people are 
facing. 

30
00:02:15,060 --> 00:02:18,579
In fact Yago, DBPedia, Freebase, and 
TextRunner, Cyc these are all the efforts 

31
00:02:18,579 --> 00:02:21,945
which have moved towards focused on that 
aspect where you have text and you want 

32
00:02:21,945 --> 00:02:29,820
to extract into some structured form. 
But what this LOD is really all about is 

33
00:02:29,820 --> 00:02:41,642
a, not really care about how they have 
been derived, but to put them together. 

34
00:02:41,642 --> 00:02:48,944
 >> Oh, in that case, [INAUDIBLE]. 
 >> [COUGH] [INAUDIBLE] x, random y, 

35
00:02:48,944 --> 00:02:54,405
which is like a big chunk in, for 
example. 

36
00:02:54,405 --> 00:02:56,951
 >> Yes. 
 >> So the subject is x, random y, 

37
00:02:56,951 --> 00:03:01,310
[INAUDIBLE] now that [INAUDIBLE] 
Precisely, yes. 

38
00:03:01,310 --> 00:03:05,359
 >> So, even after extraction. 
 >> Yes. 

39
00:03:05,359 --> 00:03:09,010
 >> It is problematic. 
 >> Yeah. 

40
00:03:09,010 --> 00:03:12,210
So that, that, that's the second, kind 
of, construction problem I was kind of 

41
00:03:12,210 --> 00:03:15,417
talking about. 
So, in fact, even the example that you 

42
00:03:15,417 --> 00:03:20,890
gave is already a complex one. 
I can take a much simpler example. 

43
00:03:20,890 --> 00:03:22,334
So somebody is calling PubMed Mr. 
Howard. 

44
00:03:22,334 --> 00:03:22,801
Right? 
Hetch Howard. 

45
00:03:22,801 --> 00:03:27,895
Okay, his name is Hetch Howard. 
I know there is another guy in IMDB which 

46
00:03:27,895 --> 00:03:34,307
is Harvey Howard. 
How do I know this Harvey Howard and 

47
00:03:34,307 --> 00:03:41,775
Hetch Howard, whether they are the same? 
Whether they are different I have no 

48
00:03:41,775 --> 00:03:45,886
idea, right? 
So there is a requirement that everything 

49
00:03:45,886 --> 00:03:50,536
should, every such entity, whether it's a 
node or a edge, has to be uniquely 

50
00:03:50,536 --> 00:03:55,658
identified. 
So that's the basis of the way semantic 

51
00:03:55,658 --> 00:04:00,065
web can function. 
That's the way LOD can function. 

52
00:04:00,065 --> 00:04:04,265
But how do I get, and assign such a 
unique identifier? 

53
00:04:05,530 --> 00:04:09,800
How do I know that these two are the 
same, so I shouldn't just give separate 

54
00:04:09,800 --> 00:04:18,708
identifiers? 
So, that's a huge problem, and yeah? 

55
00:04:18,708 --> 00:04:28,110
 >> [INAUDIBLE] No, Google Knowledge 
Base does much simpler solution for this. 

56
00:04:28,110 --> 00:04:33,216
It minimizes the number of resources from 
where they take, right, and they use some 

57
00:04:33,216 --> 00:04:39,794
standard machine learning techniques. 
Through the web, relatively easy 

58
00:04:39,794 --> 00:04:42,240
matching. 
All right? 

59
00:04:42,240 --> 00:04:43,950
At this moment at least. 
OK? 

60
00:04:43,950 --> 00:04:46,960
And there is a usual amount of human 
creation also going on. 

61
00:04:46,960 --> 00:04:52,320
In fact I was just talking about, even. 
 >> [INAUDIBLE] Yeah, yeah. 

62
00:04:52,320 --> 00:04:58,618
In fact, many of these sources, right, 
the yellow things that I have here. 

63
00:04:58,618 --> 00:05:00,740
 >> [INAUDIBLE] Yeah. 
Yeah. 

64
00:05:00,740 --> 00:05:02,580
Okay. 
So the Yago, Freebase, and DBPedia. 

65
00:05:02,580 --> 00:05:05,580
All of them are largely built on 
Wikipedia. 

66
00:05:05,580 --> 00:05:09,872
In fact they haven't existed, if 
Wikipedia did not exist in such a rich 

67
00:05:09,872 --> 00:05:14,880
form, then you wouldn't have many of 
these resources. 

68
00:05:16,020 --> 00:05:19,434
But. 
LOD is beyond just these. 

69
00:05:19,434 --> 00:05:22,638
Right? 
Not just the things that have come from 

70
00:05:22,638 --> 00:05:27,192
Wikipedia but way beyond it because there 
is a whole bunch of biological facts, 

71
00:05:27,192 --> 00:05:32,480
DBLP citation graphs, these are not part 
of Wikipedia. 

72
00:05:32,480 --> 00:05:37,006
These have come from different sources. 
 >> [INAUDIBLE] Yes, that's what I was 

73
00:05:37,006 --> 00:05:41,360
about to tell him. 
That although Google Knowledge Base, 

74
00:05:41,360 --> 00:05:46,139
knowledge graph, does so many things 
nicely, it can trip up quite badly 

75
00:05:46,139 --> 00:05:52,149
because of exactly this problem. 
So I was giving the example of fff, you 

76
00:05:52,149 --> 00:05:56,570
filing the query, all of you can go back 
and try it out. 

77
00:05:56,570 --> 00:05:59,370
You'll find the query for fff. 
Right? 

78
00:05:59,370 --> 00:06:03,290
He is one of the well known figures in 
computer science, in database work. 

79
00:06:04,520 --> 00:06:07,588
So you look at the kind of snippet that 
you get on the right side, which is 

80
00:06:07,588 --> 00:06:13,214
powered by a knowledge graph. 
It's perfectly matching what you know of 

81
00:06:13,214 --> 00:06:17,021
knowledge, fff. 
 >> But you look at the picture that is 

82
00:06:17,021 --> 00:06:21,854
associated with it, it's someone else. 
He's from chemistry. 

83
00:06:21,854 --> 00:06:27,533
He's also called fff. 
 >> In fact, if, [LAUGH] [INAUDIBLE] He 

84
00:06:27,533 --> 00:06:32,780
doesn't look like [FOREIGN] at all, 
right? 

85
00:06:32,780 --> 00:06:38,175
So there is no I mean, how do you believe 
that everything that you get is perfect? 

86
00:06:38,175 --> 00:06:41,335
Right? 
they try their best, so there are still, 

87
00:06:41,335 --> 00:06:45,755
these are the recent challenges which are 
still open and there are many, many such 

88
00:06:45,755 --> 00:06:53,003
research challenges. 
 >> iii It is not a small problem. 

89
00:06:53,003 --> 00:06:54,600
 >> It is not at all a small problem. 
No, no, no. 

90
00:06:54,600 --> 00:06:57,390
Entity resolution. 
People know that entity resolution is not 

91
00:06:57,390 --> 00:07:00,360
a small problem in noisy text. 
Like Twitter. 

92
00:07:00,360 --> 00:07:05,572
Yeah sure, we don't know. 
But I'm talking about perfectly well 

93
00:07:05,572 --> 00:07:13,070
handcrafted source like Wikipedia. 
Even there things are not easy. 

94
00:07:13,070 --> 00:07:14,274
Extremely messy. 
Okay? 

95
00:07:14,274 --> 00:07:18,306
So that's only one part of the problem 
that you will face when you go to linked 

96
00:07:18,306 --> 00:07:24,005
open data world. 
And the size of the linked open data at 

97
00:07:24,005 --> 00:07:32,134
this moment with all these errors in, is 
about 30 billion edges. 

98
00:07:33,630 --> 00:07:35,200
Okay? 
Fairly large. 

99
00:07:35,200 --> 00:07:40,020
Just compare it with something like what 
Facebook already announced. 

100
00:07:40,020 --> 00:07:44,164
One billion users. 
One billion is already very exciting for 

101
00:07:44,164 --> 00:07:47,481
many people. 
And this is 30 billion triples with so 

102
00:07:47,481 --> 00:07:51,324
much of nice already and clearly all 
these one billion have to be part of such 

103
00:07:51,324 --> 00:07:57,115
a graph right, every human being in 
principle can be part of the LOD. 

104
00:07:57,115 --> 00:07:58,736
Alright. 
So we have a. 

105
00:07:58,736 --> 00:08:03,021
 >> [INAUDIBLE]. 
 >> There is no super imposed schema, 

106
00:08:03,021 --> 00:08:12,381
but most of these ontologies are already 
part of LOD. 

107
00:08:12,381 --> 00:08:18,851
 >> [INAUDIBLE] What do you mean by 
that? 

108
00:08:18,851 --> 00:08:22,782
 >> I mean, for example 

109
00:08:22,782 --> 00:08:25,142
[INAUDIBLE] 

110
00:08:25,142 --> 00:08:28,518
you can write any 

111
00:08:28,518 --> 00:08:30,967
[INAUDIBLE] 

112
00:08:30,967 --> 00:08:33,530
you want 

113
00:08:33,530 --> 00:08:37,804
[INAUDIBLE]. 

114
00:08:37,804 --> 00:08:46,635
 >> Yes. 
 >> [INAUDIBLE]. 

115
00:08:46,635 --> 00:08:52,500
 >> Yeah. 
 >> [INAUDIBLE] Are the research 

116
00:08:52,500 --> 00:08:55,244
problem. 
Another research problem, right? 

117
00:08:55,244 --> 00:08:58,049
 >> [INAUDIBLE] 

118
00:08:58,049 --> 00:09:02,242
Yeah. 
 >> [INAUDIBLE] Yeah, that's another 

119
00:09:02,242 --> 00:09:08,172
research problem which is not yet solved. 
Again, these things have to be solved in 

120
00:09:08,172 --> 00:09:11,085
order to make LOD even more valuable, 
right? 

121
00:09:11,085 --> 00:09:14,480
So in fact none of these I'm going to 
cover in this talk at all. 

122
00:09:14,480 --> 00:09:18,100
But I'm going to keep agreeing with all 
these problems that you say. 

123
00:09:18,100 --> 00:09:20,280
Yes, yes, these are all the research 
problems. 

124
00:09:20,280 --> 00:09:21,520
Right. 
 >> We have to move on. 

125
00:09:21,520 --> 00:09:24,436
 >> Yeah, we'll, we'll move on, all 
right? 

126
00:09:24,436 --> 00:09:29,200
So clearly one of the big challenges is 
that graphs are really are big. 

127
00:09:29,200 --> 00:09:32,590
Okay? 
Big massive. 

128
00:09:32,590 --> 00:09:36,350
Really massive. 
You can add any kind of these fancy 

129
00:09:36,350 --> 00:09:40,460
adjectives. 
And more importantly then just having the 

130
00:09:40,460 --> 00:09:45,220
graphs getting larger, the queries that 
you fire on these graphs are getting more 

131
00:09:45,220 --> 00:09:50,770
and more interesting and exciting and 
harder to evaluate. 

132
00:09:50,770 --> 00:09:54,480
So one is, you can have the standard 
pattern like queries. 

133
00:09:54,480 --> 00:09:58,500
Which you can, all, all of you, I think 
some of you already said it's similar to 

134
00:09:58,500 --> 00:10:05,110
SQL but few joins, yeah, SQL joined. 
But really done well for join queries. 

135
00:10:05,110 --> 00:10:09,020
But you can actually go into a little 
more complicated things in these graphs. 

136
00:10:09,020 --> 00:10:13,840
And say I wanted to recursive traversals 
with unbounded recursion, right? 

137
00:10:13,840 --> 00:10:16,921
I just want to keep doing reach 
abilities, right? 

138
00:10:16,921 --> 00:10:20,444
Then, I have no good solution which is 
really efficient. 

139
00:10:20,444 --> 00:10:22,595
Right? 
This is allowed in SPARQL extensions 

140
00:10:22,595 --> 00:10:26,075
people are talking about and then you 
have on these graphs many analytic 

141
00:10:26,075 --> 00:10:32,190
queries that you can I mean all of you 
are already familiar with PageRank. 

142
00:10:32,190 --> 00:10:36,308
Pagerank is an analytic way which runs on 
the entire graph, tries to rank, based on 

143
00:10:36,308 --> 00:10:41,888
the random mark potentials, right? 
[NOISE] You can also think of our first 

144
00:10:41,888 --> 00:10:46,628
file, one query. 
Get a sub-graph out, some ad hoc subset 

145
00:10:46,628 --> 00:10:51,337
of the entire LOD. 
And then, from this I will extract bin 

146
00:10:51,337 --> 00:10:56,342
sub-graphs, or some kind of k cord 
decomposition of this or, com-, compute 

147
00:10:56,342 --> 00:11:01,551
standard lease, right? 
If you're familiar with this keyword 

148
00:11:01,551 --> 00:11:04,680
search and graphs, this is basically what 
they do. 

149
00:11:04,680 --> 00:11:08,633
They fire some initial set of queries 
which will filter out certain notes and 

150
00:11:08,633 --> 00:11:13,230
then try to connect them up using 
standard lease. 

151
00:11:13,230 --> 00:11:17,770
Again, these are not so easy to solve 
using relational databases. 

152
00:11:17,770 --> 00:11:21,552
The solution most of the current 
approaches use is, let's take this graph 

153
00:11:21,552 --> 00:11:25,199
and put it in memory and deal with it, 
okay? 

154
00:11:25,199 --> 00:11:29,415
While it is a valid solution, given that 
the memory is getting cheaper, it's still 

155
00:11:29,415 --> 00:11:33,235
not cheap enough to hold 30 billion 
triples. 

156
00:11:33,235 --> 00:11:35,782
Okay? 
So we need to come up with better 

157
00:11:35,782 --> 00:11:40,316
strategies for dealing with it. 
 >> Isn't this a concept of uh, 

158
00:11:40,316 --> 00:11:46,090
[INAUDIBLE] core assets and roots, in the 
[INAUDIBLE] graph? 

159
00:11:46,090 --> 00:11:48,444
 >> Yes. 
So can always look at frequents of 

160
00:11:48,444 --> 00:11:50,072
graphs. 
 >> Right? 

161
00:11:50,072 --> 00:11:54,872
So in fact there's another, as they say, 
some kind of. 

162
00:11:54,872 --> 00:12:00,701
 >> [INAUDIBLE] [SOUND] Your opinion 
always predict the, the edges also. 

163
00:12:00,701 --> 00:12:05,093
 >> So you can relax, you can relax the 
labels on the, edges and say I will, I 

164
00:12:05,093 --> 00:12:10,820
don't care about the edges. 
But, really any kind of structure. 

165
00:12:10,820 --> 00:12:14,726
Or you can say, I will include just the 
labels but I relax on the subjects and 

166
00:12:14,726 --> 00:12:19,041
objects that are there. 
So people have worked on these kind of. 

167
00:12:19,041 --> 00:12:23,236
 >> [INAUDIBLE] subgraphs and you can 
also have interesting subgraphs. 

168
00:12:23,236 --> 00:12:25,515
 >> Interesting subgraphs. 
Right?