1
00:00:00,025 --> 00:00:05,790
And so, clearly one of the big challenges 
is that graphs are really big. 

2
00:00:05,790 --> 00:00:12,296
Big, massive, really massive. 
You can add any kind of these fancy 

3
00:00:12,296 --> 00:00:16,870
adjectives. 
And more importantly than just having the 

4
00:00:16,870 --> 00:00:21,630
graphs getting larger, the queries that 
you fire on these graphs are getting more 

5
00:00:21,630 --> 00:00:27,170
and more interesting and exciting and 
harder to evaluate. 

6
00:00:27,170 --> 00:00:30,509
So one is, you can have the standard 
pattern like queries, which you can, all 

7
00:00:30,509 --> 00:00:35,870
of you, I think some of you already said, 
it's similar to SQL with few drawings. 

8
00:00:35,870 --> 00:00:41,490
Yeah, SQL has drawings, but really done 
well for draw inquiries. 

9
00:00:41,490 --> 00:00:45,420
But you can actually go into a little 
more complicated things in these graphs. 

10
00:00:45,420 --> 00:00:50,130
And say I wanted to cut some troubles 
with unbounded recursions. 

11
00:00:50,130 --> 00:00:53,958
I just want to keep doing bits of 
validities than have no good solution 

12
00:00:53,958 --> 00:00:58,824
which is really efficient, right? 
This is a load in sparkle ex regions 

13
00:00:58,824 --> 00:01:03,146
people are talking about. 
And then you have, on these graphs, many 

14
00:01:03,146 --> 00:01:07,556
inequality that you can, I mean you are 
all already familiar with page wrap, 

15
00:01:07,556 --> 00:01:12,316
basically these analytic way we translate 
the entire graph twice to rank based on 

16
00:01:12,316 --> 00:01:19,531
the random walk potentials, right? 
You can also think of our I'll first fire 

17
00:01:19,531 --> 00:01:25,350
1 query, get subgraph out, some ad hoc 
subset of the entire el woody. 

18
00:01:25,350 --> 00:01:30,129
And then from this I will extract then 
subgraphs or some kind of K cord of 

19
00:01:30,129 --> 00:01:35,585
decomposition of this compute standard 
lease right? 

20
00:01:35,585 --> 00:01:39,617
If you are familiar with this keyword 
search and graph this is basically what 

21
00:01:39,617 --> 00:01:42,925
they do. 
They fire some inch inside of queiries 

22
00:01:42,925 --> 00:01:46,405
which will figure out certain notes And 
then try to connect them up using 

23
00:01:46,405 --> 00:01:51,480
standard [INAUDIBLE]. 
Again, these are not so easy to solve 

24
00:01:51,480 --> 00:01:55,952
using relational databases. 
The solution most of the current 

25
00:01:55,952 --> 00:02:02,080
approaches use is, let's take this graph 
and put it in memory and deal with it. 

26
00:02:02,080 --> 00:02:04,788
Okay? 
While it is a valid solution, given that 

27
00:02:04,788 --> 00:02:08,428
the memory is getting cheaper 
 >> It's still no cheap enough to hold 

28
00:02:08,428 --> 00:02:12,104
30 billion triplets. 
Okay, so we need to come up with better 

29
00:02:12,104 --> 00:02:17,715
strategies for dealing with it. 
 >> Is this a concept of uh, [INAUDIBLE] 

30
00:02:17,715 --> 00:02:19,932
sets and rules, in the 

31
00:02:19,932 --> 00:02:22,058
[INAUDIBLE]. 

32
00:02:22,058 --> 00:02:24,738
 >> Yes. 
So you can always looks at frequent 

33
00:02:24,738 --> 00:02:30,751
subgraphs. 
 >> Like, so in fact, there is another, 

34
00:02:30,751 --> 00:02:37,120
as they say, some kind of 
 >> [INAUDIBLE] 

35
00:02:37,120 --> 00:02:41,268
 >> So, you can relax, you can relax the 
labels on the ages, and say I will, I 

36
00:02:41,268 --> 00:02:45,756
don't care about the ages, but Really any 
kind of structure, we can say I will 

37
00:02:45,756 --> 00:02:53,750
include just the labels but I relax on 
the subjects and objects that are there. 

38
00:02:53,750 --> 00:02:57,858
So people have worked on these kind of. 
 >> Sequence the subgraphs and then also 

39
00:02:57,858 --> 00:03:02,459
interesting subgraphs. 
 >> Interesting subgraphs, okay. 

40
00:03:02,459 --> 00:03:04,476
 >> [INAUDIBLE]. 
 >> Just to reiterate my point, so far 

41
00:03:04,476 --> 00:03:09,720
we have never talked about how exactly 
they're evaluated. 

42
00:03:09,720 --> 00:03:12,492
So far I have never even mentioned how 
they're evaluated. 

43
00:03:12,492 --> 00:03:16,357
 >> [INAUDIBLE] 
 >> So created, you can just create any 

44
00:03:16,357 --> 00:03:20,389
graph you, in the good old way that you 
you take your relational database, 

45
00:03:20,389 --> 00:03:25,065
perform all your joints, and store it as 
a graph. 

46
00:03:25,065 --> 00:03:28,700
Right, I mean that's one cheap way of 
answering it. 

47
00:03:28,700 --> 00:03:33,500
But in other words, there are so many 
efforts coming, both from automated 

48
00:03:33,500 --> 00:03:38,000
methods as well as from manually 
hand-coding stuff, so if you go back and 

49
00:03:38,000 --> 00:03:43,100
look at it, DBLP, for example they 
maintain huge records of which paper 

50
00:03:43,100 --> 00:03:51,730
appeared in which conference, which 
journal, what author, and so on. 

51
00:03:52,850 --> 00:03:58,755
And they, again, wrote scripts, from this 
structured relational table they had. 

52
00:03:58,755 --> 00:04:04,232
To turn it into and ideas. 
And then give it out as part of RDF data 

53
00:04:04,232 --> 00:04:08,910
set, which is essentially a graph. 
Right? 

54
00:04:08,910 --> 00:04:14,680
So that's one method of doing it. 
Iago, free-base kind of efforts, they're 

55
00:04:14,680 --> 00:04:20,950
there focusing on, Iago and DDPDS, sorry, 
are focusing more on, let's take some. 

56
00:04:20,950 --> 00:04:25,711
Almost structured data like Wikipedia, 
and apply whole bunch of machine learning 

57
00:04:25,711 --> 00:04:30,541
tools, natural language processing tools, 
so that you can extract out these facts, 

58
00:04:30,541 --> 00:04:37,720
x is related to y, in some way, and then 
put them into the graph form. 

59
00:04:37,720 --> 00:04:40,910
Right? 
And psych, again, hand colored many of 

60
00:04:40,910 --> 00:04:44,404
these knowledge. 
So, there are many efforts in which the 

61
00:04:44,404 --> 00:04:48,500
graphs were created. 
Again, we are, I'm not going too much 

62
00:04:48,500 --> 00:04:51,710
into any of those. 
I agree that these are challenges, but 

63
00:04:51,710 --> 00:04:53,870
I'm looking at a slightly different 
challenge. 

64
00:04:53,870 --> 00:05:03,460
[COUGH] So just to continue with this 
line and try to finish it quickly. 

65
00:05:03,460 --> 00:05:10,306
Queries are bigger, graphs are bigger. 
And you can see hey why not use some 

66
00:05:10,306 --> 00:05:17,594
magic bullet like how to write. 
Google users account for terabytes so 30 

67
00:05:17,594 --> 00:05:23,270
billion should be T right But the problem 
is that, your graphs are not the same as 

68
00:05:23,270 --> 00:05:30,520
text tables, right? 
Text pieces, which Google uses this one. 

69
00:05:30,520 --> 00:05:35,418
So your graphs have complex inter 
connectivity, this is known as god knows 

70
00:05:35,418 --> 00:05:38,300
what are the notes. 
Right? 

71
00:05:38,300 --> 00:05:40,985
What kinds of relationships are there? 
That's one. 

72
00:05:40,985 --> 00:05:46,479
And today some connections may not exist, 
but tomorrow you can just add one edge, 

73
00:05:46,479 --> 00:05:51,563
it seems like a very harmless piece of 
edge saying Dolph Lundgren and Bruce 

74
00:05:51,563 --> 00:05:58,512
Willis acted in The Expendables. 
They never acted before. 

75
00:05:58,512 --> 00:06:02,930
Suddenly, you put these two together and 
form a connection. 

76
00:06:02,930 --> 00:06:06,428
So as you assume the doing the hard loop 
you're actually partition in the nicely 

77
00:06:06,428 --> 00:06:09,750
hard all, the guys who works with Bruce 
Willis. 

78
00:06:09,750 --> 00:06:14,098
I will put these two down separately. 
And suddenly, expendables comes, all the 

79
00:06:14,098 --> 00:06:20,118
partitioning falls apart, right? 
Which is not the case in the applications 

80
00:06:20,118 --> 00:06:24,610
settings where hard loop is used. 
Your partitioning is fine. 

81
00:06:24,610 --> 00:06:28,380
You don't need to worry about it anymore. 
Your partitioning logically means the 

82
00:06:28,380 --> 00:06:31,618
same. 
But here things can get really messy, 

83
00:06:31,618 --> 00:06:37,426
even in a single static snapshot, as well 
as you when have these dynamism in your 

84
00:06:37,426 --> 00:06:43,102
data set. 
So therefore data partitioning solutions 

85
00:06:43,102 --> 00:06:48,165
will also need rethinking. 
Okay, so given all that, it's not like 

86
00:06:48,165 --> 00:06:53,300
I'm the only guy who has thought about it 
and teaching you. 

87
00:06:53,300 --> 00:06:56,390
There's a whole bunch of people who have 
worked on it. 

88
00:06:56,390 --> 00:06:58,537
Okay. 
There is huge research as well as 

89
00:06:58,537 --> 00:07:02,939
industry work that's going on for 
developing graph data management tools 

90
00:07:02,939 --> 00:07:10,233
for very different kind of applications. 
Both generic as well as in the analytic's 

91
00:07:10,233 --> 00:07:14,199
world. 
So, for example, there is transactional 

92
00:07:14,199 --> 00:07:20,404
graph database systems, data management 
systems, like Neo4J, Jena, HyperGraphDB, 

93
00:07:20,404 --> 00:07:25,374
RDF3x. 
They're, they really focus on this LOD 

94
00:07:25,374 --> 00:07:29,974
kind of settings. 
Right, well, you, until now, you never 

95
00:07:29,974 --> 00:07:34,793
had analytic s queries on there. 
You just had more, just coming in, a 

96
00:07:34,793 --> 00:07:38,280
pattern, and you match it. 
Right. 

97
00:07:38,280 --> 00:07:42,300
But we are saying now that SPARQL is 
getting richer with recursive reasoning 

98
00:07:42,300 --> 00:07:46,908
kind of queries. 
Then transactional GDM has to evolve into 

99
00:07:46,908 --> 00:07:53,260
which supports little more than just a 
transactional pattern maxing place. 

100
00:07:53,260 --> 00:07:58,820
And then you have analytic GDM. 
Analytic, graph database management, like 

101
00:07:58,820 --> 00:08:05,683
Pregel and Giraph, which, can be seen as. 
Hadoop style processing for graphs, which 

102
00:08:05,683 --> 00:08:11,690
have [UNKNOWN] of reasoning. 
You can compute page rank using Pregel. 

103
00:08:11,690 --> 00:08:16,411
In fact, this is the clean Pregel. 
Paper also makes that, they design Pregel 

104
00:08:16,411 --> 00:08:19,988
so that they can do [UNKNOWN] on massive 
graphs. 

105
00:08:19,988 --> 00:08:22,980
[INAUDIBLE] 
 >> Legal is not. 

106
00:08:22,980 --> 00:08:27,380
But it's open source implementation which 
is on to of hall loop, called Giraph. 

107
00:08:27,380 --> 00:08:30,102
 >> [INAUDIBLE] 
 >> Giraph is open. 

108
00:08:30,102 --> 00:08:31,956
 >> [INAUDIBLE] 
 >> Yea, it's actually bad at this 

109
00:08:31,956 --> 00:08:35,296
moment. 
But, it's not as. 

110
00:08:35,296 --> 00:08:39,350
 >> [INAUDIBLE] 
 >> Giraph lab, yes. 

111
00:08:39,350 --> 00:08:44,420
Graph lab I haven't put yet because it's 
not really a GDI GD-ermium. 

112
00:08:44,420 --> 00:08:49,476
They are more designed for doing machine 
loading applications on top of some 

113
00:08:49,476 --> 00:08:54,330
graphs, without really worrying about, 
exactly. 

114
00:08:54,330 --> 00:08:56,756
So that's basically the focus.