So he's now put in a lot of work designing
and analyzing super fast algorithms for
reasoning about graphs. So in this
optional video, what I want to do is show
you why you might want such a primitive,
especially for computation on extremely
large graphs. Specifically, we're going to
look at the results of a famous study that
computes the strongly connected components
of the web graph. So what is the web
graph? Well, it's the graph in which the
vertices correspond to web pages. So for
example, I have my own web page where I,
you know, list my research papers, and
also links to courses such as this one And
the edges are going to be directed, and
they correspond precisely to hyperlinks.
So, the links that, bring you from one web
page to another. Note of course that these
are directed edges, where the tail is the
page that contains the hyperlink, and the
head is the pa, page that you go to if you
click the hyperlink, and so this is a
directed graph. So from my home page you
can get to my papers. You can get to my
courses. Sometimes I have random links up
to things I like, like say, my favorite
record store And of course for many of
these web pages there are additional links
going out and going in. So for example
from my papers I might link some to my
co-authors, some of my co-authors might be
linking from their home pages to me, or of
course there's web pages out there that
list the currently available free online
courses And so on So obviously this is
just part of a, massive, web graph, Just a
tiny, tiny piece of it. So the origins of
the web were probably around 1990 or so
but it started to really explode in the
mid'90s. And by the year 2000 it was
already beyond comprehension. Even though
in internet years the year 2000 is sort of
the Stone Age relative to right now,
relative to 2012 But still, even by 2000
people were so overwhelmed with the
massive scale of the web graph; they
wanted to understand anything about it,
even the most basic things. Of course one
issue with understanding what the graph
looks like is you don't even have it
locally, right? It's distributed over all
these different servers Over the entire
world. So the first thing people really
focused on when they wanted to answer this
question Was on techniques for crawling.
So having software which just follows lots
of hyperlinks Reports back to the home
base, from which you can assemble. At
least some kind of sketch of what this
graph actually is. So, but then the
question is Even once you have this
crawled information Even if, once you've
accessed A good chunk of the nodes and the
edges of this Of this network. What does
it look like? So what makes this a
difficult question, more difficult than,
say, for any other directed graph you
might encounter? Well, it's simply the
massive scale of the web graph, it's just
so big. So for the graph used in the
particular study I'm gonna discuss, you
know, like we said, it was in the year
2000, which was sort of the stone age
compared to 2012, so the graph was small,
relatively, but still, the graph was
really, really, big. So, it was something
like 200 million nodes and one billion
edges, really, one and a half billion
edges. So the reference for the work I'm
gonna discuss is a paper by a number of
authors. The first author is Andrei Broder
And then he has many co-authors And this
was a paper that appeared in the Dub, Dub,
Dub conference of the year 2000. That's
the World Wide Web Conference And I
encourage you, those of you who are
interested, to go track down the, the
paper on line And read the original
source. So Andrei Broder the lead author.
At this time he was at a company that was
called Alta Vista. So how many of you
remember a company called Alta Vista?
Well, some of you, especially, you know,
the youngest ones among you maybe have
never heard of Alta Vista And the youngest
ones among you maybe can't even conceive
of a world in which we didn't have Google
But in fact there was a time when we had
Web Search that it would, but Google did
not yet exist. That was sort of in the,
you know, maybe 97 or so. And so this was
in the very embryonic years of Google And
this, this data set actually came out of,
out of Alta Vista instead. So [inaudible]
etc all wanted to get some answers to this
question. What does the web graph look
like? And they approached it in a few
ways, but the one I'm going to focus on
here is they asked, well. You know, what's
the most detailed structure we can get
about this web graph without doing an
infeasible amount of computation? Really
just sticking to linear time algorithms,
at, at the worst And what have we seen?
We've seen that in a directed graph you
can get full connectivity information just
really using [inaudible] first search. You
can compute strongly connected components
in linear time with small constants, and
that's one of the major things that they
did in this study. Now if you wanted to do
this same computation today you'd have one
thing going against you and one thing
going for you. The obvious thing that
you'd have going against you is that the
web is still very much bigger Than it was
here. Certainly by an order of magnitude
The thing that you'd have going for you is
now there's specialized systems which are
meant to operate on massive data sets And
in particular, they can do things like
compute connectivity information on graph
data. So what you have to remember, for
those of you who are aware of these terms,
in 2000, there was no map reduce. There
was no [inaudible]. There were no tools
for automated processing large data sets.
These guys really had to do it from
scratch. So let me tell you about what
Broder et al found when they did strong
connectivity computations one the web
graph. They explained their results in
what they called the bow tie picture of
the web. So let's begin with the center of
the knot of the bow tie. So in the middle
we have what we're gonna call a giant,
strongly connected component. With the
interpretation being this is the core of
the web in some sense. Alright, so all of
you know what a, an SCC is at this point.
A strongly connected component is the reg
ion from which you can get from any point
to any other point along a directed path.
So in the context of the web graph, with
this giant SCC, what this means is that
from any webpage inside this [inaudible]
you can get to any other webpage inside
this [inaudible] just by traversing a
sequence of hyperlinks. And hopefully, it
doesn't strike you as too surprising that
a big chunk of the web is strongly
connected, is well connected in this sense
Right? So if you think about all these
different universities in the world. You
know, probably all of the web pages
corresponding to all of the different
universities. You can get from any one
place to any other place For example, from
the homepage on which I put my papers,
that often include links to my co-authors
which very commonly are other universities
so that already provides a web link from
some Stanford page to some page at say,
Berkeley or Cornell or whatever and of
course I'm just a one person. I'm just one
of many faculty members at Stanford. So
you put all of these together, you would
expect all of the different SCCs
corresponding to different universities to
collapse into a single one and so on for
other sectors s well. An then of course if
you knew that a huge chunk of the web was
in the same shelling ham component, let's
say ten percent of the web, which would be
tens of millions of web pages, you
wouldn't expect there to be a second one,
right, it would be super weird if there
were two different blobs, ten million web
pages each, that somehow were not mutually
reachable from each other. That would
just, all it takes to collapse two SCCs
into one, is a lone arc going from one to
the other, and then a lone arc in the
reverse direction. And then those two SCCs
collapse into one. So we do expect a giant
SCC, just sort of thinking anecdotally
about what the web looks like, and then
once we realize there's one giant SCC, we
don't expect there to be more than one.
Alright, so is that the whole story? Is
the web graph just one big SCC? Well, one
of the perhaps intere sting findings of
this [inaudible] paper is that, you know
there is a giant SCC, but it doesn't
actually take up the whole web, or
anything really that close to the entire
web. So what else would there be in such a
picture? Well, there's the other two ends
of the bow tie. Which are called the in
and the out regions In the out regions,
you have a bunch of strongly connecting
components, not giant SCCs. We've
established that shouldn't be any other
giant SCCs, but small SCCs Which you can
reach from the giant strongly connecting
component, but from which you cannot go
back to the giant strongly connecting
component. I encourage you to think about
what types of web sites you would expect
to see in this out part of the bow tie.
I'll give you once example, very often if
you look at a corporate site including
those of well known corporations, which
you would definitely expect to be
reachable from the giant SEC. There's
actually a corporate policy that no
hyperlinks can go from something in the
corporate site to something outside the
corporate site. So that means the
corporate site is going to be a collection
of web pages which are certainly strongly
connected, because it's a major
corporation you can certainly get there
from. The giant has [cough], but because
of this corporate policy you can't get
back out. Symmetrically, in the in part of
the bow tie, you have a strong [inaudible]
of components, generally small ones, from
which you can reach the giant SCC But you
cannot get through them from the giant
SCC. Again I encourage you to think about
all the different types of web-pages you
might expect to see in this import of the
bow-tie [inaudible]. Certainly one really
obvious example would be new web-pages. So
if you just create something, and then,
you know if I just created a web-page and
pointed it to Stanford University that
would immediately be in this in-component,
in this in-collection of components. Now,
if you think about it, this does not
exhaust all of the possibilities of where
notes can log. There's a few other cases.
They frankly are pretty weird, but they're
there. You can have passive hyperlinks
which bypass the giant FCC, and go
straight from the in part of the Bowtie to
the out part So [inaudible] suggested
calling these tubes. And then there's also
a kind of very curious outgrowths going
out of the in component, but which don't
make it all the way to the giant SCC, and
similarly there's stuff which goes into
the out component And [inaudible]
recommended calling these strange
creatures' tendrils And then in fact you
can also just have some weird isolated
islands of SCCs that are not connected
even weekly to the giant SCC. So this is
the picture that emerged from Broderall's
strong component computation on the web
graph and here's qualitatively some of the
main findings that they came up with. So
first of all, that picture on the previous
slide I drew roughly to scale in the sense
that all four parts, so the giant SCC, the
in-part, the out-part and then their
residual stuff; the tubes and tendrils,
have roughly the same size. You know, more
or less 25 percent of the nodes in the
graph. I think this surprised some people.
I think some people might have thought
that the core that the giant SCC might
have been a little bit bigger than just 25
or 28%. But [inaudible] is a lot of other
stuff outside of the stronger connected
core. You might wonder if this is, is just
an artifact of the, this data set being
from the Stone Age, being from 2000 or so.
But, people have rerun this experiment on,
on the web graph again in later years. And
of course the numbers are changing because
the graph is growing rapidly, but these
qualitatively findings, qualitative
findings have seemed pretty stable,
throughout subsequent, reevaluations of
the structure of the web. On the other
hand, while the. Core of the web is not as
big as you might have expected. It's
extremely well connected Perhaps better
connected than you might have expected.
Now you'd be right to ask the question.
You know, what could I mean by unusually
well connected? We've already established
that this giant core of the web is strong
and connected. You can get from any one
place to any other place, via a sequence
of hyperlinks. What else could you want?
Well, in fact. It has a very richer notion
of connectivity called the small world
property. So let me tell you about the
small world property or the phenomenon
colloquially known as six-degrees of
separation. So this is an idea which had
been in the air at least since the early
twentieth century, but it really it was
kind of studied. In a major way and
popularized by [inaudible]. Norgren was a
social scientist back in 1967. So Norgren
was interested in understanding you know
are people at great distance in fact
connected by short chains of
intermediaries. So the way he evaluated
this he ran the following experiment. He
gave he identified a friend in Boston
Massachusetts. A doctor I believe And so
this was gonna be the target And then he
identified a bunch of people who were
thought to be far away both culturally and
geographically. Specifically Omaha So for
those of you who don't live in the US just
take it on faith that many people in the
US would regard Boston and, and Omaha as
being fairly far apart geographically and
otherwise And what he did is he wrote each
of these residents of Omaha the following
letter. He said look here's the name and
address of this doctor who lives in
Boston. Okay, your job is to get this
letter to this doctor in Boston. Now,
you're not allowed to mail the letter
directly to the doctor, instead you need
to mail it to an intermediary, someone who
you know on a first name basis. So of
course if you knew the doctor on a first
name basis you could mail it straight to
them, but that was very unlikely. So you
know what people would do in Omaha is,
they'd say, well. I don't know any
doctors, or I don't know anyone in Boston,
but at least I know somebody in
Pittsburgh, and at least that's closer to
Boston than Omaha, that's further eastward
Or maybe someone would say, well I don't
really know anyone at E ast coast, but at
least I do know some doctors here in Omaha
And so they give the letter to somebody
depending on a first name basis in Omaha
And then, the situation would repeat.
Whoever got the letter, again they'd be
given the same instructions: If you know
this doctor in Boston on a first name
basis, send him the letter. Otherwise,
pass the letter on to somebody who seems
more likely closer to them then you are
Now of course many of these letters never
reached their destination but shocking at
least to me is that a lot of them did. So
something like 25 percent at least of the
letters that they started with made it all
the way to Boston. Which I think says
something about people in the late sixties
just having more free time on their hands
than they do in the early 21th century And
I find this hard to imagine but it's a
fact. So you had dozens and dozens of
letters reaching this doctor in Boston.
And they were able to trace exactly which
path of individuals the letter went along
before it eventually reached this doctor
in Boston And so then what they did is
they looked at the distribution of chain
links. So how many intermediaries were
required to get from some random person in
Omaha to this doctor in Boston? Some were
as few as two. Some were as big as nine
But the average number of hops, the
average number of intermediaries was in
the range of five and one-half or six And
so this is what has given rise to the
colloquialism even the name of a popular
play the six degrees of separation. So
that's the origin myth. That's where this
phrase comes from. These sorts of
experiments with physical letters But now,
in network science, the small world
property is meant to be a network, which,
on the one hand, is richly connected But
also, in some sense, there are enough cues
about which links are likely to get closer
to some target. So that if you need to
route information from point a, to point
b. Not only is there a short path But if
you, in some sense, follow your nose, then
you'll actually exhibit a short path. So
in some sense, routing information is easy
in small world networks And this is
exactly the property that [inaudible]
identified with [inaudible] SCC, very rich
with short paths and if you want to get
from point A to point B just follow your
nose and you'll do great. You don't need a
very sophisticated [inaudible] path
algorithm to find a short path. Some of
you may have heard of Stanley Migrim, not
for this small world experiment but for
another famous, or maybe infamous, he did
earlier in the Sixties which consisted
into tricking volunteers into thinking
they were subjecting other human beings to
massive doses of electric shocks. So that
wound up, you know, ended up causing a
rewrite to certain standards of ethics in
experimental psychology. You don't hear
about that so much when people are talking
about networks, but that's another reason
why Migram's work is well known And just
as a point of contrast, outside of this
[inaudible] opponent which has this rich
small world structure very poor
conductivity in other parts of the web
graph. So there's a lot of cool research
going on these days about the study of
information networks like, like the web
graph. So they want you to get the
impression that the entire interaction
between algorithms and thinking about
information networks is just being as one
strong component competition in 2000. Of
course, there is all kind of interactions,
I have just singled one out that was easy
to explain and also highly influential and
interesting back in the day But you know
these days; lots of stuffs are going on.
People are thinking about information
networks kind of different ways and of
course algorithms and almost everything
explained a very fundamental law. So let
me just dash off to a few examples. I'm
Whetting your appetite, maybe you want to
go explore this topic in greater depth
outside of this course. So one super
interesting question is, rather than
looking at a static snapshot of the web
Like we were doing so far in this video
Alright, the web is changing all the time.
New pages are getting created, new links
are getting created and destroyed and so
on And How does this evolution proceed?
Can we have a mathematical model which
faithfully reproduces the most important
first order properties of this
evolutionary process? So the second issue
is to not think not just about the
dynamics of the graph itself, but the
dynamics of information that gets carried
by the graph. And you could ask this both
about the web graph and about other social
networks like say, Facebook or Twitter.
Another really important topic which
there's been a lot of work on, but we
still don't fully understand by any means,
is getting at the finer grain structure in
networks, including the web graph And
particularly what we really like to do is
that full proof method for identifying
communities. So groups of notes is going
to be web pages and web graph for
individuals and social network which we
should think of as grouped together. We
discussed this when we talked about
applications of cuts. One motivation of
cuts is to identify communities
everything. Communities has been
relatively densely connected inside and
sparsely connected outside And that's just
a, and that's just baby step, really we
need much better techniques for both
defining and computing communities in
these kind of networks. So I think these
questions are super interesting, both from
a mathematical/technical level, but also
they're very timely. Answering them really
helps us understand our world better.
Unfortunately these are gonna be well
outside the course of just the bread and
better design analysis of algorithms,
which is what the, on task with covering
here. But I will leave you with a
reference of a book that I recommend if
you want to read more about these topics
Namely the, quite recent book by David
Easley and Jon Kleinberg, called
"Networks, Crowds, and Markets."