Now let's look at some examples of
algorithms implemented in parallel using
map-reduce.
For those of you used to relational
databases, here's how one can implement a
join using map-reduce.
Suppose we are given two tables, sales by
address and cities by address and we are
interested in joining these tables to find
the sales by city.
In sequel, this is how you'll do it.
You would simply select the sum of sales
from these two tables, where the addresses
match, and group them by city.
However in map-reduce, this requires two
map-reduce jobs.
In the first case, you join or essentially
sort the data by address so that each
reducer is able to emit the city for every
sale.
So in other words, the first map-reduced
task only joins the sales and city table
but doesn't do the grouping.
In the second map produced task we sort on
the city key so that all the records for a
particular city comes to the same reducer
where the addition takes place.
Notice that we couldn't have used a single
map-reduce raise because at the end of the
first phase, it's not possible to
guarantee that all the data for a
particular city resides in the same
reducer and so the summation can't be
done, at least city-wise in a single
phase.
I'll relate to you a real world story that
illustrates the difference between the
map-reduce way of thinking and a
traditional database approach.
This is a real example where an
organization had lots of data which could
be abstracted as data about papers their
authors and their contents.
The actual nature of the data is
proprietary so I won't relate that in
detail.
But it's pretty much the same problem as
this.
There are millions of papers written by
many millions of authors containing
millions of possible technical terms,
phrases, you know, multi-word terms.
And one of the tasks was to figure out the
top ten technical terms used by each
author.
The top ten authors that used a particular
term, and so on.
And you can see this, problem is quite
close to our word counting example.
And possibly indicates a real life,
requirement as well.
This is how a database person actually
propose the solution.
You start out with a bunch of documents
represented as a table with say a paper
ID.
The content of the document itself and its
author.
Then you get rid if the actual content and
replace it by words.
Notice what has happened over here.
Instead of one row per paper, now you only
have a row for a word and answer pair that
occurs in a particular paper.
So, one row here will produce many, many
rows in the table queue.
No matter, says our database expert.
All we need to do is now to group by, so
that we can count how many times a
particular word is used by a particular
author.
So the resulting table, would let you
answer questions about, which of the top
ten technical terms used by a particular
author, by simply, finding all the entries
for an author, sorting them by word count,
and you have your answer.
Problems start to occur when you try to
consider how large these tables can get.
If this P has million or a few million
documents, then let's take a look at this
table here.
At worst, you'll have an entry for every
word and every author that is if you allow
zero word counts.
That works out to million into million or
a trillion entries.
Even if you don't have zeros, and you get
rid of them by a suitable modifications to
your joining and counting process.
You'll still have many billions of entries
in this table merely to solve this kind of
a problem.
Now suppose you wanted to find the top ten
authors for two terms.
They are a combination of terms or you
wanted to find the top ten terms for two
authors who wrote together.
This solution doesn't work anymore.
So, we see that the database mindset
doesn't quite work in such high
dimensional spaces where you have millions
of terms and millions of authors.
And you are trying to find counts of
combinations between them.
This is really a case for map-reduce and
it is exactly these types of challenges
which lead the web companies to develop
new technologies, and new programming
paradigms.
It was not just that the old technologies
didn't scale, it's the fact that the
relational paradigm itself needed to be
looked at fresh, when dealing with large
volumes of unstructured data.
Okay, so lets try doing top key words per
author in map-reduce.
Well, suppose we use the map function
where you emit a word and author every
time you see it in a document just like we
emitted the word and its counts earlier.
And the reduce key is a word author
combination.
So that every reducer gets all the records
which have one such combination, and can
simply count them.
Well, this doesn't quite solve the problem
or does it?
Actually it's the same problem as before,
a database mindset being used with map
produce.
So it's the, programming paradigm is not a
solution by itself.
The, approach really needs to change from
a data base oriented one, to something
else.
Let's try it again.
Again, we have a map function but this
time instead of emitting the author and
the word, we emit the author and the
contents.
So essentially all the map phases doing is
extracting the author field, and emitting
it along with the contents.
In the reduced phase we group everything
by author so that all the documents for a
particular author end up being in the same
reducer and in the reduced function is a
custom function which works as follows.
For each author, this reduced function
grabs, all the elements in that reducer
for that author.
Computes the word counts, inserts into
some temporary array W, Sorts it.
There goes out the top K word for that
author.
Deletes the W and starts again for the
next author.
Essentially, you're computing the top k
for each author.
Not storing the combination of each word
author.
Because you're only interested in the top
K, the top ten, the top five, the top
twenty, whatever.
Notice what we've done here.
We are bringing some element of
programming back into, the reduced
function.
And not relying entirely on the system,
map produced in this case, or the database
in the previous case, to do all our work.
Now let's look at some more examples, such
as those we've considered in the past few
weeks.
Indexing, locality sensitive hashing.
In particular, how to assemble the results
from many different hash functions.
How to compute the likelihoods while
coming up with a Bayesian classifier.
And, how to evaluate a Bayesian classifier
to compute the likelihood ratio?
Do you alctually need parallelism to do
this?
How about computing the TF-IDF or the
joint probabilities for computing the
information gain or mutual information?
I'll leave these last three as homework
for you.
And we'll just do the first three right
now.
Let's start with indexing.
In the map phase, we produce a partial
index for a partial set of documents that
is, we emit for every word, a postings
list, out of all the documents that we
see.
This is not a full index so we have to
reduce, again by word, and merge the
partial indexes.
That is merge these postings list per word
into a common postings list.
An interesting question is what if you
have to sort either by document ID or by
page rank as we had in our first homework
assignment.
What do you think you need to do then?
Would you need another map-reduce phase?
Think about it.
Now let's consider locality sensitive
hashing.
The map phase, you start with documents
and compute K hash values for every
document.
These could be the K hashwell use be used
for fingerprints, or they could be others,
as I described in the Rodger Almond book
for documents.
The reduced key has now the K hash
functions.
Each reducer gets all those documents that
have the same K hash values.
And figures out all the pairs that they
could possibly forms.
So if there are three documents, it emits
three pairs.
If there are four documents, it emits six
pairs and so on.
As a result, the map-reduce LSH, produces
a list of document pairs, which are
potentially similar.
An interesting question is, will the
document pair be emitted by more than one
reducer?
And suppose we, wanted, each pair to be
emitted, only once.
What would we do then?
Lastly, let's come to our Bayesian
classifier, where we need to compute the
likelihoods, or rather the conditional
probabilities of a yes occurring, or a no
occurring, for each possible feature.
So the map phase, merely produces the
counts of feature yes or feature no which
are then, grouped, by, features in the
reduced phase, so that we get, all the,
counts, that were emitted for a particular
feature, in one reducer where we merely
sum up, the respective counts, and divide
by the total number of instances that
we've seen, for a particular feature.
Finally we emit the log of this
likelihood, so that we get the elements of
out beigen classifier.
Notice that once we have the log
likelihoods for each feature, do we really
need parallelism for testing new documents
or new objects using the naive based
classifier?
Think about it.
Is that computation really that large as
to require parallel computing?
Well we've seen a number of examples of
map-reduce.
Admittedly the last few have been covered
rather quickly.
Please review the material so that you
understand it and discuss it if you don't
in the forum.