1
00:00:00,000 --> 00:00:06,071
Now let's look at some examples of
algorithms implemented in parallel using

2
00:00:06,071 --> 00:00:12,004
map-reduce.
For those of you used to relational

3
00:00:12,004 --> 00:00:17,098
databases, here's how one can implement a
join using map-reduce.

4
00:00:17,098 --> 00:00:25,146
Suppose we are given two tables, sales by
address and cities by address and we are

5
00:00:25,146 --> 00:00:31,069
interested in joining these tables to find
the sales by city.

6
00:00:31,069 --> 00:00:37,639
In sequel, this is how you'll do it.
You would simply select the sum of sales

7
00:00:37,639 --> 00:00:44,067
from these two tables, where the addresses
match, and group them by city.

8
00:00:44,097 --> 00:00:50,089
However in map-reduce, this requires two
map-reduce jobs.

9
00:00:50,089 --> 00:00:59,065
In the first case, you join or essentially
sort the data by address so that each

10
00:00:59,065 --> 00:01:04,091
reducer is able to emit the city for every
sale.

11
00:01:04,091 --> 00:01:13,057
So in other words, the first map-reduced
task only joins the sales and city table

12
00:01:13,057 --> 00:01:20,779
but doesn't do the grouping.
In the second map produced task we sort on

13
00:01:20,779 --> 00:01:27,909
the city key so that all the records for a
particular city comes to the same reducer

14
00:01:27,909 --> 00:01:33,534
where the addition takes place.
Notice that we couldn't have used a single

15
00:01:33,534 --> 00:01:39,488
map-reduce raise because at the end of the
first phase, it's not possible to

16
00:01:39,488 --> 00:01:44,944
guarantee that all the data for a
particular city resides in the same

17
00:01:44,944 --> 00:01:50,607
reducer and so the summation can't be
done, at least city-wise in a single

18
00:01:50,607 --> 00:01:54,206
phase.
I'll relate to you a real world story that

19
00:01:54,206 --> 00:01:59,687
illustrates the difference between the
map-reduce way of thinking and a

20
00:01:59,687 --> 00:02:06,499
traditional database approach.
This is a real example where an

21
00:02:06,499 --> 00:02:14,173
organization had lots of data which could
be abstracted as data about papers their

22
00:02:14,173 --> 00:02:19,045
authors and their contents.
The actual nature of the data is

23
00:02:19,045 --> 00:02:22,079
proprietary so I won't relate that in
detail.

24
00:02:22,079 --> 00:02:26,021
But it's pretty much the same problem as
this.

25
00:02:26,021 --> 00:02:32,077
There are millions of papers written by
many millions of authors containing

26
00:02:32,077 --> 00:02:38,090
millions of possible technical terms,
phrases, you know, multi-word terms.

27
00:02:38,090 --> 00:02:45,081
And one of the tasks was to figure out the
top ten technical terms used by each

28
00:02:45,081 --> 00:02:50,013
author.
The top ten authors that used a particular

29
00:02:50,013 --> 00:02:54,086
term, and so on.
And you can see this, problem is quite

30
00:02:54,086 --> 00:03:01,019
close to our word counting example.
And possibly indicates a real life,

31
00:03:01,019 --> 00:03:07,017
requirement as well.
This is how a database person actually

32
00:03:07,017 --> 00:03:12,013
propose the solution.
You start out with a bunch of documents

33
00:03:12,013 --> 00:03:15,194
represented as a table with say a paper
ID.

34
00:03:15,194 --> 00:03:19,302
The content of the document itself and its
author.

35
00:03:19,302 --> 00:03:26,247
Then you get rid if the actual content and
replace it by words.

36
00:03:26,247 --> 00:03:33,990
Notice what has happened over here.
Instead of one row per paper, now you only

37
00:03:33,990 --> 00:03:40,610
have a row for a word and answer pair that
occurs in a particular paper.

38
00:03:40,610 --> 00:03:46,494
So, one row here will produce many, many
rows in the table queue.

39
00:03:46,494 --> 00:03:53,853
No matter, says our database expert.
All we need to do is now to group by, so

40
00:03:53,853 --> 00:04:01,361
that we can count how many times a
particular word is used by a particular

41
00:04:01,361 --> 00:04:05,920
author.
So the resulting table, would let you

42
00:04:05,920 --> 00:04:12,273
answer questions about, which of the top
ten technical terms used by a particular

43
00:04:12,273 --> 00:04:18,911
author, by simply, finding all the entries
for an author, sorting them by word count,

44
00:04:18,911 --> 00:04:25,561
and you have your answer.
Problems start to occur when you try to

45
00:04:25,561 --> 00:04:34,465
consider how large these tables can get.
If this P has million or a few million

46
00:04:34,465 --> 00:04:40,350
documents, then let's take a look at this
table here.

47
00:04:40,350 --> 00:04:50,025
At worst, you'll have an entry for every
word and every author that is if you allow

48
00:04:50,025 --> 00:04:55,503
zero word counts.
That works out to million into million or

49
00:04:55,503 --> 00:05:00,998
a trillion entries.
Even if you don't have zeros, and you get

50
00:05:00,998 --> 00:05:08,605
rid of them by a suitable modifications to
your joining and counting process.

51
00:05:08,605 --> 00:05:16,317
You'll still have many billions of entries
in this table merely to solve this kind of

52
00:05:16,317 --> 00:05:22,650
a problem.
Now suppose you wanted to find the top ten

53
00:05:22,650 --> 00:05:28,282
authors for two terms.
They are a combination of terms or you

54
00:05:28,282 --> 00:05:34,043
wanted to find the top ten terms for two
authors who wrote together.

55
00:05:34,043 --> 00:05:39,684
This solution doesn't work anymore.
So, we see that the database mindset

56
00:05:39,684 --> 00:05:45,431
doesn't quite work in such high
dimensional spaces where you have millions

57
00:05:45,431 --> 00:05:50,954
of terms and millions of authors.
And you are trying to find counts of

58
00:05:50,954 --> 00:05:55,955
combinations between them.
This is really a case for map-reduce and

59
00:05:55,955 --> 00:06:02,748
it is exactly these types of challenges
which lead the web companies to develop

60
00:06:02,748 --> 00:06:07,112
new technologies, and new programming
paradigms.

61
00:06:07,112 --> 00:06:14,269
It was not just that the old technologies
didn't scale, it's the fact that the

62
00:06:14,269 --> 00:06:21,776
relational paradigm itself needed to be
looked at fresh, when dealing with large

63
00:06:21,776 --> 00:06:28,238
volumes of unstructured data.
Okay, so lets try doing top key words per

64
00:06:28,238 --> 00:06:33,235
author in map-reduce.
Well, suppose we use the map function

65
00:06:33,235 --> 00:06:40,051
where you emit a word and author every
time you see it in a document just like we

66
00:06:40,051 --> 00:06:46,086
emitted the word and its counts earlier.
And the reduce key is a word author

67
00:06:46,086 --> 00:06:50,697
combination.
So that every reducer gets all the records

68
00:06:50,697 --> 00:06:55,849
which have one such combination, and can
simply count them.

69
00:06:55,849 --> 00:07:01,294
Well, this doesn't quite solve the problem
or does it?

70
00:07:01,294 --> 00:07:09,391
Actually it's the same problem as before,
a database mindset being used with map

71
00:07:09,391 --> 00:07:15,189
produce.
So it's the, programming paradigm is not a

72
00:07:15,189 --> 00:07:21,217
solution by itself.
The, approach really needs to change from

73
00:07:21,217 --> 00:07:25,887
a data base oriented one, to something
else.

74
00:07:25,887 --> 00:07:33,876
Let's try it again.
Again, we have a map function but this

75
00:07:33,876 --> 00:07:40,233
time instead of emitting the author and
the word, we emit the author and the

76
00:07:40,233 --> 00:07:44,615
contents.
So essentially all the map phases doing is

77
00:07:44,615 --> 00:07:50,435
extracting the author field, and emitting
it along with the contents.

78
00:07:50,435 --> 00:07:58,021
In the reduced phase we group everything
by author so that all the documents for a

79
00:07:58,021 --> 00:08:05,646
particular author end up being in the same
reducer and in the reduced function is a

80
00:08:05,646 --> 00:08:13,514
custom function which works as follows.
For each author, this reduced function

81
00:08:13,514 --> 00:08:19,122
grabs, all the elements in that reducer
for that author.

82
00:08:19,122 --> 00:08:29,062
Computes the word counts, inserts into
some temporary array W, Sorts it.

83
00:08:29,062 --> 00:08:33,093
There goes out the top K word for that
author.

84
00:08:33,093 --> 00:08:38,072
Deletes the W and starts again for the
next author.

85
00:08:40,082 --> 00:08:45,032
Essentially, you're computing the top k
for each author.

86
00:08:45,032 --> 00:08:49,025
Not storing the combination of each word
author.

87
00:08:49,025 --> 00:08:55,055
Because you're only interested in the top
K, the top ten, the top five, the top

88
00:08:55,055 --> 00:08:59,015
twenty, whatever.
Notice what we've done here.

89
00:08:59,086 --> 00:09:05,063
We are bringing some element of
programming back into, the reduced

90
00:09:05,063 --> 00:09:09,090
function.
And not relying entirely on the system,

91
00:09:09,090 --> 00:09:17,020
map produced in this case, or the database
in the previous case, to do all our work.

92
00:09:17,020 --> 00:09:24,028
Now let's look at some more examples, such
as those we've considered in the past few

93
00:09:24,028 --> 00:09:27,082
weeks.
Indexing, locality sensitive hashing.

94
00:09:27,082 --> 00:09:34,032
In particular, how to assemble the results
from many different hash functions.

95
00:09:34,032 --> 00:09:40,048
How to compute the likelihoods while
coming up with a Bayesian classifier.

96
00:09:40,079 --> 00:09:47,012
And, how to evaluate a Bayesian classifier
to compute the likelihood ratio?

97
00:09:47,012 --> 00:09:51,002
Do you alctually need parallelism to do
this?

98
00:09:51,002 --> 00:09:56,468
How about computing the TF-IDF or the
joint probabilities for computing the

99
00:09:56,468 --> 00:10:03,417
information gain or mutual information?
I'll leave these last three as homework

100
00:10:03,417 --> 00:10:08,672
for you.
And we'll just do the first three right

101
00:10:08,672 --> 00:10:12,850
now.
Let's start with indexing.

102
00:10:12,850 --> 00:10:20,651
In the map phase, we produce a partial
index for a partial set of documents that

103
00:10:20,651 --> 00:10:27,675
is, we emit for every word, a postings
list, out of all the documents that we

104
00:10:27,675 --> 00:10:32,697
see.
This is not a full index so we have to

105
00:10:32,697 --> 00:10:37,284
reduce, again by word, and merge the
partial indexes.

106
00:10:37,284 --> 00:10:43,553
That is merge these postings list per word
into a common postings list.

107
00:10:43,553 --> 00:10:51,016
An interesting question is what if you
have to sort either by document ID or by

108
00:10:51,016 --> 00:10:56,166
page rank as we had in our first homework
assignment.

109
00:10:56,166 --> 00:11:03,652
What do you think you need to do then?
Would you need another map-reduce phase?

110
00:11:03,652 --> 00:11:09,539
Think about it.
Now let's consider locality sensitive

111
00:11:09,539 --> 00:11:16,241
hashing.
The map phase, you start with documents

112
00:11:16,241 --> 00:11:21,058
and compute K hash values for every
document.

113
00:11:22,005 --> 00:11:28,055
These could be the K hashwell use be used
for fingerprints, or they could be others,

114
00:11:28,055 --> 00:11:32,086
as I described in the Rodger Almond book
for documents.

115
00:11:33,055 --> 00:11:37,051
The reduced key has now the K hash
functions.

116
00:11:37,051 --> 00:11:43,076
Each reducer gets all those documents that
have the same K hash values.

117
00:11:43,076 --> 00:11:48,069
And figures out all the pairs that they
could possibly forms.

118
00:11:48,069 --> 00:11:53,036
So if there are three documents, it emits
three pairs.

119
00:11:53,036 --> 00:11:58,037
If there are four documents, it emits six
pairs and so on.

120
00:11:59,036 --> 00:12:07,001
As a result, the map-reduce LSH, produces
a list of document pairs, which are

121
00:12:07,001 --> 00:12:12,061
potentially similar.
An interesting question is, will the

122
00:12:12,061 --> 00:12:16,069
document pair be emitted by more than one
reducer?

123
00:12:16,069 --> 00:12:21,026
And suppose we, wanted, each pair to be
emitted, only once.

124
00:12:21,026 --> 00:12:25,083
What would we do then?
Lastly, let's come to our Bayesian

125
00:12:25,083 --> 00:12:32,019
classifier, where we need to compute the
likelihoods, or rather the conditional

126
00:12:32,019 --> 00:12:38,048
probabilities of a yes occurring, or a no
occurring, for each possible feature.

127
00:12:38,048 --> 00:12:45,688
So the map phase, merely produces the
counts of feature yes or feature no which

128
00:12:45,688 --> 00:12:52,456
are then, grouped, by, features in the
reduced phase, so that we get, all the,

129
00:12:52,456 --> 00:12:58,844
counts, that were emitted for a particular
feature, in one reducer where we merely

130
00:12:58,844 --> 00:13:04,789
sum up, the respective counts, and divide
by the total number of instances that

131
00:13:04,789 --> 00:13:11,051
we've seen, for a particular feature.
Finally we emit the log of this

132
00:13:11,051 --> 00:13:17,005
likelihood, so that we get the elements of
out beigen classifier.

133
00:13:18,015 --> 00:13:23,062
Notice that once we have the log
likelihoods for each feature, do we really

134
00:13:23,062 --> 00:13:29,046
need parallelism for testing new documents
or new objects using the naive based

135
00:13:29,046 --> 00:13:31,036
classifier?
Think about it.

136
00:13:31,036 --> 00:13:36,053
Is that computation really that large as
to require parallel computing?

137
00:13:37,046 --> 00:13:41,019
Well we've seen a number of examples of
map-reduce.

138
00:13:41,019 --> 00:13:45,044
Admittedly the last few have been covered
rather quickly.

139
00:13:45,044 --> 00:13:51,048
Please review the material so that you
understand it and discuss it if you don't

140
00:13:51,048 --> 00:13:52,046
in the forum.