1
00:00:00,000 --> 00:00:07,020
As we all know, microprocessors are
becoming cheaper and faster every day.

2
00:00:07,020 --> 00:00:12,092
Computing power doubles every eighteen for
the same price.

3
00:00:12,092 --> 00:00:17,085
And this has been going on for more than
40 years.

4
00:00:17,085 --> 00:00:23,067
So, in the early days of parallel
computing, such as the 80s.

5
00:00:23,067 --> 00:00:31,056
By the time one figured how one could use
many processors to do the work of one

6
00:00:31,056 --> 00:00:37,441
processor faster, that one processor
itself became twice as fast.

7
00:00:37,441 --> 00:00:42,409
So parallel computing really had to play a
catch up game.

8
00:00:42,409 --> 00:00:50,330
However, in recent years you may have
noticed that all your PCs and servers come

9
00:00:50,330 --> 00:00:57,899
with multiple cores, four cores, eight
cores, sixteen cores, and, soon will be 32

10
00:00:57,899 --> 00:01:04,799
cores or 64 cores on every chip.
What that's really meaning is that the

11
00:01:04,799 --> 00:01:12,659
limits of the speed of light are requiring
parallel computing to be used in everyday

12
00:01:12,659 --> 00:01:20,048
life, within every computer which was
never the case twenty or 30 years ago.

13
00:01:21,012 --> 00:01:27,069
On the web, parallel computing is used at
a macro level with hundreds of thousands

14
00:01:27,069 --> 00:01:34,003
or even millions of servers powering the
large search engines and social media

15
00:01:34,003 --> 00:01:41,246
platforms, such as Google and Facebook.
So, in two different ways, within chips on

16
00:01:41,246 --> 00:01:47,508
the one hand and in the web on the other,
parallel computing, has seen, a new life,

17
00:01:47,508 --> 00:01:53,096
in the past five years.
So it's important to understand some basic

18
00:01:53,096 --> 00:01:59,065
principles, when one tries to do things
faster using more machines.

19
00:01:59,065 --> 00:02:06,062
If one uses P machines, one would, expect
that one can do the same task which took

20
00:02:06,062 --> 00:02:09,099
t1 time, in less time.
Let's call it Tp.

21
00:02:09,099 --> 00:02:15,091
This will obviously be greater than one
and will increase as you increase the

22
00:02:15,091 --> 00:02:19,086
number of processors.
So for example, you have twenty

23
00:02:19,086 --> 00:02:25,041
processors, you might get speed up of
somewhere around fifteen or sixteen.

24
00:02:25,041 --> 00:02:30,042
If you have 40 processors, you might get
somewhere around 29 or 30.

25
00:02:30,042 --> 00:02:36,050
But as the number of processors increases,
for a variety of reasons, the speed up

26
00:02:36,050 --> 00:02:42,046
doesn't go on forever.
So another important concept is, the

27
00:02:42,046 --> 00:02:46,053
efficiency.
The expected speed up is P, with P

28
00:02:46,053 --> 00:02:51,015
processors.
So the efficiency measures how close we

29
00:02:51,015 --> 00:02:57,013
are to achieving that speed up, and that's
something less than one.

30
00:02:57,071 --> 00:03:04,060
When you have a small number of processors
like two or four or eight you can get

31
00:03:04,060 --> 00:03:10,071
close to one as your efficiency.
But as the number of processors goes up,

32
00:03:11,003 --> 00:03:15,046
we will see that the efficiency will
actually come down.

33
00:03:16,012 --> 00:03:21,087
There are reasons for this and the primary
reason is communication between processors

34
00:03:21,087 --> 00:03:27,003
or coordination between processors.
Think of it like this, if you have two or

35
00:03:27,003 --> 00:03:31,092
three people to perform a task, it's
probably easier but if you put a 100

36
00:03:31,092 --> 00:03:37,014
people to perform a task, the task doesn't
necessarily become 100 times faster

37
00:03:37,014 --> 00:03:42,050
because the people will have to talk to
each other and coordinate their actions.

38
00:03:42,087 --> 00:03:47,038
So, do we lose hope?
Is there no way we can really exploit

39
00:03:47,038 --> 00:03:52,012
thousands and millions of processors?
Well, there is actually.

40
00:03:52,012 --> 00:03:58,036
And the trick is to not solve the same
problem, but to solve bigger problems if

41
00:03:58,036 --> 00:04:03,034
you have more processors.
There's no point indexing a file of a

42
00:04:03,034 --> 00:04:09,050
megabyte with a million machines, you only
need a million machines if you have

43
00:04:09,050 --> 00:04:12,097
billions and billions of web pages to
index.

44
00:04:14,075 --> 00:04:21,087
A scalable algorithm is one which allows
you to increase the problem size without

45
00:04:21,087 --> 00:04:28,012
suffering any loss in efficiency.
On the contrary, because you have large

46
00:04:28,012 --> 00:04:34,097
number of processors, and a large problem
size, the effect of coordination costs

47
00:04:34,097 --> 00:04:41,013
should actually go down as a ratio of the
overall work being performed.

48
00:04:41,013 --> 00:04:47,089
So your efficiency should actually
increase as the ratio and over p increases

49
00:04:47,089 --> 00:04:54,089
So scalar algorithm is one where, as the
problem size goes up in proportion to the

50
00:04:54,089 --> 00:05:01,046
number of processors that you throw at the
problem, you get better and better

51
00:05:01,046 --> 00:05:07,040
efficiency.
Parallel programming is considerably more

52
00:05:07,040 --> 00:05:14,094
difficult than normal serial programming.
Because one has to deal with coordinating

53
00:05:14,094 --> 00:05:22,013
the actions of many different processors.
There are two fundamental paradigms that

54
00:05:22,013 --> 00:05:28,017
have been used historically for parallel
programming one is shared memory.

55
00:05:28,017 --> 00:05:35,006
For example, one might partition work
using shared memory in this way.

56
00:05:35,006 --> 00:05:40,001
Each processor does a particular piece of
work called W sub P.

57
00:05:40,001 --> 00:05:46,057
And different processors do different work
by locking that part of the data which

58
00:05:46,057 --> 00:05:52,074
they are working on doing their work.
And then unlocking the data or piece of

59
00:05:52,074 --> 00:05:58,033
data that they just worked on.
And as long as every processor accesses

60
00:05:58,033 --> 00:06:04,637
the different part of the data that some
other processor is accessing, this is

61
00:06:04,637 --> 00:06:09,042
reasonably fast.
Of course, the problem arises if they try

62
00:06:09,042 --> 00:06:15,772
to access the same piece of data in which
case, they might find it to be locked by

63
00:06:15,772 --> 00:06:20,059
somebody else.
And that's where the costs of coordination

64
00:06:20,059 --> 00:06:26,069
or communication overhead slow down the
procedure and reduce the efficiency.

65
00:06:27,063 --> 00:06:33,030
Another paradigm which has been used
historically is message passing, for

66
00:06:33,030 --> 00:06:39,058
example, by partitioning the data.
So each processor does the same

67
00:06:39,058 --> 00:06:44,036
computational work but on different parts
of the data.

68
00:06:44,093 --> 00:06:49,098
So for example, a processor might have
this slice of the data.

69
00:06:49,098 --> 00:06:55,718
So the data is divided into n over p
slices and each processor gets one slice

70
00:06:55,718 --> 00:07:00,030
but performs the same work on this slice
of data.

71
00:07:00,030 --> 00:07:06,025
Of course, if that was all there was to
it, the problem is what is called

72
00:07:06,025 --> 00:07:10,072
embarrassingly parallel and it becomes
fairly trivial.

73
00:07:10,072 --> 00:07:18,022
However, in most cases, the problem is a
bit more difficult because the results of

74
00:07:18,022 --> 00:07:26,010
work done on one slice of data need to use
the results from the work done on another

75
00:07:26,010 --> 00:07:31,010
slice of data.
Think of adding up n numbers, you simply

76
00:07:31,010 --> 00:07:36,785
couldn't add up each small slice and
expect to get the answer, but you'd have

77
00:07:36,785 --> 00:07:42,606
to exchange, the results of these partial
sums, with at least one processor, or

78
00:07:42,606 --> 00:07:48,537
maybe, many of them in a tree structure,
so that you could finally compute, the sum

79
00:07:48,537 --> 00:07:54,057
of all the numbers, in parallel.
Therefore, in the message-passing model,

80
00:07:54,057 --> 00:07:59,861
processors need to exchange data after
performing some work on their slice.

81
00:07:59,861 --> 00:08:06,644
So that they can perform more work based
on the results of the work done by other

82
00:08:06,644 --> 00:08:11,822
processors on other slices.
This make message passing a bit more

83
00:08:11,822 --> 00:08:18,173
difficult to programme than shared memory,
but it's also much more scalable because

84
00:08:18,173 --> 00:08:23,959
things like shared memory itself are
really difficult to implement when you

85
00:08:23,959 --> 00:08:31,873
have thousands or millions of processors.
Of course, you can have shared memory and

86
00:08:31,873 --> 00:08:38,198
also partition the data.
Just as you could have message passing,

87
00:08:38,198 --> 00:08:43,847
and partition the work.
The map produce programming paradigm is a,

88
00:08:43,847 --> 00:08:49,735
higher level, abstraction then, pure
message passing or shared memory.

89
00:08:49,735 --> 00:08:54,080
But it is a message passing and data
parallel paradigm.