1 00:00:00,000 --> 00:00:07,020 As we all know, microprocessors are becoming cheaper and faster every day. 2 00:00:07,020 --> 00:00:12,092 Computing power doubles every eighteen for the same price. 3 00:00:12,092 --> 00:00:17,085 And this has been going on for more than 40 years. 4 00:00:17,085 --> 00:00:23,067 So, in the early days of parallel computing, such as the 80s. 5 00:00:23,067 --> 00:00:31,056 By the time one figured how one could use many processors to do the work of one 6 00:00:31,056 --> 00:00:37,441 processor faster, that one processor itself became twice as fast. 7 00:00:37,441 --> 00:00:42,409 So parallel computing really had to play a catch up game. 8 00:00:42,409 --> 00:00:50,330 However, in recent years you may have noticed that all your PCs and servers come 9 00:00:50,330 --> 00:00:57,899 with multiple cores, four cores, eight cores, sixteen cores, and, soon will be 32 10 00:00:57,899 --> 00:01:04,799 cores or 64 cores on every chip. What that's really meaning is that the 11 00:01:04,799 --> 00:01:12,659 limits of the speed of light are requiring parallel computing to be used in everyday 12 00:01:12,659 --> 00:01:20,048 life, within every computer which was never the case twenty or 30 years ago. 13 00:01:21,012 --> 00:01:27,069 On the web, parallel computing is used at a macro level with hundreds of thousands 14 00:01:27,069 --> 00:01:34,003 or even millions of servers powering the large search engines and social media 15 00:01:34,003 --> 00:01:41,246 platforms, such as Google and Facebook. So, in two different ways, within chips on 16 00:01:41,246 --> 00:01:47,508 the one hand and in the web on the other, parallel computing, has seen, a new life, 17 00:01:47,508 --> 00:01:53,096 in the past five years. So it's important to understand some basic 18 00:01:53,096 --> 00:01:59,065 principles, when one tries to do things faster using more machines. 19 00:01:59,065 --> 00:02:06,062 If one uses P machines, one would, expect that one can do the same task which took 20 00:02:06,062 --> 00:02:09,099 t1 time, in less time. Let's call it Tp. 21 00:02:09,099 --> 00:02:15,091 This will obviously be greater than one and will increase as you increase the 22 00:02:15,091 --> 00:02:19,086 number of processors. So for example, you have twenty 23 00:02:19,086 --> 00:02:25,041 processors, you might get speed up of somewhere around fifteen or sixteen. 24 00:02:25,041 --> 00:02:30,042 If you have 40 processors, you might get somewhere around 29 or 30. 25 00:02:30,042 --> 00:02:36,050 But as the number of processors increases, for a variety of reasons, the speed up 26 00:02:36,050 --> 00:02:42,046 doesn't go on forever. So another important concept is, the 27 00:02:42,046 --> 00:02:46,053 efficiency. The expected speed up is P, with P 28 00:02:46,053 --> 00:02:51,015 processors. So the efficiency measures how close we 29 00:02:51,015 --> 00:02:57,013 are to achieving that speed up, and that's something less than one. 30 00:02:57,071 --> 00:03:04,060 When you have a small number of processors like two or four or eight you can get 31 00:03:04,060 --> 00:03:10,071 close to one as your efficiency. But as the number of processors goes up, 32 00:03:11,003 --> 00:03:15,046 we will see that the efficiency will actually come down. 33 00:03:16,012 --> 00:03:21,087 There are reasons for this and the primary reason is communication between processors 34 00:03:21,087 --> 00:03:27,003 or coordination between processors. Think of it like this, if you have two or 35 00:03:27,003 --> 00:03:31,092 three people to perform a task, it's probably easier but if you put a 100 36 00:03:31,092 --> 00:03:37,014 people to perform a task, the task doesn't necessarily become 100 times faster 37 00:03:37,014 --> 00:03:42,050 because the people will have to talk to each other and coordinate their actions. 38 00:03:42,087 --> 00:03:47,038 So, do we lose hope? Is there no way we can really exploit 39 00:03:47,038 --> 00:03:52,012 thousands and millions of processors? Well, there is actually. 40 00:03:52,012 --> 00:03:58,036 And the trick is to not solve the same problem, but to solve bigger problems if 41 00:03:58,036 --> 00:04:03,034 you have more processors. There's no point indexing a file of a 42 00:04:03,034 --> 00:04:09,050 megabyte with a million machines, you only need a million machines if you have 43 00:04:09,050 --> 00:04:12,097 billions and billions of web pages to index. 44 00:04:14,075 --> 00:04:21,087 A scalable algorithm is one which allows you to increase the problem size without 45 00:04:21,087 --> 00:04:28,012 suffering any loss in efficiency. On the contrary, because you have large 46 00:04:28,012 --> 00:04:34,097 number of processors, and a large problem size, the effect of coordination costs 47 00:04:34,097 --> 00:04:41,013 should actually go down as a ratio of the overall work being performed. 48 00:04:41,013 --> 00:04:47,089 So your efficiency should actually increase as the ratio and over p increases 49 00:04:47,089 --> 00:04:54,089 So scalar algorithm is one where, as the problem size goes up in proportion to the 50 00:04:54,089 --> 00:05:01,046 number of processors that you throw at the problem, you get better and better 51 00:05:01,046 --> 00:05:07,040 efficiency. Parallel programming is considerably more 52 00:05:07,040 --> 00:05:14,094 difficult than normal serial programming. Because one has to deal with coordinating 53 00:05:14,094 --> 00:05:22,013 the actions of many different processors. There are two fundamental paradigms that 54 00:05:22,013 --> 00:05:28,017 have been used historically for parallel programming one is shared memory. 55 00:05:28,017 --> 00:05:35,006 For example, one might partition work using shared memory in this way. 56 00:05:35,006 --> 00:05:40,001 Each processor does a particular piece of work called W sub P. 57 00:05:40,001 --> 00:05:46,057 And different processors do different work by locking that part of the data which 58 00:05:46,057 --> 00:05:52,074 they are working on doing their work. And then unlocking the data or piece of 59 00:05:52,074 --> 00:05:58,033 data that they just worked on. And as long as every processor accesses 60 00:05:58,033 --> 00:06:04,637 the different part of the data that some other processor is accessing, this is 61 00:06:04,637 --> 00:06:09,042 reasonably fast. Of course, the problem arises if they try 62 00:06:09,042 --> 00:06:15,772 to access the same piece of data in which case, they might find it to be locked by 63 00:06:15,772 --> 00:06:20,059 somebody else. And that's where the costs of coordination 64 00:06:20,059 --> 00:06:26,069 or communication overhead slow down the procedure and reduce the efficiency. 65 00:06:27,063 --> 00:06:33,030 Another paradigm which has been used historically is message passing, for 66 00:06:33,030 --> 00:06:39,058 example, by partitioning the data. So each processor does the same 67 00:06:39,058 --> 00:06:44,036 computational work but on different parts of the data. 68 00:06:44,093 --> 00:06:49,098 So for example, a processor might have this slice of the data. 69 00:06:49,098 --> 00:06:55,718 So the data is divided into n over p slices and each processor gets one slice 70 00:06:55,718 --> 00:07:00,030 but performs the same work on this slice of data. 71 00:07:00,030 --> 00:07:06,025 Of course, if that was all there was to it, the problem is what is called 72 00:07:06,025 --> 00:07:10,072 embarrassingly parallel and it becomes fairly trivial. 73 00:07:10,072 --> 00:07:18,022 However, in most cases, the problem is a bit more difficult because the results of 74 00:07:18,022 --> 00:07:26,010 work done on one slice of data need to use the results from the work done on another 75 00:07:26,010 --> 00:07:31,010 slice of data. Think of adding up n numbers, you simply 76 00:07:31,010 --> 00:07:36,785 couldn't add up each small slice and expect to get the answer, but you'd have 77 00:07:36,785 --> 00:07:42,606 to exchange, the results of these partial sums, with at least one processor, or 78 00:07:42,606 --> 00:07:48,537 maybe, many of them in a tree structure, so that you could finally compute, the sum 79 00:07:48,537 --> 00:07:54,057 of all the numbers, in parallel. Therefore, in the message-passing model, 80 00:07:54,057 --> 00:07:59,861 processors need to exchange data after performing some work on their slice. 81 00:07:59,861 --> 00:08:06,644 So that they can perform more work based on the results of the work done by other 82 00:08:06,644 --> 00:08:11,822 processors on other slices. This make message passing a bit more 83 00:08:11,822 --> 00:08:18,173 difficult to programme than shared memory, but it's also much more scalable because 84 00:08:18,173 --> 00:08:23,959 things like shared memory itself are really difficult to implement when you 85 00:08:23,959 --> 00:08:31,873 have thousands or millions of processors. Of course, you can have shared memory and 86 00:08:31,873 --> 00:08:38,198 also partition the data. Just as you could have message passing, 87 00:08:38,198 --> 00:08:43,847 and partition the work. The map produce programming paradigm is a, 88 00:08:43,847 --> 00:08:49,735 higher level, abstraction then, pure message passing or shared memory. 89 00:08:49,735 --> 00:08:54,080 But it is a message passing and data parallel paradigm.