1 00:00:02,540 --> 00:00:11,420 But before, before we do that, I wanted to talk about programming and a 2 00:00:11,420 --> 00:00:15,380 competitor to using shared memory for programming. 3 00:00:18,400 --> 00:00:24,269 So, so far, we have looked at a shared memory architecture to communicate from 4 00:00:24,269 --> 00:00:28,484 one core to another core. So, what does that means is one core does 5 00:00:28,484 --> 00:00:33,402 a store to some memory address. And sometime in the future, another core goes 6 00:00:33,402 --> 00:00:36,340 to read that memory address and gets the data. 7 00:00:39,360 --> 00:00:46,394 What's interesting about that paradigm is the sender of the data, 8 00:00:46,394 --> 00:00:50,118 the core which is writing the data or doing 9 00:00:50,118 --> 00:00:54,716 the store does not need to know which core is going to read the data in the 10 00:00:54,716 --> 00:01:00,730 future. Or it's very possible that no one will go 11 00:01:00,730 --> 00:01:07,282 read that data in the future. And in shared memory architectures, that 12 00:01:07,282 --> 00:01:13,820 doesn't, that doesn't affect anything. We can actually have one core writes a 13 00:01:13,820 --> 00:01:18,560 piece of data in the memory and some address and given this implicit name, we 14 00:01:18,560 --> 00:01:23,060 effectively have some associative match where you can look up based on the 15 00:01:23,060 --> 00:01:27,800 address, the piece of data and go find the piece of data sometime in the future. 16 00:01:27,800 --> 00:01:32,720 Now, you probably need to use locking and to, to guaranteed causality between one 17 00:01:32,720 --> 00:01:37,340 processor writing the data and another processor reading the data in shared 18 00:01:37,340 --> 00:01:41,920 memory but you don't need to know the destination. 19 00:01:41,920 --> 00:01:47,825 In contrast, we're going to talk about explicit message passing as a programming 20 00:01:47,825 --> 00:01:50,662 model. Now, there are lots of hardware 21 00:01:50,662 --> 00:01:53,883 implementations, you can go implement this. 22 00:01:53,883 --> 00:02:00,248 But let's start off by talking about exclusive messaging as a program model to 23 00:02:00,248 --> 00:02:04,620 communicate between different threads, processes, or cores. 24 00:02:05,760 --> 00:02:09,520 [COUGH] And what we're going to have here is we're going to add, 25 00:02:09,520 --> 00:02:16,610 we're going to, we're going to have an API or a, a programming interface here of 26 00:02:16,610 --> 00:02:22,030 send, where it names the destination, and 27 00:02:22,030 --> 00:02:31,844 passes into it, a pointer to some data. And in our messaging API, this is going 28 00:02:31,844 --> 00:02:39,540 to take this data and somehow get it to the receiver or the destination. 29 00:02:41,020 --> 00:02:47,120 On the receive side, the most basic thing they can do is you 30 00:02:47,120 --> 00:02:54,137 can receive data. Note this receive here does not denote 31 00:02:54,137 --> 00:02:59,440 who it's receiving data from. So, a common of thing you can do is just 32 00:02:59,440 --> 00:03:04,359 receive in order, receive the next piece of data that it gets in. 33 00:03:04,359 --> 00:03:10,584 Now, that may not be super useful so you may want to extend this to having the 34 00:03:10,584 --> 00:03:15,196 receive actually take a, a source also. But it's not required. 35 00:03:15,196 --> 00:03:21,421 There are the reason I'm, I'm talking about this in this abstract fashion here 36 00:03:21,421 --> 00:03:27,656 is there's many different explicit message passing programming interfaces 37 00:03:27,656 --> 00:03:33,223 because it's just software that people have implemented over, over the years. 38 00:03:33,223 --> 00:03:38,204 but at the, at the fundamental level, there's a send and a receive. 39 00:03:38,204 --> 00:03:41,324 And you could think about having the send, 40 00:03:41,324 --> 00:03:46,713 or the send requires a destination and receive can just receive of, of, 41 00:03:46,713 --> 00:03:51,717 basically the input queue or it would try to receive from a pre-recorded source, or 42 00:03:51,717 --> 00:03:56,452 another way to do this is what they do in in MPI, which is called the public 43 00:03:56,452 --> 00:04:00,965 message passing interface, which is a probably the most common programming 44 00:04:00,965 --> 00:04:05,969 interface for messaging, is it won't take a source, but instead, it will take a 45 00:04:05,969 --> 00:04:11,572 tag, which is effectively a number which says, what message type is the or not 46 00:04:11,572 --> 00:04:12,921 message type, sorry, 47 00:04:12,921 --> 00:04:17,039 what, what traffic flow is it, you sort of name the flow. 48 00:04:17,039 --> 00:04:21,726 This is kind of a tag is, is sort of a more generalized form of, 49 00:04:21,726 --> 00:04:24,921 in TCP/IP, they have ports, for instance, right? 50 00:04:24,921 --> 00:04:28,190 And UDP has ports. Okay. 51 00:04:28,190 --> 00:04:32,372 So, questions so far about this basic program model and why is it different 52 00:04:32,372 --> 00:04:35,160 than loads and stores through shared addresses. 53 00:04:37,820 --> 00:04:41,360 Very different. They are jewels of each other. 54 00:04:41,360 --> 00:04:45,173 I'm going to tell you right now that you, if you can implement one thing of shared 55 00:04:45,173 --> 00:04:48,683 memory, you can implement messaging and vice versa, and there have been lots of 56 00:04:48,683 --> 00:04:52,106 fights over this, and at some point or, or, the whole community realized that you 57 00:04:52,106 --> 00:04:55,010 can do either with either. You can do any programming model, you can 58 00:04:55,010 --> 00:05:07,652 do one and the other. Okay. So, the default message type we're 59 00:05:07,652 --> 00:05:22,041 going to talk about is unicast. So, that's one-to-one. 60 00:05:22,041 --> 00:05:27,299 If you ascend to a destination, the destination is one other node or one 61 00:05:27,299 --> 00:05:27,299 other process, or one other thread. 62 00:05:27,299 --> 00:05:27,299 The reason that I say, process, thread, and node is depending on how API is 63 00:05:27,299 --> 00:05:27,299 structured. You might have, you could use a, a 64 00:05:27,299 --> 00:05:27,299 messaging, exclusive message passing system on something like a uniprocessor. 65 00:05:27,299 --> 00:05:27,299 We have different threads or different processes running and you are only 66 00:05:27,299 --> 00:05:27,299 messaging for one process to another process even though it is one core. 67 00:05:31,000 --> 00:05:33,682 But there's a pretty natural extension of that. 68 00:05:33,682 --> 00:05:37,756 You can take those two processes and put them on different cores, and have them 69 00:05:37,756 --> 00:05:43,089 communicate using the same interface. We could also do this between two 70 00:05:43,089 --> 00:05:45,944 threads, but that's less common. Okay. 71 00:05:45,944 --> 00:05:50,777 So, that, so unicast is one-to-one. Some messaging networks and some 72 00:05:50,777 --> 00:05:56,049 messaging primitives will actually program models will have a couple of 73 00:05:56,049 --> 00:05:59,125 other choices here. We can have multicast. 74 00:05:59,125 --> 00:06:04,983 So,- this is communication from one node, but instead of having a destination here 75 00:06:04,983 --> 00:06:10,255 which denotes one other location, it can denote a set of other people to 76 00:06:10,255 --> 00:06:13,805 communicate with. It is still messaging except for 77 00:06:13,805 --> 00:06:18,776 destination, we've expanded our notion of destination here so it can be a set, not 78 00:06:18,776 --> 00:06:26,200 just a single thing. [COUGH] We can also have broadcast, 79 00:06:26,200 --> 00:06:30,884 where you can have one node, instead of having destination here, you can have 80 00:06:30,884 --> 00:06:35,323 some magical flags which says, communicate this with every other node in 81 00:06:35,323 --> 00:06:40,069 the system or every other process in the system or every other threat in the 82 00:06:40,069 --> 00:06:44,816 system or every other process in that process group or thread group there is. 83 00:06:44,816 --> 00:06:48,700 There's ways to sort of say that in these software protocols. 84 00:06:49,840 --> 00:06:56,300 just for a quick question here. Has anyone here programmed an MPI? 85 00:06:58,340 --> 00:07:03,016 Okay. And once in a while, perform a programming. 86 00:07:03,016 --> 00:07:05,898 Good. So, we'll talk briefly. 87 00:07:05,898 --> 00:07:12,931 I'm going to show a brief code example here of one process communicating with 88 00:07:12,931 --> 00:07:19,047 another process via a very common messaging library called the message 89 00:07:19,047 --> 00:07:26,615 passing interface or MPI for short. So, let's look at this piece of code. 90 00:07:26,615 --> 00:07:33,255 It's a C code and what's happening here is, okay we start our program, 91 00:07:33,255 --> 00:07:38,378 we have my ID. So, the, the program model of MPI is that 92 00:07:38,378 --> 00:07:46,080 it will actually start the same process on multiple cores or multiple processes 93 00:07:46,080 --> 00:07:53,508 or multiple not, not, it's not threads but multiple processes or multiple cores 94 00:07:53,508 --> 00:08:00,936 will start up the exact same program. So, this is sometimes called SPMD or 95 00:08:00,936 --> 00:08:04,790 single program multiple data program model. 96 00:08:04,790 --> 00:08:08,141 SPMD. So, the SPMD model here, what you're 97 00:08:08,141 --> 00:08:13,922 going to notice is that actually we're going to execute this program twice. 98 00:08:13,922 --> 00:08:18,380 One on the one core, and one on the second core. 99 00:08:18,380 --> 00:08:24,041 Now, you're going to ,, well, if you're executing exact same program, are they 100 00:08:24,041 --> 00:08:28,042 going to do the exact same thing? Well, let's take a look. 101 00:08:28,042 --> 00:08:31,062 Okay. So, we have this thing called my ID. 102 00:08:31,062 --> 00:08:34,233 This is what tells us in what core we are. 103 00:08:34,233 --> 00:08:40,688 And there's something here called numprocs, which is going to tell us how 104 00:08:40,688 --> 00:08:43,708 many processors there are in the computer. 105 00:08:43,708 --> 00:08:48,095 And we fill these in. So, you see here, we start off by calling 106 00:08:48,095 --> 00:08:53,057 MPI INT MPI INT sets up the message passing system from a software 107 00:08:53,057 --> 00:08:57,713 perspective. Then we call MPI communication size with 108 00:08:57,713 --> 00:09:01,216 some magic parameters. And it's going to fill in this field, 109 00:09:01,216 --> 00:09:05,550 which tells us how many processors there are in this, in this MPI program. 110 00:09:05,550 --> 00:09:10,304 When you launch an MPI program, there's a special MPI Start command that you will 111 00:09:10,304 --> 00:09:14,887 have to run, which will actually, is, it takes a parameter of how many it takes 112 00:09:14,887 --> 00:09:19,527 parameter of programmer trying to run and multiple processors that you are trying 113 00:09:19,527 --> 00:09:23,423 to run, and the number of processes that you are trying to run it on. 114 00:09:23,423 --> 00:09:27,719 So, our program now can dynamically detect how many processors were running 115 00:09:27,719 --> 00:09:30,441 on. It can also detect what's called our 116 00:09:30,441 --> 00:09:35,474 rank, which is our ID. So now, we can actually detect before the 117 00:09:35,474 --> 00:09:41,805 first process or the second process launched in a two-process system or which 118 00:09:41,805 --> 00:09:45,880 one we are, and will fill in my ID with a rank. 119 00:09:45,880 --> 00:09:49,388 And this is a library which will figure this out for us. 120 00:09:49,388 --> 00:09:51,707 In this, this MPI implementation. Okay. 121 00:09:51,707 --> 00:09:55,278 So, we are assert now. We say the number of procs equals two. 122 00:09:55,278 --> 00:09:59,915 this is just to make sure we're not running on three processors or ten 123 00:09:59,915 --> 00:10:04,301 processors or something like that. We want strictly two and And not one, and 124 00:10:04,301 --> 00:10:06,995 not so that will, that will fail if we get a 125 00:10:06,995 --> 00:10:08,185 problem there. Okay. 126 00:10:08,185 --> 00:10:11,693 So now, this is, this is where we can do different things. 127 00:10:11,693 --> 00:10:16,078 So, single program multiple data. The reason it's called this is we can 128 00:10:16,078 --> 00:10:20,087 actually make decisions based on our processor ID or my ID here. 129 00:10:20,087 --> 00:10:24,723 So, we'll do something which says is my ID zero? Do this else, do this other 130 00:10:24,723 --> 00:10:29,108 piece of code, okay? 131 00:10:29,108 --> 00:10:34,260 So, is everyone want to wage your guess here what this program does so far? 132 00:10:38,720 --> 00:10:42,360 So, the one, one processor is going to be executing this, the other processor is 133 00:10:42,360 --> 00:10:48,700 going to be executing that. First processor executes a send. 134 00:10:50,300 --> 00:10:55,020 [COUGH] Let's, let's look at this first line here and see what happens. 135 00:10:56,160 --> 00:11:01,356 This is the data we're trying to send so we're sending x, which is an integer, 136 00:11:01,356 --> 00:11:06,080 which we load with the number 475, which is our class, or course number. 137 00:11:07,540 --> 00:11:14,611 [COUGH] And we send that with a particular tag. 138 00:11:14,611 --> 00:11:20,014 So, MPI is structured around this notion of tags, 139 00:11:20,014 --> 00:11:27,220 which is how you can effectively connect up a sender and a receiver. 140 00:11:28,980 --> 00:11:34,646 [COUGH] Oh, sorry. That's how you can connect up multi, 141 00:11:34,646 --> 00:11:38,403 multiple pieces of traffic between senders and receivers. 142 00:11:38,403 --> 00:11:41,765 The other way that you figure out is by looking at, 143 00:11:41,765 --> 00:11:45,127 there's a a number of who you're sending and 144 00:11:45,127 --> 00:11:51,012 receiving to, so this is the core number. So, what this says here is send x to rank 145 00:11:51,012 --> 00:11:55,827 number one with tag, tag, we're going to say tag is 475 also, 146 00:11:55,827 --> 00:12:00,560 and this MPI rest of this stuff here, don't worry about. 147 00:12:01,800 --> 00:12:06,826 And this is, this is a length. So, this is going to say, send one word 148 00:12:06,826 --> 00:12:11,146 of size integer to processor one with a particular tag. 149 00:12:11,146 --> 00:12:17,036 And this MPI COMM WORLD basically is extra flags you can pass there which 150 00:12:17,036 --> 00:12:22,927 says, do I want to send to ev, all other MPI sub processes or you can make 151 00:12:22,927 --> 00:12:26,775 subgroups? There's this notion of groups that MPI 152 00:12:26,775 --> 00:12:32,678 has but it's kind of, more complicated. Okay. So, at the same time, this other 153 00:12:32,678 --> 00:12:37,223 program is executing here. The first thing it goes to do is it 154 00:12:37,223 --> 00:12:40,448 receives. So, if it gets here first, it's just 155 00:12:40,448 --> 00:12:48,608 going to block and wait in this receive. Conveniently, this first processor did 156 00:12:48,608 --> 00:12:56,386 ascend, and has a matched receive here. So, it's going to receive with tag 475 157 00:12:56,386 --> 00:13:04,120 length one of an MPI integer from zero, and return as a status code. 158 00:13:05,480 --> 00:13:11,000 So, it's going to fill in y with the data that got sent on this message. 159 00:13:11,000 --> 00:13:14,512 That's pretty fun. We just communicate information between 160 00:13:14,512 --> 00:13:17,788 two processes. Okay. So, let's, let's look through 161 00:13:17,788 --> 00:13:23,336 what's happening here. Process zero as you do the send. 162 00:13:23,336 --> 00:13:28,840 So, it's going to send 475 to processor one. 163 00:13:30,860 --> 00:13:34,621 [COUGH] Processor one could have gotten here early, at this receive. 164 00:13:34,621 --> 00:13:38,160 And processor one does not execute any of this code. 165 00:13:38,160 --> 00:13:41,842 And it has different address base, x's and y's are different, 166 00:13:41,842 --> 00:13:46,445 we have no shared memory here. It gets here, and it's going to do 167 00:13:46,445 --> 00:13:51,891 receive. And at some point, the send shows up and the, the message shows up to 168 00:13:51,891 --> 00:13:55,344 the receiver. It gets received and put into y. 169 00:13:55,344 --> 00:14:00,020 So, it just fills it in which is why we pass the address of y. 170 00:14:00,020 --> 00:14:03,781 [COUGH] Now, we're going to increment y by 105, 171 00:14:03,781 --> 00:14:10,545 so we did some computation on this node. At the, at the same time while were, 172 00:14:10,545 --> 00:14:15,334 we're doing this receive, and we're doing this increment, and we're doing this 173 00:14:15,334 --> 00:14:18,106 send, process zero is basically just sitting 174 00:14:18,106 --> 00:14:21,060 here waiting, trying to receive. 175 00:14:21,060 --> 00:14:23,797 So, we're doing, we're having process zero. 176 00:14:23,797 --> 00:14:28,138 Sends some data to process one. Process one is going to do some math on 177 00:14:28,138 --> 00:14:33,520 it that's going to increment the number, and it's going to send that number back. 178 00:14:33,520 --> 00:14:36,723 So, it does the send back here which have y again. 179 00:14:36,723 --> 00:14:43,291 And we're going to send y to process zero of length one, and when the message shows 180 00:14:43,291 --> 00:14:46,975 up, we're going to receive in the process zero, 181 00:14:46,975 --> 00:14:51,880 and process zero is going to print out a message. 182 00:14:51,880 --> 00:14:53,918 okay. What, what does it say? 183 00:14:53,918 --> 00:14:58,131 Received number 580. Yes, it should go [UNKNOWN] 580 A, 184 00:14:58,131 --> 00:15:03,636 not 580 not 580 without an A. Because 580 without an A is not the 185 00:15:03,636 --> 00:15:07,985 parallel programming class, that's the security class next to it. 186 00:15:07,985 --> 00:15:11,858 But 580 A, you should go to it. So, it prints out the number. 187 00:15:11,858 --> 00:15:16,819 So, it can communicate from the one process to the other process and back 188 00:15:16,819 --> 00:15:20,625 with messaging. So, one of the things we should note here 189 00:15:20,625 --> 00:15:27,311 is, we are both moving data and we are synchronizing at the same time in this 190 00:15:27,311 --> 00:15:30,553 message. So, it's, it's doing two different 191 00:15:30,553 --> 00:15:33,477 operations here. So, this is a great question. 192 00:15:33,477 --> 00:15:36,201 So, this is a programming model right now. 193 00:15:36,201 --> 00:15:41,119 We're not going, you know, there's many different ways to implement MPI and 194 00:15:41,119 --> 00:15:47,798 people to implement MPI. The interface of these MPI send, MPI 195 00:15:47,798 --> 00:15:54,455 receive, are many different ways. So, people have implement this, implement 196 00:15:54,455 --> 00:15:58,721 this over a message, hardware message passing network, which does not have to 197 00:15:58,721 --> 00:16:02,145 go into the OS, for instance. You send and it actually, there's, 198 00:16:02,145 --> 00:16:06,299 effectively hardware there which implements MPI send and receive, sends it 199 00:16:06,299 --> 00:16:10,510 out over the networks and receives it on the receive side and there's some 200 00:16:10,510 --> 00:16:13,541 interconnection network in the middle. That's one way. 201 00:16:13,541 --> 00:16:18,620 Another kind of way is that you actually run MPI over a shared memory machine. 202 00:16:18,620 --> 00:16:22,755 At which point, MPI send. This is going to sound kind of funny. 203 00:16:22,755 --> 00:16:27,486 It basically takes the data, copies it into RAM somewhere and then when the 204 00:16:27,486 --> 00:16:31,325 receiver goes to receive it, it looks to open in the hash table, finds the 205 00:16:31,325 --> 00:16:35,164 location in RAM and doesn't reach from that pointer and that's, that's an 206 00:16:35,164 --> 00:16:39,160 option. And actually, that is one of the most common ways that people run MPI 207 00:16:39,160 --> 00:16:43,420 today is, that doesn't have to go in to the OS because it's just shared memory. 208 00:16:43,420 --> 00:16:46,994 on small machines, people run MPI that way. 209 00:16:46,994 --> 00:16:51,159 And it actually has good, has typically had a better performance 210 00:16:51,159 --> 00:16:54,960 than sort of the equivalent thing of writing the short memory program. 211 00:16:54,960 --> 00:16:58,602 And the reason for that is you've effectively, explicitly made the 212 00:16:58,602 --> 00:17:01,527 communication or you've made the communication explicit. 213 00:17:01,527 --> 00:17:05,501 So, that coherence protocol, you're basically going to be build to optimize 214 00:17:05,501 --> 00:17:08,591 for producer-consumer. You have a right to some location. 215 00:17:08,591 --> 00:17:12,896 And then, someone else is going to read it, your not going to have random false 216 00:17:12,896 --> 00:17:15,601 sharing problems, your not going to lots other problems. 217 00:17:15,601 --> 00:17:19,298 So, that's pretty common. where MPI has the biggest use today is in 218 00:17:19,298 --> 00:17:23,272 massively peril computations. So, this is sort of the supercomputers of 219 00:17:23,272 --> 00:17:26,743 the world. sort of the, all the supercomputers minus 220 00:17:26,743 --> 00:17:30,741 the vector of supercomputers who typically don't use MPI, 221 00:17:30,741 --> 00:17:35,592 but the massively parallel computers. So, like the biggest computers in the 222 00:17:35,592 --> 00:17:37,740 world right now, the 223 00:17:37,740 --> 00:17:42,575 I don't know, the Roadrunner computer at Los Alamos, 224 00:17:42,575 --> 00:17:48,306 for instance, and the similar sorts of computers like that. 225 00:17:48,306 --> 00:17:54,195 They will have special network cards that implement MPI effectively and they do it 226 00:17:54,195 --> 00:17:57,275 in user space. Now, another way to do it is that you can 227 00:17:57,275 --> 00:18:01,139 actually implement MPI over TCP/IP going through the operating system. 228 00:18:01,139 --> 00:18:05,675 And that's pretty common when people just run small clusters of computers is MPI 229 00:18:05,675 --> 00:18:08,642 send and receive will trap into the operating system. 230 00:18:08,642 --> 00:18:12,898 It will go into the library first and then up, the library will watch you read 231 00:18:12,898 --> 00:18:16,538 and write to the sockets. At some point, that will go out over the 232 00:18:16,538 --> 00:18:19,338 network card. All of those are possible and all of 233 00:18:19,338 --> 00:18:23,416 those have very optimized MPI implementations because this is the most 234 00:18:23,416 --> 00:18:28,810 widely used parallel messaging library for high performance application and has 235 00:18:28,810 --> 00:18:33,305 the largest [UNKNOWN]. So, people have optimized this quite a 236 00:18:33,305 --> 00:18:34,917 bit. Okay. 237 00:18:34,917 --> 00:18:39,199 So, this, this brings us to the question of, a lot of words in this slide, but I 238 00:18:39,199 --> 00:18:43,480 do want to sort of talk about this, of message passing versus shared memory. 239 00:18:43,480 --> 00:18:46,840 So, we have two different parallel programming models. 240 00:18:46,840 --> 00:18:51,120 Message passing, where you have to explicitly name your destinations. 241 00:18:51,120 --> 00:18:55,195 And shared memory, where you went to some random location at some point in the 242 00:18:55,195 --> 00:18:59,113 future, someone goes and reads from it and you don't know who the reader is. 243 00:18:59,113 --> 00:19:01,621 It could be any of the processors in the system. 244 00:19:01,621 --> 00:19:04,756 If, you, you could just read around them it could be random. 245 00:19:04,756 --> 00:19:08,883 It could be literally random like you read some random number and it tells you 246 00:19:08,883 --> 00:19:11,130 whose going to do a reading of that location. 247 00:19:11,130 --> 00:19:15,358 But if you think about it, because you have any, you can write two locations 248 00:19:15,358 --> 00:19:18,000 someone else can go read it in shared memory. 249 00:19:18,000 --> 00:19:22,170 You are effectively having a one to all communication versus message passing 250 00:19:22,170 --> 00:19:26,457 allows your microarchitecture and your system to do the optimize around 251 00:19:26,457 --> 00:19:29,286 point-to-point communication. Okay. 252 00:19:29,286 --> 00:19:35,416 So, let's compare these two things. we have message passing here. 253 00:19:35,416 --> 00:19:41,353 In message passing, typically, memory is private per node or 254 00:19:41,353 --> 00:19:46,146 per process or per core. So, memory is not shared. 255 00:19:46,146 --> 00:19:50,640 Shared memory by definition, memory is shared. 256 00:19:52,020 --> 00:19:56,228 Message passing of explicit send and receives in our software code. 257 00:19:56,228 --> 00:19:59,870 So, we're actually going to put a MPI send and MPI receive. 258 00:19:59,870 --> 00:20:02,760 We're actually going to have sends and receives. 259 00:20:06,100 --> 00:20:10,636 Conversely, shared memory, we have these implicit communication via loads and 260 00:20:10,636 --> 00:20:13,203 stores. And the load and the store names an 261 00:20:13,203 --> 00:20:17,502 address but it does not name a core number or a process number or a thread 262 00:20:17,502 --> 00:20:21,226 number. [COUGH] Message passing moves data. 263 00:20:21,226 --> 00:20:24,206 We do a send and we put some data in there. 264 00:20:24,206 --> 00:20:28,677 Also by definition of receiving a message on the receive side, there's some 265 00:20:28,677 --> 00:20:32,432 synchronization there. There's a producer-consumer relationship 266 00:20:32,432 --> 00:20:38,834 from the sender to the receiver. In shared memory, you have to explicitly 267 00:20:38,834 --> 00:20:43,213 have synchronization. sorry, I should say explicit. 268 00:20:43,213 --> 00:20:47,190 You need to add explicit synchronization via fences, locks, and flags. 269 00:20:47,190 --> 00:20:50,700 So, you need to add something in there to do synchronization. 270 00:20:50,700 --> 00:20:53,332 And if you don't, you don't, if you don't have 271 00:20:53,332 --> 00:20:55,964 synchronization, you could have race conditions. 272 00:20:55,964 --> 00:20:59,591 You can get the wrong data. You can pick up the incorrect data, 273 00:20:59,591 --> 00:21:03,920 which is a, a pretty common programming error in shared memory programming. 274 00:21:05,040 --> 00:21:11,200 [COUGH] So you don't need to know the destination, you don't need to know the 275 00:21:11,200 --> 00:21:17,340 destination up here. [COUGH] From a, from a programming 276 00:21:17,340 --> 00:21:22,303 perspective, what I wanted to point out here is message passing is very natural 277 00:21:22,303 --> 00:21:26,313 for producer-consumer style computations. So, if you have one node producing some 278 00:21:26,313 --> 00:21:29,521 value and another node reading the value, it's very, very natural. 279 00:21:29,521 --> 00:21:34,032 You set up, you set up a you send, you're sending from one node to the other node 280 00:21:34,032 --> 00:21:38,091 and receive it at the other node and you have a channel between them and you can 281 00:21:38,091 --> 00:21:40,648 communicate. And you can sort of send data in there, 282 00:21:40,648 --> 00:21:43,605 it's all in order, we'll say, and you can just send down the 283 00:21:43,605 --> 00:21:48,785 channel stuff comes out the other side. Very natural for producer-consumer. 284 00:21:48,785 --> 00:21:53,053 Shared memory, you have to implement producer-consumer 285 00:21:53,053 --> 00:21:58,801 like we did on last lecture, where you have, like a Python memory, and locks on 286 00:21:58,801 --> 00:22:02,752 that structure. But what is easy on shared memory, which 287 00:22:02,752 --> 00:22:08,960 is hard to view in message passing, is if you have a large shared data structure. 288 00:22:08,960 --> 00:22:12,060 Let's take, for example, you have a big table. 289 00:22:12,060 --> 00:22:21,080 And you're trying to process this table in parallel and you have, I don't know, 290 00:22:21,080 --> 00:22:25,648 a big you have lots of little files. And you give each of these files to a 291 00:22:25,648 --> 00:22:29,090 different processor. The processor reads the file and what 292 00:22:29,090 --> 00:22:32,710 we're trying to do here is we're trying to build a histogram. 293 00:22:32,710 --> 00:22:37,338 So, it's going to read some number out of a file and based on that number, it's 294 00:22:37,338 --> 00:22:41,847 going to look at a shared data structure and increment that location by one. 295 00:22:41,847 --> 00:22:46,476 Now, because it's shared memory, you want to make sure the two processes or two 296 00:22:46,476 --> 00:22:50,926 processors or two threads or two processes or two processors is not trying 297 00:22:50,926 --> 00:22:55,420 to increment the number at the same time, so you probably want to lock that 298 00:22:55,420 --> 00:22:59,857 location in the table or lock the whole table, increment it, and then unlock it 299 00:22:59,857 --> 00:23:03,782 so you don't have two people trying to increment it at the same time. 300 00:23:03,782 --> 00:23:08,390 But it's very natural to build a shared table in shared memory and have locks on 301 00:23:08,390 --> 00:23:12,828 that shared table and have different people operating on that shared piece of 302 00:23:12,828 --> 00:23:18,992 data at the same time. So, the interesting thing here is that 303 00:23:18,992 --> 00:23:23,400 you can actually tunnel shared memory over messaging. 304 00:23:23,400 --> 00:23:27,120 And you can tunnel messaging over shared memory. 305 00:23:27,120 --> 00:23:34,707 So, let's look at the first example here of how to implement shared memory on top 306 00:23:34,707 --> 00:23:38,312 of messaging. You could do this in software. 307 00:23:38,312 --> 00:23:42,753 And in software, we, how we do this is effectively try and turn all of our loads 308 00:23:42,753 --> 00:23:47,018 and stores into sends and receives. Maybe from, and there's maybe like one 309 00:23:47,018 --> 00:23:49,939 centralized node which has a big notion of memory. 310 00:23:49,939 --> 00:23:53,153 That's one way to do this or you distribute it somehow. 311 00:23:53,153 --> 00:23:57,593 It's going to be pretty painful to do. But people have implemented systems like 312 00:23:57,593 --> 00:24:02,745 this, where you actually implement loads and stores or your compiler will go and 313 00:24:02,745 --> 00:24:06,160 pick out loads and stores and turn them into messages. 314 00:24:07,620 --> 00:24:15,315 A more common thing to have happen is to actually have hardware that automatically 315 00:24:15,315 --> 00:24:20,673 turns communications into messaging. So, we are going to take this motion of a 316 00:24:20,673 --> 00:24:25,959 programming model and instead, we are going to turn into a broader motion of 317 00:24:25,959 --> 00:24:31,245 entities trying to communicate via explicit communications that are sent and 318 00:24:31,245 --> 00:24:37,684 then sort of interconnect. And this is actually the most common way 319 00:24:37,684 --> 00:24:44,315 these days that people go about implementing these large shared memory 320 00:24:44,315 --> 00:24:48,612 machines. This is actually, the memory traffic will 321 00:24:48,612 --> 00:24:52,167 get packetized. And we'll talk about packetization in a 322 00:24:52,167 --> 00:24:55,812 second. And it will get sent over a network and 323 00:24:55,812 --> 00:24:59,448 then received and then some receiver will do something with it. 324 00:24:59,448 --> 00:25:03,776 So, an example here, let's say, we have this core here and this core wants to do 325 00:25:03,776 --> 00:25:06,935 a load. And the, the, the, the data is in main 326 00:25:06,935 --> 00:25:10,301 memory. [COUGH] On our bus here, the core could 327 00:25:10,301 --> 00:25:15,351 just shout, I need address five. And the memory will shout back, got it, 328 00:25:15,351 --> 00:25:22,500 address five has the value six. If we have a switch interconnect, 329 00:25:22,500 --> 00:25:27,420 we put switches in here. We can have the core actually packetize 330 00:25:27,420 --> 00:25:31,880 the memory request. So, we'll take this load and it will make 331 00:25:31,880 --> 00:25:37,032 a message, send it over the network, and it will show up at the memory. 332 00:25:37,032 --> 00:25:41,124 The memory will read the message, make a response, 333 00:25:41,124 --> 00:25:45,680 and And it can send it back over the network also. 334 00:25:45,680 --> 00:25:50,086 So effectively, what we have here is we can actually tunnel shared memory over a 335 00:25:50,086 --> 00:25:53,170 message network. And this is actually, for large cache 336 00:25:53,170 --> 00:25:56,089 coherence systems. The most common thing that happens. 337 00:25:56,089 --> 00:26:00,330 We'll be talking about this in greater depth in two classes when we have our 338 00:26:00,330 --> 00:26:04,075 directory based protocols, which are effectively implementing shared 339 00:26:04,075 --> 00:26:07,160 memory over these switched interconnection networks. 340 00:26:09,060 --> 00:26:14,203 You could also implement messaging over shared memory and this is pretty common 341 00:26:14,203 --> 00:26:19,640 in small systems that have shared memory. They have no other way to communicate. 342 00:26:19,640 --> 00:26:23,593 And this is exactly what we talked about last lecture. 343 00:26:23,593 --> 00:26:26,815 We had a FIFO, we had head and tail pointers, 344 00:26:26,815 --> 00:26:31,868 producer can effectively enqueue onto this queue in main, main memory. 345 00:26:31,868 --> 00:26:37,820 Consumer can go read from it and you could implement messaging this way. 346 00:26:37,820 --> 00:26:41,940 So, I guess, what I'm trying to get at here is messaging and shared memory are 347 00:26:41,940 --> 00:26:45,487 duals of each other. They may be more natural for one thing or 348 00:26:45,487 --> 00:26:49,836 another, but you can implement any algorithm, you can implement one in the 349 00:26:49,836 --> 00:26:52,240 other. And that's been shown at this point.