But before, before we do that, I wanted 
to talk about programming and a competitor to using shared memory for 
programming. So, so far, we have looked at a shared 
memory architecture to communicate from one core to another core. 
So, what does that means is one core does a store to some memory address. And 
sometime in the future, another core goes to read that memory address and gets the 
data. What's interesting about that paradigm is 
the sender of the data, the 
core which is writing the data or doing the store does not need to know which 
core is going to read the data in the future. 
Or it's very possible that no one will go read that data in the future. 
And in shared memory architectures, that doesn't, that doesn't affect anything. 
We can actually have one core writes a piece of data in the memory and some 
address and given this implicit name, we effectively have some associative match 
where you can look up based on the address, the piece of data and go find 
the piece of data sometime in the future. Now, you probably need to use locking and 
to, to guaranteed causality between one processor writing the data and another 
processor reading the data in shared memory but you don't need to know the 
destination. In contrast, we're going to talk about 
explicit message passing as a programming model. 
Now, there are lots of hardware implementations, you can go implement 
this. But let's start off by talking about 
exclusive messaging as a program model to communicate between different threads, 
processes, or cores. [COUGH] And what we're going to have here 
is we're going to add, we're going to, we're going to have an 
API or a, a programming interface here of send, 
where it names the destination, and passes into it, a pointer to some data. 
And in our messaging API, this is going to take this data and somehow get it to 
the receiver or the destination. On the receive side, 
the most basic thing they can do is you can receive data. 
Note this receive here does not denote who it's receiving data from. 
So, a common of thing you can do is just receive in order, receive the next piece 
of data that it gets in. Now, that may not be super useful so you 
may want to extend this to having the receive actually take a, a source also. 
But it's not required. There are the reason I'm, I'm talking 
about this in this abstract fashion here is there's many different explicit 
message passing programming interfaces because it's just software that people 
have implemented over, over the years. but at the, at the fundamental level, 
there's a send and a receive. And you could think about having the 
send, or the send requires a destination and 
receive can just receive of, of, basically the input queue or it would try 
to receive from a pre-recorded source, or another way to do this is what they do in 
in MPI, which is called the public message passing interface, which is a 
probably the most common programming interface for messaging, is it won't take 
a source, but instead, it will take a tag, which is effectively a number which 
says, what message type is the or not message type, 
sorry, what, what traffic flow is it, 
you sort of name the flow. This is kind of a tag is, is sort of a 
more generalized form of, in TCP/IP, they have ports, for instance, 
right? And UDP has ports. 
Okay. So, questions so far about this basic 
program model and why is it different than loads and stores through shared 
addresses. Very different. 
They are jewels of each other. I'm going to tell you right now that you, 
if you can implement one thing of shared memory, you can implement messaging and 
vice versa, and there have been lots of fights over this, and at some point or, 
or, the whole community realized that you can do either with either. 
You can do any programming model, you can do one and the other. 
Okay. So, the default message type we're going to talk about is unicast. 
So, that's one-to-one. If you ascend to a destination, the 
destination is one other node or one other process, 
or one other thread. The reason that I say, process, thread, 
and node is depending on how API is structured. 
You might have, you could use a, a messaging, exclusive message passing 
system on something like a uniprocessor. We have different threads or different 
processes running and you are only messaging for one process to another 
process even though it is one core. But there's a pretty natural extension of 
that. You can take those two processes and put 
them on different cores, and have them communicate using the same interface. 
We could also do this between two threads, but that's less common. 
Okay. So, that, so unicast is one-to-one. 
Some messaging networks and some messaging primitives will actually 
program models will have a couple of other choices here. 
We can have multicast. So,- this is communication from one node, 
but instead of having a destination here which denotes one other location, it can 
denote a set of other people to communicate with. 
It is still messaging except for destination, we've expanded our notion of 
destination here so it can be a set, not just a single thing. 
[COUGH] We can also have broadcast, where you can have one node, instead of 
having destination here, you can have some magical flags which says, 
communicate this with every other node in the system or every other process in the 
system or every other threat in the system or every other process in that 
process group or thread group there is. There's ways to sort of say that in these 
software protocols. just for a quick question here. 
Has anyone here programmed an MPI? Okay. And once in a while, perform a 
programming. Good. So, 
we'll talk briefly. I'm going to show a brief code example 
here of one process communicating with another process via a very common 
messaging library called the message passing interface or MPI for short. 
So, let's look at this piece of code. It's a C code and what's happening here 
is, okay we start our program, we have my ID. 
So, the, the program model of MPI is that it will actually start the same process 
on multiple cores or multiple processes or multiple not, not, it's not threads 
but multiple processes or multiple cores will start up the exact same program. 
So, this is sometimes called SPMD or single program multiple data program 
model. SPMD. 
So, the SPMD model here, what you're going to notice is that actually we're 
going to execute this program twice. One on the one core, and one on the 
second core. Now, you're going to ,, well, if you're 
executing exact same program, are they going to do the exact same thing? 
Well, let's take a look. Okay. 
So, we have this thing called my ID. This is what tells us in what core we 
are. And there's something here called 
numprocs, which is going to tell us how many processors there are in the 
computer. And we fill these in. 
So, you see here, we start off by calling MPI INT MPI INT sets up the message 
passing system from a software perspective. 
Then we call MPI communication size with some magic parameters. 
And it's going to fill in this field, which tells us how many processors there 
are in this, in this MPI program. When you launch an MPI program, there's a 
special MPI Start command that you will have to run, which will actually, is, it 
takes a parameter of how many it takes parameter of programmer trying to run and 
multiple processors that you are trying to run, and the number of processes that 
you are trying to run it on. So, our program now can dynamically 
detect how many processors were running on. 
It can also detect what's called our rank, which is our ID. 
So now, we can actually detect before the first process or the second process 
launched in a two-process system or which one we are, and will fill in my ID with a 
rank. And this is a library which will figure 
this out for us. In this, this MPI implementation. 
Okay. So, we are assert now. 
We say the number of procs equals two. this is just to make sure we're not 
running on three processors or ten processors or something like that. 
We want strictly two and And not one, and not 
so that will, that will fail if we get a problem there. 
Okay. So now, this is, this is where we can do 
different things. So, single program multiple data. 
The reason it's called this is we can actually make decisions based on our 
processor ID or my ID here. So, we'll do something which says is my 
ID zero? Do this else, do this other piece of code, 
okay? So, is everyone want to wage your guess 
here what this program does so far? So, the one, one processor is going to be 
executing this, the other processor is going to be executing that. 
First processor executes a send. [COUGH] Let's, let's look at this first 
line here and see what happens. This is the data we're trying to send so 
we're sending x, which is an integer, which we load with the number 475, which 
is our class, or course number. [COUGH] And we send that with a 
particular tag. So, MPI is structured around this notion 
of tags, which is how you can effectively connect 
up a sender and a receiver. [COUGH] Oh, sorry. 
That's how you can connect up multi, multiple pieces of traffic between 
senders and receivers. The other way that you figure out is by 
looking at, there's a 
a number of who you're sending and receiving to, so this is the core number. 
So, what this says here is send x to rank number one with tag, tag, 
we're going to say tag is 475 also, and this MPI rest of this stuff here, 
don't worry about. And this is, this is a length. 
So, this is going to say, send one word of size integer to processor one with a 
particular tag. And this MPI COMM WORLD basically is 
extra flags you can pass there which says, do I want to send to ev, all other 
MPI sub processes or you can make subgroups? 
There's this notion of groups that MPI has but it's kind of, more complicated. 
Okay. So, at the same time, this other program is executing here. 
The first thing it goes to do is it receives. 
So, if it gets here first, it's just going to block and wait in this receive. 
Conveniently, this first processor did ascend, and has a matched receive here. 
So, it's going to receive with tag 475 length one of an MPI integer from zero, 
and return as a status code. So, it's going to fill in y with the data 
that got sent on this message. That's pretty fun. 
We just communicate information between two processes. 
Okay. So, let's, let's look through what's happening here. 
Process zero as you do the send. So, it's going to send 475 to processor 
one. [COUGH] Processor one could have gotten 
here early, at this receive. And processor one does not execute any of 
this code. And it has different address base, x's 
and y's are different, we have no shared memory here. 
It gets here, and it's going to do receive. And at some point, the send 
shows up and the, the message shows up to the receiver. 
It gets received and put into y. So, it just fills it in which is why we 
pass the address of y. [COUGH] Now, we're going to increment y 
by 105, so we did some computation on this node. 
At the, at the same time while were, we're doing this receive, and we're doing 
this increment, and we're doing this send, 
process zero is basically just sitting here waiting, 
trying to receive. So, we're doing, we're having process 
zero. Sends some data to process one. 
Process one is going to do some math on it that's going to increment the number, 
and it's going to send that number back. So, it does the send back here which have 
y again. And we're going to send y to process zero 
of length one, and when the message shows up, we're going to receive in the process 
zero, and process zero is going to print out a 
message. okay. 
What, what does it say? Received number 580. 
Yes, it should go [UNKNOWN] 580 A, not 580 not 580 without an A. 
Because 580 without an A is not the parallel programming class, 
that's the security class next to it. But 580 A, you should go to it. 
So, it prints out the number. So, it can communicate from the one 
process to the other process and back with messaging. 
So, one of the things we should note here is, we are both moving data and we are 
synchronizing at the same time in this message. 
So, it's, it's doing two different operations here. 
So, this is a great question. So, this is a programming model right 
now. We're not going, you know, there's many 
different ways to implement MPI and people to implement MPI. 
The interface of these MPI send, MPI receive, are many different ways. 
So, people have implement this, implement this over a message, hardware message 
passing network, which does not have to go into the OS, for instance. 
You send and it actually, there's, effectively hardware there which 
implements MPI send and receive, sends it out over the networks and receives it on 
the receive side and there's some interconnection network in the middle. 
That's one way. Another kind of way is that you actually 
run MPI over a shared memory machine. At which point, MPI send. 
This is going to sound kind of funny. It basically takes the data, copies it 
into RAM somewhere and then when the receiver goes to receive it, it looks to 
open in the hash table, finds the location in RAM and doesn't reach from 
that pointer and that's, that's an option. And actually, that is one of the 
most common ways that people run MPI today is, that doesn't have to go in to 
the OS because it's just shared memory. on small machines, people run MPI that 
way. And it actually has good, 
has typically had a better performance than sort of the equivalent thing of 
writing the short memory program. And the reason for that is you've 
effectively, explicitly made the communication or you've made the 
communication explicit. So, that coherence protocol, you're 
basically going to be build to optimize for producer-consumer. 
You have a right to some location. And then, someone else is going to read 
it, your not going to have random false sharing problems, your not going to lots 
other problems. So, that's pretty common. 
where MPI has the biggest use today is in massively peril computations. 
So, this is sort of the supercomputers of the world. 
sort of the, all the supercomputers minus the vector of supercomputers who 
typically don't use MPI, but the massively parallel computers. 
So, like the biggest computers in the world right now, 
the I don't know, 
the Roadrunner computer at Los Alamos, for instance, and the similar sorts of 
computers like that. They will have special network cards that 
implement MPI effectively and they do it in user space. 
Now, another way to do it is that you can actually implement MPI over TCP/IP going 
through the operating system. And that's pretty common when people just 
run small clusters of computers is MPI send and receive will trap into the 
operating system. It will go into the library first and 
then up, the library will watch you read and write to the sockets. 
At some point, that will go out over the network card. 
All of those are possible and all of those have very optimized MPI 
implementations because this is the most widely used parallel messaging library 
for high performance application and has the largest [UNKNOWN]. 
So, people have optimized this quite a bit. 
Okay. So, this, this brings us to the question 
of, a lot of words in this slide, but I do want to sort of talk about this, of 
message passing versus shared memory. So, we have two different parallel 
programming models. Message passing, where you have to 
explicitly name your destinations. And shared memory, where you went to some 
random location at some point in the future, someone goes and reads from it 
and you don't know who the reader is. It could be any of the processors in the 
system. If, you, you could just read around them 
it could be random. It could be literally random like you 
read some random number and it tells you whose going to do a reading of that 
location. But if you think about it, because you 
have any, you can write two locations someone else can go read it in shared 
memory. You are effectively having a one to all 
communication versus message passing allows your microarchitecture and your 
system to do the optimize around point-to-point communication. 
Okay. So, let's compare these two things. 
we have message passing here. In message passing, 
typically, memory is private per node or per process or per core. 
So, memory is not shared. Shared memory by definition, memory is 
shared. Message passing of explicit send and 
receives in our software code. So, we're actually going to put a MPI 
send and MPI receive. We're actually going to have sends and 
receives. Conversely, shared memory, we have these 
implicit communication via loads and stores. 
And the load and the store names an address but it does not name a core 
number or a process number or a thread number. 
[COUGH] Message passing moves data. We do a send and we put some data in 
there. Also by definition of receiving a message 
on the receive side, there's some synchronization there. 
There's a producer-consumer relationship from the sender to the receiver. 
In shared memory, you have to explicitly have synchronization. 
sorry, I should say explicit. You need to add explicit synchronization 
via fences, locks, and flags. So, you need to add something in there to 
do synchronization. And if you don't, 
you don't, if you don't have synchronization, you could have race 
conditions. You can get the wrong data. 
You can pick up the incorrect data, which is a, a pretty common programming 
error in shared memory programming. [COUGH] So you don't need to know the 
destination, you don't need to know the destination up here. 
[COUGH] From a, from a programming perspective, what I wanted to point out 
here is message passing is very natural for producer-consumer style computations. 
So, if you have one node producing some value and another node reading the value, 
it's very, very natural. You set up, you set up a you send, you're 
sending from one node to the other node and receive it at the other node and you 
have a channel between them and you can communicate. 
And you can sort of send data in there, it's all in order, 
we'll say, and you can just send down the channel stuff comes out the other side. 
Very natural for producer-consumer. Shared memory, 
you have to implement producer-consumer like we did on last lecture, where you 
have, like a Python memory, and locks on that structure. 
But what is easy on shared memory, which is hard to view in message passing, is if 
you have a large shared data structure. Let's take, for example, you have a big 
table. And you're trying to process this table 
in parallel and you have, I don't know, a big you have lots of little files. 
And you give each of these files to a different processor. 
The processor reads the file and what we're trying to do here is we're trying 
to build a histogram. So, it's going to read some number out of 
a file and based on that number, it's going to look at a shared data structure 
and increment that location by one. Now, because it's shared memory, you want 
to make sure the two processes or two processors or two threads or two 
processes or two processors is not trying to increment the number at the same time, 
so you probably want to lock that location in the table or lock the whole 
table, increment it, and then unlock it so you don't have two people trying to 
increment it at the same time. But it's very natural to build a shared 
table in shared memory and have locks on that shared table and have different 
people operating on that shared piece of data at the same time. 
So, the interesting thing here is that you can actually tunnel shared memory 
over messaging. And you can tunnel messaging over shared 
memory. So, let's look at the first example here 
of how to implement shared memory on top of messaging. 
You could do this in software. And in software, we, how we do this is 
effectively try and turn all of our loads and stores into sends and receives. 
Maybe from, and there's maybe like one centralized node which has a big notion 
of memory. That's one way to do this or you 
distribute it somehow. It's going to be pretty painful to do. 
But people have implemented systems like this, where you actually implement loads 
and stores or your compiler will go and pick out loads and stores and turn them 
into messages. A more common thing to have happen is to 
actually have hardware that automatically turns communications into messaging. 
So, we are going to take this motion of a programming model and instead, we are 
going to turn into a broader motion of entities trying to communicate via 
explicit communications that are sent and then sort of interconnect. 
And this is actually the most common way these days that people go about 
implementing these large shared memory machines. 
This is actually, the memory traffic will get packetized. 
And we'll talk about packetization in a second. 
And it will get sent over a network and then received and then some receiver will 
do something with it. So, an example here, let's say, we have 
this core here and this core wants to do a load. 
And the, the, the, the data is in main memory. 
[COUGH] On our bus here, the core could just shout, I need address five. 
And the memory will shout back, got it, address five has the value six. 
If we have a switch interconnect, we put switches in here. 
We can have the core actually packetize the memory request. 
So, we'll take this load and it will make a message, send it over the network, 
and it will show up at the memory. The memory will read the message, 
make a response, and And it can send it back over the 
network also. So effectively, what we have here is we 
can actually tunnel shared memory over a message network. 
And this is actually, for large cache coherence systems. 
The most common thing that happens. We'll be talking about this in greater 
depth in two classes when we have our directory based protocols, 
which are effectively implementing shared memory over these switched 
interconnection networks. You could also implement messaging over 
shared memory and this is pretty common in small systems that have shared memory. 
They have no other way to communicate. And this is exactly what we talked about 
last lecture. We had a FIFO, we had head and tail 
pointers, producer can effectively enqueue onto 
this queue in main, main memory. Consumer can go read from it and you 
could implement messaging this way. So, I guess, what I'm trying to get at 
here is messaging and shared memory are duals of each other. 
They may be more natural for one thing or another, but you can implement any 
algorithm, you can implement one in the other. 
And that's been shown at this point.