Okay. So, we're, today, we're talking about parallelism and synchronization. from a parallel, parallelization perspective, we're worried here about parallel programming and how to build computer architectures that can run simultaneously multiple programs or multiple threads within one program. And today, we're going to talk about some models of that synchronization, and some primitives to help you solve it like synchronization primitives and memory fences. This is a little bit of background or motivation for nowadays why we're going to multiple processors. So, we're, show here Moore's Law plotted on a log plot. A number of transistors going up exponentially. Our sequential performance has sort of gone up, and then sort of stopped at this point. If you go look at the new core I7 or something like that, it's kind of flattened out here. It isn't continuing to go up. and in fact, the numbers that they actually publish for sequential SpecINT to our actually parallel numbers these days because they're using multiple cores to make SpecINT and SpecFP go a little bit faster. and, clock frequency has also found out we didn't get to our 10GHz processor. And largely, that it is due to us hitting some power challenges here. But it's also just incredibly hard to build and design these very high frequency processors. So, instead, we started to use these more and more transistors to implement multiple cores. So, we started off with two cores and then four cores and, you know, we've, we've gone up from there. Intel, I think, he is now selling an eight core part and a ten core part is coming out soon. AMD has a sixteen core part, but it is not a true sixteen core part. I think it is two eight cores that are glued together on a chip. and there's some other weirder stuff up here. Some actual many cores, multicore processors with more, more cores. But people did multiprocessor built multiprocessors before the multicore revolution here. Back in these days, there were multiprocessor systems. You could take multiple chips and put them together somehow into small systems or large systems. And for the rest, rest of class between now and the end of the term, we're going to be talking about parallel computing systems. Both systems that have multiple cores on one chip and multiple chips in a system. And for, there's some subtle differences between the two of those things, but a lot of commonality. In today's lecture, we're going to focus on symmetric multiprocessors and memory systems for sharing of data and using shared memory to share data. But there are other ways to share data and we're already touching on that in two lectures when we talk about messaging in more detail. The, I bring up symmetric multiprocessors because it's the simplest model to reason about. All processors are up here, and memory is down here. And let's assume there's no caches in the system to begin with. And everyone is equidistant from memory, and it's one big memory. And you've multiple processors, which could either be running multiple threads in one program or multiple programs. A couple other interesting things is, you know, these concurrency challenges don't come up just from having multiple processors. There are other things that can go and communicate with memory or communicate via memory. We have all these different IO controllers down here, network cards, disc controllers, graphics cards and they also want to read and write memory. So, in your memory system, even on a uniprocessor system, there are multiple agents that want to read and write memory simultaneously. [COUGH] Okay. So, let's talk about synchronization. The dining philosopher problem is an example of a synchronization challenge and we're going to talk about two major synchronization ideas but there are more than, are shown on the slide. mainly they're sort of broadcast models and, and other things like that. But for right now, when I say synchronization, it's some way to synchronize communication or arbitrate communication in a restricted fashion. So, we're going to have two here, we're going to talk about producer-consumer, that's the, the figure on the right here. A producer, as the name implies, produces some values. And a consumer, consumes the values. This is the most basic form here, one producer, one consumer. You could think about having one producer, multiple consumers or multiple producers, multiple consumers or multiple producers, one consumer. a lot of this same ideas hold here. So, that's, that's one model that people like to use. Another model that people like to use is there's some shared resource and you want to make sure that not more, let's say, then one processor is trying to access that shared resource. And this resource could either be a disk, a graphics card, or it could actually be a location in memory. And we're going to call this mutual exclusion. And the reason it's called mutual exclusion is it's exclusive, who can access the resource at one time. Now, the generalized form of this, that, that's the most basic form is only one processor or one entity can go and access the resource at a time. A more general form of it is some number of processors or resources can go access that at one time. And this is the more general semaphore question which we'll be talking about a little bit later. So, what I mean by that is, you could have P processors or let's say, P is twenty. And you could have a resource that no more than two processors can go access at the same time. You might say, well, why would you want to do that? Well, a good example is something like a network card that has two outbound queues, we'll say. So, you can actually have multiple processors using it, but if you try to have three processors use that one network card at the same time, well, there's only two outbound queues. It has to decide, somehow, to, to share those. so that's the example of a, of a true [UNKNOWN] versus just a mutex. Okay. So, let's go through some code examples here because these are going to be instructive and let's look at actually, before I do that you can even have on a uniprocessor system, synchronization challenges. And how is that possible? Well, as I said, even in the uniprocessor system, you sometimes have multiple concurrent entities. So, it's either your disk drive, doing DMA into memory at the same time as a process is trying to read from that block, for instance. And you need to guarantee, for instance, that this block is completely read before the processor goes and tries to read the block. That's one way that is going to happen in uniprocessor system. Another way that happen in a uniprocessor system, is uniprocessor systems can many times be multiprogrammed. So, the operating system would actually time slice between different programs. And you'll get interleaving of instructions in a preemptive environment at least of different processes at the same time. So, you might be running one process, the timer tick goes off. You stop that process. You switch to another process and start executing code from that. At some point, the timer tick goes off, and you switch back to the first process. So, even on the uniprocessor system, you can have synchronization challenges. Okay. So, let's, let's look at a producer-consumer example. Let's look at the abstract first. So, you have a producer here, a consumer there. We're going to use memory and one big symmetric block of memory for all of the communication at this point. We have a queue between the producer and the consumer shown here. And then, we have a head and a tail pointer. So, it's a FIFO says, where the, our first in, first out queue. I was going to say, where the head is, where the tail is, they can sort of move after each other and they'll say. it's circular so they wrap around at some point. [COUGH] You have a register value here on the producer and you can move the tail pointer into this register. And then, we have three registers over here. The tail pointer register, the head pointer register, and a register for the value we receive. And the reason I point out these different registers is just to show that the producer owns this register, and the consumer owns these three registers. And, as you can see here, this says, R tail and that says, R tail. So, they're going to have different copies of R tail and they could potentially get out of sync here. We're going see some, some fun hijinks happening. Okay. So, let's, let's look at the basic sequence of code here where the producer wants to put an item x into the cube for a consumer to read sometime in the future. First thing, it wants to do is it's going to read the tail pointer into its register. It's going to then, this is a, a load store architecture here. It's going to do a store of value x into the tail, which is going to put it here. Now, we're going to have to bump the tail. We can't do this atomically so we actually have to add one to it in our register space so we just increment it in our own register. And then, we're going to store this back into the pointer in memory. Seems simple enough. Let's look at what the consumer does. The consumer loads the head. And the first thing it's really going to do here is it's going to try to figure out if the head equals the tail. because that will mean that the queue is empty and it just needs to block. Loads the tail pointer. So, it's the head pointer in this register, a tail pointer in that register, and sees that they're equal. If they are equal, it's going to jump up here, reload the tail pointer into the tab to see if it's got updated by this thread and just spin here for forever until the tail pointer is not equal to head pointer, which means there's some data available. And then, we can fall through. And in the fall through case, we are going to load from the head into R, so into this register, this is, so we, we did the read. And we're going to update the head pointer to the next location. Save it off and then, do something on this value we got. So, processes just do, do something. So, this program is a little naive because we've been talking about out of order processors, we've been talking about out of order memory systems. And in a perfect world, this is assuming that instructions are executed in order and that there's no interleaving between the consumer and the producer here, and vice versa, in the memory system. So, anyone think there's any problems with this? does this work fine? Yup. Okay. So, you're, you're saying if the consumer runs before the producer does anything, you might just try to take values off the queue. Well, so let's guard against that and say, the, the start case is head pointer equals tail pointer. So, that won't happen, you'll sit spinning here. Now, because no one's like jumping up and screaming up and down at this point, you're sort of all assuming that these operations happen in order. But we've talked about out of order machines which reorder loads relative to stores. We've talked about very interesting interleaving. We've talked about out of order memory systems. So, so much for me jumping up and down here, and saying, what do we guarantee about loads and stores to different addresses? So, we said, all we know is that a load and a store or excuse me, a store and a load that are serialized in that order, on the same processor, to the same address, will happen in order. Our out of order super scale and our out of order memory systems, we've talked about, have defined nothing about ordering between processors or even memory operations inside of one processor. So, that's what we're really going to be talking about today, how to reorder those instructions and how to prevent that from causing havoc. Okay. So, let's, let's look at this and let's number these purple instructions. So, we're going to number store, store in these two loads. The reason we, we look at those, in particular, is this is basically updating state that this thread is reading from, and these are the two instructions that update that thread, and these are the two instructions that read that thread, or read that state rather. So, the programmer assumes that there's some level of causality, here. Incorrectly, producer is assuming, if you look at this code, that if three, this load happens after this store here, that there's some relation between one and four. Namely, that one has happened by the time four happens. Not a good assumption in our memory systems. And that's especially in our out-of-order memory system. Because, if we recall, out out of order memory system said nothing about store ordering. We talked about, the only thing we talked about was having a load at the same address as a store and not moving that load past that store. But otherwise, these two stores are out of order processor that can very easily reorder or an out of order memory system can very easily reorder. Likewise, over here, these are just two loads. These loads can happen completely out of order in an out of order computer, and out of our memory system. No guarantees there. So, this assumption is bad in a multiprocessor system. Uniprocessor system might make sense, multiprocessor system, bad assumption. So, let's look at some sequences that are problematic. Name, namely, let's say, two and then, three happens. And then, four and then one happens. So, what, what happens when that? Well, two updated the tail. Three says, raise the value of that new tail pointer. But at this point, we haven't actually stored the value into the cube because one never happened. We go x cubed four and it just reads garbage from memory. And then finally, the value gets updated. Totally valid in out of order, out of our memory systems that we talked about up to this point. Nothing wrong here. Okay, let's say, four and then one happen. So, goes the load. Load the head. Then, the value gets written. And then, two and then three happen. This is just completely out of order, kind of no good semantics here. So, this piece of code which looks like it should have, the programmer would assume to have good semantics. Basically, it has no, no useful semantics unless we define what those semantics are.