And Dijkstra came up with this naming nomenclature. So, implementation of sum of force. this idea is all well and good, but we still need some way to come up with mutual exclusion. Believe it or not, you can actually do this with just loads and stores. And we're going to be talking about that a little bit later in class. But it's really pretty slow. and it's, it, there are some algorithms that can do this with just loads and stores. But from an implementation perspective, people typically try to go faster. So, if you want to go faster, it might be helpful for your computer architecture, your instructions set architect, to come up with a special instruction which will allow you to do something atomically. So, when we say do something atomically, it means there's nothing else can be interweaved in that, in the meantime. And the, the simple solution here is that you add an atomic operation which can do a read, a modify, and a write. All atomically if nothing else interleaving it. So going, going back to our locks and sum of fours here. All inside of this, this P and all inside this V here, you need to somehow, atomically modify S. You need to be able to read S you need to be able to write S if you were able to implement this somehow with code inside of the implementation of, of P and inside the implementation of V. So, the primitive here that makes this faster is having some read modify write operation. And we're going to look at a couple different choices of this. One of the most basic ones is test and set. x86 has this instruction. This is the one of the most basic operations on x86 for atomic operations. lots of other architectures have an instructions which does this. So, we write here a piece a code. But in reality, this is a piece of pseudo code, but in reality, this is an instruction which we'll actually do this atomically relative to the whole rest of the system. So, let's look at what test and set does. The basic idea of test and set is it, is it's probably one of the most primitive atomic operations you can do. I've seen something slightly simpler than this. there's like a test and clear which is a little bit easier to implement. But, largely the idea is that, you're going to look at a memory address. And if that memory address is what you expect it to be, then write some value. And, implicit in this piece of code here is that, the piece of code which executed the testing set needs to know whether it's exceeded and was successfully able to write the value or not to, or to not write the value. So, it's sort of a status code and that's returned in this register R here. So, if we look at this piece of code, you pass into it a memory reference, a memory address, which is the address of the sum of four or the address lock. And you pass into it and, and, and a register name. We do a load into that register and that happens unconditionally. This part here though, is the, is the the test part. So the test part here is we check whether the thing we just read from this memory address is zero. And if so, we're going to write a one to the memory address. Sometimes these test and set operations will actually allow you to write something else besides just one. But, for right now, let's, this is sort of the more basic one here. So, it writes a one to that memory address. And what's important here is the architecture guarantees that all of this happens atomically. But what's interesting is, if you think about our processors that we built, this requires a load and a store. Hm. So, we will talk about how we implement this a little bit in, in a few slides. But, you basically have to stop everything else, stop all other loads and stores happening in the system. Do this load, do the test, and possibly do the store in hardware. One of the more basic ways this is done or one of the original way this was done actually on some of the SA-6 based architectures, is there is something called the lock bit on the bus. So what they did is when one processor was going to go execute a atomic operation like this, a testing set for instance. You'd actually broadcast all of the other processors in the system saying, I'm about to do this. Do not touch anything. Do not issue any memory [LAUGH] instructions effectively, or do not issue any bus, based memory instructions. It would, it would do the load, it would do the test, and it would do the set, and then it would release the broadcast bit. This is the wire on a multi-drop bus. that's a pretty big hammer. things have gone a little bit better since then. But, just to give you an idea of basic hardware implementation of this. things got a little more sophisticated this sort of showed up in the, the 70s' here. These fancier operations or the fancy atomic operations. So, this whole box here is still a atomic operation. All the, the code in here happens atomically, and what this basically does is it's going to take a memory address and add something to it. So, we're going to add Rv to R and put that into memory at the same memory address and, so we're going to do the load in the store here, atomically. So, it's a read modified write operation also. And having it all the atomic. Note, this still actually returns into register R here, the original value. And sometimes that's actually very useful if you're trying to implement a semifor based on this you kind of want to see what was there before. There's some, there's some jokes around fetching at here, though. the atomic operation community kind of push this to the extreme at some point and people started having sort of fetch and insert something here. So, fetch and multiply, fetch and divide or, or things like that and people started building some stranger machines that had fetch and something else. There's this old joke that someone was going to implement instruction called fetch and FFT. So, it would be atomic Fourier Transform. I don't think it ever got to that point. But, in the, in the microcoded days and the, and the early microprocessor microcoded days, there was some pretty sophisticated atomic operations. That's largely been decided to not be a good direction to go. you want to have a little bit of complexity in your atomic operations. so maybe test and set is a little too simple. But, some of these fancier fetching FFTs is probably a little bit too complex. [COUGH] Here's another interesting operation that is atomic. And this one you can actually build some pretty interesting semaphores out of. It's called swap. So, it takes a register value and a memory location, and it'll take what's in the memory location and what's in the register and swap the two things. And note, we need to use we use a temporary here because you can't actually swap two things without a temporary. This is actually not very common because it's kind of hard to use. so what's a little bit more common to something called compare and swap which is something between kind of a test and set, and a swap operation. And that's, that's what commonly used and we'll talk about that in a second. Okay, so now we go back to the multiple consumers problems that we have, where multiple people were trying to DQ from a, a Q. And let's introduce test and set, and we'll look at our critical section here. We're going to use this term mutex to basically mean a sum of four that only one thing can enter at a time. So, what happens at the beginning here is there is a tight loop, and this tight loop will read from a relocation of mutex into a temporary register, and will compare it to zero. Note, we define test and set as writing one to memory if it succeeds. So if it was zero, that means that it was not able to actually modify memory. The test failed. So if the test failed, it's just going to sit and spin here. And we'll call this busy waiting or spin attritional spinlock, it's called. If it acquired the lock, this read value would have been a one. So, it would not be equal to, or excuse me, if there, if it acquired the lock through the, the, the, you just write, if it's one, it's zero. Okay, sorry we did this wrong. If, if the memory address was a, if it was a one coming into here, that means someone else already locked it. If it's a zero coming into here, that means no one locked it, but we locked it and set it to a one now. So if we acquired the lock, we're going to atomically set it to a one, and we're going to read a zero. So, we'll drop into our critical section and start executing our critical section. And it, under the critical section, we can do stores, we can do loads, we can do whatever we want because we're guaranteeing that no other op, no other processor, no other thread is doing anything to it. And then at some point, we need to do the, the release here. And that we actually don't need a test and set for, we just need a store. A store is just going to clear that in a real location, and some other thread is safe to go and fall into the critical section here. Okay. So, this can be done with lots of different implementations. We showed here of test and set, but this can be done with swap, it can be done with fetch and add. The code gets a little bit more complex. But, one of the interesting questions that comes up is, what do you, what happens if you running along some threat acquires the critical or acquires the law falls into the critical section and let's say, terminates or gets swapped out for very long period of time? [SOUND] What's, what's going to happen here? [SOUND] Well, system crash. I mean, we, we don't reference bad memory here. This is more, more of a hang. So, it's going to freeze. So this is actually pretty common in the old, for instance, if you look at like non-preemptive operating systems. So, back in the day of like the original Mac OS before Mac OS became Mac OSX, so versions sort of nine and earlier. That was a cooperative operating system. And if someone went and grabbed a lock, and then died or took an error case, the whole machine would crash, crash. if you're pre-emptive, I mean, there's like a timer going off, it's possible you still you won't be able to make forward progress. But at least, the OS can interrupt your process and try and do something about it. It can detect that one process is hung. But in a cooperative multi-threading environment, if one process, let's say, just hogs the processor and never gives back the processor or just dies, and you have some other process, which say, which, or, or, let's say, you, you grab the lock and one process and die. And then, another process is running along and tries to acquire the lock, but it's just sitting there spinning in this little spin section here for forever. You're basically just going to hang the processor. You're not going to see anything else here. So, you've got to be pretty careful especially in cooperative multi-threading environments. So, one of the tricks to this actually is, if you're in a cooperative multi-threaded environment, they'll typically put inside the spin loop here something which will yield to a different thread. So, someone else can try to do something in that time. because if you just sit there spinning, it's very possible that you won't be fair or no one else can ever go to unlock the lock, for instance. So, you've got to be a little careful there. Okay. So, we're going to move forward here. And look at how to do a slightly different version of the same piece of code, but now with compare and swap. So, before we do that, let's talk about the, the semantics of compare and swap. Compare and swap takes memory and two registers here. And it's going to compare memory with one of the registers. If it's equal, then its going to take the other register, put it in memory, and effectively swap them in return of status. If the compare operation fails, and we come to this L station statement here, we don't do any swap. So this would be effectively taking a register and a memory location and swapping them, dependent on another register. So, it's a little bit more powerful than just swap. It's a little harder to implement, as you might imagine. it's probably not as hard to implement as something like fetch and app. The reason for this is, if you want to do this operation out of the memory system, you can basically send it as atomic sort of operation. Go to main memory, compare this, put a comparitor there and swap the two things. If you try to have these sort of increments or arbitrary instructions or adds, you need to put a adder out there for sure. This only needs a comparator out there. Still, still, still painful to go implement. But, it might a little bit better than putting a full adder in your memory controller because that's almost like duplicating your ALU out in your memory sub-system which doesn't make a whole lot of sense. But, that's, that's why people used to talk about the fetching FFT. But, okay. So, let's take a look at the same piece of the same case here of a multi-reader piece of code. But instead, what we're going to do is we're not going to have a spin lock at the beginning. Instead, we're going to have a critical section, but we can concurrently execute critical sections from different threads here which we're not going to call it a critical section then. We're going to say, we'll concurrently execute what was in the other thread's critical section, and then we're going to atomically try to swap out the head pointer and replace it. So, we're inspectively doing speculative work here, and this is called non-blockive, non-blocking synchronization. So, the synchronization primitive we're going to do the loads, we talked about that before. We checked to make sure there's at least something available. [COUGH] We'll pull the, the head pointer off. This is all speculative work at this point because we have not check to make sure that we can validly pull off the head of the queue. We're going to increment the head pointer into a temporary here that we're going to call new head or registered new head. And then, we're going to do a, try to swap the new head with the old head atomically. So, it's going to take the register that we think is the new head and memory location where the head pointer is. We're going to try to swap these two things. And the thing that we're dependent on here is that what we think is the old head, still is the old head. [SOUND] And if, if something bad happens here, we can, we can try again. so, so the idea here is that we're going to spin actually in this effectively bigger loop, trying to speculatively try to go and change the, the head. [SOUND] So, one, one question that comes up here is, what if the head was DQ'd from, added back on to, and, and it just so happens that we have some sort of, it's called an ABA problem? The headpointer was a, was changed to be, and later changed back to a, all in, this short period of time. Yeah. That, that might be a problem here in this piece of code. You might need to think about that and carefully reason about that. Usually, you can protect that in some other way sort of, when the, the pointer wraps around, maybe. if you think about a normal ray, when it wraps around, you put something which is not a compare and swap. You put like more spin lock, or a spin loop sort of lock on it. So, you only have to worry about that in these sort of circular buffers when something wraps around. because otherwise, there's no, no other way that the head pointer can go back to where it was before. But, you do need to worry about that, that you wrapped all the way around the array in this very short period of time. So, there's some, there's some race conditions that happen if these non-blocking synchronizations operations. Another non-blocking synchronization primitive that is actually really cool, I think, is, goes by lots of different names. it goes by load locked, load linked, load reserve, and store conditional. So, you take your ISA, you take your instruction set, and you add two new instructions. Something which checks to make sure or something which doesn't load, and what you then do is store conditional checks to make sure that no one else did a load. Or a special load link to that same address in the inter-meeting time. I can explain that one more time, it's a little confusing. But the, basically the idea is a flag register here. And in hardware, you do a load link. Just load the value into your register and its sort of on the side, somewhere next to the register. It may not be architecturally visible, or shouldn't be architecturally visible likely. It will set a bit. You execute your critical section. And when you come back to it, to go do the store, we say, for this sort of read modified write operations. When you go to the store, the store checks to make sure the flag is still set to one. The flag is one. If it's still one, that means no one did any memory, memory operations to that address or did at least any load links or store initials to that address in the inter meeting time. Actually, sorry. We'll say that in a more completely. No one else did a store conditional during that time. So, no one else tried to do a store update. Other people may have tried to do a load. But, no one else could of store at that address, or store conditional at that address. Because if someone else did a store conditional, what'll happen is that it'll actually cancel the other processor's reservation, so set their flag bits to zero atomically. This is a pretty primitive, this is a pretty powerful primitive here such that you can basically do a load. Try to do a store later in the future. If no one else try to do a store to that same address in the meantime, or a store conditional to that address in the meantime, your store goes through and everybody else's loads. If they do a load in the meantime, will get invalidated or they'll know that it gets invalidated when they go to do a store conditional. So, if we take the same example here and we implement with load link and store conditional, we're going to look at the headpointer here where we normally do the update. And we're going to do a load into this register. And, when we go into a store conditional, if we, someone else had changed this value in the meantime, we're going to get a fail and we're going to jump back around and have to redo the load link. And this actually gets rid of the ABA problem, also. Because if you did have that sort of race, someone else would have done a store conditional to that address, and the wrap around cases and all that stuff would actually, and it would even validate your flag on that bit, if you will. We're flagging that memory address. One of the reasons that people like this is, when you go to do load link or load reserve in store conditional, you can, a lot of times sort of piggy back this over a memory coherence protocol. Such that, you do a load link and it will add an extra bit in your memory, we'll say. So, if you're doing some sort of memory coherence protocol, it'll pull it in your cache and you can sort of tag this little loop line as special, or that's the flag bit here effectively because this whole line was load linked. [COUGH] And then, if no one else pushes it out of your cache or fetches it from your cache in the meantime, so would mean no one else did a load link on it. By the time you go to do the store, you know that no one did anything. And what's really nice about this is there is no communication that has to happen between the processors on the common case with an uncontained lock. Versus the spin locks are, are generating memory traffic all the time. They are saying, they're doing loads, doing stores and trying to like grab the lock, trying to grab lock, trying to grab lock, trying to grab lock and they just generating all this traffic all over the place versus this is very quiet. You just pull in your local cash. And if by the time you get back to it, it was still in your cache and it still has it's bitset. That's great. You get to do it. In the contendant case though, you may actually get traffic ping ponging here, but you can still guarantee correctness. [SOUND] So, this brings us to performance. As I said, some of these blocking atomic operations can generate a lot of memory traffic. So, they'll say they're spinning on memory and they'll at least be doing lots of loads. [COUGH] And those lots of loads may be going out to main memory. Non-walking atomic operations, so like these compare and swamp operations or load link store conditional. And the load link store conditional, as I was alluding to many times you can, you can do it against your own cashe. also these, these things can have sometimes a better performance because you don't have to sit there sort of spinning in type loops. You can try to do stuff in, in the mean time. The down side to, at least the compare and swamp implementation is you have to sort of speculative work in the meantime. You're sitting there redoing some work and if you didn't get the lock, that performance can be bad. There's also, as I alluded to at the beginning of class, there are things that use ordinary loads and stores, or algorithms that do allow blocking with just loads and stores. Their performance is, is not, not very good and we're going to look at an example of that called Dekker's Algorithm today. Finally, the performance of these isn't just dependent on the operations themselves. So, of the, these primitives, harbor primitives, have better performance or worse performance depending on how highly they are contended. So, what I mean by contended is I mean you have multiple threads trying to acquire the lock at the same time. If you have lots of people trying to fight for the lock, that's high contention. If you have a lock which there's almost never contention on, you go to just go to get the lock and you just get it, and it's there for a really rare case. other locking primitives have better, better protocols there. And, of course, we're going to talk about this in a little bit of how do you make these integrate well with caches, out-of-order processors, and out-of-order memory systems.