And 
Dijkstra came up with this naming nomenclature. 
So, implementation of sum of force. this idea is all well and good, but we 
still need some way to come up with mutual exclusion. 
Believe it or not, you can actually do this with just loads and stores. And 
we're going to be talking about that a little bit later in class. 
But it's really pretty slow. and it's, it, 
there are some algorithms that can do this with just loads and stores. But from 
an implementation perspective, people typically try to go faster. 
So, if you want to go faster, it might be helpful for your computer 
architecture, your instructions set architect, to come up with a special 
instruction which will allow you to do something atomically. So, when we say do 
something atomically, it means there's nothing else can be interweaved in that, 
in the meantime. And the, the simple solution here is that 
you add an atomic operation which can do a read, 
a modify, and a write. All atomically if nothing else 
interleaving it. So going, going back to our locks and sum 
of fours here. All inside of this, this P and all inside 
this V here, you need to somehow, atomically modify S. 
You need to be able to read S you need to be able to write S if you were able to 
implement this somehow with code inside of the implementation of, of P and inside 
the implementation of V. So, the primitive here that makes this 
faster is having some read modify write operation. 
And we're going to look at a couple different choices of this. 
One of the most basic ones is test and set. 
x86 has this instruction. This is the one of the most basic 
operations on x86 for atomic operations. lots of other architectures have an 
instructions which does this. So, we write here a piece a code. 
But in reality, this is a piece of pseudo code, but in reality, this is an 
instruction which we'll actually do this atomically relative to the whole rest of 
the system. So, let's look at what test and set does. 
The basic idea of test and set is it, is it's probably one of the most primitive 
atomic operations you can do. I've seen something slightly simpler than this. 
there's like a test and clear which is a little bit easier to implement. 
But, largely the idea is that, you're going to look at a memory address. 
And if that memory address is what you expect it to be, 
then write some value. And, 
implicit in this piece of code here is that, the piece of code which executed 
the testing set needs to know whether it's exceeded and was successfully able 
to write the value or not to, or to not write the value. So, it's sort of a 
status code and that's returned in this register R here. 
So, if we look at this piece of code, you pass into it a memory reference, a memory 
address, which is the address of the sum of four or the address lock. 
And you pass into it and, and, and a register name. 
We do a load into that register and that happens unconditionally. 
This part here though, is the, is the the test part. 
So the test part here is we check whether the thing we just read from this memory 
address is zero. And if so, 
we're going to write a one to the memory address. 
Sometimes these test and set operations will actually allow you to write 
something else besides just one. But, for right now, let's, this is sort 
of the more basic one here. So, it writes a one to that memory 
address. And what's important here is the 
architecture guarantees that all of this happens atomically. 
But what's interesting is, if you think about our processors that we built, this 
requires a load and a store. Hm. 
So, we will talk about how we implement this a little bit in, in a few slides. 
But, you basically have to stop everything else, 
stop all other loads and stores happening in the system. 
Do this load, do the test, and possibly do the store in hardware. 
One of the more basic ways this is done or one of the original way this was done 
actually on some of the SA-6 based architectures, is there is something 
called the lock bit on the bus. So what they did is when one processor 
was going to go execute a atomic operation like this, a testing set for 
instance. You'd actually broadcast all of the other 
processors in the system saying, I'm about to do this. 
Do not touch anything. Do not issue any memory [LAUGH] 
instructions effectively, or do not issue any bus, based memory 
instructions. It would, it would do the load, it would 
do the test, and it would do the set, and then it would release the broadcast bit. 
This is the wire on a multi-drop bus. that's a pretty big hammer. 
things have gone a little bit better since then. 
But, just to give you an idea of basic hardware implementation of this. 
things got a little more sophisticated this sort of showed up in the, the 70s' 
here. These fancier operations or the fancy 
atomic operations. So, this whole box here is still a atomic 
operation. All the, the code in here happens 
atomically, and what this basically does is it's going to take a memory address 
and add something to it. So, we're going to add Rv to R and put 
that into memory at the same memory address and, so we're going to do the 
load in the store here, atomically. So, it's a read modified write operation 
also. And having it all the atomic. 
Note, this still actually returns into register R here, the original value. 
And sometimes that's actually very useful if you're trying to implement a semifor 
based on this you kind of want to see what was there before. 
There's some, there's some jokes around fetching at here, though. 
the atomic operation community kind of push this to the extreme at some point 
and people started having sort of fetch and insert something here. So, fetch and 
multiply, fetch and divide or, or things like that and people started building 
some stranger machines that had fetch and something else. 
There's this old joke that someone was going to implement instruction called 
fetch and FFT. So, it would be atomic Fourier Transform. 
I don't think it ever got to that point. But, in the, in the microcoded days and 
the, and the early microprocessor microcoded days, there was some pretty 
sophisticated atomic operations. That's largely been decided to not be a 
good direction to go. you want to have a little bit of 
complexity in your atomic operations. so maybe test and set is a little too 
simple. But, some of these fancier fetching FFTs is probably a little bit 
too complex. [COUGH] Here's another interesting 
operation that is atomic. And this one you can actually build some 
pretty interesting semaphores out of. It's called swap. 
So, it takes a register value and a memory location, and it'll take what's in 
the memory location and what's in the register and swap the two things. 
And note, we need to use we use a temporary here because you can't actually 
swap two things without a temporary. This is actually not very common because 
it's kind of hard to use. so what's a little bit more common to 
something called compare and swap which is something between kind of a test and 
set, and a swap operation. And that's, that's what commonly used and 
we'll talk about that in a second. Okay, so now we go back to the multiple 
consumers problems that we have, where multiple people were trying to DQ from a, 
a Q. And let's introduce test and set, and 
we'll look at our critical section here. We're going to use this term mutex to 
basically mean a sum of four that only one thing can enter at a time. 
So, what happens at the beginning here is there is a tight loop, and this tight 
loop will read from a relocation of mutex into a temporary register, and will 
compare it to zero. Note, we define test and set as writing 
one to memory if it succeeds. So if it was zero, that means that it was 
not able to actually modify memory. The test failed. 
So if the test failed, it's just going to sit and spin here. 
And we'll call this busy waiting or spin attritional spinlock, it's called. 
If it acquired the lock, this read value would have been a one. 
So, it would not be equal to, or excuse me, if there, if it acquired the lock 
through the, the, the, you just write, 
if it's one, it's zero. Okay, sorry we did this wrong. 
If, if the memory address was a, if it was a one coming into here, 
that means someone else already locked it. 
If it's a zero coming into here, that means no one locked it, but we locked it 
and set it to a one now. So if we acquired the lock, we're going 
to atomically set it to a one, and we're going to read a zero. 
So, we'll drop into our critical section and start executing our critical section. 
And it, under the critical section, we can do stores, we can do loads, we can do 
whatever we want because we're guaranteeing that no other op, no other 
processor, no other thread is doing anything to it. 
And then at some point, we need to do the, the release here. 
And that we actually don't need a test and set for, we just need a store. 
A store is just going to clear that in a real location, and some other thread is 
safe to go and fall into the critical section here. 
Okay. So, this can be done with lots of different implementations. 
We showed here of test and set, but this can be done with swap, 
it can be done with fetch and add. The code gets a little bit more complex. 
But, one of the interesting questions that comes up is, what do you, what 
happens if you running along some threat acquires the critical or acquires the law 
falls into the critical section and let's say, terminates or gets swapped out for 
very long period of time? [SOUND] What's, what's going to happen 
here? [SOUND] Well, system crash. 
I mean, we, we don't reference bad memory here. 
This is more, more of a hang. So, it's going to freeze. 
So this is actually pretty common in the old, for instance, if you look at like 
non-preemptive operating systems. So, back in the day of like the original Mac 
OS before Mac OS became Mac OSX, so versions sort of nine and earlier. 
That was a cooperative operating system. And if someone went and grabbed a lock, 
and then died or took an error case, the whole machine would crash, crash. 
if you're pre-emptive, I mean, there's like a timer going off, it's possible you 
still you won't be able to make forward progress. 
But at least, the OS can interrupt your process and try and do something about 
it. It can detect that one process is hung. 
But in a cooperative multi-threading environment, if one process, let's say, 
just hogs the processor and never gives back the processor or just dies, 
and you have some other process, which say, 
which, or, or, let's say, you, you grab the lock 
and one process and die. And then, another process is running 
along and tries to acquire the lock, but it's just sitting there spinning in this 
little spin section here for forever. You're basically just going to hang the 
processor. You're not going to see anything else 
here. So, you've got to be pretty careful 
especially in cooperative multi-threading environments. 
So, one of the tricks to this actually is, if you're in a cooperative 
multi-threaded environment, they'll typically put inside the spin 
loop here something which will yield to a different thread. 
So, someone else can try to do something in that time. 
because if you just sit there spinning, it's very possible that you won't be fair 
or no one else can ever go to unlock the lock, for instance. 
So, you've got to be a little careful there. 
Okay. So, we're going to move forward here. 
And look at how to do a slightly different version of the same piece of 
code, but now with compare and swap. So, before we do that, let's talk about 
the, the semantics of compare and swap. Compare and swap takes memory and two 
registers here. And it's going to compare memory with one 
of the registers. If it's equal, then its going to take the other register, put it 
in memory, and effectively swap them in return of status. 
If the compare operation fails, and we come to this L station statement here, we 
don't do any swap. So this would be effectively taking a 
register and a memory location and swapping them, dependent on another 
register. So, it's a little bit more powerful than 
just swap. It's a little harder to implement, as you might imagine. it's 
probably not as hard to implement as something like fetch and app. 
The reason for this is, if you want to do this operation out of the memory system, 
you can basically send it as atomic sort of operation. 
Go to main memory, compare this, 
put a comparitor there and swap the two things. 
If you try to have these sort of increments or arbitrary instructions or 
adds, you need to put a adder out there for sure. 
This only needs a comparator out there. Still, still, still painful to go 
implement. But, it might a little bit better than putting a full adder in your 
memory controller because that's almost like duplicating your ALU out in your 
memory sub-system which doesn't make a whole lot of sense. But, that's, that's 
why people used to talk about the fetching FFT. But, 
okay. So, let's take a look at the same piece of the same case here of a 
multi-reader piece of code. But instead, what we're going to do is 
we're not going to have a spin lock at the beginning. 
Instead, we're going to have a critical section, but we can concurrently execute 
critical sections from different threads here which we're not going to call it a 
critical section then. We're going to say, we'll concurrently 
execute what was in the other thread's critical section, and then we're going to 
atomically try to swap out the head pointer and replace it. 
So, we're inspectively doing speculative work here, 
and this is called non-blockive, non-blocking synchronization. 
So, the synchronization primitive we're going to do the loads, we talked about 
that before. We checked to make sure there's at least 
something available. [COUGH] We'll pull the, the head pointer 
off. This is all speculative work at this 
point because we have not check to make sure that we can validly pull off the 
head of the queue. We're going to increment the head pointer 
into a temporary here that we're going to call new head or registered new head. 
And then, we're going to do a, try to swap the new head with the old head 
atomically. So, it's going to take the register that 
we think is the new head and memory location where the head pointer is. We're 
going to try to swap these two things. And the thing that we're dependent on 
here is that what we think is the old head, still is the old head. 
[SOUND] And if, if something bad happens here, 
we can, we can try again. so, so the idea here is that we're going 
to spin actually in this effectively bigger loop, trying to speculatively try 
to go and change the, the head. [SOUND] So, one, one question that comes 
up here is, what if the head was DQ'd from, added 
back on to, and, and it just so happens that we have some sort of, 
it's called an ABA problem? The headpointer was a, was changed to be, 
and later changed back to a, all in, this short period of time. 
Yeah. That, that might be a problem here in 
this piece of code. You might need to think about that and 
carefully reason about that. Usually, you can protect that in some 
other way sort of, when the, the pointer wraps around, maybe. 
if you think about a normal ray, when it wraps around, you put something which is 
not a compare and swap. You put like more spin lock, or a spin 
loop sort of lock on it. So, you only have to worry about that in 
these sort of circular buffers when something wraps around. because 
otherwise, there's no, no other way that the head pointer can go back to where it 
was before. But, you do need to worry about that, 
that you wrapped all the way around the array in this very short period of time. 
So, there's some, there's some race conditions that happen if these 
non-blocking synchronizations operations. Another non-blocking synchronization 
primitive that is actually really cool, I think, is, 
goes by lots of different names. it goes by load locked, load linked, load 
reserve, and store conditional. So, you take your ISA, you take your 
instruction set, and you add two new instructions. 
Something which checks to make sure or something which doesn't load, 
and what you then do is store conditional checks to make sure that no one else did 
a load. Or a special load link to that same 
address in the inter-meeting time. I can explain that one more time, it's a 
little confusing. But the, basically the idea is a flag 
register here. And in hardware, you do a load link. 
Just load the value into your register and its sort of on the side, somewhere 
next to the register. It may not be architecturally visible, or 
shouldn't be architecturally visible likely. 
It will set a bit. You execute your critical section. 
And when you come back to it, to go do the store, we say, for this sort of read 
modified write operations. When you go to the store, the store 
checks to make sure the flag is still set to one. 
The flag is one. If it's still one, that means no one did 
any memory, memory operations to that address or did at least any load links or 
store initials to that address in the inter meeting time. 
Actually, sorry. We'll say that in a more completely. 
No one else did a store conditional during that time. 
So, no one else tried to do a store update. 
Other people may have tried to do a load. But, 
no one else could of store at that address, or store conditional at that 
address. Because if someone else did a store conditional, what'll happen is that 
it'll actually cancel the other processor's reservation, so set their 
flag bits to zero atomically. This is a pretty primitive, this is a 
pretty powerful primitive here such that you can basically do a load. 
Try to do a store later in the future. If no one else try to do a store to that 
same address in the meantime, or a store conditional to that address in the 
meantime, your store goes through and everybody else's loads. If they do a load 
in the meantime, will get invalidated or they'll know that it gets invalidated 
when they go to do a store conditional. So, if we take the same example here and 
we implement with load link and store conditional, we're going to look at the 
headpointer here where we normally do the update. 
And we're going to do a load into this register. 
And, when we go into a store conditional, if 
we, someone else had changed this value in the meantime, we're going to get a 
fail and we're going to jump back around and have to redo the load link. 
And this actually gets rid of the ABA problem, also. Because if you did have 
that sort of race, someone else would have done a store conditional to that 
address, and the wrap around cases and all that stuff would actually, and it 
would even validate your flag on that bit, if you will. 
We're flagging that memory address. One of the reasons that people like this 
is, when you go to do load link or load reserve in store conditional, you can, a 
lot of times sort of piggy back this over a memory coherence protocol. 
Such that, you do a load link and it will add an extra bit in your memory, we'll 
say. So, if you're doing some sort of memory coherence protocol, it'll pull it 
in your cache and you can sort of tag this little loop line as special, or 
that's the flag bit here effectively because this whole line was load linked. 
[COUGH] And then, if no one else pushes it out of your 
cache or fetches it from your cache in the meantime, 
so would mean no one else did a load link on it. 
By the time you go to do the store, you know that no one did anything. 
And what's really nice about this is there is no communication that has to 
happen between the processors on the common case with an uncontained lock. 
Versus the spin locks are, are generating memory traffic all the time. 
They are saying, they're doing loads, doing stores and trying to like grab the 
lock, trying to grab lock, trying to grab lock, trying to grab lock and they just 
generating all this traffic all over the place versus this is very quiet. 
You just pull in your local cash. And if by the time you get back to it, it 
was still in your cache and it still has it's bitset. 
That's great. You get to do it. 
In the contendant case though, you may actually get traffic ping ponging here, 
but you can still guarantee correctness. [SOUND] So, this brings us to 
performance. As I said, some of these blocking atomic 
operations can generate a lot of memory traffic. 
So, they'll say they're spinning on memory and they'll at least be doing lots 
of loads. [COUGH] And those lots of loads may be 
going out to main memory. Non-walking atomic operations, so like 
these compare and swamp operations or load link store conditional. 
And the load link store conditional, as I was alluding to many times you can, you 
can do it against your own cashe. also these, these things can have 
sometimes a better performance because you don't have to sit there sort of 
spinning in type loops. You can try to do stuff in, in the mean 
time. The down side to, at least the compare 
and swamp implementation is you have to sort of speculative work in the meantime. 
You're sitting there redoing some work and if you didn't get the lock, 
that performance can be bad. There's also, as I alluded to at the 
beginning of class, there are things that use ordinary loads and stores, or 
algorithms that do allow blocking with just loads and stores. 
Their performance is, is not, not very good and we're going to look at an 
example of that called Dekker's Algorithm today. 
Finally, the performance of these isn't just 
dependent on the operations themselves. So, of the, these primitives, harbor 
primitives, have better performance or worse 
performance depending on how highly they are contended. 
So, what I mean by contended is I mean you have multiple threads trying to 
acquire the lock at the same time. If you have lots of people trying to 
fight for the lock, that's high contention. 
If you have a lock which there's almost never contention on, you go to just go to 
get the lock and you just get it, and it's there for a really rare case. 
other locking primitives have better, better protocols there. 
And, of course, we're going to talk about this in a little bit of how do you make 
these integrate well with caches, out-of-order processors, and out-of-order 
memory systems.