Okay, so now we get to talk about issues 
of sequential consistency and how to actually go about implementing these 
things. So going back to our, valid interleaving. 
We start to ask ourselves, how do we build a sequentially consistent model 
with out of order processors, or out of order execution of memory. 
Well we already said in our out of order system that we can re-order loads to 
different addresses. We can re-order loads to the same 
address. Doesn't really matter. 
Well, if you go take that, all of a sudden they just breaks a 
sequential consistency. Huh. 
We said we could reorder a store in a load in our out of order processor if 
they were to different addresses. We said we could reorder stores and load 
the other way. We said we could reorder stores in 
stores. So out of order memory systems and auto, 
auto processors effectively break sequential consistencies. 
So how you break, how do you implement a processor which is useful? 
And is out of order, and we want to go out of order for performance reasons. 
Also cache has started to become a big problem here for performance. 
If we want to have true sequential consistency you're going to have to 
somehow figure out how to not store any data in your cache. 
Because the second that you store data in your cache, 
you're going to have to go, let's say it's a right back cache if you do a store 
to your local cache and you write a one to it, 
and some other processor tries to read it, well it's not going to be able to see 
that data. So we need to be very careful here of 
when is data actually visible to other processors if you have a cashe involved 
in multi-processor system. So we are going to move towards little 
bit more weak or the relax memory models. So as I said last class, 
almost no processor actually implements a, a truly sequentially consistent memory 
system. They're all weaker than that. 
Now the question is how weak do you want to go. 
and this starts to come up because it's a, it's a tech trade off between what the 
programmer can wrap their head around versus effectively performance in an out 
of order system. So what we can do is we can think about 
going to weaker memory systems, and when we do that, we are going to make 
the programmer decide when a load in a store can be reordered, or when a store 
in a store can be reordered, or when a load in a load can be re reordered. 
So let's look at these four different cases independently here. 
Reordering a load with a load, reordering a load with a store after it, 
reordering a store with a load after it and reordering two stores. 
And note we're not talking about the same memory addresses here at all. 
We are talking about two different addresses, but sequential consistency 
even says for different addresses we need to worry about this. 
So how many of these extra dependencies can we remove and we're going to 
introduce this notion of a memory fence or sometimes we call memory barrier 
operation which is a serialization point that says all of the previous 
instructions before well, I will be careful what I say there actually. 
My references is actually a serialization operation which says all of the memory 
accesses or at least the named memory accesses before a certain point have been 
completed before the memory fence completes. 
And, all of the operations after it do not complete, have not started or 
completed. Before the memory fence completes so this 
really is a barrier on your memory operations and the reason I'm being a 
little bit hesitant here is the basic memory fence will say that all memory 
operations before the memory fence completes and that none after the memory 
fence have started when the memory fence completes. 
But people have introduced a little bit even weaker versions of memory fence over 
time of course, for performance. And you start to see things where. 
You'll have only load memory fences, where it barriers against all the loads 
previous. Or you have, one you need directional 
memory fences which will say only look at all the stores that happen after this 
point and don't allow any of them to happen before this point. 
So, people have actually introduced instructions in modern architectures that 
have all these things. And, probably one of the weaker ones if 
you actually want to go look at a really weak memory model, is go read about the 
itanium memory models, extremely weak memory model. 
and they, they. Need to have fences all over the place. 
And the, the trade off here, the performance trade off is how many fences 
you need to put in. Versus performance you get from 
reordering of memory operations. So let's look at a couple, let's, let's 
name a few of these things, cause everyone likes naming. 
And these are computer architecture research names. 
These are common across the industry at this point effectively, 
One of them is called total store ordering. 
So in total store ordering, this is not a big step away from sequential 
consistency. The stores, some total store ordering. 
Loads do not get reordered with other loads. 
Loads do not get reordered with other stores. 
Stores don't get reordered with other s, stores. 
But a store followed by a load? Can be reordered. 
So a load can be moved above a store here. 
And if you want to enforce the order between the store and load you need to 
write this. You need to do store word and then fence, 
loadwork. And this guarantees that this load is 
going to happen after that store in the processor. 
Partial store ordering is a little bit weaker than that. 
So partial store ordering, loads of other loads don't get reordered. 
Loads of other stores don't need fences, but stores followed by loads, which is up 
here, and stores followed by stores also need 
fences to guarantee ordering. And then finally we get this class of 
weak ordering, and frankly, most processors these days actually fall into 
some category of weak ordering, and there's sort of different questions in 
how weak it is. but we're not going into that little 
detail here. And here, you actually need a fence to 
enforce all of these orderings, so it's kind of, kind of interesting, interesting 
to think about. The nice thing about fences is that. 
While these operations are expensive and are big serialization points, the memory 
fences, you only need to pay for them when you care about the ordering. 
When you don't care about the ordering, the computer architecture can reorder 
things and make everything go faster. It's only exactly when you care about it 
that you need to introduce defenses. one thing I was going to say is some 
compilers will do some of this work for you. 
Introduce defenses, for you. And usually one of the ways that it's 
done actually is if you see atomic operation. 
So if they see, if a compiler sees something like a lock operation or P or a 
V, in your code and it's sort of a special way to 
signal back to the compiler. It'll know that you need to make sure the 
memory operations have all been serialized at that point. 
Usually also atomic operations like test and set are fences on their own right. 
which helps, but sometimes compilers can even figure out where other fences are 
needed for other memory addresses. And one of the things that I want to say 
here is that fences don't actually take a argue bit. 
It's not going to take an address to go fence on, 
and the reason for this is we really want to do is we want to make sure all the 
previous memory operations, because in the question of consistency we haven't 
said anything about the addresses. We just said previous memory operations 
in one thread don't move after a subsequent memory operation. 
Okay, so let's look at an example here of where we might need to introduce a 
different, different levels of fences here. 
So here we have our producer consumer model. 
And one of the things we were worried about in our original examples, 
we were worried about these two stores getting reordered. 
Well, we can force that by putting a memory fence operation in here, 
and saying, it's a store, store. So no stores can pass any other stores, 
and this'll guarantee that these happen in order. 
But we don't really care, in this case we'll say, if, well, 
this actually already has an arc, so we're not, that's not going to get 
reordered, because this register is, is, is dependent there. 
But let's go over here, and take a look over here. 
If we have full sequential consistency, we would not be able to reorder these two 
loads. Well for performance maybe we want to 
reorder those loads because one of the input registers isn't valable. 
Because it's the result of a long multiply for instance, or something, 
something like that. So you might want to reorder into your 
out of order processes for performance reasons. 
In true sequential consistency you wouldn't be able to do that. 
But with this weaker memory model we can reorder that, but then guarantee 
correctness by introducing a fence operation here. 
And this fence operation is going to to say. 
After these loads are done. These loads need to complete before this 
load starts. So you can sort of guarantee that if our 
head equals our tail, you know at that point that you're, you're safe to go 
execute the rest of this code here. Any questions about that so far, 
about the, the fences and why? It's kind of a pay as you go, if you will.