So, today we're going to continue our 
adventure in computer architecture and talk more about parallel computer 
architecture. last time we talked about coherence, 
memory coherence, and cache coherence, systems and to differentiate that from 
memory consistency models which is a model of how memory is supposed to work, 
versus the underlying algorithms that try to keep memory consistent, and try to 
implement the consistency models. We left off last time, we, we were 
talking about MOESI, or also known as the Illinois protocol, 
and we walked through all of the different arcs through here. 
And if you recall what we were talking about, was, we split the shared state 
from the MSI protocol into two states, shared and exclusive. 
And the insight here is, it's very common for programs to read a memory address, 
which will pull it into your cache. And then go modify that memory address. 
So for instance, if you want to increment a number. 
You're going to do a load. It's going to bring it into your, your, 
ca-, or into your register set. But also into your cache. 
You're going to going to increment the number and then you do a write back to 
the exact same location. Pretty common in imperative programming 
languages. Declarative programming languages like 
Scheme and such, they may at times copy everything. 
But for declara- excuse me, imperative programming languages it's pretty common 
to actually change state in place. So, because of that, you can bring it 
right into, this exclusive stage, and then when you have to go to modify it, 
you would have to go and broadcast in the bus. 
You know, you would have to talk to anybody and would loose, hm, effectively 
this intent to write message. Then you would have to send otherwise 
across the bus and waiting for that address to be snooped on the bus, or be 
seen by all the other entities on the bus. 
note I, I say entities in the bus. We've been talking primarily about, 
processors, the last day, but there can be other entities on the 
bus that want to snoop the bus. So examples sometimes include coherent, 
IO devices. So, 
this isn't very popular right now, but I think this will become much more 
popular as soon as we start to have, GPUs or Graphics Processing Units or general 
purpose GPUs, which will be sitting, effectively, very 
close to our processor on the same bus, and will want to take part in the 
coherence traffic of the processor. So it's going to want to basically read 
and write to the same memory addresses that the processor is reading and 
writing, and take part in the cash coherence 
protocol. [COUGH] At a minimum, usually your IO 
devices need to effectively tell the processor when its doing a memory 
transaction that the processor should know about. 
So typically when you are moving data from a IO device to main memory, 
that's going to have to effectively go across the person. 
Everyone is going to have to validate their cache's, you have to snoop the 
traffic, or they will all have to snoop that memory traffic from the IO device. 
So we had talked about MOESI as an enhancement to MSI. 
Well, we left off last time, and we were going to talk about two more enhancements 
that are pretty common. one is been used widely in AMD Opterons. 
I think they still use this in AMD. I think they use something similar to 
this still in AMD, is my understanding. and the idea is you add an extra state 
here, which is called ownership, or the owned 
state. And effectively, what this is, is it 
looks just like our MOESI protocol from before. 
But now, instead of having data in the modified stage, when you, let's say 
another processor needs to go access that data, instead of having to send all that 
data back to main memory, and validate that line out to main memory, and go 
fetch it back from main memory. Instead, you can do direct cache to cache 
transfer. This is of a, basically an optimization 
here. So, you don't have to right back to data 
to main memory, and in fact you can allow main memory to 
be stale. And you can just transfer the data across 
the bus from the one cache to the cache which needs it. 
So in this example here, we're going to look at this edge here. 
So another processor wants to read the data. 
So we see an intent to write to a particular cache line and our processor 
currently has it in the modified state. We see this other processors intent to 
write, and. [COUGH] or excuse me. 
Intent to read. And we're actually going to provide the 
data out of our cache, and not write it back to main memory, 
and transition the line in our cache to this owned state. 
The other processors can now take it in, and take it in a shared state. 
So they will have it a read, read only copy. 
Now, note this is only for, for read-only, we'll talk about if another 
processor wants to write to the state in a second. 
So we have it in its own state, and what we're trying to do here is this processor 
is tracking that, that data needs to be written back to main memory at some 
point. That's the whole purpose of this state 
here, is we've basically designated a processor 
which owns the data and owns the modified state. So the processors which take at 
read only get it into the shared state, and if they need to invalidate the line, 
they don't need to contact anybody. Because they are having a share state, 
they have a read-only copy. They don't need to make any bus 
transactions. [COUGH] So if you think about it, if you 
actually want to effectively have one core, read the state from other, read 
this dirty state from the other core, and then in some points it goes in and 
just invalidates it in the, in the second core. 
If the data is not up to date as in, it would be in main memory, you lose the 
changes. So, by one processor keeping it in the 
own state here, it keeps track that at some point, if it never gets invalidated 
out of that processor's cache, it needs to write that out to main memory, to keep 
it up, up to date. Now, there's a couple other arcs here. 
you can transition from the own state back to the modified state if the 
processor, which has it in the owned state wants to go to a write. 
[COUGH]. It can't do that while it's in the owned 
state, because while it's in the owned state, 
other processors may have shared copies of it. 
So, when it needs to do that, if it wants to do, P1 wants to do a write 
here, it needs to re-invalidate everyone else's copies across the bus. 
So it's going to have to send an intent to write for that line, 
and everyone else will snoop that traffic, and transition to the invalid 
state. And then, this processor will be able to 
transition to the modified state, and now it's able to actually modify the 
data. Okay. 
So we've got this arc here, which we sort of already talked about, is that if 
you're in the owned state, anyone else can get read only shared copies of it. 
[COUGH]. They can't go get an exclusive copy, 
because that would basically violate this notion, because then they would be able 
to upgrade to modified without telling anybody, and we don't want that. 
But they can get shared read-only copies of the data and then there's this arc 
here from owned to invalid, is if some other processor wants to write the data. 
We're going, processor 1, P1 here will say, 
we'll see the intent to write from another processor. 
It will, snoop that traffic effectively, and at that point it will transition to 
this invalid state. note here that this intent to write, we 
may need to provide information across the bus when we're in the owned state. 
Because if the only, if we're the only owner of that, or the only cache that has 
that data, and the other processor is basically going straight into this state 
here via rightness, we're going to need to provide the data. 
Okay, so, questions about MOESI? So far, 
But a basic, extra optimization.'Cause we don't have to. 
We can basically transfer data around. And one cache can have a, a, a cache line 
in the owned state. And later, some other cache, you know, 
the exact same cache line in the own state. 
And it can basically bounce around without ever having to go out to main 
memory. And this, this decreases our bandwidth 
out to the main memory system. Okay. 
Then we're going to talk about MESIF or MESIF, which is actually used in the core 
I7 in the most up to date Intel processors. 
And it looks very similar to MOESI, except we're going to see an extra little 
letter in this one bubble here. Effectively, 
the, what's going on here is we add an extra state called the Forward State. 
And this is similar to sort of the optimization we saw in MOESI, except it 
can't keep the data writeable. So, what happens in this protocol is, 
let's say the first cache, which does a read miss on a line for widely shared 
data is going to be elected and going to get the data in this forward state. 
And then if other caches want to get read only copies, bring it in shared. 
Instead of having to go out to main memory, 
the cache that has it in the forward state is going to provide that data 
across the bus. So this is going to effectively decrease 
our bandwidth to main memory, by providing the data out of another 
cache's, cache, effectively, or another 
processor's cache rather, and then you won't have to, have to transition it. 
Now, this is a little bit of a simplification. 
There is a question here of, if you're in this forward state and you invalidate the 
data. Who has it? 
does anyone provide the data? So there's sort of two choices here. 
One choice is no one has it in the forward state. 
So when it's a snooper quest for a line, it actually has, it just have to go out 
of the main memory. That's kind of the easy case. 
The other case is you could try to actually build a protocol where another 
cache when one, one cache invalids the forward, it just chooses another cache. 
But probably the simplest thing you do is when the forward, the forwarding core 
invalidates the data. For whatever reason, you just go back out 
to main memory, because there's always a copy in main memory. 
So effectively you're just keeping read only copies. 
Yeah, you're right. 
You're probably going to enter, into the exclusive state. 
That's a good question. so I read two different versions of this 
in, in different books. So, 
I'm not quite sure Intel actually documents what they do for, for this. 
probably what's okay, so, so, you probably will, youre probably right. 
You probably want to enter straight into exclusive state. 
If you have a read only copy, What. 
You, you, yeah. So what's going to happen is when you 
transition from E to S here, you're going to transition from E to F. 
And then you're going to be able to make this, you'll end up in the F state. 
So the first person who actually downgrades is going to always end up in 
the F state. but like I said. 
I saw other references where people said, There were other people implementing 
something similar to this. Where they, 
Some have, some have some election where they figure out who is the, the 
forwarding, Node, but probably the easiest thing to 
do is to downgrade from E to F. So the rest of the course, we're going to 
look at how to scale beyond these broadcast and these invalidate protocols 
that have to snoop on a bus. so, some of the problems of building 
these. Snooping systems is, that you need, it 
really affects how you design your processor. 
So first of all, you're going to have to add more bandwidth into your cache. 
Or at least more bandwidth into your tag array. 
so one choice is going to dual port your tags. 
Another choice is you can steal cycles for snoops. 
So what I mean by steal cycles is if there is a bus transaction happening and 
you need to check this against your tags, you actually block the main processor 
that is associated with that cash from accessing the cash that cycle, so you, 
you generate a stall signal to the cash or to the main pipe. 
[COUGH] And one of the things here that get a little tricky is, and this will 
affects your design, is if you have a multilevel cache, usually you want to put 
your sort L2 tag array on the bus and snoop against your L2 tag array. 
But if it hits there and you figure out that you have to invalidate something. 
You're going to have to invalidate down the entire cache hierarchy, 
all the way down to the level one cache. So this can actually affect your 
throughput on your level one cache effectively. 
And also, it sort of is, is annoying to do, 
because it's going to effectively have to reach down and touch your tag array of 
your L1 cache. And as I had mentioned, I think, last 
time, briefly. If you're thinking about something like a 
exclusive cache. So a cache where the tags and the L2 
don't have the tags in the L1. You're going to have to check both tags 
for every snoop transaction, and that can be pretty, pretty painful, to do. 
[COUGH] Or you have to copy, the L1 tags, but it's effectively the same thing as 
just having a, inclusive cache, but maybe for little less data storage. 
Okay, so what limits our performance? Why can't we just build 1,000 processors 
on a big bus? Well it's the same idea if you have 1,000 
people in this room, and they're all trying to shout to each other at the same 
time. At some point you, you run out of both 
bandwidth, and more importantly you need some way to coordinate them. 
But also, but also if you wanted to, if you're required to basically serialize, 
the occupancy on the bus goes up. So, if you have one bus with two people 
talking on the bus at a time, they each can, let's say, and they talk 10% of the 
time, then you have a 20% utilized bus. Well all of a sudden, if you have ten 
people on this bus, you have 100% utilized bus and if you have 1,000 
people, you have an oversubscribed bus so, you have to worry about the 
bandwidth, and occupancy, because we do need to make these different bus 
transactions atomic. So it's not quite just a bandwidth 
problem. And what, what I mean by balance, you 
could make the bus wider. To increase the bandwidth, but it's not 
going to solve our problems. Because there's an occupancy challenge 
here also that you need effectively atomic transactions to happen across the 
bus in order to keep the cache coherence protocol correct. 
Okay, so before we move off this topic into our interconnection networks, that 
we were talking about today, hm, I want to talk about one of the challenge of, 
that happens in simple cache coherence systems. 
And that's false sharing. So caches, like to track information on, 
a particular bloc size. So, we've talked about caches which have 
64 byte, lines, or 64 byte block sizes, and they can be bigger or smaller than 
that. Now, one of the things that happens that 
is pretty unpleasant in these coherence protocols is let's say, you take a piece 
of data which is shared, and needs to be coherent between two different 
processors. And it's gets communicated relatively 
often. And you put some other piece of critical 
data right next to it, on the same cache line. 
All of the sudden, what's going to happen is, because they're packed into one cache 
line, and we only track that information on a per cache line basis, 
whenever that one piece of data, let's say it's a four byte integer, and there's 
another four byte integer which is not shared, [COUGH]. 
Whatever the first four by energy which let's just say a lock or something like 
that gets, gets bounced around between caches you're going to bounce around the 
other data. So this can effectively can hurt your 
performance common case performance for non shared data by having this true 
sharing of data happening. And this is not something that typically 
happens in a normal cache because in a uniprocessor cache system you're going to 
bring the data in and it's going to bring everything in and you get spacial 
locality. And if you. 
Pump it out, you know, you can, you can get conflicts which are sort of 
equivalent to this but it's a little bit different idea here. 
It's never going to be in the same line. But with false sharing, we, we do see 
this. Hm, now false sharing is interesting 
because people have come up with a whole measure of techniques to avoid it. 
So, anyone have an idea, one, one really good technique to avoid false sharing? 
What we can do, and this is pretty common, is either the programmer or the 
compiler can detect that this is happening and it will actually pad the 
information out. So waste memory for highly contended 
pieces of data, and co-locate it with nothing that is shared. 
[COUGH]. So one of the better examples of why you 
really have to care about this is something like your stack. 
Sometimes if you, if you were to have, let's say, a lock on your stack, there's 
a lot of data which you need to use often, and it's all local. 
Stacks between threads are all local. But if you have, like, some sort of 
variable that you pass to someone else, which is a struct, and inside that struct 
is a lock, or something like that. All of sudden, you're basically going to 
be bouncing a line around which is your stack. 
And it's, other people are going to be invalidating your stack. 
So one way to solve this is when you put a lock, 
and the compiler can sometimes recognize this. 
Because you can actually designate memory addresses as locks, 
with special keywords, sometimes, depending on the language. 
And when you do that, it'll say, oh, don't put this with anything else, 
or maybe only collocate this data with things that are other locks. 
because that may have bad sharing performance anyway, for instance. 
So and really what you want to do here, is not have a false sharing case. 
Now, the analog default sharing is actually true sharing. 
So there are, there are cases where you'll have multiple pieces of data that 
are, shared differently between different lines. 
But they are also widely shared. So example of this is, you have an array 
of locks, and different processors won't be grabbing these blocks randomly. 
[COUGH] You can use similar techniques in fold sharing. Now, you probably don't 
want all those locks to be on those cache line. 
Because the locks are basically going to be bouncing around, 
and everyone is going to be contending for that one cache line to get it, 
modified in their ache, or in the M state in their cache. 
So what you can think about doing is, is actually just doing the similar 
technique, and putting, each of those locks on a separate cache line. 
Okay, so let's switch gears here.