So let's, let's think about our snoopy protocols that we've talked about, our bus based protocols, and the performance and the asymptotic performance requirements of them. So, what are the challenges of a snooping protocol? As we discussed before, as you add more people or more processors to the system you have more entities shouting on one shared media. And you need to hear the shouting. You can't just forget some, some, some shout because you need to snoop that against your vocal cache. So when ever another core takes a cache miss, you need to snoop that against your cache and make sure that you don't have a copy of that or that you have to invalidate for instance overplay back of some data or do something other invalidation coherence protocol adjustment. So, what's, what's annoying about this, if you sort of look at the amount of bandwidth you require on your bus, all cache miss need to go across that bus everyone needs to look at that, and everyone needs to have a port that has enough bandwidth to look at all those transactions on their cache. So, if we go to look at this, our bus, if we want to sort of keep up with the same amount of bandwidth of cache misses per core as we add more cores to the system is going to grow order N, where N is the number of processors. [COUGH]. Because though you can compute this because everyone let's say has the same number of same amount of cache misses going on and you want to have the same cache miss rate and you just multiply it by N. Three each, each course is going to have that. Well, that will be fine when N is eight, but if N goes to a thousand or a million, you're going to have some serious problems with that, it's a very, very big bus. And it's not just straight badnwidth, you also need to have effectively somewhere arbitrate for the bus and you need to have atomic transactions going across that bus so it may be hard to actually even if you have a very high bandwidth bus you may not have enough this cycles in order to operate in the bus. So, a solution this, is, we start to look at something we're going to call directory cache coherence and directory protocols. And the idea in a directory protocol, that the key idea here is that instead of broadcasting your invalidations to every other core in the system, or all cache in the system, every other core in the system. Instead what you do is you go talk to a location that we're going to call directory. And this directory is going to keep track of which caches have that data. And what's nice about this is now if you take a cache miss you can go ask the directory well, who has all these different, who has this cache line. And if only one other core has the cache line let's say readable, it will and you're trying to take it into an exclusive access trying to write to it. You only need to invalidate only one location instead of sending that data to all end processors in your system. So we've cut down what we, was a broadcast system into a point to point system. But the overhead that we have to keep now is we need to track, in a directory. All the locations that have or all the caches which could have a particular cache line in it. And, and we'll, we'll go through a much more complicated example of that but that's the, that's the overall key idea. And this is going to turn what was a broadcast into a point to point communication. And we can use point to point interconnects for this. Another good point of scalability here is you can actually have different directories. So you don't have to have one big monolithic directory. Instead you can segment the address space somehow and depending on the address that you have you can go to a different directory. And, by going in these different directories, you can actually increase the bandwidth, to your directories. Okay, so let's see how this fits into like, a block diagram here. We have CPUs are trying to communicate with other CPUs via shared memory. And they go check their cache first. If it's not in the cache, before they would have communicate or cross a bus, and everyone would have to look at that traffic but instead, in our directory cache coherence, they'll send a message from the cache. To the directory controller associated with the address so you're going to send a message here to this directory controller. And they directly controller is going to keep track for every single line in or the, the basic directory controller here is going to keep track for every single line and memory. The list of possible other caches which could potentially have that piece of data and we're going to call that the sharer list or the share list. And this is. Right now, if we look at this, might still be a uniform communication network. So, let's say, in here, you have some omega network. Anyone can talk to anyone else and the latency through it is fixed. So this is still a uniform memory access system. We've not. We didn't have to go non-uniform here. So, no cache is necessarily closer or farther away to any other piece of memory in a system like this. [COUGH] So this is kind of our naive directory cache coherence protocol. But what's still nice here is we don't have to broadcast. We can our, let's say our omega network, or mesh network, or something else on the inside here. We'd like to send from this cache directly to this controller. If no other cache has it. Let's say, readable or writable, or anything like that. It can just respond back with the memory. The, the data from memory. If not, instead of invalidating and broadcasting an invalidate to all other cores. Instead now, the directory can just say. Oh, this core, this cache and this cache have copies. I need to send two messages. One to this cache, one to that cache, and validate them. Wait for the responses and then reply back with the data. So, we can, we can decrease our bandwidth that we use in the common case across our inter-connection network. So I'm going to show a slightly different picture here which is pretty similar to the previous picture. Well, you'll notice is that the memory and the directory, now, are connected to an individual CPU. So why do we do this? Well. If you're building one of these scalable systems some sort of like supercomputer, it might be a good property that as you add more CPUs to the system, you also add more RAM memory. And maybe more directory storage or something like that. Another positive of, of, of a design like this is this CPU now is actually close to this memory bank. And we can try to take advantage of that. So one, one question comes up, is how can we take advantage of that? Anyone have any thoughts? Okay so, shared data, we don't know where it's going to be accessed. It could be accessed by all 6 CPUs and all 6 caches here. But it's very common that your stack for your program is going to be only access local and the instruction memory for your program is only going to be access local. So you can potentially have performance benefits by putting the instruction in memory or excuse me, instruction and stack and maybe even some portion of the heap close to this core because then you can access that really quickly, but only shared data has to go across this interconnect. [COUGH] And will or in fact that has a fancy name. So systems where some data is close and some data is far away are called Non Uniform Memory Access or NUMA and you might see this actually even in your desktop processors are actually Moving towards numerous systems. They're, they're, some of them are are actually If you look at some of the. I believe it's the A and D chips today are already numa systems even on a on a single di, or not excuse me single chip with multiple dies or something like that there, there actually two newer nodes inside of them sort of one for one memory controller one for one another memory controller. So, [COUGH] if you go into something like Linux and you go look in the proc directory, you can actually see there's a sub-directory in there called NUMA. And it'll tell you the configuration of the different memory. And then the OS can take advantage of this, so can put for instance, the stack and the instruction memory, for a particular program, that's being used by a particular core close to that core. And then, maybe through some other data, I can somehow choose some other, other choice. Now, I want to make a point here, is that just because the latency to memory is different does not mean that your system is a directory based cached coherent NUMA system. So you can still have non-directory-based systems where some memory is close and some memory is far away. So you could still have a, basically a bus, or something like that, or maybe some other internet connection network in there which is still a snooping protocol, or effectively a snooping protocol. But some data is close and some data is far away. But if you see this in literature usually you feel people talking about directory based cache coherent NUMA systems will call them CC NUMA or cache coherent NUMA systems, that's usually sort of means that this is a cache coherent non-uniform memory access architecture and usually implies that directory based for the may be other protocols that people are using out there also. Okay so I want to go back one slide here, and I wanted to finish off talking about one-topology, which is interesting. And the difference between these two slides is we went from a CPU here to CPUs. So this is a multi-core chip now and where this gets interesting is you might have a directory based cache coherence system connecting multiple chips but then inside of a chip you may have something like a bus based snooping protocol. So we actually mix and match these two things. And how we go about doing this is, if caches, if cores inside of this one chip for instance, after you go get data from each other, they can just effectively snoop on each other, but outside of that your cache controller or may be your L3 cache for this particular chip, is going to respond to messages coming from other directories, like invalidation requests and do something about it. So there's basically a transducer there between a directory base cache coherence protocol, and a bus base snoopy protocol. And this is pretty, pretty common these days. Especially given that you have a fair number of multi core systems showing up. And being used in more of these directory based cache coherence systems. And we'll talk about one of them at the end actually the SGI UV systems or UV 1000, which we'll talk about, is a users off-the-shelve, Intel parts, modern day sort of Core i7 parts mixed together with a NUMA, and directory based coherence system to connect all the chips together. So there's a transducer from the external snoop bus protocol to the directory based coherence protocol.