Okay, so Multiple Logical Communication Channels. What we just talked about with all these different message types here, thus like, the ten not drawn. Can't all go on the same network at the same time. If you have one interconnection network or one logical, excuse me, logical network here. You can end up with cases where you have responses that get qued behind requests that will create more traffic. So, in these protocols we have multiple phases of, of communication. We send messages from point A to point B, then from B to C, then from C back to B, and then B back to A, or something like that. So, if you try to over lay this onto an inner connection network, and you do head of line processing. So you process the message which is at the head of the que first It's very common to have something where you have a new response that sneaks in. Which is dependent on some other transaction that's happening somewhere else in the system. And you're waiting for excuse me a new request comes in that gets qued behind a response. Or excuse me, you have a new request that comes in that is in front of a response and all of a sudden you're going to have a deadlock. So, one way people go about solving this is you look at all the different types of messages that you have, message types that you have, and you figure out which types can coexist on the network at the same time. so good example here is a I think if you look at a relatively basic cache coherence protocol there's maybe twelve different message types. You can try to shrink that down to maybe four, three or four different classes of traffic. That those classes of traffic when intermixed will never introduce a deadlock. If you would be really safe you'd just have twelve separate networks or twelve different networks, or twelve logical different networks. But that's expensive to build, so people try to come up with equivalence classes of different traffic and then only have that number of networks. But you're going to have to segregate these flows in different, logical or physical channels to make sure you don't have deadlock happening. [COUGH] Another thing that I wanted to point out here is we just distributed our memory. So we need to start worrying about the memory ordering point or for given memory address, where do you go and what order do memory operations happen in? So just like in a bus space protocol you have some sort of arbitor which determines the transaction or who wins the bus. In a directory based system, typically the forth given address is the ultimate other. So two messages are coming in from different course you go for right access to a particular address. One of them is going to get to the directory first and what if we gets the directory first, could win or estimately will win. You could think of, sometimes this could be unfair. So some of these systems you try to have something which prevents the core, which is close to the directory, always winning access to that contended line we'll say. So may have some sort of back off protocol playing into effect there. But this is what effectively guarantees our atomicity is the directory allows, guarantees that a particular line can only be transitioning from one state to another state at a given time. So we have the directory as the ordering point. And whichever message gets there, gets to the home directory first let's say wins. Subsequent request for that line are going to lose so what do you do on a lose. Well you probably, the directory's probably going to want to send a negative acknowledgement, or NACK. It's going to say, or send a retry. It's the same thing here. It's going to say, I can't handle this right now. This line is currently being transitioned already. Someone else won the transitioning of this line. Go retry that transaction, the memory transaction again in the future. Now, this gets pretty complicated back at the cache controller. because it's going to get a retry, or a NACK back. It could potentially have a message coming in for that exact same cache line. It needs to give the directories message priority in this case over the pipeline the main processor. So it's going to have to order that after, it because the directory one the directory is the ultimate ordering point for the memory location. Finally, you have to worry about forward progress guarantees. What I mean by this is, it's pretty easy to build a memory system that you pull the data into your cache. Your cache control pull means your cache. But before you're able to do load or storage at that actual address, it gets bumped out f your cache. At all sign you have a cash line just sort of bouncing between two cashes and its a live lock scenario. So you believes directory based coherence systems and this also happens with bus space protocols plus usually it'll be less likely. you have to have some sort of forward progress guarantee. And the reason is less likely is because in a directory based system the communication time is usually longer. So the window of, of vulnerability is longer. We afford progress guarantee means that once you get into your cache you need to at least do one memory operation to it, to guarantee you have won bef, before you relinquish the memory back to the directory control. So, if you do, if you doing a load you're actually reading your cache shared to the actual one load. And then you're allowed to cough it up, so you don't respond back with a reply. To, let's say, an invalidation request from the directory control until you've done your one memory transaction. Or likewise, if you bring in a modify. Do that one store before you, you cough the data back up or release the data back into the memory controller. That's really important, to make sure you have some forward progress in the system, and don't end up with a live lock. Okay so we're going to get into the, the more futurey for future looking stuff here. We've been talking about what's called, a full map directory, where there's a bit processor core. If you have 1,000 cores in the system, that's a very large bit map. And, it's pretty uncommon that a thousand core in the system will all be reading all of the data, or all of one particular cache line. So there might be wasteful and your directory in a format directory grows at order N. So that's, that's not particularly good. Hmm, so, people have looked into different protocols here. I just want to touch on this, this is largely the area sort of research and future directions of this. One, one idea here is to have a limited pointer based directory. So instead of having a bit mask of the share of list. So its right this is the share of list. So its sort of a bit mask of the share of list you have base two encoding numbers up to some, some point. This is why its called limited directory or limited point of directory, because you can't have all the nodes in the system on the list. There is none of entries here. But you can name them because you have base two encoding of the actual number. And if you get bigger this in this entire list, so lets say you'll have one, two, three, four, five entries an over sense six sharers, six mead owing copies of the list want to be taken. Usually this is an overflow bit here which says its more than six and when its more than six and it will share and also in a transition to modified. You're going to just send a broadcast message or send a invalidate every single of cache in the system. But usually this could be a good trade off, because it's pretty uncommon to have extremely widely shared lines. So it's an interesting trade-off. There are storage space, versus sending more messages in the system. Like wise there's interesting protocol here called limitless. Where same idea, it's a limited directory and overflow bit, but if it overflows you start to keep the directory, or you start to keep the sharer list in software in a structure in main memory. And this requires you to basically have very fast traps such that when this happens because your, your servicing cache line here you interrupt the main processor and the main processor provides the rest of the sharer list for instance, if the sharer list gets overflowed. So, there's, and there's a bunch of stuff in between in some future, future research that's still being done actively in this space. Beyond simple directory coherence, we have some pretty cool on-chip coherence. This is why this is actually being studied a lot right now, is people built this massively parallel directory based machines in the past. They got some use, they were very good for some applications. But now we are starting to put more and more cores on to a given chip, so we start to see on-chip coherence. And figuring out how to leverage, the fast on-chip communication alongst directories to make more interesting coherence protocols. There is something called COMA systems or Cache Only Memory Architectures, where instead of having a data in main memory, you don't have main memory ever and the directories move around. before beyond scope of the scores worry about the KSR one if you want to go about that kind a score research one and then also this real had a scale of the sharer list is active. Briefly wanted to talk about the most up-to-date versions of these things. We have the SGI UV 1000 which is a descendant of the Origin and the Origin 2000 machines from SGI, lots of cores here, 2560 cores that are all kept cache coherent. They use a directory based coherence protocol here. It's very non uniform, and it's all connected together by a multi-chassis treaty tour. So, this is one chassis, there's actually up to eight of these. Princeton has one of these in the HPCRC centre, that I think is four chassis, which is sort of half the size of the maximum. [COUGH] An on-chip example here is the TILE64Pro had 64 cores. And each of the cores, itself is a directory home and it runs a directory based protocol. And I was talking about dividing communication traffic into different flows. Well there were three different memory networks here. So we had to do, come up with three different classes of traffic, that themselves would not deadlock themselves, if you will. And also there's four memory controllers, which is connected into our interconnect system. And because of this, the communication lane sees different. A core here talking to the American controller is very fast, whereas a core there takes longer. But maybe this core is close to that memory controller. So non, non-uniformity here. Okay, so this is our last slide of the term here, Beyond ELE 475. If you want to go on and do more. [COUGH] Well start reading some papers from different computer architecture conferences. The proceedings of International Symposium on Computer Architecture, ISCA is a good place to start. That's probably the top, major architecture conference in the field. The International Symposium on Microarchitecture is the top Microarchitecture conference. So if you're trying to look inside of a processor, and some of the smaller microtechtectural details. ASPLOS, Architectural Support for Programming Languages and Operating Systems has lot of different processor between software and hardware's radio conference. And HPCA looks at or used to look up at a higher performing high, big computers, high performance computer systems. but now lot of normal computer operation ends up in a compensawsive. I would have CD audio. Well go to research. Built in chips, built some test [INAUDIBLE] FPGA learn more about parallel computer architecture and, in deed you can come back in the Fall. and take ELE 580A which is going to be a graduate level primary sources paper reading course. More traditional graduate course where you'll learn about all the different parallel computing systems. So this is called parallel computation because it's both parallel computer architecture and some programming models that hook together without in parallel programming together with the architectures cause they go, they go very hand in hand. So let's stop here for today. And stop here for the course.