Okay, now we, now we get to move on to the meat of today, we're going to talk about non-blocking caches. Also known as out of order memory systems. Also known as lock up free caches. So I think the first paper that actually published on this called the Lockup Free Cache. Lots of people call these things non-blocking caches today. And if you think about it from a memory perspective it's an out of order memory system. What is non-blocking cache allow you to do? Well, it enables you to have subsequent memory operations occurring from the main process or pipeline even when you have a miss that was earlier in your instruction sequence. So all pipelines we looked at up to this point, you take it, even our out of order pipelines, we talked about them as basically saying a cache miss you just sort of stopped the pipe because we couldn't deal with having memory coming out of order even to our out of order processor pipelines. We didn't have enough bits to track all that. Well now we're going to talk about structures that allow us to track out of order memory. So, two major things this allows you to do. It allows you to have a cache hit under a miss, and it allows you to have a miss under miss. So what do I mean by miss under-miss? Well, what I mean by miss under-miss is you, do access. Like say a load. It some address. It takes a cache miss. And you have a load, a secondary load but you keep executing your program. You do another load and that takes a cache miss also and you are able to process both these things if you have a non-blocking cache which allows you to have miss under miss. One of the big points I want to get across today is that this is not only for out of order processors. We talked a lot about out of order processors. Believe it or not, you can actually hook up a non blocking cache to an in order processor. Or even a VLIW in order processor. Now, how do you go about doing that? Well. Two, two major ways that you can do that one is when you take the cache miss, you mark the variable, or you mark the register as not being there. So when you go to actually read the data, you block. So we'll show an example of that in, in few slides up from now. But, really what I was trying to get a across is you can have either of processors with out of order memory systems. You can have out of order processor without out of order memory systems. Both these things are, are possible. Okay. So, challenge list couple, couple big challenges here. Hurry up. A couple of big challenges. If you have multiple out of order misses, well, the memory system is going to return data sort of out of order. You have to deal with it somehow. It's possible you might end up in a different memory bank, and the data comes back, in different orders, that you, sent it out in. And this gets hard, to deal with. You, you sent out a cache miss for x, y, and z and they come back in z, y, and x, order, or something like that. And you need to make sure you're delivering, the right data to the right location in the cache, and the right data to the right, instruction. So we're going to need some big associative table to go figure this out. Second problem, major challenge here is lots of times you're going to have a load and a store to the address that you want to take a load and store miss to later. So what do I mean by that? Well, it's pretty that you are going to be doing loads sequentially through memory. And if the first load in a cash line misses, The second move's also going to miss, but they're on the same cache line. You don't want to send two of those out to the memory system at the same time cuz you might confuse the memory system. Worse, let's say if you have a load to one address and a store to another address. But the they are both in the same cache line. The load might go out to memory, you start gaining new memory, you do the store the store goes out to memory and you actually store on the main memory somehow but you bring in the original load data and they sort of pass in transit or something like that, the load data, passes the store data, and all of a sudden you are going to have the wrong data in your cache. You don't have it updated, and the reason that ends up happening the store didn't have anyplace to merge into. It couldn't actually go deposit it's data into the cache, for instance. Okay, so how do we go about handling this. But before we do that, let's, let's look at a, a timeline here. Time goes from left to right on this graph. At the top here we have a blocking cache. The blocking cache you are happily running the cpu you do a load we'll say or store doesn't matter in these systems, and you take cache miss, and the most basic blocking cache you're going to wait, and wait for the cache line to get filled in. And then you can return the data and then keep running the cpu, there's no overlap happening here. But we want to go fast, we want to go faster. If we have a bound blocking cache, we can take a hit under a miss. So what that means is, we're running along, we take a cache miss. This goes out to main memory. But we don't stop the CPU from executing. So let's say it is a, a load for register five. Well as long as no one looks at register five . Does the processor care that we took cache miss to register five? Probably not. Now there might be some complexities around if that load to register five takes a trap or something like that, or takes an interrupt. Then it might care, but let's say it doesn't for, for, for One of the reasons that this is actually usually safe to do is that you've already done all the memory checks and you've done. You're sort of passed the commit point to that point of the processor, because you're already pretty late in the processor pipe by the time you know you get a cache miss. So it's not too harmful. But then, you keep executing here and you take, you exit your cache again you do a different load, you get hit. That's great, you just keep executing. And it's only when you go to use the data do you stall. And effectively we've executed, we've overlapped computation with our mispenalty And this, this is, this is pretty nice. That's something else we can do is misses under misses. So a miss under miss, we're executing along here, we take cache miss and we send out something out to the main memory system to go get the data but we don't stop executing, we keep executing, we take another miss and we send that out to the main memory system. At some point, maybe we actually go to look at the data or possibly. Maybe we don't actually look at the data until here and the see if you just keeps executing the whole time. It overlaps multiple memory accesses with computation. So this can be really powerful. And you can do this, as I said, with in-order processor. One thing I do want to say is, usually you have a limited number, of outstanding memory accesses. Some small integer. Maybe like four or eight This diminisher returns as you add more and more. But usually this isn't sort of, thousands of outstanding memory accesses. So, let's, let's look at this, the structure here. There's a couple different names for this thing. It depends what school of computer architecture design you come from. If you are from the alpha or the Digital Equipment Corporation design philosophy you're going to call this thing a miss address file. If you come from the Intel school of things you're going to call it a Miss Status Handling Register. Miss Status Handling Register actually predates Intel goes back to I believe Control Data Corp. There's a paper from Control Data Corp. On Miss Status Handling Registers back in the like, I don't know, late'60s early'70s. Let's look in, inside of this thing and, and so understand whats, whats going on. You would have some small number of miss status handling registers probably an unregistered file. And you have a valid bit. Your going to have a block address. Now this block address is not the address and load of the store, it's the address and load of the store it's the address of the cache line That awaience level in the store. Because, well when you use this structure for us, we're going to use this to check subsequent memory accesses or subsequent misses against previous ones that have been going on. And we may have multiple sort of in flight here, we don't actually need the address of the, the load or the store. We need the address of the entire line cuz we need to check the entire line. Cuz that's what's in play. We have a bit here that says whether it's issued or not. Now, why, why we have that is just because you took a miss, doesn't mean it's actually out of the memory system or going to the memory system yet. So you sort of fill this in. And it's sits around there, until you have some time to go and talk to the rain memory. And hopefully that happens quickly. Some architectures you may not even need this you might just stall, until it actually goes out to main memory. But once issued you, you take that bit. And, and what this, what this allows you to do, it allows you to have multiple misses, which are not issued. So if you have a miss, under a miss, under a miss and it happened really quickly, you can fill up this table quickly, and not actually have to wait for it to stream out to main memory although request out to memory yet. That's half of it. And then we have a bunch of load store entries. Now with these load store entries they also have a valid bit. And they have a pointer back to one of these entries here. So this is a number which says sort of which entry you are in the miss status registry file. And this is for individual loads or stores that are occurring. So what allows, allows it to happen is if you have a load miss to address zero and you've a load miss to address I don't know ten And these both are in the same cash line, you can fill in this table with two entries and they'll actually both point back to the same miss status handling register location, they'll have different offsets. And the, the destination field, you'll fill, you'll fill in which register on the processor they're destined for. So what we're going to do is we're going to use these table such that when a load comes back, or excuse me, when a cache line comes back from main memory, we'll be able to check in these two tables and do two things. One, we'll be able to fill it into a cache somewhere. And two, we'll be able to return the data to the correct destinations.. And we'll be able to find which piece of data we need, given the offset in the cache line. That, that, that's a mouthful to say. So this is a little bit of a complicated data structure here to actually work out. Now I want to point out where the associativity, associative matches versus indexed matches happen in these tables. So one of the, the things you might notice is that this block address, we're going to have to check every subsequent miss. Against this block address. So we're going to take the high, higher order bits of the address, check it against this. And the checks that, we don't add a new entry here. Instead we just merge it into a currently pending one. But we do have to add new load store entry for that, which points back over here. When a memory transaction comes back from main memory, we're going to look back in this table and say, uh-oh, Here's the one that I, I had issued. I need to clear it out of the table. And given the number of the location I'm clearing out a table. I'm going to associatively check that against all of the load store entries and wake up the ones that match. So, if this is entry number one and there's multiple number ones in here. All of the ones that have number ones in here, I need to return that data to the registers. Now when I return the data to the register, I have to mark the register as being available again. And the type here just basically says, you know, is it word, half word, bytes, and is it a load versus a store. One other thing I wanted to say is destination here. For loads it's going to be a register identifier. Now this might be a register identifier versus a virtual register identifier or a optical register identifier if you had a registry neighbor. So you have a out of order processor here, this might get more complicated. This is typically a physical register identifier, not a artificial register identifier. For stores. You also effectively need to track something in this table. Because you need to merge and store with the background data. So, this going to, basically, add an entry in here. It's going to sound to main memory, go get the background data. And when that comes back, it's going to deposit that into our cache. And we need to point at some other buffer. This is very similar, we had that future store buffer. You can have an array for those, for instance. And it'll tell you which one to go play the store against our cache width. And you can merge the return data with the, the store data, that we've stored locally. So this is kinda fun. We can have memory coming back out of order. We can have memory being issued out of order. Lot's of, lot's of fun, fun toys here and one of the fun things we can do is we, we have ability cuz the associated check here to check to make sure that the data coming back or assuming subsequent requests hit or miss here and, and will merge. So we don't have to generate more memory traffic or cause strange things happening where you have responses coming back and new requests going out from the same line and you might lose some data in flight. I should put up this is only one implementation. Lot's of times, what would people do is in, in this miss status handling register, there will actually be a tag field because main memory will not necessarily keep track of the entire address when you get a response back from main memory. Instead, it'll just have a tag. So, instead of checking against the block address you might check against a smaller tag. That's one way people sort of get make this a little bit easier. Another way most people do this is the tag, if you have a small number of cores might actually be which entry you are in this table so you don't even have to associative check. That's sort of optimization on this. You still need to do associative check when future loads and stores that miss go out to routinely check this table. I think I, I think I swapped through most of this, already. On cache miss you have to check the table for a matched address. If found you allocate a new entry that points to the miss handled entry register. If not, you have to both a load entry load store entry, and miss status handling register entry. One thing I did want to point out here is if you run out of miss status handling registers or load storer entries, that's not the end of the world. You can just stop the processor. So let's say you have eight of them and you run out of all eight, you have eight outstanding memory transactions. You can just stop for awhile. It's going to come back at some point. One of the loads, or one of the main memory accesses are going to come back. So you can just stall for a little bit. I need a return, need to find the load in store that's waiting for it. Going back to what, Berchun said it's very possible that a, The loading store that was waiting on it might've actually disappeared in that time period. Well, at least for a load, cuz you might have write after write occurs. That's okay, you still want to fill it in your cache at that point. And of course you can have multiple loads and stores. When the catch line is completely returned and you finished checking against all the load and store entries you would deallocate both load and store entries and mustache handling registries. Okay, so little bit of oops, fun here with in order machines so you can sort of see how this relatively logically fits into an out of order pipeline. If you were to fit this into an in order pipeline, you can its not too hard, you can actually add a scoreboard for each individual register. And when I say scoreboard, you're not tracking where the data is coming from, instead there's a special bit saying, This register is out to lunch. This register is out of the memory system. If you try to go access it, just wait or just stall. And then when the memory, cuz it's a variable length sort of thing, so your scoreboard can have this will be ready in five seconds." You don't know. It's out of the memory, the memory system. Unload miss here. You can mark the, you mark the destination register as busy. When it come back, you mark it available and uninstall the processor. But if no one actually went to go use that register, in the mean time while it was out of main memory, the main processor never stalls and no ones ever the wiser. Okay, so, I wanted to, we're almost out of time, so I wanted to, pick it up here for a second. So, non-blocking, caches, they can effectively increase, the, bandwidth to your, lower levels of caches, your sort of, L1's. The only thing they can do is they can increase the bandwidth because they can merge, misses to your cache. Now you probably would have actually gotten that anyway, if you had a blocking cache, but, the missed cache handling register basically, allows you to have multiple cache misses merge into one transaction. Your missed penalty is obviously lower because, going back to this picture here, we've overlapped the missed penalty of other useful work.