So, a few things that on our agenda for today, we're going to finish up talking about a review of cashes, and that we're going start talking about in order superscalars so this is something new. And I want to say this now but, well, I'll reiterate when I come to it. This other book. The [unknown] book, is probably use more, is much more useful for the super scalar portion of our class than the Patters or, Hennessy and Patterson book. And I recommend either acquiring it or borrowing a friend's Or the library has a copy of that book. And at least reading the, sort of, two to three chapters in there would be useful. But that's not until the second half of today's lecture. The first half, we have to go and finish up caches. So, going back to review, let's, let's talk about caches can look like on the inside and a couple, a couple of different things about how to classify caches. So, we've already talked about why caches are good. So, a little bit of a review here. You have big slow things over here. Big slow memories and you have the processor. And you can, you know, you can think about how to build small fast memories. And if we can put a small, small fast memory close, and have a high probability that, you know, things are actually located in this small fast memory, then we don't have to go out to the big slow memory as often. We can save power and perform, and, and, and, have better performance, etc. Okay. So, how do we, how do we classify caches and what, what are, going, reviewing what, what a cache, sort of looks like on the inside? So, processor, cache, main memory, addresses flow that way. Data can go either direction cuz you can either do a store which will put data out from left to right or you can read back data or do a load which will have data going back the other way. So, let's look, look at our sort of a mythical view here of what a cache looks like, or, or our view of what a cache looks like on the inside. So, caches contain copies of what is being stored in the main memory. And if you look at it inside of it here, you know, this, let's say, first line is at [unknown] store address 100. So here, we have address 100, and it has some data in there and, typically, you introduce this notion of a line in a cache because we want to reduce the tracking information for a particular cache line or sometimes called a block in the cache. If we, it is possible to build something which, you know, tracks data on, let's say, a bit or a byte basis. But then, you're tracking information that's probably larger than the amount of data you're actually trying to store. So, that's not, not really a great idea there. In the cache here, if you look inside of it what we store, is we store an address tag. So, it's not the full address. It could be the full address, but then you're probably storing extra, extra bits. You only need to store enough of the address to be able to differentiate different things that could be stored at the same location. So, in a fully associative cache, where you can put any data in any location, you will have to store for instance, the whole address in the tag. But, if you, let's say, have a direct mapped cache where each address can only go to one location, you would get up to store a subset with this. Yes, and I wanted to sort of point out here, you know, sometimes people call the whole thing here, a line. A line typically includes the tag information and possibly valid bits, and the block is the actual data that we're storing. Okay. So, this is really review here of how the algorithm or what the cache, caches looks like. Well, let's view briefly. So, processor issues a load, so we're going to look at the load case here. We take the address and we compare the sub sets of the bits that can be, could be different for different things that store in interesting location in the cache and we check against the tag. One thing not shown here is we also need to check to make sure it is valid, because it is possible you could have an invalid cache line. So, it's typically, is one bit which says, whether the cacheline is valid or not valid. You have to make sure it is valid. And if so, you get a cache hit and just return the data. If not, you go out to the main memory. Ooh, this is a tricky one here. You have to go find someplace to go put that data because this new data comes in and if you want to store it in your cache now, there's probably something in that cache at that same location, these just sort of bump out. And, we're going to talk briefly about replacement policies, and later in the advanced caches lecture, we'll talk more about that and other cache policies, and much more advanced replacement policies. But you kick something out, and you fill it in, and then finally you return the data. You could also think about trying to do sort of, this step and this step in parallel effectively, and a lot of cache system and more advanced cache system will actually return the data before they sort of finish the eviction or, to, to reduce the leniency of the load. Okay, so let's, let's do some classification, so some pathology here of what a cache looks like. So, what, what could we talk about? We could talk about block placement. So, this is locations that a particular block or particular, particular address can be found in the cache. Block identification, which is the structure of how to actually go find something in a cache if I give an address. So, this is kind of an inverse of the placement question. Block replacement which is figuring out what should be replaced on a miss and then finally what happens on a write. So, these are all sort of read questions. Well, block replacement is not really a read question here, but the write strategy is a, is a good question here of what happens when a write occurs, do we update all the caches, only the main memory, do we evict just out of the cache? And all these are possible strategies to choose. Okay, so let's look at a block placement figure here, and look at a couple different types of caches. So, up here, we have a 31 or excuse me, 32-word memory. So, let's say, this is the, all of the memory in the machine. And, depending on how you want to look at this, maybe the, each block, let's say, is one byte, or it could be some number of bytes. So, to give an example, a typical, something like your Core two Duo, I believe has a 64-byte cache line, or cache box size. Some people have tried to, for, push that up, and, and there's reasons to, sort of, make it larger or smaller, but we'll talk about that in a second. So, let's take a, this highlighted location here, block number twelve. And let's see, where it can fit into different types of caches. So, we have three different types of caches here. Let's, let's start on the right side here. So, on the right side, we have what's called direct mapped cache. So, the direct mapped cache. It is, literally, an array and each block can only fit one place into the cache. And the indexing function that you use is you take the block number mod, the size of the cache. So, in this case, we are, have block number twelve, or byte number twelve here, we take twelve mod eight, it says block four. So, this is the only place to go look for the data. Okay, that's, that's nice and, nice and simple. It may not be the best performance, easy to build. It's all index-based. Okay, then, sort of one step up from there. We can start to look at things like set associativity. So, what set associative means is instead of having only one location, one unique location that a block can be placed, you allow it to be in some number of locations. So , in this, there's, in this drawing here, there's a two-way set associative cache. So, each block can either fit in one of two places. Now, you can actually have, you know, machines which have much more associativity, or higher associativity. So, an example of this is the modern Core i7 processors. Their, sort of L1 caches are 8-way set associative and farther out, they're even higher associative. Probably, I think they have a 16-way set associative cache out in their L3 or L, farther out caches. So, let's, let's look through basic example here. If we want, twelve mod four, so how do we do this? Well there's only four buckets, if you will. And inside of that bucket, it can fit into either of the two sort of sub-slots in the bucket, or they're called ways. So, if we take twelve mod four, that means twelve mod four, is zero. So, it has to fall into, this, this bucket here. It can be in either one of those two, ways, and it can't be in both. That's an important, important point there. Finally, there are types of caches which we'll call fully associative caches. So, what this means is the associativity or the number of ways is equal to the number of elements in the cache. So here, there is a, there are eight elements, there are eight walk locations in this cache. And it's fully associative, which means that there are eight different sets, if you will. So, it can fit in any of these locations. So, this same address here or block number twelve can fit anywhere in this cache. Okay. So now, the next question is how do we actually find the block in the cache? So when, you're going to, let's say, take an address, and you want to go figure out where it goes in the cache. Let's, let's take a look at the example of a, a direct mapped cache first. And here, here's sort of our information. We have, we have the data block and then, we have some tag information and we have a valid bit. So, if we look at the address that we're sticking in here, so, let's say, this is like a 32-bit address. Some portion of that address is just going to figure out where in the data block the load or store is going to. So, if you have, let's say, a 64-byte cache line or 64-byte blocks, data block. You're going basically index into some sub-word in there based off the offset bits. Then, we start to look at the block address. So, the next the, this is the traditional way to do this is that you use, that you use these sort of middle order bits as the index. There are more complicated cache systems, which move this around or do much more complicated hashing functions. But the basic hash function here, is we're going to take the index. So, in this cache here, let's say, that has three, I'm assuming four entries in the cache. How many, how many bits are in this index? Two, yes, okay. So, two to the two is four. So, there's, this index is in here and chooses one of these four locations. And then finally, this tag information here is used to compare against the tag that we store. And that will say whether this address is actually in the cache right now or if we have to go out to main memory. So, let's, let's, work a brief example here. We have, 32-bit addresses. We have a four way, or excuse me, a four set cache, direct maps cache. How many, how many bits? And we have, okay, so we have 64-bytes in our block size. How many bits are in the offset? Let's start with that. The offset is going to have to index a byte within the block here. So, if we have byte addressable memory and there's 64 different entries, log base two of 64 is six. Okay. So, we have six bits of offset. Okay, what do we say about the index here? And we said, this was already two bits. Okay, so that's eight bits. So, how many bits of tag do we have? So, the tag information is 24-bits. And as you can see, as you make the data block larger, or the data block size large, larger, or the cash line size larger, you need fewer tag bits because it sort of, eats away this way, and the index gets shifted up. So, that's, that could be, that could be a good thing. Finally, there's this V column here. This is what I was looking to before, it's really important. You need a, valid bit to see whether the lines valid. It's very possible you could have old data or the cache is not initialized. At which point, these V bits would all be zero or, or not valid. Okay, so let's do a little more advanced example here, we have a, a two-way set associative cache still with four lines. And we want to go find our data. Hm. So, how do we, how do we go about doing that? What's interesting here is, when we did this, we only had to check one tag location. Well, for a cache like this. The index is going to tell us which of the two sets it's in. But, it can either be in either of the two ways of, of these sets. So, we actually need to do a tag check against this tag and that tag, and check the two valid bits. Many caches can do this in parallel, but that's how you can figure out, and it's possible that it's in neither of those two and we have to just say, that, it's a cache-miss. Okay. So, let's, let's talk about, when you need to go taking things out of the cache. So, let's talk about block replacements. We need to figure out what to go, kick out of the cache, when you take cache missing, it will bring, bring some new data in. So, an important point here is in direct mapped cache, this question makes no sense cuz in direct mapped cache, it can only go to one location. In set associative or fully associative caches, we need to make a decision. And hopefully, we only need to make this decision when a set becomes full. Because, otherwise, we'll just choose the empty location in that set. So, if you have a two-way set associative cache. And one of the ways is full of data, and the other one is just empty. The valid bit is empty. We probably shouldn't even be looking at, sort of different choices on how to replace things, because, there's an open spot to go put the data, we should go put the data there. Okay. Som what are, what are some, good block replacement policies, or good cache replacement policies? First one, is random. You can just choose randomly out of a hat. Actually, it doesn't work so bad. It's, it's, it's, you know, it's easy to implement, you just like, have some random unique feedback shift register, or sometimes you just sort of choose some random number and, and you replace it. Not so, not so bad. Well, well, now, now we start thinking about put, put our thinking caps on, we can start thinking about some better, better ways to go do this. When we need to think about least recently used, hm. So, the idea behind this strategy is that you are trying to preserve temporal locality. So, if you've used some data recently, there's a heuristic here, that it's likely to be used again in a not too distant time axis. So, if you look back to that sort of time versus address block that we had, in the time axis we think that, you know, temporal locality is something that happens relatively often. So, one, one really good strategy is to try to look at the least recently used as the block to get rid of. So, if something has not been used recently we kick that out of cash. Now, some problems with this is, this gets really hard to do. Well, there's two problems. One, to implement appropriate lease recently used, every single load or store that happens to the cache, needs to update the information. Well, that's not great for power. But it's also not great because you basically need to have sort of a very fine during tracking information on each cache line on every single cycle that does a load or a store. So, you know, you need, you need some cache data that needs to be updated pretty often. Plenty of people do that, though. There's, you know, if you have a least recently used, piece of information where you have extra bit on each cache line, you can think about just sort of flipping that back and forth and having one bit for that information and every single load and store actually sets that bit and lots of processors do that. So, what, what starts to get hard here is when you try to go above 2-way caches. So, 2-way you have sort of one bit that can decide where, which is the most least recently used. If you start to have, let's say, an 8-way associative cache, you want to do true, least recently used, you've got to somehow keep like timing stamps for each different way. You have to say, oh, well, this one was just recently accessed and, you know, you can time stamp it. You can try to sort of reorganize a list of numbers in order. And these are the things that start to get harder to do in hardware and especially, you know, on the critical path of your processor. So, lot of times what people do is they have pseudo list recently used for higher associative caches where you'll try to sort of keep maybe accurate information of the last two most recently used lines and the other ones you do random or something like that. Or there's some more complex things which we'll talk later in the class about. Another one, first in, first out. So, whatever gets brought into the cache first, sits in the cache for some period of time and then gets kicked out in the future. So, you're sort of guaranteed that some piece of data that you bring in will sit in your cache for some period of time, and this is a lot easier to implement in highly associative caches. But it's certainly not implementing least recently used because if you sort of bring in a cache line. Let's say, you have a four-way set associative cache. Four axises later, it's sort of the top of the list to get kicked out. But what's nice about this is you're guaranteed it'll sit in your cache for some period of time and won't just get sort of randomly kicked out, which like random, has problems with. Finally here, another, another acronym here, not most recently used. Well, let's think about that one for a second. So, we're going to kick out something, that is not the most recently used block. So, if we can sort of keep track and say that we and we have some index into our cache, if we have a four-way cache, we can think of having sort of two bits that says, well, this line was the most recently used and the rest of them was sort of random. And we know we are not going to be kicking out the most recently used cache. So, write strategies, you're going to do a store, you've got to know where you put the data. Important, important, important thing here. Caches have of lot's of different allocation policies and policies about whether you keep the data or throw out the data in the different levels of cache hierarchy, when a store happens or a write happens. So, let's, let's take a look at couple, couple of different choices here. Let's say, you are doing a store and you have a cache hit, so hits in your, your cache, Should you go put that in main memory also? If you, if you do a, a store, and you put every single store into your cache and into main memory, if it's already in your cache, then, as you point out, it uses bandwidth out to the main memory. So, that can be a downside. There are some positives to this, though. When you go to evict the line, you don't have to write anything back to main memory. So, you sort of are pre-putting data in the main memory. So, that's, that's a positive. You could also think about if you have a multilevel cache hierarchy. Some levels of hierarchy may, may make sense to actually put it in the sort of the next level out if you have enough bandwidth. So, this is actually a trade off. As I said, Computer Architecture is all about trade offs. Neither of this actually correct. So, the one is called write through. So, right through means, you put it in your cache, if you get a cache fit and you put it in memory. Write back cache just means you just put it in your local cache. And when that block gets evicted and it has dirty data, you have to go put it back in the main memory. So, so, that's, that's kind if the downsides and upsides there. One, one thing you do need to do if you have write back, as we say here, is that you need to keep track of, if the data is dirty or not. So, if you go look at more complicated cache coherence protocols, sometimes you'll actually favor things like write through. Because right back gets really complicated. When you have data that has to be invalidated and, and sent out to other places. But if you knew the data was clean i.e., it has no dirty data, you never have to a write back when an invalidation comes in for a multiprocessor scenario. So, it's a tough, tough trade-off there, but, generally, sort of write back is considered to be more advanced and saves bandwidth. But, there are trade-offs there of why you might want to do one or the other, because write through is definitely simpler to design. Okay. So, that's our, and, and, oh, I do want to point out that, you know, you may not actually go all the way out to the main memory. So plenty of architectures have small L1 caches, for instance, where you don't necessarily want to deal with invalidations coming in and complicated things happening. Where you might do write through from your L1 to L2, but then you want to save, let's say, off-chip bandwidth here. So, you write back out of your L2 to the DRAM. So, that's, that is a pretty common sort of thing to do. And as I was saying, the real motivation for this is if you have a multiprocessor system, and you have, let's say, invalidation requests. So, let's say, another processor wants that same cache line, and the data is dirty here, but not into, not in the DRAM, You'd actually have to sort of reach down into the L1 cache. If you have write back out of the L1 cache and that gets pretty complicated. The, and it especially gets complicated if your L1 cache is very tightly integrated into your CPU because it's on your pipeline. Like, it's in your main pipeline and you have to sort of bubble the pipe somehow or have structural hazard for any invalidation that comes into that L1 cache. So typically, a lot a times, people try not to have or they, or they, people will consider write through caches, at least from the L1 to the L2. But you definitely won't say bandwidth, for a L2, sort of out to DRAM cache, if you don't it write through every time. Okay, let's take a look at what happens on a cache miss? So, on a cache miss, we're doing a store or write to memory. Do we need to put it in our cache at all? Choices, maybe yes, maybe no. There's no gearing, you don't have to put in your cache. A lot of times, and, and this comes down to heuristics here, a lot of times you'll actually do a store, and you may just not read that again. So some architectures actually have store instructions with locality hints, or, or temporal locality hints saying, I'm doing this store, I'm doing this write right now, and I'm trying not to read it again right now for a very long period of time. Don't put it in my cache. Also somethings just decide not to do write allocate, because or so, no, no, I'll define these two things here. No write allocate means you are doing the store, or write to your cache and you're not going to pull the data into your cache to do the merging there, locally if you will. And you're not going to actually store any data into your cache. Cuz, you don't have to. You're just keeping a local copy of it, you don't actually have to, you're not forced to keep a local copy. You can always send it out to your canonical main memory. Write allocate says, go fetch the data from memory, merge it in the cache because usually the block line, block size is longer than the amount of data you're writing. So, merge it, and then, probably either put it out to main memory if you write through, or just keep it in the cache. So, you, you, if you write back. So, you can think about having theses two policies, and, and they both, they both have places. Sometimes, people will, you know, typically, people try to write allocate, unless you have some locality information. Cuz data, lets say, it's at the top of your stack, in your processor, or in your, in your sort of software stack. You are going to basically be writing and reading from that relatively often. So, it makes sense to try to allocate that. The other thing I did want to point out is, there is a heuristic sort of going on with this no write allocate that, if you read that block, then, it will get into your cache. So, maybe the first store that you do to the address won't allocate, but the next load, you do will allocate. So, you're not going to miss that much by doing no-write allocate. It's not like you've lost everything for forever. It's just that, that first write, oh, well, I had to go on to the beat and erase this system. But the next load that I do from, let's say, my stack, or some local variable, will pull that line in anyway. Okay. So, some common combinations here. Write back with write allocate. So, this is sort of the full blown solution that a lot of people do. This is kind of like the base high performance one. Write through and no-write allocate is, kind of possible, this is, like, the cheap one to build. You don't need any logic here to deal with no-write allocate. But all possible mixes are, are, are valid.