Okay. So let's start talking about performance. 'Coz the whole reason you built cache is, is to you know, have lower power and higher performance. And let's, let's go back to the iron law. So what is, what is the cache trying to do? Well, what a cache is trying to do is if you look at that iron law of processor performance, when you do a load or a store, we're trying to decrease the clocks per instruction to process that load or the store. So if you have to go all the way out to main memory in a cache miss we'll say, or if you don't have a cache. It's gonna take a long time. But if you shrink the cost per instruction for a load in the store everything gets, gets faster so you can actually, actually go, go faster here. So as a sort of showing here, we have some loop that does some loads, some adds, and here it takes a, a cache miss. If, if we can somehow do things to the cache which reduce the probability of a cache miss, we can shrink the amount of time it takes. To do that and the whole program will run faster. So, reducing our number of cache misses is good, using our cache to actually keep the data local and also, sort of, as you can see here in this diagram, this first load here it's in the cache. So it doesn't have to go out to main memory. So, it just hits in it, if it's a properly pipe lined cache, those return the data on that load. Just want to introduce, yeah, in this diagram, two things, processor to cache and coming back, that's a hit in the cache. Miss takes longer, it just takes more cycles. Okay, so this is an important slide cause it uses some important ways to think about caches. So we want to categorize the types of misses we have to cache. So this is figuring why we would have these different euristics inside of caches based on the different policies and different ways that you'd actually have a cache miss. And a lot of times people call this the three Cs of caches. Okay, the first C, A compulsory cash miss. So what that means is that the first reference to a block and you're going to take that cache miss even if you have an infinite size cache. It's just you can't get things into the cache unless you go to try to access them the first time ever. So that's what a compulsory cache miss is that first reference. You can try real hard. You can see by have a prefecture, actually, to try to reduce the amount of compulsory cache misses. You could think really hard say, oh, I think I'm gonna access this data sometime in the future. I should go get it. And then when you go to actually access it, you won't take a cache miss. So that's, that's a possibility. Okay, the second C that contributes to our performance of caches is the capacity of the cache. So, traditionally, a larger cache will be able to fit more data, well that's always true. But traditionally a larger cache you will have a lower miss rate if you have a larger cache. So let's think about that, this is an important question here. Will a larger cache always have a lower cache miss rate than any smaller cache? Within a loop, let's think about what kicks data out. If you have a cache that is let's say eight lines and you have a cache that is sixteen lines. By definition the addresses that with alias in the bigger cache, are gonna alias in the smaller cache also if you go let's say, from am eight or a sixteen entry cache, to a eight entry cache. So it's actually if you're going up by factors of two, at least, your cache miss rate, is always going to be better with a larger cache. Or, or the cache miss rate's gonna be lower with larger cache. The one place it could actually come up, is if you have, if you try to change the hashing function. So if you have a cache that is, let's say two thirds the size... Well, you have a cache that is let's say eight entries and you have a cache that is twelve entries. So there, there [inaudible] locations are gonna be different if you actually think about having different patters there. But it's likely that the larger cache will still be better there. So large, large is good. That's, that's assuming like sort of direct mapped cache. You could, you know, depending on your LRU strategy probably, or your, your replacement policy there's some other caveats there you have to think a little bit harder about. Okay. So finally, what's the last thing that can cause a cache conflict to occur is, or cache miss occur is conflict in the cache. So this means that we don't have a high enough associativity in two different pieces of data. Or two different clocks alias to the same location and they're fighting for that location. So is, there's a conflict there and one of the, the pieces of data that gets kicked out before you have the time to go read it back in. So you have a load and then another load, but in the mean, in the, in the middle time, you have a load to a different address, which happens to alias to that same location in the cache. The hash function points to the same location, and they're gonna fight for that resource. So that's a, that's a conflict. Okay, so let's put some data behind this and look at some ways to make caches go faster. Okay, so, let's, let's look at what we are plotting here on this graph. On the X axis we have different sized caches. Sixteen kilobytes, 32 kilobytes, 64. So we go up by factors of two all the way up to one megabyte. On the Y axis here we have access time, So how long it takes to access the cache. So if we go back to, to this average memory latency equation. We take the miss rate times the miss penalty plus the hit time. How, how long it takes to actually go sort of look stuff up and find something in our cache in the first place. Well if we can reduce the hit time, that's, that's good. So if we take cache which let's say takes I don't know. Two nanoseconds to access. And on a one gigahertz machine, that'd be two clock cycles. And we can somehow, you know, replace it with the cache, which takes half a nanosecond to access. That's, that's, that's good, it means we can actually fit that in one clock cycle in a one gigahertz machine. In fact, we can probably go all the way up to, you know, sort of, one nanosecond on a, on a one gigahertz machine before we spill over into the next clock cycle. So you can sort of think it, think about that. So, first thing is, actually, small and simple caches could be good. Every one thinks capacity, capacity, capacity but, in your sort of processor core, if you can reduce the hit time, that's a good thing. So, just to give you some idea here, as you, as you scale back, you can have, to the smaller caches the, time it takes to access the cache goes down. So the next, next thing we do is we try to reduce. The miss rate. And there's a couple different ways to do that. These are just sort of some basic optimizations in the cache lectures. Later in class we're gonna go through at lot more optimizations for caches. But one thing you do is think about looking at the block size. So the examples I've been giving, we're talking about 64, byte block size. But you could think about having either a smaller block size or a larger block size. And in fact this, this happens. People, the, this, this graph here, you know, 64 looks like, sort of the lowest misrates is a good spot. But this is really dependent in your applications space. So if you have applications which just sorta stream through memory, you probably want a bigger block size. If you have more random patterns, you're not getting any reuse. It might make sense to even have a smaller block size. So, this, this data is from your textbook actually. This is plotting [inaudible]. Rates with different cache sizes. So each of these lines here is a different size cache. So you can see four kilobyte cache, sixteen kilobyte cache, 64 kilobyte cache, an 256 kilobyte caches. And you can plot those, those two things against each other and see where the, the sort of sweet spot in this curve is. We're trying to minimize the miss rate, so how often we actually take that cache miss. Okay. So there's some, there's some positive things here about having different having larger cash sizes. Let's talk about that. S if we have a, excuse me. Larger box sizes. If we have a larger box size we need less tag overhead. So we talked about that already in the tag look up slide. The, you can, if you have longer block sizes you can think about having uhh. More burst transfers from your D ram. So in D ram, typically they like to give you sort of large chunks of data at a time and not little chunks of data. Cause there's a overhead cost in firing up the D ram, and there's what's called the call address strobe, and the row address strobe and rasser cast time. And you do that for memory every other memory access. So if you can sort of pull in larger chunks of memory at a time, you only have to do that once for the large amount of data you bring in. So that pushes you to actually want to have larger block sizes. And, you could even think about having similar ideas here for, sort of, on-chip buses if you have larger block sizes, you'll probably be using that on-chip bus more effectively, coz, there is some overheads and some turnarounds, usually, for arbitrations for buses. Okay, so on the right side we have the downsides of larger block sizes. If you have larger block size, you might be pulling in data you are not using. So if I have a 256 byte block, cash block versus 64 byte block. We're gonna be pulling in four times as much data. And if we're only trying to access let's say one byte in a data, we just wasted a lot of main memory bandwidth to hold that byte in. So, we need to be cognizant of that. And that's why this curve is not, you know, increasing in one direction or the other direction. Because, on first, when I sort of first took a computer architecture class, I thought, oh, you know, as you increase the block size, student performance go up or, or shouldn't the cache-miss rate go down in this graph. And it's, it's not true because you start to waste bandwidth at some point. Also if you have larger block size by definition, you have fewer blocks. So if we have let's say 256 byte blocks versus 64 byte blocks, by definition this is sort of four, four times fewer blocks in the cache. The same amount of data. So if we have a, you know, four kilobyte cache, we're gonna have the same amount the same amount of data. It's still a four kilobyte cache. But there's less blocks in that cache, so you're not gonna be able to have as much random data in your cache at one time. So this is one technique to reduce a miss rate. Another way, in a perfect world, this, this fights off against small and simple caches, is that you just build big caches. If you build big caches, this means that when you go to access the data, there's a very high probability that the data is close to you. That, that sounds good. So here, we actually have miss rate plotted against cache size. And, and of course, the sort of different associ, different types of caches here. And let's see, there's one. It's one thing I wanted to point out here, This empirical rule of thumb. If you double your cash size. Your miss rate usually drops by root two. Sometimes people call this the square root rule. How do we derive this? Well. Sorry [inaudible] you guys. It says here it is an empirical rule of thumb. It is just a rule of thumb. Obviously, if this was perfectly true this, this line would be, you know. Nicely curved versus having some, like, bumps in it, But, this actually, surprisingly works out, pretty well, as a, a rule of thumb. So, and it doesn't really work very well for, very small caches, so typically, sort of, right in here it doesn't work well, Possibly, sometimes for very large things it doesn't work well, and for high associativity, this rule of thumb starts to break down, also. But, it's a good rule of thumb the rule of thumb to think about. Okay. So how else do we reduce the, miss rate? Well, same, same graph. But, this is also from, Hennessy and Patterson, your book. You can increase the associativity. So you can take a, I don't know, four way cache and turn into eight way cache. And some power cost associates that, with that, and there is also a clock cycle cost, typically. So, let's look at the, the rule thumb here. And the rule thumb basically says, a direct mapped cache of size N, has about the same missed rate, has a two way set associative cache, of size N over two. And this kind of, that sounds, crazy good. Like how is that possible. We should always go at least two ways at associative caches. But let's, let's look to, let's look at this graph, and see if this actually, actually works. So. We're gonna to look at a point. Let's look at a sixteen kilobyte cache that is direct mapped and compare it against a 32 kilobyte cache that is or excuse me, a sixteen kilobyte cache that is two ways associative, versus a 32 kilobyte cache that is, lower associative. And, kind of what you're trying to see here is if this point is equals to that point. Well, it's not, but it's actually not a horrible approximation. We're sort of saying if we have a sixteen kilobyte cache with a higher sensitivity, our miss rate goes down, right? So it's going to be somewhere in here, versus a 32 kilobyte cache which has, is direct map which is this point here. Okay, well that doesn't hold there. Let's go look for another point here. Let's say 32 kilobytes to a say associative cache which is that point there over a 64 kilobyte cache that is direct mapped. Well that's almost on a straight line with each other, so it's almost exactly equal. And this is just an empirical rule of thumb that people have sort of figured out that as you double your associativity, you actually have a sort of, you can almost half your cache size and still have it, have the same miss rate. And, and likewise, like I said, as I said, this is just, empirical. There is no, no, reason why this has to be. And it's really dependent on your, sort of, data access patterns. But, I, I found that, I found that pretty interesting. Okay, so what's, what's the problem with building a two way set of associative cache always? Why do we not do that? So, the area for the data store, shouldn't actually, change very much. With those changes, your tag store, the, the tag data, you're gonna need for the higher associative one, you're gonna need more tag-check logic at least. Cuz you'll have to check, let's say, two tags in parallel.