So the next memory optimization or the cache optimization we are going to look at is adding multi level caches. And you probably see this in sort of any modern day processor you have. There's level one caches, level two caches, maybe level three caches. Maybe even level four caches. Why this might be a good ideas and what the affect of this is on the different parameters we've been looking at through out this lecture. So what's the basic idea here? Well the basic idea is you have a CPU and instead of just having on cache we say let's add two caches. And why do we want to do this? Well, it comes from the insight that it is both difficult to have a very large cache and a very fast cache. So how do you solve this problem? Well, you can think about trying to fill this entire room with RAM. That would be a very large, let's say, cache for something. And if you think about this, they actually do have sort of notional caches like this. If you go look at something like a internet scale data center. There will be basically a room full of mostly just RAM. And they'll use this to cache things like look ups. And there's sort of, you might see something like mem cache d which is something which caches queries to databases. Or is a key value store that typically caches queries to databases which people store in large RAM. The problem with having a very large cache is by definition, if it's large and you don't want to violate the laws of physics. To get to the farthest extent of the cache, you might have to go very very very far. And if you run a wire very very far, it's by definition going to be slow you can't violate the physics law here, that you need to travel at maybe, the fastest that you can travel is maybe at the speed of light. So the farther you go the distance becomes a factor. So, our problem here is that you can't have something that's both large and fast memory. So, what can we do? Well, we can actually add multiple levels of cache. So that if you have a certain sized working set, you can try to store that, let's say, in a small local cache which is both small and fast. And then you could have different levels of cache here. Say level two cache which is a little bit large, but with a larger capacity. But is a little bit slower and a little bit further away from your CPU. And this can mitigate the cost of having to go all the way out to DRAM. Okay, so I wanted to introduce some nomenclature here because this is an important thing you're going to see over and over again throughout caches. And it's important that we introduce it now when we start to talk about multilevel caches. Because just because you see, let's say a low cache miss rate does not mean a cache is performing well, especially when you get into the multilevel cache domain. And why is that? Well, let's say we have this level two cache here, and I say it has a very low miss rate. That sounds good but this level one cache is filtering accesses to it. So,just because level two has a low miss rate doesn't necessarily mean that the level two cache is performing well. Or vice versa. Let's say the level two has a very high miss rate. You might say that level two cache is doing a really bad job. Well to some extent all of the easy misses are being filtered by the level one cache. So all of a sudden the miss rate out of your level two cache might look very different. So we need to come up with some way to sort of discuss these different misses with correct nomenclature. So I'm going to introduce three notions here. Something we're going to call local cache miss rate. And this is just going to be the number of misses that you have in a cache versus the number of accesses in a cache. This is local per cache so if you were to look at let us say a level two cache here and you have a certain amount of accesses coming to level two cache. A certain amount of misses coming out it is all local to level two cache. So this is going to give you the actual miss rate of a particular cache. Now that's not the same thing as something like a global miss rate. So global cache miss rate here will take the number of misses in the cache relative to the number of CPU memory accesses. And this might be a better metric, multilevel cache because you might want to say in our level two cache here. What is our miss rate out of the last level of cache or out of our L2 cache relative to the number of actual axes that the CPU is doing? And this help you encapsulate sort of the level one and level two together. So that might be a better metric. And you'll see your book sometimes use these two different metrics in different ways. And then finally, something that's useful to think about is the number of misses per instruction. Now why did we bring this up? That sounds pretty similar here to like a global miss rate or something like that. Well the difference here is the denominator changes. Instead of having either accesses per cache or CPU memory accesses, instead we have per number of instructions. And why is this useful? Well, if you have misses per instruction, it takes out of the equation how many modes and stores you have as a percentage of instructions in your program. So, if you have let's say a program which has very few loads and stores but has a relatively high miss rate per access. You might say this is bad for performance so if you just look at to the cache miss rate. LIke the local cache miss rate. But in reality, it may not be so bad for performance. Because if you never do a load or never do a store in that particular program, maybe it doesn't actually affect your performance very much. So this last metric here, this misses per instruction encapsulates that. So it encapsulates both the percentage of instructions that are loads and stores mixed together with the number of cache misses in the program here. So, and this can either be local or global, but usually this is considered to be sort of global-ish. But it can be either. You could also use this metric for a particular cache. So we can say this cache misses once every 1,000 instructions. That would be pretty good. And depending on the number of loads or stores per Set of instructions. So let's say you have a load or a store once every, I don't know five instructions. So maybe 20% of or a little bit less than 20% of the instructions are loads or stores. Which might be typical in a typical processor or a typical program for a typical processor. So then you could say, you could make some notion here about how the cache is performing on aggregate. And you'll see, actually, a lot of the numbers in the Patterson-Hennessy book, actually use misses per 1,000 instructions. Which could be a more useful metric or typically a more useful metric. Okay, so let's take a look at how adding a level two cache influences the design of a level one cache. And this is pretty common that when you add multiple levels to your cache hierarchy it actually influences the lower levels of your cache hierarchy. So how does adding a level two cache potentially influence the level one design? Well, one interesting thing you can do here is just by having a level 2 cash this might allow you to think about having a smaller level 1 cash. So if you have a relatively close level 2 cash, for the same performance you can actually have a much smaller level 1 cash and have a level 2 cash... Take care of these things. And this can actually even help performance maybe even more, because you can potentially move the level 1 cache closer because it's now smaller, or increase the speed of the level 1 cache. But the miss rate from the level 1 cache is going to go up. Because you made it small. So what about energy? This can actually be a really good energy win. How is this an energy win? You are able to take your level one cache and make it smaller because you now have a level two cache. So the common case thing is that now your axis in your level one cache, and this level one cache is smaller, so you can be accessing it, relatively frequently, and each access you do to it, is itself is going to be, less energy. So something to think about here is that, our level two design, is going to influence our lower level designs and it's not in a vacuum so it's not like we can just slap on a level two cache or a level three cache. And say the lower levels of the cache hierarchy don't change. Another thing that, another way that the level two design, or the presence of a level two can really influence level one, is you might be able to have a much simpler level one cache design, because you have a level two cache. So, what does this look like? Well, let's say in your old design, the level one cache was a right back cache. So it stored dirty data. And when a cache conflict occurred, it had to find an eviction, and evict something out of the level one cache, and wait for that eviction to occur, or at least find the bandwidth for that eviction to occur. [COUGH] >> In something like a level 2 cache that backs a level 1 cache, you can actually move to a write through design. Now how is this possible? Why would you be able to do a write through design when you weren't able to do a write through design before? Well, if you only had one level of cache and you had to do write through, every write you did would have to have gone out to main memory. And largely, you don't have enough bandwidth to go do that out to your main memory store or out to your DRAM. So, because you now have a level two cache, you can use that as a buffer if you will to absorb right through traffic. And that traffic doesn't have to go off chip. So your write back cache, you could have a write back L2 which tracks all the dirty data, and makes sure that it actually does invalidates out to, it has to do evictions on invalidations. For instance, for dirty data But the level 1 cache can just write through all data. Now this requires you to have enough bandwidth between your level 1, your level 2. But that's typically a lot easier to come by because your level 1 cache and level 2 cache are usually to some extent located near each other in both on chip and modern day processor Let's see, other reasons that this is good. Well, it really simplifies your level one design. If your write-through cache in your level one design, you don't have to worry about having dirty victim evictions, the control becomes easier, and a lot of times it becomes easier to integrate into your pipeline. Which is a good thing. Something we haven't talked about yet but we'll be talking about at the end of this course is how something like this multi level cache can actually simplify your cache coherence issues. So something we haven't talked about yet is how cache coherence deals with, caches can deal with coherent. And by having a smaller L1, which is write-through backed by an L2, you can really think about simplifying the cache coherence issue. So what is cache coherence? Well, we've talked about compulsory misses, capacity misses, and conflict misses well when we get to the end of this class, not this class but the end of this course we'll be talking about cache coherence, which is keeping multiple processors with caches, and keeping the caches coherent between those different caches so that the data does not become stale between those different caches. Well one of the questions that comes up here is by having write-through and simplifying the L1. This is become easier from that perspective and it does. Now, why is that? Well, something we'll going to learn about is this coherence misses and when coherence misses going to basically a different cache is going to reach across the boss or connection and tell a different cache to invalidate something out of this cache. And if you only have one cache per processor, if you only have a level one cache we'll say, or something like just one cache level, when you go to reach across the bus and this cache is tightly integrated into your pipeline, you now have to deal with these sort of external requests coming in to do invalidations or to do something like snoops, which we'll also be talking about. In a few classes, when we get to the end of this course. But if you have a level one backed by a level two you can have a level two service a lot of that complexity. So you can take complexity and push it from the level one into the level two. Now, you're still going to have to figure out how to do invalidations in the level one but, before when you actually had to do it validation. You could potentially have had dirty data. You would have had to a victim and then fix that line. And the invalidation naturally generate the eviction. But now, because global 1 is let's say a simpler write through level 1. You don't have to generate the eviction, instead you just have to invalidate it. It's a lot easier to sort of blast away lines than it is to blast away lines and actually figure out how to take that data and evict it. It's a lot less disruptive to your main processor pipeline because you don't have to stop to do the eviction and there's a block further loses stores coming from the main processor pipeline. Last if you have a right through cache into level one this can really simplify error recovery and when I say error recovery I mean from a soft error perspective so if you have your chip and it gets struck by radiation. And this is actually a relatively common occurrence. You have a chip and you have an outer particle. Something coming out of the sky. Some highly energetic piece of radiation hits your chip. Well, that flips a bit. And how do we protect against this? Well, there's a couple of different solutions. We can use error correcting codes or we can use some simpler ideas like parity. And if you have something like a write-through cache, because you never have dirty data in the cache, you can get away with maybe just having parity and not full ECC or something like that. And if you were to detect a parity error now in this write-through cache, you just have to basically invalidate the line. You don't have to declare it as being corrupt in memory, because you know that the L2 has an up-to-date copy, which is up-to-date in has the most up to date version of it. So the L2 might have more protection on it, but the L1 can basically just invalidate if it gets struck by an alpha particle. So something to think about there, level 2, the presence of having a level 2 can really influence our level 1 designs. Okay, how else does this occur? How else does level one and level two design commingle together. Well, there's a question that comes up of what is the inclusion policy? So now that we have multiple levels of cache, we can actually think about having the lower down level. Let's say, to level one having a certain piece of data and level two, either having that same piece of data or not. And we're going to call caches which have the level, anything which is in the level one cache being in a level two cache or anything in a lower level cache being in a higher level cache. We're going to call that an inclusive cache. So the inner cache holds copies of the data in the farther down cache and the external snoop access only need to check in to the outer cache as we were talking about before, and we were to call that inclusive cache, and if you were to contrast that with an exclusive cache. So a exclusive cache, what you're going to have is the different layers of your cache. For instance, your level two cache is not going to have the same data or may not have the same data that is in your level one cache. What is that mean? Well, it means if you evict something out of your level one cache, you probably want to put it back into the level two cache in an exclusive design, because that's kind of the idea behind caches. You want to have a certain amount of working set in your cache. So if you evict something out of level one, that's probably still a relatively useful thing. It still would probably fit in your larger working set at level two. So you just don't want to throw it away, you want to keep it around. So in exclusive caches, there's typically a swap operation that occurs. So, you're actually going to swap line between the lower level and the higher level cache when you move data. So, the higher level of cache or the farther away cache from the processor is going to go access main memory and it's going to bring it into itself. And then when the level one cache goes to bring it in, you're actually going to swap two lines and this adds complexity to your hardware design. You have to add something which actually does the swap. It could potentially be bad for power, but people have actually built these. If you go look at the original AMD Athlon, they had a 64-kilobyte primary cache of a 246 secondary cache and they were exclusive. Now you might say, this sounds like a lot of headache while you would ever go built an exclusive cache. Well, it has a one really big benefit, you get more storage area. Because you're not keeping two copies of the same piece of data, you now can store effectively more data in your cache. So if we had this AMD Athlon here, we have a 64-kilobyte primary cache backed by 256 kilobyte cache, you have the sum of those two values to store data and it's all unique data. In contrast, if we had the same cache hierarchy, but we were inclusive cache, you would only have strictly 256 kilobytes. The larger cache or the farther away cache's amount of storage space, because the lower level cache or the primary cache here only keeps copies of what is already in the father out cache in Inclusive in the inclusive cache design. Let's take a look at a few examples of caches in modern day systems and see what trade-offs people have made. So, let's start off by taking a look at something like the Itanium-2 processor here. So the Itanium-2 processor, you're going to see we have our chip and the first thing you're going to notice about this dive photograph here is the level three cache is very large. So this has a very large level three cache, but this is a big iron processor. So the Itanium-2 was an Intel chip or is an Intel chip that is Intel and HP together, I guess, because they were collaborating at the time on this project, but has a big level three cache and it's kind of funny shaped. And the reason it's funny shaped is that they just took all the extra space they had in their die and filled it basically with cache and this was good but it didn't quite matter the shape and size, the shape, because it was so far away from the processor that it didn't have to be a regular shape, but let's take a look at the different levels here. So, we have a level one cache. It's small, 16 kilobytes, 4-way set associative, 64-byte line size. It's heavily ported. They can do a lot of loads and stores at the same time here. So it has two loads and two stores concurrently and it's fast, single cycle access. So, 16 kilobytes is relatively small by today's standards. In 2002, that was probably a little on the large side. This was a large processor. But in 2013 time, that's not that big for a level one cache. If we step up here, we can see that typically as you step up, you increase your size. Otherwise, what would be the point? But the latency also gets worse. So, we've seen cycle latency for level one. We've five cycles latency for level two and we're a lot bigger. We're at 256 kilobytes of cache. The associativity stays the same, it's still four-way associative. And as we get farther away here, we'll see we have a 3 megabyte cache with twelve cycles latency. So it's this big cache around the outside of the chip here, but we actually increase our associativity. So it's a 12-way set associative cache and the line size in both the level 2, and level 3 cache is larger. Now, 128 bytes instead of 64 bytes and that's pretty common. You'll see that as you get farther away from caches, people go and put larger line sizes in view of larger chunks of memory at times. And to some extent, this is because you have more capacity in these caches. So it doesn't hurt you as much if you're called back to our earlier lecture where we plotted the capacity or assuming the line size that you have, the lock size versus performance. If you recall, it hurts you more for smaller cache, because there's just not that many lines you can fit into something, let's say, like a 4K cache. But as you start to get to something like a 3 megabyte cache, you can be a little bit wasteful with storage if it helps you with your potentially spatial locality. One point I wanted to make here is a rule of thumb that people typically use for cache design and this is just an empirical rule of thumb. The empirical rule of thumb is usually, when you go from one level of cache to the next level of cache, you want to step up by a minimum by a factor of eight in size. Now, where does that come from? Well, it's totally empirical, but it's based on how much extra latency you have to add when you add another level of cache hierarchy. So if you add another level of cache hierarchy, it has some extra latency. And while it's going to decrease your cache miss rate, it also is going to make your time out to main memory. Let's say, a little bit worse. As a trade-off there of, does it make sense to add the extra cache layer and go check the extra cache layer or is it better to go out to main memory at that point? And the question is how much benefit, how much larger does that cache have to be to have any useful benefit? And empirically, when sort of people have built these processors, usually you have to step up by something like a factor of eight. If you step up by a factor of four The benefit that you get from the cache due to the extra size does not outweigh the extra complexity and cost and time that is added by adding extra cache, but that's just a empirical rule of thumb. But if you go analyze enough processors, you'll see that basically every cash level steps about by a minimum of eight. So we'll call this the 8X, the 8 times rule if you will. This cache here steps up by a factor of 16. So it's a little bit more than a factor of eight, but let's look at another bit more modern processor here. So, this is the IBM Power 7. So the IBM Power 7, this is a eight core machine. There is private level one caches per processor and then there is a big level three cache sitting in the middle of this processor. Well, what does this look like? Well, we have relatively small level one caches. We have a 32-kilobyte L1 I. A 32 kilobyte L1 D cache and latency is higher than our previous example, it's a three cycle cache latency. And as we go farther out, we can see we actually have an 8X step up in cache size here at least from a data cache size perspective. So, we go from 32 to 256 and the latency is worse. So, the eight cycles worth of latency to check on level two. And then finally here, we have eight cores sharing this data. So, it's a lot of core sharing this cache. We have a 32 megabyte unified level 3 cache that's actually built out of invented DRAM in this process. So, IBM has some pretty impressive technology here where they can actually embed DRAM on a logic process. This is not common in most of the processes. This is typically only IBM trait, an IBM boundary design sort of trait. But some other people do also have embedded DRAM now, but IBM is really the forefront leader in there, but the latency here is quite higher. So if 25 cycle latency to the power set in level three cache. Let's pull-up the score board. So we've been building a score board here throughout this and seeing how adding these different techniques, these advanced cache techniques helped with performance. So, what is our score board going to look like here? Well, we have multilevel cache and our first question is what happens for the level one perspective? Well, the miss penalty. What is adding a level one cache do to the miss? Well, we take a miss, you have to go out to the next level. And before we had to go out to the main memory, which could be far far away. So, miss penalty was quite high. But if we have multilevel cache and lets say, we have a level two cache, just a few cycles away. Maybe five cycles away. Your miss penalty now goes down as seen from the level one cache when we add this multilevel cache. So that's the level one, what about if we draw a box around all of the levels of our cache? So lets say, level one, level two, level three, what happens? Well, in aggregate, our miss penalty for each particular level is going to go down, so that's a plus. As you go to look at farther away going to main memory, what you're actually going to see is the miss rate is also going to go down in aggregate, because you're not going to have to miss out of your last level cache as often. So, the miss rate here goes down. So you're not going to have to g out to main memory as much, because now you have a larger, but not as big as your main memory or as big as your DRAM cache sitting in front of there. So, the miss rate goes down. So, you have lower overall miss rate.