So the next memory optimization or the cache optimization we are going to
look at is adding multi level caches. And you probably see this in sort of
any modern day processor you have. There's level one caches, level two
caches, maybe level three caches. Maybe even level four caches. Why this might be a good ideas and what
the affect of this is on the different parameters we've been looking
at through out this lecture. So what's the basic idea here? Well the basic idea is you have a CPU and instead of just having on cache
we say let's add two caches. And why do we want to do this? Well, it comes from the insight
that it is both difficult to have a very large cache and a very fast cache. So how do you solve this problem? Well, you can think about trying
to fill this entire room with RAM. That would be a very large,
let's say, cache for something. And if you think about this, they actually
do have sort of notional caches like this. If you go look at something like
a internet scale data center. There will be basically a room
full of mostly just RAM. And they'll use this to
cache things like look ups. And there's sort of,
you might see something like mem cache d which is something which
caches queries to databases. Or is a key value store
that typically caches queries to databases which
people store in large RAM. The problem with having a very
large cache is by definition, if it's large and you don't want to
violate the laws of physics. To get to the farthest
extent of the cache, you might have to go very very very far. And if you run a wire very very far,
it's by definition going to be slow you can't violate the physics law here,
that you need to travel at maybe, the fastest that you can travel
is maybe at the speed of light. So the farther you go
the distance becomes a factor. So, our problem here is that you can't
have something that's both large and fast memory. So, what can we do? Well, we can actually add
multiple levels of cache. So that if you have a certain sized
working set, you can try to store that, let's say, in a small local cache
which is both small and fast. And then you could have
different levels of cache here. Say level two cache which is a little
bit large, but with a larger capacity. But is a little bit slower and
a little bit further away from your CPU. And this can mitigate the cost of
having to go all the way out to DRAM. Okay, so I wanted to introduce some
nomenclature here because this is an important thing you're going to see
over and over again throughout caches. And it's important that we introduce
it now when we start to talk about multilevel caches. Because just because you see, let's say a
low cache miss rate does not mean a cache is performing well, especially when you
get into the multilevel cache domain. And why is that? Well, let's say we have this
level two cache here, and I say it has a very low miss rate. That sounds good but this level one
cache is filtering accesses to it. So,just because level two has a low
miss rate doesn't necessarily mean that the level two
cache is performing well. Or vice versa. Let's say the level two
has a very high miss rate. You might say that level two
cache is doing a really bad job. Well to some extent all of the easy misses
are being filtered by the level one cache. So all of a sudden the miss rate
out of your level two cache might look very different. So we need to come up with some
way to sort of discuss these different misses with
correct nomenclature. So I'm going to introduce
three notions here. Something we're going to
call local cache miss rate. And this is just going to be
the number of misses that you have in a cache versus the number
of accesses in a cache. This is local per cache so if you were to
look at let us say a level two cache here and you have a certain amount of
accesses coming to level two cache. A certain amount of misses coming out
it is all local to level two cache. So this is going to give you the actual
miss rate of a particular cache. Now that's not the same thing as
something like a global miss rate. So global cache miss rate
here will take the number of misses in the cache relative to
the number of CPU memory accesses. And this might be a better metric, multilevel cache because you might
want to say in our level two cache here. What is our miss rate out of the last
level of cache or out of our L2 cache relative to the number of actual
axes that the CPU is doing? And this help you encapsulate sort of
the level one and level two together. So that might be a better metric. And you'll see your book sometimes
use these two different metrics in different ways. And then finally, something that's useful to think about
is the number of misses per instruction. Now why did we bring this up? That sounds pretty similar here to like
a global miss rate or something like that. Well the difference here
is the denominator changes. Instead of having either accesses
per cache or CPU memory accesses, instead we have per
number of instructions. And why is this useful? Well, if you have misses per instruction,
it takes out of the equation how many modes and stores you have as a percentage
of instructions in your program. So, if you have let's say a program
which has very few loads and stores but has a relatively
high miss rate per access. You might say this is bad for performance so if you just look
at to the cache miss rate. LIke the local cache miss rate. But in reality, it may not be so
bad for performance. Because if you never do a load or never
do a store in that particular program, maybe it doesn't actually affect
your performance very much. So this last metric here, this misses
per instruction encapsulates that. So it encapsulates both the percentage
of instructions that are loads and stores mixed together with the number of cache misses in the program here. So, and this can either be local or
global, but usually this is considered
to be sort of global-ish. But it can be either. You could also use this metric for
a particular cache. So we can say this cache misses
once every 1,000 instructions. That would be pretty good. And depending on the number of loads or stores per Set of instructions. So let's say you have a load or a store once every,
I don't know five instructions. So maybe 20% of or a little bit less than 20% of
the instructions are loads or stores. Which might be typical in
a typical processor or a typical program for a typical processor. So then you could say, you could make some notion here about how
the cache is performing on aggregate. And you'll see, actually, a lot of the
numbers in the Patterson-Hennessy book, actually use misses per
1,000 instructions. Which could be a more useful metric or
typically a more useful metric. Okay, so let's take a look at
how adding a level two cache influences the design
of a level one cache. And this is pretty common that when
you add multiple levels to your cache hierarchy it actually influences
the lower levels of your cache hierarchy. So how does adding a level
two cache potentially influence the level one design? Well, one interesting thing you
can do here is just by having a level 2 cash this might allow you to
think about having a smaller level 1 cash. So if you have a relatively close level
2 cash, for the same performance you can actually have a much smaller level
1 cash and have a level 2 cash... Take care of these things. And this can actually even help
performance maybe even more, because you can potentially move the level
1 cache closer because it's now smaller, or increase the speed
of the level 1 cache. But the miss rate from the level
1 cache is going to go up. Because you made it small. So what about energy? This can actually be
a really good energy win. How is this an energy win? You are able to take
your level one cache and make it smaller because you
now have a level two cache. So the common case thing is that now your
axis in your level one cache, and this level one cache is smaller, so you can be
accessing it, relatively frequently, and each access you do to it,
is itself is going to be, less energy. So something to think about here is that,
our level two design, is going to influence our lower level
designs and it's not in a vacuum so it's not like we can just slap on a level
two cache or a level three cache. And say the lower levels of
the cache hierarchy don't change. Another thing that, another way that
the level two design, or the presence of a level two can really influence
level one, is you might be able to have a much simpler level one cache design,
because you have a level two cache. So, what does this look like? Well, let's say in your old design, the
level one cache was a right back cache. So it stored dirty data. And when a cache conflict occurred,
it had to find an eviction, and evict something out of
the level one cache, and wait for that eviction to occur, or at least find
the bandwidth for that eviction to occur. [COUGH]
>> In something like a level 2 cache that backs a level 1 cache, you can
actually move to a write through design. Now how is this possible? Why would you be able to do a write
through design when you weren't able to do a write through design before? Well, if you only had one level of
cache and you had to do write through, every write you did would have
to have gone out to main memory. And largely, you don't have enough
bandwidth to go do that out to your main memory store or
out to your DRAM. So, because you now have
a level two cache, you can use that as a buffer if you will to
absorb right through traffic. And that traffic doesn't
have to go off chip. So your write back cache,
you could have a write back L2 which tracks all the dirty data, and makes
sure that it actually does invalidates out to,
it has to do evictions on invalidations. For instance, for dirty data But the level
1 cache can just write through all data. Now this requires you to have enough
bandwidth between your level 1, your level 2. But that's typically a lot easier to come
by because your level 1 cache and level 2 cache are usually to some extent located
near each other in both on chip and modern day processor Let's see,
other reasons that this is good. Well, it really simplifies
your level one design. If your write-through cache in your level
one design, you don't have to worry about having dirty victim evictions,
the control becomes easier, and a lot of times it becomes easier
to integrate into your pipeline. Which is a good thing. Something we haven't talked about yet but we'll be talking about at the end of
this course is how something like this multi level cache can actually
simplify your cache coherence issues. So something we haven't talked about yet
is how cache coherence deals with,
caches can deal with coherent. And by having a smaller L1,
which is write-through backed by an L2, you can really think about simplifying
the cache coherence issue. So what is cache coherence? Well, we've talked about compulsory
misses, capacity misses, and conflict misses well when we get to
the end of this class, not this class but the end of this course we'll be
talking about cache coherence, which is keeping multiple
processors with caches, and keeping the caches coherent
between those different caches so that the data does not become stale
between those different caches. Well one of the questions that comes
up here is by having write-through and simplifying the L1. This is become easier from
that perspective and it does. Now, why is that? Well, something we'll going to learn
about is this coherence misses and when coherence misses going to
basically a different cache is going to reach across the boss or connection and tell a different cache to invalidate
something out of this cache. And if you only have one cache per
processor, if you only have a level one cache we'll say, or something like
just one cache level, when you go to reach across the bus and this cache is
tightly integrated into your pipeline, you now have to deal with these sort
of external requests coming in to do invalidations or to do something like
snoops, which we'll also be talking about. In a few classes,
when we get to the end of this course. But if you have a level one backed
by a level two you can have a level two service a lot of that complexity. So you can take complexity and push it
from the level one into the level two. Now, you're still going to have to
figure out how to do invalidations in the level one but, before when you
actually had to do it validation. You could potentially have had dirty data. You would have had to a victim and
then fix that line. And the invalidation naturally
generate the eviction. But now, because global 1 is let's
say a simpler write through level 1. You don't have to generate the eviction,
instead you just have to invalidate it. It's a lot easier to sort of blast away
lines than it is to blast away lines and actually figure out how to
take that data and evict it. It's a lot less disruptive to
your main processor pipeline because you don't have to
stop to do the eviction and there's a block further loses stores
coming from the main processor pipeline. Last if you have a right
through cache into level one this can really simplify error recovery
and when I say error recovery I mean from a soft error perspective so if you have
your chip and it gets struck by radiation. And this is actually
a relatively common occurrence. You have a chip and
you have an outer particle. Something coming out of the sky. Some highly energetic piece
of radiation hits your chip. Well, that flips a bit. And how do we protect against this? Well, there's a couple
of different solutions. We can use error correcting codes or
we can use some simpler ideas like parity. And if you have something
like a write-through cache, because you never have dirty data in
the cache, you can get away with maybe just having parity and
not full ECC or something like that. And if you were to detect a parity
error now in this write-through cache, you just have to basically
invalidate the line. You don't have to declare it
as being corrupt in memory, because you know that the L2
has an up-to-date copy, which is up-to-date in has
the most up to date version of it. So the L2 might have more
protection on it, but the L1 can basically just invalidate
if it gets struck by an alpha particle. So something to think about there,
level 2, the presence of having a level 2 can
really influence our level 1 designs. Okay, how else does this occur? How else does level one and
level two design commingle together. Well, there's a question that comes
up of what is the inclusion policy? So now that we have
multiple levels of cache, we can actually think about
having the lower down level. Let's say, to level one having
a certain piece of data and level two, either having that same piece of data or
not. And we're going to call
caches which have the level, anything which is in the level one
cache being in a level two cache or anything in a lower level cache
being in a higher level cache. We're going to call that
an inclusive cache. So the inner cache holds copies of
the data in the farther down cache and the external snoop access only need to
check in to the outer cache as we were talking about before, and
we were to call that inclusive cache, and if you were to contrast
that with an exclusive cache. So a exclusive cache, what you're going to have is
the different layers of your cache. For instance, your level two cache is
not going to have the same data or may not have the same data that
is in your level one cache. What is that mean? Well, it means if you evict something
out of your level one cache, you probably want to put it back into the
level two cache in an exclusive design, because that's kind of
the idea behind caches. You want to have a certain amount
of working set in your cache. So if you evict something
out of level one, that's probably still
a relatively useful thing. It still would probably fit in your
larger working set at level two. So you just don't want to throw it away,
you want to keep it around. So in exclusive caches, there's
typically a swap operation that occurs. So, you're actually going to swap
line between the lower level and the higher level cache when you move data. So, the higher level of cache or
the farther away cache from the processor is going to go access main memory and
it's going to bring it into itself. And then when the level one cache
goes to bring it in, you're actually going to swap two lines and this adds
complexity to your hardware design. You have to add something
which actually does the swap. It could potentially be bad for power,
but people have actually built these. If you go look at the original AMD Athlon,
they had a 64-kilobyte primary cache of a 246 secondary cache and
they were exclusive. Now you might say, this sounds like a lot
of headache while you would ever go built an exclusive cache. Well, it has a one really big benefit,
you get more storage area. Because you're not keeping two
copies of the same piece of data, you now can store effectively
more data in your cache. So if we had this AMD Athlon here,
we have a 64-kilobyte primary cache backed
by 256 kilobyte cache, you have the sum of those two values
to store data and it's all unique data. In contrast,
if we had the same cache hierarchy, but we were inclusive cache, you would
only have strictly 256 kilobytes. The larger cache or the farther away
cache's amount of storage space, because the lower level cache or the
primary cache here only keeps copies of what is already in the father out cache in
Inclusive in the inclusive cache design. Let's take a look at a few examples
of caches in modern day systems and see what trade-offs people have made. So, let's start off by taking a look at something like
the Itanium-2 processor here. So the Itanium-2 processor, you're going to see we have our chip and
the first thing you're going to notice about this dive photograph here is
the level three cache is very large. So this has a very large level three
cache, but this is a big iron processor. So the Itanium-2 was an Intel chip or
is an Intel chip that is Intel and HP together, I guess, because they were
collaborating at the time on this project, but has a big level three cache and
it's kind of funny shaped. And the reason it's funny shaped is that
they just took all the extra space they had in their die and filled it basically
with cache and this was good but it didn't quite matter the shape and
size, the shape, because it was so far away from the processor that it
didn't have to be a regular shape, but let's take a look at
the different levels here. So, we have a level one cache. It's small, 16 kilobytes, 4-way set associative, 64-byte line size. It's heavily ported. They can do a lot of loads and
stores at the same time here. So it has two loads and
two stores concurrently and it's fast, single cycle access. So, 16 kilobytes is relatively
small by today's standards. In 2002, that was probably
a little on the large side. This was a large processor. But in 2013 time, that's not that big for a level one cache. If we step up here,
we can see that typically as you step up, you increase your size. Otherwise, what would be the point? But the latency also gets worse. So, we've seen cycle latency for
level one. We've five cycles latency for
level two and we're a lot bigger. We're at 256 kilobytes of cache. The associativity stays the same,
it's still four-way associative. And as we get farther away here, we'll see we have a 3 megabyte
cache with twelve cycles latency. So it's this big cache around
the outside of the chip here, but we actually increase our associativity. So it's a 12-way set associative cache and
the line size in both the level 2, and level 3 cache is larger. Now, 128 bytes instead of 64 bytes and
that's pretty common. You'll see that as you get farther
away from caches, people go and put larger line sizes in view of
larger chunks of memory at times. And to some extent, this is because you
have more capacity in these caches. So it doesn't hurt you as
much if you're called back to our earlier lecture where
we plotted the capacity or assuming the line size that you have,
the lock size versus performance. If you recall, it hurts you more for
smaller cache, because there's just not that many lines you can fit into
something, let's say, like a 4K cache. But as you start to get to something like
a 3 megabyte cache, you can be a little bit wasteful with storage if it helps you
with your potentially spatial locality. One point I wanted to make here is a rule
of thumb that people typically use for cache design and
this is just an empirical rule of thumb. The empirical rule of thumb is usually, when you go from one level of
cache to the next level of cache, you want to step up by a minimum
by a factor of eight in size. Now, where does that come from? Well, it's totally empirical,
but it's based on how much extra latency you have to add when you add
another level of cache hierarchy. So if you add another level of cache
hierarchy, it has some extra latency. And while it's going to
decrease your cache miss rate, it also is going to make your
time out to main memory. Let's say, a little bit worse. As a trade-off there of, does it make
sense to add the extra cache layer and go check the extra cache layer or is it better to go out to
main memory at that point? And the question is how much benefit, how much larger does that cache have
to be to have any useful benefit? And empirically, when sort of
people have built these processors, usually you have to step up by
something like a factor of eight. If you step up by a factor of four The
benefit that you get from the cache due to the extra size does not outweigh
the extra complexity and cost and time that is added by adding extra cache,
but that's just a empirical rule of thumb. But if you go analyze enough processors, you'll see that basically every cash
level steps about by a minimum of eight. So we'll call this the 8X,
the 8 times rule if you will. This cache here steps
up by a factor of 16. So it's a little bit more
than a factor of eight, but let's look at another bit
more modern processor here. So, this is the IBM Power 7. So the IBM Power 7,
this is a eight core machine. There is private level one
caches per processor and then there is a big level three cache
sitting in the middle of this processor. Well, what does this look like? Well, we have relatively
small level one caches. We have a 32-kilobyte L1 I. A 32 kilobyte L1 D cache and
latency is higher than our previous example,
it's a three cycle cache latency. And as we go farther out, we can see
we actually have an 8X step up in cache size here at least from
a data cache size perspective. So, we go from 32 to 256 and
the latency is worse. So, the eight cycles worth of
latency to check on level two. And then finally here,
we have eight cores sharing this data. So, it's a lot of core sharing this cache. We have a 32 megabyte
unified level 3 cache that's actually built out of invented
DRAM in this process. So, IBM has some pretty impressive
technology here where they can actually embed DRAM on a logic process. This is not common in
most of the processes. This is typically only IBM trait,
an IBM boundary design sort of trait. But some other people do also
have embedded DRAM now, but IBM is really the forefront leader in there,
but the latency here is quite higher. So if 25 cycle latency to the power
set in level three cache. Let's pull-up the score board. So we've been building a score board here
throughout this and seeing how adding these different techniques, these advanced
cache techniques helped with performance. So, what is our score board
going to look like here? Well, we have multilevel cache and
our first question is what happens for
the level one perspective? Well, the miss penalty. What is adding a level
one cache do to the miss? Well, we take a miss,
you have to go out to the next level. And before we had to go out to the main
memory, which could be far far away. So, miss penalty was quite high. But if we have multilevel cache and
lets say, we have a level two cache, just a few cycles away. Maybe five cycles away. Your miss penalty now goes down as seen
from the level one cache when we add this multilevel cache. So that's the level one, what about if we draw a box around
all of the levels of our cache? So lets say, level one, level two,
level three, what happens? Well, in aggregate, our miss penalty for each particular level is going to go down,
so that's a plus. As you go to look at farther away
going to main memory, what you're actually going to see is the miss rate
is also going to go down in aggregate, because you're not going to have to miss
out of your last level cache as often. So, the miss rate here goes down. So you're not going to have to
g out to main memory as much, because now you have a larger, but
not as big as your main memory or as big as your DRAM cache
sitting in front of there. So, the miss rate goes down. So, you have lower overall miss rate.