Welcome to our next installment of ELE475.
today we are going to be continuing our discussion of advanced cash techniques.
And one of the big things we are going to get in to is talking about how to build
out of order memory systems. So what I mean by out of order memory systems,
basically, I mean, you can have memory operations out of the memory system but
they execute out of order. And you can mix this with the in order processor pipeline.
Or an our of order processor pipeline. It's the designer's choice. They both
sorts of designs have been built, have been successfully built. But where, this
is important, this allows you to effectively get more memory parallelism.
Because you can have transactions going out the memory system in parallel, and the
processor can continue an overlap computation with the memory. so we, just
to recap from last lecture. We have a revised agenda here, now with all of our
Different cache optimization techniques or advanced optimization techniques. or we
talked about pipelining caches, we talked about adding a right buffer, and a buffer,
last lecture. We talked about the reason and the motivation for multi-level caches.
So two level caches, three level caches, maybe even four level caches, backed by
main memory. We talked about having victim caches. Which were extra little highly
associative caches that you can put next to a low associative cache or someone can
direct mapped cache. And effectively this can increase your performance quite a bit.
because once you sorta miss in your direct mapped cache you can go out you your
victim cache. And the victim cache may have the data you're looking for. And this
gets over small numbers of lines which are highly contended and might blow out if you
will in a number of ways within the amount of associativity you have in your cache.
We talked about hardware and software pre fetching. And this is where we stopped
last time. We stopped at hardware, software pre fetching. And today we have a
couple, couple extra topics. We are going to talk about h ow to bank your cache as a
technique to get more bandwidth. The easy technique to get more bandwidth is just
to, put multiple ports on your cache, but that gets quite expensive. Because you
have to duplicate your, possibly your . You probably have to duplicate, the right
logic. And in the worst case, you may have to duplicate the storage space. We're, in
today's lecture we're gonna to talk about some software optimizations that the
compiler can do. We're gonna start by talking about some software optimizations
that, are, easy to do without transforming the code too horribly. And then we're
gonna talk about some that actually require some pretty heavy lifting, with
respect to code transformations, and what you have to do to reorganize the code.
Still in valid ways. But by doing such, you can actually have better memory
performance and better cache performance. And then we're gonna get to the meat of
today's lecture. We're gonna be talking about non-blocking caches or sometimes
called lock-up free caches, or sometimes called out of order memory systems. Then a
couple, , at the end here we're gonna have critical word first, which basically means
that when you go out to the memory system. You ask for the word that you need most
urgently. And the memory system will give that, that back to you first, and this
allows you to restart, quicker. And then there's a, very similar idea but possibly
easier to implement called early restart. Which instead of asking for memory system
to return the data in a specific order and taking advantage of that. Instead what you
do is you, the memory system gives you back the data in order or the whatever the
chromatic order is that expects back to you with. But when the word that shows,
the word shows up that you need you, or you stall for you just restart the
processor . So if you're Asking for one word out of a huge cache line, and it
happens to be the second word in the cache line, you don't have to wait for the whole
cache line to get loaded into the cache. But this has come extra co mplexity.
And finally, we are going to talk about way prediction. Way prediction, we touched
on it a little bit when we were talking about some of the instruction issues in
super scalars. We are going to talk about it a little more in detail here, where way
prediction allows you to predict where to go find either the next piece of data you
are going to go access or the next instruction you are going to go access.
This is important if you have a multi-way cache. In a direct map there is only one
place to go find it. In a two way cache there is two places to go find it. And if
you check both. You can try to do that in parallel. It'll probably waste a bunch of
power. or what you can do is you can predict it. Let's say it's in way one, or
you can predict that it's in way, way zero. And use that prediction if you have
high prediction accuracy, you may only have to fire up half your cache. Or if
it's on our critical path, you don't actually have to do two comparisons or
three comparisons or four comparisons, however, however high the associativity of
your cache is. So let's go take a look, At thinking strategy and multi, multi porting
our caches. This is a little bit of a recap, I think we talked about this
briefly during last lecture. But we are going to continue on with that. So, here
we have our two way issue processor pipeline. You have a program counter,
reason instruction cache, you have your instruction registers. What's interesting
is that this pipe, you can only execute one memory operation per cycle. What's a
limiter? Let's say you want to execute two memory operations or three memory
operations or four memory operations that are all independent of each other. What do
you go about doing? Well, in our, in our two issue Python here. Logically, what you
wanna do is you wanna somehow put the data path on both pipelines. And remove this
restriction, of where you can execute memory accesses. In, in which pipe you can
access memory accesses from. Okay, that doesn't sound too horrible, until we go
look at the imple mentation of this. So we can go and have a true multi-ported cache.
So if a true multi-ported cache, if you think about the decoding logic on a read,
we're gonna have to least duplicat, duplicate that decoding logic. Likewise if
we want to actually execute two writes at the same time, we're going to have to
duplicate the rotor coders for, for that logic. And a lot of times how people
actually go to implement this, is they basically duplicate the entire cache to
get an extra port for, for a variety of reasons. It could, you know, the area
increase could be large, and it could even be double for, for a two port read and a
two, two write port Sorts of cache. So they don't even share the storage data.
But that's like, only a little bit of the problem. The other problem here is that it
also increases your hit time in your cash, and no one likes increasing hit times in
cashes. Now, why does it increase the hit time? Well, unless you. For instance,
duplicate everything. If you try to share some of the logic, you're basically going
to increase the loading on the readout of the cache here. And by increasing the
load, this is gonna have slower drive strengths, you'll probably increase the
time. And, and you can go look in something like cacti which is a tool from
HP which has different cache configurations and you can see that, you
know, adding a second port is usually will be the last thing in the world you wanna
do when you're doing your processor design. It's really, ends up being really
quite expensive. mostly in terms of area and somewhat in terms of timing. So can
we, can we do better? Well maybe yes maybe no. So one thing you can think about doing
is instead of actually having a true dual porting cache, can we have a pseudo dual
porting cache or a cache which has the same external interface? So if we sort of
cut out here in the middle, we have two addresses coming in. Two datas coming out,
it looks the same, but, instead, we actually put something else in the middle
here. What do we put in the middle? Well, we actu ally put. Two, banks, that are
called, of our cache. And the first bank is one-half the size of our cache. And the
other bank is one-half the size of our cache. And they're both single ported.
What this is relying on is if you have statistical memory accesses that have some
good pattern to them, the probability that you'll actually end up routing two memory
addresses to the same bank, and get what it's called bank conflict, will be quite
low. Instead the probability that you actually go to two different banks will be
high. Now, you might say, well, big conflicts, sound pretty, pretty. Possible
to happen. This is sort of the naive drawing here. When people go to build
these things for large cache, you actually have many banks. and then let's say you,
and if you have more banks than the number of ports you have, you can make a decent
argument for the probability you're gonna get a bank conflict ends up being quite
low. Or the probability that any two random addresses will end up going to the
same bank will be low. Now, let's talk a little bit about addressing and how we do
this Assignment of addresses to banks. The most basic thing is you can just maybe low
order interleave it. Or you can try to high order interleave it. So, one of the
questions that comes up here is, which is better, low order or high order
interleaving our banks. Everyone will take a guess on this one. So, when I say low
order, I mean use the low order bits to hash into which address bank, or which
bank you go into. Higher leaving you some higher bits. Maybe middle order early
leaving you some middle order bits. What would you choose if you were a designer?
That's a good insight So, if we think we are accessing an array. And we're trying
to describe through the array, so we're through the array the high order bits of
those are all going to be the same. So if we have high order inter leave bits all
the high bits are going to all be the same and were conflicts. They're all going to
go to let's say a big zero all the time. And therefore our cache which we are
trying to have more or higher through put, didn't get any higher through put. Okay,
now let me give you counter example. . Counter example is let's say you are
striding through an array. And you have an array an array of structures. A pretty
common thing to do. And the structure is, let's say 100, no it's yeah it's 120 bytes
long. And we go and access the first byte in that 180 128 byte structure in a loop.
Are the lower order bits of the address the same between all of those? Yeah. So I
can probably bet our higher bits are the same and our lower bits are the same.
What's changing in that example? Some medium bits, some, some bits a little bit
higher than the, the, the little lower bits. So what we're trying to get across
here is that you have to be very careful about how you do bank interleading, if you
have naive hash functions. Or the e functions which sign address to a
particular bank. And you may not even want to have you've been thinking about having
even more sophisticated hashing functions or sign ins from address to bank. But, if
you go to do that, you know this is on your critical path of your memory load so
you may not want to have a complex hash function there. So you might want to just
take some middle order bits, you might want to tank some little bits, and some
high bits for you bank assignments. It gets a little challenging here, you
probably don't want the highest order bits. Definitely cause in most systems all
those are always zero or something like that. Its very unlikely that your higher
bits are something important unless your accessing a stack where all those are
usually one, or one. . So I have another I think we talked about most of these on the
slide, but if it were trying to go thru high thru put here bank conflicts. We, we
talked about how you can get a bank conflict. we talked somewhat how you can
avoid bank conflicts by having more banks than you have ports into the, the porting
structure. And that's just a, from a statistics perspective if you have let's
say a, a 32, your, your, your cache is you have a 32 kilobyte cache, and each bank is
one kilobyte or something like that. Then, all of the sudden, you have the
probability that you have two ports trying to access one particular bank. It's kind
of like a birthday problem. So it's not, there's the trying to find if two people
have the same birthday.'Cause it's not quite the easy probability, you'll go
figure that out, but it's quite low because it's roughly you know, one. What
is this? It's probably 32 times one over 32 times one over 32. So it's not, not the
lowest thing in the world but it's, it's, it's possible to actually reduce the
probability conflict with random access. And if you have a better hash function, of
course you can do better. Extra wiring, so what do I mean by extra wiring? Well, in
our previous design, we didn't have these sort of crossover paths. So what this we
present you is that its a mux here, a multiplexer is there. A mux here and a mux
there for the address paths. So that's extra wiring extra delay by banking. And
one of the bigger, bigger challenges the extra wiring for going in the multi banked
memories is. It'll shrink bank size or shrink the memory size and if you
traditionally sort of look at memories, they like to be larger. You can the
overheads a little bit better off the, for example the fires at the bottom of the
memories and the rotor decoders and the column decoders if it's bigger, because
for instance the rotor decoders and the cone decoders will roughly grow as the log
of the number of bits or something like that. But if you make each your catches
rally small and you have multiple of that overhead gets worse so you have to have
more sense amps, more row decoders, more column decoder's. So something to think
about, but usually that's, that's relatively small and, and you're willing
to pay that, unless you get some really funny aspect ratios. So what I mean by
really funny aspect ratios if you want to have your, if you shrink the cache too
small. So if you have, I don't know, 256 byte caches. And the read out width is
lets say a, a byte. So you get a long and narrow, cache. And you can try to sort of
you call addressing or something to sort of squish that into a more square form
factor. But sometimes if you squish your bank sizes to small you get very funny
aspect ratios, or very long skin rams that don't fit very well into the designs.
Okay. Un-, uneven utilization, so what do I mean by that. Well, what I mean by that
is . Just like you can have bait conflicts. Due to the addressing of the
hashing scheme, you can also have a pathological case, where all of the
interesting data is stored in one of the banks. So lets say going back to the
example you use low order enter leaving to the lowest order bid of the address, which
is not the offset we'll say, to choose between big zero bank one. But it just so
happens the program, pads out all its data structures, such that, that BIT is a
constant of zero, basically for all the useful arrays you're trying to access.
What that will do is it will locate all the data lines, in banks zero. Alright,
well we just took our big cash and cut it in, in half, because we effectively placed
all of our data over in this bank. So we're getting uneven data utilization. So
we get uneven bandwidth utilization and bank conflicts, but we can also get data
utilization, these often go together though. The bank conflicts and the uneven
utilization cuz you're accessing something a lot, and it's in bank zero, you know,
that means it's probably means stored there, so you have uneven, storage
patterns also. Questions so far about banking, why we wanna bank? This is a, a
pretty common thing to do. So a good example of this is, if you go look at the
Niagara processor which is a Sun microprocessor. That has, the first
diagram had eight cores. And, if you go look at their cache structure, they had
eight banks of the cache. So, is a relatively, heavily last level of cash. If
you go look at something actually like, the multi core intel chips out there, they
even have more b anks. actually I think I saw one once where is the 32 megabytes
last level cash for the intel itanium two had some incredibly large number of banks.
I thin it had 32 or 64 banks in their cash. And they were wide, also. So they
had lots of bandwidth going through that cache. Okay, so, if you look at our, our
cache banking, was really, really helps here, as we are just, we are increasing
the bandwidth. We might be actually hurting some things, we might be hurting
the hit time a little bit, miss rate, miss penalty should not change, the cache size
is same. We haven't, effectively increased or shrunk the cache size. Even if you get
lots and lots of, bank conflicts or you can have poor utilization, that's still
not actually affecting your miss rate. Cuz you would have been, conflicting, in the
same size cache. Logically, it's, you could draw a box around it. It would be
the exact same, same thing. It just that it happens that, you know, in, if you were
to, have it not be a banked cache, you would have every other line, be in the
contested ones, or something like that. So just having better memory layout is, is
good all around.