So last thing I wanted to talk about
in today's lecture is prefetching or the technique of prefetching. So what is prefetching? Well, you speculate that you're
going to be using some let's say either instruction or
data In the near future. So you can say, I think I'm going to be accessing
some cache line in the near future. Now you haven't accessed it yet, the processor hasn't gone to
initiate this transaction yet. But because you have some reason to
believe that it's a good idea to go do this. You speculatively go and
fetch that data from either memory or a farther out level of cache. And bring it into a nearer level of cache. So this is the basic idea of prefetching. Well what does this look like? Well, there are a couple different
varieties of prefetching we're going to be talking about today. So we're going to talk
about hardware prefetching. Which is a complete hardware
based solution where the hardware tries to detect what is
a useful thing to go bring in. We're going to talk about
software prefetching. Where the compiler or the programmer decides what is a useful
piece of data to go get early. And we can talk about mixed schemes. Where it's software hinting
to hardware potentially about when it's a good
idea to go bring in data. Okay so, important question that
comes up on the slide here. What types of cache misses
does prefetching affect? Hm? What does it affect? Well we're bringing in data
potentially before it's actually used. So that sounds good, this sounds all good. But we're actually polluting our cache
potentially with data we don't know that we're going to need yet. So, there can be downsides to this. So let's go through the three Cs. Capacity, does this affect capacity? Well, yes, it actually affects capacity,
in what is effectively a negative way. You effectively polluted your cache, because you've prefetched
something into your cache. And you could potentially make the cache have a smaller effective
capacity because of this. And this actually trades off, this is a similar sort of
thing with conflict misses. You're effectively putting
data into your cache. That if you have a really bad prefetcher,
a prefetcher which actually goes and brings data in. That is not accessed ever we'll say, or not accessed in a short
period of time afterwards. You may actually be both polluting
capacity and creating more conflicts. You can bring it into a line. And because of this some other piece of
data that gets brought in has to go evict that now. And it takes time to go evict it. Or at least has to make some decision. And might even mess up the recently
used policy, the replacement policy. Okay, so why do we do this? We said this is bad for conflict misses,
or increases our conflict misses. Where does prefetching help? Well, it can actually really
help with compulsory misses. So misses where you just miss
on effectively cold data. The data's just not in your cache,
you have to go get it and bring it in. So let's say a first touch data or
compulsory miss. It would be really great
if the prefetch system or either a hardware software prefetcher. Could bring that data into
your cache preemptively. And by bringing that data into your cache
preemptively either reduce the latency or remove the latency completely
of compulsory misses. So how do we reduce that miss cost? Well let's say you go to take
cache miss on a line and the prefetcher already has sort of started
to go get that line from the main memory. The missed data's handling register could
detect that there's a missed argument flag for that line. And not actually send out a subsequent
request and just wait for that data to come back. And this can decrease the latency
of compulsory miss or decrease the miss cost of compulsory miss. Alternatively, the compulsory
message can just go away. If the prefetcher is really on its game,
it can bring the data into your cache. And by the time you go to access that
data, it's clairvoyant, if you will. It already knows that data that you
were going to access in the future and it's already pulled it in. When you go to access it,
it's already there. So prefetching can have some
pretty big benefits there. But like all things,
including caches in general, you need to make sure that what you
bring in is used with high probability. Otherwise, you're just going
to pollute your cache. One last thing I wanted to say
is that we're going to see this. Is that instruction caches
are usually easier to predict for prefetching purposes than data. Why is this? Well instructions are very well behaved. Typically you execute one instruction. Then you execute the next instruction. Then you execute the next
instruction after it. And yeah there are branches. And programs do have
a fair amount of branches. But it i s a decent chance you're just
going to be executing straight line code, especially in something that'll loop. Where you're going to be executing
straight line code for a while and hit one branch at the bottom and
then go back up to the top. So your prefetcher can many times
be very effective for that. Data on the other hand,
usually data misses and data accesses are a little
bit more randomized. But not always, sometimes you're
just going to be accessing, let's say straight line data. Something like a mem copy, or
something like taking array and reading the whole array. And a prefetcher can be
very effective there also. So, let's take a look at some
of the issues with prefetching. We've already touched on
some of these things. In order for a prefetcher to
be good it has to be useful. So it should produce hits in the cache. If the prefetcher is actually
producing mostly misses in the cache. Or bringing in stuff that, if it's on the instruction
side it never executes. Or on the data side, it never tries to
go to a loader and store [INAUDIBLE], it wasn't useful. Now, things get a little bit more
complex here when you start to think about usefulness. You have to start thinking
about timeliness of the data. So what do we mean by timeliness? Well, if you prefetch a piece
of data into your cache. But you do it too early, it's very
possible by the time you go to go read the data, it might have
gotten evicted from the cache. So, if you bring it in too early, the prefetcher might have
just wasted bandwidth. And wasted energy going to the next level
of cache or going out to maintain memory. And that data may not end up being
useful because it wasn't timely. You might have gotten the address right,
but gotten it there too early, or it might
have figured out that it was too late. So how does this happen? So let's say, you determine that
this one line is really good and there's a high probability
you're going to need it. So you [INAUDIBLE] want to go pull it in. The prefetcher decides the cycle
after the load happens to that line. Well, it didn't help. Maybe even the cycle before, it just wouldn't help because
you're too late, effectively. And as we've already discussed, you can have significant cache and
bandwidth pollution in here. So you can pollute
the capacity of the cache and you can waste a lot of bandwidth. And this is a problem if, for instance,
you're in a multicore chip or many-core chip, and off-chip memory
balance is really important. And it's a really limited resource, you may not actually want to have any
prefetch turned on because you might just be wasting bandwidth, so it's
a speculative execution sort of thing, so you're going to be trying to
pull in data, but it's speculative. You don't know 100% that you're going to
be using that data, and any time that you bring something in that you don't use,
you've effectively wasted bandwidth. So, this is a big challenge here. From the bandwidth perspective,
usually a good place to put a prefetcher is maybe somewhere between,
let's say, a level two cache and a level one cache especially
maybe on the instruction side but you team this that you one
will basically pulling data. And this bandwidth here is relatively
inexpensive from your level one to your level two because it's an on ship bus,
it's directly connected. You can make it relatively large, coming
out of level two cache though, usually or going out of let's say,
the last level cache out to main memory, that's usually expensive bandwidth. So, sometimes people
come up with prefetchers. The only prefetch from a level
two cache to a level one cache. But if it's not in the level two cache,
prefetch just gets dropped. So that's a pretty common strategy here. Let's start off by looking
at the simple case here of instruction prefetching in
hardware with a basic example. Something like what they
had in the Alpha 21064. So it's a very relatively basic processor. And we want to look at what they did for
performance. They actually added an extra little
buffer here that they're going to call a stream buffer. This stream buffer here is going to store
the next line that is likely to be used. So if you take a cache miss,
this is on the instruction side for particular instruction cache block. Let's say block i in this example. You decide, the hardware decides with high probability block i + 1 is
probably going to be used. But instead of polluting our
instruction cache in their design what they did is the subsequent cache block is
actually stored into the stream buffer. So you actually pollute their cache. This is kind of like doing little bit
of extra associativity here only for the next predictive line, if you will. Okay, so in this example here, what would happen is,
you'd be executing level one cache. And let's say you fall through,
the code just falls through, there's no branch at the end of
the cache line And you fall through. Well, the predictor actually
did really well here and it overlapped defectively the cost, going out to level two capture the main
memory for that subsequent cache line. So it hits in the stream buffer and the
stream buffer get moved into the cache and the harvest system says or
the hard pre-system says A-ha, I went in,
just I'm now executing block (i + 1). Maybe it's a good idea to go and
prefetch block (i + 2). And effectively what you're doing here
is overlapping the execution of block i. With the fetch of block i+1. So you can be doing two
things at one time. You can be using the CPU fetching,
excuse me, from block i. And you can be using the cache, moving data this way into
the stream buffer for block i+1. And likewise, when you go to fetch,
when you're executing, let's say, block (i+1),
you might be fetching block (i+2). Okay, so that's the instruction
prefetch side of the world. Life gets a little more complex when we
start to go look at the data prefetching. Now why is this? Well, it's not as
well-structured as instructions. Instructions you just but typically
blocks of the code keep falling through. In the data world, you don't necessarily
know because you've went in and accessed let's say address a but you will go access some other
address because of that. But you can try. So people have implemented a bunch
of different prefetches that help to some extent. The most basic thing that they're not
doing is actually just doing something very similar to this that we
had in the instruction example. If you go and you take a cache
miss on let's say block B, you go also fetch block
b + 1 into your cache. And this always tends to help. You have to be a little bit
careful here because you might be wasting capacity and bandwidth. But in some cases this is helpful. So if you're streaming through an array,
this helps. If you're doing a memcpy. This helps. This prefetch on miss. Note we say prefetch-on-miss here,
but we don't say pre-fetch on hit, which is a little bit different than our
example in the instruction space there. Prefetch-on-miss is the only
way to cache miss that you go, let's say, fetch B plus one. Now, there's different strategies here. Prefetch on hit, prefetch on miss, you
might think about doing either of these depending on the heuristics of
what your hardware comes up with. So that's prefetch on miss, maybe you
actually want to do one block lookahead. So this is much more, this is basically
the same thing we were doing in the instruction side here,
is if you go access, let's say, Block b, and
this means not necessarily cache missed. You go and actually fetch from your
next level audited cache b + 1. Hm, okay, so
the question that comes up is, is this the same thing as this
doubling your block size? Well, that's interesting
because you effectively go fetch whenever you touch b,
you go prefetch b plus 1. This kind of looks like just making
your block size bigger because this is just doubling the block size. You're always bringing in two blocks. But it's only a prefetch operation, so
you could think about having it, for instance, just bold from your L2 into
your L1, when you have this example, not actually have to go out
to your last load of cache. So, you could think about having
different trade-offs there. Also the down side of this is that it's
not just making your block size bigger, in fact we have two sets of tags here. That's the down side of having something
like this one block look ahead scheme. But you can think about
taking this example and actually extending it to and
end block look ahead. So if you go to access block,
let's say b, or cache line b. You go fetch a couple after that. And the name of the game here is
you're trying to hide memory latency, main memory latency. So what would be really great is if
you go to access, let's say block b. And you know that the hardware
somehow determines if heuristics, typically how you implement these things
is little watchers that watch what accesses you're doing and then decide it's
fruitful to actually go and do a prefetch. You could say, I'm accessing block b, I should start trying to
read from main memory block. b+1, b+2, b+3, b+4, all the way up to,
let's say big n here. And by doing this, you can overlap the. So that if try to access that next block,
it will have already been here and you don't have to wait for
the going out to main memory. This can actually increase our
performance significantly. Okay, so that's the basic examples but
those don't always work well. Sometimes you might have
something like an array access, where you're not touching
every bite in the array. Maybe you're accessing every nth
bite in the array or some offset. Well how does this happen? Well this is actually really
common if you're trying to access. An array of structures or
an array of let's say class objects. Why is this? Well if you have a structure you have
different fields in the structure and different offsets inside of the structure. So if you have an array of them
they're packed densely and if you are reading let's say
some subfield of a structure and you're reading every
element of that array. You're actually going to have
what's called a strided access. So you're not going to be
accessing byte one and then byte two and byte three,
what was called unistride. Instead, you're going to be having some
offset as you move through the array. And if the structure is large enough, what
might end up happening is, you may not actually want to go and fetch the next
line and the next line and the next line. Instead, you might want to fetch
just the lines they contain the portion of the structure
that you want to go access. So, what does this look like? Well, you need something like
a strided hardware prefetcher. So if you see a pattern happening, typically that's when you fire up
some of your strided prefetcher. So if you go and look at a modern day Intel processor
they actually have stride detectors. So it's little pieces of hardware which
watch the cache access pattern or the processor access pattern and
will say aha, I see that the processor
is accessing cache block b, cache block b + N, cache block b + 2N,
cache block b + 3N. Maybe we should go prefetch
cache block b + 4N. So, but N is the stride size. So you may not be pulling in every
block and series, but instead pulling out end blocks that are spaced
out through memory with some spacing. But with everything, you have to be
careful if this does not pull your cache from bandwidth or a capacity perspective. So, let's take a look at example here. We have a Power Five. The Power Five has pretty sophisticated
hardware base predictor or hardware base preficity and
prediction scheme. So they actually have eight independent stride [INAUDIBLE]
pro-processor which can each. Prefetch up to 12 lines
ahead of the current access. Wow, so that's pretty impressive. They basically have little
pieces of hardware, they're seeing what is a likely
thing to go access based on what the program is doing and try to
bring it in early to improve performance. And reduce the cache miss latency,
and reduce the cache miss time or rate, if you will. Okay.
That's hardware prefetching. We can also think about having
software based prefetching. The compiler can figure out aha! Or maybe the programmer
who's writing code can say, I know that it's going to
go on many times. So instead of waiting to take
a cache from this when I really, really need it, which is when I have to
go use add here, let's say a(i) And b[i]. Instead, we can say okay,
I know the next time around the loop, I'm going to be accessing a[i+1] and
b[i+1], so let's tell the memory system that
I'm going to be accessing a[i+1] and b[i+1], and
get that operation out there early. So we can effectively have
a software help the hardware here by doing complete software prefetch. And there's sort of two ways
you can think about doing this. One way is these prefetches
are basically just loads that'll bring the line in probably. Alternatively some architectures have
a more hybrid hardware software approach where you'll actually have
a prefetch instruction and the prefetch instruction
will tell the hardware or hint to the hardware that it's likely that
you're going to go and be using this data but maybe you don't want to waste
off chip memory bandwidth for this. So only pull in let's say from
the L2 to the level one cache. But don't go take cache miss for this,
because this has some probability of not being correct, and shouldn't take
priority over true cache misses. And if you had just a normal load
in operation for your prefetch, you wouldn't be able to detect that case. In some hyper hybrid solutions,
prefetches will actually turn to special instructions software prefetches will
actually turn in special instructions. Okay so what are the issues
with software prefetching? Well, timing actually ends up being the biggest challenge here,
not predictability. This look here is very predictable. We know we're just going to stride
through array A and array B, but what we don't know is we might
end up being too late or too early. So, you have to get sort
of the right time, you have to get the time right when you actually
go to access these different things. So, what we're going to do
here is we're going to say, Instead of accessing i+1,
we're going to access maybe i+p here. And this is a settable parameter, where you can say, this is a few catch
lines into the future, we can pull it in. And we want to set that correctly so
that we pull in at the right time so that we
can cover the cash missed latency time. Out to the either next level of catch or
maybe out to main memory, but not really enough that we
actually pollute the cache and actually cause conflicts somewhere else. If we said P here huge, we were
pulling in potentially way too early. So if we set this to,
let's say bigger than the array size A and B, it's very possible that we would
actually be polluting the cache with wrong data, if you will. Few other things that come up here is,
how do we go about trying to estimate P. Well, this is a tricky thing. Why is this hard? Well, we don't quite know what the is and we don't quite know whether a and
b are going to hit or miss or whether the hardware pre-fetcher is going
to fight against our software pre-fetcher. And things get a little complex here and it's really just because you
don't know the memory load. The actual congestion at
the memory controller or you don't know whether lines miss or
don't. So, you effectively have
a piece of a dynamic number, which is being determined at
static compile time, we'll say. And also you don't quite know what
architecture you're going to be on. Different architectures, or
different micro architectures for the same architecture set architecture, might have different cash mislancies so if
you wanted to write one piece of software that works on all of them there
may not be a correct value of P. So, you have to think about that, finally I wanted to say you must consider
the cost of prefetch instructions, these are not free, you have to add
these two extra instructions here, and if your performance benefit At
the performance benefit, having prefetch instructions does not outweigh
the cost of adding two extra instructions. It's probably not a good trade-off. You probably shouldn't be
doing something like this. So, next lecture we're going to talk
a little bit more about different software prefetching issues and other ways
to rearrange code in the compiler. To effectively bring data in
in a more structured manner. But lets take a look at
the Efficacy of Prefetching. So which of these things does this help or
hurt? Well what we're going to see is. First of all, we're going to actually
potentially help the miss penalty. So, when is the miss penalty help? When is the miss penalty made better? Well, if you have a cash, or excuse me,
if you have a prefetch inflight to a piece of data that you
then have a demand miss for. You effectively cut the latency,
for that demand miss, or the missed penalty down to just finishing
the currently in flight pre fetch. So you can take that cost and
make it a little bit shorter. Miss rate. Miss rate also is better. Why is miss rate better? Well If you have a prefetch occur and
it brough into your cache, and then you hit on that data, you just
have fewer misses out of your cache, so your miss rate is better. A couple other things here,
Hit Time and Bandwidth. Prefecthing If done poorly,
it might actually hurt your bandwidth. So there might be a negative
here if done poorly. And hit time typically not
affected by prefetching. Okay, we'll stop here for today.