So last thing I wanted to talk about in today's lecture is prefetching or the technique of prefetching. So what is prefetching? Well, you speculate that you're going to be using some let's say either instruction or data In the near future. So you can say, I think I'm going to be accessing some cache line in the near future. Now you haven't accessed it yet, the processor hasn't gone to initiate this transaction yet. But because you have some reason to believe that it's a good idea to go do this. You speculatively go and fetch that data from either memory or a farther out level of cache. And bring it into a nearer level of cache. So this is the basic idea of prefetching. Well what does this look like? Well, there are a couple different varieties of prefetching we're going to be talking about today. So we're going to talk about hardware prefetching. Which is a complete hardware based solution where the hardware tries to detect what is a useful thing to go bring in. We're going to talk about software prefetching. Where the compiler or the programmer decides what is a useful piece of data to go get early. And we can talk about mixed schemes. Where it's software hinting to hardware potentially about when it's a good idea to go bring in data. Okay so, important question that comes up on the slide here. What types of cache misses does prefetching affect? Hm? What does it affect? Well we're bringing in data potentially before it's actually used. So that sounds good, this sounds all good. But we're actually polluting our cache potentially with data we don't know that we're going to need yet. So, there can be downsides to this. So let's go through the three Cs. Capacity, does this affect capacity? Well, yes, it actually affects capacity, in what is effectively a negative way. You effectively polluted your cache, because you've prefetched something into your cache. And you could potentially make the cache have a smaller effective capacity because of this. And this actually trades off, this is a similar sort of thing with conflict misses. You're effectively putting data into your cache. That if you have a really bad prefetcher, a prefetcher which actually goes and brings data in. That is not accessed ever we'll say, or not accessed in a short period of time afterwards. You may actually be both polluting capacity and creating more conflicts. You can bring it into a line. And because of this some other piece of data that gets brought in has to go evict that now. And it takes time to go evict it. Or at least has to make some decision. And might even mess up the recently used policy, the replacement policy. Okay, so why do we do this? We said this is bad for conflict misses, or increases our conflict misses. Where does prefetching help? Well, it can actually really help with compulsory misses. So misses where you just miss on effectively cold data. The data's just not in your cache, you have to go get it and bring it in. So let's say a first touch data or compulsory miss. It would be really great if the prefetch system or either a hardware software prefetcher. Could bring that data into your cache preemptively. And by bringing that data into your cache preemptively either reduce the latency or remove the latency completely of compulsory misses. So how do we reduce that miss cost? Well let's say you go to take cache miss on a line and the prefetcher already has sort of started to go get that line from the main memory. The missed data's handling register could detect that there's a missed argument flag for that line. And not actually send out a subsequent request and just wait for that data to come back. And this can decrease the latency of compulsory miss or decrease the miss cost of compulsory miss. Alternatively, the compulsory message can just go away. If the prefetcher is really on its game, it can bring the data into your cache. And by the time you go to access that data, it's clairvoyant, if you will. It already knows that data that you were going to access in the future and it's already pulled it in. When you go to access it, it's already there. So prefetching can have some pretty big benefits there. But like all things, including caches in general, you need to make sure that what you bring in is used with high probability. Otherwise, you're just going to pollute your cache. One last thing I wanted to say is that we're going to see this. Is that instruction caches are usually easier to predict for prefetching purposes than data. Why is this? Well instructions are very well behaved. Typically you execute one instruction. Then you execute the next instruction. Then you execute the next instruction after it. And yeah there are branches. And programs do have a fair amount of branches. But it i s a decent chance you're just going to be executing straight line code, especially in something that'll loop. Where you're going to be executing straight line code for a while and hit one branch at the bottom and then go back up to the top. So your prefetcher can many times be very effective for that. Data on the other hand, usually data misses and data accesses are a little bit more randomized. But not always, sometimes you're just going to be accessing, let's say straight line data. Something like a mem copy, or something like taking array and reading the whole array. And a prefetcher can be very effective there also. So, let's take a look at some of the issues with prefetching. We've already touched on some of these things. In order for a prefetcher to be good it has to be useful. So it should produce hits in the cache. If the prefetcher is actually producing mostly misses in the cache. Or bringing in stuff that, if it's on the instruction side it never executes. Or on the data side, it never tries to go to a loader and store [INAUDIBLE], it wasn't useful. Now, things get a little bit more complex here when you start to think about usefulness. You have to start thinking about timeliness of the data. So what do we mean by timeliness? Well, if you prefetch a piece of data into your cache. But you do it too early, it's very possible by the time you go to go read the data, it might have gotten evicted from the cache. So, if you bring it in too early, the prefetcher might have just wasted bandwidth. And wasted energy going to the next level of cache or going out to maintain memory. And that data may not end up being useful because it wasn't timely. You might have gotten the address right, but gotten it there too early, or it might have figured out that it was too late. So how does this happen? So let's say, you determine that this one line is really good and there's a high probability you're going to need it. So you [INAUDIBLE] want to go pull it in. The prefetcher decides the cycle after the load happens to that line. Well, it didn't help. Maybe even the cycle before, it just wouldn't help because you're too late, effectively. And as we've already discussed, you can have significant cache and bandwidth pollution in here. So you can pollute the capacity of the cache and you can waste a lot of bandwidth. And this is a problem if, for instance, you're in a multicore chip or many-core chip, and off-chip memory balance is really important. And it's a really limited resource, you may not actually want to have any prefetch turned on because you might just be wasting bandwidth, so it's a speculative execution sort of thing, so you're going to be trying to pull in data, but it's speculative. You don't know 100% that you're going to be using that data, and any time that you bring something in that you don't use, you've effectively wasted bandwidth. So, this is a big challenge here. From the bandwidth perspective, usually a good place to put a prefetcher is maybe somewhere between, let's say, a level two cache and a level one cache especially maybe on the instruction side but you team this that you one will basically pulling data. And this bandwidth here is relatively inexpensive from your level one to your level two because it's an on ship bus, it's directly connected. You can make it relatively large, coming out of level two cache though, usually or going out of let's say, the last level cache out to main memory, that's usually expensive bandwidth. So, sometimes people come up with prefetchers. The only prefetch from a level two cache to a level one cache. But if it's not in the level two cache, prefetch just gets dropped. So that's a pretty common strategy here. Let's start off by looking at the simple case here of instruction prefetching in hardware with a basic example. Something like what they had in the Alpha 21064. So it's a very relatively basic processor. And we want to look at what they did for performance. They actually added an extra little buffer here that they're going to call a stream buffer. This stream buffer here is going to store the next line that is likely to be used. So if you take a cache miss, this is on the instruction side for particular instruction cache block. Let's say block i in this example. You decide, the hardware decides with high probability block i + 1 is probably going to be used. But instead of polluting our instruction cache in their design what they did is the subsequent cache block is actually stored into the stream buffer. So you actually pollute their cache. This is kind of like doing little bit of extra associativity here only for the next predictive line, if you will. Okay, so in this example here, what would happen is, you'd be executing level one cache. And let's say you fall through, the code just falls through, there's no branch at the end of the cache line And you fall through. Well, the predictor actually did really well here and it overlapped defectively the cost, going out to level two capture the main memory for that subsequent cache line. So it hits in the stream buffer and the stream buffer get moved into the cache and the harvest system says or the hard pre-system says A-ha, I went in, just I'm now executing block (i + 1). Maybe it's a good idea to go and prefetch block (i + 2). And effectively what you're doing here is overlapping the execution of block i. With the fetch of block i+1. So you can be doing two things at one time. You can be using the CPU fetching, excuse me, from block i. And you can be using the cache, moving data this way into the stream buffer for block i+1. And likewise, when you go to fetch, when you're executing, let's say, block (i+1), you might be fetching block (i+2). Okay, so that's the instruction prefetch side of the world. Life gets a little more complex when we start to go look at the data prefetching. Now why is this? Well, it's not as well-structured as instructions. Instructions you just but typically blocks of the code keep falling through. In the data world, you don't necessarily know because you've went in and accessed let's say address a but you will go access some other address because of that. But you can try. So people have implemented a bunch of different prefetches that help to some extent. The most basic thing that they're not doing is actually just doing something very similar to this that we had in the instruction example. If you go and you take a cache miss on let's say block B, you go also fetch block b + 1 into your cache. And this always tends to help. You have to be a little bit careful here because you might be wasting capacity and bandwidth. But in some cases this is helpful. So if you're streaming through an array, this helps. If you're doing a memcpy. This helps. This prefetch on miss. Note we say prefetch-on-miss here, but we don't say pre-fetch on hit, which is a little bit different than our example in the instruction space there. Prefetch-on-miss is the only way to cache miss that you go, let's say, fetch B plus one. Now, there's different strategies here. Prefetch on hit, prefetch on miss, you might think about doing either of these depending on the heuristics of what your hardware comes up with. So that's prefetch on miss, maybe you actually want to do one block lookahead. So this is much more, this is basically the same thing we were doing in the instruction side here, is if you go access, let's say, Block b, and this means not necessarily cache missed. You go and actually fetch from your next level audited cache b + 1. Hm, okay, so the question that comes up is, is this the same thing as this doubling your block size? Well, that's interesting because you effectively go fetch whenever you touch b, you go prefetch b plus 1. This kind of looks like just making your block size bigger because this is just doubling the block size. You're always bringing in two blocks. But it's only a prefetch operation, so you could think about having it, for instance, just bold from your L2 into your L1, when you have this example, not actually have to go out to your last load of cache. So, you could think about having different trade-offs there. Also the down side of this is that it's not just making your block size bigger, in fact we have two sets of tags here. That's the down side of having something like this one block look ahead scheme. But you can think about taking this example and actually extending it to and end block look ahead. So if you go to access block, let's say b, or cache line b. You go fetch a couple after that. And the name of the game here is you're trying to hide memory latency, main memory latency. So what would be really great is if you go to access, let's say block b. And you know that the hardware somehow determines if heuristics, typically how you implement these things is little watchers that watch what accesses you're doing and then decide it's fruitful to actually go and do a prefetch. You could say, I'm accessing block b, I should start trying to read from main memory block. b+1, b+2, b+3, b+4, all the way up to, let's say big n here. And by doing this, you can overlap the. So that if try to access that next block, it will have already been here and you don't have to wait for the going out to main memory. This can actually increase our performance significantly. Okay, so that's the basic examples but those don't always work well. Sometimes you might have something like an array access, where you're not touching every bite in the array. Maybe you're accessing every nth bite in the array or some offset. Well how does this happen? Well this is actually really common if you're trying to access. An array of structures or an array of let's say class objects. Why is this? Well if you have a structure you have different fields in the structure and different offsets inside of the structure. So if you have an array of them they're packed densely and if you are reading let's say some subfield of a structure and you're reading every element of that array. You're actually going to have what's called a strided access. So you're not going to be accessing byte one and then byte two and byte three, what was called unistride. Instead, you're going to be having some offset as you move through the array. And if the structure is large enough, what might end up happening is, you may not actually want to go and fetch the next line and the next line and the next line. Instead, you might want to fetch just the lines they contain the portion of the structure that you want to go access. So, what does this look like? Well, you need something like a strided hardware prefetcher. So if you see a pattern happening, typically that's when you fire up some of your strided prefetcher. So if you go and look at a modern day Intel processor they actually have stride detectors. So it's little pieces of hardware which watch the cache access pattern or the processor access pattern and will say aha, I see that the processor is accessing cache block b, cache block b + N, cache block b + 2N, cache block b + 3N. Maybe we should go prefetch cache block b + 4N. So, but N is the stride size. So you may not be pulling in every block and series, but instead pulling out end blocks that are spaced out through memory with some spacing. But with everything, you have to be careful if this does not pull your cache from bandwidth or a capacity perspective. So, let's take a look at example here. We have a Power Five. The Power Five has pretty sophisticated hardware base predictor or hardware base preficity and prediction scheme. So they actually have eight independent stride [INAUDIBLE] pro-processor which can each. Prefetch up to 12 lines ahead of the current access. Wow, so that's pretty impressive. They basically have little pieces of hardware, they're seeing what is a likely thing to go access based on what the program is doing and try to bring it in early to improve performance. And reduce the cache miss latency, and reduce the cache miss time or rate, if you will. Okay. That's hardware prefetching. We can also think about having software based prefetching. The compiler can figure out aha! Or maybe the programmer who's writing code can say, I know that it's going to go on many times. So instead of waiting to take a cache from this when I really, really need it, which is when I have to go use add here, let's say a(i) And b[i]. Instead, we can say okay, I know the next time around the loop, I'm going to be accessing a[i+1] and b[i+1], so let's tell the memory system that I'm going to be accessing a[i+1] and b[i+1], and get that operation out there early. So we can effectively have a software help the hardware here by doing complete software prefetch. And there's sort of two ways you can think about doing this. One way is these prefetches are basically just loads that'll bring the line in probably. Alternatively some architectures have a more hybrid hardware software approach where you'll actually have a prefetch instruction and the prefetch instruction will tell the hardware or hint to the hardware that it's likely that you're going to go and be using this data but maybe you don't want to waste off chip memory bandwidth for this. So only pull in let's say from the L2 to the level one cache. But don't go take cache miss for this, because this has some probability of not being correct, and shouldn't take priority over true cache misses. And if you had just a normal load in operation for your prefetch, you wouldn't be able to detect that case. In some hyper hybrid solutions, prefetches will actually turn to special instructions software prefetches will actually turn in special instructions. Okay so what are the issues with software prefetching? Well, timing actually ends up being the biggest challenge here, not predictability. This look here is very predictable. We know we're just going to stride through array A and array B, but what we don't know is we might end up being too late or too early. So, you have to get sort of the right time, you have to get the time right when you actually go to access these different things. So, what we're going to do here is we're going to say, Instead of accessing i+1, we're going to access maybe i+p here. And this is a settable parameter, where you can say, this is a few catch lines into the future, we can pull it in. And we want to set that correctly so that we pull in at the right time so that we can cover the cash missed latency time. Out to the either next level of catch or maybe out to main memory, but not really enough that we actually pollute the cache and actually cause conflicts somewhere else. If we said P here huge, we were pulling in potentially way too early. So if we set this to, let's say bigger than the array size A and B, it's very possible that we would actually be polluting the cache with wrong data, if you will. Few other things that come up here is, how do we go about trying to estimate P. Well, this is a tricky thing. Why is this hard? Well, we don't quite know what the is and we don't quite know whether a and b are going to hit or miss or whether the hardware pre-fetcher is going to fight against our software pre-fetcher. And things get a little complex here and it's really just because you don't know the memory load. The actual congestion at the memory controller or you don't know whether lines miss or don't. So, you effectively have a piece of a dynamic number, which is being determined at static compile time, we'll say. And also you don't quite know what architecture you're going to be on. Different architectures, or different micro architectures for the same architecture set architecture, might have different cash mislancies so if you wanted to write one piece of software that works on all of them there may not be a correct value of P. So, you have to think about that, finally I wanted to say you must consider the cost of prefetch instructions, these are not free, you have to add these two extra instructions here, and if your performance benefit At the performance benefit, having prefetch instructions does not outweigh the cost of adding two extra instructions. It's probably not a good trade-off. You probably shouldn't be doing something like this. So, next lecture we're going to talk a little bit more about different software prefetching issues and other ways to rearrange code in the compiler. To effectively bring data in in a more structured manner. But lets take a look at the Efficacy of Prefetching. So which of these things does this help or hurt? Well what we're going to see is. First of all, we're going to actually potentially help the miss penalty. So, when is the miss penalty help? When is the miss penalty made better? Well, if you have a cash, or excuse me, if you have a prefetch inflight to a piece of data that you then have a demand miss for. You effectively cut the latency, for that demand miss, or the missed penalty down to just finishing the currently in flight pre fetch. So you can take that cost and make it a little bit shorter. Miss rate. Miss rate also is better. Why is miss rate better? Well If you have a prefetch occur and it brough into your cache, and then you hit on that data, you just have fewer misses out of your cache, so your miss rate is better. A couple other things here, Hit Time and Bandwidth. Prefecthing If done poorly, it might actually hurt your bandwidth. So there might be a negative here if done poorly. And hit time typically not affected by prefetching. Okay, we'll stop here for today.