1 00:00:03,721 --> 00:00:09,427 So last thing I wanted to talk about in today's lecture is prefetching or 2 00:00:09,427 --> 00:00:11,961 the technique of prefetching. 3 00:00:13,120 --> 00:00:14,205 So what is prefetching? 4 00:00:14,205 --> 00:00:19,583 Well, you speculate that you're going to be using some let's 5 00:00:19,583 --> 00:00:26,630 say either instruction or data In the near future. 6 00:00:26,630 --> 00:00:28,240 So you can say, 7 00:00:28,240 --> 00:00:34,070 I think I'm going to be accessing some cache line in the near future. 8 00:00:34,070 --> 00:00:35,464 Now you haven't accessed it yet, 9 00:00:35,464 --> 00:00:38,017 the processor hasn't gone to initiate this transaction yet. 10 00:00:38,017 --> 00:00:42,847 But because you have some reason to believe that it's a good idea to 11 00:00:42,847 --> 00:00:43,718 go do this. 12 00:00:43,718 --> 00:00:48,086 You speculatively go and fetch that data from either memory or 13 00:00:48,086 --> 00:00:50,189 a farther out level of cache. 14 00:00:50,189 --> 00:00:55,182 And bring it into a nearer level of cache. 15 00:00:55,182 --> 00:00:57,460 So this is the basic idea of prefetching. 16 00:00:57,460 --> 00:01:00,520 Well what does this look like? 17 00:01:00,520 --> 00:01:03,077 Well, there are a couple different varieties of prefetching we're going to be 18 00:01:03,077 --> 00:01:04,550 talking about today. 19 00:01:04,550 --> 00:01:06,317 So we're going to talk about hardware prefetching. 20 00:01:06,317 --> 00:01:11,277 Which is a complete hardware based solution where the hardware 21 00:01:11,277 --> 00:01:15,401 tries to detect what is a useful thing to go bring in. 22 00:01:15,401 --> 00:01:17,662 We're going to talk about software prefetching. 23 00:01:17,662 --> 00:01:19,296 Where the compiler or 24 00:01:19,296 --> 00:01:24,569 the programmer decides what is a useful piece of data to go get early. 25 00:01:24,569 --> 00:01:25,682 And we can talk about mixed schemes. 26 00:01:25,682 --> 00:01:30,303 Where it's software hinting to hardware potentially 27 00:01:30,303 --> 00:01:34,327 about when it's a good idea to go bring in data. 28 00:01:34,327 --> 00:01:37,825 Okay so, important question that comes up on the slide here. 29 00:01:37,825 --> 00:01:43,180 What types of cache misses does prefetching affect? 30 00:01:47,876 --> 00:01:49,020 Hm? 31 00:01:51,714 --> 00:01:52,916 What does it affect? 32 00:01:52,916 --> 00:01:56,483 Well we're bringing in data potentially before it's actually used. 33 00:02:00,574 --> 00:02:03,659 So that sounds good, this sounds all good. 34 00:02:03,659 --> 00:02:07,624 But we're actually polluting our cache potentially with data we don't know 35 00:02:07,624 --> 00:02:09,780 that we're going to need yet. 36 00:02:09,780 --> 00:02:12,412 So, there can be downsides to this. 37 00:02:12,412 --> 00:02:14,915 So let's go through the three Cs. 38 00:02:14,915 --> 00:02:18,490 Capacity, does this affect capacity? 39 00:02:20,320 --> 00:02:28,210 Well, yes, it actually affects capacity, in what is effectively a negative way. 40 00:02:29,380 --> 00:02:31,570 You effectively polluted your cache, 41 00:02:31,570 --> 00:02:34,050 because you've prefetched something into your cache. 42 00:02:34,050 --> 00:02:36,570 And you could potentially make 43 00:02:36,570 --> 00:02:40,940 the cache have a smaller effective capacity because of this. 44 00:02:41,980 --> 00:02:43,742 And this actually trades off, 45 00:02:43,742 --> 00:02:46,801 this is a similar sort of thing with conflict misses. 46 00:02:46,801 --> 00:02:49,193 You're effectively putting data into your cache. 47 00:02:49,193 --> 00:02:53,381 That if you have a really bad prefetcher, a prefetcher which actually goes and 48 00:02:53,381 --> 00:02:54,277 brings data in. 49 00:02:54,277 --> 00:02:57,659 That is not accessed ever we'll say, or 50 00:02:57,659 --> 00:03:01,946 not accessed in a short period of time afterwards. 51 00:03:01,946 --> 00:03:05,358 You may actually be both polluting capacity and creating more conflicts. 52 00:03:05,358 --> 00:03:07,165 You can bring it into a line. 53 00:03:07,165 --> 00:03:12,105 And because of this some other piece of data that gets brought in has to go evict 54 00:03:12,105 --> 00:03:12,790 that now. 55 00:03:12,790 --> 00:03:14,110 And it takes time to go evict it. 56 00:03:14,110 --> 00:03:16,169 Or at least has to make some decision. 57 00:03:16,169 --> 00:03:22,690 And might even mess up the recently used policy, the replacement policy. 58 00:03:22,690 --> 00:03:23,730 Okay, so why do we do this? 59 00:03:23,730 --> 00:03:29,210 We said this is bad for conflict misses, or increases our conflict misses. 60 00:03:29,210 --> 00:03:31,200 Where does prefetching help? 61 00:03:31,200 --> 00:03:34,079 Well, it can actually really help with compulsory misses. 62 00:03:35,140 --> 00:03:40,298 So misses where you just miss on effectively cold data. 63 00:03:40,298 --> 00:03:44,300 The data's just not in your cache, you have to go get it and bring it in. 64 00:03:44,300 --> 00:03:47,314 So let's say a first touch data or compulsory miss. 65 00:03:47,314 --> 00:03:50,174 It would be really great if the prefetch system or 66 00:03:50,174 --> 00:03:52,628 either a hardware software prefetcher. 67 00:03:52,628 --> 00:03:55,740 Could bring that data into your cache preemptively. 68 00:03:55,740 --> 00:04:00,470 And by bringing that data into your cache preemptively either reduce the latency or 69 00:04:00,470 --> 00:04:04,790 remove the latency completely of compulsory misses. 70 00:04:04,790 --> 00:04:08,050 So how do we reduce that miss cost? 71 00:04:08,050 --> 00:04:11,983 Well let's say you go to take cache miss on a line and 72 00:04:11,983 --> 00:04:18,404 the prefetcher already has sort of started to go get that line from the main memory. 73 00:04:18,404 --> 00:04:22,683 The missed data's handling register could detect that there's a missed argument flag 74 00:04:22,683 --> 00:04:23,448 for that line. 75 00:04:23,448 --> 00:04:26,523 And not actually send out a subsequent request and just wait for 76 00:04:26,523 --> 00:04:27,690 that data to come back. 77 00:04:27,690 --> 00:04:32,890 And this can decrease the latency of compulsory miss or 78 00:04:32,890 --> 00:04:36,910 decrease the miss cost of compulsory miss. 79 00:04:36,910 --> 00:04:39,781 Alternatively, the compulsory message can just go away. 80 00:04:39,781 --> 00:04:43,249 If the prefetcher is really on its game, it can bring the data into your cache. 81 00:04:43,249 --> 00:04:49,247 And by the time you go to access that data, it's clairvoyant, if you will. 82 00:04:49,247 --> 00:04:52,642 It already knows that data that you were going to access in the future and 83 00:04:52,642 --> 00:04:54,010 it's already pulled it in. 84 00:04:54,010 --> 00:04:55,890 When you go to access it, it's already there. 85 00:04:56,910 --> 00:05:01,000 So prefetching can have some pretty big benefits there. 86 00:05:01,000 --> 00:05:04,929 But like all things, including caches in general, 87 00:05:04,929 --> 00:05:10,480 you need to make sure that what you bring in is used with high probability. 88 00:05:10,480 --> 00:05:13,279 Otherwise, you're just going to pollute your cache. 89 00:05:13,279 --> 00:05:15,176 One last thing I wanted to say is that we're going to see this. 90 00:05:15,176 --> 00:05:19,776 Is that instruction caches are usually easier to predict for 91 00:05:19,776 --> 00:05:22,400 prefetching purposes than data. 92 00:05:23,400 --> 00:05:24,320 Why is this? 93 00:05:24,320 --> 00:05:26,780 Well instructions are very well behaved. 94 00:05:26,780 --> 00:05:28,050 Typically you execute one instruction. 95 00:05:28,050 --> 00:05:29,590 Then you execute the next instruction. 96 00:05:29,590 --> 00:05:32,200 Then you execute the next instruction after it. 97 00:05:32,200 --> 00:05:34,385 And yeah there are branches. 98 00:05:34,385 --> 00:05:36,302 And programs do have a fair amount of branches. 99 00:05:36,302 --> 00:05:39,530 But it i s a decent chance you're just going to be executing straight line code, 100 00:05:39,530 --> 00:05:41,280 especially in something that'll loop. 101 00:05:41,280 --> 00:05:43,382 Where you're going to be executing straight line code for a while and 102 00:05:43,382 --> 00:05:45,196 hit one branch at the bottom and then go back up to the top. 103 00:05:45,196 --> 00:05:50,500 So your prefetcher can many times be very effective for that. 104 00:05:50,500 --> 00:05:53,530 Data on the other hand, usually data misses and 105 00:05:53,530 --> 00:05:56,647 data accesses are a little bit more randomized. 106 00:05:56,647 --> 00:05:59,769 But not always, sometimes you're just going to be accessing, 107 00:05:59,769 --> 00:06:01,398 let's say straight line data. 108 00:06:01,398 --> 00:06:05,078 Something like a mem copy, or something like taking array and 109 00:06:05,078 --> 00:06:06,647 reading the whole array. 110 00:06:06,647 --> 00:06:10,000 And a prefetcher can be very effective there also. 111 00:06:11,140 --> 00:06:15,770 So, let's take a look at some of the issues with prefetching. 112 00:06:15,770 --> 00:06:17,640 We've already touched on some of these things. 113 00:06:17,640 --> 00:06:22,670 In order for a prefetcher to be good it has to be useful. 114 00:06:22,670 --> 00:06:26,460 So it should produce hits in the cache. 115 00:06:26,460 --> 00:06:29,755 If the prefetcher is actually producing mostly misses in the cache. 116 00:06:29,755 --> 00:06:32,521 Or bringing in stuff that, 117 00:06:32,521 --> 00:06:37,827 if it's on the instruction side it never executes. 118 00:06:37,827 --> 00:06:41,335 Or on the data side, it never tries to go to a loader and store [INAUDIBLE], 119 00:06:41,335 --> 00:06:42,320 it wasn't useful. 120 00:06:43,900 --> 00:06:47,140 Now, things get a little bit more complex here when you start to think 121 00:06:47,140 --> 00:06:48,550 about usefulness. 122 00:06:48,550 --> 00:06:52,240 You have to start thinking about timeliness of the data. 123 00:06:52,240 --> 00:06:53,630 So what do we mean by timeliness? 124 00:06:53,630 --> 00:06:57,020 Well, if you prefetch a piece of data into your cache. 125 00:06:59,480 --> 00:07:02,870 But you do it too early, it's very possible by the time you go to go 126 00:07:02,870 --> 00:07:07,340 read the data, it might have gotten evicted from the cache. 127 00:07:07,340 --> 00:07:09,030 So, if you bring it in too early, 128 00:07:09,030 --> 00:07:11,510 the prefetcher might have just wasted bandwidth. 129 00:07:11,510 --> 00:07:16,110 And wasted energy going to the next level of cache or going out to maintain memory. 130 00:07:16,110 --> 00:07:21,110 And that data may not end up being useful because it wasn't timely. 131 00:07:21,110 --> 00:07:23,470 You might have gotten the address right, but 132 00:07:23,470 --> 00:07:27,560 gotten it there too early, or it might have figured out that it was too late. 133 00:07:27,560 --> 00:07:28,291 So how does this happen? 134 00:07:28,291 --> 00:07:32,307 So let's say, you determine that this one line is really good and 135 00:07:32,307 --> 00:07:35,531 there's a high probability you're going to need it. 136 00:07:35,531 --> 00:07:37,490 So you [INAUDIBLE] want to go pull it in. 137 00:07:37,490 --> 00:07:42,730 The prefetcher decides the cycle after the load happens to that line. 138 00:07:42,730 --> 00:07:44,080 Well, it didn't help. 139 00:07:44,080 --> 00:07:45,580 Maybe even the cycle before, 140 00:07:45,580 --> 00:07:49,790 it just wouldn't help because you're too late, effectively. 141 00:07:50,890 --> 00:07:51,800 And as we've already discussed, 142 00:07:51,800 --> 00:07:54,450 you can have significant cache and bandwidth pollution in here. 143 00:07:54,450 --> 00:07:59,220 So you can pollute the capacity of the cache and 144 00:07:59,220 --> 00:08:01,430 you can waste a lot of bandwidth. 145 00:08:01,430 --> 00:08:05,046 And this is a problem if, for instance, you're in a multicore chip or 146 00:08:05,046 --> 00:08:09,000 many-core chip, and off-chip memory balance is really important. 147 00:08:09,000 --> 00:08:10,710 And it's a really limited resource, 148 00:08:10,710 --> 00:08:13,685 you may not actually want to have any prefetch turned on because you might 149 00:08:13,685 --> 00:08:18,605 just be wasting bandwidth, so it's a speculative execution sort of thing, 150 00:08:18,605 --> 00:08:21,515 so you're going to be trying to pull in data, but it's speculative. 151 00:08:21,515 --> 00:08:24,835 You don't know 100% that you're going to be using that data, and any time that you 152 00:08:24,835 --> 00:08:29,635 bring something in that you don't use, you've effectively wasted bandwidth. 153 00:08:29,635 --> 00:08:34,698 So, this is a big challenge here. 154 00:08:34,698 --> 00:08:40,308 From the bandwidth perspective, usually a good place to put a prefetcher 155 00:08:40,308 --> 00:08:45,102 is maybe somewhere between, let's say, a level two cache and 156 00:08:45,102 --> 00:08:49,805 a level one cache especially maybe on the instruction side but 157 00:08:49,805 --> 00:08:54,084 you team this that you one will basically pulling data. 158 00:08:54,084 --> 00:08:57,857 And this bandwidth here is relatively inexpensive from your level one to your 159 00:08:57,857 --> 00:09:01,181 level two because it's an on ship bus, it's directly connected. 160 00:09:01,181 --> 00:09:05,697 You can make it relatively large, coming out of level two cache though, usually or 161 00:09:05,697 --> 00:09:09,244 going out of let's say, the last level cache out to main memory, 162 00:09:09,244 --> 00:09:11,460 that's usually expensive bandwidth. 163 00:09:11,460 --> 00:09:13,470 So, sometimes people come up with prefetchers. 164 00:09:13,470 --> 00:09:17,510 The only prefetch from a level two cache to a level one cache. 165 00:09:17,510 --> 00:09:19,970 But if it's not in the level two cache, prefetch just gets dropped. 166 00:09:21,090 --> 00:09:22,900 So that's a pretty common strategy here. 167 00:09:25,010 --> 00:09:29,000 Let's start off by looking at the simple case here of 168 00:09:29,000 --> 00:09:33,100 instruction prefetching in hardware with a basic example. 169 00:09:33,100 --> 00:09:36,744 Something like what they had in the Alpha 21064. 170 00:09:37,890 --> 00:09:41,940 So it's a very relatively basic processor. 171 00:09:41,940 --> 00:09:46,350 And we want to look at what they did for performance. 172 00:09:47,870 --> 00:09:52,400 They actually added an extra little buffer here that they're going to 173 00:09:53,680 --> 00:09:54,650 call a stream buffer. 174 00:09:56,160 --> 00:10:04,590 This stream buffer here is going to store the next line that is likely to be used. 175 00:10:04,590 --> 00:10:08,060 So if you take a cache miss, this is on the instruction side for 176 00:10:08,060 --> 00:10:12,950 particular instruction cache block. 177 00:10:12,950 --> 00:10:16,929 Let's say block i in this example. 178 00:10:16,929 --> 00:10:21,176 You decide, the hardware decides with high 179 00:10:21,176 --> 00:10:25,950 probability block i + 1 is probably going to be used. 180 00:10:27,390 --> 00:10:32,350 But instead of polluting our instruction cache in their design what 181 00:10:32,350 --> 00:10:39,250 they did is the subsequent cache block is actually stored into the stream buffer. 182 00:10:41,090 --> 00:10:42,300 So you actually pollute their cache. 183 00:10:42,300 --> 00:10:46,520 This is kind of like doing little bit of extra associativity here only for 184 00:10:46,520 --> 00:10:48,190 the next predictive line, if you will. 185 00:10:50,810 --> 00:10:52,970 Okay, so in this example here, 186 00:10:52,970 --> 00:10:56,960 what would happen is, you'd be executing level one cache. 187 00:10:56,960 --> 00:10:59,080 And let's say you fall through, the code just falls through, 188 00:10:59,080 --> 00:11:03,150 there's no branch at the end of the cache line And you fall through. 189 00:11:03,150 --> 00:11:05,680 Well, the predictor actually did really well here and 190 00:11:05,680 --> 00:11:07,940 it overlapped defectively the cost, 191 00:11:07,940 --> 00:11:12,810 going out to level two capture the main memory for that subsequent cache line. 192 00:11:12,810 --> 00:11:19,400 So it hits in the stream buffer and the stream buffer get moved into the cache and 193 00:11:19,400 --> 00:11:23,780 the harvest system says or the hard pre-system says A-ha, 194 00:11:23,780 --> 00:11:27,100 I went in, just I'm now executing block (i + 1). 195 00:11:27,100 --> 00:11:31,975 Maybe it's a good idea to go and prefetch block (i + 2). 196 00:11:31,975 --> 00:11:39,940 And effectively what you're doing here is overlapping the execution of block i. 197 00:11:39,940 --> 00:11:43,340 With the fetch of block i+1. 198 00:11:43,340 --> 00:11:47,540 So you can be doing two things at one time. 199 00:11:47,540 --> 00:11:51,567 You can be using the CPU fetching, excuse me, from block i. 200 00:11:51,567 --> 00:11:54,203 And you can be using the cache, 201 00:11:54,203 --> 00:11:59,388 moving data this way into the stream buffer for block i+1. 202 00:11:59,388 --> 00:12:04,204 And likewise, when you go to fetch, when you're executing, let's say, 203 00:12:04,204 --> 00:12:08,006 block (i+1), you might be fetching block (i+2). 204 00:12:08,006 --> 00:12:11,580 Okay, so that's the instruction prefetch side of the world. 205 00:12:11,580 --> 00:12:16,940 Life gets a little more complex when we start to go look at the data prefetching. 206 00:12:16,940 --> 00:12:17,840 Now why is this? 207 00:12:17,840 --> 00:12:20,890 Well, it's not as well-structured as instructions. 208 00:12:20,890 --> 00:12:24,310 Instructions you just but typically blocks of the code keep falling through. 209 00:12:26,820 --> 00:12:32,553 In the data world, you don't necessarily know because you've went in and 210 00:12:32,553 --> 00:12:35,101 accessed let's say address a but 211 00:12:35,101 --> 00:12:39,288 you will go access some other address because of that. 212 00:12:41,508 --> 00:12:42,713 But you can try. 213 00:12:42,713 --> 00:12:45,994 So people have implemented a bunch of different prefetches that 214 00:12:45,994 --> 00:12:47,096 help to some extent. 215 00:12:48,224 --> 00:12:51,258 The most basic thing that they're not doing is actually just doing something 216 00:12:51,258 --> 00:12:54,460 very similar to this that we had in the instruction example. 217 00:12:54,460 --> 00:12:59,140 If you go and you take a cache miss on let's say block B, 218 00:13:01,220 --> 00:13:06,940 you go also fetch block b + 1 into your cache. 219 00:13:06,940 --> 00:13:08,700 And this always tends to help. 220 00:13:08,700 --> 00:13:11,460 You have to be a little bit careful here because you might be 221 00:13:11,460 --> 00:13:13,320 wasting capacity and bandwidth. 222 00:13:13,320 --> 00:13:15,250 But in some cases this is helpful. 223 00:13:15,250 --> 00:13:18,600 So if you're streaming through an array, this helps. 224 00:13:18,600 --> 00:13:20,360 If you're doing a memcpy. 225 00:13:20,360 --> 00:13:21,190 This helps. 226 00:13:21,190 --> 00:13:22,400 This prefetch on miss. 227 00:13:24,170 --> 00:13:27,920 Note we say prefetch-on-miss here, but we don't say pre-fetch on hit, 228 00:13:27,920 --> 00:13:33,775 which is a little bit different than our example in the instruction space there. 229 00:13:33,775 --> 00:13:36,930 Prefetch-on-miss is the only way to cache miss that you go, 230 00:13:36,930 --> 00:13:39,570 let's say, fetch B plus one. 231 00:13:39,570 --> 00:13:41,650 Now, there's different strategies here. 232 00:13:41,650 --> 00:13:45,080 Prefetch on hit, prefetch on miss, you might think about doing either of these 233 00:13:45,080 --> 00:13:49,210 depending on the heuristics of what your hardware comes up with. 234 00:13:50,960 --> 00:13:56,820 So that's prefetch on miss, maybe you actually want to do one block lookahead. 235 00:13:56,820 --> 00:13:59,500 So this is much more, this is basically the same thing we were doing in 236 00:13:59,500 --> 00:14:03,070 the instruction side here, is if you go access, let's say, 237 00:14:03,070 --> 00:14:07,640 Block b, and this means not necessarily cache missed. 238 00:14:07,640 --> 00:14:12,128 You go and actually fetch from your next level audited cache b + 1. 239 00:14:12,128 --> 00:14:16,485 Hm, okay, so the question that comes up is, 240 00:14:16,485 --> 00:14:21,887 is this the same thing as this doubling your block size? 241 00:14:26,971 --> 00:14:30,863 Well, that's interesting because you effectively 242 00:14:30,863 --> 00:14:35,020 go fetch whenever you touch b, you go prefetch b plus 1. 243 00:14:35,020 --> 00:14:40,470 This kind of looks like just making your block size bigger because 244 00:14:40,470 --> 00:14:42,340 this is just doubling the block size. 245 00:14:42,340 --> 00:14:44,240 You're always bringing in two blocks. 246 00:14:44,240 --> 00:14:48,904 But it's only a prefetch operation, so you could think about having it, for 247 00:14:48,904 --> 00:14:53,639 instance, just bold from your L2 into your L1, when you have this example, 248 00:14:53,639 --> 00:14:56,894 not actually have to go out to your last load of cache. 249 00:14:56,894 --> 00:15:00,315 So, you could think about having different trade-offs there. 250 00:15:00,315 --> 00:15:04,220 Also the down side of this is that it's not just making your block size bigger, 251 00:15:04,220 --> 00:15:06,900 in fact we have two sets of tags here. 252 00:15:06,900 --> 00:15:12,210 That's the down side of having something like this one block look ahead scheme. 253 00:15:12,210 --> 00:15:14,300 But you can think about taking this example and 254 00:15:14,300 --> 00:15:17,820 actually extending it to and end block look ahead. 255 00:15:17,820 --> 00:15:22,505 So if you go to access block, let's say b, or cache line b. 256 00:15:22,505 --> 00:15:25,915 You go fetch a couple after that. 257 00:15:25,915 --> 00:15:30,255 And the name of the game here is you're trying to hide memory latency, 258 00:15:30,255 --> 00:15:31,815 main memory latency. 259 00:15:31,815 --> 00:15:38,090 So what would be really great is if you go to access, let's say block b. 260 00:15:38,090 --> 00:15:41,960 And you know that the hardware somehow determines if heuristics, 261 00:15:41,960 --> 00:15:46,120 typically how you implement these things is little watchers that watch what 262 00:15:46,120 --> 00:15:50,210 accesses you're doing and then decide it's fruitful to actually go and do a prefetch. 263 00:15:50,210 --> 00:15:53,110 You could say, I'm accessing block b, 264 00:15:53,110 --> 00:15:58,756 I should start trying to read from main memory block. 265 00:15:58,756 --> 00:16:03,070 b+1, b+2, b+3, b+4, all the way up to, let's say big n here. 266 00:16:04,680 --> 00:16:07,280 And by doing this, you can overlap the. 267 00:16:07,280 --> 00:16:10,981 So that if try to access that next block, it will have already been here and 268 00:16:10,981 --> 00:16:14,220 you don't have to wait for the going out to main memory. 269 00:16:14,220 --> 00:16:16,980 This can actually increase our performance significantly. 270 00:16:18,190 --> 00:16:23,630 Okay, so that's the basic examples but those don't always work well. 271 00:16:24,970 --> 00:16:28,550 Sometimes you might have something like an array access, 272 00:16:28,550 --> 00:16:31,330 where you're not touching every bite in the array. 273 00:16:33,180 --> 00:16:37,593 Maybe you're accessing every nth bite in the array or some offset. 274 00:16:37,593 --> 00:16:38,752 Well how does this happen? 275 00:16:38,752 --> 00:16:41,680 Well this is actually really common if you're trying to access. 276 00:16:41,680 --> 00:16:46,370 An array of structures or an array of let's say class objects. 277 00:16:46,370 --> 00:16:47,020 Why is this? 278 00:16:47,020 --> 00:16:50,660 Well if you have a structure you have different fields in the structure and 279 00:16:50,660 --> 00:16:52,920 different offsets inside of the structure. 280 00:16:52,920 --> 00:16:56,670 So if you have an array of them they're packed densely and 281 00:16:56,670 --> 00:17:01,210 if you are reading let's say some subfield of a structure and 282 00:17:01,210 --> 00:17:05,400 you're reading every element of that array. 283 00:17:05,400 --> 00:17:07,610 You're actually going to have what's called a strided access. 284 00:17:07,610 --> 00:17:10,890 So you're not going to be accessing byte one and 285 00:17:10,890 --> 00:17:13,260 then byte two and byte three, what was called unistride. 286 00:17:13,260 --> 00:17:16,600 Instead, you're going to be having some offset as you move through the array. 287 00:17:16,600 --> 00:17:20,340 And if the structure is large enough, what might end up happening is, you may not 288 00:17:20,340 --> 00:17:25,160 actually want to go and fetch the next line and the next line and the next line. 289 00:17:25,160 --> 00:17:29,680 Instead, you might want to fetch just the lines they contain 290 00:17:29,680 --> 00:17:33,470 the portion of the structure that you want to go access. 291 00:17:33,470 --> 00:17:34,740 So, what does this look like? 292 00:17:34,740 --> 00:17:38,610 Well, you need something like a strided hardware prefetcher. 293 00:17:38,610 --> 00:17:44,040 So if you see a pattern happening, 294 00:17:44,040 --> 00:17:47,080 typically that's when you fire up some of your strided prefetcher. 295 00:17:47,080 --> 00:17:47,660 So if you go and 296 00:17:47,660 --> 00:17:52,640 look at a modern day Intel processor they actually have stride detectors. 297 00:17:52,640 --> 00:17:59,461 So it's little pieces of hardware which watch the cache access pattern or 298 00:17:59,461 --> 00:18:03,862 the processor access pattern and will say aha, 299 00:18:03,862 --> 00:18:08,704 I see that the processor is accessing cache block b, 300 00:18:08,704 --> 00:18:14,447 cache block b + N, cache block b + 2N, cache block b + 3N. 301 00:18:16,558 --> 00:18:21,800 Maybe we should go prefetch cache block b + 4N. 302 00:18:21,800 --> 00:18:24,600 So, but N is the stride size. 303 00:18:25,940 --> 00:18:30,950 So you may not be pulling in every block and series, but instead 304 00:18:30,950 --> 00:18:35,710 pulling out end blocks that are spaced out through memory with some spacing. 305 00:18:36,800 --> 00:18:40,560 But with everything, you have to be careful if this does not pull your cache 306 00:18:40,560 --> 00:18:46,970 from bandwidth or a capacity perspective. 307 00:18:46,970 --> 00:18:48,480 So, let's take a look at example here. 308 00:18:48,480 --> 00:18:50,088 We have a Power Five. 309 00:18:50,088 --> 00:18:55,207 The Power Five has pretty sophisticated hardware base predictor or 310 00:18:55,207 --> 00:18:58,990 hardware base preficity and prediction scheme. 311 00:18:58,990 --> 00:19:01,740 So they actually have eight independent 312 00:19:01,740 --> 00:19:05,770 stride [INAUDIBLE] pro-processor which can each. 313 00:19:05,770 --> 00:19:10,800 Prefetch up to 12 lines ahead of the current access. 314 00:19:12,560 --> 00:19:13,780 Wow, so that's pretty impressive. 315 00:19:13,780 --> 00:19:16,910 They basically have little pieces of hardware, 316 00:19:16,910 --> 00:19:21,230 they're seeing what is a likely thing to go access based on 317 00:19:21,230 --> 00:19:25,430 what the program is doing and try to bring it in early to improve performance. 318 00:19:25,430 --> 00:19:31,040 And reduce the cache miss latency, and reduce 319 00:19:31,040 --> 00:19:36,210 the cache miss time or rate, if you will. 320 00:19:37,210 --> 00:19:38,770 Okay. That's hardware prefetching. 321 00:19:40,280 --> 00:19:43,450 We can also think about having software based prefetching. 322 00:19:44,720 --> 00:19:47,640 The compiler can figure out aha! 323 00:19:50,180 --> 00:19:53,830 Or maybe the programmer who's writing code can say, 324 00:19:53,830 --> 00:19:56,540 I know that it's going to go on many times. 325 00:19:58,760 --> 00:20:04,810 So instead of waiting to take a cache from this when I really, 326 00:20:04,810 --> 00:20:10,170 really need it, which is when I have to go use add here, let's say a(i) And b[i]. 327 00:20:10,170 --> 00:20:14,840 Instead, we can say okay, I know the next time around the loop, 328 00:20:14,840 --> 00:20:17,830 I'm going to be accessing a[i+1] and b[i+1], so 329 00:20:17,830 --> 00:20:23,123 let's tell the memory system that I'm going to be accessing a[i+1] and 330 00:20:23,123 --> 00:20:27,720 b[i+1], and get that operation out there early. 331 00:20:29,770 --> 00:20:34,040 So we can effectively have a software help the hardware here 332 00:20:34,040 --> 00:20:36,380 by doing complete software prefetch. 333 00:20:36,380 --> 00:20:40,140 And there's sort of two ways you can think about doing this. 334 00:20:40,140 --> 00:20:44,950 One way is these prefetches are basically just loads 335 00:20:44,950 --> 00:20:47,660 that'll bring the line in probably. 336 00:20:47,660 --> 00:20:51,970 Alternatively some architectures have a more hybrid hardware software approach 337 00:20:51,970 --> 00:20:56,140 where you'll actually have a prefetch instruction and 338 00:20:56,140 --> 00:20:59,190 the prefetch instruction will tell the hardware or 339 00:20:59,190 --> 00:21:03,840 hint to the hardware that it's likely that you're going to go and be using this data 340 00:21:03,840 --> 00:21:08,180 but maybe you don't want to waste off chip memory bandwidth for this. 341 00:21:08,180 --> 00:21:11,680 So only pull in let's say from the L2 to the level one cache. 342 00:21:11,680 --> 00:21:16,280 But don't go take cache miss for this, because this has some probability of not 343 00:21:16,280 --> 00:21:22,170 being correct, and shouldn't take priority over true cache misses. 344 00:21:22,170 --> 00:21:25,540 And if you had just a normal load in operation for your prefetch, 345 00:21:25,540 --> 00:21:28,800 you wouldn't be able to detect that case. 346 00:21:28,800 --> 00:21:33,420 In some hyper hybrid solutions, prefetches will actually turn to special 347 00:21:33,420 --> 00:21:36,970 instructions software prefetches will actually turn in special instructions. 348 00:21:38,440 --> 00:21:41,910 Okay so what are the issues with software prefetching? 349 00:21:43,040 --> 00:21:47,630 Well, timing actually ends up being 350 00:21:47,630 --> 00:21:51,620 the biggest challenge here, not predictability. 351 00:21:51,620 --> 00:21:53,390 This look here is very predictable. 352 00:21:53,390 --> 00:21:59,350 We know we're just going to stride through array A and array B, 353 00:21:59,350 --> 00:22:06,750 but what we don't know is we might end up being too late or too early. 354 00:22:06,750 --> 00:22:12,350 So, you have to get sort of the right time, you have 355 00:22:12,350 --> 00:22:15,900 to get the time right when you actually go to access these different things. 356 00:22:15,900 --> 00:22:18,920 So, what we're going to do here is we're going to say, 357 00:22:20,980 --> 00:22:24,470 Instead of accessing i+1, we're going to access maybe i+p here. 358 00:22:24,470 --> 00:22:27,670 And this is a settable parameter, 359 00:22:27,670 --> 00:22:31,670 where you can say, this is a few catch lines into the future, we can pull it in. 360 00:22:31,670 --> 00:22:36,490 And we want to set that correctly so that we 361 00:22:36,490 --> 00:22:41,760 pull in at the right time so that we can cover the cash missed latency time. 362 00:22:41,760 --> 00:22:45,440 Out to the either next level of catch or maybe out to main memory, but 363 00:22:45,440 --> 00:22:47,720 not really enough that we actually pollute the cache and 364 00:22:47,720 --> 00:22:50,650 actually cause conflicts somewhere else. 365 00:22:50,650 --> 00:22:56,070 If we said P here huge, we were pulling in potentially way too early. 366 00:22:56,070 --> 00:22:59,050 So if we set this to, let's say bigger than the array size A and 367 00:22:59,050 --> 00:23:02,940 B, it's very possible that we would actually be polluting the cache with 368 00:23:04,990 --> 00:23:09,360 wrong data, if you will. 369 00:23:09,360 --> 00:23:15,060 Few other things that come up here is, how do we go about trying to estimate P. 370 00:23:15,060 --> 00:23:17,400 Well, this is a tricky thing. 371 00:23:18,500 --> 00:23:19,310 Why is this hard? 372 00:23:19,310 --> 00:23:24,601 Well, we don't quite know what the is and 373 00:23:24,601 --> 00:23:29,450 we don't quite know whether a and b are going to hit or miss or 374 00:23:29,450 --> 00:23:33,140 whether the hardware pre-fetcher is going to fight against our software pre-fetcher. 375 00:23:33,140 --> 00:23:35,420 And things get a little complex here and 376 00:23:35,420 --> 00:23:38,210 it's really just because you don't know the memory load. 377 00:23:38,210 --> 00:23:40,590 The actual congestion at the memory controller or 378 00:23:40,590 --> 00:23:45,320 you don't know whether lines miss or don't. 379 00:23:45,320 --> 00:23:51,450 So, you effectively have a piece of a dynamic number, 380 00:23:51,450 --> 00:23:56,360 which is being determined at static compile time, we'll say. 381 00:23:56,360 --> 00:24:00,630 And also you don't quite know what architecture you're going to be on. 382 00:24:00,630 --> 00:24:04,040 Different architectures, or different micro architectures for 383 00:24:04,040 --> 00:24:07,100 the same architecture set architecture, 384 00:24:07,100 --> 00:24:12,040 might have different cash mislancies so if you wanted to write one piece of software 385 00:24:12,040 --> 00:24:15,220 that works on all of them there may not be a correct value of P. 386 00:24:16,410 --> 00:24:18,223 So, you have to think about that, 387 00:24:18,223 --> 00:24:22,287 finally I wanted to say you must consider the cost of prefetch instructions, 388 00:24:22,287 --> 00:24:26,224 these are not free, you have to add these two extra instructions here, and 389 00:24:26,224 --> 00:24:31,187 if your performance benefit At the performance benefit, having 390 00:24:31,187 --> 00:24:34,510 prefetch instructions does not outweigh the cost of adding two extra instructions. 391 00:24:34,510 --> 00:24:35,670 It's probably not a good trade-off. 392 00:24:35,670 --> 00:24:40,040 You probably shouldn't be doing something like this. 393 00:24:40,040 --> 00:24:44,780 So, next lecture we're going to talk a little bit more about different software 394 00:24:44,780 --> 00:24:49,190 prefetching issues and other ways to rearrange code in the compiler. 395 00:24:49,190 --> 00:24:53,730 To effectively bring data in in a more structured manner. 396 00:24:53,730 --> 00:25:00,540 But lets take a look at the Efficacy of Prefetching. 397 00:25:00,540 --> 00:25:04,070 So which of these things does this help or hurt? 398 00:25:04,070 --> 00:25:06,640 Well what we're going to see is. 399 00:25:06,640 --> 00:25:11,870 First of all, we're going to actually potentially help the miss penalty. 400 00:25:12,900 --> 00:25:15,107 So, when is the miss penalty help? 401 00:25:15,107 --> 00:25:18,164 When is the miss penalty made better? 402 00:25:18,164 --> 00:25:23,560 Well, if you have a cash, or excuse me, if you have a prefetch inflight 403 00:25:23,560 --> 00:25:27,150 to a piece of data that you then have a demand miss for. 404 00:25:28,170 --> 00:25:31,640 You effectively cut the latency, for that demand miss, or 405 00:25:31,640 --> 00:25:36,870 the missed penalty down to just finishing the currently in flight pre fetch. 406 00:25:36,870 --> 00:25:40,010 So you can take that cost and make it a little bit shorter. 407 00:25:41,350 --> 00:25:42,340 Miss rate. 408 00:25:42,340 --> 00:25:45,050 Miss rate also is better. 409 00:25:45,050 --> 00:25:46,600 Why is miss rate better? 410 00:25:46,600 --> 00:25:51,480 Well If you have a prefetch occur and it brough into your cache, 411 00:25:51,480 --> 00:25:56,160 and then you hit on that data, you just have fewer misses out of your cache, so 412 00:25:56,160 --> 00:25:57,530 your miss rate is better. 413 00:25:58,910 --> 00:26:03,170 A couple other things here, Hit Time and Bandwidth. 414 00:26:03,170 --> 00:26:07,800 Prefecthing If done poorly, it might actually hurt your bandwidth. 415 00:26:07,800 --> 00:26:11,360 So there might be a negative here if done poorly. 416 00:26:12,910 --> 00:26:16,480 And hit time typically not affected by prefetching. 417 00:26:17,510 --> 00:26:18,720 Okay, we'll stop here for today.