1 00:00:03,540 --> 00:00:09,529 Welcome to our next installment of ELE475. today we are going to be continuing our 2 00:00:09,529 --> 00:00:15,238 discussion of advanced cash techniques. And one of the big things we are going to 3 00:00:15,238 --> 00:00:20,611 get in to is talking about how to build out of order memory systems. So what I 4 00:00:20,611 --> 00:00:24,853 mean by out of order memory systems, basically, I mean, you can have memory 5 00:00:24,853 --> 00:00:29,618 operations out of the memory system but they execute out of order. And you can mix 6 00:00:29,618 --> 00:00:34,267 this with the in order processor pipeline. Or an our of order processor pipeline. 7 00:00:34,267 --> 00:00:38,625 It's the designer's choice. They both sorts of designs have been built, have 8 00:00:38,625 --> 00:00:43,065 been successfully built. But where, this is important, this allows you to 9 00:00:43,065 --> 00:00:48,250 effectively get more memory parallelism. Because you can have transactions going 10 00:00:48,250 --> 00:00:53,175 out the memory system in parallel, and the processor can continue an overlap 11 00:00:53,175 --> 00:01:02,535 computation with the memory. so we, just to recap from last lecture. We have a 12 00:01:02,846 --> 00:01:11,068 revised agenda here, now with all of our Different cache optimization techniques or 13 00:01:11,068 --> 00:01:16,876 advanced optimization techniques. or we talked about pipelining caches, we talked 14 00:01:16,876 --> 00:01:23,943 about adding a right buffer, and a buffer, last lecture. We talked about the reason 15 00:01:23,943 --> 00:01:29,611 and the motivation for multi-level caches. So two level caches, three level caches, 16 00:01:29,611 --> 00:01:35,259 maybe even four level caches, backed by main memory. We talked about having victim 17 00:01:35,259 --> 00:01:39,734 caches. Which were extra little highly associative caches that you can put next 18 00:01:39,734 --> 00:01:44,435 to a low associative cache or someone can direct mapped cache. And effectively this 19 00:01:44,435 --> 00:01:49,420 can increase your performance quite a bit. because once you sorta miss in your direct 20 00:01:49,420 --> 00:01:53,668 mapped cache you can go out you your victim cache. And the victim cache may 21 00:01:53,668 --> 00:01:58,652 have the data you're looking for. And this gets over small numbers of lines which are 22 00:01:58,652 --> 00:02:03,524 highly contended and might blow out if you will in a number of ways within the amount 23 00:02:03,524 --> 00:02:08,182 of associativity you have in your cache. We talked about hardware and software pre 24 00:02:08,182 --> 00:02:12,227 fetching. And this is where we stopped last time. We stopped at hardware, 25 00:02:12,227 --> 00:02:17,177 software pre fetching. And today we have a couple, couple extra topics. We are going 26 00:02:17,177 --> 00:02:22,591 to talk about h ow to bank your cache as a technique to get more bandwidth. The easy 27 00:02:22,591 --> 00:02:28,004 technique to get more bandwidth is just to, put multiple ports on your cache, but 28 00:02:28,004 --> 00:02:33,874 that gets quite expensive. Because you have to duplicate your, possibly your . 29 00:02:33,874 --> 00:02:39,418 You probably have to duplicate, the right logic. And in the worst case, you may have 30 00:02:39,418 --> 00:02:45,481 to duplicate the storage space. We're, in today's lecture we're gonna to talk about 31 00:02:45,481 --> 00:02:49,822 some software optimizations that the compiler can do. We're gonna start by 32 00:02:49,822 --> 00:02:54,868 talking about some software optimizations that, are, easy to do without transforming 33 00:02:54,868 --> 00:02:59,150 the code too horribly. And then we're gonna talk about some that actually 34 00:02:59,150 --> 00:03:03,844 require some pretty heavy lifting, with respect to code transformations, and what 35 00:03:03,844 --> 00:03:08,302 you have to do to reorganize the code. Still in valid ways. But by doing such, 36 00:03:08,302 --> 00:03:12,820 you can actually have better memory performance and better cache performance. 37 00:03:12,820 --> 00:03:17,132 And then we're gonna get to the meat of today's lecture. We're gonna be talking 38 00:03:17,132 --> 00:03:21,445 about non-blocking caches or sometimes called lock-up free caches, or sometimes 39 00:03:21,445 --> 00:03:28,829 called out of order memory systems. Then a couple, , at the end here we're gonna have 40 00:03:28,829 --> 00:03:34,938 critical word first, which basically means that when you go out to the memory system. 41 00:03:34,938 --> 00:03:41,047 You ask for the word that you need most urgently. And the memory system will give 42 00:03:41,047 --> 00:03:46,484 that, that back to you first, and this allows you to restart, quicker. And then 43 00:03:46,484 --> 00:03:51,816 there's a, very similar idea but possibly easier to implement called early restart. 44 00:03:51,816 --> 00:03:57,148 Which instead of asking for memory system to return the data in a specific order and 45 00:03:57,148 --> 00:04:02,167 taking advantage of that. Instead what you do is you, the memory system gives you 46 00:04:02,167 --> 00:04:07,562 back the data in order or the whatever the chromatic order is that expects back to 47 00:04:07,562 --> 00:04:12,580 you with. But when the word that shows, the word shows up that you need you, or 48 00:04:12,580 --> 00:04:18,363 you stall for you just restart the processor . So if you're Asking for one 49 00:04:18,363 --> 00:04:21,903 word out of a huge cache line, and it happens to be the second word in the cache 50 00:04:21,903 --> 00:04:25,531 line, you don't have to wait for the whole cache line to get loaded into the cache. 51 00:04:25,531 --> 00:04:30,125 But this has come extra co mplexity. And finally, we are going to talk about 52 00:04:30,125 --> 00:04:34,829 way prediction. Way prediction, we touched on it a little bit when we were talking 53 00:04:34,829 --> 00:04:39,533 about some of the instruction issues in super scalars. We are going to talk about 54 00:04:39,533 --> 00:04:44,412 it a little more in detail here, where way prediction allows you to predict where to 55 00:04:44,412 --> 00:04:48,884 go find either the next piece of data you are going to go access or the next 56 00:04:48,884 --> 00:04:53,589 instruction you are going to go access. This is important if you have a multi-way 57 00:04:53,589 --> 00:04:58,177 cache. In a direct map there is only one place to go find it. In a two way cache 58 00:04:58,177 --> 00:05:02,824 there is two places to go find it. And if you check both. You can try to do that in 59 00:05:02,824 --> 00:05:07,114 parallel. It'll probably waste a bunch of power. or what you can do is you can 60 00:05:07,114 --> 00:05:11,243 predict it. Let's say it's in way one, or you can predict that it's in way, way 61 00:05:11,243 --> 00:05:15,479 zero. And use that prediction if you have high prediction accuracy, you may only 62 00:05:15,479 --> 00:05:19,447 have to fire up half your cache. Or if it's on our critical path, you don't 63 00:05:19,447 --> 00:05:23,577 actually have to do two comparisons or three comparisons or four comparisons, 64 00:05:23,577 --> 00:05:33,171 however, however high the associativity of your cache is. So let's go take a look, At 65 00:05:33,171 --> 00:05:42,354 thinking strategy and multi, multi porting our caches. This is a little bit of a 66 00:05:42,354 --> 00:05:47,802 recap, I think we talked about this briefly during last lecture. But we are 67 00:05:47,802 --> 00:05:53,325 going to continue on with that. So, here we have our two way issue processor 68 00:05:53,325 --> 00:05:58,847 pipeline. You have a program counter, reason instruction cache, you have your 69 00:05:58,847 --> 00:06:04,737 instruction registers. What's interesting is that this pipe, you can only execute 70 00:06:04,737 --> 00:06:10,394 one memory operation per cycle. What's a limiter? Let's say you want to execute two 71 00:06:10,394 --> 00:06:14,470 memory operations or three memory operations or four memory operations that 72 00:06:14,470 --> 00:06:18,760 are all independent of each other. What do you go about doing? Well, in our, in our 73 00:06:18,760 --> 00:06:24,903 two issue Python here. Logically, what you wanna do is you wanna somehow put the data 74 00:06:24,903 --> 00:06:30,794 path on both pipelines. And remove this restriction, of where you can execute 75 00:06:30,794 --> 00:06:37,775 memory accesses. In, in which pipe you can access memory accesses from. Okay, that 76 00:06:37,775 --> 00:06:43,878 doesn't sound too horrible, until we go look at the imple mentation of this. So we 77 00:06:43,878 --> 00:06:51,684 can go and have a true multi-ported cache. So if a true multi-ported cache, if you 78 00:06:51,684 --> 00:06:59,008 think about the decoding logic on a read, we're gonna have to least duplicat, 79 00:06:59,008 --> 00:07:06,383 duplicate that decoding logic. Likewise if we want to actually execute two writes at 80 00:07:06,383 --> 00:07:11,870 the same time, we're going to have to duplicate the rotor coders for, for that 81 00:07:11,870 --> 00:07:16,946 logic. And a lot of times how people actually go to implement this, is they 82 00:07:16,946 --> 00:07:22,570 basically duplicate the entire cache to get an extra port for, for a variety of 83 00:07:22,570 --> 00:07:27,851 reasons. It could, you know, the area increase could be large, and it could even 84 00:07:27,851 --> 00:07:33,721 be double for, for a two port read and a two, two write port Sorts of cache. So 85 00:07:33,721 --> 00:07:39,916 they don't even share the storage data. But that's like, only a little bit of the 86 00:07:39,916 --> 00:07:45,563 problem. The other problem here is that it also increases your hit time in your cash, 87 00:07:45,563 --> 00:07:51,008 and no one likes increasing hit times in cashes. Now, why does it increase the hit 88 00:07:51,008 --> 00:07:56,596 time? Well, unless you. For instance, duplicate everything. If you try to share 89 00:07:56,596 --> 00:08:03,014 some of the logic, you're basically going to increase the loading on the readout of 90 00:08:03,014 --> 00:08:08,737 the cache here. And by increasing the load, this is gonna have slower drive 91 00:08:08,737 --> 00:08:14,610 strengths, you'll probably increase the time. And, and you can go look in 92 00:08:14,610 --> 00:08:18,860 something like cacti which is a tool from HP which has different cache 93 00:08:18,860 --> 00:08:23,570 configurations and you can see that, you know, adding a second port is usually will 94 00:08:23,570 --> 00:08:27,992 be the last thing in the world you wanna do when you're doing your processor 95 00:08:27,992 --> 00:08:32,759 design. It's really, ends up being really quite expensive. mostly in terms of area 96 00:08:32,931 --> 00:08:38,257 and somewhat in terms of timing. So can we, can we do better? Well maybe yes maybe 97 00:08:38,257 --> 00:08:44,567 no. So one thing you can think about doing is instead of actually having a true dual 98 00:08:44,567 --> 00:08:50,727 porting cache, can we have a pseudo dual porting cache or a cache which has the 99 00:08:50,727 --> 00:08:56,737 same external interface? So if we sort of cut out here in the middle, we have two 100 00:08:56,737 --> 00:09:02,867 addresses coming in. Two datas coming out, it looks the same, but, instead, we 101 00:09:02,867 --> 00:09:09,808 actually put something else in the middle here. What do we put in the middle? Well, 102 00:09:09,808 --> 00:09:15,473 we actu ally put. Two, banks, that are called, of our cache. And the first bank 103 00:09:15,473 --> 00:09:20,269 is one-half the size of our cache. And the other bank is one-half the size of our 104 00:09:20,269 --> 00:09:25,886 cache. And they're both single ported. What this is relying on is if you have 105 00:09:25,886 --> 00:09:32,534 statistical memory accesses that have some good pattern to them, the probability that 106 00:09:32,534 --> 00:09:38,729 you'll actually end up routing two memory addresses to the same bank, and get what 107 00:09:38,729 --> 00:09:44,773 it's called bank conflict, will be quite low. Instead the probability that you 108 00:09:44,773 --> 00:09:51,399 actually go to two different banks will be high. Now, you might say, well, big 109 00:09:51,399 --> 00:09:56,861 conflicts, sound pretty, pretty. Possible to happen. This is sort of the naive 110 00:09:56,861 --> 00:10:01,691 drawing here. When people go to build these things for large cache, you actually 111 00:10:01,691 --> 00:10:07,668 have many banks. and then let's say you, and if you have more banks than the number 112 00:10:07,668 --> 00:10:13,608 of ports you have, you can make a decent argument for the probability you're gonna 113 00:10:13,608 --> 00:10:19,113 get a bank conflict ends up being quite low. Or the probability that any two 114 00:10:19,113 --> 00:10:25,198 random addresses will end up going to the same bank will be low. Now, let's talk a 115 00:10:25,198 --> 00:10:32,022 little bit about addressing and how we do this Assignment of addresses to banks. The 116 00:10:32,022 --> 00:10:37,788 most basic thing is you can just maybe low order interleave it. Or you can try to 117 00:10:37,788 --> 00:10:44,336 high order interleave it. So, one of the questions that comes up here is, which is 118 00:10:44,336 --> 00:10:51,029 better, low order or high order interleaving our banks. Everyone will take 119 00:10:51,029 --> 00:10:56,535 a guess on this one. So, when I say low order, I mean use the low order bits to 120 00:10:56,535 --> 00:11:01,349 hash into which address bank, or which bank you go into. Higher leaving you some 121 00:11:01,349 --> 00:11:06,041 higher bits. Maybe middle order early leaving you some middle order bits. What 122 00:11:06,041 --> 00:11:11,098 would you choose if you were a designer? That's a good insight So, if we think we 123 00:11:11,098 --> 00:11:15,824 are accessing an array. And we're trying to describe through the array, so we're 124 00:11:16,438 --> 00:11:21,247 through the array the high order bits of those are all going to be the same. So if 125 00:11:21,247 --> 00:11:25,493 we have high order inter leave bits all the high bits are going to all be the same 126 00:11:25,493 --> 00:11:30,199 and were conflicts. They're all going to go to let's say a big zero all the time. 127 00:11:30,199 --> 00:11:34,240 And therefore our cache which we are trying to have more or higher through put, 128 00:11:34,240 --> 00:11:40,397 didn't get any higher through put. Okay, now let me give you counter example. . 129 00:11:40,397 --> 00:11:47,095 Counter example is let's say you are striding through an array. And you have an 130 00:11:47,095 --> 00:11:54,735 array an array of structures. A pretty common thing to do. And the structure is, 131 00:11:54,735 --> 00:12:02,830 let's say 100, no it's yeah it's 120 bytes long. And we go and access the first byte 132 00:12:02,830 --> 00:12:10,471 in that 180 128 byte structure in a loop. Are the lower order bits of the address 133 00:12:10,471 --> 00:12:21,409 the same between all of those? Yeah. So I can probably bet our higher bits are the 134 00:12:21,409 --> 00:12:26,481 same and our lower bits are the same. What's changing in that example? Some 135 00:12:26,481 --> 00:12:31,124 medium bits, some, some bits a little bit higher than the, the, the little lower 136 00:12:31,124 --> 00:12:36,009 bits. So what we're trying to get across here is that you have to be very careful 137 00:12:36,009 --> 00:12:40,584 about how you do bank interleading, if you have naive hash functions. Or the e 138 00:12:40,584 --> 00:12:44,746 functions which sign address to a particular bank. And you may not even want 139 00:12:44,746 --> 00:12:49,330 to have you've been thinking about having even more sophisticated hashing functions 140 00:12:49,330 --> 00:12:53,439 or sign ins from address to bank. But, if you go to do that, you know this is on 141 00:12:53,439 --> 00:12:57,759 your critical path of your memory load so you may not want to have a complex hash 142 00:12:57,759 --> 00:13:02,923 function there. So you might want to just take some middle order bits, you might 143 00:13:02,923 --> 00:13:07,601 want to tank some little bits, and some high bits for you bank assignments. It 144 00:13:07,601 --> 00:13:12,035 gets a little challenging here, you probably don't want the highest order 145 00:13:12,035 --> 00:13:17,016 bits. Definitely cause in most systems all those are always zero or something like 146 00:13:17,016 --> 00:13:21,937 that. Its very unlikely that your higher bits are something important unless your 147 00:13:21,937 --> 00:13:33,282 accessing a stack where all those are usually one, or one. . So I have another I 148 00:13:33,282 --> 00:13:37,356 think we talked about most of these on the slide, but if it were trying to go thru 149 00:13:37,356 --> 00:13:43,319 high thru put here bank conflicts. We, we talked about how you can get a bank 150 00:13:43,319 --> 00:13:49,180 conflict. we talked somewhat how you can avoid bank conflicts by having more banks 151 00:13:49,180 --> 00:13:54,534 than you have ports into the, the porting structure. And that's just a, from a 152 00:13:54,534 --> 00:14:00,363 statistics perspective if you have let's say a, a 32, your, your, your cache is you 153 00:14:00,363 --> 00:14:05,988 have a 32 kilobyte cache, and each bank is one kilobyte or something like that. Then, 154 00:14:05,988 --> 00:14:11,206 all of the sudden, you have the probability that you have two ports trying 155 00:14:11,206 --> 00:14:16,363 to access one particular bank. It's kind of like a birthday problem. So it's not, 156 00:14:16,363 --> 00:14:21,463 there's the trying to find if two people have the same birthday.'Cause it's not 157 00:14:21,463 --> 00:14:26,052 quite the easy probability, you'll go figure that out, but it's quite low 158 00:14:26,052 --> 00:14:31,470 because it's roughly you know, one. What is this? It's probably 32 times one over 159 00:14:31,470 --> 00:14:36,180 32 times one over 32. So it's not, not the lowest thing in the world but it's, it's, 160 00:14:36,180 --> 00:14:41,704 it's possible to actually reduce the probability conflict with random access. 161 00:14:41,704 --> 00:14:47,060 And if you have a better hash function, of course you can do better. Extra wiring, so 162 00:14:47,060 --> 00:14:52,660 what do I mean by extra wiring? Well, in our previous design, we didn't have these 163 00:14:52,660 --> 00:14:57,881 sort of crossover paths. So what this we present you is that its a mux here, a 164 00:14:57,881 --> 00:15:03,228 multiplexer is there. A mux here and a mux there for the address paths. So that's 165 00:15:03,228 --> 00:15:08,574 extra wiring extra delay by banking. And one of the bigger, bigger challenges the 166 00:15:08,574 --> 00:15:13,733 extra wiring for going in the multi banked memories is. It'll shrink bank size or 167 00:15:13,733 --> 00:15:17,899 shrink the memory size and if you traditionally sort of look at memories, 168 00:15:17,899 --> 00:15:22,979 they like to be larger. You can the overheads a little bit better off the, for 169 00:15:22,979 --> 00:15:28,172 example the fires at the bottom of the memories and the rotor decoders and the 170 00:15:28,172 --> 00:15:32,681 column decoders if it's bigger, because for instance the rotor decoders and the 171 00:15:32,681 --> 00:15:37,418 cone decoders will roughly grow as the log of the number of bits or something like 172 00:15:37,418 --> 00:15:43,923 that. But if you make each your catches rally small and you have multiple of that 173 00:15:43,923 --> 00:15:49,475 overhead gets worse so you have to have more sense amps, more row decoders, more 174 00:15:49,475 --> 00:15:53,719 column decoder's. So something to think about, but usually that's, that's 175 00:15:53,719 --> 00:15:58,180 relatively small and, and you're willing to pay that, unless you get some really 176 00:15:58,180 --> 00:16:02,887 funny aspect ratios. So what I mean by really funny aspect ratios if you want to 177 00:16:02,887 --> 00:16:07,597 have your, if you shrink the cache too small. So if you have, I don't know, 256 178 00:16:07,597 --> 00:16:12,426 byte caches. And the read out width is lets say a, a byte. So you get a long and 179 00:16:12,426 --> 00:16:17,434 narrow, cache. And you can try to sort of you call addressing or something to sort 180 00:16:17,434 --> 00:16:22,144 of squish that into a more square form factor. But sometimes if you squish your 181 00:16:22,144 --> 00:16:26,974 bank sizes to small you get very funny aspect ratios, or very long skin rams that 182 00:16:26,974 --> 00:16:33,134 don't fit very well into the designs. Okay. Un-, uneven utilization, so what do 183 00:16:33,134 --> 00:16:39,921 I mean by that. Well, what I mean by that is . Just like you can have bait 184 00:16:39,921 --> 00:16:45,269 conflicts. Due to the addressing of the hashing scheme, you can also have a 185 00:16:45,269 --> 00:16:50,172 pathological case, where all of the interesting data is stored in one of the 186 00:16:50,172 --> 00:16:56,403 banks. So lets say going back to the example you use low order enter leaving to 187 00:16:56,403 --> 00:17:01,834 the lowest order bid of the address, which is not the offset we'll say, to choose 188 00:17:01,834 --> 00:17:08,652 between big zero bank one. But it just so happens the program, pads out all its data 189 00:17:08,652 --> 00:17:13,749 structures, such that, that BIT is a constant of zero, basically for all the 190 00:17:13,749 --> 00:17:19,122 useful arrays you're trying to access. What that will do is it will locate all 191 00:17:19,122 --> 00:17:25,219 the data lines, in banks zero. Alright, well we just took our big cash and cut it 192 00:17:25,219 --> 00:17:30,552 in, in half, because we effectively placed all of our data over in this bank. So 193 00:17:30,552 --> 00:17:36,091 we're getting uneven data utilization. So we get uneven bandwidth utilization and 194 00:17:36,091 --> 00:17:41,288 bank conflicts, but we can also get data utilization, these often go together 195 00:17:41,288 --> 00:17:47,032 though. The bank conflicts and the uneven utilization cuz you're accessing something 196 00:17:47,032 --> 00:17:52,160 a lot, and it's in bank zero, you know, that means it's probably means stored 197 00:17:52,160 --> 00:17:58,729 there, so you have uneven, storage patterns also. Questions so far about 198 00:17:58,729 --> 00:18:07,022 banking, why we wanna bank? This is a, a pretty common thing to do. So a good 199 00:18:07,022 --> 00:18:13,314 example of this is, if you go look at the Niagara processor which is a Sun 200 00:18:13,314 --> 00:18:19,279 microprocessor. That has, the first diagram had eight cores. And, if you go 201 00:18:19,279 --> 00:18:25,221 look at their cache structure, they had eight banks of the cache. So, is a 202 00:18:25,987 --> 00:18:33,821 relatively, heavily last level of cash. If you go look at something actually like, 203 00:18:33,821 --> 00:18:41,314 the multi core intel chips out there, they even have more b anks. actually I think I 204 00:18:41,314 --> 00:18:48,722 saw one once where is the 32 megabytes last level cash for the intel itanium two 205 00:18:48,722 --> 00:18:55,534 had some incredibly large number of banks. I thin it had 32 or 64 banks in their 206 00:18:55,534 --> 00:19:01,933 cash. And they were wide, also. So they had lots of bandwidth going through that 207 00:19:01,933 --> 00:19:08,583 cache. Okay, so, if you look at our, our cache banking, was really, really helps 208 00:19:08,583 --> 00:19:14,046 here, as we are just, we are increasing the bandwidth. We might be actually 209 00:19:14,046 --> 00:19:20,333 hurting some things, we might be hurting the hit time a little bit, miss rate, miss 210 00:19:20,333 --> 00:19:26,125 penalty should not change, the cache size is same. We haven't, effectively increased 211 00:19:26,125 --> 00:19:31,025 or shrunk the cache size. Even if you get lots and lots of, bank conflicts or you 212 00:19:31,025 --> 00:19:35,807 can have poor utilization, that's still not actually affecting your miss rate. Cuz 213 00:19:35,807 --> 00:19:40,352 you would have been, conflicting, in the same size cache. Logically, it's, you 214 00:19:40,352 --> 00:19:44,957 could draw a box around it. It would be the exact same, same thing. It just that 215 00:19:44,957 --> 00:19:49,503 it happens that, you know, in, if you were to, have it not be a banked cache, you 216 00:19:49,503 --> 00:19:54,201 would have every other line, be in the contested ones, or something like that. So 217 00:19:54,201 --> 00:19:58,080 just having better memory layout is, is good all around.