1 00:00:03,081 --> 00:00:05,673 Okay. So let's start talking about performance. 2 00:00:05,673 --> 00:00:09,990 'Coz the whole reason you built cache is, is to you know, have lower power and 3 00:00:09,990 --> 00:00:14,088 higher performance. And let's, let's go back to the iron law. 4 00:00:14,088 --> 00:00:17,236 So what is, what is the cache trying to do? 5 00:00:17,236 --> 00:00:22,633 Well, what a cache is trying to do is if you look at that iron law of processor 6 00:00:22,633 --> 00:00:28,097 performance, when you do a load or a store, we're trying to decrease the clocks 7 00:00:28,097 --> 00:00:31,609 per instruction to process that load or the store. 8 00:00:31,609 --> 00:00:36,521 So if you have to go all the way out to main memory in a cache miss we'll say, or 9 00:00:36,521 --> 00:00:40,424 if you don't have a cache. It's gonna take a long time. 10 00:00:40,424 --> 00:00:46,865 But if you shrink the cost per instruction for a load in the store everything gets, 11 00:00:46,865 --> 00:00:51,070 gets faster so you can actually, actually go, go faster here. 12 00:00:51,070 --> 00:00:56,941 So as a sort of showing here, we have some loop that does some loads, some adds, and 13 00:00:56,941 --> 00:01:02,475 here it takes a, a cache miss. If, if we can somehow do things to the 14 00:01:02,475 --> 00:01:07,999 cache which reduce the probability of a cache miss, we can shrink the amount of 15 00:01:07,999 --> 00:01:11,627 time it takes. To do that and the whole program will run 16 00:01:11,627 --> 00:01:15,028 faster. So, reducing our number of cache misses is 17 00:01:15,028 --> 00:01:19,998 good, using our cache to actually keep the data local and also, sort of, as you can 18 00:01:19,998 --> 00:01:23,793 see here in this diagram, this first load here it's in the cache. 19 00:01:23,793 --> 00:01:27,036 So it doesn't have to go out to main memory. 20 00:01:27,036 --> 00:01:32,054 So, it just hits in it, if it's a properly pipe lined cache, those return the data on 21 00:01:32,054 --> 00:01:35,060 that load. Just want to introduce, yeah, in this 22 00:01:35,060 --> 00:01:40,021 diagram, two things, processor to cache and coming back, that's a hit in the 23 00:01:40,021 --> 00:01:42,090 cache. Miss takes longer, it just takes more 24 00:01:42,090 --> 00:01:48,537 cycles. Okay, so this is an important slide cause 25 00:01:48,537 --> 00:01:53,004 it uses some important ways to think about caches. 26 00:01:53,004 --> 00:01:57,060 So we want to categorize the types of misses we have to cache. 27 00:01:57,060 --> 00:02:03,097 So this is figuring why we would have these different euristics inside of caches 28 00:02:03,097 --> 00:02:10,011 based on the different policies and different ways that you'd actually have a 29 00:02:10,011 --> 00:02:13,078 cache miss. And a lot of times people call this the 30 00:02:13,078 --> 00:02:20,075 three Cs of caches. Okay, the first C, A compulsory cash miss. 31 00:02:20,075 --> 00:02:25,097 So what that means is that the first reference to a block and you're going to 32 00:02:25,097 --> 00:02:29,090 take that cache miss even if you have an infinite size cache. 33 00:02:29,090 --> 00:02:35,024 It's just you can't get things into the cache unless you go to try to access them 34 00:02:35,024 --> 00:02:39,024 the first time ever. So that's what a compulsory cache miss is 35 00:02:39,024 --> 00:02:42,000 that first reference. You can try real hard. 36 00:02:42,000 --> 00:02:45,098 You can see by have a prefecture, actually, to try to reduce the amount of 37 00:02:45,098 --> 00:02:48,265 compulsory cache misses. You could think really hard say, oh, I 38 00:02:48,265 --> 00:02:51,754 think I'm gonna access this data sometime in the future. 39 00:02:51,754 --> 00:02:54,738 I should go get it. And then when you go to actually access 40 00:02:54,738 --> 00:02:58,205 it, you won't take a cache miss. So that's, that's a possibility. 41 00:02:58,205 --> 00:03:04,282 Okay, the second C that contributes to our performance of caches is the capacity of 42 00:03:04,282 --> 00:03:08,021 the cache. So, traditionally, a larger cache will be 43 00:03:08,021 --> 00:03:11,477 able to fit more data, well that's always true. 44 00:03:11,477 --> 00:03:17,289 But traditionally a larger cache you will have a lower miss rate if you have a 45 00:03:17,289 --> 00:03:21,082 larger cache. So let's think about that, this is an 46 00:03:21,082 --> 00:03:25,878 important question here. Will a larger cache always have a lower 47 00:03:25,878 --> 00:03:34,056 cache miss rate than any smaller cache? Within a loop, let's think about what 48 00:03:34,056 --> 00:03:39,046 kicks data out. If you have a cache that is let's say 49 00:03:39,046 --> 00:03:44,055 eight lines and you have a cache that is sixteen lines. 50 00:03:44,055 --> 00:03:53,394 By definition the addresses that with alias in the bigger cache, are gonna alias 51 00:03:53,394 --> 00:03:58,542 in the smaller cache also if you go let's say, from am eight or a sixteen entry 52 00:03:58,542 --> 00:04:03,109 cache, to a eight entry cache. So it's actually if you're going up by 53 00:04:03,109 --> 00:04:09,050 factors of two, at least, your cache miss rate, is always going to be better with a 54 00:04:09,050 --> 00:04:12,434 larger cache. Or, or the cache miss rate's gonna be 55 00:04:12,434 --> 00:04:16,528 lower with larger cache. The one place it could actually come up, 56 00:04:16,528 --> 00:04:20,213 is if you have, if you try to change the hashing function. 57 00:04:20,213 --> 00:04:25,068 So if you have a cache that is, let's say two thirds the size... 58 00:04:25,068 --> 00:04:32,218 Well, you have a cache that is let's say eight entries and you have a cache that is 59 00:04:32,218 --> 00:04:35,355 twelve entries. So there, there [inaudible] locations are 60 00:04:35,355 --> 00:04:38,589 gonna be different if you actually think about having different patters there. 61 00:04:38,589 --> 00:04:44,431 But it's likely that the larger cache will still be better there. 62 00:04:44,431 --> 00:04:56,951 So large, large is good. That's, that's assuming like sort of 63 00:04:56,951 --> 00:05:00,304 direct mapped cache. You could, you know, depending on your LRU 64 00:05:00,304 --> 00:05:03,950 strategy probably, or your, your replacement policy there's some other 65 00:05:03,950 --> 00:05:06,830 caveats there you have to think a little bit harder about. 66 00:05:06,830 --> 00:05:10,709 Okay. So finally, what's the last thing that can 67 00:05:10,709 --> 00:05:16,615 cause a cache conflict to occur is, or cache miss occur is conflict in the cache. 68 00:05:16,615 --> 00:05:21,900 So this means that we don't have a high enough associativity in two different 69 00:05:21,900 --> 00:05:26,296 pieces of data. Or two different clocks alias to the same 70 00:05:26,296 --> 00:05:29,437 location and they're fighting for that location. 71 00:05:29,437 --> 00:05:34,447 So is, there's a conflict there and one of the, the pieces of data that gets kicked 72 00:05:34,447 --> 00:05:37,527 out before you have the time to go read it back in. 73 00:05:37,527 --> 00:05:42,078 So you have a load and then another load, but in the mean, in the, in the middle 74 00:05:42,078 --> 00:05:46,415 time, you have a load to a different address, which happens to alias to that 75 00:05:46,415 --> 00:05:50,367 same location in the cache. The hash function points to the same 76 00:05:50,367 --> 00:05:53,470 location, and they're gonna fight for that resource. 77 00:05:53,470 --> 00:05:59,521 So that's a, that's a conflict. Okay, so let's put some data behind this 78 00:05:59,521 --> 00:06:03,617 and look at some ways to make caches go faster. 79 00:06:03,617 --> 00:06:09,161 Okay, so, let's, let's look at what we are plotting here on this graph. 80 00:06:09,161 --> 00:06:12,680 On the X axis we have different sized caches. 81 00:06:12,680 --> 00:06:18,558 Sixteen kilobytes, 32 kilobytes, 64. So we go up by factors of two all the way 82 00:06:18,558 --> 00:06:26,763 up to one megabyte. On the Y axis here we have access time, So 83 00:06:26,763 --> 00:06:35,246 how long it takes to access the cache. So if we go back to, to this average 84 00:06:35,246 --> 00:06:40,984 memory latency equation. We take the miss rate times the miss 85 00:06:40,984 --> 00:06:46,495 penalty plus the hit time. How, how long it takes to actually go sort 86 00:06:46,495 --> 00:06:51,713 of look stuff up and find something in our cache in the first place. 87 00:06:51,713 --> 00:06:55,635 Well if we can reduce the hit time, that's, that's good. 88 00:06:55,635 --> 00:07:01,806 So if we take cache which let's say takes I don't know. 89 00:07:01,806 --> 00:07:07,282 Two nanoseconds to access. And on a one gigahertz machine, that'd be 90 00:07:07,282 --> 00:07:12,092 two clock cycles. And we can somehow, you know, replace it 91 00:07:12,092 --> 00:07:14,444 with the cache, which takes half a nanosecond to access. 92 00:07:14,444 --> 00:07:19,440 That's, that's, that's good, it means we can actually fit that in one clock cycle 93 00:07:19,440 --> 00:07:23,144 in a one gigahertz machine. In fact, we can probably go all the way up 94 00:07:23,144 --> 00:07:27,621 to, you know, sort of, one nanosecond on a, on a one gigahertz machine before we 95 00:07:27,621 --> 00:07:32,670 spill over into the next clock cycle. So you can sort of think it, think about 96 00:07:32,670 --> 00:07:34,757 that. So, first thing is, actually, small and 97 00:07:34,757 --> 00:07:38,893 simple caches could be good. Every one thinks capacity, capacity, 98 00:07:38,893 --> 00:07:44,607 capacity but, in your sort of processor core, if you can reduce the hit time, 99 00:07:44,607 --> 00:07:48,884 that's a good thing. So, just to give you some idea here, as 100 00:07:48,884 --> 00:07:54,877 you, as you scale back, you can have, to the smaller caches the, time it takes to 101 00:07:54,877 --> 00:08:02,467 access the cache goes down. So the next, next thing we do is we try to 102 00:08:02,467 --> 00:08:04,808 reduce. The miss rate. 103 00:08:04,808 --> 00:08:07,324 And there's a couple different ways to do that. 104 00:08:07,324 --> 00:08:11,401 These are just sort of some basic optimizations in the cache lectures. 105 00:08:11,401 --> 00:08:15,984 Later in class we're gonna go through at lot more optimizations for caches. 106 00:08:15,984 --> 00:08:20,033 But one thing you do is think about looking at the block size. 107 00:08:20,033 --> 00:08:25,664 So the examples I've been giving, we're talking about 64, byte block size. 108 00:08:25,664 --> 00:08:33,109 But you could think about having either a smaller block size or a larger block size. 109 00:08:33,109 --> 00:08:39,246 And in fact this, this happens. People, the, this, this graph here, you 110 00:08:39,246 --> 00:08:43,971 know, 64 looks like, sort of the lowest misrates is a good spot. 111 00:08:43,971 --> 00:08:48,404 But this is really dependent in your applications space. 112 00:08:48,404 --> 00:08:53,516 So if you have applications which just sorta stream through memory, you probably 113 00:08:53,516 --> 00:08:57,651 want a bigger block size. If you have more random patterns, you're 114 00:08:57,651 --> 00:09:02,269 not getting any reuse. It might make sense to even have a smaller 115 00:09:02,269 --> 00:09:06,291 block size. So, this, this data is from your textbook 116 00:09:06,291 --> 00:09:09,347 actually. This is plotting [inaudible]. 117 00:09:09,347 --> 00:09:14,457 Rates with different cache sizes. So each of these lines here is a different 118 00:09:14,457 --> 00:09:16,396 size cache. So you can see four kilobyte cache, 119 00:09:16,396 --> 00:09:22,551 sixteen kilobyte cache, 64 kilobyte cache, an 256 kilobyte caches. 120 00:09:22,551 --> 00:09:28,250 And you can plot those, those two things against each other and see where the, the 121 00:09:28,250 --> 00:09:34,203 sort of sweet spot in this curve is. We're trying to minimize the miss rate, so 122 00:09:34,203 --> 00:09:37,549 how often we actually take that cache miss. 123 00:09:37,549 --> 00:09:40,881 Okay. So there's some, there's some positive 124 00:09:40,881 --> 00:09:44,959 things here about having different having larger cash sizes. 125 00:09:44,959 --> 00:09:48,435 Let's talk about that. S if we have a, excuse me. 126 00:09:48,435 --> 00:09:53,100 Larger box sizes. If we have a larger box size we need less 127 00:09:53,100 --> 00:09:57,797 tag overhead. So we talked about that already in the tag 128 00:09:57,797 --> 00:10:03,057 look up slide. The, you can, if you have longer block 129 00:10:03,057 --> 00:10:09,465 sizes you can think about having uhh. More burst transfers from your D ram. 130 00:10:09,465 --> 00:10:14,386 So in D ram, typically they like to give you sort of large chunks of data at a time 131 00:10:14,386 --> 00:10:18,010 and not little chunks of data. Cause there's a overhead cost in firing up 132 00:10:18,010 --> 00:10:22,274 the D ram, and there's what's called the call address strobe, and the row address 133 00:10:22,274 --> 00:10:26,234 strobe and rasser cast time. And you do that for memory every other 134 00:10:26,234 --> 00:10:28,940 memory access. So if you can sort of pull in larger 135 00:10:28,940 --> 00:10:33,145 chunks of memory at a time, you only have to do that once for the large amount of 136 00:10:33,145 --> 00:10:36,530 data you bring in. So that pushes you to actually want to 137 00:10:36,530 --> 00:10:41,016 have larger block sizes. And, you could even think about having 138 00:10:41,016 --> 00:10:45,402 similar ideas here for, sort of, on-chip buses if you have larger block sizes, 139 00:10:45,402 --> 00:10:49,958 you'll probably be using that on-chip bus more effectively, coz, there is some 140 00:10:49,958 --> 00:10:53,749 overheads and some turnarounds, usually, for arbitrations for buses. 141 00:10:53,749 --> 00:10:58,205 Okay, so on the right side we have the downsides of larger block sizes. 142 00:10:58,205 --> 00:11:04,237 If you have larger block size, you might be pulling in data you are not using. 143 00:11:04,237 --> 00:11:10,044 So if I have a 256 byte block, cash block versus 64 byte block. 144 00:11:10,044 --> 00:11:13,012 We're gonna be pulling in four times as much data. 145 00:11:13,012 --> 00:11:18,004 And if we're only trying to access let's say one byte in a data, we just wasted a 146 00:11:18,004 --> 00:11:21,009 lot of main memory bandwidth to hold that byte in. 147 00:11:21,009 --> 00:11:24,091 So, we need to be cognizant of that. And that's why this curve is not, you 148 00:11:24,091 --> 00:11:27,076 know, increasing in one direction or the other direction. 149 00:11:27,076 --> 00:11:31,058 Because, on first, when I sort of first took a computer architecture class, I 150 00:11:31,058 --> 00:11:35,066 thought, oh, you know, as you increase the block size, student performance go up or, 151 00:11:35,066 --> 00:11:38,047 or shouldn't the cache-miss rate go down in this graph. 152 00:11:38,047 --> 00:11:44,053 And it's, it's not true because you start to waste bandwidth at some point. 153 00:11:44,053 --> 00:11:48,791 Also if you have larger block size by definition, you have fewer blocks. 154 00:11:48,791 --> 00:11:54,342 So if we have let's say 256 byte blocks versus 64 byte blocks, by definition this 155 00:11:54,342 --> 00:11:57,757 is sort of four, four times fewer blocks in the cache. 156 00:11:57,757 --> 00:12:01,486 The same amount of data. So if we have a, you know, four kilobyte 157 00:12:01,486 --> 00:12:06,377 cache, we're gonna have the same amount the same amount of data. 158 00:12:06,377 --> 00:12:11,072 It's still a four kilobyte cache. But there's less blocks in that cache, so 159 00:12:11,072 --> 00:12:16,431 you're not gonna be able to have as much random data in your cache at one time. 160 00:12:16,431 --> 00:12:19,422 So this is one technique to reduce a miss rate. 161 00:12:19,422 --> 00:12:25,053 Another way, in a perfect world, this, this fights off against small and simple 162 00:12:25,053 --> 00:12:30,504 caches, is that you just build big caches. If you build big caches, this means that 163 00:12:30,504 --> 00:12:35,329 when you go to access the data, there's a very high probability that the data is 164 00:12:35,329 --> 00:12:39,080 close to you. That, that sounds good. 165 00:12:39,080 --> 00:12:44,954 So here, we actually have miss rate plotted against cache size. 166 00:12:44,954 --> 00:12:53,003 And, and of course, the sort of different associ, different types of caches here. 167 00:12:53,028 --> 00:12:59,078 And let's see, there's one. It's one thing I wanted to point out here, 168 00:13:00,036 --> 00:13:06,079 This empirical rule of thumb. If you double your cash size. 169 00:13:07,025 --> 00:13:14,033 Your miss rate usually drops by root two. Sometimes people call this the square root 170 00:13:14,033 --> 00:13:16,086 rule. How do we derive this? 171 00:13:18,094 --> 00:13:21,011 Well. Sorry [inaudible] you guys. 172 00:13:21,011 --> 00:13:24,752 It says here it is an empirical rule of thumb. 173 00:13:24,752 --> 00:13:29,126 It is just a rule of thumb. Obviously, if this was perfectly true 174 00:13:29,126 --> 00:13:33,835 this, this line would be, you know. Nicely curved versus having some, like, 175 00:13:33,835 --> 00:13:38,194 bumps in it, But, this actually, surprisingly works out, pretty well, as a, 176 00:13:38,194 --> 00:13:42,035 a rule of thumb. So, and it doesn't really work very well 177 00:13:42,035 --> 00:13:47,415 for, very small caches, so typically, sort of, right in here it doesn't work well, 178 00:13:47,415 --> 00:13:51,892 Possibly, sometimes for very large things it doesn't work well, and for high 179 00:13:51,892 --> 00:13:55,612 associativity, this rule of thumb starts to break down, also. 180 00:13:55,612 --> 00:13:59,820 But, it's a good rule of thumb the rule of thumb to think about. 181 00:13:59,820 --> 00:14:03,974 Okay. So how else do we reduce the, miss rate? 182 00:14:03,974 --> 00:14:09,848 Well, same, same graph. But, this is also from, Hennessy and 183 00:14:09,848 --> 00:14:15,341 Patterson, your book. You can increase the associativity. 184 00:14:15,341 --> 00:14:21,097 So you can take a, I don't know, four way cache and turn into eight way cache. 185 00:14:21,097 --> 00:14:27,063 And some power cost associates that, with that, and there is also a clock cycle 186 00:14:27,063 --> 00:14:31,020 cost, typically. So, let's look at the, the rule thumb 187 00:14:31,020 --> 00:14:34,007 here. And the rule thumb basically says, a 188 00:14:34,007 --> 00:14:39,059 direct mapped cache of size N, has about the same missed rate, has a two way set 189 00:14:39,059 --> 00:14:44,441 associative cache, of size N over two. And this kind of, that sounds, crazy good. 190 00:14:44,441 --> 00:14:48,751 Like how is that possible. We should always go at least two ways at 191 00:14:48,751 --> 00:14:52,645 associative caches. But let's, let's look to, let's look at 192 00:14:52,645 --> 00:14:55,912 this graph, and see if this actually, actually works. 193 00:14:55,912 --> 00:15:02,599 So. We're gonna to look at a point. 194 00:15:02,599 --> 00:15:11,260 Let's look at a sixteen kilobyte cache that is direct mapped and compare it 195 00:15:11,260 --> 00:15:18,960 against a 32 kilobyte cache that is or excuse me, a sixteen kilobyte cache that 196 00:15:18,960 --> 00:15:25,019 is two ways associative, versus a 32 kilobyte cache that is, lower associative. 197 00:15:25,019 --> 00:15:30,059 And, kind of what you're trying to see here is if this point is equals to that 198 00:15:30,059 --> 00:15:34,006 point. Well, it's not, but it's actually not a 199 00:15:34,006 --> 00:15:38,018 horrible approximation. We're sort of saying if we have a sixteen 200 00:15:38,018 --> 00:15:42,049 kilobyte cache with a higher sensitivity, our miss rate goes down, right? 201 00:15:42,049 --> 00:15:47,021 So it's going to be somewhere in here, versus a 32 kilobyte cache which has, is 202 00:15:47,021 --> 00:15:51,046 direct map which is this point here. Okay, well that doesn't hold there. 203 00:15:51,046 --> 00:15:56,441 Let's go look for another point here. Let's say 32 kilobytes to a say 204 00:15:56,441 --> 00:16:00,095 associative cache which is that point there over a 64 kilobyte cache that is 205 00:16:00,095 --> 00:16:03,812 direct mapped. Well that's almost on a straight line with 206 00:16:03,812 --> 00:16:09,012 each other, so it's almost exactly equal. And this is just an empirical rule of 207 00:16:09,012 --> 00:16:13,677 thumb that people have sort of figured out that as you double your associativity, you 208 00:16:13,677 --> 00:16:18,560 actually have a sort of, you can almost half your cache size and still have it, 209 00:16:18,560 --> 00:16:23,520 have the same miss rate. And, and likewise, like I said, as I said, 210 00:16:23,520 --> 00:16:26,931 this is just, empirical. There is no, no, reason why this has to 211 00:16:26,931 --> 00:16:29,916 be. And it's really dependent on your, sort 212 00:16:29,916 --> 00:16:35,077 of, data access patterns. But, I, I found that, I found that pretty 213 00:16:35,077 --> 00:16:38,130 interesting. Okay, so what's, what's the problem with 214 00:16:38,130 --> 00:16:41,177 building a two way set of associative cache always? 215 00:16:41,177 --> 00:16:45,476 Why do we not do that? So, the area for the data store, shouldn't 216 00:16:45,476 --> 00:16:49,518 actually, change very much. With those changes, your tag store, the, 217 00:16:49,518 --> 00:16:54,912 the tag data, you're gonna need for the higher associative one, you're gonna need 218 00:16:54,912 --> 00:16:59,158 more tag-check logic at least. Cuz you'll have to check, let's say, two 219 00:16:59,158 --> 00:17:00,086 tags in parallel.