1 00:00:03,480 --> 00:00:08,540 Okay. Welcome to today's ELE 475. 2 00:00:08,540 --> 00:00:13,611 So today, we are going to be moving to a little bit different topic. We're going 3 00:00:13,611 --> 00:00:18,992 back to one of the topics that we talked about at the beginning of class in Lecture 4 00:00:18,992 --> 00:00:22,776 three. We're going to talk about caches, and 5 00:00:22,776 --> 00:00:24,930 we're spending two lectures in caches because they are really important to 6 00:00:24,930 --> 00:00:28,047 performance. In the, we'll say, in the 90s,' 80s' and 7 00:00:28,047 --> 00:00:32,414 90s,' a lot of, lot of computer architectural papers and a lot of effort 8 00:00:32,414 --> 00:00:37,238 were spent trying to use more and more transistors, which Moors Law was giving 9 00:00:37,238 --> 00:00:41,630 the computer architect and applying them to either build bigger caches, 10 00:00:41,630 --> 00:00:48,859 Fancier caches, more set associative caches, harder will say, ways to think 11 00:00:48,859 --> 00:00:58,825 about caches all for higher performance. Cough And we'll be continuing to videotape 12 00:00:58,825 --> 00:01:02,752 these lectures and put them online for everybody. 13 00:01:02,993 --> 00:01:07,801 And so let's, let's start of looking what we're doing today. 14 00:01:07,801 --> 00:01:13,570 So, we're going to start off with a litlle bit of review from Lecture three. 15 00:01:13,571 --> 00:01:17,338 We're going to talk about the three Cs of caches. 16 00:01:17,338 --> 00:01:22,351 So, capacity, Conflict, and compulsory misses, we'll be 17 00:01:22,351 --> 00:01:28,889 talking about the basic cache optimization we had talked about in Lecture three, 18 00:01:28,893 --> 00:01:35,109 which are mostly kind of, make your cache bigger, make your cache more highly 19 00:01:35,109 --> 00:01:41,570 associative somehow to reduce these three C and then we're going to start talking 20 00:01:41,570 --> 00:01:47,295 about much more advanced cache organizations and optimizations we can 21 00:01:47,295 --> 00:01:51,755 make. So, let's briefly talk about it. It's sort of a hodgepodge. 22 00:01:51,986 --> 00:01:54,811 And this is kind of just the nature of the beast. 23 00:01:55,002 --> 00:02:00,094 There's lots of little things that sort of add up to make either the cache more 24 00:02:00,094 --> 00:02:05,060 effective or integrate the cache into your pipeline more, more efficient. 25 00:02:05,580 --> 00:02:09,685 And this is only the beginning. We will have about five or six more topics 26 00:02:09,685 --> 00:02:12,521 the next time. But let's start off by looking at what 27 00:02:12,521 --> 00:02:16,902 we're going to be talking about today. So, we're going to look at how to build, 28 00:02:16,902 --> 00:02:20,829 how to pipeline a cache right. So, one of the things we haven't really 29 00:02:20,829 --> 00:02:25,665 thought about when we were working on our labs is we do it right to the cache and it 30 00:02:25,665 --> 00:02:30,217 just sort of magically gets stored there. Or, or, but if you're actually trying to 31 00:02:30,217 --> 00:02:34,977 build the real cache, you have to look at the tag data and then you have to write 32 00:02:34,977 --> 00:02:40,183 the data because you don't want to be writing something into the cache if it's, 33 00:02:40,183 --> 00:02:44,230 let's say, a different piece of data is in that same location. 34 00:02:44,230 --> 00:02:49,405 So, you need to check the tag first. And we'll talk about some ways to optimize 35 00:02:49,405 --> 00:02:51,860 that. We'll talk about a write buffer. 36 00:02:51,860 --> 00:02:58,147 A write buffer is a extra data structure we can put after our first level of cache, 37 00:02:58,147 --> 00:03:03,913 which can hold victims. So, victim is a line or a cache block, 38 00:03:03,913 --> 00:03:10,320 that is being evicted from the cache. And if you're going to, let's say, 39 00:03:10,320 --> 00:03:15,537 Replace something in the cache, you need some place to put the old data and, or the 40 00:03:15,537 --> 00:03:18,974 data that was, let's say, dirty in the cache from before. 41 00:03:18,974 --> 00:03:23,937 And the slow way to do this with things, we start with that up to this point is you 42 00:03:23,937 --> 00:03:28,391 take that line and you wait for it to get into either memory or the next level of 43 00:03:28,391 --> 00:03:30,965 cache. But if you want to go faster, you can 44 00:03:30,965 --> 00:03:34,754 think about trying to store in some place inside the structure and we'll say, write 45 00:03:34,754 --> 00:03:39,400 to the next level of cache or write to the main memory in the background. 46 00:03:40,060 --> 00:03:44,992 We'll talk about multi-level caches. So, multi-level caches, meaning you have a 47 00:03:44,992 --> 00:03:49,541 level one, a level two, a level three, maybe a level four cache, then main 48 00:03:49,541 --> 00:03:52,360 memory, and why this can improve performance. 49 00:03:52,880 --> 00:03:58,116 We'll talk about victim caches. Victim caches this is the idea that can 50 00:03:58,116 --> 00:04:04,603 basically increase the associativity of a cache by putting a low associative cache 51 00:04:04,603 --> 00:04:10,542 or maybe a direct mapped cache next to a cache with very high associativity. 52 00:04:10,542 --> 00:04:16,794 So, maybe like a fully associative cache and they use them together to basically 53 00:04:16,794 --> 00:04:20,390 implement a, a pseudo higher associativity cache. 54 00:04:20,390 --> 00:04:25,075 Cough We're going to talk today about prefetching, both hardware and software. 55 00:04:25,075 --> 00:04:29,881 So, prefetching is the act of either a piece of hardware or software trying to 56 00:04:29,881 --> 00:04:32,620 get data that you need early into your cache. 57 00:04:33,380 --> 00:04:36,986 We'll talk about how to increase the bandwidth to your cache. 58 00:04:36,986 --> 00:04:41,100 So, up to this point, we've looked at doing one memory operation per cycle. 59 00:04:41,100 --> 00:04:45,509 Well, we built machines that can execute two operations per cycle, and on your 60 00:04:45,509 --> 00:04:49,517 exam, you had a machine which could execute three operations per cycle. 61 00:04:49,517 --> 00:04:54,041 Wouldn't it be great to be able to do three memory operations per cycle versus 62 00:04:54,041 --> 00:04:57,705 just three ALU operations? So, how do we, how do we go about doing 63 00:04:57,705 --> 00:05:00,225 that? Well, we can put multiple ports on this 64 00:05:00,225 --> 00:05:04,977 structure, but that's not particularly great for performance from, from a clock 65 00:05:04,977 --> 00:05:08,070 perspective, because it makes the structure very large. 66 00:05:08,070 --> 00:05:12,243 So, you can start to think about banking which we'll talk about today as a, as a 67 00:05:12,243 --> 00:05:15,857 mechanism to increase the bandwidth. And then, we'll talk a bunch, about a 68 00:05:15,857 --> 00:05:20,133 bunch of different compiler optimizations or software optimizations that a compiler 69 00:05:20,133 --> 00:05:23,187 can do to restructure code to have better cache performance. 70 00:05:23,187 --> 00:05:25,580 So, this is just what we're talking about today. 71 00:05:25,580 --> 00:05:28,293 Next lecture, we're going to continue on this. 72 00:05:28,293 --> 00:05:33,398 And the main topic of next lecture as far as events, cache optimizations are 73 00:05:33,398 --> 00:05:36,565 concerned, Is we're gonna be talking about how to 74 00:05:36,565 --> 00:05:41,868 build caches which can deal with out of order memory or can deal with the ability 75 00:05:41,868 --> 00:05:47,181 to have loads and stores which hit in the cache, or loads and stores which miss in 76 00:05:47,181 --> 00:05:50,745 the cache, and then you can continue on past that point. 77 00:05:50,745 --> 00:05:55,540 So, in our sequential machine up to this point, we've all, even in our out of order 78 00:05:55,540 --> 00:06:00,658 machines we were talking about, when you took a cache miss, you basically had to 79 00:06:00,658 --> 00:06:03,725 stop the machine. Cuz you have to wait for the data to come 80 00:06:03,725 --> 00:06:07,329 back cuz you didn't know if the next instruction was going to need this data or 81 00:06:07,329 --> 00:06:10,699 the next, maybe the next load will need that data or something like that. 82 00:06:11,140 --> 00:06:15,396 So, if you want to try to go out of order, you can start thinking about that and 83 00:06:15,396 --> 00:06:19,754 that's what the, the bulk of next of next lecture will be about, is how you go out 84 00:06:19,754 --> 00:06:24,004 of order in your memory system or this what, this is largely sometimes called is 85 00:06:24,004 --> 00:06:27,348 a nonblocking cache. Cough And I should point out that a 86 00:06:27,348 --> 00:06:31,202 nonblocking cache is actually not just for out of order processors. 87 00:06:31,202 --> 00:06:33,865 This is sometimes good for in order processors. 88 00:06:33,865 --> 00:06:38,569 So, if you have a in order processor, the instruction sequence is going in order but 89 00:06:38,569 --> 00:06:41,970 you don't want to, let's say, wait when you take a cache miss. 90 00:06:41,970 --> 00:06:45,886 You could actually have the memory system be out of order with the pipeline be in 91 00:06:45,886 --> 00:06:48,179 order. That's actually a pretty common thing for 92 00:06:48,179 --> 00:06:53,036 people to do. Okay, so here, here let's go look at our 93 00:06:53,036 --> 00:06:55,060 review. We have processor. 94 00:06:55,560 --> 00:07:00,569 And in our basic machine, we go out to main memory when we want to get data. 95 00:07:00,569 --> 00:07:04,443 Main memory is big. It holds everything, but it can be slow. 96 00:07:04,443 --> 00:07:10,054 So, we put a cache in front of that, which keeps a subset of all the data in the main 97 00:07:10,054 --> 00:07:13,060 memory. And it has a faster time and can take 98 00:07:13,060 --> 00:07:16,132 something, for instance, like this example here, 99 00:07:16,132 --> 00:07:21,409 Where you go to access memory, and you have to wait for the memory to come back. 100 00:07:21,409 --> 00:07:26,194 And it can shrink this time. If we look at that as sort of, you know, 101 00:07:26,194 --> 00:07:30,321 average memory access, It's your hit time plus your miss time 102 00:07:30,321 --> 00:07:34,580 times your or assume your miss rate times your miss penalty. 103 00:07:35,200 --> 00:07:38,681 We, this is just review. Okay. 104 00:07:38,681 --> 00:07:46,326 So, going back to the three Cs, here we have a four-block cache. So, say, to a 105 00:07:46,326 --> 00:07:53,857 side of associated cache and let's look at one of the major types of misses you can 107 00:08:01,490 --> 00:08:06,060 So, this just means that's the first time any, any portion of your program has gone 108 00:08:06,060 --> 00:08:09,070 to go access the data. There's no real way around this. 109 00:08:09,070 --> 00:08:13,362 Maybe you can prefetch, but, you know, you're not really going to magically be 110 00:08:13,362 --> 00:08:17,598 able to get the data in there early, unless you have some sort of aggressive 111 00:08:17,598 --> 00:08:21,080 prefetching mechanism. Capacity. 112 00:08:21,540 --> 00:08:26,420 So, in this cache here, we have four blocks. 113 00:08:27,420 --> 00:08:33,700 If you're looping over, let's say, five cache box worth of data, 114 00:08:34,060 --> 00:08:38,835 You're basically not going to ever be able to fit all the data in this cache or in 115 00:08:38,835 --> 00:08:43,496 more common cases, let's say, you have a cache that's eight kilobytes cache, that's 116 00:08:43,496 --> 00:08:47,985 your 01 cache and you're trying to add two arrays where each array is sixteen 117 00:08:47,985 --> 00:08:50,689 kilobytes. Well, that means 32 kilobytes of data, 118 00:08:50,689 --> 00:08:55,465 maybe even another eight kilobytes of data if you're not doing it in place for the 119 00:08:55,465 --> 00:08:58,860 result matrix and it's just not going to fit in your cache. 120 00:08:58,860 --> 00:09:03,521 You have capacity problems and things are just basically going to be continually 121 00:09:03,521 --> 00:09:11,501 kicked out of your cache. Conflict means that you don't actually 122 00:09:11,501 --> 00:09:20,625 have you, you have enough space in your cache, but roo many different loads or 123 00:09:20,625 --> 00:09:25,421 stores you're trying to access alias to the same location in your cache. 124 00:09:25,421 --> 00:09:30,883 So, let's say, you're trying to loop over three different cache blocks or trying to 125 00:09:30,883 --> 00:09:36,079 access three different cache blocks which all alias to one set in the cache. 126 00:09:36,079 --> 00:09:41,075 And effectively you can only fit two things and you're trying to fit three 127 00:09:41,075 --> 00:09:44,672 things in two spots. And this is kind of a pigeon hole 128 00:09:44,672 --> 00:09:50,268 principle here comes, comes up and says, no you just can't fit three things in two 129 00:09:50,268 --> 00:09:52,000 boxes. So, we get conflicts. 130 00:09:52,800 --> 00:09:58,260 Okay, so still, still review here. Let's, let's take a look at one way to 131 00:09:58,260 --> 00:10:03,016 reduce our hit time. Hit time is just, the, the fast case. 132 00:10:03,016 --> 00:10:06,303 That's actually you find the data in your cache. 133 00:10:06,303 --> 00:10:09,180 This is the good case. But for hit time, 134 00:10:09,600 --> 00:10:15,396 One way to make it lower, this is actually in nanoseconds, is just to have a smaller 135 00:10:15,396 --> 00:10:19,304 cache or a simpler cache. So, small, simple caches are in this 136 00:10:19,304 --> 00:10:22,299 darwing here. We have time on the vertical axis. 137 00:10:22,299 --> 00:10:25,165 On the x-axis we have different cache sizes. 138 00:10:25,165 --> 00:10:30,310 And the 16-kilobyte cache is faster than, let's say, the 256-kilobyte cache. 139 00:10:30,310 --> 00:10:33,501 So, if you make a smaller cache, you can go faster. 140 00:10:33,501 --> 00:10:38,712 And we've seen examples of this, actually, in the Pentium four when we were talking 141 00:10:38,712 --> 00:10:42,602 about super pipelining. In the Pentium four, they had a very small 142 00:10:42,602 --> 00:10:47,044 cache and the real reason they did that is because they had such a high clock 143 00:10:47,044 --> 00:10:51,544 frequency that they couldn't put anything bigger and still fit it in one clock 144 00:10:51,544 --> 00:10:53,936 cycle. You know, in a perfect world, if your 145 00:10:53,936 --> 00:10:57,980 clock cycle was, was slower, you could try to fit some bigger cache in there. 146 00:10:59,740 --> 00:11:04,545 So, you can try to reduce the miss rate. there's a couple different techniques on 147 00:11:04,545 --> 00:11:08,797 how to reduce the miss rate and this is, this is the advance technique which we 148 00:11:08,797 --> 00:11:12,802 will talk about later. But one of the basic ones is to change the 149 00:11:12,802 --> 00:11:17,971 block size. As you make the box size larger, you'll be 150 00:11:17,971 --> 00:11:25,660 pulling in more data so you'll get spatial locality and hopefully, you'll be pulling 151 00:11:25,660 --> 00:11:32,349 in data that you actually need. So, the, you know, the, the positives of 152 00:11:32,349 --> 00:11:40,206 this is you're really going to have less tag overhead if you have a larger block 153 00:11:40,206 --> 00:11:45,119 size or cache line size. Because for the same amount of data, let's 154 00:11:45,119 --> 00:11:50,552 say, you got a 64-byte line versus a 128-byte line, you're going to have half 155 00:11:50,552 --> 00:11:55,093 as much tag sort of cost or, or area dedicated to the tags. 156 00:11:55,093 --> 00:11:59,261 Cough And, and usually, as you go to larger block sizes, 157 00:11:59,261 --> 00:12:05,215 This is good for the farther out layers of your cache, because you can basically 158 00:12:05,215 --> 00:12:09,656 burst more data in. So if you look at something like a DDR2 159 00:12:09,944 --> 00:12:14,700 memory or DDR3 memories. This is the traditional DRAM technologies 160 00:12:14,700 --> 00:12:19,475 that we are using today. They want to actually, their block size, 161 00:12:19,475 --> 00:12:25,567 their line size is very, very long, sort of thousands of bits. And if you have a 162 00:12:25,567 --> 00:12:31,972 longer block size, you can use all those bits and pull them all into your cache in 163 00:12:31,972 --> 00:12:35,252 one big chunk, and you can use wider buses. 164 00:12:35,252 --> 00:12:39,265 Cough Downside? You can waste bandwidths. 165 00:12:39,265 --> 00:12:43,899 If you have a very large block, and you don't actually use all the data in the 166 00:12:43,899 --> 00:12:46,573 block, Let's say, you use one byte out of your, 167 00:12:46,573 --> 00:12:50,079 out of 256-byte cache, You're just pulling all the data and 168 00:12:50,079 --> 00:12:54,060 you're wasting lots of off chip bandwidths for no real good reason. 169 00:12:54,640 --> 00:12:59,408 And if you have fewer blocks, You're going to have less locations to put 170 00:12:59,408 --> 00:13:04,898 data for the fixed sized cache, so you might actually increase your conflict 171 00:13:04,898 --> 00:13:06,826 rate. So, this, this chart here, 172 00:13:06,826 --> 00:13:10,220 What we're trying to show this is from your book, 173 00:13:10,220 --> 00:13:15,244 We're plotting the block size versus the miss rate. 174 00:13:15,244 --> 00:13:19,040 And you want the lowest miss rate possible. 175 00:13:19,040 --> 00:13:25,795 So, if you start here with a 16-byte block or 16-byte cache size or cache line. 176 00:13:25,795 --> 00:13:29,264 As you increase it, the miss rate actually goes down. 177 00:13:29,264 --> 00:13:34,734 But at some point, you end up wasting a lot of bandwidth cuz you don't have enough 178 00:13:34,734 --> 00:13:39,537 of, of the, the program will ,, it does not have enough spatial locality. 179 00:13:39,537 --> 00:13:43,740 So, you actually just end up pulling in more data than you need. 180 00:13:43,810 --> 00:13:46,636 Cough You also could potentially have more conflicts. 181 00:13:46,794 --> 00:13:49,744 So it starts to go up. So, there's sort of a minimum here. 182 00:13:49,744 --> 00:13:53,905 Traditionally, people have thought about this as actually, increasing block size, 183 00:13:53,905 --> 00:13:56,907 usually increases your performance for most applications. 184 00:13:57,065 --> 00:14:00,488 But that's up to a limit. And the reason why people traditionally 185 00:14:00,488 --> 00:14:05,228 thought this is, cache size is used to be sort of smaller or I mean, cache block 186 00:14:05,228 --> 00:14:09,653 sizes or line sizes used to be smaller and they sort of got, starting getting bigger, 187 00:14:09,653 --> 00:14:12,391 and then people really weren't looking way out here. 189 00:14:16,835 --> 00:14:20,065 applications as you get farther out. Okay. 190 00:14:20,065 --> 00:14:24,231 So, another way we, Just the basic ways we can reduce miss 191 00:14:24,231 --> 00:14:27,154 rate, You could just have a bigger cache. 192 00:14:27,154 --> 00:14:30,516 This one is great. Bigger cache, lower miss rate. 193 00:14:30,516 --> 00:14:35,851 You got to pay area for this. And potentially it takes longer to access 194 00:14:35,851 --> 00:14:41,260 this larger cache, as we'd seen in the first graph that we showed today. 195 00:14:42,162 --> 00:14:46,754 I'm not going to really talk about this, but empirical rule of thumb. 196 00:14:46,754 --> 00:14:52,290 If the cache size is doubled, your miss rate usually goes down by a square root of 197 00:14:52,290 --> 00:14:55,126 two. It's just empirical, people have found, 198 00:14:55,126 --> 00:15:01,980 but good rule of thumb to know. Same, same graph, tells a different story. 199 00:15:02,500 --> 00:15:08,065 If you go to a higher associativity, your miss rate goes down. 200 00:15:08,065 --> 00:15:14,360 So, if you have a direct mapped cache, and you go to a two-way set of cache, 201 00:15:14,800 --> 00:15:19,824 And this rate goes down. And the traditional sort of rule of thumb 202 00:15:19,824 --> 00:15:24,240 about this is if you have a direct mapped cache of size n. 203 00:15:24,600 --> 00:15:32,686 It has roughly the same miss rate as a two-way set of associative cache that is 204 00:15:32,686 --> 00:15:35,849 half the size. You might say, wow that's, that's pretty 205 00:15:35,849 --> 00:15:40,682 amazing so if I just continually apply that over and over again, it doesn't quite 206 00:15:40,682 --> 00:15:43,726 work. Laugh it also gets very hard to actually 207 00:15:43,726 --> 00:15:48,918 build very highly associative caches cuz your, your, your tag check becomes pretty 208 00:15:48,918 --> 00:15:51,484 hard. Because if you have to check against all 209 00:15:51,484 --> 00:15:56,020 different places in the cache, it could be imparallel, and that becomes hard.