1 00:00:03,332 --> 00:00:05,985 So the next memory optimization or 2 00:00:05,985 --> 00:00:11,661 the cache optimization we are going to look at is adding multi level caches. 3 00:00:12,840 --> 00:00:16,170 And you probably see this in sort of any modern day processor you have. 4 00:00:16,170 --> 00:00:19,660 There's level one caches, level two caches, maybe level three caches. 5 00:00:19,660 --> 00:00:21,802 Maybe even level four caches. 6 00:00:21,802 --> 00:00:26,903 Why this might be a good ideas and what the affect of this is on the different 7 00:00:26,903 --> 00:00:31,197 parameters we've been looking at through out this lecture. 8 00:00:31,197 --> 00:00:33,254 So what's the basic idea here? 9 00:00:33,254 --> 00:00:35,946 Well the basic idea is you have a CPU and 10 00:00:35,946 --> 00:00:40,250 instead of just having on cache we say let's add two caches. 11 00:00:40,250 --> 00:00:41,250 And why do we want to do this? 12 00:00:41,250 --> 00:00:46,500 Well, it comes from the insight that it is both difficult to have 13 00:00:46,500 --> 00:00:51,870 a very large cache and a very fast cache. 14 00:00:51,870 --> 00:00:53,060 So how do you solve this problem? 15 00:00:54,620 --> 00:00:58,960 Well, you can think about trying to fill this entire room with RAM. 16 00:01:00,490 --> 00:01:03,270 That would be a very large, let's say, cache for something. 17 00:01:03,270 --> 00:01:07,660 And if you think about this, they actually do have sort of notional caches like this. 18 00:01:07,660 --> 00:01:11,890 If you go look at something like a internet scale data center. 19 00:01:11,890 --> 00:01:15,010 There will be basically a room full of mostly just RAM. 20 00:01:15,010 --> 00:01:18,850 And they'll use this to cache things like look ups. 21 00:01:18,850 --> 00:01:23,120 And there's sort of, you might see something like mem cache d 22 00:01:23,120 --> 00:01:26,580 which is something which caches queries to databases. 23 00:01:26,580 --> 00:01:29,480 Or is a key value store that typically caches 24 00:01:29,480 --> 00:01:32,680 queries to databases which people store in large RAM. 25 00:01:32,680 --> 00:01:36,160 The problem with having a very large cache is by definition, 26 00:01:36,160 --> 00:01:39,910 if it's large and you don't want to violate the laws of physics. 27 00:01:39,910 --> 00:01:42,260 To get to the farthest extent of the cache, 28 00:01:42,260 --> 00:01:44,780 you might have to go very very very far. 29 00:01:44,780 --> 00:01:49,168 And if you run a wire very very far, it's by definition going to be slow you 30 00:01:49,168 --> 00:01:54,020 can't violate the physics law here, that you need to travel at maybe, 31 00:01:54,020 --> 00:01:56,570 the fastest that you can travel is maybe at the speed of light. 32 00:01:56,570 --> 00:02:01,180 So the farther you go the distance becomes a factor. 33 00:02:01,180 --> 00:02:04,880 So, our problem here is that you can't have something that's both large and 34 00:02:04,880 --> 00:02:06,580 fast memory. 35 00:02:06,580 --> 00:02:08,030 So, what can we do? 36 00:02:08,030 --> 00:02:11,440 Well, we can actually add multiple levels of cache. 37 00:02:11,440 --> 00:02:16,020 So that if you have a certain sized working set, you can try to store that, 38 00:02:16,020 --> 00:02:22,030 let's say, in a small local cache which is both small and fast. 39 00:02:22,030 --> 00:02:24,470 And then you could have different levels of cache here. 40 00:02:24,470 --> 00:02:29,480 Say level two cache which is a little bit large, but with a larger capacity. 41 00:02:29,480 --> 00:02:33,340 But is a little bit slower and a little bit further away from your CPU. 42 00:02:33,340 --> 00:02:39,465 And this can mitigate the cost of having to go all the way out to DRAM. 43 00:02:39,465 --> 00:02:44,338 Okay, so I wanted to introduce some nomenclature here because this is 44 00:02:44,338 --> 00:02:49,723 an important thing you're going to see over and over again throughout caches. 45 00:02:49,723 --> 00:02:53,223 And it's important that we introduce it now when we start to talk about 46 00:02:53,223 --> 00:02:54,302 multilevel caches. 47 00:02:54,302 --> 00:02:58,730 Because just because you see, let's say a low cache miss rate does not mean a cache 48 00:02:58,730 --> 00:03:03,450 is performing well, especially when you get into the multilevel cache domain. 49 00:03:03,450 --> 00:03:04,210 And why is that? 50 00:03:04,210 --> 00:03:07,780 Well, let's say we have this level two cache here, and 51 00:03:07,780 --> 00:03:10,280 I say it has a very low miss rate. 52 00:03:10,280 --> 00:03:16,232 That sounds good but this level one cache is filtering accesses to it. 53 00:03:16,232 --> 00:03:21,113 So,just because level two has a low miss rate doesn't necessarily 54 00:03:21,113 --> 00:03:25,300 mean that the level two cache is performing well. 55 00:03:25,300 --> 00:03:26,840 Or vice versa. 56 00:03:26,840 --> 00:03:29,330 Let's say the level two has a very high miss rate. 57 00:03:30,560 --> 00:03:34,060 You might say that level two cache is doing a really bad job. 58 00:03:34,060 --> 00:03:40,190 Well to some extent all of the easy misses are being filtered by the level one cache. 59 00:03:40,190 --> 00:03:42,820 So all of a sudden the miss rate out of your level two cache 60 00:03:42,820 --> 00:03:43,790 might look very different. 61 00:03:43,790 --> 00:03:48,255 So we need to come up with some way to sort of discuss these 62 00:03:48,255 --> 00:03:52,055 different misses with correct nomenclature. 63 00:03:52,055 --> 00:03:54,625 So I'm going to introduce three notions here. 64 00:03:54,625 --> 00:03:58,230 Something we're going to call local cache miss rate. 65 00:03:58,230 --> 00:04:03,050 And this is just going to be the number of misses that you have in 66 00:04:03,050 --> 00:04:05,440 a cache versus the number of accesses in a cache. 67 00:04:05,440 --> 00:04:09,900 This is local per cache so if you were to look at let us say a level two cache here 68 00:04:09,900 --> 00:04:13,170 and you have a certain amount of accesses coming to level two cache. 69 00:04:13,170 --> 00:04:16,710 A certain amount of misses coming out it is all local to level two cache. 70 00:04:16,710 --> 00:04:20,150 So this is going to give you the actual miss rate of a particular cache. 71 00:04:22,230 --> 00:04:28,000 Now that's not the same thing as something like a global miss rate. 72 00:04:28,000 --> 00:04:32,810 So global cache miss rate here will take the number of 73 00:04:32,810 --> 00:04:37,180 misses in the cache relative to the number of CPU memory accesses. 74 00:04:38,510 --> 00:04:40,543 And this might be a better metric, 75 00:04:40,543 --> 00:04:46,190 multilevel cache because you might want to say in our level two cache here. 76 00:04:46,190 --> 00:04:50,650 What is our miss rate out of the last level of cache or out of our L2 cache 77 00:04:50,650 --> 00:04:54,940 relative to the number of actual axes that the CPU is doing? 78 00:04:54,940 --> 00:04:58,240 And this help you encapsulate sort of the level one and level two together. 79 00:04:59,740 --> 00:05:01,770 So that might be a better metric. 80 00:05:01,770 --> 00:05:05,280 And you'll see your book sometimes use these two different metrics 81 00:05:05,280 --> 00:05:06,440 in different ways. 82 00:05:06,440 --> 00:05:07,210 And then finally, 83 00:05:08,280 --> 00:05:12,440 something that's useful to think about is the number of misses per instruction. 84 00:05:13,460 --> 00:05:14,273 Now why did we bring this up? 85 00:05:14,273 --> 00:05:19,114 That sounds pretty similar here to like a global miss rate or something like that. 86 00:05:19,114 --> 00:05:24,720 Well the difference here is the denominator changes. 87 00:05:24,720 --> 00:05:27,810 Instead of having either accesses per cache or CPU memory accesses, 88 00:05:27,810 --> 00:05:30,080 instead we have per number of instructions. 89 00:05:31,140 --> 00:05:32,220 And why is this useful? 90 00:05:32,220 --> 00:05:37,540 Well, if you have misses per instruction, it takes out of the equation how many 91 00:05:37,540 --> 00:05:43,310 modes and stores you have as a percentage of instructions in your program. 92 00:05:43,310 --> 00:05:48,880 So, if you have let's say a program which has very few loads and 93 00:05:48,880 --> 00:05:52,720 stores but has a relatively high miss rate per access. 94 00:05:52,720 --> 00:05:54,250 You might say this is bad for 95 00:05:54,250 --> 00:05:56,389 performance so if you just look at to the cache miss rate. 96 00:05:58,020 --> 00:05:59,750 LIke the local cache miss rate. 97 00:05:59,750 --> 00:06:02,170 But in reality, it may not be so bad for performance. 98 00:06:02,170 --> 00:06:06,300 Because if you never do a load or never do a store in that particular program, 99 00:06:06,300 --> 00:06:08,720 maybe it doesn't actually affect your performance very much. 100 00:06:08,720 --> 00:06:11,980 So this last metric here, this misses per instruction encapsulates that. 101 00:06:11,980 --> 00:06:17,860 So it encapsulates both the percentage of instructions that are loads and 102 00:06:17,860 --> 00:06:22,940 stores mixed together with the number of 103 00:06:25,050 --> 00:06:27,080 cache misses in the program here. 104 00:06:27,080 --> 00:06:30,678 So, and this can either be local or global, but 105 00:06:30,678 --> 00:06:35,520 usually this is considered to be sort of global-ish. 106 00:06:35,520 --> 00:06:36,400 But it can be either. 107 00:06:36,400 --> 00:06:39,880 You could also use this metric for a particular cache. 108 00:06:39,880 --> 00:06:45,930 So we can say this cache misses once every 1,000 instructions. 109 00:06:45,930 --> 00:06:48,070 That would be pretty good. 110 00:06:48,070 --> 00:06:51,806 And depending on the number of loads or 111 00:06:51,806 --> 00:06:56,690 stores per Set of instructions. 112 00:06:56,690 --> 00:06:59,620 So let's say you have a load or 113 00:06:59,620 --> 00:07:04,028 a store once every, I don't know five instructions. 114 00:07:04,028 --> 00:07:04,980 So maybe 20% of or 115 00:07:04,980 --> 00:07:10,350 a little bit less than 20% of the instructions are loads or stores. 116 00:07:10,350 --> 00:07:14,680 Which might be typical in a typical processor or 117 00:07:14,680 --> 00:07:17,730 a typical program for a typical processor. 118 00:07:17,730 --> 00:07:18,965 So then you could say, 119 00:07:18,965 --> 00:07:23,333 you could make some notion here about how the cache is performing on aggregate. 120 00:07:23,333 --> 00:07:29,584 And you'll see, actually, a lot of the numbers in the Patterson-Hennessy book, 121 00:07:29,584 --> 00:07:33,468 actually use misses per 1,000 instructions. 122 00:07:33,468 --> 00:07:36,410 Which could be a more useful metric or typically a more useful metric. 123 00:07:39,220 --> 00:07:44,340 Okay, so let's take a look at how adding a level two cache 124 00:07:45,960 --> 00:07:47,910 influences the design of a level one cache. 125 00:07:47,910 --> 00:07:51,190 And this is pretty common that when you add multiple levels to your 126 00:07:51,190 --> 00:07:56,220 cache hierarchy it actually influences the lower levels of your cache hierarchy. 127 00:07:56,220 --> 00:08:00,180 So how does adding a level two cache potentially 128 00:08:00,180 --> 00:08:01,600 influence the level one design? 129 00:08:02,980 --> 00:08:07,610 Well, one interesting thing you can do here is just by having 130 00:08:07,610 --> 00:08:12,200 a level 2 cash this might allow you to think about having a smaller level 1 cash. 131 00:08:13,500 --> 00:08:17,745 So if you have a relatively close level 2 cash, for the same performance you can 132 00:08:17,745 --> 00:08:22,060 actually have a much smaller level 1 cash and have a level 2 cash... 133 00:08:22,060 --> 00:08:23,650 Take care of these things. 134 00:08:23,650 --> 00:08:26,610 And this can actually even help performance maybe even more, 135 00:08:26,610 --> 00:08:30,880 because you can potentially move the level 1 cache closer because it's now smaller, 136 00:08:30,880 --> 00:08:34,300 or increase the speed of the level 1 cache. 137 00:08:34,300 --> 00:08:39,480 But the miss rate from the level 1 cache is going to go up. 138 00:08:39,480 --> 00:08:41,870 Because you made it small. 139 00:08:41,870 --> 00:08:44,160 So what about energy? 140 00:08:44,160 --> 00:08:47,450 This can actually be a really good energy win. 141 00:08:48,450 --> 00:08:49,660 How is this an energy win? 142 00:08:49,660 --> 00:08:53,670 You are able to take your level one cache and 143 00:08:53,670 --> 00:08:57,060 make it smaller because you now have a level two cache. 144 00:08:57,060 --> 00:09:00,910 So the common case thing is that now your axis in your level one cache, and this 145 00:09:00,910 --> 00:09:05,800 level one cache is smaller, so you can be accessing it, relatively frequently, and 146 00:09:05,800 --> 00:09:09,850 each access you do to it, is itself is going to be, less energy. 147 00:09:09,850 --> 00:09:14,140 So something to think about here is that, our level two design, 148 00:09:14,140 --> 00:09:17,110 is going to influence our lower level designs and it's not in a vacuum so 149 00:09:17,110 --> 00:09:20,940 it's not like we can just slap on a level two cache or a level three cache. 150 00:09:20,940 --> 00:09:23,334 And say the lower levels of the cache hierarchy don't change. 151 00:09:25,260 --> 00:09:30,169 Another thing that, another way that the level two design, or the presence 152 00:09:30,169 --> 00:09:34,847 of a level two can really influence level one, is you might be able to have 153 00:09:34,847 --> 00:09:40,270 a much simpler level one cache design, because you have a level two cache. 154 00:09:40,270 --> 00:09:42,430 So, what does this look like? 155 00:09:42,430 --> 00:09:48,300 Well, let's say in your old design, the level one cache was a right back cache. 156 00:09:48,300 --> 00:09:50,170 So it stored dirty data. 157 00:09:50,170 --> 00:09:56,100 And when a cache conflict occurred, it had to find an eviction, 158 00:09:56,100 --> 00:09:58,050 and evict something out of the level one cache, and wait for 159 00:09:58,050 --> 00:10:02,835 that eviction to occur, or at least find the bandwidth for that eviction to occur. 160 00:10:02,835 --> 00:10:07,690 [COUGH] >> In something like a level 2 161 00:10:07,690 --> 00:10:12,130 cache that backs a level 1 cache, you can actually move to a write through design. 162 00:10:13,350 --> 00:10:14,840 Now how is this possible? 163 00:10:14,840 --> 00:10:16,830 Why would you be able to do a write through design when you weren't 164 00:10:16,830 --> 00:10:18,390 able to do a write through design before? 165 00:10:18,390 --> 00:10:22,730 Well, if you only had one level of cache and you had to do write through, 166 00:10:22,730 --> 00:10:26,260 every write you did would have to have gone out to main memory. 167 00:10:26,260 --> 00:10:29,660 And largely, you don't have enough bandwidth to go do that out 168 00:10:29,660 --> 00:10:32,330 to your main memory store or out to your DRAM. 169 00:10:33,530 --> 00:10:38,750 So, because you now have a level two cache, you can use 170 00:10:38,750 --> 00:10:43,469 that as a buffer if you will to absorb right through traffic. 171 00:10:44,470 --> 00:10:46,910 And that traffic doesn't have to go off chip. 172 00:10:46,910 --> 00:10:52,800 So your write back cache, you could have a write back L2 173 00:10:52,800 --> 00:10:57,380 which tracks all the dirty data, and makes sure that it actually does invalidates 174 00:10:57,380 --> 00:11:00,880 out to, it has to do evictions on invalidations. 175 00:11:00,880 --> 00:11:07,043 For instance, for dirty data But the level 1 cache can just write through all data. 176 00:11:07,043 --> 00:11:09,645 Now this requires you to have enough bandwidth between your level 1, 177 00:11:09,645 --> 00:11:10,228 your level 2. 178 00:11:10,228 --> 00:11:14,798 But that's typically a lot easier to come by because your level 1 cache and level 2 179 00:11:14,798 --> 00:11:18,845 cache are usually to some extent located near each other in both on chip and 180 00:11:18,845 --> 00:11:25,250 modern day processor Let's see, other reasons that this is good. 181 00:11:25,250 --> 00:11:29,986 Well, it really simplifies your level one design. 182 00:11:29,986 --> 00:11:34,707 If your write-through cache in your level one design, you don't have to worry 183 00:11:34,707 --> 00:11:39,004 about having dirty victim evictions, the control becomes easier, and 184 00:11:39,004 --> 00:11:43,450 a lot of times it becomes easier to integrate into your pipeline. 185 00:11:43,450 --> 00:11:44,090 Which is a good thing. 186 00:11:45,230 --> 00:11:46,680 Something we haven't talked about yet but 187 00:11:46,680 --> 00:11:50,920 we'll be talking about at the end of this course is how something like 188 00:11:50,920 --> 00:11:55,340 this multi level cache can actually simplify your cache coherence issues. 189 00:11:55,340 --> 00:11:58,400 So something we haven't talked about yet is how 190 00:11:59,540 --> 00:12:04,210 cache coherence deals with, caches can deal with coherent. 191 00:12:04,210 --> 00:12:08,900 And by having a smaller L1, which is write-through backed by an L2, 192 00:12:08,900 --> 00:12:13,010 you can really think about simplifying the cache coherence issue. 193 00:12:13,010 --> 00:12:15,890 So what is cache coherence? 194 00:12:15,890 --> 00:12:19,160 Well, we've talked about compulsory misses, capacity misses, and 195 00:12:19,160 --> 00:12:23,170 conflict misses well when we get to the end of this class, not this class but 196 00:12:23,170 --> 00:12:25,800 the end of this course we'll be talking about cache coherence, 197 00:12:25,800 --> 00:12:29,600 which is keeping multiple processors with caches, and 198 00:12:29,600 --> 00:12:33,070 keeping the caches coherent between those different caches so 199 00:12:33,070 --> 00:12:36,970 that the data does not become stale between those different caches. 200 00:12:36,970 --> 00:12:41,770 Well one of the questions that comes up here is by having write-through and 201 00:12:41,770 --> 00:12:43,220 simplifying the L1. 202 00:12:43,220 --> 00:12:45,960 This is become easier from that perspective and it does. 203 00:12:45,960 --> 00:12:47,100 Now, why is that? 204 00:12:47,100 --> 00:12:52,380 Well, something we'll going to learn about is this coherence misses and 205 00:12:52,380 --> 00:12:57,220 when coherence misses going to basically a different cache is going 206 00:12:57,220 --> 00:13:00,300 to reach across the boss or connection and 207 00:13:00,300 --> 00:13:05,244 tell a different cache to invalidate something out of this cache. 208 00:13:05,244 --> 00:13:08,999 And if you only have one cache per processor, if you only have a level one 209 00:13:08,999 --> 00:13:12,817 cache we'll say, or something like just one cache level, when you go to 210 00:13:12,817 --> 00:13:16,948 reach across the bus and this cache is tightly integrated into your pipeline, 211 00:13:16,948 --> 00:13:20,640 you now have to deal with these sort of external requests coming in to do 212 00:13:20,640 --> 00:13:25,550 invalidations or to do something like snoops, which we'll also be talking about. 213 00:13:25,550 --> 00:13:27,830 In a few classes, when we get to the end of this course. 214 00:13:28,840 --> 00:13:32,570 But if you have a level one backed by a level two you can have a level 215 00:13:32,570 --> 00:13:36,360 two service a lot of that complexity. 216 00:13:36,360 --> 00:13:40,616 So you can take complexity and push it from the level one into the level two. 217 00:13:40,616 --> 00:13:45,809 Now, you're still going to have to figure out how to do invalidations 218 00:13:45,809 --> 00:13:51,100 in the level one but, before when you actually had to do it validation. 219 00:13:51,100 --> 00:13:53,610 You could potentially have had dirty data. 220 00:13:53,610 --> 00:13:57,830 You would have had to a victim and then fix that line. 221 00:13:57,830 --> 00:14:00,600 And the invalidation naturally generate the eviction. 222 00:14:00,600 --> 00:14:04,720 But now, because global 1 is let's say a simpler write through level 1. 223 00:14:04,720 --> 00:14:09,490 You don't have to generate the eviction, instead you just have to invalidate it. 224 00:14:09,490 --> 00:14:15,350 It's a lot easier to sort of blast away lines than it is to blast away lines and 225 00:14:15,350 --> 00:14:19,180 actually figure out how to take that data and evict it. 226 00:14:19,180 --> 00:14:23,700 It's a lot less disruptive to your main processor pipeline 227 00:14:23,700 --> 00:14:26,410 because you don't have to stop to do the eviction and 228 00:14:26,410 --> 00:14:31,000 there's a block further loses stores coming from the main processor pipeline. 229 00:14:32,270 --> 00:14:35,440 Last if you have a right through cache into level one 230 00:14:35,440 --> 00:14:39,910 this can really simplify error recovery and when I say error recovery I mean from 231 00:14:39,910 --> 00:14:45,060 a soft error perspective so if you have your chip and it gets struck by radiation. 232 00:14:45,060 --> 00:14:46,720 And this is actually a relatively common occurrence. 233 00:14:46,720 --> 00:14:48,720 You have a chip and you have an outer particle. 234 00:14:48,720 --> 00:14:49,970 Something coming out of the sky. 235 00:14:49,970 --> 00:14:53,650 Some highly energetic piece of radiation hits your chip. 236 00:14:53,650 --> 00:14:55,860 Well, that flips a bit. 237 00:14:55,860 --> 00:14:57,410 And how do we protect against this? 238 00:14:57,410 --> 00:14:58,790 Well, there's a couple of different solutions. 239 00:14:58,790 --> 00:15:03,670 We can use error correcting codes or we can use some simpler ideas like parity. 240 00:15:04,780 --> 00:15:07,340 And if you have something like a write-through cache, 241 00:15:07,340 --> 00:15:11,430 because you never have dirty data in the cache, you can get away with maybe 242 00:15:11,430 --> 00:15:15,992 just having parity and not full ECC or something like that. 243 00:15:15,992 --> 00:15:20,660 And if you were to detect a parity error now in this write-through cache, 244 00:15:20,660 --> 00:15:23,310 you just have to basically invalidate the line. 245 00:15:24,360 --> 00:15:27,600 You don't have to declare it as being corrupt in memory, 246 00:15:27,600 --> 00:15:30,934 because you know that the L2 has an up-to-date copy, 247 00:15:30,934 --> 00:15:35,205 which is up-to-date in has the most up to date version of it. 248 00:15:35,205 --> 00:15:37,985 So the L2 might have more protection on it, but 249 00:15:37,985 --> 00:15:42,175 the L1 can basically just invalidate if it gets struck by an alpha particle. 250 00:15:42,175 --> 00:15:43,953 So something to think about there, level 2, 251 00:15:43,953 --> 00:15:46,887 the presence of having a level 2 can really influence our level 1 designs. 252 00:15:50,101 --> 00:15:52,155 Okay, how else does this occur? 253 00:15:52,155 --> 00:15:57,090 How else does level one and level two design commingle together. 254 00:15:57,090 --> 00:16:02,220 Well, there's a question that comes up of what is the inclusion policy? 255 00:16:03,370 --> 00:16:06,372 So now that we have multiple levels of cache, 256 00:16:06,372 --> 00:16:10,168 we can actually think about having the lower down level. 257 00:16:10,168 --> 00:16:14,695 Let's say, to level one having a certain piece of data and level two, 258 00:16:14,695 --> 00:16:17,630 either having that same piece of data or not. 259 00:16:18,790 --> 00:16:22,880 And we're going to call caches which have the level, 260 00:16:22,880 --> 00:16:26,520 anything which is in the level one cache being in a level two cache or 261 00:16:26,520 --> 00:16:29,480 anything in a lower level cache being in a higher level cache. 262 00:16:29,480 --> 00:16:31,410 We're going to call that an inclusive cache. 263 00:16:32,470 --> 00:16:38,083 So the inner cache holds copies of the data in the farther down cache and 264 00:16:38,083 --> 00:16:43,983 the external snoop access only need to check in to the outer cache as we were 265 00:16:43,983 --> 00:16:49,217 talking about before, and we were to call that inclusive cache, 266 00:16:49,217 --> 00:16:53,812 and if you were to contrast that with an exclusive cache. 267 00:16:53,812 --> 00:16:55,592 So a exclusive cache, 268 00:16:55,592 --> 00:17:01,080 what you're going to have is the different layers of your cache. 269 00:17:01,080 --> 00:17:05,796 For instance, your level two cache is not going to have the same data or 270 00:17:05,796 --> 00:17:09,786 may not have the same data that is in your level one cache. 271 00:17:09,786 --> 00:17:12,828 What is that mean? 272 00:17:12,828 --> 00:17:17,572 Well, it means if you evict something out of your level one cache, 273 00:17:17,572 --> 00:17:23,163 you probably want to put it back into the level two cache in an exclusive design, 274 00:17:23,163 --> 00:17:26,570 because that's kind of the idea behind caches. 275 00:17:26,570 --> 00:17:29,895 You want to have a certain amount of working set in your cache. 276 00:17:29,895 --> 00:17:32,116 So if you evict something out of level one, 277 00:17:32,116 --> 00:17:35,490 that's probably still a relatively useful thing. 278 00:17:35,490 --> 00:17:38,660 It still would probably fit in your larger working set at level two. 279 00:17:38,660 --> 00:17:41,230 So you just don't want to throw it away, you want to keep it around. 280 00:17:41,230 --> 00:17:46,597 So in exclusive caches, there's typically a swap operation that occurs. 281 00:17:46,597 --> 00:17:51,550 So, you're actually going to swap line between the lower level and 282 00:17:51,550 --> 00:17:55,330 the higher level cache when you move data. 283 00:17:55,330 --> 00:18:00,300 So, the higher level of cache or the farther away cache from the processor 284 00:18:00,300 --> 00:18:03,830 is going to go access main memory and it's going to bring it into itself. 285 00:18:05,440 --> 00:18:08,920 And then when the level one cache goes to bring it in, you're actually 286 00:18:08,920 --> 00:18:12,450 going to swap two lines and this adds complexity to your hardware design. 287 00:18:13,520 --> 00:18:16,332 You have to add something which actually does the swap. 288 00:18:16,332 --> 00:18:19,943 It could potentially be bad for power, but people have actually built these. 289 00:18:19,943 --> 00:18:26,800 If you go look at the original AMD Athlon, they had a 64-kilobyte primary 290 00:18:26,800 --> 00:18:32,182 cache of a 246 secondary cache and they were exclusive. 291 00:18:32,182 --> 00:18:36,642 Now you might say, this sounds like a lot of headache while you would ever go built 292 00:18:36,642 --> 00:18:37,832 an exclusive cache. 293 00:18:37,832 --> 00:18:42,182 Well, it has a one really big benefit, you get more storage area. 294 00:18:42,182 --> 00:18:45,836 Because you're not keeping two copies of the same piece of data, 295 00:18:45,836 --> 00:18:48,940 you now can store effectively more data in your cache. 296 00:18:48,940 --> 00:18:53,892 So if we had this AMD Athlon here, we have a 64-kilobyte 297 00:18:53,892 --> 00:18:58,037 primary cache backed by 256 kilobyte cache, 298 00:18:58,037 --> 00:19:04,420 you have the sum of those two values to store data and it's all unique data. 299 00:19:04,420 --> 00:19:07,975 In contrast, if we had the same cache hierarchy, but 300 00:19:07,975 --> 00:19:12,958 we were inclusive cache, you would only have strictly 256 kilobytes. 301 00:19:12,958 --> 00:19:18,818 The larger cache or the farther away cache's amount of storage space, 302 00:19:18,818 --> 00:19:24,970 because the lower level cache or the primary cache here only keeps copies of 303 00:19:24,970 --> 00:19:31,830 what is already in the father out cache in Inclusive in the inclusive cache design. 304 00:19:34,610 --> 00:19:39,390 Let's take a look at a few examples of caches in modern day systems and 305 00:19:39,390 --> 00:19:41,230 see what trade-offs people have made. 306 00:19:41,230 --> 00:19:43,949 So, let's start off by taking a look 307 00:19:43,949 --> 00:19:47,826 at something like the Itanium-2 processor here. 308 00:19:47,826 --> 00:19:49,699 So the Itanium-2 processor, 309 00:19:49,699 --> 00:19:53,999 you're going to see we have our chip and the first thing you're going to notice 310 00:19:53,999 --> 00:19:58,180 about this dive photograph here is the level three cache is very large. 311 00:19:58,180 --> 00:20:05,710 So this has a very large level three cache, but this is a big iron processor. 312 00:20:05,710 --> 00:20:11,383 So the Itanium-2 was an Intel chip or is an Intel chip that is Intel and 313 00:20:11,383 --> 00:20:17,924 HP together, I guess, because they were collaborating at the time on this project, 314 00:20:17,924 --> 00:20:23,780 but has a big level three cache and it's kind of funny shaped. 315 00:20:23,780 --> 00:20:26,460 And the reason it's funny shaped is that they just took all the extra space they 316 00:20:26,460 --> 00:20:31,238 had in their die and filled it basically with cache and this was good but 317 00:20:31,238 --> 00:20:36,250 it didn't quite matter the shape and size, the shape, because it was so 318 00:20:36,250 --> 00:20:39,290 far away from the processor that it didn't have to be a regular shape, but 319 00:20:39,290 --> 00:20:41,650 let's take a look at the different levels here. 320 00:20:41,650 --> 00:20:44,121 So, we have a level one cache. 321 00:20:44,121 --> 00:20:47,349 It's small, 16 kilobytes, 322 00:20:47,349 --> 00:20:52,497 4-way set associative, 64-byte line size. 323 00:20:52,497 --> 00:20:54,462 It's heavily ported. 324 00:20:54,462 --> 00:20:57,720 They can do a lot of loads and stores at the same time here. 325 00:20:57,720 --> 00:21:02,210 So it has two loads and two stores concurrently and it's fast, 326 00:21:02,210 --> 00:21:04,300 single cycle access. 327 00:21:04,300 --> 00:21:07,440 So, 16 kilobytes is relatively small by today's standards. 328 00:21:07,440 --> 00:21:11,114 In 2002, that was probably a little on the large side. 329 00:21:11,114 --> 00:21:12,449 This was a large processor. 330 00:21:12,449 --> 00:21:15,190 But in 2013 time, 331 00:21:15,190 --> 00:21:20,392 that's not that big for a level one cache. 332 00:21:20,392 --> 00:21:23,926 If we step up here, we can see that typically as you step up, 333 00:21:23,926 --> 00:21:25,430 you increase your size. 334 00:21:25,430 --> 00:21:28,173 Otherwise, what would be the point? 335 00:21:28,173 --> 00:21:30,082 But the latency also gets worse. 336 00:21:30,082 --> 00:21:32,247 So, we've seen cycle latency for level one. 337 00:21:32,247 --> 00:21:38,790 We've five cycles latency for level two and we're a lot bigger. 338 00:21:38,790 --> 00:21:40,950 We're at 256 kilobytes of cache. 339 00:21:42,280 --> 00:21:45,680 The associativity stays the same, it's still four-way associative. 340 00:21:45,680 --> 00:21:47,427 And as we get farther away here, 341 00:21:47,427 --> 00:21:51,072 we'll see we have a 3 megabyte cache with twelve cycles latency. 342 00:21:51,072 --> 00:21:54,374 So it's this big cache around the outside of the chip here, but 343 00:21:54,374 --> 00:21:56,630 we actually increase our associativity. 344 00:21:56,630 --> 00:22:01,526 So it's a 12-way set associative cache and the line size in both the level 2, 345 00:22:01,526 --> 00:22:03,327 and level 3 cache is larger. 346 00:22:03,327 --> 00:22:06,824 Now, 128 bytes instead of 64 bytes and that's pretty common. 347 00:22:06,824 --> 00:22:11,010 You'll see that as you get farther away from caches, people go and 348 00:22:11,010 --> 00:22:15,140 put larger line sizes in view of larger chunks of memory at times. 349 00:22:15,140 --> 00:22:17,547 And to some extent, this is because you have more capacity in these caches. 350 00:22:17,547 --> 00:22:22,343 So it doesn't hurt you as much if you're called back 351 00:22:22,343 --> 00:22:27,357 to our earlier lecture where we plotted the capacity or 352 00:22:27,357 --> 00:22:34,580 assuming the line size that you have, the lock size versus performance. 353 00:22:34,580 --> 00:22:38,191 If you recall, it hurts you more for smaller cache, because there's just not 354 00:22:38,191 --> 00:22:41,655 that many lines you can fit into something, let's say, like a 4K cache. 355 00:22:41,655 --> 00:22:46,162 But as you start to get to something like a 3 megabyte cache, you can be a little 356 00:22:46,162 --> 00:22:50,970 bit wasteful with storage if it helps you with your potentially spatial locality. 357 00:22:53,563 --> 00:22:57,770 One point I wanted to make here is a rule of thumb that people typically use for 358 00:22:57,770 --> 00:23:00,950 cache design and this is just an empirical rule of thumb. 359 00:23:02,080 --> 00:23:04,889 The empirical rule of thumb is usually, 360 00:23:04,889 --> 00:23:09,100 when you go from one level of cache to the next level of cache, 361 00:23:09,100 --> 00:23:13,083 you want to step up by a minimum by a factor of eight in size. 362 00:23:13,083 --> 00:23:14,118 Now, where does that come from? 363 00:23:14,118 --> 00:23:18,947 Well, it's totally empirical, but it's based on how much extra 364 00:23:18,947 --> 00:23:24,720 latency you have to add when you add another level of cache hierarchy. 365 00:23:24,720 --> 00:23:28,156 So if you add another level of cache hierarchy, it has some extra latency. 366 00:23:28,156 --> 00:23:34,219 And while it's going to decrease your cache miss rate, 367 00:23:34,219 --> 00:23:39,900 it also is going to make your time out to main memory. 368 00:23:39,900 --> 00:23:41,656 Let's say, a little bit worse. 369 00:23:41,656 --> 00:23:45,436 As a trade-off there of, does it make sense to add the extra cache layer and 370 00:23:45,436 --> 00:23:47,056 go check the extra cache layer or 371 00:23:47,056 --> 00:23:49,650 is it better to go out to main memory at that point? 372 00:23:49,650 --> 00:23:51,698 And the question is how much benefit, 373 00:23:51,698 --> 00:23:55,421 how much larger does that cache have to be to have any useful benefit? 374 00:23:55,421 --> 00:23:59,397 And empirically, when sort of people have built these processors, 375 00:23:59,397 --> 00:24:03,042 usually you have to step up by something like a factor of eight. 376 00:24:03,042 --> 00:24:07,937 If you step up by a factor of four The benefit that you get from the cache due to 377 00:24:07,937 --> 00:24:11,710 the extra size does not outweigh the extra complexity and cost and 378 00:24:11,710 --> 00:24:16,663 time that is added by adding extra cache, but that's just a empirical rule of thumb. 379 00:24:16,663 --> 00:24:18,885 But if you go analyze enough processors, 380 00:24:18,885 --> 00:24:23,084 you'll see that basically every cash level steps about by a minimum of eight. 381 00:24:23,084 --> 00:24:29,121 So we'll call this the 8X, the 8 times rule if you will. 382 00:24:29,121 --> 00:24:32,983 This cache here steps up by a factor of 16. 383 00:24:32,983 --> 00:24:35,139 So it's a little bit more than a factor of eight, but 384 00:24:35,139 --> 00:24:37,454 let's look at another bit more modern processor here. 385 00:24:37,454 --> 00:24:39,924 So, this is the IBM Power 7. 386 00:24:39,924 --> 00:24:43,220 So the IBM Power 7, this is a eight core machine. 387 00:24:45,220 --> 00:24:49,649 There is private level one caches per processor and 388 00:24:49,649 --> 00:24:56,355 then there is a big level three cache sitting in the middle of this processor. 389 00:24:56,355 --> 00:24:57,293 Well, what does this look like? 390 00:24:57,293 --> 00:25:00,671 Well, we have relatively small level one caches. 391 00:25:00,671 --> 00:25:03,064 We have a 32-kilobyte L1 I. 392 00:25:03,064 --> 00:25:08,289 A 32 kilobyte L1 D cache and latency is higher than 393 00:25:08,289 --> 00:25:14,243 our previous example, it's a three cycle cache latency. 394 00:25:14,243 --> 00:25:18,860 And as we go farther out, we can see we actually have an 8X step up in 395 00:25:18,860 --> 00:25:23,075 cache size here at least from a data cache size perspective. 396 00:25:23,075 --> 00:25:28,205 So, we go from 32 to 256 and the latency is worse. 397 00:25:28,205 --> 00:25:30,925 So, the eight cycles worth of latency to check on level two. 398 00:25:30,925 --> 00:25:35,039 And then finally here, we have eight cores sharing this data. 399 00:25:35,039 --> 00:25:38,755 So, it's a lot of core sharing this cache. 400 00:25:38,755 --> 00:25:41,998 We have a 32 megabyte unified level 3 cache that's 401 00:25:41,998 --> 00:25:45,251 actually built out of invented DRAM in this process. 402 00:25:45,251 --> 00:25:49,445 So, IBM has some pretty impressive technology here where they can actually 403 00:25:49,445 --> 00:25:51,160 embed DRAM on a logic process. 404 00:25:51,160 --> 00:25:53,955 This is not common in most of the processes. 405 00:25:53,955 --> 00:26:00,077 This is typically only IBM trait, an IBM boundary design sort of trait. 406 00:26:00,077 --> 00:26:03,713 But some other people do also have embedded DRAM now, but IBM 407 00:26:03,713 --> 00:26:08,504 is really the forefront leader in there, but the latency here is quite higher. 408 00:26:08,504 --> 00:26:16,035 So if 25 cycle latency to the power set in level three cache. 409 00:26:16,035 --> 00:26:17,283 Let's pull-up the score board. 410 00:26:17,283 --> 00:26:21,485 So we've been building a score board here throughout this and seeing how adding 411 00:26:21,485 --> 00:26:26,329 these different techniques, these advanced cache techniques helped with performance. 412 00:26:26,329 --> 00:26:29,123 So, what is our score board going to look like here? 413 00:26:29,123 --> 00:26:34,401 Well, we have multilevel cache and our first question 414 00:26:34,401 --> 00:26:39,001 is what happens for the level one perspective? 415 00:26:39,001 --> 00:26:42,243 Well, the miss penalty. 416 00:26:42,243 --> 00:26:46,169 What is adding a level one cache do to the miss? 417 00:26:46,169 --> 00:26:53,049 Well, we take a miss, you have to go out to the next level. 418 00:26:53,049 --> 00:26:56,878 And before we had to go out to the main memory, which could be far far away. 419 00:26:56,878 --> 00:26:59,875 So, miss penalty was quite high. 420 00:26:59,875 --> 00:27:04,152 But if we have multilevel cache and lets say, we have a level two cache, 421 00:27:04,152 --> 00:27:05,593 just a few cycles away. 422 00:27:05,593 --> 00:27:07,747 Maybe five cycles away. 423 00:27:07,747 --> 00:27:12,539 Your miss penalty now goes down as seen from the level one cache when we add 424 00:27:12,539 --> 00:27:14,201 this multilevel cache. 425 00:27:16,665 --> 00:27:18,032 So that's the level one, 426 00:27:18,032 --> 00:27:21,337 what about if we draw a box around all of the levels of our cache? 427 00:27:21,337 --> 00:27:25,198 So lets say, level one, level two, level three, what happens? 428 00:27:25,198 --> 00:27:29,833 Well, in aggregate, our miss penalty for 429 00:27:29,833 --> 00:27:36,291 each particular level is going to go down, so that's a plus. 430 00:27:36,291 --> 00:27:41,929 As you go to look at farther away going to main memory, what you're 431 00:27:41,929 --> 00:27:48,072 actually going to see is the miss rate is also going to go down in aggregate, 432 00:27:48,072 --> 00:27:54,339 because you're not going to have to miss out of your last level cache as often. 433 00:27:54,339 --> 00:27:56,244 So, the miss rate here goes down. 434 00:27:56,244 --> 00:27:59,393 So you're not going to have to g out to main memory as much, 435 00:27:59,393 --> 00:28:03,078 because now you have a larger, but not as big as your main memory or 436 00:28:03,078 --> 00:28:05,972 as big as your DRAM cache sitting in front of there. 437 00:28:05,972 --> 00:28:08,131 So, the miss rate goes down. 438 00:28:08,131 --> 00:28:10,659 So, you have lower overall miss rate.