1 00:00:03,039 --> 00:00:08,614 So, a few things that on our agenda for today, we're going to finish up talking 2 00:00:08,614 --> 00:00:15,003 about a review of cashes, and that we're going start talking about in order 3 00:00:15,003 --> 00:00:20,076 superscalars so this is something new. And I want to say this now but, well, I'll 4 00:00:20,076 --> 00:00:24,000 reiterate when I come to it. This other book. 5 00:00:24,000 --> 00:00:28,556 The [unknown] book, is probably use more, is much more useful for the super scalar 6 00:00:28,556 --> 00:00:33,081 portion of our class than the Patters or, Hennessy and Patterson book. 7 00:00:33,081 --> 00:00:38,406 And I recommend either acquiring it or borrowing a friend's Or the library has a 8 00:00:38,406 --> 00:00:42,075 copy of that book. And at least reading the, sort of, two to 9 00:00:42,075 --> 00:00:47,083 three chapters in there would be useful. But that's not until the second half of 10 00:00:47,083 --> 00:00:49,926 today's lecture. The first half, we have to go and finish 11 00:00:49,926 --> 00:00:52,455 up caches. So, going back to review, let's, let's 12 00:00:52,455 --> 00:00:56,928 talk about caches can look like on the inside and a couple, a couple of different 13 00:00:56,928 --> 00:01:01,070 things about how to classify caches. So, we've already talked about why caches 14 00:01:01,070 --> 00:01:05,061 are good. So, a little bit of a review here. 15 00:01:06,010 --> 00:01:10,039 You have big slow things over here. Big slow memories and you have the 16 00:01:10,039 --> 00:01:12,910 processor. And you can, you know, you can think about 17 00:01:12,910 --> 00:01:17,389 how to build small fast memories. And if we can put a small, small fast 18 00:01:17,389 --> 00:01:22,023 memory close, and have a high probability that, you know, things are actually 19 00:01:22,023 --> 00:01:26,972 located in this small fast memory, then we don't have to go out to the big slow 20 00:01:26,972 --> 00:01:30,330 memory as often. We can save power and perform, and, and, 21 00:01:30,330 --> 00:01:33,023 and, have better performance, etc. Okay. 22 00:01:33,023 --> 00:01:38,019 So, how do we, how do we classify caches and what, what are, going, reviewing what, 23 00:01:38,019 --> 00:01:41,020 what a cache, sort of looks like on the inside? 24 00:01:41,020 --> 00:01:44,084 So, processor, cache, main memory, addresses flow that way. 25 00:01:44,084 --> 00:01:50,034 Data can go either direction cuz you can either do a store which will put data out 26 00:01:50,034 --> 00:01:55,078 from left to right or you can read back data or do a load which will have data 27 00:01:55,078 --> 00:01:59,442 going back the other way. So, let's look, look at our sort of a 28 00:01:59,442 --> 00:02:04,233 mythical view here of what a cache looks like, or, or our view of what a cache 29 00:02:04,233 --> 00:02:09,063 looks like on the inside. So, caches contain copies of what is being 30 00:02:09,063 --> 00:02:15,807 stored in the main memory. And if you look at it inside of it here, 31 00:02:15,807 --> 00:02:20,725 you know, this, let's say, first line is at [unknown] store address 100. 32 00:02:20,725 --> 00:02:25,562 So here, we have address 100, and it has some data in there and, typically, you 33 00:02:25,562 --> 00:02:30,153 introduce this notion of a line in a cache because we want to reduce the tracking 34 00:02:30,153 --> 00:02:35,055 information for a particular cache line or sometimes called a block in the cache. 35 00:02:35,055 --> 00:02:39,095 If we, it is possible to build something which, you know, tracks data on, let's 36 00:02:39,095 --> 00:02:43,060 say, a bit or a byte basis. But then, you're tracking information 37 00:02:43,060 --> 00:02:48,017 that's probably larger than the amount of data you're actually trying to store. 38 00:02:48,017 --> 00:02:50,072 So, that's not, not really a great idea there. 39 00:02:51,063 --> 00:02:56,050 In the cache here, if you look inside of it what we store, is we store an address 40 00:02:56,050 --> 00:02:58,043 tag. So, it's not the full address. 41 00:02:58,043 --> 00:03:02,053 It could be the full address, but then you're probably storing extra, extra bits. 42 00:03:02,053 --> 00:03:06,080 You only need to store enough of the address to be able to differentiate 43 00:03:06,098 --> 00:03:10,043 different things that could be stored at the same location. 44 00:03:10,043 --> 00:03:14,766 So, in a fully associative cache, where you can put any data in any location, you 45 00:03:14,766 --> 00:03:19,013 will have to store for instance, the whole address in the tag. 46 00:03:19,013 --> 00:03:23,643 But, if you, let's say, have a direct mapped cache where each address can only 47 00:03:23,643 --> 00:03:27,069 go to one location, you would get up to store a subset with this. 48 00:03:27,089 --> 00:03:32,422 Yes, and I wanted to sort of point out here, you know, sometimes people call the 49 00:03:32,422 --> 00:03:36,334 whole thing here, a line. A line typically includes the tag 50 00:03:36,334 --> 00:03:41,538 information and possibly valid bits, and the block is the actual data that we're 51 00:03:41,538 --> 00:03:43,300 storing. Okay. 52 00:03:43,300 --> 00:03:48,081 So, this is really review here of how the algorithm or what the cache, caches looks 53 00:03:48,081 --> 00:03:50,069 like. Well, let's view briefly. 54 00:03:50,069 --> 00:03:55,016 So, processor issues a load, so we're going to look at the load case here. 55 00:03:55,016 --> 00:03:59,507 We take the address and we compare the sub sets of the bits that can be, could be 56 00:03:59,507 --> 00:04:04,857 different for different things that store in interesting location in the cache and 57 00:04:04,857 --> 00:04:09,051 we check against the tag. One thing not shown here is we also need 58 00:04:09,051 --> 00:04:14,028 to check to make sure it is valid, because it is possible you could have an invalid 59 00:04:14,028 --> 00:04:17,016 cache line. So, it's typically, is one bit which says, 60 00:04:17,016 --> 00:04:19,069 whether the cacheline is valid or not valid. 61 00:04:19,069 --> 00:04:23,083 You have to make sure it is valid. And if so, you get a cache hit and just 62 00:04:23,083 --> 00:04:26,066 return the data. If not, you go out to the main memory. 63 00:04:26,066 --> 00:04:30,014 Ooh, this is a tricky one here. You have to go find someplace to go put 64 00:04:30,014 --> 00:04:33,567 that data because this new data comes in and if you want to store it in your cache 65 00:04:33,567 --> 00:04:37,827 now, there's probably something in that cache at that same location, these just 66 00:04:37,827 --> 00:04:40,537 sort of bump out. And, we're going to talk briefly about 67 00:04:40,537 --> 00:04:44,075 replacement policies, and later in the advanced caches lecture, we'll talk more 68 00:04:44,075 --> 00:04:48,081 about that and other cache policies, and much more advanced replacement policies. 69 00:04:48,081 --> 00:04:51,859 But you kick something out, and you fill it in, and then finally you return the 70 00:04:51,859 --> 00:04:54,543 data. You could also think about trying to do 71 00:04:54,543 --> 00:04:59,329 sort of, this step and this step in parallel effectively, and a lot of cache 72 00:04:59,329 --> 00:05:04,462 system and more advanced cache system will actually return the data before they sort 73 00:05:04,462 --> 00:05:09,030 of finish the eviction or, to, to reduce the leniency of the load. 74 00:05:09,066 --> 00:05:14,045 Okay, so let's, let's do some classification, so some pathology here of 75 00:05:14,045 --> 00:05:18,008 what a cache looks like. So, what, what could we talk about? 76 00:05:18,008 --> 00:05:22,094 We could talk about block placement. So, this is locations that a particular 77 00:05:22,094 --> 00:05:27,010 block or particular, particular address can be found in the cache. 78 00:05:27,050 --> 00:05:32,006 Block identification, which is the structure of how to actually go find 79 00:05:32,006 --> 00:05:36,094 something in a cache if I give an address. So, this is kind of an inverse of the 80 00:05:36,094 --> 00:05:41,052 placement question. Block replacement which is figuring out 81 00:05:41,052 --> 00:05:46,084 what should be replaced on a miss and then finally what happens on a write. 82 00:05:46,084 --> 00:05:52,044 So, these are all sort of read questions. Well, block replacement is not really a 83 00:05:52,044 --> 00:05:57,475 read question here, but the write strategy is a, is a good question here of what 84 00:05:57,475 --> 00:06:02,805 happens when a write occurs, do we update all the caches, only the main memory, do 85 00:06:02,805 --> 00:06:08,464 we evict just out of the cache? And all these are possible strategies to 86 00:06:08,464 --> 00:06:13,357 choose. Okay, so let's look at a block placement 87 00:06:13,357 --> 00:06:18,680 figure here, and look at a couple different types of caches. 88 00:06:18,680 --> 00:06:25,039 So, up here, we have a 31 or excuse me, 32-word memory. 89 00:06:25,039 --> 00:06:29,055 So, let's say, this is the, all of the memory in the machine. 90 00:06:29,055 --> 00:06:34,087 And, depending on how you want to look at this, maybe the, each block, let's say, is 91 00:06:34,087 --> 00:06:37,093 one byte, or it could be some number of bytes. 92 00:06:37,093 --> 00:06:42,091 So, to give an example, a typical, something like your Core two Duo, I 93 00:06:42,091 --> 00:06:46,039 believe has a 64-byte cache line, or cache box size. 94 00:06:46,039 --> 00:06:51,070 Some people have tried to, for, push that up, and, and there's reasons to, sort of, 95 00:06:51,070 --> 00:06:55,476 make it larger or smaller, but we'll talk about that in a second. 96 00:06:56,081 --> 00:07:02,320 So, let's take a, this highlighted location here, block number twelve. 97 00:07:02,320 --> 00:07:06,599 And let's see, where it can fit into different types of caches. 98 00:07:06,599 --> 00:07:09,409 So, we have three different types of caches here. 99 00:07:09,409 --> 00:07:14,312 Let's, let's start on the right side here. So, on the right side, we have what's 100 00:07:14,312 --> 00:07:17,741 called direct mapped cache. So, the direct mapped cache. 101 00:07:18,254 --> 00:07:24,180 It is, literally, an array and each block can only fit one place into the cache. 102 00:07:24,180 --> 00:07:31,274 And the indexing function that you use is you take the block number mod, the size of 103 00:07:31,274 --> 00:07:34,784 the cache. So, in this case, we are, have block 104 00:07:34,784 --> 00:07:41,440 number twelve, or byte number twelve here, we take twelve mod eight, it says block 105 00:07:41,440 --> 00:07:45,377 four. So, this is the only place to go look for 106 00:07:45,377 --> 00:07:49,035 the data. Okay, that's, that's nice and, nice and 107 00:07:49,035 --> 00:07:52,074 simple. It may not be the best performance, easy 108 00:07:52,074 --> 00:07:55,012 to build. It's all index-based. 109 00:07:55,012 --> 00:07:58,052 Okay, then, sort of one step up from there. 110 00:07:58,052 --> 00:08:02,040 We can start to look at things like set associativity. 111 00:08:02,040 --> 00:08:08,014 So, what set associative means is instead of having only one location, one unique 112 00:08:08,014 --> 00:08:14,087 location that a block can be placed, you allow it to be in some number of 113 00:08:14,087 --> 00:08:17,687 locations. So , in this, there's, in this drawing 114 00:08:17,687 --> 00:08:21,044 here, there's a two-way set associative cache. 115 00:08:21,044 --> 00:08:24,481 So, each block can either fit in one of two places. 116 00:08:25,031 --> 00:08:29,090 Now, you can actually have, you know, machines which have much more 117 00:08:29,090 --> 00:08:35,084 associativity, or higher associativity. So, an example of this is the modern Core 118 00:08:35,084 --> 00:08:39,035 i7 processors. Their, sort of L1 caches are 8-way set 119 00:08:39,035 --> 00:08:43,040 associative and farther out, they're even higher associative. 120 00:08:43,061 --> 00:08:49,015 Probably, I think they have a 16-way set associative cache out in their L3 or L, 121 00:08:49,055 --> 00:08:53,423 farther out caches. So, let's, let's look through basic 122 00:08:53,423 --> 00:08:57,941 example here. If we want, twelve mod four, so how do we 123 00:08:57,941 --> 00:09:01,043 do this? Well there's only four buckets, if you 124 00:09:01,043 --> 00:09:04,072 will. And inside of that bucket, it can fit into 125 00:09:04,072 --> 00:09:08,531 either of the two sort of sub-slots in the bucket, or they're called ways. 126 00:09:08,531 --> 00:09:13,528 So, if we take twelve mod four, that means twelve mod four, is zero. 127 00:09:13,528 --> 00:09:16,398 So, it has to fall into, this, this bucket here. 128 00:09:16,398 --> 00:09:20,811 It can be in either one of those two, ways, and it can't be in both. 129 00:09:20,811 --> 00:09:23,439 That's an important, important point there. 130 00:09:23,439 --> 00:09:28,648 Finally, there are types of caches which we'll call fully associative caches. 131 00:09:28,648 --> 00:09:34,799 So, what this means is the associativity or the number of ways is equal to the 132 00:09:34,799 --> 00:09:39,873 number of elements in the cache. So here, there is a, there are eight 133 00:09:39,873 --> 00:09:44,109 elements, there are eight walk locations in this cache. 134 00:09:44,109 --> 00:09:50,288 And it's fully associative, which means that there are eight different sets, if 135 00:09:50,288 --> 00:09:53,958 you will. So, it can fit in any of these locations. 136 00:09:53,958 --> 00:09:59,689 So, this same address here or block number twelve can fit anywhere in this cache. 137 00:09:59,689 --> 00:10:02,593 Okay. So now, the next question is how do we 138 00:10:02,593 --> 00:10:12,487 actually find the block in the cache? So when, you're going to, let's say, take 139 00:10:12,487 --> 00:10:18,399 an address, and you want to go figure out where it goes in the cache. 140 00:10:18,399 --> 00:10:24,591 Let's, let's take a look at the example of a, a direct mapped cache first. 141 00:10:24,591 --> 00:10:30,730 And here, here's sort of our information. We have, we have the data block and then, 142 00:10:30,730 --> 00:10:34,323 we have some tag information and we have a valid bit. 143 00:10:34,323 --> 00:10:40,209 So, if we look at the address that we're sticking in here, so, let's say, this is 144 00:10:40,209 --> 00:10:45,074 like a 32-bit address. Some portion of that address is just going 145 00:10:45,074 --> 00:10:49,770 to figure out where in the data block the load or store is going to. 146 00:10:49,770 --> 00:10:57,019 So, if you have, let's say, a 64-byte cache line or 64-byte blocks, data block. 147 00:10:57,019 --> 00:11:03,022 You're going basically index into some sub-word in there based off the offset 148 00:11:03,022 --> 00:11:07,526 bits. Then, we start to look at the block 149 00:11:07,526 --> 00:11:11,077 address. So, the next the, this is the traditional 150 00:11:11,077 --> 00:11:15,425 way to do this is that you use, that you use these sort of middle order bits as the 151 00:11:15,425 --> 00:11:18,022 index. There are more complicated cache systems, 152 00:11:18,022 --> 00:11:21,084 which move this around or do much more complicated hashing functions. 153 00:11:21,084 --> 00:11:24,843 But the basic hash function here, is we're going to take the index. 154 00:11:24,843 --> 00:11:30,413 So, in this cache here, let's say, that has three, I'm assuming four entries in 155 00:11:30,413 --> 00:11:34,132 the cache. How many, how many bits are in this index? 156 00:11:34,132 --> 00:11:37,151 Two, yes, okay. So, two to the two is four. 157 00:11:37,151 --> 00:11:42,984 So, there's, this index is in here and chooses one of these four locations. 158 00:11:42,984 --> 00:11:49,026 And then finally, this tag information here is used to compare against the tag 159 00:11:49,026 --> 00:11:53,034 that we store. And that will say whether this address is 160 00:11:53,034 --> 00:11:57,049 actually in the cache right now or if we have to go out to main memory. 161 00:11:57,049 --> 00:12:01,030 So, let's, let's, work a brief example here. 162 00:12:01,030 --> 00:12:07,558 We have, 32-bit addresses. We have a four way, or excuse me, a four 163 00:12:07,558 --> 00:12:13,618 set cache, direct maps cache. How many, how many bits? 164 00:12:13,618 --> 00:12:19,631 And we have, okay, so we have 64-bytes in our block size. 165 00:12:19,631 --> 00:12:25,031 How many bits are in the offset? Let's start with that. 166 00:12:25,031 --> 00:12:31,127 The offset is going to have to index a byte within the block here. 167 00:12:31,127 --> 00:12:39,603 So, if we have byte addressable memory and there's 64 different entries, log base two 168 00:12:39,603 --> 00:12:41,448 of 64 is six. Okay. 169 00:12:41,448 --> 00:12:47,178 So, we have six bits of offset. Okay, what do we say about the index here? 170 00:12:47,178 --> 00:12:51,733 And we said, this was already two bits. Okay, so that's eight bits. 171 00:12:51,733 --> 00:12:55,855 So, how many bits of tag do we have? So, the tag information is 24-bits. 172 00:12:55,855 --> 00:13:00,223 And as you can see, as you make the data block larger, or the data block size 173 00:13:00,223 --> 00:13:04,892 large, larger, or the cash line size larger, you need fewer tag bits because it 174 00:13:04,892 --> 00:13:09,065 sort of, eats away this way, and the index gets shifted up. 175 00:13:09,601 --> 00:13:13,024 So, that's, that could be, that could be a good thing. 176 00:13:13,045 --> 00:13:19,007 Finally, there's this V column here. This is what I was looking to before, it's 177 00:13:19,007 --> 00:13:23,389 really important. You need a, valid bit to see whether the 178 00:13:23,389 --> 00:13:27,325 lines valid. It's very possible you could have old data 179 00:13:27,325 --> 00:13:32,050 or the cache is not initialized. At which point, these V bits would all be 180 00:13:32,050 --> 00:13:37,078 zero or, or not valid. Okay, so let's do a little more advanced 181 00:13:37,078 --> 00:13:46,024 example here, we have a, a two-way set associative cache still with four lines. 182 00:13:47,010 --> 00:13:49,365 And we want to go find our data. Hm. 183 00:13:49,365 --> 00:13:53,367 So, how do we, how do we go about doing that? 184 00:13:53,367 --> 00:13:59,217 What's interesting here is, when we did this, we only had to check one tag 185 00:13:59,217 --> 00:14:02,500 location. Well, for a cache like this. 186 00:14:02,500 --> 00:14:07,390 The index is going to tell us which of the two sets it's in. 187 00:14:07,390 --> 00:14:11,545 But, it can either be in either of the two ways of, of these sets. 188 00:14:11,545 --> 00:14:16,796 So, we actually need to do a tag check against this tag and that tag, and check 189 00:14:16,796 --> 00:14:21,566 the two valid bits. Many caches can do this in parallel, but 190 00:14:21,566 --> 00:14:27,295 that's how you can figure out, and it's possible that it's in neither of those two 191 00:14:27,295 --> 00:14:32,077 and we have to just say, that, it's a cache-miss. 192 00:14:32,077 --> 00:14:35,611 Okay. So, let's, let's talk about, when you need 193 00:14:35,611 --> 00:14:40,549 to go taking things out of the cache. So, let's talk about block replacements. 194 00:14:40,549 --> 00:14:45,496 We need to figure out what to go, kick out of the cache, when you take cache missing, 195 00:14:45,496 --> 00:14:51,007 it will bring, bring some new data in. So, an important point here is in direct 196 00:14:51,007 --> 00:14:55,962 mapped cache, this question makes no sense cuz in direct mapped cache, it can only go 197 00:14:55,962 --> 00:15:00,070 to one location. In set associative or fully associative 198 00:15:00,070 --> 00:15:06,072 caches, we need to make a decision. And hopefully, we only need to make this 199 00:15:06,072 --> 00:15:12,043 decision when a set becomes full. Because, otherwise, we'll just choose the 200 00:15:12,043 --> 00:15:17,076 empty location in that set. So, if you have a two-way set associative 201 00:15:17,076 --> 00:15:20,167 cache. And one of the ways is full of data, and 202 00:15:20,167 --> 00:15:23,126 the other one is just empty. The valid bit is empty. 203 00:15:23,126 --> 00:15:26,946 We probably shouldn't even be looking at, sort of different choices on how to 204 00:15:26,946 --> 00:15:30,516 replace things, because, there's an open spot to go put the data, we should go put 205 00:15:30,516 --> 00:15:31,411 the data there. Okay. 206 00:15:31,411 --> 00:15:35,886 Som what are, what are some, good block replacement policies, or good cache 207 00:15:35,886 --> 00:15:37,991 replacement policies? First one, is random. 208 00:15:37,991 --> 00:15:41,887 You can just choose randomly out of a hat. Actually, it doesn't work so bad. 209 00:15:41,887 --> 00:15:46,229 It's, it's, it's, you know, it's easy to implement, you just like, have some random 210 00:15:46,229 --> 00:15:50,132 unique feedback shift register, or sometimes you just sort of choose some 211 00:15:50,132 --> 00:15:53,277 random number and, and you replace it. Not so, not so bad. 212 00:15:53,277 --> 00:15:57,904 Well, well, now, now we start thinking about put, put our thinking caps on, we 213 00:15:57,904 --> 00:16:01,526 can start thinking about some better, better ways to go do this. 214 00:16:01,526 --> 00:16:08,319 When we need to think about least recently used, hm. 215 00:16:08,319 --> 00:16:13,499 So, the idea behind this strategy is that you are trying to preserve temporal 216 00:16:13,499 --> 00:16:16,394 locality. So, if you've used some data recently, 217 00:16:16,394 --> 00:16:21,761 there's a heuristic here, that it's likely to be used again in a not too distant time 218 00:16:21,761 --> 00:16:25,075 axis. So, if you look back to that sort of time 219 00:16:25,075 --> 00:16:29,940 versus address block that we had, in the time axis we think that, you know, 220 00:16:29,940 --> 00:16:33,563 temporal locality is something that happens relatively often. 221 00:16:33,563 --> 00:16:40,972 So, one, one really good strategy is to try to look at the least recently used as 222 00:16:40,972 --> 00:16:46,526 the block to get rid of. So, if something has not been used 223 00:16:46,526 --> 00:16:51,832 recently we kick that out of cash. Now, some problems with this is, this gets 224 00:16:51,832 --> 00:16:55,193 really hard to do. Well, there's two problems. 225 00:16:55,193 --> 00:17:00,998 One, to implement appropriate lease recently used, every single load or store 226 00:17:00,998 --> 00:17:05,365 that happens to the cache, needs to update the information. 227 00:17:05,365 --> 00:17:10,127 Well, that's not great for power. But it's also not great because you 228 00:17:10,127 --> 00:17:15,315 basically need to have sort of a very fine during tracking information on each cache 229 00:17:15,315 --> 00:17:18,362 line on every single cycle that does a load or a store. 230 00:17:18,362 --> 00:17:22,661 So, you know, you need, you need some cache data that needs to be updated pretty 231 00:17:22,661 --> 00:17:24,903 often. Plenty of people do that, though. 232 00:17:24,903 --> 00:17:29,634 There's, you know, if you have a least recently used, piece of information where 233 00:17:29,634 --> 00:17:34,274 you have extra bit on each cache line, you can think about just sort of flipping that 234 00:17:34,274 --> 00:17:38,480 back and forth and having one bit for that information and every single load and 235 00:17:38,480 --> 00:17:42,266 store actually sets that bit and lots of processors do that. 236 00:17:42,266 --> 00:17:47,861 So, what, what starts to get hard here is when you try to go above 2-way caches. 237 00:17:47,861 --> 00:17:53,224 So, 2-way you have sort of one bit that can decide where, which is the most least 238 00:17:53,224 --> 00:17:56,636 recently used. If you start to have, let's say, an 8-way 239 00:17:56,636 --> 00:18:02,029 associative cache, you want to do true, least recently used, you've got to somehow 240 00:18:02,029 --> 00:18:05,302 keep like timing stamps for each different way. 241 00:18:05,302 --> 00:18:10,007 You have to say, oh, well, this one was just recently accessed and, you know, you 242 00:18:10,007 --> 00:18:13,086 can time stamp it. You can try to sort of reorganize a list 243 00:18:13,086 --> 00:18:16,570 of numbers in order. And these are the things that start to get 244 00:18:16,570 --> 00:18:20,596 harder to do in hardware and especially, you know, on the critical path of your 245 00:18:20,596 --> 00:18:23,966 processor. So, lot of times what people do is they 246 00:18:23,966 --> 00:18:29,928 have pseudo list recently used for higher associative caches where you'll try to 247 00:18:29,928 --> 00:18:36,546 sort of keep maybe accurate information of the last two most recently used lines and 248 00:18:36,546 --> 00:18:40,592 the other ones you do random or something like that. 249 00:18:40,592 --> 00:18:45,178 Or there's some more complex things which we'll talk later in the class about. 250 00:18:45,178 --> 00:18:49,536 Another one, first in, first out. So, whatever gets brought into the cache 251 00:18:49,536 --> 00:18:54,605 first, sits in the cache for some period of time and then gets kicked out in the 252 00:18:54,605 --> 00:18:57,494 future. So, you're sort of guaranteed that some 253 00:18:57,494 --> 00:19:02,717 piece of data that you bring in will sit in your cache for some period of time, and 254 00:19:02,717 --> 00:19:06,963 this is a lot easier to implement in highly associative caches. 255 00:19:06,963 --> 00:19:12,246 But it's certainly not implementing least recently used because if you sort of bring 256 00:19:12,246 --> 00:19:15,643 in a cache line. Let's say, you have a four-way set 257 00:19:15,643 --> 00:19:19,460 associative cache. Four axises later, it's sort of the top of 258 00:19:19,460 --> 00:19:23,461 the list to get kicked out. But what's nice about this is you're 259 00:19:23,461 --> 00:19:27,922 guaranteed it'll sit in your cache for some period of time and won't just get 260 00:19:27,922 --> 00:19:32,044 sort of randomly kicked out, which like random, has problems with. 261 00:19:32,063 --> 00:19:37,304 Finally here, another, another acronym here, not most recently used. 262 00:19:37,304 --> 00:19:41,424 Well, let's think about that one for a second. 263 00:19:41,424 --> 00:19:46,414 So, we're going to kick out something, that is not the most recently used block. 264 00:19:46,414 --> 00:19:51,109 So, if we can sort of keep track and say that we and we have some index into our 265 00:19:51,109 --> 00:19:56,021 cache, if we have a four-way cache, we can think of having sort of two bits that 266 00:19:56,021 --> 00:20:00,293 says, well, this line was the most recently used and the rest of them was 267 00:20:00,293 --> 00:20:04,497 sort of random. And we know we are not going to be kicking 268 00:20:04,497 --> 00:20:09,071 out the most recently used cache. So, write strategies, you're going to do a 269 00:20:09,071 --> 00:20:12,092 store, you've got to know where you put the data. 270 00:20:13,018 --> 00:20:17,405 Important, important, important thing here. 271 00:20:17,405 --> 00:20:24,054 Caches have of lot's of different allocation policies and policies about 272 00:20:24,054 --> 00:20:30,036 whether you keep the data or throw out the data in the different levels of cache 273 00:20:30,036 --> 00:20:33,564 hierarchy, when a store happens or a write happens. 274 00:20:33,564 --> 00:20:38,306 So, let's, let's take a look at couple, couple of different choices here. 275 00:20:38,306 --> 00:20:43,647 Let's say, you are doing a store and you have a cache hit, so hits in your, your 276 00:20:43,647 --> 00:20:49,096 cache, Should you go put that in main memory also? 277 00:20:51,016 --> 00:20:57,075 If you, if you do a, a store, and you put every single store into your cache and 278 00:20:57,075 --> 00:21:02,230 into main memory, if it's already in your cache, then, as you point out, it uses 279 00:21:02,230 --> 00:21:05,632 bandwidth out to the main memory. So, that can be a downside. 280 00:21:05,632 --> 00:21:10,693 There are some positives to this, though. When you go to evict the line, you don't 281 00:21:10,693 --> 00:21:13,550 have to write anything back to main memory. 282 00:21:13,550 --> 00:21:17,272 So, you sort of are pre-putting data in the main memory. 283 00:21:17,272 --> 00:21:22,054 So, that's, that's a positive. You could also think about if you have a 284 00:21:22,054 --> 00:21:25,895 multilevel cache hierarchy. Some levels of hierarchy may, may make 285 00:21:25,895 --> 00:21:30,354 sense to actually put it in the sort of the next level out if you have enough 286 00:21:30,354 --> 00:21:32,953 bandwidth. So, this is actually a trade off. 287 00:21:32,953 --> 00:21:36,290 As I said, Computer Architecture is all about trade offs. 288 00:21:36,290 --> 00:21:39,885 Neither of this actually correct. So, the one is called write through. 289 00:21:39,885 --> 00:21:44,652 So, right through means, you put it in your cache, if you get a cache fit and you 290 00:21:44,652 --> 00:21:48,063 put it in memory. Write back cache just means you just put 291 00:21:48,063 --> 00:21:51,417 it in your local cache. And when that block gets evicted and it 292 00:21:51,417 --> 00:21:55,811 has dirty data, you have to go put it back in the main memory. 293 00:21:56,693 --> 00:22:01,002 So, so, that's, that's kind if the downsides and upsides there. 294 00:22:01,002 --> 00:22:05,078 One, one thing you do need to do if you have write back, as we say here, is that 295 00:22:05,078 --> 00:22:09,013 you need to keep track of, if the data is dirty or not. 296 00:22:09,031 --> 00:22:13,594 So, if you go look at more complicated cache coherence protocols, sometimes 297 00:22:13,594 --> 00:22:16,327 you'll actually favor things like write through. 298 00:22:16,327 --> 00:22:18,844 Because right back gets really complicated. 299 00:22:18,844 --> 00:22:23,300 When you have data that has to be invalidated and, and sent out to other 300 00:22:23,300 --> 00:22:26,062 places. But if you knew the data was clean i.e., 301 00:22:26,062 --> 00:22:31,064 it has no dirty data, you never have to a write back when an invalidation comes in 302 00:22:31,064 --> 00:22:35,054 for a multiprocessor scenario. So, it's a tough, tough trade-off there, 303 00:22:35,054 --> 00:22:39,497 but, generally, sort of write back is considered to be more advanced and saves 304 00:22:39,497 --> 00:22:43,013 bandwidth. But, there are trade-offs there of why you 305 00:22:43,013 --> 00:22:47,041 might want to do one or the other, because write through is definitely simpler to 306 00:22:47,041 --> 00:22:49,054 design. Okay. 307 00:22:49,054 --> 00:22:56,070 So, that's our, and, and, oh, I do want to point out that, you know, you may not 308 00:22:56,070 --> 00:23:00,066 actually go all the way out to the main memory. 309 00:23:00,066 --> 00:23:17,052 So plenty of architectures have small L1 caches, for instance, where you don't 310 00:23:17,052 --> 00:23:22,531 necessarily want to deal with invalidations coming in and complicated 311 00:23:22,531 --> 00:23:28,649 things happening. Where you might do write through from your 312 00:23:28,649 --> 00:23:34,950 L1 to L2, but then you want to save, let's say, off-chip bandwidth here. 313 00:23:34,950 --> 00:23:39,460 So, you write back out of your L2 to the DRAM. 314 00:23:39,460 --> 00:23:44,216 So, that's, that is a pretty common sort of thing to do. 315 00:23:44,216 --> 00:23:52,107 And as I was saying, the real motivation for this is if you have a multiprocessor 316 00:23:52,107 --> 00:23:57,465 system, and you have, let's say, invalidation requests. 317 00:23:57,465 --> 00:24:04,264 So, let's say, another processor wants that same cache line, and the data is 318 00:24:04,264 --> 00:24:10,811 dirty here, but not into, not in the DRAM, You'd actually have to sort of reach down 319 00:24:10,811 --> 00:24:14,312 into the L1 cache. If you have write back out of the L1 cache 320 00:24:14,312 --> 00:24:19,079 and that gets pretty complicated. The, and it especially gets complicated if 321 00:24:19,079 --> 00:24:23,656 your L1 cache is very tightly integrated into your CPU because it's on your 322 00:24:23,656 --> 00:24:26,903 pipeline. Like, it's in your main pipeline and you 323 00:24:26,903 --> 00:24:32,029 have to sort of bubble the pipe somehow or have structural hazard for any 324 00:24:32,050 --> 00:24:35,043 invalidation that comes into that L1 cache. 325 00:24:35,043 --> 00:24:40,075 So typically, a lot a times, people try not to have or they, or they, people will 326 00:24:40,075 --> 00:24:44,092 consider write through caches, at least from the L1 to the L2. 327 00:24:44,092 --> 00:24:51,000 But you definitely won't say bandwidth, for a L2, sort of out to DRAM cache, if 328 00:24:51,000 --> 00:24:58,415 you don't it write through every time. Okay, let's take a look at what happens on 329 00:24:58,415 --> 00:25:02,802 a cache miss? So, on a cache miss, we're doing a store 330 00:25:02,802 --> 00:25:07,070 or write to memory. Do we need to put it in our cache at all? 331 00:25:11,031 --> 00:25:15,254 Choices, maybe yes, maybe no. There's no gearing, you don't have to put 332 00:25:15,254 --> 00:25:18,458 in your cache. A lot of times, and, and this comes down 333 00:25:18,458 --> 00:25:23,179 to heuristics here, a lot of times you'll actually do a store, and you may just not 334 00:25:23,179 --> 00:25:26,617 read that again. So some architectures actually have store 335 00:25:26,617 --> 00:25:31,129 instructions with locality hints, or, or temporal locality hints saying, I'm doing 336 00:25:31,129 --> 00:25:35,540 this store, I'm doing this write right now, and I'm trying not to read it again 337 00:25:35,540 --> 00:25:39,634 right now for a very long period of time. Don't put it in my cache. 338 00:25:39,634 --> 00:25:44,078 Also somethings just decide not to do write allocate, because or so, no, no, 339 00:25:44,078 --> 00:25:48,621 I'll define these two things here. No write allocate means you are doing the 340 00:25:48,621 --> 00:25:54,003 store, or write to your cache and you're not going to pull the data into your cache 341 00:25:54,020 --> 00:25:56,064 to do the merging there, locally if you will. 342 00:25:56,064 --> 00:26:00,001 And you're not going to actually store any data into your cache. 343 00:26:00,001 --> 00:26:03,056 Cuz, you don't have to. You're just keeping a local copy of it, 344 00:26:03,056 --> 00:26:07,021 you don't actually have to, you're not forced to keep a local copy. 345 00:26:07,021 --> 00:26:10,037 You can always send it out to your canonical main memory. 346 00:26:10,037 --> 00:26:14,063 Write allocate says, go fetch the data from memory, merge it in the cache because 347 00:26:14,063 --> 00:26:17,681 usually the block line, block size is longer than the amount of data you're 348 00:26:17,681 --> 00:26:20,934 writing. So, merge it, and then, probably either 349 00:26:20,934 --> 00:26:25,092 put it out to main memory if you write through, or just keep it in the cache. 350 00:26:25,092 --> 00:26:30,027 So, you, you, if you write back. So, you can think about having theses two 351 00:26:30,027 --> 00:26:32,682 policies, and, and they both, they both have places. 352 00:26:32,682 --> 00:26:37,018 Sometimes, people will, you know, typically, people try to write allocate, 353 00:26:37,018 --> 00:26:41,095 unless you have some locality information. Cuz data, lets say, it's at the top of 354 00:26:41,095 --> 00:26:46,002 your stack, in your processor, or in your, in your sort of software stack. 355 00:26:46,002 --> 00:26:50,043 You are going to basically be writing and reading from that relatively often. 356 00:26:50,043 --> 00:26:52,509 So, it makes sense to try to allocate that. 357 00:26:52,509 --> 00:26:57,328 The other thing I did want to point out is, there is a heuristic sort of going on 358 00:26:57,328 --> 00:27:02,271 with this no write allocate that, if you read that block, then, it will get into 359 00:27:02,271 --> 00:27:05,419 your cache. So, maybe the first store that you do to 360 00:27:05,419 --> 00:27:09,789 the address won't allocate, but the next load, you do will allocate. 361 00:27:09,789 --> 00:27:13,031 So, you're not going to miss that much by doing no-write allocate. 362 00:27:13,031 --> 00:27:15,076 It's not like you've lost everything for forever. 363 00:27:15,076 --> 00:27:19,062 It's just that, that first write, oh, well, I had to go on to the beat and erase 364 00:27:19,062 --> 00:27:22,017 this system. But the next load that I do from, let's 365 00:27:22,017 --> 00:27:25,048 say, my stack, or some local variable, will pull that line in anyway. 366 00:27:26,025 --> 00:27:29,538 Okay. So, some common combinations here. 367 00:27:29,538 --> 00:27:35,047 Write back with write allocate. So, this is sort of the full blown 368 00:27:35,047 --> 00:27:40,029 solution that a lot of people do. This is kind of like the base high 369 00:27:40,029 --> 00:27:44,053 performance one. Write through and no-write allocate is, 370 00:27:44,075 --> 00:27:48,070 kind of possible, this is, like, the cheap one to build. 371 00:27:48,092 --> 00:27:53,038 You don't need any logic here to deal with no-write allocate. 372 00:27:53,038 --> 00:27:56,038 But all possible mixes are, are, are valid.