1 00:00:03,460 --> 00:00:07,130 Now, we get in the meat of today, today's lecture. 2 00:00:07,130 --> 00:00:12,600 And we're going to talk about all these different advanced cache optimizations, 3 00:00:12,600 --> 00:00:18,820 starting with pipelining the cache. Okay. So, here is our. 4 00:00:19,440 --> 00:00:24,614 Sketch of what happens in the cache. In the cache, we have a tag and we have 5 00:00:24,614 --> 00:00:29,236 the data blocks here. And we want to go do write into the cache. 6 00:00:21,927 --> 00:00:33,928 So, forget about the read for a second, But instead let's focus on the write. 7 00:00:33,928 --> 00:00:37,170 So, in the write, you're going to do a tag check. 8 00:00:37,170 --> 00:00:42,551 So, you're going to index into here with the index into your tag array, it's going 9 00:00:42,551 --> 00:00:46,691 to spit out the tag. You want to check the valid bit also, 10 00:00:46,691 --> 00:00:55,491 And you're going to do a comparison here. If it matches, that means that the data's 11 00:00:55,491 --> 00:01:01,417 already in the cache and you're doing a write, so you can just write the data into 12 00:01:01,417 --> 00:01:06,300 the cache. So, that, that sounds good. 13 00:01:06,920 --> 00:01:10,434 Some challenges here is, it really is sequential. 14 00:01:10,434 --> 00:01:16,237 You need to absolutely do this first before you do the write. Because if you 15 00:01:16,237 --> 00:01:21,689 have a different address stored in this location here and is blindly to the write, 16 00:01:21,689 --> 00:01:26,342 You are going to overwrite the wrong data. And you're just going to be very unhappy 17 00:01:26,342 --> 00:01:31,461 because we overwrote some data which wasn't, wasn't a portion to this, this 18 00:01:31,461 --> 00:01:34,520 particular write or is that the wrong address. 19 00:01:34,635 --> 00:01:44,996 So that's, that's definitely a bummer so do we think we will want to do this in one 20 00:01:44,996 --> 00:02:01,920 cycle? Anyone have, want to wager a guess here? 21 00:02:02,620 --> 00:02:05,800 Do people build machines where they do this in one cycle? 22 00:02:10,000 --> 00:02:13,696 Well, sure. Original machines did do this one cycle. 23 00:02:13,696 --> 00:02:19,314 It's possible to build, you have a nice combinational path that sort of come 24 00:02:19,314 --> 00:02:25,080 through out of the index through here, through all this logic. And then, through 25 00:02:25,080 --> 00:02:30,772 the write enable and then through here. It's buildable,, not great for your clock 26 00:02:30,772 --> 00:02:36,820 performance. So, the first optimization we're going to 27 00:02:37,173 --> 00:02:43,160 so, so, how do we reduce the, hit time for a write? 28 00:02:43,440 --> 00:02:47,045 And there's a couple different strategies for this. 29 00:02:47,045 --> 00:02:52,488 So, we can think of this as trying to do this over two different cycles. 30 00:02:52,488 --> 00:02:58,426 The first cycle checks the tag, we'll say, and the second cycle does something to the 31 00:02:58,426 --> 00:03:03,329 data. That's, that's, that's kind of our problem 32 00:03:03,329 --> 00:03:06,200 statement. What are our solutions? 33 00:03:07,100 --> 00:03:14,252 One, one solution is kind of innovative, is you can build a multi-ported cache, or 34 00:03:14,252 --> 00:03:20,780 cache which can simultaneously do a read and a write to the same address, 35 00:03:21,380 --> 00:03:26,872 And you do that. So you have a cache, you're doing a write. 36 00:03:26,872 --> 00:03:29,959 You blindly do the write not checking the tag. 37 00:03:29,959 --> 00:03:34,860 But, at the same time, you rebuild data that was there and you save it off. 38 00:03:35,400 --> 00:03:43,567 And if you took a tagness for the write, you go back and actually fix it up by 39 00:03:43,567 --> 00:03:46,638 filling that back in. That is an option, 40 00:03:46,638 --> 00:03:52,700 People have built such things. and so you just restore the old value on a tagness 41 00:03:52,700 --> 00:03:55,703 Works perfectly fine, you need a side buffer. 42 00:03:55,907 --> 00:04:01,367 This kinda looks like a victim cache, which we'll talk about later in today's 43 00:04:01,367 --> 00:04:04,712 lecture. But you can do a write and a read at the 44 00:04:04,712 --> 00:04:09,148 same time and it's kinda like you're putting the, the new data in, 45 00:04:09,148 --> 00:04:12,561 speculatively, pulling out the old data, the victim. 46 00:04:12,561 --> 00:04:17,679 And, if the write hits in the, the, the tag check, then you can just let it go and 47 00:04:17,679 --> 00:04:21,706 everything is great. If not, you have to basically pull that 48 00:04:21,706 --> 00:04:23,813 back out and undo it. It's doable. 49 00:04:23,813 --> 00:04:28,660 Not a lot of people do this, but people have built machines like that. 50 00:04:29,620 --> 00:04:36,060 Another solution is trying to build a fully associative cache. 51 00:04:36,060 --> 00:04:39,534 Now, what do I mean by fully associative cache? 52 00:04:39,534 --> 00:04:45,954 Well, I mean, a cache which has lines such that you can store it to any location in 53 00:04:45,954 --> 00:04:49,768 the cache. And in those structures, typically, the 54 00:04:49,768 --> 00:04:54,356 tag check and the, access are kind of done in parallel. 55 00:04:54,356 --> 00:05:00,554 So, because, because you don't know where to find the data, there's no index 56 00:05:00,554 --> 00:05:04,579 operation. The index operation is the tag check, or 57 00:05:04,579 --> 00:05:09,570 the cam operation, or the content addressable memory operation. 58 00:05:09,570 --> 00:05:14,819 You're basically doing it in parallels or, or you're basically doing the tag check to 59 00:05:14,819 --> 00:05:18,661 do the read anyway. It doesn't really hurt you to sort of do 60 00:05:18,661 --> 00:05:21,350 the tag check, check first, and then do the write. 61 00:05:21,350 --> 00:05:27,112 So the, it just doesn't really, hurt you. It's not necessarily a great design if you 62 00:05:27,112 --> 00:05:31,465 have, a very large cache. But for a small cache, you can definitely 63 00:05:31,465 --> 00:05:34,667 have a content addressable memory, for the tags, 64 00:05:34,667 --> 00:05:39,788 And then just have it so you can write anywhere if you have a fully associative 65 00:05:39,788 --> 00:05:42,477 cache. That is one way to solve this write 66 00:05:42,477 --> 00:05:45,489 problem. And then, we're going to look at this is 67 00:05:45,489 --> 00:05:49,800 what we're going to focus on in today's lecture is pipelining the write. 68 00:05:50,700 --> 00:05:56,774 So, to pipeline the write, instead of actually doing the write in the cycle in 69 00:05:56,774 --> 00:06:02,060 the mem, M stage of your pipeline, we're going to tag check the M stage. 70 00:06:02,340 --> 00:06:07,620 And then, hold the store data for some time in the future. 71 00:06:07,920 --> 00:06:11,220 Now, you're going to say, did we actually do the store then? 72 00:06:11,220 --> 00:06:14,593 Well, yes. We're going to call that committed state. 73 00:06:14,593 --> 00:06:19,974 So this is what we're going to focus on today is, how to, how to do a pipelined 74 00:06:19,974 --> 00:06:23,060 memory access to reduce our write hit time. 75 00:06:23,660 --> 00:06:31,972 So here, we have our five stage pipeline. Cough And, what we're going to do is we 76 00:06:31,972 --> 00:06:37,160 are going to the M stage for write, we're going to check the tag, 77 00:06:38,580 --> 00:06:42,860 And we're not going to fire up the data array at all. 78 00:06:44,140 --> 00:06:47,026 Now, you might say, where do you put the data? 79 00:06:47,026 --> 00:06:53,302 Well, we put a nice little buffer here COUGH which is a, basically to be stored 80 00:06:53,302 --> 00:06:58,246 buffers. Stores the data that's going to be stored 81 00:06:58,246 --> 00:07:04,180 in the future or a delayed cache buffer as we're going to call this. 82 00:07:05,320 --> 00:07:10,360 So in parallel, you check the tag, and you store the data here. 83 00:07:10,960 --> 00:07:17,080 And then, sometime in the future, you want to move this into the cache. 84 00:07:18,580 --> 00:07:24,851 But you need a convenient time to do that. So, one option is you wait for a dead 85 00:07:24,851 --> 00:07:31,880 cycle on the cache. Okay. That sounds, sounds good. 86 00:07:32,880 --> 00:07:35,340 How do you know you're going to get a dead cycle? 87 00:07:38,220 --> 00:07:41,173 I can't guarantee that. I can write a piece of code which does 88 00:07:41,173 --> 00:07:44,603 store, after store, after store, after store, after store, after store, after 89 00:07:44,603 --> 00:07:48,176 store, after store, after store, after store, in a really tight loop so you'll 90 00:07:48,176 --> 00:07:58,202 never have a dead cycle. A cool trick here is a subsequent store 91 00:07:58,202 --> 00:08:05,385 only has to use the tag array. So, if you have a store after a store, the 92 00:08:05,385 --> 00:08:09,568 first store will check the tag, and not use the data, 93 00:08:09,568 --> 00:08:14,440 We'll destroy the data here. And then, when the second store comes down the pipe, 94 00:08:15,180 --> 00:08:21,660 It'll check the tag but the data array is open so you can do the store at that time. 95 00:06:16,245 --> 00:08:27,086 So, there's a cool trick there. If you have a store, after a store, after 96 00:08:27,086 --> 00:08:33,337 a store, you could basically decouple the tag from the data for stores and use the 97 00:08:33,337 --> 00:08:39,000 port on the cache to do, the, the data ray of the cache to do the store later. 98 00:08:39,560 --> 00:08:46,056 And we'll go through this a little more detailed drawing here, and this is just 99 00:08:46,056 --> 00:08:52,220 kind of to redirect what was going on. You do the store, it checks the tags. 100 00:08:52,500 --> 00:08:57,500 Cough And it's going to store the address and the data. 101 00:08:58,560 --> 00:09:04,110 And at some point in the future, when there's a idle cycle or another store 102 00:09:04,110 --> 00:09:08,920 going on, it'll actually transition this data into the data array. 103 00:09:11,940 --> 00:09:15,880 That sounds good. Okay. So, 104 00:09:16,360 --> 00:09:21,079 Pop quiz question here. What happens when you do a load and 105 00:09:21,079 --> 00:09:23,940 there's data in the delayed write data buffer? 106 00:09:28,540 --> 00:09:32,853 You have to bypass it, yes, you need to go check it. 107 00:09:32,853 --> 00:09:39,754 So, you need to have something which is going to check, and that's actually drawn 108 00:09:39,754 --> 00:09:44,585 here. COUGH The delayed write address against a future, load. 109 00:09:44,585 --> 00:09:51,400 And if you get a hit there, you need to have this data come around and come out. 110 00:09:52,140 --> 00:09:55,020 So, we need basically to go check this buffer. 111 00:09:55,582 --> 00:10:00,326 One, one thing I do want to point, this is kind of the naive way to do this. 112 00:10:00,508 --> 00:10:05,312 More advanced processors will actually typically have a multi entry version of 113 00:10:05,312 --> 00:10:09,752 these delayed write data buffers because caches are sort of usually pipelined 114 00:10:10,117 --> 00:10:11,759 inside the cache, too. And I think multiple access, multiple 115 00:10:12,610 --> 00:10:17,537 cycles to actually access the data array. So, you might not do this until the end of 116 00:10:17,537 --> 00:10:20,091 the pipe. So, some of the processors that I've 117 00:10:20,395 --> 00:10:25,078 worked on, these are sort of on the delayed write buffers or sort of the end, 118 00:10:25,078 --> 00:10:29,631 order of maybe two to four entries big. So, you can abstract this and make this 119 00:10:29,631 --> 00:10:33,993 into a bigger data structure, and when you do loads, you obviously have to do a 120 00:10:33,993 --> 00:10:39,367 content addressable match against all of those entries and see if the datas in the 121 00:10:39,367 --> 00:10:44,328 delayed write data buffer. Okay. 122 00:10:44,328 --> 00:10:54,807 So let's go see how well this does. Pipelining our cache. So let's, so, so, 123 00:10:54,807 --> 00:10:58,288 we're going to use this throughout today's lecture, 124 00:10:58,288 --> 00:11:03,800 Ways that it makes, makes life better. You can either reduce the miss rate. 125 00:11:04,300 --> 00:11:09,397 So do something good from miss rate, you're going to do something good for 126 00:11:09,397 --> 00:11:15,475 reducing the missed penalty. You can reduce the hit time and this would 127 00:11:15,475 --> 00:11:19,653 be like, for instance, having a smaller cache reduces the hit time, or you can 128 00:11:19,653 --> 00:11:23,775 increase the bandwidth to your cache. But, all these things are going to factor 129 00:11:23,775 --> 00:11:26,393 into performance. And not all of the techniques, 130 00:11:26,393 --> 00:11:30,784 optimizations we're going to talk about today touch all of these. 131 00:11:30,784 --> 00:11:35,360 In fact, most of them only touch sort of one at a time or two at a time. 132 00:11:36,240 --> 00:11:39,640 Okay. So, pipeline, 133 00:11:40,260 --> 00:11:49,357 Cache caches, what do they do for us? Are they going to reduce the miss rate? I 134 00:11:49,357 --> 00:11:53,820 see people shaking their heads, no. Yeah, we didn't make anything bigger or 135 00:11:53,820 --> 00:11:56,815 smaller here. It's probably not going to touch the, the 136 00:11:56,815 --> 00:12:02,040 miss rate. Is it going to affect our miss penalty? 137 00:12:03,460 --> 00:12:08,851 So, this is, when we're missing the cache, How long it takes us to go to, let's say, 138 00:12:08,851 --> 00:12:11,100 the main memory or the next level of cache? 139 00:12:12,560 --> 00:12:15,632 Hm, no. It's not actually, it's not actually going 140 00:12:15,632 --> 00:12:19,472 anywhere different. The, effectively, this is implementing the 141 00:12:19,472 --> 00:12:25,940 exact same thing. Hit time, does this affect the hit time? 142 00:12:28,220 --> 00:12:34,265 What's going to affect the hit time, it's kind of hard to see whether this is done 143 00:12:34,265 --> 00:12:40,386 in a positive or negative direction here. If you compare to having the, let's say, 144 00:12:40,386 --> 00:12:48,149 tag access and data access on the same cycle, pipelining is actually going to 145 00:12:48,149 --> 00:12:53,728 make the hit time, hm, Make the hit time a little bit better 146 00:12:53,728 --> 00:12:59,486 because you're doing less in a cycle. So, that would, that would have us put a plus 147 00:12:59,486 --> 00:13:04,889 here. But, if you compare it to, let's say, not having a cache or just having a 148 00:13:04,889 --> 00:13:10,790 big, RAM there, it actually makes the hit time worse because we need to mux in 149 00:13:10,790 --> 00:13:15,553 multiple, extra stuff here. We need to basically do this extra check. 150 00:13:15,553 --> 00:13:21,454 We need to do the, associative check against our, delayed write buffer data so 151 00:13:21,454 --> 00:13:25,936 that actually hurts us a little bit there. So, okay. 152 00:13:25,936 --> 00:13:33,643 That might be a, a plus or a minus. The bandwidth, though, is actually getting 153 00:13:33,643 --> 00:13:39,812 better through this cache because before, we had to just sort of wait for this cache 154 00:13:39,812 --> 00:13:44,407 for, let's say, The whole time we go through tag check and 155 00:13:44,407 --> 00:13:49,711 the data check and the data write. But now, we can actually have two stores sort 156 00:13:49,711 --> 00:13:54,886 of happening at the same time, or the tag for one store and the data for a different 157 00:13:54,886 --> 00:13:57,378 store. So, this is really going to make the 158 00:13:57,378 --> 00:14:03,495 bandwidth better for this. So that's, that's where we are with 159 00:14:03,729 --> 00:14:04,980 pipeline caches.