1 00:00:03,520 --> 00:00:05,985 So that is our first optimization technique. 2 00:00:05,985 --> 00:00:10,170 Let us, if there are no questions, we will move on to our second optimization 3 00:00:10,170 --> 00:00:12,692 technique. Your book lists ten optimization 4 00:00:12,692 --> 00:00:15,614 techniques. We are only going to cover I think seven 5 00:00:15,616 --> 00:00:20,088 of them between today and the next lecture because I think some of them are not 6 00:00:20,088 --> 00:00:24,101 actually that important but we are also going to cover some other ones which I 7 00:00:24,101 --> 00:00:29,645 think are important. okay so next thing we are going to look at, the next 8 00:00:29,645 --> 00:00:38,604 optimization we are going to look at is how to deal with hits or excuse me, how to 9 00:00:38,604 --> 00:00:46,682 deal with read misses in the cache that have some data there that needs to get 10 00:00:46,682 --> 00:00:52,279 kicked out of the cache. So here we have our CPU, L1 data cache, 11 00:00:52,279 --> 00:00:58,140 our next level of cache here will say a little L, L2 cache or maybe main memory 12 00:00:59,800 --> 00:01:06,100 There is something in this cache, at a particular line and it is dirty in the 13 00:01:06,100 --> 00:01:08,909 cache. So the dirty bit is set. 14 00:01:08,909 --> 00:01:15,761 It has state we cannot throw out. We do a read and it aliases that same 15 00:01:15,761 --> 00:01:20,446 location and we need to evict that line. It will create a victim. 16 00:01:20,446 --> 00:01:26,302 In a naive implementation, we'd actually have to sit there and wait for all that 17 00:01:26,302 --> 00:01:31,499 data to go out to main memory while we get through the next while 18 00:01:31,499 --> 00:01:34,940 We go do the read and get the data and fill it in. 19 00:01:35,640 --> 00:01:39,008 Okay, That is, that is, that is not pretty good. 20 00:01:39,008 --> 00:01:43,179 We sort of have to wait for this evicted dirty line. 21 00:01:44,540 --> 00:01:49,499 We will talk about that in a second. So the processor could be just stalled, 22 00:01:49,499 --> 00:01:53,644 waiting on rights. One thing you do is actually have to read 23 00:01:53,644 --> 00:01:57,500 misses. Go beyond the rights, sort of pass the 24 00:01:57,500 --> 00:02:01,552 rights and pass the rights going out to the unified L2. 25 00:02:01,552 --> 00:02:07,742 But one of the problems here is you do not have that many ports onto this unified L2. 26 00:02:07,742 --> 00:02:10,837 You only have one port out here we will say. 27 00:02:10,837 --> 00:02:16,142 So if you have the load sort of pass it and if you do this little dance that when 28 00:02:16,142 --> 00:02:20,564 the read data comes back or load data comes back, you need to have a place to 29 00:02:20,564 --> 00:02:24,951 put it. So either at that point you might need to 30 00:02:24,951 --> 00:02:29,289 wait for it to go out. To main memory or go out to the next level 31 00:02:29,289 --> 00:02:32,157 of cache. So you cannot really get around this. 32 00:02:32,157 --> 00:02:37,827 So you can say oh I'll try to get my load out early and just sort of worry about it 33 00:02:37,827 --> 00:02:40,758 later. Yeah but that does not work if that load 34 00:02:40,758 --> 00:02:45,601 hits in your next level of cache. You need to access the next level of cache 35 00:02:45,601 --> 00:02:50,698 and you are waiting for this data to go out, because you need someplace to put it. 36 00:02:50,698 --> 00:02:55,923 So the solution to this is we put a little buffer between our L1 and our L2 or L1 and 37 00:02:56,178 --> 00:03:02,282 our main memory and this will hold writes or victims that go out from the L1 to the 38 00:03:02,282 --> 00:03:05,367 L2 And now, we have someplace to put the data. 39 00:03:05,367 --> 00:03:10,715 So, if we wanted to do this fast, we would actually do the load, we would miss our L1 40 00:03:10,715 --> 00:03:13,525 cache, We would send that request out to here. 41 00:03:13,525 --> 00:03:18,667 And then, instantaneously, we would start evicting the line into this buffer. 42 00:03:18,667 --> 00:03:24,152 And the reason we cannot evict it here is because the load is actually using that, 43 00:03:24,152 --> 00:03:28,060 that data right now or it is using the L2 cache, we will say. 44 00:03:28,640 --> 00:03:34,943 But we have some place to put the victim data, and when the load comes back, we can 45 00:03:34,943 --> 00:03:41,119 put it into the L1 data, array. So this brings up a whole bunch of 46 00:03:41,119 --> 00:03:44,557 problems. Biggest one being, at some point, you 47 00:03:44,557 --> 00:03:49,140 actually can get transition, from the right buffer, into the L2 cache. 48 00:03:50,720 --> 00:03:53,886 Hmm. You could do that if you have extra time. 49 00:03:53,886 --> 00:03:57,966 So you could just have some circuits here with checks. 50 00:03:58,177 --> 00:04:03,876 But then comes the question of if you need to do this a second time what happens? 51 00:04:03,876 --> 00:04:09,575 Your second, let us say load that has to create a victim and this buffer is full. 52 00:04:09,575 --> 00:04:12,460 What do you do? Or you could just stall. 53 00:04:12,940 --> 00:04:16,520 And wait that's that is a, that is a pretty good option. 54 00:04:17,298 --> 00:04:22,894 So the, the, prob, this kind of what you are saying here is the probability that 55 00:04:22,894 --> 00:04:29,055 you have two victims generated in a very short period of time is low and this is 56 00:04:29,055 --> 00:04:34,524 actually a scheme that people do use. They do not do anything special and the 57 00:04:34,524 --> 00:04:40,569 first victim that gets generated goes into the right buffer, the second victim that 58 00:04:40,569 --> 00:04:45,129 gets generated just stalls, the pipe. That is okay. 59 00:04:45,129 --> 00:04:53,020 You have higher performance, if you can have the, subsequent read, basically. 60 00:04:53,060 --> 00:04:58,398 Go beyond the right buffer here and start actually doing something in the memory 61 00:04:58,398 --> 00:05:01,453 system. And if you want to do this, you need to, 62 00:05:01,453 --> 00:05:06,632 just like what we did in the previous example, you're gonna have to check this 63 00:05:06,632 --> 00:05:11,679 right buffer to see if the data is there. And that introduces complexity, because 64 00:05:11,679 --> 00:05:16,062 now your data can be here or here or you are further out. 65 00:05:16,062 --> 00:05:25,364 So there is just more places to check. Okay, so that is like the first half of 66 00:05:25,364 --> 00:05:28,881 the right buffers. The second half of the right buffer, why 67 00:05:28,881 --> 00:05:34,280 we want to put a right buffer is if we have a write-through cache. 68 00:05:35,060 --> 00:05:38,880 So we have been talking about write-back caches which introduce victims. 69 00:05:39,280 --> 00:05:44,980 But, if you recall, a write-through cache, every single store that happens, 70 00:05:45,320 --> 00:05:50,090 Discordance the data cash that the low level L1 data cache and it also gets bring 71 00:05:50,090 --> 00:05:53,460 in into the next level of cash, we will because it is writing through. 72 00:05:53,460 --> 00:05:56,260 So let us say you have a right through from the L1 to the L2. 73 00:05:56,520 --> 00:06:01,668 And one of the challenges with this is you might not have enough bandwidth, into the 74 00:06:01,668 --> 00:06:08,280 L2 cache to basically take in every single store that occurs. 75 00:06:09,320 --> 00:06:15,182 So the solution in this is you can actually put a right buffer here which 76 00:06:15,182 --> 00:06:21,678 will sort of buffer off some of this extra store bandwidth and we will introduce a 77 00:06:21,678 --> 00:06:28,810 notion of a coalescing right buffer. So this is a extra addition to a right 78 00:06:28,810 --> 00:06:35,836 buffer here that will actually merge multiple stores through the same line. 79 00:06:35,836 --> 00:06:43,535 So let us say you have a store to address five and a store to address six with a 80 00:06:43,535 --> 00:06:48,252 right through cache. You do not wanna actually have to write 81 00:06:48,252 --> 00:06:51,260 sort of, two full cash lines out to the L2. 82 00:06:51,260 --> 00:06:55,413 Instead what IP will do is they have coalescing right buffers. 83 00:06:55,413 --> 00:07:00,589 So there is one right buffer here might have multiple entries that holds the whole 84 00:07:00,589 --> 00:07:03,975 cache line. And that first store will push that whole 85 00:07:03,975 --> 00:07:08,001 cache line out into here. The second store will try to push the 86 00:07:08,001 --> 00:07:12,474 whole cache line out but you will notice that it is for the same address that it 87 00:07:12,474 --> 00:07:16,819 already has in it so actually merge the two caches line into one location. 88 00:07:16,819 --> 00:07:21,931 And what this does is decreases the bandwidth that you need at the L2 cause it 89 00:07:21,931 --> 00:07:25,190 is very common in codes to write sequential addresses. 90 00:07:25,190 --> 00:07:30,668 So it is if common to let us say you are, you are adding to raise the destination 91 00:07:30,668 --> 00:07:35,600 array you will actually just be writing address after address after address. 92 00:07:36,320 --> 00:07:43,495 And you do not want to actually have to go fire up the L2 for every single store you 93 00:07:43,495 --> 00:07:48,973 do in that array operation, If you have a write-through cache. 94 00:07:48,973 --> 00:07:54,840 So you can put a coalescing write buffer here, to, to save bandwidth, into your L2. 95 00:07:56,360 --> 00:08:02,960 This, this, this is our, for our second technique is having this right buffer. 96 00:08:04,180 --> 00:08:12,360 Okay, so what does the right buffer do? Does it decrease our miss rate? 97 00:08:16,880 --> 00:08:22,800 Cache is the same size. Associativity is the same. 98 00:08:23,640 --> 00:08:31,413 Probably not going to change our miss rate. 99 00:08:31,413 --> 00:08:40,469 Miss penalty. Okay,, raise your hand here Who thinks 100 00:08:40,469 --> 00:08:47,349 this affects the miss penalty? Some people raising their hands. 101 00:08:47,349 --> 00:08:51,441 I think we should probably be all be raising our hands, because that was really 102 00:08:51,441 --> 00:08:55,751 what we were trying to do with this whole right buffer is to reduce the missed 103 00:08:55,751 --> 00:09:00,007 penalty, and the reason this reduces the missed penalty is that reed misses in 104 00:09:00,007 --> 00:09:04,207 here. It, it does not need to wait for the right 105 00:09:04,207 --> 00:09:08,296 to occur of the background data, or the victim data. 106 00:09:08,296 --> 00:09:12,880 Instead, it can just have that happen in the background. 107 00:09:13,400 --> 00:09:18,278 This also does not affect our hit time. It may actually help our bandwidth. 108 00:09:18,278 --> 00:09:23,798 The reason it does not affect our hit time is our L1 cache just will still work the 109 00:09:23,798 --> 00:09:28,933 same it works before you are still, can do loads and stores against that and if you 110 00:09:28,933 --> 00:09:32,399 hit it is fine. So it only affects miss sorts of things. 111 00:09:32,399 --> 00:09:37,149 Bandwidth like I said if you are a write through cache this might actually 112 00:09:37,149 --> 00:09:42,220 effectively give you more bandwidth if you have a coalescing right buffer.