1 00:00:03,333 --> 00:00:08,560 So lets go on to our next cache optimization technique. 2 00:00:08,560 --> 00:00:10,530 So we talked about multilevel caches, 3 00:00:10,530 --> 00:00:14,770 almost all processors today have multilevel caches and this is good. 4 00:00:14,770 --> 00:00:20,430 It's going to reduce our miss rate and for at least the lower level cache, 5 00:00:20,430 --> 00:00:24,730 it's effectively going to miss the, reduce the miss because it's going to be able to 6 00:00:24,730 --> 00:00:29,260 go to a closer cache versus having to go out to main. 7 00:00:29,260 --> 00:00:31,560 What's our new technique here? 8 00:00:31,560 --> 00:00:36,290 Well, the next technique I wanted to introduce is what we call a victim cache. 9 00:00:37,600 --> 00:00:39,850 And why do we call this thing a victim cache? 10 00:00:39,850 --> 00:00:45,350 Well, a victim is a cache line which has been evicted from your cache. 11 00:00:47,330 --> 00:00:48,040 Hm. 12 00:00:48,040 --> 00:00:52,850 So there's victim or this evicted line, or evicted cache block, 13 00:00:54,040 --> 00:00:57,180 Typically if it's dirty, it's has to be written back out to main memory. 14 00:00:57,180 --> 00:01:01,770 If it's clean, you can just invalidate it and not have to write it back. 15 00:01:01,770 --> 00:01:06,690 But what the insight of the victim cache is, 16 00:01:06,690 --> 00:01:11,590 is that we're going to keep around the most recent victims. 17 00:01:11,590 --> 00:01:18,030 Or the most recent cache lines that we've casted out of our cache and keep 18 00:01:18,030 --> 00:01:22,100 them in a special little side structure that we're going to call a victim cache. 19 00:01:22,100 --> 00:01:25,310 And the side structure of this victim cache is going to be very, very, 20 00:01:25,310 --> 00:01:26,000 very small. 21 00:01:27,850 --> 00:01:29,790 Now why is this useful? 22 00:01:29,790 --> 00:01:35,360 Well, one of the things that is pretty common is that you're going to have 23 00:01:35,360 --> 00:01:37,170 a piece of code. 24 00:01:37,170 --> 00:01:45,590 Which for instance writes the same set in the cache. 25 00:01:45,590 --> 00:01:49,750 So let's say we have a two way set of associative cache. 26 00:01:49,750 --> 00:01:56,660 So there's only two sets per index or two ways per index, excuse me. 27 00:01:56,660 --> 00:02:00,870 So we go look at the two ways. 28 00:02:00,870 --> 00:02:06,130 And let's say we have a loop where we read two 29 00:02:06,130 --> 00:02:11,890 values from an array and we write it to a third array. 30 00:02:11,890 --> 00:02:16,100 So let's take a look at a piece of code that says, something like the following. 31 00:02:18,770 --> 00:02:24,530 For i is 0, 32 00:02:24,530 --> 00:02:30,080 i less than some big number, i plus, plus. 33 00:02:32,170 --> 00:02:36,625 And then, let's say we have three arrays which are all large, so 34 00:02:36,625 --> 00:02:39,871 let's say we have an array, an integer array. 35 00:02:42,117 --> 00:02:45,700 A, and this is, let's say a million. 36 00:02:45,700 --> 00:02:46,320 One million. 37 00:02:48,162 --> 00:02:52,410 We have integer array, b, which is also one million. 38 00:02:53,710 --> 00:02:58,880 And integer array c, which is also one million. 39 00:02:58,880 --> 00:03:02,420 And we'll say that this is one million, the one million, 40 00:03:02,420 --> 00:03:04,850 which is basically one thousand by a thousand. 41 00:03:04,850 --> 00:03:06,100 So, it's a power of two. 42 00:03:08,000 --> 00:03:15,130 Okay, in a four loop here, we have a simple loop. 43 00:03:15,130 --> 00:03:22,437 The loop's going to say, c[i] = a[i] + b[i]. 44 00:03:22,437 --> 00:03:24,430 And that's our loop. 45 00:03:24,430 --> 00:03:29,970 Now we've a two way set associative cache. 46 00:03:29,970 --> 00:03:31,640 Well that sounds like a problem here. 47 00:03:31,640 --> 00:03:35,710 So if you go look at this, what we're actually going to do is we're going to 48 00:03:35,710 --> 00:03:41,360 pull in a, we're going to do a load of a. 49 00:03:41,360 --> 00:03:45,210 And this is going to have let's say the same index in the cache because 50 00:03:45,210 --> 00:03:48,820 we have the same Index into this array here, and if those, 51 00:03:48,820 --> 00:03:53,930 these three arrays A, B, and C are all aligned are naturally aligned. 52 00:03:53,930 --> 00:03:56,160 They're going to basically hit the same place. 53 00:03:56,160 --> 00:03:58,048 They're going to have a conflict with b. 54 00:04:00,624 --> 00:04:03,890 Which is going to have a conflict with C. 55 00:04:03,890 --> 00:04:09,380 So C of i, A of i, and B of i, all want to fight for those two 56 00:04:09,380 --> 00:04:15,880 ways in our cache at the same index, at the same time, every loop iteration. 57 00:04:15,880 --> 00:04:16,900 Now why is this bad? 58 00:04:16,900 --> 00:04:20,650 Well, we're iterating here based on integers, 59 00:04:20,650 --> 00:04:23,240 well our cache line is bigger than integers. 60 00:04:23,240 --> 00:04:26,080 So what we're going to do is we're going to pull in the cache line, 61 00:04:26,080 --> 00:04:28,740 Let's say it has a(i) on it. 62 00:04:28,740 --> 00:04:32,530 They're going to pull a cache line that has b(i) on it, and then we're going to 63 00:04:32,530 --> 00:04:37,870 evict one of these two to fill it in because we need to do a right 64 00:04:37,870 --> 00:04:43,710 to c(i), which is in the same index in our cache. 65 00:04:43,710 --> 00:04:44,500 So we're going to fight that. 66 00:04:44,500 --> 00:04:48,410 We have three things trying to fit into two locations. 67 00:04:49,780 --> 00:04:53,300 And to make matters worse here the next time we go round this loop we 68 00:04:53,300 --> 00:04:57,300 could have actually just hit on that same cache line a, b, and 69 00:04:57,300 --> 00:05:02,990 c cache line because you need a small let's say a 64 bit cache line here for 70 00:05:02,990 --> 00:05:06,500 the interest let's say four bits and a cache line of 64 bits long 71 00:05:06,500 --> 00:05:10,430 we could have done special locality and temporal locality here. 72 00:05:10,430 --> 00:05:12,470 But instead, we got nothing. 73 00:05:12,470 --> 00:05:14,040 We just lost. 74 00:05:14,040 --> 00:05:16,490 Because we have three things trying to fit into two. 75 00:05:16,490 --> 00:05:19,390 And even if you use something like least recently used, 76 00:05:19,390 --> 00:05:21,620 it's actually going to do the wrong thing here. 77 00:05:21,620 --> 00:05:24,880 Least recently used is going to continually kick out 78 00:05:24,880 --> 00:05:26,710 The least recently used thing. 79 00:05:26,710 --> 00:05:30,270 But the next time I'll allow in the loop, you want that least recently used thing. 80 00:05:30,270 --> 00:05:33,140 So you could have actually done better here if you didn't use least 81 00:05:33,140 --> 00:05:33,680 recently used. 82 00:05:33,680 --> 00:05:37,610 But instead, just let's say, kept a and b and always cache missed on c or 83 00:05:37,610 --> 00:05:42,350 something like that or pegged one of these three into the cache. 84 00:05:42,350 --> 00:05:43,640 For a two way associative cache. 85 00:05:45,470 --> 00:05:47,620 Okay so how do we go about solving this? 86 00:05:47,620 --> 00:05:50,520 So let's go back and take a look at the slide here and 87 00:05:50,520 --> 00:05:54,060 we're going to see that what we can do is, 88 00:05:54,060 --> 00:06:00,110 we can have a victim cache which is hooked up to our level one data cache. 89 00:06:00,110 --> 00:06:04,730 And this victim cache is going to have effectively extend 90 00:06:04,730 --> 00:06:09,530 our associativity on a very limited subset of the lines or 91 00:06:09,530 --> 00:06:12,490 very limited subset of the indexes into our cache. 92 00:06:13,610 --> 00:06:18,510 So, if we take a look at this, let's say we have a fully associative cache here for 93 00:06:18,510 --> 00:06:22,150 recently evicted lines, maybe four entries. 94 00:06:22,150 --> 00:06:27,170 In fact actually, the original paperwork showed this, they had even let's say one 95 00:06:27,170 --> 00:06:29,650 cache line worth of victim cache and it helped performance. 96 00:06:30,960 --> 00:06:34,240 But, this relatively small number of entries here for 97 00:06:34,240 --> 00:06:39,360 victim cache can extend the associativity of any index in the cache line. 98 00:06:39,360 --> 00:06:44,720 So it's a fully associative cache added into our cache miss design. 99 00:06:45,790 --> 00:06:47,910 And how does this go about working? 100 00:06:47,910 --> 00:06:52,043 Well now when you go to access level one cache here, 101 00:06:52,043 --> 00:06:57,930 if you miss in your level one cache you pull it in from your level two cache but 102 00:06:57,930 --> 00:07:02,120 you first go and check the victim cache. 103 00:07:02,120 --> 00:07:05,690 Now how do things get into the victim cache in the first place? 104 00:07:05,690 --> 00:07:10,290 Well when you go and try to invalidate something out of your level one cache 105 00:07:10,290 --> 00:07:12,180 instead of just taking that and 106 00:07:12,180 --> 00:07:16,120 throwing away that data you transfer it over into the victim cache. 107 00:07:17,350 --> 00:07:23,830 So it's a little cache of most recently evicted lines from our level one data 108 00:07:23,830 --> 00:07:28,940 cache or most recently invalidated lines from our level one data cache. 109 00:07:30,270 --> 00:07:33,710 So it has some extra associativity. 110 00:07:33,710 --> 00:07:38,520 Now you can check this either in parallel or in serial with your level one cache. 111 00:07:38,520 --> 00:07:45,930 And there's a couple design questions that come up here is on a level one miss. 112 00:07:45,930 --> 00:07:48,240 And let's say it misses in the victim cache. 113 00:07:48,240 --> 00:07:50,240 What goes about happening here? 114 00:07:50,240 --> 00:07:53,170 You're going to be bringing in a new line from a level two cache. 115 00:07:53,170 --> 00:07:55,700 You're going to take what was in the level one cache. 116 00:07:55,700 --> 00:07:57,200 You're going to move it into the victim cache. 117 00:07:57,200 --> 00:08:00,970 And you want to go check this on future cache actives that miss in level 118 00:08:00,970 --> 00:08:02,420 one cache. 119 00:08:02,420 --> 00:08:06,140 But the question is, what happens with the victim cache? 120 00:08:06,140 --> 00:08:10,930 Well, just like a normal cache, if it's dirty data you probably want to 121 00:08:11,940 --> 00:08:15,770 go write that back to the L2 Cache. 122 00:08:15,770 --> 00:08:20,520 Alternatively, you could make The Victim Cache, effectively write-through. 123 00:08:20,520 --> 00:08:24,150 And in fact, usually the sort of Victim Cache design, 124 00:08:24,150 --> 00:08:27,770 they typically try to make the Victim Cache write-through. 125 00:08:27,770 --> 00:08:32,120 Because then, you dont have to worry about when something has to get 126 00:08:33,490 --> 00:08:36,210 removed from the Victim Cache, where it has to go. 127 00:08:36,210 --> 00:08:41,380 You don't actually have to go and do the write-back to the father levels of cache. 128 00:08:41,380 --> 00:08:45,480 Instead you can just basically throw it away, but that's a design decision. 129 00:08:45,480 --> 00:08:48,080 You can decide whether you want that level 130 00:08:48,080 --> 00:08:50,010 of victim cache to always be clean or dirty. 131 00:08:51,100 --> 00:08:52,770 Thankfully, usually if you're doing the evict, 132 00:08:52,770 --> 00:08:56,290 you might have enough extra bandwidth in your L2 here to actually just 133 00:08:56,290 --> 00:09:00,780 do that right into the L2 here and leave the victim cache completely clean. 134 00:09:00,780 --> 00:09:05,690 So to recap, the basic idea here is you add a couple extra 135 00:09:07,610 --> 00:09:10,740 ways, if you will, or a little bit of extra associativity for 136 00:09:10,740 --> 00:09:15,410 a very limited set of the indexes in your cache here by adding this victim cache. 137 00:09:16,480 --> 00:09:19,790 The downside here is you might have to go check this victim cache. 138 00:09:19,790 --> 00:09:24,280 If you do this serially it might add to your lane C out to the L2, or 139 00:09:24,280 --> 00:09:26,430 you could keep on checking it in parallel. 140 00:09:26,430 --> 00:09:31,280 Which would potentially increase power, or the energy used, but not your performance. 141 00:09:33,330 --> 00:09:40,250 Okay, let's pull up the scoreboard and see how out, how adding in victim cache helps. 142 00:09:40,250 --> 00:09:45,480 So first we're going to see is that the miss penalty to 143 00:09:45,480 --> 00:09:51,460 our level one cache is going to be decreased here. 144 00:09:51,460 --> 00:09:56,810 Because we've effectively added an extra structure here that's going to be closer 145 00:09:56,810 --> 00:09:57,870 let's say then a level two. 146 00:09:59,540 --> 00:10:01,820 So miss penalty is going to be lower. 147 00:10:01,820 --> 00:10:07,230 But the other advantage of a Victim Cache is a miss rate out of the aggregate. 148 00:10:07,230 --> 00:10:11,140 Level 1 cache plus Victim Cache together is now going to be lower. 149 00:10:11,140 --> 00:10:14,120 So it's going to be better from a miss rate perspective.