1
00:00:03,333 --> 00:00:08,560
So lets go on to our next
cache optimization technique.

2
00:00:08,560 --> 00:00:10,530
So we talked about multilevel caches,

3
00:00:10,530 --> 00:00:14,770
almost all processors today have
multilevel caches and this is good.

4
00:00:14,770 --> 00:00:20,430
It's going to reduce our miss rate and
for at least the lower level cache,

5
00:00:20,430 --> 00:00:24,730
it's effectively going to miss the, reduce
the miss because it's going to be able to

6
00:00:24,730 --> 00:00:29,260
go to a closer cache versus
having to go out to main.

7
00:00:29,260 --> 00:00:31,560
What's our new technique here?

8
00:00:31,560 --> 00:00:36,290
Well, the next technique I wanted to
introduce is what we call a victim cache.

9
00:00:37,600 --> 00:00:39,850
And why do we call this
thing a victim cache?

10
00:00:39,850 --> 00:00:45,350
Well, a victim is a cache line which
has been evicted from your cache.

11
00:00:47,330 --> 00:00:48,040
Hm.

12
00:00:48,040 --> 00:00:52,850
So there's victim or this evicted line,
or evicted cache block,

13
00:00:54,040 --> 00:00:57,180
Typically if it's dirty, it's has to
be written back out to main memory.

14
00:00:57,180 --> 00:01:01,770
If it's clean, you can just invalidate
it and not have to write it back.

15
00:01:01,770 --> 00:01:06,690
But what the insight of
the victim cache is,

16
00:01:06,690 --> 00:01:11,590
is that we're going to keep
around the most recent victims.

17
00:01:11,590 --> 00:01:18,030
Or the most recent cache lines that
we've casted out of our cache and keep

18
00:01:18,030 --> 00:01:22,100
them in a special little side structure
that we're going to call a victim cache.

19
00:01:22,100 --> 00:01:25,310
And the side structure of this victim
cache is going to be very, very,

20
00:01:25,310 --> 00:01:26,000
very small.

21
00:01:27,850 --> 00:01:29,790
Now why is this useful?

22
00:01:29,790 --> 00:01:35,360
Well, one of the things that is pretty
common is that you're going to have

23
00:01:35,360 --> 00:01:37,170
a piece of code.

24
00:01:37,170 --> 00:01:45,590
Which for
instance writes the same set in the cache.

25
00:01:45,590 --> 00:01:49,750
So let's say we have a two
way set of associative cache.

26
00:01:49,750 --> 00:01:56,660
So there's only two sets per index or
two ways per index, excuse me.

27
00:01:56,660 --> 00:02:00,870
So we go look at the two ways.

28
00:02:00,870 --> 00:02:06,130
And let's say we have
a loop where we read two

29
00:02:06,130 --> 00:02:11,890
values from an array and
we write it to a third array.

30
00:02:11,890 --> 00:02:16,100
So let's take a look at a piece of code
that says, something like the following.

31
00:02:18,770 --> 00:02:24,530
For i is 0,

32
00:02:24,530 --> 00:02:30,080
i less than some big number, i plus, plus.

33
00:02:32,170 --> 00:02:36,625
And then, let's say we have three
arrays which are all large, so

34
00:02:36,625 --> 00:02:39,871
let's say we have an array,
an integer array.

35
00:02:42,117 --> 00:02:45,700
A, and this is, let's say a million.

36
00:02:45,700 --> 00:02:46,320
One million.

37
00:02:48,162 --> 00:02:52,410
We have integer array, b,
which is also one million.

38
00:02:53,710 --> 00:02:58,880
And integer array c,
which is also one million.

39
00:02:58,880 --> 00:03:02,420
And we'll say that this is one million,
the one million,

40
00:03:02,420 --> 00:03:04,850
which is basically one
thousand by a thousand.

41
00:03:04,850 --> 00:03:06,100
So, it's a power of two.

42
00:03:08,000 --> 00:03:15,130
Okay, in a four loop here,
we have a simple loop.

43
00:03:15,130 --> 00:03:22,437
The loop's going to say,
c[i] = a[i] + b[i].

44
00:03:22,437 --> 00:03:24,430
And that's our loop.

45
00:03:24,430 --> 00:03:29,970
Now we've a two way set associative cache.

46
00:03:29,970 --> 00:03:31,640
Well that sounds like a problem here.

47
00:03:31,640 --> 00:03:35,710
So if you go look at this, what we're
actually going to do is we're going to

48
00:03:35,710 --> 00:03:41,360
pull in a, we're going to do a load of a.

49
00:03:41,360 --> 00:03:45,210
And this is going to have let's say
the same index in the cache because

50
00:03:45,210 --> 00:03:48,820
we have the same Index into
this array here, and if those,

51
00:03:48,820 --> 00:03:53,930
these three arrays A, B, and
C are all aligned are naturally aligned.

52
00:03:53,930 --> 00:03:56,160
They're going to basically
hit the same place.

53
00:03:56,160 --> 00:03:58,048
They're going to have a conflict with b.

54
00:04:00,624 --> 00:04:03,890
Which is going to have a conflict with C.

55
00:04:03,890 --> 00:04:09,380
So C of i, A of i, and B of i,
all want to fight for those two

56
00:04:09,380 --> 00:04:15,880
ways in our cache at the same index,
at the same time, every loop iteration.

57
00:04:15,880 --> 00:04:16,900
Now why is this bad?

58
00:04:16,900 --> 00:04:20,650
Well, we're iterating
here based on integers,

59
00:04:20,650 --> 00:04:23,240
well our cache line is
bigger than integers.

60
00:04:23,240 --> 00:04:26,080
So what we're going to do is we're
going to pull in the cache line,

61
00:04:26,080 --> 00:04:28,740
Let's say it has a(i) on it.

62
00:04:28,740 --> 00:04:32,530
They're going to pull a cache line that
has b(i) on it, and then we're going to

63
00:04:32,530 --> 00:04:37,870
evict one of these two to fill it
in because we need to do a right

64
00:04:37,870 --> 00:04:43,710
to c(i),
which is in the same index in our cache.

65
00:04:43,710 --> 00:04:44,500
So we're going to fight that.

66
00:04:44,500 --> 00:04:48,410
We have three things trying
to fit into two locations.

67
00:04:49,780 --> 00:04:53,300
And to make matters worse here
the next time we go round this loop we

68
00:04:53,300 --> 00:04:57,300
could have actually just hit on
that same cache line a, b, and

69
00:04:57,300 --> 00:05:02,990
c cache line because you need a small
let's say a 64 bit cache line here for

70
00:05:02,990 --> 00:05:06,500
the interest let's say four bits and
a cache line of 64 bits long

71
00:05:06,500 --> 00:05:10,430
we could have done special locality and
temporal locality here.

72
00:05:10,430 --> 00:05:12,470
But instead, we got nothing.

73
00:05:12,470 --> 00:05:14,040
We just lost.

74
00:05:14,040 --> 00:05:16,490
Because we have three things
trying to fit into two.

75
00:05:16,490 --> 00:05:19,390
And even if you use something
like least recently used,

76
00:05:19,390 --> 00:05:21,620
it's actually going to
do the wrong thing here.

77
00:05:21,620 --> 00:05:24,880
Least recently used is
going to continually kick out

78
00:05:24,880 --> 00:05:26,710
The least recently used thing.

79
00:05:26,710 --> 00:05:30,270
But the next time I'll allow in the loop,
you want that least recently used thing.

80
00:05:30,270 --> 00:05:33,140
So you could have actually done
better here if you didn't use least

81
00:05:33,140 --> 00:05:33,680
recently used.

82
00:05:33,680 --> 00:05:37,610
But instead, just let's say, kept a and
b and always cache missed on c or

83
00:05:37,610 --> 00:05:42,350
something like that or
pegged one of these three into the cache.

84
00:05:42,350 --> 00:05:43,640
For a two way associative cache.

85
00:05:45,470 --> 00:05:47,620
Okay so how do we go about solving this?

86
00:05:47,620 --> 00:05:50,520
So let's go back and
take a look at the slide here and

87
00:05:50,520 --> 00:05:54,060
we're going to see that what we can do is,

88
00:05:54,060 --> 00:06:00,110
we can have a victim cache which is
hooked up to our level one data cache.

89
00:06:00,110 --> 00:06:04,730
And this victim cache is going
to have effectively extend

90
00:06:04,730 --> 00:06:09,530
our associativity on a very
limited subset of the lines or

91
00:06:09,530 --> 00:06:12,490
very limited subset of
the indexes into our cache.

92
00:06:13,610 --> 00:06:18,510
So, if we take a look at this, let's say
we have a fully associative cache here for

93
00:06:18,510 --> 00:06:22,150
recently evicted lines,
maybe four entries.

94
00:06:22,150 --> 00:06:27,170
In fact actually, the original paperwork
showed this, they had even let's say one

95
00:06:27,170 --> 00:06:29,650
cache line worth of victim cache and
it helped performance.

96
00:06:30,960 --> 00:06:34,240
But, this relatively small
number of entries here for

97
00:06:34,240 --> 00:06:39,360
victim cache can extend the associativity
of any index in the cache line.

98
00:06:39,360 --> 00:06:44,720
So it's a fully associative cache
added into our cache miss design.

99
00:06:45,790 --> 00:06:47,910
And how does this go about working?

100
00:06:47,910 --> 00:06:52,043
Well now when you go to
access level one cache here,

101
00:06:52,043 --> 00:06:57,930
if you miss in your level one cache you
pull it in from your level two cache but

102
00:06:57,930 --> 00:07:02,120
you first go and check the victim cache.

103
00:07:02,120 --> 00:07:05,690
Now how do things get into
the victim cache in the first place?

104
00:07:05,690 --> 00:07:10,290
Well when you go and try to invalidate
something out of your level one cache

105
00:07:10,290 --> 00:07:12,180
instead of just taking that and

106
00:07:12,180 --> 00:07:16,120
throwing away that data you transfer
it over into the victim cache.

107
00:07:17,350 --> 00:07:23,830
So it's a little cache of most recently
evicted lines from our level one data

108
00:07:23,830 --> 00:07:28,940
cache or most recently invalidated
lines from our level one data cache.

109
00:07:30,270 --> 00:07:33,710
So it has some extra associativity.

110
00:07:33,710 --> 00:07:38,520
Now you can check this either in parallel
or in serial with your level one cache.

111
00:07:38,520 --> 00:07:45,930
And there's a couple design questions
that come up here is on a level one miss.

112
00:07:45,930 --> 00:07:48,240
And let's say it misses
in the victim cache.

113
00:07:48,240 --> 00:07:50,240
What goes about happening here?

114
00:07:50,240 --> 00:07:53,170
You're going to be bringing in
a new line from a level two cache.

115
00:07:53,170 --> 00:07:55,700
You're going to take what
was in the level one cache.

116
00:07:55,700 --> 00:07:57,200
You're going to move it
into the victim cache.

117
00:07:57,200 --> 00:08:00,970
And you want to go check this on future
cache actives that miss in level

118
00:08:00,970 --> 00:08:02,420
one cache.

119
00:08:02,420 --> 00:08:06,140
But the question is,
what happens with the victim cache?

120
00:08:06,140 --> 00:08:10,930
Well, just like a normal cache,
if it's dirty data you probably want to

121
00:08:11,940 --> 00:08:15,770
go write that back to the L2 Cache.

122
00:08:15,770 --> 00:08:20,520
Alternatively, you could make The Victim
Cache, effectively write-through.

123
00:08:20,520 --> 00:08:24,150
And in fact,
usually the sort of Victim Cache design,

124
00:08:24,150 --> 00:08:27,770
they typically try to make
the Victim Cache write-through.

125
00:08:27,770 --> 00:08:32,120
Because then, you dont have to worry
about when something has to get

126
00:08:33,490 --> 00:08:36,210
removed from the Victim Cache,
where it has to go.

127
00:08:36,210 --> 00:08:41,380
You don't actually have to go and do the
write-back to the father levels of cache.

128
00:08:41,380 --> 00:08:45,480
Instead you can just basically throw
it away, but that's a design decision.

129
00:08:45,480 --> 00:08:48,080
You can decide whether you want that level

130
00:08:48,080 --> 00:08:50,010
of victim cache to always be clean or
dirty.

131
00:08:51,100 --> 00:08:52,770
Thankfully, usually if
you're doing the evict,

132
00:08:52,770 --> 00:08:56,290
you might have enough extra bandwidth
in your L2 here to actually just

133
00:08:56,290 --> 00:09:00,780
do that right into the L2 here and
leave the victim cache completely clean.

134
00:09:00,780 --> 00:09:05,690
So to recap, the basic idea
here is you add a couple extra

135
00:09:07,610 --> 00:09:10,740
ways, if you will, or
a little bit of extra associativity for

136
00:09:10,740 --> 00:09:15,410
a very limited set of the indexes in your
cache here by adding this victim cache.

137
00:09:16,480 --> 00:09:19,790
The downside here is you might have
to go check this victim cache.

138
00:09:19,790 --> 00:09:24,280
If you do this serially it might add
to your lane C out to the L2, or

139
00:09:24,280 --> 00:09:26,430
you could keep on checking it in parallel.

140
00:09:26,430 --> 00:09:31,280
Which would potentially increase power, or
the energy used, but not your performance.

141
00:09:33,330 --> 00:09:40,250
Okay, let's pull up the scoreboard and see
how out, how adding in victim cache helps.

142
00:09:40,250 --> 00:09:45,480
So first we're going to see
is that the miss penalty to

143
00:09:45,480 --> 00:09:51,460
our level one cache is
going to be decreased here.

144
00:09:51,460 --> 00:09:56,810
Because we've effectively added an extra
structure here that's going to be closer

145
00:09:56,810 --> 00:09:57,870
let's say then a level two.

146
00:09:59,540 --> 00:10:01,820
So miss penalty is going to be lower.

147
00:10:01,820 --> 00:10:07,230
But the other advantage of a Victim Cache
is a miss rate out of the aggregate.

148
00:10:07,230 --> 00:10:11,140
Level 1 cache plus Victim Cache
together is now going to be lower.

149
00:10:11,140 --> 00:10:14,120
So it's going to be better
from a miss rate perspective.