1
00:00:03,332 --> 00:00:05,985
So the next memory optimization or

2
00:00:05,985 --> 00:00:11,661
the cache optimization we are going to
look at is adding multi level caches.

3
00:00:12,840 --> 00:00:16,170
And you probably see this in sort of
any modern day processor you have.

4
00:00:16,170 --> 00:00:19,660
There's level one caches, level two
caches, maybe level three caches.

5
00:00:19,660 --> 00:00:21,802
Maybe even level four caches.

6
00:00:21,802 --> 00:00:26,903
Why this might be a good ideas and what
the affect of this is on the different

7
00:00:26,903 --> 00:00:31,197
parameters we've been looking
at through out this lecture.

8
00:00:31,197 --> 00:00:33,254
So what's the basic idea here?

9
00:00:33,254 --> 00:00:35,946
Well the basic idea is you have a CPU and

10
00:00:35,946 --> 00:00:40,250
instead of just having on cache
we say let's add two caches.

11
00:00:40,250 --> 00:00:41,250
And why do we want to do this?

12
00:00:41,250 --> 00:00:46,500
Well, it comes from the insight
that it is both difficult to have

13
00:00:46,500 --> 00:00:51,870
a very large cache and a very fast cache.

14
00:00:51,870 --> 00:00:53,060
So how do you solve this problem?

15
00:00:54,620 --> 00:00:58,960
Well, you can think about trying
to fill this entire room with RAM.

16
00:01:00,490 --> 00:01:03,270
That would be a very large,
let's say, cache for something.

17
00:01:03,270 --> 00:01:07,660
And if you think about this, they actually
do have sort of notional caches like this.

18
00:01:07,660 --> 00:01:11,890
If you go look at something like
a internet scale data center.

19
00:01:11,890 --> 00:01:15,010
There will be basically a room
full of mostly just RAM.

20
00:01:15,010 --> 00:01:18,850
And they'll use this to
cache things like look ups.

21
00:01:18,850 --> 00:01:23,120
And there's sort of,
you might see something like mem cache d

22
00:01:23,120 --> 00:01:26,580
which is something which
caches queries to databases.

23
00:01:26,580 --> 00:01:29,480
Or is a key value store
that typically caches

24
00:01:29,480 --> 00:01:32,680
queries to databases which
people store in large RAM.

25
00:01:32,680 --> 00:01:36,160
The problem with having a very
large cache is by definition,

26
00:01:36,160 --> 00:01:39,910
if it's large and you don't want to
violate the laws of physics.

27
00:01:39,910 --> 00:01:42,260
To get to the farthest
extent of the cache,

28
00:01:42,260 --> 00:01:44,780
you might have to go very very very far.

29
00:01:44,780 --> 00:01:49,168
And if you run a wire very very far,
it's by definition going to be slow you

30
00:01:49,168 --> 00:01:54,020
can't violate the physics law here,
that you need to travel at maybe,

31
00:01:54,020 --> 00:01:56,570
the fastest that you can travel
is maybe at the speed of light.

32
00:01:56,570 --> 00:02:01,180
So the farther you go
the distance becomes a factor.

33
00:02:01,180 --> 00:02:04,880
So, our problem here is that you can't
have something that's both large and

34
00:02:04,880 --> 00:02:06,580
fast memory.

35
00:02:06,580 --> 00:02:08,030
So, what can we do?

36
00:02:08,030 --> 00:02:11,440
Well, we can actually add
multiple levels of cache.

37
00:02:11,440 --> 00:02:16,020
So that if you have a certain sized
working set, you can try to store that,

38
00:02:16,020 --> 00:02:22,030
let's say, in a small local cache
which is both small and fast.

39
00:02:22,030 --> 00:02:24,470
And then you could have
different levels of cache here.

40
00:02:24,470 --> 00:02:29,480
Say level two cache which is a little
bit large, but with a larger capacity.

41
00:02:29,480 --> 00:02:33,340
But is a little bit slower and
a little bit further away from your CPU.

42
00:02:33,340 --> 00:02:39,465
And this can mitigate the cost of
having to go all the way out to DRAM.

43
00:02:39,465 --> 00:02:44,338
Okay, so I wanted to introduce some
nomenclature here because this is

44
00:02:44,338 --> 00:02:49,723
an important thing you're going to see
over and over again throughout caches.

45
00:02:49,723 --> 00:02:53,223
And it's important that we introduce
it now when we start to talk about

46
00:02:53,223 --> 00:02:54,302
multilevel caches.

47
00:02:54,302 --> 00:02:58,730
Because just because you see, let's say a
low cache miss rate does not mean a cache

48
00:02:58,730 --> 00:03:03,450
is performing well, especially when you
get into the multilevel cache domain.

49
00:03:03,450 --> 00:03:04,210
And why is that?

50
00:03:04,210 --> 00:03:07,780
Well, let's say we have this
level two cache here, and

51
00:03:07,780 --> 00:03:10,280
I say it has a very low miss rate.

52
00:03:10,280 --> 00:03:16,232
That sounds good but this level one
cache is filtering accesses to it.

53
00:03:16,232 --> 00:03:21,113
So,just because level two has a low
miss rate doesn't necessarily

54
00:03:21,113 --> 00:03:25,300
mean that the level two
cache is performing well.

55
00:03:25,300 --> 00:03:26,840
Or vice versa.

56
00:03:26,840 --> 00:03:29,330
Let's say the level two
has a very high miss rate.

57
00:03:30,560 --> 00:03:34,060
You might say that level two
cache is doing a really bad job.

58
00:03:34,060 --> 00:03:40,190
Well to some extent all of the easy misses
are being filtered by the level one cache.

59
00:03:40,190 --> 00:03:42,820
So all of a sudden the miss rate
out of your level two cache

60
00:03:42,820 --> 00:03:43,790
might look very different.

61
00:03:43,790 --> 00:03:48,255
So we need to come up with some
way to sort of discuss these

62
00:03:48,255 --> 00:03:52,055
different misses with
correct nomenclature.

63
00:03:52,055 --> 00:03:54,625
So I'm going to introduce
three notions here.

64
00:03:54,625 --> 00:03:58,230
Something we're going to
call local cache miss rate.

65
00:03:58,230 --> 00:04:03,050
And this is just going to be
the number of misses that you have in

66
00:04:03,050 --> 00:04:05,440
a cache versus the number
of accesses in a cache.

67
00:04:05,440 --> 00:04:09,900
This is local per cache so if you were to
look at let us say a level two cache here

68
00:04:09,900 --> 00:04:13,170
and you have a certain amount of
accesses coming to level two cache.

69
00:04:13,170 --> 00:04:16,710
A certain amount of misses coming out
it is all local to level two cache.

70
00:04:16,710 --> 00:04:20,150
So this is going to give you the actual
miss rate of a particular cache.

71
00:04:22,230 --> 00:04:28,000
Now that's not the same thing as
something like a global miss rate.

72
00:04:28,000 --> 00:04:32,810
So global cache miss rate
here will take the number of

73
00:04:32,810 --> 00:04:37,180
misses in the cache relative to
the number of CPU memory accesses.

74
00:04:38,510 --> 00:04:40,543
And this might be a better metric,

75
00:04:40,543 --> 00:04:46,190
multilevel cache because you might
want to say in our level two cache here.

76
00:04:46,190 --> 00:04:50,650
What is our miss rate out of the last
level of cache or out of our L2 cache

77
00:04:50,650 --> 00:04:54,940
relative to the number of actual
axes that the CPU is doing?

78
00:04:54,940 --> 00:04:58,240
And this help you encapsulate sort of
the level one and level two together.

79
00:04:59,740 --> 00:05:01,770
So that might be a better metric.

80
00:05:01,770 --> 00:05:05,280
And you'll see your book sometimes
use these two different metrics

81
00:05:05,280 --> 00:05:06,440
in different ways.

82
00:05:06,440 --> 00:05:07,210
And then finally,

83
00:05:08,280 --> 00:05:12,440
something that's useful to think about
is the number of misses per instruction.

84
00:05:13,460 --> 00:05:14,273
Now why did we bring this up?

85
00:05:14,273 --> 00:05:19,114
That sounds pretty similar here to like
a global miss rate or something like that.

86
00:05:19,114 --> 00:05:24,720
Well the difference here
is the denominator changes.

87
00:05:24,720 --> 00:05:27,810
Instead of having either accesses
per cache or CPU memory accesses,

88
00:05:27,810 --> 00:05:30,080
instead we have per
number of instructions.

89
00:05:31,140 --> 00:05:32,220
And why is this useful?

90
00:05:32,220 --> 00:05:37,540
Well, if you have misses per instruction,
it takes out of the equation how many

91
00:05:37,540 --> 00:05:43,310
modes and stores you have as a percentage
of instructions in your program.

92
00:05:43,310 --> 00:05:48,880
So, if you have let's say a program
which has very few loads and

93
00:05:48,880 --> 00:05:52,720
stores but has a relatively
high miss rate per access.

94
00:05:52,720 --> 00:05:54,250
You might say this is bad for

95
00:05:54,250 --> 00:05:56,389
performance so if you just look
at to the cache miss rate.

96
00:05:58,020 --> 00:05:59,750
LIke the local cache miss rate.

97
00:05:59,750 --> 00:06:02,170
But in reality, it may not be so
bad for performance.

98
00:06:02,170 --> 00:06:06,300
Because if you never do a load or never
do a store in that particular program,

99
00:06:06,300 --> 00:06:08,720
maybe it doesn't actually affect
your performance very much.

100
00:06:08,720 --> 00:06:11,980
So this last metric here, this misses
per instruction encapsulates that.

101
00:06:11,980 --> 00:06:17,860
So it encapsulates both the percentage
of instructions that are loads and

102
00:06:17,860 --> 00:06:22,940
stores mixed together with the number of

103
00:06:25,050 --> 00:06:27,080
cache misses in the program here.

104
00:06:27,080 --> 00:06:30,678
So, and this can either be local or
global, but

105
00:06:30,678 --> 00:06:35,520
usually this is considered
to be sort of global-ish.

106
00:06:35,520 --> 00:06:36,400
But it can be either.

107
00:06:36,400 --> 00:06:39,880
You could also use this metric for
a particular cache.

108
00:06:39,880 --> 00:06:45,930
So we can say this cache misses
once every 1,000 instructions.

109
00:06:45,930 --> 00:06:48,070
That would be pretty good.

110
00:06:48,070 --> 00:06:51,806
And depending on the number of loads or

111
00:06:51,806 --> 00:06:56,690
stores per Set of instructions.

112
00:06:56,690 --> 00:06:59,620
So let's say you have a load or

113
00:06:59,620 --> 00:07:04,028
a store once every,
I don't know five instructions.

114
00:07:04,028 --> 00:07:04,980
So maybe 20% of or

115
00:07:04,980 --> 00:07:10,350
a little bit less than 20% of
the instructions are loads or stores.

116
00:07:10,350 --> 00:07:14,680
Which might be typical in
a typical processor or

117
00:07:14,680 --> 00:07:17,730
a typical program for a typical processor.

118
00:07:17,730 --> 00:07:18,965
So then you could say,

119
00:07:18,965 --> 00:07:23,333
you could make some notion here about how
the cache is performing on aggregate.

120
00:07:23,333 --> 00:07:29,584
And you'll see, actually, a lot of the
numbers in the Patterson-Hennessy book,

121
00:07:29,584 --> 00:07:33,468
actually use misses per
1,000 instructions.

122
00:07:33,468 --> 00:07:36,410
Which could be a more useful metric or
typically a more useful metric.

123
00:07:39,220 --> 00:07:44,340
Okay, so let's take a look at
how adding a level two cache

124
00:07:45,960 --> 00:07:47,910
influences the design
of a level one cache.

125
00:07:47,910 --> 00:07:51,190
And this is pretty common that when
you add multiple levels to your

126
00:07:51,190 --> 00:07:56,220
cache hierarchy it actually influences
the lower levels of your cache hierarchy.

127
00:07:56,220 --> 00:08:00,180
So how does adding a level
two cache potentially

128
00:08:00,180 --> 00:08:01,600
influence the level one design?

129
00:08:02,980 --> 00:08:07,610
Well, one interesting thing you
can do here is just by having

130
00:08:07,610 --> 00:08:12,200
a level 2 cash this might allow you to
think about having a smaller level 1 cash.

131
00:08:13,500 --> 00:08:17,745
So if you have a relatively close level
2 cash, for the same performance you can

132
00:08:17,745 --> 00:08:22,060
actually have a much smaller level
1 cash and have a level 2 cash...

133
00:08:22,060 --> 00:08:23,650
Take care of these things.

134
00:08:23,650 --> 00:08:26,610
And this can actually even help
performance maybe even more,

135
00:08:26,610 --> 00:08:30,880
because you can potentially move the level
1 cache closer because it's now smaller,

136
00:08:30,880 --> 00:08:34,300
or increase the speed
of the level 1 cache.

137
00:08:34,300 --> 00:08:39,480
But the miss rate from the level
1 cache is going to go up.

138
00:08:39,480 --> 00:08:41,870
Because you made it small.

139
00:08:41,870 --> 00:08:44,160
So what about energy?

140
00:08:44,160 --> 00:08:47,450
This can actually be
a really good energy win.

141
00:08:48,450 --> 00:08:49,660
How is this an energy win?

142
00:08:49,660 --> 00:08:53,670
You are able to take
your level one cache and

143
00:08:53,670 --> 00:08:57,060
make it smaller because you
now have a level two cache.

144
00:08:57,060 --> 00:09:00,910
So the common case thing is that now your
axis in your level one cache, and this

145
00:09:00,910 --> 00:09:05,800
level one cache is smaller, so you can be
accessing it, relatively frequently, and

146
00:09:05,800 --> 00:09:09,850
each access you do to it,
is itself is going to be, less energy.

147
00:09:09,850 --> 00:09:14,140
So something to think about here is that,
our level two design,

148
00:09:14,140 --> 00:09:17,110
is going to influence our lower level
designs and it's not in a vacuum so

149
00:09:17,110 --> 00:09:20,940
it's not like we can just slap on a level
two cache or a level three cache.

150
00:09:20,940 --> 00:09:23,334
And say the lower levels of
the cache hierarchy don't change.

151
00:09:25,260 --> 00:09:30,169
Another thing that, another way that
the level two design, or the presence

152
00:09:30,169 --> 00:09:34,847
of a level two can really influence
level one, is you might be able to have

153
00:09:34,847 --> 00:09:40,270
a much simpler level one cache design,
because you have a level two cache.

154
00:09:40,270 --> 00:09:42,430
So, what does this look like?

155
00:09:42,430 --> 00:09:48,300
Well, let's say in your old design, the
level one cache was a right back cache.

156
00:09:48,300 --> 00:09:50,170
So it stored dirty data.

157
00:09:50,170 --> 00:09:56,100
And when a cache conflict occurred,
it had to find an eviction,

158
00:09:56,100 --> 00:09:58,050
and evict something out of
the level one cache, and wait for

159
00:09:58,050 --> 00:10:02,835
that eviction to occur, or at least find
the bandwidth for that eviction to occur.

160
00:10:02,835 --> 00:10:07,690
[COUGH]
>> In something like a level 2

161
00:10:07,690 --> 00:10:12,130
cache that backs a level 1 cache, you can
actually move to a write through design.

162
00:10:13,350 --> 00:10:14,840
Now how is this possible?

163
00:10:14,840 --> 00:10:16,830
Why would you be able to do a write
through design when you weren't

164
00:10:16,830 --> 00:10:18,390
able to do a write through design before?

165
00:10:18,390 --> 00:10:22,730
Well, if you only had one level of
cache and you had to do write through,

166
00:10:22,730 --> 00:10:26,260
every write you did would have
to have gone out to main memory.

167
00:10:26,260 --> 00:10:29,660
And largely, you don't have enough
bandwidth to go do that out

168
00:10:29,660 --> 00:10:32,330
to your main memory store or
out to your DRAM.

169
00:10:33,530 --> 00:10:38,750
So, because you now have
a level two cache, you can use

170
00:10:38,750 --> 00:10:43,469
that as a buffer if you will to
absorb right through traffic.

171
00:10:44,470 --> 00:10:46,910
And that traffic doesn't
have to go off chip.

172
00:10:46,910 --> 00:10:52,800
So your write back cache,
you could have a write back L2

173
00:10:52,800 --> 00:10:57,380
which tracks all the dirty data, and makes
sure that it actually does invalidates

174
00:10:57,380 --> 00:11:00,880
out to,
it has to do evictions on invalidations.

175
00:11:00,880 --> 00:11:07,043
For instance, for dirty data But the level
1 cache can just write through all data.

176
00:11:07,043 --> 00:11:09,645
Now this requires you to have enough
bandwidth between your level 1,

177
00:11:09,645 --> 00:11:10,228
your level 2.

178
00:11:10,228 --> 00:11:14,798
But that's typically a lot easier to come
by because your level 1 cache and level 2

179
00:11:14,798 --> 00:11:18,845
cache are usually to some extent located
near each other in both on chip and

180
00:11:18,845 --> 00:11:25,250
modern day processor Let's see,
other reasons that this is good.

181
00:11:25,250 --> 00:11:29,986
Well, it really simplifies
your level one design.

182
00:11:29,986 --> 00:11:34,707
If your write-through cache in your level
one design, you don't have to worry

183
00:11:34,707 --> 00:11:39,004
about having dirty victim evictions,
the control becomes easier, and

184
00:11:39,004 --> 00:11:43,450
a lot of times it becomes easier
to integrate into your pipeline.

185
00:11:43,450 --> 00:11:44,090
Which is a good thing.

186
00:11:45,230 --> 00:11:46,680
Something we haven't talked about yet but

187
00:11:46,680 --> 00:11:50,920
we'll be talking about at the end of
this course is how something like

188
00:11:50,920 --> 00:11:55,340
this multi level cache can actually
simplify your cache coherence issues.

189
00:11:55,340 --> 00:11:58,400
So something we haven't talked about yet
is how

190
00:11:59,540 --> 00:12:04,210
cache coherence deals with,
caches can deal with coherent.

191
00:12:04,210 --> 00:12:08,900
And by having a smaller L1,
which is write-through backed by an L2,

192
00:12:08,900 --> 00:12:13,010
you can really think about simplifying
the cache coherence issue.

193
00:12:13,010 --> 00:12:15,890
So what is cache coherence?

194
00:12:15,890 --> 00:12:19,160
Well, we've talked about compulsory
misses, capacity misses, and

195
00:12:19,160 --> 00:12:23,170
conflict misses well when we get to
the end of this class, not this class but

196
00:12:23,170 --> 00:12:25,800
the end of this course we'll be
talking about cache coherence,

197
00:12:25,800 --> 00:12:29,600
which is keeping multiple
processors with caches, and

198
00:12:29,600 --> 00:12:33,070
keeping the caches coherent
between those different caches so

199
00:12:33,070 --> 00:12:36,970
that the data does not become stale
between those different caches.

200
00:12:36,970 --> 00:12:41,770
Well one of the questions that comes
up here is by having write-through and

201
00:12:41,770 --> 00:12:43,220
simplifying the L1.

202
00:12:43,220 --> 00:12:45,960
This is become easier from
that perspective and it does.

203
00:12:45,960 --> 00:12:47,100
Now, why is that?

204
00:12:47,100 --> 00:12:52,380
Well, something we'll going to learn
about is this coherence misses and

205
00:12:52,380 --> 00:12:57,220
when coherence misses going to
basically a different cache is going

206
00:12:57,220 --> 00:13:00,300
to reach across the boss or connection and

207
00:13:00,300 --> 00:13:05,244
tell a different cache to invalidate
something out of this cache.

208
00:13:05,244 --> 00:13:08,999
And if you only have one cache per
processor, if you only have a level one

209
00:13:08,999 --> 00:13:12,817
cache we'll say, or something like
just one cache level, when you go to

210
00:13:12,817 --> 00:13:16,948
reach across the bus and this cache is
tightly integrated into your pipeline,

211
00:13:16,948 --> 00:13:20,640
you now have to deal with these sort
of external requests coming in to do

212
00:13:20,640 --> 00:13:25,550
invalidations or to do something like
snoops, which we'll also be talking about.

213
00:13:25,550 --> 00:13:27,830
In a few classes,
when we get to the end of this course.

214
00:13:28,840 --> 00:13:32,570
But if you have a level one backed
by a level two you can have a level

215
00:13:32,570 --> 00:13:36,360
two service a lot of that complexity.

216
00:13:36,360 --> 00:13:40,616
So you can take complexity and push it
from the level one into the level two.

217
00:13:40,616 --> 00:13:45,809
Now, you're still going to have to
figure out how to do invalidations

218
00:13:45,809 --> 00:13:51,100
in the level one but, before when you
actually had to do it validation.

219
00:13:51,100 --> 00:13:53,610
You could potentially have had dirty data.

220
00:13:53,610 --> 00:13:57,830
You would have had to a victim and
then fix that line.

221
00:13:57,830 --> 00:14:00,600
And the invalidation naturally
generate the eviction.

222
00:14:00,600 --> 00:14:04,720
But now, because global 1 is let's
say a simpler write through level 1.

223
00:14:04,720 --> 00:14:09,490
You don't have to generate the eviction,
instead you just have to invalidate it.

224
00:14:09,490 --> 00:14:15,350
It's a lot easier to sort of blast away
lines than it is to blast away lines and

225
00:14:15,350 --> 00:14:19,180
actually figure out how to
take that data and evict it.

226
00:14:19,180 --> 00:14:23,700
It's a lot less disruptive to
your main processor pipeline

227
00:14:23,700 --> 00:14:26,410
because you don't have to
stop to do the eviction and

228
00:14:26,410 --> 00:14:31,000
there's a block further loses stores
coming from the main processor pipeline.

229
00:14:32,270 --> 00:14:35,440
Last if you have a right
through cache into level one

230
00:14:35,440 --> 00:14:39,910
this can really simplify error recovery
and when I say error recovery I mean from

231
00:14:39,910 --> 00:14:45,060
a soft error perspective so if you have
your chip and it gets struck by radiation.

232
00:14:45,060 --> 00:14:46,720
And this is actually
a relatively common occurrence.

233
00:14:46,720 --> 00:14:48,720
You have a chip and
you have an outer particle.

234
00:14:48,720 --> 00:14:49,970
Something coming out of the sky.

235
00:14:49,970 --> 00:14:53,650
Some highly energetic piece
of radiation hits your chip.

236
00:14:53,650 --> 00:14:55,860
Well, that flips a bit.

237
00:14:55,860 --> 00:14:57,410
And how do we protect against this?

238
00:14:57,410 --> 00:14:58,790
Well, there's a couple
of different solutions.

239
00:14:58,790 --> 00:15:03,670
We can use error correcting codes or
we can use some simpler ideas like parity.

240
00:15:04,780 --> 00:15:07,340
And if you have something
like a write-through cache,

241
00:15:07,340 --> 00:15:11,430
because you never have dirty data in
the cache, you can get away with maybe

242
00:15:11,430 --> 00:15:15,992
just having parity and
not full ECC or something like that.

243
00:15:15,992 --> 00:15:20,660
And if you were to detect a parity
error now in this write-through cache,

244
00:15:20,660 --> 00:15:23,310
you just have to basically
invalidate the line.

245
00:15:24,360 --> 00:15:27,600
You don't have to declare it
as being corrupt in memory,

246
00:15:27,600 --> 00:15:30,934
because you know that the L2
has an up-to-date copy,

247
00:15:30,934 --> 00:15:35,205
which is up-to-date in has
the most up to date version of it.

248
00:15:35,205 --> 00:15:37,985
So the L2 might have more
protection on it, but

249
00:15:37,985 --> 00:15:42,175
the L1 can basically just invalidate
if it gets struck by an alpha particle.

250
00:15:42,175 --> 00:15:43,953
So something to think about there,
level 2,

251
00:15:43,953 --> 00:15:46,887
the presence of having a level 2 can
really influence our level 1 designs.

252
00:15:50,101 --> 00:15:52,155
Okay, how else does this occur?

253
00:15:52,155 --> 00:15:57,090
How else does level one and
level two design commingle together.

254
00:15:57,090 --> 00:16:02,220
Well, there's a question that comes
up of what is the inclusion policy?

255
00:16:03,370 --> 00:16:06,372
So now that we have
multiple levels of cache,

256
00:16:06,372 --> 00:16:10,168
we can actually think about
having the lower down level.

257
00:16:10,168 --> 00:16:14,695
Let's say, to level one having
a certain piece of data and level two,

258
00:16:14,695 --> 00:16:17,630
either having that same piece of data or
not.

259
00:16:18,790 --> 00:16:22,880
And we're going to call
caches which have the level,

260
00:16:22,880 --> 00:16:26,520
anything which is in the level one
cache being in a level two cache or

261
00:16:26,520 --> 00:16:29,480
anything in a lower level cache
being in a higher level cache.

262
00:16:29,480 --> 00:16:31,410
We're going to call that
an inclusive cache.

263
00:16:32,470 --> 00:16:38,083
So the inner cache holds copies of
the data in the farther down cache and

264
00:16:38,083 --> 00:16:43,983
the external snoop access only need to
check in to the outer cache as we were

265
00:16:43,983 --> 00:16:49,217
talking about before, and
we were to call that inclusive cache,

266
00:16:49,217 --> 00:16:53,812
and if you were to contrast
that with an exclusive cache.

267
00:16:53,812 --> 00:16:55,592
So a exclusive cache,

268
00:16:55,592 --> 00:17:01,080
what you're going to have is
the different layers of your cache.

269
00:17:01,080 --> 00:17:05,796
For instance, your level two cache is
not going to have the same data or

270
00:17:05,796 --> 00:17:09,786
may not have the same data that
is in your level one cache.

271
00:17:09,786 --> 00:17:12,828
What is that mean?

272
00:17:12,828 --> 00:17:17,572
Well, it means if you evict something
out of your level one cache,

273
00:17:17,572 --> 00:17:23,163
you probably want to put it back into the
level two cache in an exclusive design,

274
00:17:23,163 --> 00:17:26,570
because that's kind of
the idea behind caches.

275
00:17:26,570 --> 00:17:29,895
You want to have a certain amount
of working set in your cache.

276
00:17:29,895 --> 00:17:32,116
So if you evict something
out of level one,

277
00:17:32,116 --> 00:17:35,490
that's probably still
a relatively useful thing.

278
00:17:35,490 --> 00:17:38,660
It still would probably fit in your
larger working set at level two.

279
00:17:38,660 --> 00:17:41,230
So you just don't want to throw it away,
you want to keep it around.

280
00:17:41,230 --> 00:17:46,597
So in exclusive caches, there's
typically a swap operation that occurs.

281
00:17:46,597 --> 00:17:51,550
So, you're actually going to swap
line between the lower level and

282
00:17:51,550 --> 00:17:55,330
the higher level cache when you move data.

283
00:17:55,330 --> 00:18:00,300
So, the higher level of cache or
the farther away cache from the processor

284
00:18:00,300 --> 00:18:03,830
is going to go access main memory and
it's going to bring it into itself.

285
00:18:05,440 --> 00:18:08,920
And then when the level one cache
goes to bring it in, you're actually

286
00:18:08,920 --> 00:18:12,450
going to swap two lines and this adds
complexity to your hardware design.

287
00:18:13,520 --> 00:18:16,332
You have to add something
which actually does the swap.

288
00:18:16,332 --> 00:18:19,943
It could potentially be bad for power,
but people have actually built these.

289
00:18:19,943 --> 00:18:26,800
If you go look at the original AMD Athlon,
they had a 64-kilobyte primary

290
00:18:26,800 --> 00:18:32,182
cache of a 246 secondary cache and
they were exclusive.

291
00:18:32,182 --> 00:18:36,642
Now you might say, this sounds like a lot
of headache while you would ever go built

292
00:18:36,642 --> 00:18:37,832
an exclusive cache.

293
00:18:37,832 --> 00:18:42,182
Well, it has a one really big benefit,
you get more storage area.

294
00:18:42,182 --> 00:18:45,836
Because you're not keeping two
copies of the same piece of data,

295
00:18:45,836 --> 00:18:48,940
you now can store effectively
more data in your cache.

296
00:18:48,940 --> 00:18:53,892
So if we had this AMD Athlon here,
we have a 64-kilobyte

297
00:18:53,892 --> 00:18:58,037
primary cache backed
by 256 kilobyte cache,

298
00:18:58,037 --> 00:19:04,420
you have the sum of those two values
to store data and it's all unique data.

299
00:19:04,420 --> 00:19:07,975
In contrast,
if we had the same cache hierarchy, but

300
00:19:07,975 --> 00:19:12,958
we were inclusive cache, you would
only have strictly 256 kilobytes.

301
00:19:12,958 --> 00:19:18,818
The larger cache or the farther away
cache's amount of storage space,

302
00:19:18,818 --> 00:19:24,970
because the lower level cache or the
primary cache here only keeps copies of

303
00:19:24,970 --> 00:19:31,830
what is already in the father out cache in
Inclusive in the inclusive cache design.

304
00:19:34,610 --> 00:19:39,390
Let's take a look at a few examples
of caches in modern day systems and

305
00:19:39,390 --> 00:19:41,230
see what trade-offs people have made.

306
00:19:41,230 --> 00:19:43,949
So, let's start off by taking a look

307
00:19:43,949 --> 00:19:47,826
at something like
the Itanium-2 processor here.

308
00:19:47,826 --> 00:19:49,699
So the Itanium-2 processor,

309
00:19:49,699 --> 00:19:53,999
you're going to see we have our chip and
the first thing you're going to notice

310
00:19:53,999 --> 00:19:58,180
about this dive photograph here is
the level three cache is very large.

311
00:19:58,180 --> 00:20:05,710
So this has a very large level three
cache, but this is a big iron processor.

312
00:20:05,710 --> 00:20:11,383
So the Itanium-2 was an Intel chip or
is an Intel chip that is Intel and

313
00:20:11,383 --> 00:20:17,924
HP together, I guess, because they were
collaborating at the time on this project,

314
00:20:17,924 --> 00:20:23,780
but has a big level three cache and
it's kind of funny shaped.

315
00:20:23,780 --> 00:20:26,460
And the reason it's funny shaped is that
they just took all the extra space they

316
00:20:26,460 --> 00:20:31,238
had in their die and filled it basically
with cache and this was good but

317
00:20:31,238 --> 00:20:36,250
it didn't quite matter the shape and
size, the shape, because it was so

318
00:20:36,250 --> 00:20:39,290
far away from the processor that it
didn't have to be a regular shape, but

319
00:20:39,290 --> 00:20:41,650
let's take a look at
the different levels here.

320
00:20:41,650 --> 00:20:44,121
So, we have a level one cache.

321
00:20:44,121 --> 00:20:47,349
It's small, 16 kilobytes,

322
00:20:47,349 --> 00:20:52,497
4-way set associative, 64-byte line size.

323
00:20:52,497 --> 00:20:54,462
It's heavily ported.

324
00:20:54,462 --> 00:20:57,720
They can do a lot of loads and
stores at the same time here.

325
00:20:57,720 --> 00:21:02,210
So it has two loads and
two stores concurrently and it's fast,

326
00:21:02,210 --> 00:21:04,300
single cycle access.

327
00:21:04,300 --> 00:21:07,440
So, 16 kilobytes is relatively
small by today's standards.

328
00:21:07,440 --> 00:21:11,114
In 2002, that was probably
a little on the large side.

329
00:21:11,114 --> 00:21:12,449
This was a large processor.

330
00:21:12,449 --> 00:21:15,190
But in 2013 time,

331
00:21:15,190 --> 00:21:20,392
that's not that big for a level one cache.

332
00:21:20,392 --> 00:21:23,926
If we step up here,
we can see that typically as you step up,

333
00:21:23,926 --> 00:21:25,430
you increase your size.

334
00:21:25,430 --> 00:21:28,173
Otherwise, what would be the point?

335
00:21:28,173 --> 00:21:30,082
But the latency also gets worse.

336
00:21:30,082 --> 00:21:32,247
So, we've seen cycle latency for
level one.

337
00:21:32,247 --> 00:21:38,790
We've five cycles latency for
level two and we're a lot bigger.

338
00:21:38,790 --> 00:21:40,950
We're at 256 kilobytes of cache.

339
00:21:42,280 --> 00:21:45,680
The associativity stays the same,
it's still four-way associative.

340
00:21:45,680 --> 00:21:47,427
And as we get farther away here,

341
00:21:47,427 --> 00:21:51,072
we'll see we have a 3 megabyte
cache with twelve cycles latency.

342
00:21:51,072 --> 00:21:54,374
So it's this big cache around
the outside of the chip here, but

343
00:21:54,374 --> 00:21:56,630
we actually increase our associativity.

344
00:21:56,630 --> 00:22:01,526
So it's a 12-way set associative cache and
the line size in both the level 2,

345
00:22:01,526 --> 00:22:03,327
and level 3 cache is larger.

346
00:22:03,327 --> 00:22:06,824
Now, 128 bytes instead of 64 bytes and
that's pretty common.

347
00:22:06,824 --> 00:22:11,010
You'll see that as you get farther
away from caches, people go and

348
00:22:11,010 --> 00:22:15,140
put larger line sizes in view of
larger chunks of memory at times.

349
00:22:15,140 --> 00:22:17,547
And to some extent, this is because you
have more capacity in these caches.

350
00:22:17,547 --> 00:22:22,343
So it doesn't hurt you as
much if you're called back

351
00:22:22,343 --> 00:22:27,357
to our earlier lecture where
we plotted the capacity or

352
00:22:27,357 --> 00:22:34,580
assuming the line size that you have,
the lock size versus performance.

353
00:22:34,580 --> 00:22:38,191
If you recall, it hurts you more for
smaller cache, because there's just not

354
00:22:38,191 --> 00:22:41,655
that many lines you can fit into
something, let's say, like a 4K cache.

355
00:22:41,655 --> 00:22:46,162
But as you start to get to something like
a 3 megabyte cache, you can be a little

356
00:22:46,162 --> 00:22:50,970
bit wasteful with storage if it helps you
with your potentially spatial locality.

357
00:22:53,563 --> 00:22:57,770
One point I wanted to make here is a rule
of thumb that people typically use for

358
00:22:57,770 --> 00:23:00,950
cache design and
this is just an empirical rule of thumb.

359
00:23:02,080 --> 00:23:04,889
The empirical rule of thumb is usually,

360
00:23:04,889 --> 00:23:09,100
when you go from one level of
cache to the next level of cache,

361
00:23:09,100 --> 00:23:13,083
you want to step up by a minimum
by a factor of eight in size.

362
00:23:13,083 --> 00:23:14,118
Now, where does that come from?

363
00:23:14,118 --> 00:23:18,947
Well, it's totally empirical,
but it's based on how much extra

364
00:23:18,947 --> 00:23:24,720
latency you have to add when you add
another level of cache hierarchy.

365
00:23:24,720 --> 00:23:28,156
So if you add another level of cache
hierarchy, it has some extra latency.

366
00:23:28,156 --> 00:23:34,219
And while it's going to
decrease your cache miss rate,

367
00:23:34,219 --> 00:23:39,900
it also is going to make your
time out to main memory.

368
00:23:39,900 --> 00:23:41,656
Let's say, a little bit worse.

369
00:23:41,656 --> 00:23:45,436
As a trade-off there of, does it make
sense to add the extra cache layer and

370
00:23:45,436 --> 00:23:47,056
go check the extra cache layer or

371
00:23:47,056 --> 00:23:49,650
is it better to go out to
main memory at that point?

372
00:23:49,650 --> 00:23:51,698
And the question is how much benefit,

373
00:23:51,698 --> 00:23:55,421
how much larger does that cache have
to be to have any useful benefit?

374
00:23:55,421 --> 00:23:59,397
And empirically, when sort of
people have built these processors,

375
00:23:59,397 --> 00:24:03,042
usually you have to step up by
something like a factor of eight.

376
00:24:03,042 --> 00:24:07,937
If you step up by a factor of four The
benefit that you get from the cache due to

377
00:24:07,937 --> 00:24:11,710
the extra size does not outweigh
the extra complexity and cost and

378
00:24:11,710 --> 00:24:16,663
time that is added by adding extra cache,
but that's just a empirical rule of thumb.

379
00:24:16,663 --> 00:24:18,885
But if you go analyze enough processors,

380
00:24:18,885 --> 00:24:23,084
you'll see that basically every cash
level steps about by a minimum of eight.

381
00:24:23,084 --> 00:24:29,121
So we'll call this the 8X,
the 8 times rule if you will.

382
00:24:29,121 --> 00:24:32,983
This cache here steps
up by a factor of 16.

383
00:24:32,983 --> 00:24:35,139
So it's a little bit more
than a factor of eight, but

384
00:24:35,139 --> 00:24:37,454
let's look at another bit
more modern processor here.

385
00:24:37,454 --> 00:24:39,924
So, this is the IBM Power 7.

386
00:24:39,924 --> 00:24:43,220
So the IBM Power 7,
this is a eight core machine.

387
00:24:45,220 --> 00:24:49,649
There is private level one
caches per processor and

388
00:24:49,649 --> 00:24:56,355
then there is a big level three cache
sitting in the middle of this processor.

389
00:24:56,355 --> 00:24:57,293
Well, what does this look like?

390
00:24:57,293 --> 00:25:00,671
Well, we have relatively
small level one caches.

391
00:25:00,671 --> 00:25:03,064
We have a 32-kilobyte L1 I.

392
00:25:03,064 --> 00:25:08,289
A 32 kilobyte L1 D cache and
latency is higher than

393
00:25:08,289 --> 00:25:14,243
our previous example,
it's a three cycle cache latency.

394
00:25:14,243 --> 00:25:18,860
And as we go farther out, we can see
we actually have an 8X step up in

395
00:25:18,860 --> 00:25:23,075
cache size here at least from
a data cache size perspective.

396
00:25:23,075 --> 00:25:28,205
So, we go from 32 to 256 and
the latency is worse.

397
00:25:28,205 --> 00:25:30,925
So, the eight cycles worth of
latency to check on level two.

398
00:25:30,925 --> 00:25:35,039
And then finally here,
we have eight cores sharing this data.

399
00:25:35,039 --> 00:25:38,755
So, it's a lot of core sharing this cache.

400
00:25:38,755 --> 00:25:41,998
We have a 32 megabyte
unified level 3 cache that's

401
00:25:41,998 --> 00:25:45,251
actually built out of invented
DRAM in this process.

402
00:25:45,251 --> 00:25:49,445
So, IBM has some pretty impressive
technology here where they can actually

403
00:25:49,445 --> 00:25:51,160
embed DRAM on a logic process.

404
00:25:51,160 --> 00:25:53,955
This is not common in
most of the processes.

405
00:25:53,955 --> 00:26:00,077
This is typically only IBM trait,
an IBM boundary design sort of trait.

406
00:26:00,077 --> 00:26:03,713
But some other people do also
have embedded DRAM now, but IBM

407
00:26:03,713 --> 00:26:08,504
is really the forefront leader in there,
but the latency here is quite higher.

408
00:26:08,504 --> 00:26:16,035
So if 25 cycle latency to the power
set in level three cache.

409
00:26:16,035 --> 00:26:17,283
Let's pull-up the score board.

410
00:26:17,283 --> 00:26:21,485
So we've been building a score board here
throughout this and seeing how adding

411
00:26:21,485 --> 00:26:26,329
these different techniques, these advanced
cache techniques helped with performance.

412
00:26:26,329 --> 00:26:29,123
So, what is our score board
going to look like here?

413
00:26:29,123 --> 00:26:34,401
Well, we have multilevel cache and
our first question

414
00:26:34,401 --> 00:26:39,001
is what happens for
the level one perspective?

415
00:26:39,001 --> 00:26:42,243
Well, the miss penalty.

416
00:26:42,243 --> 00:26:46,169
What is adding a level
one cache do to the miss?

417
00:26:46,169 --> 00:26:53,049
Well, we take a miss,
you have to go out to the next level.

418
00:26:53,049 --> 00:26:56,878
And before we had to go out to the main
memory, which could be far far away.

419
00:26:56,878 --> 00:26:59,875
So, miss penalty was quite high.

420
00:26:59,875 --> 00:27:04,152
But if we have multilevel cache and
lets say, we have a level two cache,

421
00:27:04,152 --> 00:27:05,593
just a few cycles away.

422
00:27:05,593 --> 00:27:07,747
Maybe five cycles away.

423
00:27:07,747 --> 00:27:12,539
Your miss penalty now goes down as seen
from the level one cache when we add

424
00:27:12,539 --> 00:27:14,201
this multilevel cache.

425
00:27:16,665 --> 00:27:18,032
So that's the level one,

426
00:27:18,032 --> 00:27:21,337
what about if we draw a box around
all of the levels of our cache?

427
00:27:21,337 --> 00:27:25,198
So lets say, level one, level two,
level three, what happens?

428
00:27:25,198 --> 00:27:29,833
Well, in aggregate, our miss penalty for

429
00:27:29,833 --> 00:27:36,291
each particular level is going to go down,
so that's a plus.

430
00:27:36,291 --> 00:27:41,929
As you go to look at farther away
going to main memory, what you're

431
00:27:41,929 --> 00:27:48,072
actually going to see is the miss rate
is also going to go down in aggregate,

432
00:27:48,072 --> 00:27:54,339
because you're not going to have to miss
out of your last level cache as often.

433
00:27:54,339 --> 00:27:56,244
So, the miss rate here goes down.

434
00:27:56,244 --> 00:27:59,393
So you're not going to have to
g out to main memory as much,

435
00:27:59,393 --> 00:28:03,078
because now you have a larger, but
not as big as your main memory or

436
00:28:03,078 --> 00:28:05,972
as big as your DRAM cache
sitting in front of there.

437
00:28:05,972 --> 00:28:08,131
So, the miss rate goes down.

438
00:28:08,131 --> 00:28:10,659
So, you have lower overall miss rate.