1
00:00:03,081 --> 00:00:05,673
Okay.
So let's start talking about performance.

2
00:00:05,673 --> 00:00:09,990
'Coz the whole reason you built cache is,
is to you know, have lower power and

3
00:00:09,990 --> 00:00:14,088
higher performance.
And let's, let's go back to the iron law.

4
00:00:14,088 --> 00:00:17,236
So what is, what is the cache trying to
do?

5
00:00:17,236 --> 00:00:22,633
Well, what a cache is trying to do is if
you look at that iron law of processor

6
00:00:22,633 --> 00:00:28,097
performance, when you do a load or a
store, we're trying to decrease the clocks

7
00:00:28,097 --> 00:00:31,609
per instruction to process that load or
the store.

8
00:00:31,609 --> 00:00:36,521
So if you have to go all the way out to
main memory in a cache miss we'll say, or

9
00:00:36,521 --> 00:00:40,424
if you don't have a cache.
It's gonna take a long time.

10
00:00:40,424 --> 00:00:46,865
But if you shrink the cost per instruction
for a load in the store everything gets,

11
00:00:46,865 --> 00:00:51,070
gets faster so you can actually, actually
go, go faster here.

12
00:00:51,070 --> 00:00:56,941
So as a sort of showing here, we have some
loop that does some loads, some adds, and

13
00:00:56,941 --> 00:01:02,475
here it takes a, a cache miss.
If, if we can somehow do things to the

14
00:01:02,475 --> 00:01:07,999
cache which reduce the probability of a
cache miss, we can shrink the amount of

15
00:01:07,999 --> 00:01:11,627
time it takes.
To do that and the whole program will run

16
00:01:11,627 --> 00:01:15,028
faster.
So, reducing our number of cache misses is

17
00:01:15,028 --> 00:01:19,998
good, using our cache to actually keep the
data local and also, sort of, as you can

18
00:01:19,998 --> 00:01:23,793
see here in this diagram, this first load
here it's in the cache.

19
00:01:23,793 --> 00:01:27,036
So it doesn't have to go out to main
memory.

20
00:01:27,036 --> 00:01:32,054
So, it just hits in it, if it's a properly
pipe lined cache, those return the data on

21
00:01:32,054 --> 00:01:35,060
that load.
Just want to introduce, yeah, in this

22
00:01:35,060 --> 00:01:40,021
diagram, two things, processor to cache
and coming back, that's a hit in the

23
00:01:40,021 --> 00:01:42,090
cache.
Miss takes longer, it just takes more

24
00:01:42,090 --> 00:01:48,537
cycles.
Okay, so this is an important slide cause

25
00:01:48,537 --> 00:01:53,004
it uses some important ways to think about
caches.

26
00:01:53,004 --> 00:01:57,060
So we want to categorize the types of
misses we have to cache.

27
00:01:57,060 --> 00:02:03,097
So this is figuring why we would have
these different euristics inside of caches

28
00:02:03,097 --> 00:02:10,011
based on the different policies and
different ways that you'd actually have a

29
00:02:10,011 --> 00:02:13,078
cache miss.
And a lot of times people call this the

30
00:02:13,078 --> 00:02:20,075
three Cs of caches.
Okay, the first C, A compulsory cash miss.

31
00:02:20,075 --> 00:02:25,097
So what that means is that the first
reference to a block and you're going to

32
00:02:25,097 --> 00:02:29,090
take that cache miss even if you have an
infinite size cache.

33
00:02:29,090 --> 00:02:35,024
It's just you can't get things into the
cache unless you go to try to access them

34
00:02:35,024 --> 00:02:39,024
the first time ever.
So that's what a compulsory cache miss is

35
00:02:39,024 --> 00:02:42,000
that first reference.
You can try real hard.

36
00:02:42,000 --> 00:02:45,098
You can see by have a prefecture,
actually, to try to reduce the amount of

37
00:02:45,098 --> 00:02:48,265
compulsory cache misses.
You could think really hard say, oh, I

38
00:02:48,265 --> 00:02:51,754
think I'm gonna access this data sometime
in the future.

39
00:02:51,754 --> 00:02:54,738
I should go get it.
And then when you go to actually access

40
00:02:54,738 --> 00:02:58,205
it, you won't take a cache miss.
So that's, that's a possibility.

41
00:02:58,205 --> 00:03:04,282
Okay, the second C that contributes to our
performance of caches is the capacity of

42
00:03:04,282 --> 00:03:08,021
the cache.
So, traditionally, a larger cache will be

43
00:03:08,021 --> 00:03:11,477
able to fit more data, well that's always
true.

44
00:03:11,477 --> 00:03:17,289
But traditionally a larger cache you will
have a lower miss rate if you have a

45
00:03:17,289 --> 00:03:21,082
larger cache.
So let's think about that, this is an

46
00:03:21,082 --> 00:03:25,878
important question here.
Will a larger cache always have a lower

47
00:03:25,878 --> 00:03:34,056
cache miss rate than any smaller cache?
Within a loop, let's think about what

48
00:03:34,056 --> 00:03:39,046
kicks data out.
If you have a cache that is let's say

49
00:03:39,046 --> 00:03:44,055
eight lines and you have a cache that is
sixteen lines.

50
00:03:44,055 --> 00:03:53,394
By definition the addresses that with
alias in the bigger cache, are gonna alias

51
00:03:53,394 --> 00:03:58,542
in the smaller cache also if you go let's
say, from am eight or a sixteen entry

52
00:03:58,542 --> 00:04:03,109
cache, to a eight entry cache.
So it's actually if you're going up by

53
00:04:03,109 --> 00:04:09,050
factors of two, at least, your cache miss
rate, is always going to be better with a

54
00:04:09,050 --> 00:04:12,434
larger cache.
Or, or the cache miss rate's gonna be

55
00:04:12,434 --> 00:04:16,528
lower with larger cache.
The one place it could actually come up,

56
00:04:16,528 --> 00:04:20,213
is if you have, if you try to change the
hashing function.

57
00:04:20,213 --> 00:04:25,068
So if you have a cache that is, let's say
two thirds the size...

58
00:04:25,068 --> 00:04:32,218
Well, you have a cache that is let's say
eight entries and you have a cache that is

59
00:04:32,218 --> 00:04:35,355
twelve entries.
So there, there [inaudible] locations are

60
00:04:35,355 --> 00:04:38,589
gonna be different if you actually think
about having different patters there.

61
00:04:38,589 --> 00:04:44,431
But it's likely that the larger cache will
still be better there.

62
00:04:44,431 --> 00:04:56,951
So large, large is good.
That's, that's assuming like sort of

63
00:04:56,951 --> 00:05:00,304
direct mapped cache.
You could, you know, depending on your LRU

64
00:05:00,304 --> 00:05:03,950
strategy probably, or your, your
replacement policy there's some other

65
00:05:03,950 --> 00:05:06,830
caveats there you have to think a little
bit harder about.

66
00:05:06,830 --> 00:05:10,709
Okay.
So finally, what's the last thing that can

67
00:05:10,709 --> 00:05:16,615
cause a cache conflict to occur is, or
cache miss occur is conflict in the cache.

68
00:05:16,615 --> 00:05:21,900
So this means that we don't have a high
enough associativity in two different

69
00:05:21,900 --> 00:05:26,296
pieces of data.
Or two different clocks alias to the same

70
00:05:26,296 --> 00:05:29,437
location and they're fighting for that
location.

71
00:05:29,437 --> 00:05:34,447
So is, there's a conflict there and one of
the, the pieces of data that gets kicked

72
00:05:34,447 --> 00:05:37,527
out before you have the time to go read it
back in.

73
00:05:37,527 --> 00:05:42,078
So you have a load and then another load,
but in the mean, in the, in the middle

74
00:05:42,078 --> 00:05:46,415
time, you have a load to a different
address, which happens to alias to that

75
00:05:46,415 --> 00:05:50,367
same location in the cache.
The hash function points to the same

76
00:05:50,367 --> 00:05:53,470
location, and they're gonna fight for that
resource.

77
00:05:53,470 --> 00:05:59,521
So that's a, that's a conflict.
Okay, so let's put some data behind this

78
00:05:59,521 --> 00:06:03,617
and look at some ways to make caches go
faster.

79
00:06:03,617 --> 00:06:09,161
Okay, so, let's, let's look at what we are
plotting here on this graph.

80
00:06:09,161 --> 00:06:12,680
On the X axis we have different sized
caches.

81
00:06:12,680 --> 00:06:18,558
Sixteen kilobytes, 32 kilobytes, 64.
So we go up by factors of two all the way

82
00:06:18,558 --> 00:06:26,763
up to one megabyte.
On the Y axis here we have access time, So

83
00:06:26,763 --> 00:06:35,246
how long it takes to access the cache.
So if we go back to, to this average

84
00:06:35,246 --> 00:06:40,984
memory latency equation.
We take the miss rate times the miss

85
00:06:40,984 --> 00:06:46,495
penalty plus the hit time.
How, how long it takes to actually go sort

86
00:06:46,495 --> 00:06:51,713
of look stuff up and find something in our
cache in the first place.

87
00:06:51,713 --> 00:06:55,635
Well if we can reduce the hit time,
that's, that's good.

88
00:06:55,635 --> 00:07:01,806
So if we take cache which let's say takes
I don't know.

89
00:07:01,806 --> 00:07:07,282
Two nanoseconds to access.
And on a one gigahertz machine, that'd be

90
00:07:07,282 --> 00:07:12,092
two clock cycles.
And we can somehow, you know, replace it

91
00:07:12,092 --> 00:07:14,444
with the cache, which takes half a
nanosecond to access.

92
00:07:14,444 --> 00:07:19,440
That's, that's, that's good, it means we
can actually fit that in one clock cycle

93
00:07:19,440 --> 00:07:23,144
in a one gigahertz machine.
In fact, we can probably go all the way up

94
00:07:23,144 --> 00:07:27,621
to, you know, sort of, one nanosecond on
a, on a one gigahertz machine before we

95
00:07:27,621 --> 00:07:32,670
spill over into the next clock cycle.
So you can sort of think it, think about

96
00:07:32,670 --> 00:07:34,757
that.
So, first thing is, actually, small and

97
00:07:34,757 --> 00:07:38,893
simple caches could be good.
Every one thinks capacity, capacity,

98
00:07:38,893 --> 00:07:44,607
capacity but, in your sort of processor
core, if you can reduce the hit time,

99
00:07:44,607 --> 00:07:48,884
that's a good thing.
So, just to give you some idea here, as

100
00:07:48,884 --> 00:07:54,877
you, as you scale back, you can have, to
the smaller caches the, time it takes to

101
00:07:54,877 --> 00:08:02,467
access the cache goes down.
So the next, next thing we do is we try to

102
00:08:02,467 --> 00:08:04,808
reduce.
The miss rate.

103
00:08:04,808 --> 00:08:07,324
And there's a couple different ways to do
that.

104
00:08:07,324 --> 00:08:11,401
These are just sort of some basic
optimizations in the cache lectures.

105
00:08:11,401 --> 00:08:15,984
Later in class we're gonna go through at
lot more optimizations for caches.

106
00:08:15,984 --> 00:08:20,033
But one thing you do is think about
looking at the block size.

107
00:08:20,033 --> 00:08:25,664
So the examples I've been giving, we're
talking about 64, byte block size.

108
00:08:25,664 --> 00:08:33,109
But you could think about having either a
smaller block size or a larger block size.

109
00:08:33,109 --> 00:08:39,246
And in fact this, this happens.
People, the, this, this graph here, you

110
00:08:39,246 --> 00:08:43,971
know, 64 looks like, sort of the lowest
misrates is a good spot.

111
00:08:43,971 --> 00:08:48,404
But this is really dependent in your
applications space.

112
00:08:48,404 --> 00:08:53,516
So if you have applications which just
sorta stream through memory, you probably

113
00:08:53,516 --> 00:08:57,651
want a bigger block size.
If you have more random patterns, you're

114
00:08:57,651 --> 00:09:02,269
not getting any reuse.
It might make sense to even have a smaller

115
00:09:02,269 --> 00:09:06,291
block size.
So, this, this data is from your textbook

116
00:09:06,291 --> 00:09:09,347
actually.
This is plotting [inaudible].

117
00:09:09,347 --> 00:09:14,457
Rates with different cache sizes.
So each of these lines here is a different

118
00:09:14,457 --> 00:09:16,396
size cache.
So you can see four kilobyte cache,

119
00:09:16,396 --> 00:09:22,551
sixteen kilobyte cache, 64 kilobyte cache,
an 256 kilobyte caches.

120
00:09:22,551 --> 00:09:28,250
And you can plot those, those two things
against each other and see where the, the

121
00:09:28,250 --> 00:09:34,203
sort of sweet spot in this curve is.
We're trying to minimize the miss rate, so

122
00:09:34,203 --> 00:09:37,549
how often we actually take that cache
miss.

123
00:09:37,549 --> 00:09:40,881
Okay.
So there's some, there's some positive

124
00:09:40,881 --> 00:09:44,959
things here about having different having
larger cash sizes.

125
00:09:44,959 --> 00:09:48,435
Let's talk about that.
S if we have a, excuse me.

126
00:09:48,435 --> 00:09:53,100
Larger box sizes.
If we have a larger box size we need less

127
00:09:53,100 --> 00:09:57,797
tag overhead.
So we talked about that already in the tag

128
00:09:57,797 --> 00:10:03,057
look up slide.
The, you can, if you have longer block

129
00:10:03,057 --> 00:10:09,465
sizes you can think about having uhh.
More burst transfers from your D ram.

130
00:10:09,465 --> 00:10:14,386
So in D ram, typically they like to give
you sort of large chunks of data at a time

131
00:10:14,386 --> 00:10:18,010
and not little chunks of data.
Cause there's a overhead cost in firing up

132
00:10:18,010 --> 00:10:22,274
the D ram, and there's what's called the
call address strobe, and the row address

133
00:10:22,274 --> 00:10:26,234
strobe and rasser cast time.
And you do that for memory every other

134
00:10:26,234 --> 00:10:28,940
memory access.
So if you can sort of pull in larger

135
00:10:28,940 --> 00:10:33,145
chunks of memory at a time, you only have
to do that once for the large amount of

136
00:10:33,145 --> 00:10:36,530
data you bring in.
So that pushes you to actually want to

137
00:10:36,530 --> 00:10:41,016
have larger block sizes.
And, you could even think about having

138
00:10:41,016 --> 00:10:45,402
similar ideas here for, sort of, on-chip
buses if you have larger block sizes,

139
00:10:45,402 --> 00:10:49,958
you'll probably be using that on-chip bus
more effectively, coz, there is some

140
00:10:49,958 --> 00:10:53,749
overheads and some turnarounds, usually,
for arbitrations for buses.

141
00:10:53,749 --> 00:10:58,205
Okay, so on the right side we have the
downsides of larger block sizes.

142
00:10:58,205 --> 00:11:04,237
If you have larger block size, you might
be pulling in data you are not using.

143
00:11:04,237 --> 00:11:10,044
So if I have a 256 byte block, cash block
versus 64 byte block.

144
00:11:10,044 --> 00:11:13,012
We're gonna be pulling in four times as
much data.

145
00:11:13,012 --> 00:11:18,004
And if we're only trying to access let's
say one byte in a data, we just wasted a

146
00:11:18,004 --> 00:11:21,009
lot of main memory bandwidth to hold that
byte in.

147
00:11:21,009 --> 00:11:24,091
So, we need to be cognizant of that.
And that's why this curve is not, you

148
00:11:24,091 --> 00:11:27,076
know, increasing in one direction or the
other direction.

149
00:11:27,076 --> 00:11:31,058
Because, on first, when I sort of first
took a computer architecture class, I

150
00:11:31,058 --> 00:11:35,066
thought, oh, you know, as you increase the
block size, student performance go up or,

151
00:11:35,066 --> 00:11:38,047
or shouldn't the cache-miss rate go down
in this graph.

152
00:11:38,047 --> 00:11:44,053
And it's, it's not true because you start
to waste bandwidth at some point.

153
00:11:44,053 --> 00:11:48,791
Also if you have larger block size by
definition, you have fewer blocks.

154
00:11:48,791 --> 00:11:54,342
So if we have let's say 256 byte blocks
versus 64 byte blocks, by definition this

155
00:11:54,342 --> 00:11:57,757
is sort of four, four times fewer blocks
in the cache.

156
00:11:57,757 --> 00:12:01,486
The same amount of data.
So if we have a, you know, four kilobyte

157
00:12:01,486 --> 00:12:06,377
cache, we're gonna have the same amount
the same amount of data.

158
00:12:06,377 --> 00:12:11,072
It's still a four kilobyte cache.
But there's less blocks in that cache, so

159
00:12:11,072 --> 00:12:16,431
you're not gonna be able to have as much
random data in your cache at one time.

160
00:12:16,431 --> 00:12:19,422
So this is one technique to reduce a miss
rate.

161
00:12:19,422 --> 00:12:25,053
Another way, in a perfect world, this,
this fights off against small and simple

162
00:12:25,053 --> 00:12:30,504
caches, is that you just build big caches.
If you build big caches, this means that

163
00:12:30,504 --> 00:12:35,329
when you go to access the data, there's a
very high probability that the data is

164
00:12:35,329 --> 00:12:39,080
close to you.
That, that sounds good.

165
00:12:39,080 --> 00:12:44,954
So here, we actually have miss rate
plotted against cache size.

166
00:12:44,954 --> 00:12:53,003
And, and of course, the sort of different
associ, different types of caches here.

167
00:12:53,028 --> 00:12:59,078
And let's see, there's one.
It's one thing I wanted to point out here,

168
00:13:00,036 --> 00:13:06,079
This empirical rule of thumb.
If you double your cash size.

169
00:13:07,025 --> 00:13:14,033
Your miss rate usually drops by root two.
Sometimes people call this the square root

170
00:13:14,033 --> 00:13:16,086
rule.
How do we derive this?

171
00:13:18,094 --> 00:13:21,011
Well.
Sorry [inaudible] you guys.

172
00:13:21,011 --> 00:13:24,752
It says here it is an empirical rule of
thumb.

173
00:13:24,752 --> 00:13:29,126
It is just a rule of thumb.
Obviously, if this was perfectly true

174
00:13:29,126 --> 00:13:33,835
this, this line would be, you know.
Nicely curved versus having some, like,

175
00:13:33,835 --> 00:13:38,194
bumps in it, But, this actually,
surprisingly works out, pretty well, as a,

176
00:13:38,194 --> 00:13:42,035
a rule of thumb.
So, and it doesn't really work very well

177
00:13:42,035 --> 00:13:47,415
for, very small caches, so typically, sort
of, right in here it doesn't work well,

178
00:13:47,415 --> 00:13:51,892
Possibly, sometimes for very large things
it doesn't work well, and for high

179
00:13:51,892 --> 00:13:55,612
associativity, this rule of thumb starts
to break down, also.

180
00:13:55,612 --> 00:13:59,820
But, it's a good rule of thumb the rule of
thumb to think about.

181
00:13:59,820 --> 00:14:03,974
Okay.
So how else do we reduce the, miss rate?

182
00:14:03,974 --> 00:14:09,848
Well, same, same graph.
But, this is also from, Hennessy and

183
00:14:09,848 --> 00:14:15,341
Patterson, your book.
You can increase the associativity.

184
00:14:15,341 --> 00:14:21,097
So you can take a, I don't know, four way
cache and turn into eight way cache.

185
00:14:21,097 --> 00:14:27,063
And some power cost associates that, with
that, and there is also a clock cycle

186
00:14:27,063 --> 00:14:31,020
cost, typically.
So, let's look at the, the rule thumb

187
00:14:31,020 --> 00:14:34,007
here.
And the rule thumb basically says, a

188
00:14:34,007 --> 00:14:39,059
direct mapped cache of size N, has about
the same missed rate, has a two way set

189
00:14:39,059 --> 00:14:44,441
associative cache, of size N over two.
And this kind of, that sounds, crazy good.

190
00:14:44,441 --> 00:14:48,751
Like how is that possible.
We should always go at least two ways at

191
00:14:48,751 --> 00:14:52,645
associative caches.
But let's, let's look to, let's look at

192
00:14:52,645 --> 00:14:55,912
this graph, and see if this actually,
actually works.

193
00:14:55,912 --> 00:15:02,599
So.
We're gonna to look at a point.

194
00:15:02,599 --> 00:15:11,260
Let's look at a sixteen kilobyte cache
that is direct mapped and compare it

195
00:15:11,260 --> 00:15:18,960
against a 32 kilobyte cache that is or
excuse me, a sixteen kilobyte cache that

196
00:15:18,960 --> 00:15:25,019
is two ways associative, versus a 32
kilobyte cache that is, lower associative.

197
00:15:25,019 --> 00:15:30,059
And, kind of what you're trying to see
here is if this point is equals to that

198
00:15:30,059 --> 00:15:34,006
point.
Well, it's not, but it's actually not a

199
00:15:34,006 --> 00:15:38,018
horrible approximation.
We're sort of saying if we have a sixteen

200
00:15:38,018 --> 00:15:42,049
kilobyte cache with a higher sensitivity,
our miss rate goes down, right?

201
00:15:42,049 --> 00:15:47,021
So it's going to be somewhere in here,
versus a 32 kilobyte cache which has, is

202
00:15:47,021 --> 00:15:51,046
direct map which is this point here.
Okay, well that doesn't hold there.

203
00:15:51,046 --> 00:15:56,441
Let's go look for another point here.
Let's say 32 kilobytes to a say

204
00:15:56,441 --> 00:16:00,095
associative cache which is that point
there over a 64 kilobyte cache that is

205
00:16:00,095 --> 00:16:03,812
direct mapped.
Well that's almost on a straight line with

206
00:16:03,812 --> 00:16:09,012
each other, so it's almost exactly equal.
And this is just an empirical rule of

207
00:16:09,012 --> 00:16:13,677
thumb that people have sort of figured out
that as you double your associativity, you

208
00:16:13,677 --> 00:16:18,560
actually have a sort of, you can almost
half your cache size and still have it,

209
00:16:18,560 --> 00:16:23,520
have the same miss rate.
And, and likewise, like I said, as I said,

210
00:16:23,520 --> 00:16:26,931
this is just, empirical.
There is no, no, reason why this has to

211
00:16:26,931 --> 00:16:29,916
be.
And it's really dependent on your, sort

212
00:16:29,916 --> 00:16:35,077
of, data access patterns.
But, I, I found that, I found that pretty

213
00:16:35,077 --> 00:16:38,130
interesting.
Okay, so what's, what's the problem with

214
00:16:38,130 --> 00:16:41,177
building a two way set of associative
cache always?

215
00:16:41,177 --> 00:16:45,476
Why do we not do that?
So, the area for the data store, shouldn't

216
00:16:45,476 --> 00:16:49,518
actually, change very much.
With those changes, your tag store, the,

217
00:16:49,518 --> 00:16:54,912
the tag data, you're gonna need for the
higher associative one, you're gonna need

218
00:16:54,912 --> 00:16:59,158
more tag-check logic at least.
Cuz you'll have to check, let's say, two

219
00:16:59,158 --> 00:17:00,086
tags in parallel.