1
00:00:03,480 --> 00:00:08,540
Okay.
Welcome to today's ELE 475.

2
00:00:08,540 --> 00:00:13,611
So today, we are going to be moving to a
little bit different topic. We're going

3
00:00:13,611 --> 00:00:18,992
back to one of the topics that we talked
about at the beginning of class in Lecture

4
00:00:18,992 --> 00:00:22,776
three.
We're going to talk about caches, and

5
00:00:22,776 --> 00:00:24,930
we're spending two lectures in caches
because they are really important to

6
00:00:24,930 --> 00:00:28,047
performance.
In the, we'll say, in the 90s,' 80s' and

7
00:00:28,047 --> 00:00:32,414
90s,' a lot of, lot of computer
architectural papers and a lot of effort

8
00:00:32,414 --> 00:00:37,238
were spent trying to use more and more
transistors, which Moors Law was giving

9
00:00:37,238 --> 00:00:41,630
the computer architect and applying them
to either build bigger caches,

10
00:00:41,630 --> 00:00:48,859
Fancier caches, more set associative
caches, harder will say, ways to think

11
00:00:48,859 --> 00:00:58,825
about caches all for higher performance.
Cough And we'll be continuing to videotape

12
00:00:58,825 --> 00:01:02,752
these lectures and put them online for
everybody.

13
00:01:02,993 --> 00:01:07,801
And so let's, let's start of looking what
we're doing today.

14
00:01:07,801 --> 00:01:13,570
So, we're going to start off with a litlle
bit of review from Lecture three.

15
00:01:13,571 --> 00:01:17,338
We're going to talk about the three Cs of
caches.

16
00:01:17,338 --> 00:01:22,351
So, capacity,
Conflict, and compulsory misses, we'll be

17
00:01:22,351 --> 00:01:28,889
talking about the basic cache optimization
we had talked about in Lecture three,

18
00:01:28,893 --> 00:01:35,109
which are mostly kind of, make your cache
bigger, make your cache more highly

19
00:01:35,109 --> 00:01:41,570
associative somehow to reduce these three
C and then we're going to start talking

20
00:01:41,570 --> 00:01:47,295
about much more advanced cache
organizations and optimizations we can

21
00:01:47,295 --> 00:01:51,755
make. So, let's briefly talk about it.
It's sort of a hodgepodge.

22
00:01:51,986 --> 00:01:54,811
And this is kind of just the nature of the
beast.

23
00:01:55,002 --> 00:02:00,094
There's lots of little things that sort of
add up to make either the cache more

24
00:02:00,094 --> 00:02:05,060
effective or integrate the cache into your
pipeline more, more efficient.

25
00:02:05,580 --> 00:02:09,685
And this is only the beginning.
We will have about five or six more topics

26
00:02:09,685 --> 00:02:12,521
the next time.
But let's start off by looking at what

27
00:02:12,521 --> 00:02:16,902
we're going to be talking about today.
So, we're going to look at how to build,

28
00:02:16,902 --> 00:02:20,829
how to pipeline a cache right.
So, one of the things we haven't really

29
00:02:20,829 --> 00:02:25,665
thought about when we were working on our
labs is we do it right to the cache and it

30
00:02:25,665 --> 00:02:30,217
just sort of magically gets stored there.
Or, or, but if you're actually trying to

31
00:02:30,217 --> 00:02:34,977
build the real cache, you have to look at
the tag data and then you have to write

32
00:02:34,977 --> 00:02:40,183
the data because you don't want to be
writing something into the cache if it's,

33
00:02:40,183 --> 00:02:44,230
let's say, a different piece of data is in
that same location.

34
00:02:44,230 --> 00:02:49,405
So, you need to check the tag first.
And we'll talk about some ways to optimize

35
00:02:49,405 --> 00:02:51,860
that.
We'll talk about a write buffer.

36
00:02:51,860 --> 00:02:58,147
A write buffer is a extra data structure
we can put after our first level of cache,

37
00:02:58,147 --> 00:03:03,913
which can hold victims.
So, victim is a line or a cache block,

38
00:03:03,913 --> 00:03:10,320
that is being evicted from the cache.
And if you're going to, let's say,

39
00:03:10,320 --> 00:03:15,537
Replace something in the cache, you need
some place to put the old data and, or the

40
00:03:15,537 --> 00:03:18,974
data that was, let's say, dirty in the
cache from before.

41
00:03:18,974 --> 00:03:23,937
And the slow way to do this with things,
we start with that up to this point is you

42
00:03:23,937 --> 00:03:28,391
take that line and you wait for it to get
into either memory or the next level of

43
00:03:28,391 --> 00:03:30,965
cache.
But if you want to go faster, you can

44
00:03:30,965 --> 00:03:34,754
think about trying to store in some place
inside the structure and we'll say, write

45
00:03:34,754 --> 00:03:39,400
to the next level of cache or write to the
main memory in the background.

46
00:03:40,060 --> 00:03:44,992
We'll talk about multi-level caches.
So, multi-level caches, meaning you have a

47
00:03:44,992 --> 00:03:49,541
level one, a level two, a level three,
maybe a level four cache, then main

48
00:03:49,541 --> 00:03:52,360
memory, and why this can improve
performance.

49
00:03:52,880 --> 00:03:58,116
We'll talk about victim caches.
Victim caches this is the idea that can

50
00:03:58,116 --> 00:04:04,603
basically increase the associativity of a
cache by putting a low associative cache

51
00:04:04,603 --> 00:04:10,542
or maybe a direct mapped cache next to a
cache with very high associativity.

52
00:04:10,542 --> 00:04:16,794
So, maybe like a fully associative cache
and they use them together to basically

53
00:04:16,794 --> 00:04:20,390
implement a, a pseudo higher associativity
cache.

54
00:04:20,390 --> 00:04:25,075
Cough We're going to talk today about
prefetching, both hardware and software.

55
00:04:25,075 --> 00:04:29,881
So, prefetching is the act of either a
piece of hardware or software trying to

56
00:04:29,881 --> 00:04:32,620
get data that you need early into your
cache.

57
00:04:33,380 --> 00:04:36,986
We'll talk about how to increase the
bandwidth to your cache.

58
00:04:36,986 --> 00:04:41,100
So, up to this point, we've looked at
doing one memory operation per cycle.

59
00:04:41,100 --> 00:04:45,509
Well, we built machines that can execute
two operations per cycle, and on your

60
00:04:45,509 --> 00:04:49,517
exam, you had a machine which could
execute three operations per cycle.

61
00:04:49,517 --> 00:04:54,041
Wouldn't it be great to be able to do
three memory operations per cycle versus

62
00:04:54,041 --> 00:04:57,705
just three ALU operations?
So, how do we, how do we go about doing

63
00:04:57,705 --> 00:05:00,225
that?
Well, we can put multiple ports on this

64
00:05:00,225 --> 00:05:04,977
structure, but that's not particularly
great for performance from, from a clock

65
00:05:04,977 --> 00:05:08,070
perspective, because it makes the
structure very large.

66
00:05:08,070 --> 00:05:12,243
So, you can start to think about banking
which we'll talk about today as a, as a

67
00:05:12,243 --> 00:05:15,857
mechanism to increase the bandwidth.
And then, we'll talk a bunch, about a

68
00:05:15,857 --> 00:05:20,133
bunch of different compiler optimizations
or software optimizations that a compiler

69
00:05:20,133 --> 00:05:23,187
can do to restructure code to have better
cache performance.

70
00:05:23,187 --> 00:05:25,580
So, this is just what we're talking about
today.

71
00:05:25,580 --> 00:05:28,293
Next lecture, we're going to continue on
this.

72
00:05:28,293 --> 00:05:33,398
And the main topic of next lecture as far
as events, cache optimizations are

73
00:05:33,398 --> 00:05:36,565
concerned,
Is we're gonna be talking about how to

74
00:05:36,565 --> 00:05:41,868
build caches which can deal with out of
order memory or can deal with the ability

75
00:05:41,868 --> 00:05:47,181
to have loads and stores which hit in the
cache, or loads and stores which miss in

76
00:05:47,181 --> 00:05:50,745
the cache, and then you can continue on
past that point.

77
00:05:50,745 --> 00:05:55,540
So, in our sequential machine up to this
point, we've all, even in our out of order

78
00:05:55,540 --> 00:06:00,658
machines we were talking about, when you
took a cache miss, you basically had to

79
00:06:00,658 --> 00:06:03,725
stop the machine.
Cuz you have to wait for the data to come

80
00:06:03,725 --> 00:06:07,329
back cuz you didn't know if the next
instruction was going to need this data or

81
00:06:07,329 --> 00:06:10,699
the next, maybe the next load will need
that data or something like that.

82
00:06:11,140 --> 00:06:15,396
So, if you want to try to go out of order,
you can start thinking about that and

83
00:06:15,396 --> 00:06:19,754
that's what the, the bulk of next of next
lecture will be about, is how you go out

84
00:06:19,754 --> 00:06:24,004
of order in your memory system or this
what, this is largely sometimes called is

85
00:06:24,004 --> 00:06:27,348
a nonblocking cache.
Cough And I should point out that a

86
00:06:27,348 --> 00:06:31,202
nonblocking cache is actually not just for
out of order processors.

87
00:06:31,202 --> 00:06:33,865
This is sometimes good for in order
processors.

88
00:06:33,865 --> 00:06:38,569
So, if you have a in order processor, the
instruction sequence is going in order but

89
00:06:38,569 --> 00:06:41,970
you don't want to, let's say, wait when
you take a cache miss.

90
00:06:41,970 --> 00:06:45,886
You could actually have the memory system
be out of order with the pipeline be in

91
00:06:45,886 --> 00:06:48,179
order.
That's actually a pretty common thing for

92
00:06:48,179 --> 00:06:53,036
people to do.
Okay, so here, here let's go look at our

93
00:06:53,036 --> 00:06:55,060
review.
We have processor.

94
00:06:55,560 --> 00:07:00,569
And in our basic machine, we go out to
main memory when we want to get data.

95
00:07:00,569 --> 00:07:04,443
Main memory is big.
It holds everything, but it can be slow.

96
00:07:04,443 --> 00:07:10,054
So, we put a cache in front of that, which
keeps a subset of all the data in the main

97
00:07:10,054 --> 00:07:13,060
memory.
And it has a faster time and can take

98
00:07:13,060 --> 00:07:16,132
something, for instance, like this example
here,

99
00:07:16,132 --> 00:07:21,409
Where you go to access memory, and you
have to wait for the memory to come back.

100
00:07:21,409 --> 00:07:26,194
And it can shrink this time.
If we look at that as sort of, you know,

101
00:07:26,194 --> 00:07:30,321
average memory access,
It's your hit time plus your miss time

102
00:07:30,321 --> 00:07:34,580
times your or assume your miss rate times
your miss penalty.

103
00:07:35,200 --> 00:07:38,681
We, this is just review.
Okay.

104
00:07:38,681 --> 00:07:46,326
So, going back to the three Cs, here we
have a four-block cache. So, say, to a

105
00:07:46,326 --> 00:07:53,857
side of associated cache and let's look at
one of the major types of misses you can

107
00:08:01,490 --> 00:08:06,060
So, this just means that's the first time
any, any portion of your program has gone

108
00:08:06,060 --> 00:08:09,070
to go access the data.
There's no real way around this.

109
00:08:09,070 --> 00:08:13,362
Maybe you can prefetch, but, you know,
you're not really going to magically be

110
00:08:13,362 --> 00:08:17,598
able to get the data in there early,
unless you have some sort of aggressive

111
00:08:17,598 --> 00:08:21,080
prefetching mechanism.
Capacity.

112
00:08:21,540 --> 00:08:26,420
So, in this cache here, we have four
blocks.

113
00:08:27,420 --> 00:08:33,700
If you're looping over, let's say, five
cache box worth of data,

114
00:08:34,060 --> 00:08:38,835
You're basically not going to ever be able
to fit all the data in this cache or in

115
00:08:38,835 --> 00:08:43,496
more common cases, let's say, you have a
cache that's eight kilobytes cache, that's

116
00:08:43,496 --> 00:08:47,985
your 01 cache and you're trying to add two
arrays where each array is sixteen

117
00:08:47,985 --> 00:08:50,689
kilobytes.
Well, that means 32 kilobytes of data,

118
00:08:50,689 --> 00:08:55,465
maybe even another eight kilobytes of data
if you're not doing it in place for the

119
00:08:55,465 --> 00:08:58,860
result matrix and it's just not going to
fit in your cache.

120
00:08:58,860 --> 00:09:03,521
You have capacity problems and things are
just basically going to be continually

121
00:09:03,521 --> 00:09:11,501
kicked out of your cache.
Conflict means that you don't actually

122
00:09:11,501 --> 00:09:20,625
have you, you have enough space in your
cache, but roo many different loads or

123
00:09:20,625 --> 00:09:25,421
stores you're trying to access alias to
the same location in your cache.

124
00:09:25,421 --> 00:09:30,883
So, let's say, you're trying to loop over
three different cache blocks or trying to

125
00:09:30,883 --> 00:09:36,079
access three different cache blocks which
all alias to one set in the cache.

126
00:09:36,079 --> 00:09:41,075
And effectively you can only fit two
things and you're trying to fit three

127
00:09:41,075 --> 00:09:44,672
things in two spots.
And this is kind of a pigeon hole

128
00:09:44,672 --> 00:09:50,268
principle here comes, comes up and says,
no you just can't fit three things in two

129
00:09:50,268 --> 00:09:52,000
boxes.
So, we get conflicts.

130
00:09:52,800 --> 00:09:58,260
Okay, so still, still review here.
Let's, let's take a look at one way to

131
00:09:58,260 --> 00:10:03,016
reduce our hit time.
Hit time is just, the, the fast case.

132
00:10:03,016 --> 00:10:06,303
That's actually you find the data in your
cache.

133
00:10:06,303 --> 00:10:09,180
This is the good case.
But for hit time,

134
00:10:09,600 --> 00:10:15,396
One way to make it lower, this is actually
in nanoseconds, is just to have a smaller

135
00:10:15,396 --> 00:10:19,304
cache or a simpler cache.
So, small, simple caches are in this

136
00:10:19,304 --> 00:10:22,299
darwing here.
We have time on the vertical axis.

137
00:10:22,299 --> 00:10:25,165
On the x-axis we have different cache
sizes.

138
00:10:25,165 --> 00:10:30,310
And the 16-kilobyte cache is faster than,
let's say, the 256-kilobyte cache.

139
00:10:30,310 --> 00:10:33,501
So, if you make a smaller cache, you can
go faster.

140
00:10:33,501 --> 00:10:38,712
And we've seen examples of this, actually,
in the Pentium four when we were talking

141
00:10:38,712 --> 00:10:42,602
about super pipelining.
In the Pentium four, they had a very small

142
00:10:42,602 --> 00:10:47,044
cache and the real reason they did that is
because they had such a high clock

143
00:10:47,044 --> 00:10:51,544
frequency that they couldn't put anything
bigger and still fit it in one clock

144
00:10:51,544 --> 00:10:53,936
cycle.
You know, in a perfect world, if your

145
00:10:53,936 --> 00:10:57,980
clock cycle was, was slower, you could try
to fit some bigger cache in there.

146
00:10:59,740 --> 00:11:04,545
So, you can try to reduce the miss rate.
there's a couple different techniques on

147
00:11:04,545 --> 00:11:08,797
how to reduce the miss rate and this is,
this is the advance technique which we

148
00:11:08,797 --> 00:11:12,802
will talk about later.
But one of the basic ones is to change the

149
00:11:12,802 --> 00:11:17,971
block size.
As you make the box size larger, you'll be

150
00:11:17,971 --> 00:11:25,660
pulling in more data so you'll get spatial
locality and hopefully, you'll be pulling

151
00:11:25,660 --> 00:11:32,349
in data that you actually need.
So, the, you know, the, the positives of

152
00:11:32,349 --> 00:11:40,206
this is you're really going to have less
tag overhead if you have a larger block

153
00:11:40,206 --> 00:11:45,119
size or cache line size.
Because for the same amount of data, let's

154
00:11:45,119 --> 00:11:50,552
say, you got a 64-byte line versus a
128-byte line, you're going to have half

155
00:11:50,552 --> 00:11:55,093
as much tag sort of cost or, or area
dedicated to the tags.

156
00:11:55,093 --> 00:11:59,261
Cough And, and usually, as you go to
larger block sizes,

157
00:11:59,261 --> 00:12:05,215
This is good for the farther out layers of
your cache, because you can basically

158
00:12:05,215 --> 00:12:09,656
burst more data in.
So if you look at something like a DDR2

159
00:12:09,944 --> 00:12:14,700
memory or DDR3 memories.
This is the traditional DRAM technologies

160
00:12:14,700 --> 00:12:19,475
that we are using today.
They want to actually, their block size,

161
00:12:19,475 --> 00:12:25,567
their line size is very, very long, sort
of thousands of bits. And if you have a

162
00:12:25,567 --> 00:12:31,972
longer block size, you can use all those
bits and pull them all into your cache in

163
00:12:31,972 --> 00:12:35,252
one big chunk, and you can use wider
buses.

164
00:12:35,252 --> 00:12:39,265
Cough Downside?
You can waste bandwidths.

165
00:12:39,265 --> 00:12:43,899
If you have a very large block, and you
don't actually use all the data in the

166
00:12:43,899 --> 00:12:46,573
block,
Let's say, you use one byte out of your,

167
00:12:46,573 --> 00:12:50,079
out of 256-byte cache,
You're just pulling all the data and

168
00:12:50,079 --> 00:12:54,060
you're wasting lots of off chip bandwidths
for no real good reason.

169
00:12:54,640 --> 00:12:59,408
And if you have fewer blocks,
You're going to have less locations to put

170
00:12:59,408 --> 00:13:04,898
data for the fixed sized cache, so you
might actually increase your conflict

171
00:13:04,898 --> 00:13:06,826
rate.
So, this, this chart here,

172
00:13:06,826 --> 00:13:10,220
What we're trying to show this is from
your book,

173
00:13:10,220 --> 00:13:15,244
We're plotting the block size versus the
miss rate.

174
00:13:15,244 --> 00:13:19,040
And you want the lowest miss rate
possible.

175
00:13:19,040 --> 00:13:25,795
So, if you start here with a 16-byte block
or 16-byte cache size or cache line.

176
00:13:25,795 --> 00:13:29,264
As you increase it, the miss rate actually
goes down.

177
00:13:29,264 --> 00:13:34,734
But at some point, you end up wasting a
lot of bandwidth cuz you don't have enough

178
00:13:34,734 --> 00:13:39,537
of, of the, the program will ,, it does
not have enough spatial locality.

179
00:13:39,537 --> 00:13:43,740
So, you actually just end up pulling in
more data than you need.

180
00:13:43,810 --> 00:13:46,636
Cough You also could potentially have more
conflicts.

181
00:13:46,794 --> 00:13:49,744
So it starts to go up.
So, there's sort of a minimum here.

182
00:13:49,744 --> 00:13:53,905
Traditionally, people have thought about
this as actually, increasing block size,

183
00:13:53,905 --> 00:13:56,907
usually increases your performance for
most applications.

184
00:13:57,065 --> 00:14:00,488
But that's up to a limit.
And the reason why people traditionally

185
00:14:00,488 --> 00:14:05,228
thought this is, cache size is used to be
sort of smaller or I mean, cache block

186
00:14:05,228 --> 00:14:09,653
sizes or line sizes used to be smaller and
they sort of got, starting getting bigger,

187
00:14:09,653 --> 00:14:12,391
and then people really weren't looking way
out here.

189
00:14:16,835 --> 00:14:20,065
applications as you get farther out.
Okay.

190
00:14:20,065 --> 00:14:24,231
So, another way we,
Just the basic ways we can reduce miss

191
00:14:24,231 --> 00:14:27,154
rate,
You could just have a bigger cache.

192
00:14:27,154 --> 00:14:30,516
This one is great.
Bigger cache, lower miss rate.

193
00:14:30,516 --> 00:14:35,851
You got to pay area for this.
And potentially it takes longer to access

194
00:14:35,851 --> 00:14:41,260
this larger cache, as we'd seen in the
first graph that we showed today.

195
00:14:42,162 --> 00:14:46,754
I'm not going to really talk about this,
but empirical rule of thumb.

196
00:14:46,754 --> 00:14:52,290
If the cache size is doubled, your miss
rate usually goes down by a square root of

197
00:14:52,290 --> 00:14:55,126
two.
It's just empirical, people have found,

198
00:14:55,126 --> 00:15:01,980
but good rule of thumb to know.
Same, same graph, tells a different story.

199
00:15:02,500 --> 00:15:08,065
If you go to a higher associativity, your
miss rate goes down.

200
00:15:08,065 --> 00:15:14,360
So, if you have a direct mapped cache, and
you go to a two-way set of cache,

201
00:15:14,800 --> 00:15:19,824
And this rate goes down.
And the traditional sort of rule of thumb

202
00:15:19,824 --> 00:15:24,240
about this is if you have a direct mapped
cache of size n.

203
00:15:24,600 --> 00:15:32,686
It has roughly the same miss rate as a
two-way set of associative cache that is

204
00:15:32,686 --> 00:15:35,849
half the size.
You might say, wow that's, that's pretty

205
00:15:35,849 --> 00:15:40,682
amazing so if I just continually apply
that over and over again, it doesn't quite

206
00:15:40,682 --> 00:15:43,726
work.
Laugh it also gets very hard to actually

207
00:15:43,726 --> 00:15:48,918
build very highly associative caches cuz
your, your, your tag check becomes pretty

208
00:15:48,918 --> 00:15:51,484
hard.
Because if you have to check against all

209
00:15:51,484 --> 00:15:56,020
different places in the cache, it could be
imparallel, and that becomes hard.