1
00:00:03,540 --> 00:00:09,529
Welcome to our next installment of ELE475.
today we are going to be continuing our

2
00:00:09,529 --> 00:00:15,238
discussion of advanced cash techniques.
And one of the big things we are going to

3
00:00:15,238 --> 00:00:20,611
get in to is talking about how to build
out of order memory systems. So what I

4
00:00:20,611 --> 00:00:24,853
mean by out of order memory systems,
basically, I mean, you can have memory

5
00:00:24,853 --> 00:00:29,618
operations out of the memory system but
they execute out of order. And you can mix

6
00:00:29,618 --> 00:00:34,267
this with the in order processor pipeline.
Or an our of order processor pipeline.

7
00:00:34,267 --> 00:00:38,625
It's the designer's choice. They both
sorts of designs have been built, have

8
00:00:38,625 --> 00:00:43,065
been successfully built. But where, this
is important, this allows you to

9
00:00:43,065 --> 00:00:48,250
effectively get more memory parallelism.
Because you can have transactions going

10
00:00:48,250 --> 00:00:53,175
out the memory system in parallel, and the
processor can continue an overlap

11
00:00:53,175 --> 00:01:02,535
computation with the memory. so we, just
to recap from last lecture. We have a

12
00:01:02,846 --> 00:01:11,068
revised agenda here, now with all of our
Different cache optimization techniques or

13
00:01:11,068 --> 00:01:16,876
advanced optimization techniques. or we
talked about pipelining caches, we talked

14
00:01:16,876 --> 00:01:23,943
about adding a right buffer, and a buffer,
last lecture. We talked about the reason

15
00:01:23,943 --> 00:01:29,611
and the motivation for multi-level caches.
So two level caches, three level caches,

16
00:01:29,611 --> 00:01:35,259
maybe even four level caches, backed by
main memory. We talked about having victim

17
00:01:35,259 --> 00:01:39,734
caches. Which were extra little highly
associative caches that you can put next

18
00:01:39,734 --> 00:01:44,435
to a low associative cache or someone can
direct mapped cache. And effectively this

19
00:01:44,435 --> 00:01:49,420
can increase your performance quite a bit.
because once you sorta miss in your direct

20
00:01:49,420 --> 00:01:53,668
mapped cache you can go out you your
victim cache. And the victim cache may

21
00:01:53,668 --> 00:01:58,652
have the data you're looking for. And this
gets over small numbers of lines which are

22
00:01:58,652 --> 00:02:03,524
highly contended and might blow out if you
will in a number of ways within the amount

23
00:02:03,524 --> 00:02:08,182
of associativity you have in your cache.
We talked about hardware and software pre

24
00:02:08,182 --> 00:02:12,227
fetching. And this is where we stopped
last time. We stopped at hardware,

25
00:02:12,227 --> 00:02:17,177
software pre fetching. And today we have a
couple, couple extra topics. We are going

26
00:02:17,177 --> 00:02:22,591
to talk about h ow to bank your cache as a
technique to get more bandwidth. The easy

27
00:02:22,591 --> 00:02:28,004
technique to get more bandwidth is just
to, put multiple ports on your cache, but

28
00:02:28,004 --> 00:02:33,874
that gets quite expensive. Because you
have to duplicate your, possibly your .

29
00:02:33,874 --> 00:02:39,418
You probably have to duplicate, the right
logic. And in the worst case, you may have

30
00:02:39,418 --> 00:02:45,481
to duplicate the storage space. We're, in
today's lecture we're gonna to talk about

31
00:02:45,481 --> 00:02:49,822
some software optimizations that the
compiler can do. We're gonna start by

32
00:02:49,822 --> 00:02:54,868
talking about some software optimizations
that, are, easy to do without transforming

33
00:02:54,868 --> 00:02:59,150
the code too horribly. And then we're
gonna talk about some that actually

34
00:02:59,150 --> 00:03:03,844
require some pretty heavy lifting, with
respect to code transformations, and what

35
00:03:03,844 --> 00:03:08,302
you have to do to reorganize the code.
Still in valid ways. But by doing such,

36
00:03:08,302 --> 00:03:12,820
you can actually have better memory
performance and better cache performance.

37
00:03:12,820 --> 00:03:17,132
And then we're gonna get to the meat of
today's lecture. We're gonna be talking

38
00:03:17,132 --> 00:03:21,445
about non-blocking caches or sometimes
called lock-up free caches, or sometimes

39
00:03:21,445 --> 00:03:28,829
called out of order memory systems. Then a
couple, , at the end here we're gonna have

40
00:03:28,829 --> 00:03:34,938
critical word first, which basically means
that when you go out to the memory system.

41
00:03:34,938 --> 00:03:41,047
You ask for the word that you need most
urgently. And the memory system will give

42
00:03:41,047 --> 00:03:46,484
that, that back to you first, and this
allows you to restart, quicker. And then

43
00:03:46,484 --> 00:03:51,816
there's a, very similar idea but possibly
easier to implement called early restart.

44
00:03:51,816 --> 00:03:57,148
Which instead of asking for memory system
to return the data in a specific order and

45
00:03:57,148 --> 00:04:02,167
taking advantage of that. Instead what you
do is you, the memory system gives you

46
00:04:02,167 --> 00:04:07,562
back the data in order or the whatever the
chromatic order is that expects back to

47
00:04:07,562 --> 00:04:12,580
you with. But when the word that shows,
the word shows up that you need you, or

48
00:04:12,580 --> 00:04:18,363
you stall for you just restart the
processor . So if you're Asking for one

49
00:04:18,363 --> 00:04:21,903
word out of a huge cache line, and it
happens to be the second word in the cache

50
00:04:21,903 --> 00:04:25,531
line, you don't have to wait for the whole
cache line to get loaded into the cache.

51
00:04:25,531 --> 00:04:30,125
But this has come extra co mplexity.
And finally, we are going to talk about

52
00:04:30,125 --> 00:04:34,829
way prediction. Way prediction, we touched
on it a little bit when we were talking

53
00:04:34,829 --> 00:04:39,533
about some of the instruction issues in
super scalars. We are going to talk about

54
00:04:39,533 --> 00:04:44,412
it a little more in detail here, where way
prediction allows you to predict where to

55
00:04:44,412 --> 00:04:48,884
go find either the next piece of data you
are going to go access or the next

56
00:04:48,884 --> 00:04:53,589
instruction you are going to go access.
This is important if you have a multi-way

57
00:04:53,589 --> 00:04:58,177
cache. In a direct map there is only one
place to go find it. In a two way cache

58
00:04:58,177 --> 00:05:02,824
there is two places to go find it. And if
you check both. You can try to do that in

59
00:05:02,824 --> 00:05:07,114
parallel. It'll probably waste a bunch of
power. or what you can do is you can

60
00:05:07,114 --> 00:05:11,243
predict it. Let's say it's in way one, or
you can predict that it's in way, way

61
00:05:11,243 --> 00:05:15,479
zero. And use that prediction if you have
high prediction accuracy, you may only

62
00:05:15,479 --> 00:05:19,447
have to fire up half your cache. Or if
it's on our critical path, you don't

63
00:05:19,447 --> 00:05:23,577
actually have to do two comparisons or
three comparisons or four comparisons,

64
00:05:23,577 --> 00:05:33,171
however, however high the associativity of
your cache is. So let's go take a look, At

65
00:05:33,171 --> 00:05:42,354
thinking strategy and multi, multi porting
our caches. This is a little bit of a

66
00:05:42,354 --> 00:05:47,802
recap, I think we talked about this
briefly during last lecture. But we are

67
00:05:47,802 --> 00:05:53,325
going to continue on with that. So, here
we have our two way issue processor

68
00:05:53,325 --> 00:05:58,847
pipeline. You have a program counter,
reason instruction cache, you have your

69
00:05:58,847 --> 00:06:04,737
instruction registers. What's interesting
is that this pipe, you can only execute

70
00:06:04,737 --> 00:06:10,394
one memory operation per cycle. What's a
limiter? Let's say you want to execute two

71
00:06:10,394 --> 00:06:14,470
memory operations or three memory
operations or four memory operations that

72
00:06:14,470 --> 00:06:18,760
are all independent of each other. What do
you go about doing? Well, in our, in our

73
00:06:18,760 --> 00:06:24,903
two issue Python here. Logically, what you
wanna do is you wanna somehow put the data

74
00:06:24,903 --> 00:06:30,794
path on both pipelines. And remove this
restriction, of where you can execute

75
00:06:30,794 --> 00:06:37,775
memory accesses. In, in which pipe you can
access memory accesses from. Okay, that

76
00:06:37,775 --> 00:06:43,878
doesn't sound too horrible, until we go
look at the imple mentation of this. So we

77
00:06:43,878 --> 00:06:51,684
can go and have a true multi-ported cache.
So if a true multi-ported cache, if you

78
00:06:51,684 --> 00:06:59,008
think about the decoding logic on a read,
we're gonna have to least duplicat,

79
00:06:59,008 --> 00:07:06,383
duplicate that decoding logic. Likewise if
we want to actually execute two writes at

80
00:07:06,383 --> 00:07:11,870
the same time, we're going to have to
duplicate the rotor coders for, for that

81
00:07:11,870 --> 00:07:16,946
logic. And a lot of times how people
actually go to implement this, is they

82
00:07:16,946 --> 00:07:22,570
basically duplicate the entire cache to
get an extra port for, for a variety of

83
00:07:22,570 --> 00:07:27,851
reasons. It could, you know, the area
increase could be large, and it could even

84
00:07:27,851 --> 00:07:33,721
be double for, for a two port read and a
two, two write port Sorts of cache. So

85
00:07:33,721 --> 00:07:39,916
they don't even share the storage data.
But that's like, only a little bit of the

86
00:07:39,916 --> 00:07:45,563
problem. The other problem here is that it
also increases your hit time in your cash,

87
00:07:45,563 --> 00:07:51,008
and no one likes increasing hit times in
cashes. Now, why does it increase the hit

88
00:07:51,008 --> 00:07:56,596
time? Well, unless you. For instance,
duplicate everything. If you try to share

89
00:07:56,596 --> 00:08:03,014
some of the logic, you're basically going
to increase the loading on the readout of

90
00:08:03,014 --> 00:08:08,737
the cache here. And by increasing the
load, this is gonna have slower drive

91
00:08:08,737 --> 00:08:14,610
strengths, you'll probably increase the
time. And, and you can go look in

92
00:08:14,610 --> 00:08:18,860
something like cacti which is a tool from
HP which has different cache

93
00:08:18,860 --> 00:08:23,570
configurations and you can see that, you
know, adding a second port is usually will

94
00:08:23,570 --> 00:08:27,992
be the last thing in the world you wanna
do when you're doing your processor

95
00:08:27,992 --> 00:08:32,759
design. It's really, ends up being really
quite expensive. mostly in terms of area

96
00:08:32,931 --> 00:08:38,257
and somewhat in terms of timing. So can
we, can we do better? Well maybe yes maybe

97
00:08:38,257 --> 00:08:44,567
no. So one thing you can think about doing
is instead of actually having a true dual

98
00:08:44,567 --> 00:08:50,727
porting cache, can we have a pseudo dual
porting cache or a cache which has the

99
00:08:50,727 --> 00:08:56,737
same external interface? So if we sort of
cut out here in the middle, we have two

100
00:08:56,737 --> 00:09:02,867
addresses coming in. Two datas coming out,
it looks the same, but, instead, we

101
00:09:02,867 --> 00:09:09,808
actually put something else in the middle
here. What do we put in the middle? Well,

102
00:09:09,808 --> 00:09:15,473
we actu ally put. Two, banks, that are
called, of our cache. And the first bank

103
00:09:15,473 --> 00:09:20,269
is one-half the size of our cache. And the
other bank is one-half the size of our

104
00:09:20,269 --> 00:09:25,886
cache. And they're both single ported.
What this is relying on is if you have

105
00:09:25,886 --> 00:09:32,534
statistical memory accesses that have some
good pattern to them, the probability that

106
00:09:32,534 --> 00:09:38,729
you'll actually end up routing two memory
addresses to the same bank, and get what

107
00:09:38,729 --> 00:09:44,773
it's called bank conflict, will be quite
low. Instead the probability that you

108
00:09:44,773 --> 00:09:51,399
actually go to two different banks will be
high. Now, you might say, well, big

109
00:09:51,399 --> 00:09:56,861
conflicts, sound pretty, pretty. Possible
to happen. This is sort of the naive

110
00:09:56,861 --> 00:10:01,691
drawing here. When people go to build
these things for large cache, you actually

111
00:10:01,691 --> 00:10:07,668
have many banks. and then let's say you,
and if you have more banks than the number

112
00:10:07,668 --> 00:10:13,608
of ports you have, you can make a decent
argument for the probability you're gonna

113
00:10:13,608 --> 00:10:19,113
get a bank conflict ends up being quite
low. Or the probability that any two

114
00:10:19,113 --> 00:10:25,198
random addresses will end up going to the
same bank will be low. Now, let's talk a

115
00:10:25,198 --> 00:10:32,022
little bit about addressing and how we do
this Assignment of addresses to banks. The

116
00:10:32,022 --> 00:10:37,788
most basic thing is you can just maybe low
order interleave it. Or you can try to

117
00:10:37,788 --> 00:10:44,336
high order interleave it. So, one of the
questions that comes up here is, which is

118
00:10:44,336 --> 00:10:51,029
better, low order or high order
interleaving our banks. Everyone will take

119
00:10:51,029 --> 00:10:56,535
a guess on this one. So, when I say low
order, I mean use the low order bits to

120
00:10:56,535 --> 00:11:01,349
hash into which address bank, or which
bank you go into. Higher leaving you some

121
00:11:01,349 --> 00:11:06,041
higher bits. Maybe middle order early
leaving you some middle order bits. What

122
00:11:06,041 --> 00:11:11,098
would you choose if you were a designer?
That's a good insight So, if we think we

123
00:11:11,098 --> 00:11:15,824
are accessing an array. And we're trying
to describe through the array, so we're

124
00:11:16,438 --> 00:11:21,247
through the array the high order bits of
those are all going to be the same. So if

125
00:11:21,247 --> 00:11:25,493
we have high order inter leave bits all
the high bits are going to all be the same

126
00:11:25,493 --> 00:11:30,199
and were conflicts. They're all going to
go to let's say a big zero all the time.

127
00:11:30,199 --> 00:11:34,240
And therefore our cache which we are
trying to have more or higher through put,

128
00:11:34,240 --> 00:11:40,397
didn't get any higher through put. Okay,
now let me give you counter example. .

129
00:11:40,397 --> 00:11:47,095
Counter example is let's say you are
striding through an array. And you have an

130
00:11:47,095 --> 00:11:54,735
array an array of structures. A pretty
common thing to do. And the structure is,

131
00:11:54,735 --> 00:12:02,830
let's say 100, no it's yeah it's 120 bytes
long. And we go and access the first byte

132
00:12:02,830 --> 00:12:10,471
in that 180 128 byte structure in a loop.
Are the lower order bits of the address

133
00:12:10,471 --> 00:12:21,409
the same between all of those? Yeah. So I
can probably bet our higher bits are the

134
00:12:21,409 --> 00:12:26,481
same and our lower bits are the same.
What's changing in that example? Some

135
00:12:26,481 --> 00:12:31,124
medium bits, some, some bits a little bit
higher than the, the, the little lower

136
00:12:31,124 --> 00:12:36,009
bits. So what we're trying to get across
here is that you have to be very careful

137
00:12:36,009 --> 00:12:40,584
about how you do bank interleading, if you
have naive hash functions. Or the e

138
00:12:40,584 --> 00:12:44,746
functions which sign address to a
particular bank. And you may not even want

139
00:12:44,746 --> 00:12:49,330
to have you've been thinking about having
even more sophisticated hashing functions

140
00:12:49,330 --> 00:12:53,439
or sign ins from address to bank. But, if
you go to do that, you know this is on

141
00:12:53,439 --> 00:12:57,759
your critical path of your memory load so
you may not want to have a complex hash

142
00:12:57,759 --> 00:13:02,923
function there. So you might want to just
take some middle order bits, you might

143
00:13:02,923 --> 00:13:07,601
want to tank some little bits, and some
high bits for you bank assignments. It

144
00:13:07,601 --> 00:13:12,035
gets a little challenging here, you
probably don't want the highest order

145
00:13:12,035 --> 00:13:17,016
bits. Definitely cause in most systems all
those are always zero or something like

146
00:13:17,016 --> 00:13:21,937
that. Its very unlikely that your higher
bits are something important unless your

147
00:13:21,937 --> 00:13:33,282
accessing a stack where all those are
usually one, or one. . So I have another I

148
00:13:33,282 --> 00:13:37,356
think we talked about most of these on the
slide, but if it were trying to go thru

149
00:13:37,356 --> 00:13:43,319
high thru put here bank conflicts. We, we
talked about how you can get a bank

150
00:13:43,319 --> 00:13:49,180
conflict. we talked somewhat how you can
avoid bank conflicts by having more banks

151
00:13:49,180 --> 00:13:54,534
than you have ports into the, the porting
structure. And that's just a, from a

152
00:13:54,534 --> 00:14:00,363
statistics perspective if you have let's
say a, a 32, your, your, your cache is you

153
00:14:00,363 --> 00:14:05,988
have a 32 kilobyte cache, and each bank is
one kilobyte or something like that. Then,

154
00:14:05,988 --> 00:14:11,206
all of the sudden, you have the
probability that you have two ports trying

155
00:14:11,206 --> 00:14:16,363
to access one particular bank. It's kind
of like a birthday problem. So it's not,

156
00:14:16,363 --> 00:14:21,463
there's the trying to find if two people
have the same birthday.'Cause it's not

157
00:14:21,463 --> 00:14:26,052
quite the easy probability, you'll go
figure that out, but it's quite low

158
00:14:26,052 --> 00:14:31,470
because it's roughly you know, one. What
is this? It's probably 32 times one over

159
00:14:31,470 --> 00:14:36,180
32 times one over 32. So it's not, not the
lowest thing in the world but it's, it's,

160
00:14:36,180 --> 00:14:41,704
it's possible to actually reduce the
probability conflict with random access.

161
00:14:41,704 --> 00:14:47,060
And if you have a better hash function, of
course you can do better. Extra wiring, so

162
00:14:47,060 --> 00:14:52,660
what do I mean by extra wiring? Well, in
our previous design, we didn't have these

163
00:14:52,660 --> 00:14:57,881
sort of crossover paths. So what this we
present you is that its a mux here, a

164
00:14:57,881 --> 00:15:03,228
multiplexer is there. A mux here and a mux
there for the address paths. So that's

165
00:15:03,228 --> 00:15:08,574
extra wiring extra delay by banking. And
one of the bigger, bigger challenges the

166
00:15:08,574 --> 00:15:13,733
extra wiring for going in the multi banked
memories is. It'll shrink bank size or

167
00:15:13,733 --> 00:15:17,899
shrink the memory size and if you
traditionally sort of look at memories,

168
00:15:17,899 --> 00:15:22,979
they like to be larger. You can the
overheads a little bit better off the, for

169
00:15:22,979 --> 00:15:28,172
example the fires at the bottom of the
memories and the rotor decoders and the

170
00:15:28,172 --> 00:15:32,681
column decoders if it's bigger, because
for instance the rotor decoders and the

171
00:15:32,681 --> 00:15:37,418
cone decoders will roughly grow as the log
of the number of bits or something like

172
00:15:37,418 --> 00:15:43,923
that. But if you make each your catches
rally small and you have multiple of that

173
00:15:43,923 --> 00:15:49,475
overhead gets worse so you have to have
more sense amps, more row decoders, more

174
00:15:49,475 --> 00:15:53,719
column decoder's. So something to think
about, but usually that's, that's

175
00:15:53,719 --> 00:15:58,180
relatively small and, and you're willing
to pay that, unless you get some really

176
00:15:58,180 --> 00:16:02,887
funny aspect ratios. So what I mean by
really funny aspect ratios if you want to

177
00:16:02,887 --> 00:16:07,597
have your, if you shrink the cache too
small. So if you have, I don't know, 256

178
00:16:07,597 --> 00:16:12,426
byte caches. And the read out width is
lets say a, a byte. So you get a long and

179
00:16:12,426 --> 00:16:17,434
narrow, cache. And you can try to sort of
you call addressing or something to sort

180
00:16:17,434 --> 00:16:22,144
of squish that into a more square form
factor. But sometimes if you squish your

181
00:16:22,144 --> 00:16:26,974
bank sizes to small you get very funny
aspect ratios, or very long skin rams that

182
00:16:26,974 --> 00:16:33,134
don't fit very well into the designs.
Okay. Un-, uneven utilization, so what do

183
00:16:33,134 --> 00:16:39,921
I mean by that. Well, what I mean by that
is . Just like you can have bait

184
00:16:39,921 --> 00:16:45,269
conflicts. Due to the addressing of the
hashing scheme, you can also have a

185
00:16:45,269 --> 00:16:50,172
pathological case, where all of the
interesting data is stored in one of the

186
00:16:50,172 --> 00:16:56,403
banks. So lets say going back to the
example you use low order enter leaving to

187
00:16:56,403 --> 00:17:01,834
the lowest order bid of the address, which
is not the offset we'll say, to choose

188
00:17:01,834 --> 00:17:08,652
between big zero bank one. But it just so
happens the program, pads out all its data

189
00:17:08,652 --> 00:17:13,749
structures, such that, that BIT is a
constant of zero, basically for all the

190
00:17:13,749 --> 00:17:19,122
useful arrays you're trying to access.
What that will do is it will locate all

191
00:17:19,122 --> 00:17:25,219
the data lines, in banks zero. Alright,
well we just took our big cash and cut it

192
00:17:25,219 --> 00:17:30,552
in, in half, because we effectively placed
all of our data over in this bank. So

193
00:17:30,552 --> 00:17:36,091
we're getting uneven data utilization. So
we get uneven bandwidth utilization and

194
00:17:36,091 --> 00:17:41,288
bank conflicts, but we can also get data
utilization, these often go together

195
00:17:41,288 --> 00:17:47,032
though. The bank conflicts and the uneven
utilization cuz you're accessing something

196
00:17:47,032 --> 00:17:52,160
a lot, and it's in bank zero, you know,
that means it's probably means stored

197
00:17:52,160 --> 00:17:58,729
there, so you have uneven, storage
patterns also. Questions so far about

198
00:17:58,729 --> 00:18:07,022
banking, why we wanna bank? This is a, a
pretty common thing to do. So a good

199
00:18:07,022 --> 00:18:13,314
example of this is, if you go look at the
Niagara processor which is a Sun

200
00:18:13,314 --> 00:18:19,279
microprocessor. That has, the first
diagram had eight cores. And, if you go

201
00:18:19,279 --> 00:18:25,221
look at their cache structure, they had
eight banks of the cache. So, is a

202
00:18:25,987 --> 00:18:33,821
relatively, heavily last level of cash. If
you go look at something actually like,

203
00:18:33,821 --> 00:18:41,314
the multi core intel chips out there, they
even have more b anks. actually I think I

204
00:18:41,314 --> 00:18:48,722
saw one once where is the 32 megabytes
last level cash for the intel itanium two

205
00:18:48,722 --> 00:18:55,534
had some incredibly large number of banks.
I thin it had 32 or 64 banks in their

206
00:18:55,534 --> 00:19:01,933
cash. And they were wide, also. So they
had lots of bandwidth going through that

207
00:19:01,933 --> 00:19:08,583
cache. Okay, so, if you look at our, our
cache banking, was really, really helps

208
00:19:08,583 --> 00:19:14,046
here, as we are just, we are increasing
the bandwidth. We might be actually

209
00:19:14,046 --> 00:19:20,333
hurting some things, we might be hurting
the hit time a little bit, miss rate, miss

210
00:19:20,333 --> 00:19:26,125
penalty should not change, the cache size
is same. We haven't, effectively increased

211
00:19:26,125 --> 00:19:31,025
or shrunk the cache size. Even if you get
lots and lots of, bank conflicts or you

212
00:19:31,025 --> 00:19:35,807
can have poor utilization, that's still
not actually affecting your miss rate. Cuz

213
00:19:35,807 --> 00:19:40,352
you would have been, conflicting, in the
same size cache. Logically, it's, you

214
00:19:40,352 --> 00:19:44,957
could draw a box around it. It would be
the exact same, same thing. It just that

215
00:19:44,957 --> 00:19:49,503
it happens that, you know, in, if you were
to, have it not be a banked cache, you

216
00:19:49,503 --> 00:19:54,201
would have every other line, be in the
contested ones, or something like that. So

217
00:19:54,201 --> 00:19:58,080
just having better memory layout is, is
good all around.