1
00:00:03,039 --> 00:00:08,614
So, a few things that on our agenda for
today, we're going to finish up talking

2
00:00:08,614 --> 00:00:15,003
about a review of cashes, and that we're
going start talking about in order

3
00:00:15,003 --> 00:00:20,076
superscalars so this is something new.
And I want to say this now but, well, I'll

4
00:00:20,076 --> 00:00:24,000
reiterate when I come to it.
This other book.

5
00:00:24,000 --> 00:00:28,556
The [unknown] book, is probably use more,
is much more useful for the super scalar

6
00:00:28,556 --> 00:00:33,081
portion of our class than the Patters or,
Hennessy and Patterson book.

7
00:00:33,081 --> 00:00:38,406
And I recommend either acquiring it or
borrowing a friend's Or the library has a

8
00:00:38,406 --> 00:00:42,075
copy of that book.
And at least reading the, sort of, two to

9
00:00:42,075 --> 00:00:47,083
three chapters in there would be useful.
But that's not until the second half of

10
00:00:47,083 --> 00:00:49,926
today's lecture.
The first half, we have to go and finish

11
00:00:49,926 --> 00:00:52,455
up caches.
So, going back to review, let's, let's

12
00:00:52,455 --> 00:00:56,928
talk about caches can look like on the
inside and a couple, a couple of different

13
00:00:56,928 --> 00:01:01,070
things about how to classify caches.
So, we've already talked about why caches

14
00:01:01,070 --> 00:01:05,061
are good.
So, a little bit of a review here.

15
00:01:06,010 --> 00:01:10,039
You have big slow things over here.
Big slow memories and you have the

16
00:01:10,039 --> 00:01:12,910
processor.
And you can, you know, you can think about

17
00:01:12,910 --> 00:01:17,389
how to build small fast memories.
And if we can put a small, small fast

18
00:01:17,389 --> 00:01:22,023
memory close, and have a high probability
that, you know, things are actually

19
00:01:22,023 --> 00:01:26,972
located in this small fast memory, then we
don't have to go out to the big slow

20
00:01:26,972 --> 00:01:30,330
memory as often.
We can save power and perform, and, and,

21
00:01:30,330 --> 00:01:33,023
and, have better performance, etc.
Okay.

22
00:01:33,023 --> 00:01:38,019
So, how do we, how do we classify caches
and what, what are, going, reviewing what,

23
00:01:38,019 --> 00:01:41,020
what a cache, sort of looks like on the
inside?

24
00:01:41,020 --> 00:01:44,084
So, processor, cache, main memory,
addresses flow that way.

25
00:01:44,084 --> 00:01:50,034
Data can go either direction cuz you can
either do a store which will put data out

26
00:01:50,034 --> 00:01:55,078
from left to right or you can read back
data or do a load which will have data

27
00:01:55,078 --> 00:01:59,442
going back the other way.
So, let's look, look at our sort of a

28
00:01:59,442 --> 00:02:04,233
mythical view here of what a cache looks
like, or, or our view of what a cache

29
00:02:04,233 --> 00:02:09,063
looks like on the inside.
So, caches contain copies of what is being

30
00:02:09,063 --> 00:02:15,807
stored in the main memory.
And if you look at it inside of it here,

31
00:02:15,807 --> 00:02:20,725
you know, this, let's say, first line is
at [unknown] store address 100.

32
00:02:20,725 --> 00:02:25,562
So here, we have address 100, and it has
some data in there and, typically, you

33
00:02:25,562 --> 00:02:30,153
introduce this notion of a line in a cache
because we want to reduce the tracking

34
00:02:30,153 --> 00:02:35,055
information for a particular cache line or
sometimes called a block in the cache.

35
00:02:35,055 --> 00:02:39,095
If we, it is possible to build something
which, you know, tracks data on, let's

36
00:02:39,095 --> 00:02:43,060
say, a bit or a byte basis.
But then, you're tracking information

37
00:02:43,060 --> 00:02:48,017
that's probably larger than the amount of
data you're actually trying to store.

38
00:02:48,017 --> 00:02:50,072
So, that's not, not really a great idea
there.

39
00:02:51,063 --> 00:02:56,050
In the cache here, if you look inside of
it what we store, is we store an address

40
00:02:56,050 --> 00:02:58,043
tag.
So, it's not the full address.

41
00:02:58,043 --> 00:03:02,053
It could be the full address, but then
you're probably storing extra, extra bits.

42
00:03:02,053 --> 00:03:06,080
You only need to store enough of the
address to be able to differentiate

43
00:03:06,098 --> 00:03:10,043
different things that could be stored at
the same location.

44
00:03:10,043 --> 00:03:14,766
So, in a fully associative cache, where
you can put any data in any location, you

45
00:03:14,766 --> 00:03:19,013
will have to store for instance, the whole
address in the tag.

46
00:03:19,013 --> 00:03:23,643
But, if you, let's say, have a direct
mapped cache where each address can only

47
00:03:23,643 --> 00:03:27,069
go to one location, you would get up to
store a subset with this.

48
00:03:27,089 --> 00:03:32,422
Yes, and I wanted to sort of point out
here, you know, sometimes people call the

49
00:03:32,422 --> 00:03:36,334
whole thing here, a line.
A line typically includes the tag

50
00:03:36,334 --> 00:03:41,538
information and possibly valid bits, and
the block is the actual data that we're

51
00:03:41,538 --> 00:03:43,300
storing.
Okay.

52
00:03:43,300 --> 00:03:48,081
So, this is really review here of how the
algorithm or what the cache, caches looks

53
00:03:48,081 --> 00:03:50,069
like.
Well, let's view briefly.

54
00:03:50,069 --> 00:03:55,016
So, processor issues a load, so we're
going to look at the load case here.

55
00:03:55,016 --> 00:03:59,507
We take the address and we compare the sub
sets of the bits that can be, could be

56
00:03:59,507 --> 00:04:04,857
different for different things that store
in interesting location in the cache and

57
00:04:04,857 --> 00:04:09,051
we check against the tag.
One thing not shown here is we also need

58
00:04:09,051 --> 00:04:14,028
to check to make sure it is valid, because
it is possible you could have an invalid

59
00:04:14,028 --> 00:04:17,016
cache line.
So, it's typically, is one bit which says,

60
00:04:17,016 --> 00:04:19,069
whether the cacheline is valid or not
valid.

61
00:04:19,069 --> 00:04:23,083
You have to make sure it is valid.
And if so, you get a cache hit and just

62
00:04:23,083 --> 00:04:26,066
return the data.
If not, you go out to the main memory.

63
00:04:26,066 --> 00:04:30,014
Ooh, this is a tricky one here.
You have to go find someplace to go put

64
00:04:30,014 --> 00:04:33,567
that data because this new data comes in
and if you want to store it in your cache

65
00:04:33,567 --> 00:04:37,827
now, there's probably something in that
cache at that same location, these just

66
00:04:37,827 --> 00:04:40,537
sort of bump out.
And, we're going to talk briefly about

67
00:04:40,537 --> 00:04:44,075
replacement policies, and later in the
advanced caches lecture, we'll talk more

68
00:04:44,075 --> 00:04:48,081
about that and other cache policies, and
much more advanced replacement policies.

69
00:04:48,081 --> 00:04:51,859
But you kick something out, and you fill
it in, and then finally you return the

70
00:04:51,859 --> 00:04:54,543
data.
You could also think about trying to do

71
00:04:54,543 --> 00:04:59,329
sort of, this step and this step in
parallel effectively, and a lot of cache

72
00:04:59,329 --> 00:05:04,462
system and more advanced cache system will
actually return the data before they sort

73
00:05:04,462 --> 00:05:09,030
of finish the eviction or, to, to reduce
the leniency of the load.

74
00:05:09,066 --> 00:05:14,045
Okay, so let's, let's do some
classification, so some pathology here of

75
00:05:14,045 --> 00:05:18,008
what a cache looks like.
So, what, what could we talk about?

76
00:05:18,008 --> 00:05:22,094
We could talk about block placement.
So, this is locations that a particular

77
00:05:22,094 --> 00:05:27,010
block or particular, particular address
can be found in the cache.

78
00:05:27,050 --> 00:05:32,006
Block identification, which is the
structure of how to actually go find

79
00:05:32,006 --> 00:05:36,094
something in a cache if I give an address.
So, this is kind of an inverse of the

80
00:05:36,094 --> 00:05:41,052
placement question.
Block replacement which is figuring out

81
00:05:41,052 --> 00:05:46,084
what should be replaced on a miss and then
finally what happens on a write.

82
00:05:46,084 --> 00:05:52,044
So, these are all sort of read questions.
Well, block replacement is not really a

83
00:05:52,044 --> 00:05:57,475
read question here, but the write strategy
is a, is a good question here of what

84
00:05:57,475 --> 00:06:02,805
happens when a write occurs, do we update
all the caches, only the main memory, do

85
00:06:02,805 --> 00:06:08,464
we evict just out of the cache?
And all these are possible strategies to

86
00:06:08,464 --> 00:06:13,357
choose.
Okay, so let's look at a block placement

87
00:06:13,357 --> 00:06:18,680
figure here, and look at a couple
different types of caches.

88
00:06:18,680 --> 00:06:25,039
So, up here, we have a 31 or excuse me,
32-word memory.

89
00:06:25,039 --> 00:06:29,055
So, let's say, this is the, all of the
memory in the machine.

90
00:06:29,055 --> 00:06:34,087
And, depending on how you want to look at
this, maybe the, each block, let's say, is

91
00:06:34,087 --> 00:06:37,093
one byte, or it could be some number of
bytes.

92
00:06:37,093 --> 00:06:42,091
So, to give an example, a typical,
something like your Core two Duo, I

93
00:06:42,091 --> 00:06:46,039
believe has a 64-byte cache line, or cache
box size.

94
00:06:46,039 --> 00:06:51,070
Some people have tried to, for, push that
up, and, and there's reasons to, sort of,

95
00:06:51,070 --> 00:06:55,476
make it larger or smaller, but we'll talk
about that in a second.

96
00:06:56,081 --> 00:07:02,320
So, let's take a, this highlighted
location here, block number twelve.

97
00:07:02,320 --> 00:07:06,599
And let's see, where it can fit into
different types of caches.

98
00:07:06,599 --> 00:07:09,409
So, we have three different types of
caches here.

99
00:07:09,409 --> 00:07:14,312
Let's, let's start on the right side here.
So, on the right side, we have what's

100
00:07:14,312 --> 00:07:17,741
called direct mapped cache.
So, the direct mapped cache.

101
00:07:18,254 --> 00:07:24,180
It is, literally, an array and each block
can only fit one place into the cache.

102
00:07:24,180 --> 00:07:31,274
And the indexing function that you use is
you take the block number mod, the size of

103
00:07:31,274 --> 00:07:34,784
the cache.
So, in this case, we are, have block

104
00:07:34,784 --> 00:07:41,440
number twelve, or byte number twelve here,
we take twelve mod eight, it says block

105
00:07:41,440 --> 00:07:45,377
four.
So, this is the only place to go look for

106
00:07:45,377 --> 00:07:49,035
the data.
Okay, that's, that's nice and, nice and

107
00:07:49,035 --> 00:07:52,074
simple.
It may not be the best performance, easy

108
00:07:52,074 --> 00:07:55,012
to build.
It's all index-based.

109
00:07:55,012 --> 00:07:58,052
Okay, then, sort of one step up from
there.

110
00:07:58,052 --> 00:08:02,040
We can start to look at things like set
associativity.

111
00:08:02,040 --> 00:08:08,014
So, what set associative means is instead
of having only one location, one unique

112
00:08:08,014 --> 00:08:14,087
location that a block can be placed, you
allow it to be in some number of

113
00:08:14,087 --> 00:08:17,687
locations.
So , in this, there's, in this drawing

114
00:08:17,687 --> 00:08:21,044
here, there's a two-way set associative
cache.

115
00:08:21,044 --> 00:08:24,481
So, each block can either fit in one of
two places.

116
00:08:25,031 --> 00:08:29,090
Now, you can actually have, you know,
machines which have much more

117
00:08:29,090 --> 00:08:35,084
associativity, or higher associativity.
So, an example of this is the modern Core

118
00:08:35,084 --> 00:08:39,035
i7 processors.
Their, sort of L1 caches are 8-way set

119
00:08:39,035 --> 00:08:43,040
associative and farther out, they're even
higher associative.

120
00:08:43,061 --> 00:08:49,015
Probably, I think they have a 16-way set
associative cache out in their L3 or L,

121
00:08:49,055 --> 00:08:53,423
farther out caches.
So, let's, let's look through basic

122
00:08:53,423 --> 00:08:57,941
example here.
If we want, twelve mod four, so how do we

123
00:08:57,941 --> 00:09:01,043
do this?
Well there's only four buckets, if you

124
00:09:01,043 --> 00:09:04,072
will.
And inside of that bucket, it can fit into

125
00:09:04,072 --> 00:09:08,531
either of the two sort of sub-slots in the
bucket, or they're called ways.

126
00:09:08,531 --> 00:09:13,528
So, if we take twelve mod four, that means
twelve mod four, is zero.

127
00:09:13,528 --> 00:09:16,398
So, it has to fall into, this, this bucket
here.

128
00:09:16,398 --> 00:09:20,811
It can be in either one of those two,
ways, and it can't be in both.

129
00:09:20,811 --> 00:09:23,439
That's an important, important point
there.

130
00:09:23,439 --> 00:09:28,648
Finally, there are types of caches which
we'll call fully associative caches.

131
00:09:28,648 --> 00:09:34,799
So, what this means is the associativity
or the number of ways is equal to the

132
00:09:34,799 --> 00:09:39,873
number of elements in the cache.
So here, there is a, there are eight

133
00:09:39,873 --> 00:09:44,109
elements, there are eight walk locations
in this cache.

134
00:09:44,109 --> 00:09:50,288
And it's fully associative, which means
that there are eight different sets, if

135
00:09:50,288 --> 00:09:53,958
you will.
So, it can fit in any of these locations.

136
00:09:53,958 --> 00:09:59,689
So, this same address here or block number
twelve can fit anywhere in this cache.

137
00:09:59,689 --> 00:10:02,593
Okay.
So now, the next question is how do we

138
00:10:02,593 --> 00:10:12,487
actually find the block in the cache?
So when, you're going to, let's say, take

139
00:10:12,487 --> 00:10:18,399
an address, and you want to go figure out
where it goes in the cache.

140
00:10:18,399 --> 00:10:24,591
Let's, let's take a look at the example of
a, a direct mapped cache first.

141
00:10:24,591 --> 00:10:30,730
And here, here's sort of our information.
We have, we have the data block and then,

142
00:10:30,730 --> 00:10:34,323
we have some tag information and we have a
valid bit.

143
00:10:34,323 --> 00:10:40,209
So, if we look at the address that we're
sticking in here, so, let's say, this is

144
00:10:40,209 --> 00:10:45,074
like a 32-bit address.
Some portion of that address is just going

145
00:10:45,074 --> 00:10:49,770
to figure out where in the data block the
load or store is going to.

146
00:10:49,770 --> 00:10:57,019
So, if you have, let's say, a 64-byte
cache line or 64-byte blocks, data block.

147
00:10:57,019 --> 00:11:03,022
You're going basically index into some
sub-word in there based off the offset

148
00:11:03,022 --> 00:11:07,526
bits.
Then, we start to look at the block

149
00:11:07,526 --> 00:11:11,077
address.
So, the next the, this is the traditional

150
00:11:11,077 --> 00:11:15,425
way to do this is that you use, that you
use these sort of middle order bits as the

151
00:11:15,425 --> 00:11:18,022
index.
There are more complicated cache systems,

152
00:11:18,022 --> 00:11:21,084
which move this around or do much more
complicated hashing functions.

153
00:11:21,084 --> 00:11:24,843
But the basic hash function here, is we're
going to take the index.

154
00:11:24,843 --> 00:11:30,413
So, in this cache here, let's say, that
has three, I'm assuming four entries in

155
00:11:30,413 --> 00:11:34,132
the cache.
How many, how many bits are in this index?

156
00:11:34,132 --> 00:11:37,151
Two, yes, okay.
So, two to the two is four.

157
00:11:37,151 --> 00:11:42,984
So, there's, this index is in here and
chooses one of these four locations.

158
00:11:42,984 --> 00:11:49,026
And then finally, this tag information
here is used to compare against the tag

159
00:11:49,026 --> 00:11:53,034
that we store.
And that will say whether this address is

160
00:11:53,034 --> 00:11:57,049
actually in the cache right now or if we
have to go out to main memory.

161
00:11:57,049 --> 00:12:01,030
So, let's, let's, work a brief example
here.

162
00:12:01,030 --> 00:12:07,558
We have, 32-bit addresses.
We have a four way, or excuse me, a four

163
00:12:07,558 --> 00:12:13,618
set cache, direct maps cache.
How many, how many bits?

164
00:12:13,618 --> 00:12:19,631
And we have, okay, so we have 64-bytes in
our block size.

165
00:12:19,631 --> 00:12:25,031
How many bits are in the offset?
Let's start with that.

166
00:12:25,031 --> 00:12:31,127
The offset is going to have to index a
byte within the block here.

167
00:12:31,127 --> 00:12:39,603
So, if we have byte addressable memory and
there's 64 different entries, log base two

168
00:12:39,603 --> 00:12:41,448
of 64 is six.
Okay.

169
00:12:41,448 --> 00:12:47,178
So, we have six bits of offset.
Okay, what do we say about the index here?

170
00:12:47,178 --> 00:12:51,733
And we said, this was already two bits.
Okay, so that's eight bits.

171
00:12:51,733 --> 00:12:55,855
So, how many bits of tag do we have?
So, the tag information is 24-bits.

172
00:12:55,855 --> 00:13:00,223
And as you can see, as you make the data
block larger, or the data block size

173
00:13:00,223 --> 00:13:04,892
large, larger, or the cash line size
larger, you need fewer tag bits because it

174
00:13:04,892 --> 00:13:09,065
sort of, eats away this way, and the index
gets shifted up.

175
00:13:09,601 --> 00:13:13,024
So, that's, that could be, that could be a
good thing.

176
00:13:13,045 --> 00:13:19,007
Finally, there's this V column here.
This is what I was looking to before, it's

177
00:13:19,007 --> 00:13:23,389
really important.
You need a, valid bit to see whether the

178
00:13:23,389 --> 00:13:27,325
lines valid.
It's very possible you could have old data

179
00:13:27,325 --> 00:13:32,050
or the cache is not initialized.
At which point, these V bits would all be

180
00:13:32,050 --> 00:13:37,078
zero or, or not valid.
Okay, so let's do a little more advanced

181
00:13:37,078 --> 00:13:46,024
example here, we have a, a two-way set
associative cache still with four lines.

182
00:13:47,010 --> 00:13:49,365
And we want to go find our data.
Hm.

183
00:13:49,365 --> 00:13:53,367
So, how do we, how do we go about doing
that?

184
00:13:53,367 --> 00:13:59,217
What's interesting here is, when we did
this, we only had to check one tag

185
00:13:59,217 --> 00:14:02,500
location.
Well, for a cache like this.

186
00:14:02,500 --> 00:14:07,390
The index is going to tell us which of the
two sets it's in.

187
00:14:07,390 --> 00:14:11,545
But, it can either be in either of the two
ways of, of these sets.

188
00:14:11,545 --> 00:14:16,796
So, we actually need to do a tag check
against this tag and that tag, and check

189
00:14:16,796 --> 00:14:21,566
the two valid bits.
Many caches can do this in parallel, but

190
00:14:21,566 --> 00:14:27,295
that's how you can figure out, and it's
possible that it's in neither of those two

191
00:14:27,295 --> 00:14:32,077
and we have to just say, that, it's a
cache-miss.

192
00:14:32,077 --> 00:14:35,611
Okay.
So, let's, let's talk about, when you need

193
00:14:35,611 --> 00:14:40,549
to go taking things out of the cache.
So, let's talk about block replacements.

194
00:14:40,549 --> 00:14:45,496
We need to figure out what to go, kick out
of the cache, when you take cache missing,

195
00:14:45,496 --> 00:14:51,007
it will bring, bring some new data in.
So, an important point here is in direct

196
00:14:51,007 --> 00:14:55,962
mapped cache, this question makes no sense
cuz in direct mapped cache, it can only go

197
00:14:55,962 --> 00:15:00,070
to one location.
In set associative or fully associative

198
00:15:00,070 --> 00:15:06,072
caches, we need to make a decision.
And hopefully, we only need to make this

199
00:15:06,072 --> 00:15:12,043
decision when a set becomes full.
Because, otherwise, we'll just choose the

200
00:15:12,043 --> 00:15:17,076
empty location in that set.
So, if you have a two-way set associative

201
00:15:17,076 --> 00:15:20,167
cache.
And one of the ways is full of data, and

202
00:15:20,167 --> 00:15:23,126
the other one is just empty.
The valid bit is empty.

203
00:15:23,126 --> 00:15:26,946
We probably shouldn't even be looking at,
sort of different choices on how to

204
00:15:26,946 --> 00:15:30,516
replace things, because, there's an open
spot to go put the data, we should go put

205
00:15:30,516 --> 00:15:31,411
the data there.
Okay.

206
00:15:31,411 --> 00:15:35,886
Som what are, what are some, good block
replacement policies, or good cache

207
00:15:35,886 --> 00:15:37,991
replacement policies?
First one, is random.

208
00:15:37,991 --> 00:15:41,887
You can just choose randomly out of a hat.
Actually, it doesn't work so bad.

209
00:15:41,887 --> 00:15:46,229
It's, it's, it's, you know, it's easy to
implement, you just like, have some random

210
00:15:46,229 --> 00:15:50,132
unique feedback shift register, or
sometimes you just sort of choose some

211
00:15:50,132 --> 00:15:53,277
random number and, and you replace it.
Not so, not so bad.

212
00:15:53,277 --> 00:15:57,904
Well, well, now, now we start thinking
about put, put our thinking caps on, we

213
00:15:57,904 --> 00:16:01,526
can start thinking about some better,
better ways to go do this.

214
00:16:01,526 --> 00:16:08,319
When we need to think about least recently
used, hm.

215
00:16:08,319 --> 00:16:13,499
So, the idea behind this strategy is that
you are trying to preserve temporal

216
00:16:13,499 --> 00:16:16,394
locality.
So, if you've used some data recently,

217
00:16:16,394 --> 00:16:21,761
there's a heuristic here, that it's likely
to be used again in a not too distant time

218
00:16:21,761 --> 00:16:25,075
axis.
So, if you look back to that sort of time

219
00:16:25,075 --> 00:16:29,940
versus address block that we had, in the
time axis we think that, you know,

220
00:16:29,940 --> 00:16:33,563
temporal locality is something that
happens relatively often.

221
00:16:33,563 --> 00:16:40,972
So, one, one really good strategy is to
try to look at the least recently used as

222
00:16:40,972 --> 00:16:46,526
the block to get rid of.
So, if something has not been used

223
00:16:46,526 --> 00:16:51,832
recently we kick that out of cash.
Now, some problems with this is, this gets

224
00:16:51,832 --> 00:16:55,193
really hard to do.
Well, there's two problems.

225
00:16:55,193 --> 00:17:00,998
One, to implement appropriate lease
recently used, every single load or store

226
00:17:00,998 --> 00:17:05,365
that happens to the cache, needs to update
the information.

227
00:17:05,365 --> 00:17:10,127
Well, that's not great for power.
But it's also not great because you

228
00:17:10,127 --> 00:17:15,315
basically need to have sort of a very fine
during tracking information on each cache

229
00:17:15,315 --> 00:17:18,362
line on every single cycle that does a
load or a store.

230
00:17:18,362 --> 00:17:22,661
So, you know, you need, you need some
cache data that needs to be updated pretty

231
00:17:22,661 --> 00:17:24,903
often.
Plenty of people do that, though.

232
00:17:24,903 --> 00:17:29,634
There's, you know, if you have a least
recently used, piece of information where

233
00:17:29,634 --> 00:17:34,274
you have extra bit on each cache line, you
can think about just sort of flipping that

234
00:17:34,274 --> 00:17:38,480
back and forth and having one bit for that
information and every single load and

235
00:17:38,480 --> 00:17:42,266
store actually sets that bit and lots of
processors do that.

236
00:17:42,266 --> 00:17:47,861
So, what, what starts to get hard here is
when you try to go above 2-way caches.

237
00:17:47,861 --> 00:17:53,224
So, 2-way you have sort of one bit that
can decide where, which is the most least

238
00:17:53,224 --> 00:17:56,636
recently used.
If you start to have, let's say, an 8-way

239
00:17:56,636 --> 00:18:02,029
associative cache, you want to do true,
least recently used, you've got to somehow

240
00:18:02,029 --> 00:18:05,302
keep like timing stamps for each different
way.

241
00:18:05,302 --> 00:18:10,007
You have to say, oh, well, this one was
just recently accessed and, you know, you

242
00:18:10,007 --> 00:18:13,086
can time stamp it.
You can try to sort of reorganize a list

243
00:18:13,086 --> 00:18:16,570
of numbers in order.
And these are the things that start to get

244
00:18:16,570 --> 00:18:20,596
harder to do in hardware and especially,
you know, on the critical path of your

245
00:18:20,596 --> 00:18:23,966
processor.
So, lot of times what people do is they

246
00:18:23,966 --> 00:18:29,928
have pseudo list recently used for higher
associative caches where you'll try to

247
00:18:29,928 --> 00:18:36,546
sort of keep maybe accurate information of
the last two most recently used lines and

248
00:18:36,546 --> 00:18:40,592
the other ones you do random or something
like that.

249
00:18:40,592 --> 00:18:45,178
Or there's some more complex things which
we'll talk later in the class about.

250
00:18:45,178 --> 00:18:49,536
Another one, first in, first out.
So, whatever gets brought into the cache

251
00:18:49,536 --> 00:18:54,605
first, sits in the cache for some period
of time and then gets kicked out in the

252
00:18:54,605 --> 00:18:57,494
future.
So, you're sort of guaranteed that some

253
00:18:57,494 --> 00:19:02,717
piece of data that you bring in will sit
in your cache for some period of time, and

254
00:19:02,717 --> 00:19:06,963
this is a lot easier to implement in
highly associative caches.

255
00:19:06,963 --> 00:19:12,246
But it's certainly not implementing least
recently used because if you sort of bring

256
00:19:12,246 --> 00:19:15,643
in a cache line.
Let's say, you have a four-way set

257
00:19:15,643 --> 00:19:19,460
associative cache.
Four axises later, it's sort of the top of

258
00:19:19,460 --> 00:19:23,461
the list to get kicked out.
But what's nice about this is you're

259
00:19:23,461 --> 00:19:27,922
guaranteed it'll sit in your cache for
some period of time and won't just get

260
00:19:27,922 --> 00:19:32,044
sort of randomly kicked out, which like
random, has problems with.

261
00:19:32,063 --> 00:19:37,304
Finally here, another, another acronym
here, not most recently used.

262
00:19:37,304 --> 00:19:41,424
Well, let's think about that one for a
second.

263
00:19:41,424 --> 00:19:46,414
So, we're going to kick out something,
that is not the most recently used block.

264
00:19:46,414 --> 00:19:51,109
So, if we can sort of keep track and say
that we and we have some index into our

265
00:19:51,109 --> 00:19:56,021
cache, if we have a four-way cache, we can
think of having sort of two bits that

266
00:19:56,021 --> 00:20:00,293
says, well, this line was the most
recently used and the rest of them was

267
00:20:00,293 --> 00:20:04,497
sort of random.
And we know we are not going to be kicking

268
00:20:04,497 --> 00:20:09,071
out the most recently used cache.
So, write strategies, you're going to do a

269
00:20:09,071 --> 00:20:12,092
store, you've got to know where you put
the data.

270
00:20:13,018 --> 00:20:17,405
Important, important, important thing
here.

271
00:20:17,405 --> 00:20:24,054
Caches have of lot's of different
allocation policies and policies about

272
00:20:24,054 --> 00:20:30,036
whether you keep the data or throw out the
data in the different levels of cache

273
00:20:30,036 --> 00:20:33,564
hierarchy, when a store happens or a write
happens.

274
00:20:33,564 --> 00:20:38,306
So, let's, let's take a look at couple,
couple of different choices here.

275
00:20:38,306 --> 00:20:43,647
Let's say, you are doing a store and you
have a cache hit, so hits in your, your

276
00:20:43,647 --> 00:20:49,096
cache, Should you go put that in main
memory also?

277
00:20:51,016 --> 00:20:57,075
If you, if you do a, a store, and you put
every single store into your cache and

278
00:20:57,075 --> 00:21:02,230
into main memory, if it's already in your
cache, then, as you point out, it uses

279
00:21:02,230 --> 00:21:05,632
bandwidth out to the main memory.
So, that can be a downside.

280
00:21:05,632 --> 00:21:10,693
There are some positives to this, though.
When you go to evict the line, you don't

281
00:21:10,693 --> 00:21:13,550
have to write anything back to main
memory.

282
00:21:13,550 --> 00:21:17,272
So, you sort of are pre-putting data in
the main memory.

283
00:21:17,272 --> 00:21:22,054
So, that's, that's a positive.
You could also think about if you have a

284
00:21:22,054 --> 00:21:25,895
multilevel cache hierarchy.
Some levels of hierarchy may, may make

285
00:21:25,895 --> 00:21:30,354
sense to actually put it in the sort of
the next level out if you have enough

286
00:21:30,354 --> 00:21:32,953
bandwidth.
So, this is actually a trade off.

287
00:21:32,953 --> 00:21:36,290
As I said, Computer Architecture is all
about trade offs.

288
00:21:36,290 --> 00:21:39,885
Neither of this actually correct.
So, the one is called write through.

289
00:21:39,885 --> 00:21:44,652
So, right through means, you put it in
your cache, if you get a cache fit and you

290
00:21:44,652 --> 00:21:48,063
put it in memory.
Write back cache just means you just put

291
00:21:48,063 --> 00:21:51,417
it in your local cache.
And when that block gets evicted and it

292
00:21:51,417 --> 00:21:55,811
has dirty data, you have to go put it back
in the main memory.

293
00:21:56,693 --> 00:22:01,002
So, so, that's, that's kind if the
downsides and upsides there.

294
00:22:01,002 --> 00:22:05,078
One, one thing you do need to do if you
have write back, as we say here, is that

295
00:22:05,078 --> 00:22:09,013
you need to keep track of, if the data is
dirty or not.

296
00:22:09,031 --> 00:22:13,594
So, if you go look at more complicated
cache coherence protocols, sometimes

297
00:22:13,594 --> 00:22:16,327
you'll actually favor things like write
through.

298
00:22:16,327 --> 00:22:18,844
Because right back gets really
complicated.

299
00:22:18,844 --> 00:22:23,300
When you have data that has to be
invalidated and, and sent out to other

300
00:22:23,300 --> 00:22:26,062
places.
But if you knew the data was clean i.e.,

301
00:22:26,062 --> 00:22:31,064
it has no dirty data, you never have to a
write back when an invalidation comes in

302
00:22:31,064 --> 00:22:35,054
for a multiprocessor scenario.
So, it's a tough, tough trade-off there,

303
00:22:35,054 --> 00:22:39,497
but, generally, sort of write back is
considered to be more advanced and saves

304
00:22:39,497 --> 00:22:43,013
bandwidth.
But, there are trade-offs there of why you

305
00:22:43,013 --> 00:22:47,041
might want to do one or the other, because
write through is definitely simpler to

306
00:22:47,041 --> 00:22:49,054
design.
Okay.

307
00:22:49,054 --> 00:22:56,070
So, that's our, and, and, oh, I do want to
point out that, you know, you may not

308
00:22:56,070 --> 00:23:00,066
actually go all the way out to the main
memory.

309
00:23:00,066 --> 00:23:17,052
So plenty of architectures have small L1
caches, for instance, where you don't

310
00:23:17,052 --> 00:23:22,531
necessarily want to deal with
invalidations coming in and complicated

311
00:23:22,531 --> 00:23:28,649
things happening.
Where you might do write through from your

312
00:23:28,649 --> 00:23:34,950
L1 to L2, but then you want to save, let's
say, off-chip bandwidth here.

313
00:23:34,950 --> 00:23:39,460
So, you write back out of your L2 to the
DRAM.

314
00:23:39,460 --> 00:23:44,216
So, that's, that is a pretty common sort
of thing to do.

315
00:23:44,216 --> 00:23:52,107
And as I was saying, the real motivation
for this is if you have a multiprocessor

316
00:23:52,107 --> 00:23:57,465
system, and you have, let's say,
invalidation requests.

317
00:23:57,465 --> 00:24:04,264
So, let's say, another processor wants
that same cache line, and the data is

318
00:24:04,264 --> 00:24:10,811
dirty here, but not into, not in the DRAM,
You'd actually have to sort of reach down

319
00:24:10,811 --> 00:24:14,312
into the L1 cache.
If you have write back out of the L1 cache

320
00:24:14,312 --> 00:24:19,079
and that gets pretty complicated.
The, and it especially gets complicated if

321
00:24:19,079 --> 00:24:23,656
your L1 cache is very tightly integrated
into your CPU because it's on your

322
00:24:23,656 --> 00:24:26,903
pipeline.
Like, it's in your main pipeline and you

323
00:24:26,903 --> 00:24:32,029
have to sort of bubble the pipe somehow or
have structural hazard for any

324
00:24:32,050 --> 00:24:35,043
invalidation that comes into that L1
cache.

325
00:24:35,043 --> 00:24:40,075
So typically, a lot a times, people try
not to have or they, or they, people will

326
00:24:40,075 --> 00:24:44,092
consider write through caches, at least
from the L1 to the L2.

327
00:24:44,092 --> 00:24:51,000
But you definitely won't say bandwidth,
for a L2, sort of out to DRAM cache, if

328
00:24:51,000 --> 00:24:58,415
you don't it write through every time.
Okay, let's take a look at what happens on

329
00:24:58,415 --> 00:25:02,802
a cache miss?
So, on a cache miss, we're doing a store

330
00:25:02,802 --> 00:25:07,070
or write to memory.
Do we need to put it in our cache at all?

331
00:25:11,031 --> 00:25:15,254
Choices, maybe yes, maybe no.
There's no gearing, you don't have to put

332
00:25:15,254 --> 00:25:18,458
in your cache.
A lot of times, and, and this comes down

333
00:25:18,458 --> 00:25:23,179
to heuristics here, a lot of times you'll
actually do a store, and you may just not

334
00:25:23,179 --> 00:25:26,617
read that again.
So some architectures actually have store

335
00:25:26,617 --> 00:25:31,129
instructions with locality hints, or, or
temporal locality hints saying, I'm doing

336
00:25:31,129 --> 00:25:35,540
this store, I'm doing this write right
now, and I'm trying not to read it again

337
00:25:35,540 --> 00:25:39,634
right now for a very long period of time.
Don't put it in my cache.

338
00:25:39,634 --> 00:25:44,078
Also somethings just decide not to do
write allocate, because or so, no, no,

339
00:25:44,078 --> 00:25:48,621
I'll define these two things here.
No write allocate means you are doing the

340
00:25:48,621 --> 00:25:54,003
store, or write to your cache and you're
not going to pull the data into your cache

341
00:25:54,020 --> 00:25:56,064
to do the merging there, locally if you
will.

342
00:25:56,064 --> 00:26:00,001
And you're not going to actually store any
data into your cache.

343
00:26:00,001 --> 00:26:03,056
Cuz, you don't have to.
You're just keeping a local copy of it,

344
00:26:03,056 --> 00:26:07,021
you don't actually have to, you're not
forced to keep a local copy.

345
00:26:07,021 --> 00:26:10,037
You can always send it out to your
canonical main memory.

346
00:26:10,037 --> 00:26:14,063
Write allocate says, go fetch the data
from memory, merge it in the cache because

347
00:26:14,063 --> 00:26:17,681
usually the block line, block size is
longer than the amount of data you're

348
00:26:17,681 --> 00:26:20,934
writing.
So, merge it, and then, probably either

349
00:26:20,934 --> 00:26:25,092
put it out to main memory if you write
through, or just keep it in the cache.

350
00:26:25,092 --> 00:26:30,027
So, you, you, if you write back.
So, you can think about having theses two

351
00:26:30,027 --> 00:26:32,682
policies, and, and they both, they both
have places.

352
00:26:32,682 --> 00:26:37,018
Sometimes, people will, you know,
typically, people try to write allocate,

353
00:26:37,018 --> 00:26:41,095
unless you have some locality information.
Cuz data, lets say, it's at the top of

354
00:26:41,095 --> 00:26:46,002
your stack, in your processor, or in your,
in your sort of software stack.

355
00:26:46,002 --> 00:26:50,043
You are going to basically be writing and
reading from that relatively often.

356
00:26:50,043 --> 00:26:52,509
So, it makes sense to try to allocate
that.

357
00:26:52,509 --> 00:26:57,328
The other thing I did want to point out
is, there is a heuristic sort of going on

358
00:26:57,328 --> 00:27:02,271
with this no write allocate that, if you
read that block, then, it will get into

359
00:27:02,271 --> 00:27:05,419
your cache.
So, maybe the first store that you do to

360
00:27:05,419 --> 00:27:09,789
the address won't allocate, but the next
load, you do will allocate.

361
00:27:09,789 --> 00:27:13,031
So, you're not going to miss that much by
doing no-write allocate.

362
00:27:13,031 --> 00:27:15,076
It's not like you've lost everything for
forever.

363
00:27:15,076 --> 00:27:19,062
It's just that, that first write, oh,
well, I had to go on to the beat and erase

364
00:27:19,062 --> 00:27:22,017
this system.
But the next load that I do from, let's

365
00:27:22,017 --> 00:27:25,048
say, my stack, or some local variable,
will pull that line in anyway.

366
00:27:26,025 --> 00:27:29,538
Okay.
So, some common combinations here.

367
00:27:29,538 --> 00:27:35,047
Write back with write allocate.
So, this is sort of the full blown

368
00:27:35,047 --> 00:27:40,029
solution that a lot of people do.
This is kind of like the base high

369
00:27:40,029 --> 00:27:44,053
performance one.
Write through and no-write allocate is,

370
00:27:44,075 --> 00:27:48,070
kind of possible, this is, like, the cheap
one to build.

371
00:27:48,092 --> 00:27:53,038
You don't need any logic here to deal with
no-write allocate.

372
00:27:53,038 --> 00:27:56,038
But all possible mixes are, are, are
valid.