Now, we get in the meat of today, today's
lecture. And we're going to talk about all these
different advanced cache optimizations, starting with pipelining the cache.
Okay. So, here is our. Sketch of what happens in the cache.
In the cache, we have a tag and we have the data blocks here.
And we want to go do write into the cache. So, forget about the read for a second,
But instead let's focus on the write. So, in the write, you're going to do a tag
check. So, you're going to index into here with
the index into your tag array, it's going to spit out the tag.
You want to check the valid bit also, And you're going to do a comparison here.
If it matches, that means that the data's already in the cache and you're doing a
write, so you can just write the data into the cache.
So, that, that sounds good. Some challenges here is, it really is
sequential. You need to absolutely do this first
before you do the write. Because if you have a different address stored in this
location here and is blindly to the write, You are going to overwrite the wrong data.
And you're just going to be very unhappy because we overwrote some data which
wasn't, wasn't a portion to this, this particular write or is that the wrong
address. So that's, that's definitely a bummer so
do we think we will want to do this in one cycle?
Anyone have, want to wager a guess here? Do people build machines where they do
this in one cycle? Well, sure. Original machines did do this
one cycle. It's possible to build, you have a nice
combinational path that sort of come through out of the index through here,
through all this logic. And then, through the write enable and then through here.
It's buildable,, not great for your clock performance.
So, the first optimization we're going to so, so, how do we reduce the, hit time for
a write? And there's a couple different strategies
for this. So, we can think of this as trying to do
this over two different cycles. The first cycle checks the tag, we'll say,
and the second cycle does something to the data.
That's, that's, that's kind of our problem statement.
What are our solutions? One, one solution is kind of innovative,
is you can build a multi-ported cache, or cache which can simultaneously do a read
and a write to the same address, And you do that.
So you have a cache, you're doing a write. You blindly do the write not checking the
tag. But, at the same time, you rebuild data
that was there and you save it off. And if you took a tagness for the write,
you go back and actually fix it up by filling that back in.
That is an option, People have built such things. and so you
just restore the old value on a tagness Works perfectly fine, you need a side
buffer. This kinda looks like a victim cache,
which we'll talk about later in today's lecture.
But you can do a write and a read at the same time and it's kinda like you're
putting the, the new data in, speculatively, pulling out the old data,
the victim. And, if the write hits in the, the, the
tag check, then you can just let it go and everything is great.
If not, you have to basically pull that back out and undo it.
It's doable. Not a lot of people do this, but people
have built machines like that. Another solution is trying to build a
fully associative cache. Now, what do I mean by fully associative
cache? Well, I mean, a cache which has lines such
that you can store it to any location in the cache.
And in those structures, typically, the tag check and the, access are kind of done
in parallel. So, because, because you don't know where
to find the data, there's no index operation.
The index operation is the tag check, or the cam operation, or the content
addressable memory operation. You're basically doing it in parallels or,
or you're basically doing the tag check to do the read anyway.
It doesn't really hurt you to sort of do the tag check, check first, and then do
the write. So the, it just doesn't really, hurt you.
It's not necessarily a great design if you have, a very large cache.
But for a small cache, you can definitely have a content addressable memory, for the
tags, And then just have it so you can write
anywhere if you have a fully associative cache.
That is one way to solve this write problem.
And then, we're going to look at this is what we're going to focus on in today's
lecture is pipelining the write. So, to pipeline the write, instead of
actually doing the write in the cycle in the mem, M stage of your pipeline, we're
going to tag check the M stage. And then, hold the store data for some
time in the future. Now, you're going to say, did we actually
do the store then? Well, yes. We're going to call that
committed state. So this is what we're going to focus on
today is, how to, how to do a pipelined memory access to reduce our write hit
time. So here, we have our five stage pipeline.
Cough And, what we're going to do is we are going to the M stage for write, we're
going to check the tag, And we're not going to fire up the data
array at all. Now, you might say, where do you put the
data? Well, we put a nice little buffer here
COUGH which is a, basically to be stored buffers.
Stores the data that's going to be stored in the future or a delayed cache buffer as
we're going to call this. So in parallel, you check the tag, and you
store the data here. And then, sometime in the future, you want
to move this into the cache. But you need a convenient time to do that.
So, one option is you wait for a dead cycle on the cache.
Okay. That sounds, sounds good. How do you know you're going to get a dead
cycle? I can't guarantee that.
I can write a piece of code which does store, after store, after store, after
store, after store, after store, after store, after store, after store, after
store, in a really tight loop so you'll never have a dead cycle.
A cool trick here is a subsequent store only has to use the tag array.
So, if you have a store after a store, the first store will check the tag, and not
use the data, We'll destroy the data here. And then,
when the second store comes down the pipe, It'll check the tag but the data array is
open so you can do the store at that time. So, there's a cool trick there.
If you have a store, after a store, after a store, you could basically decouple the
tag from the data for stores and use the port on the cache to do, the, the data ray
of the cache to do the store later. And we'll go through this a little more
detailed drawing here, and this is just kind of to redirect what was going on. You
do the store, it checks the tags. Cough And it's going to store the address
and the data. And at some point in the future, when
there's a idle cycle or another store going on, it'll actually transition this
data into the data array. That sounds good.
Okay. So, Pop quiz question here.
What happens when you do a load and there's data in the delayed write data
buffer? You have to bypass it, yes, you need to go
check it. So, you need to have something which is
going to check, and that's actually drawn here. COUGH The delayed write address
against a future, load. And if you get a hit there, you need to
have this data come around and come out. So, we need basically to go check this
buffer. One, one thing I do want to point, this is
kind of the naive way to do this. More advanced processors will actually
typically have a multi entry version of these delayed write data buffers because
caches are sort of usually pipelined inside the cache, too.
And I think multiple access, multiple cycles to actually access the data array.
So, you might not do this until the end of the pipe.
So, some of the processors that I've worked on, these are sort of on the
delayed write buffers or sort of the end, order of maybe two to four entries big.
So, you can abstract this and make this into a bigger data structure, and when you
do loads, you obviously have to do a content addressable match against all of
those entries and see if the datas in the delayed write data buffer.
Okay. So let's go see how well this does.
Pipelining our cache. So let's, so, so, we're going to use this throughout today's
lecture, Ways that it makes, makes life better.
You can either reduce the miss rate. So do something good from miss rate,
you're going to do something good for reducing the missed penalty.
You can reduce the hit time and this would be like, for instance, having a smaller
cache reduces the hit time, or you can increase the bandwidth to your cache.
But, all these things are going to factor into performance.
And not all of the techniques, optimizations we're going to talk about
today touch all of these. In fact, most of them only touch sort of
one at a time or two at a time. Okay.
So, pipeline, Cache caches, what do they do for us?
Are they going to reduce the miss rate? I see people shaking their heads, no.
Yeah, we didn't make anything bigger or smaller here.
It's probably not going to touch the, the miss rate.
Is it going to affect our miss penalty? So, this is, when we're missing the cache,
How long it takes us to go to, let's say, the main memory or the next level of
cache? Hm, no.
It's not actually, it's not actually going anywhere different.
The, effectively, this is implementing the exact same thing.
Hit time, does this affect the hit time? What's going to affect the hit time, it's
kind of hard to see whether this is done in a positive or negative direction here.
If you compare to having the, let's say, tag access and data access on the same
cycle, pipelining is actually going to make the hit time, hm,
Make the hit time a little bit better because you're doing less in a cycle. So,
that would, that would have us put a plus here. But, if you compare it to, let's
say, not having a cache or just having a big, RAM there, it actually makes the hit
time worse because we need to mux in multiple, extra stuff here.
We need to basically do this extra check. We need to do the, associative check
against our, delayed write buffer data so that actually hurts us a little bit there.
So, okay. That might be a, a plus or a minus.
The bandwidth, though, is actually getting better through this cache because before,
we had to just sort of wait for this cache for, let's say,
The whole time we go through tag check and the data check and the data write. But
now, we can actually have two stores sort of happening at the same time, or the tag
for one store and the data for a different store.
So, this is really going to make the bandwidth better for this.
So that's, that's where we are with pipeline caches.