Okay, so let's take a look at what we have
to add to our pipelines. So we have our in order fetch, out of
order issue, out of order write back and in order commit, plus that we had before.
Note, it had variable length pipes. It had a reorder buffer.
It had a feature storer buffer. It had a scoreboard, it had an instruction
cue. So it had all the, the structures we
talked about last time. And now we're going to add two more
structures to it. And were gonna modify the structures that
are there slightly. Now let's, let's talk about what this is
gonna do. So the first structure we're gonna add.
Is a free list. And the free list is gonna keep track of
physical registers that we could go use. So the physical registers will probably
have more physical registers than we have architectural registers.
But you need to keep track of which ones are free to be used because we are gonna
basically be allocating deallocating from the number of physical registers quickly
while as we execute. The other structure here, we call the
rename table. Sometimes this is called the rat.
Which is the, sort of the intel nomenclature for this; or actually the rat
is either this table or the table we were discussing in the Tomasulo algorithm
variance of this but they're very similar. And what this table does is it's going to
map from architectural register to the most up to date version in our physical
register file. So, it's gonna say, with instruction
that's sitting here, at our decode stage, where do we go find the value, cuz this
gets complicated. We're going to, we just renamed
everything. We have different names for everything.
It's in some physical registers. We need to go figure out where the value
is. And that's what this table, table does.
We're also going to add two fields to your buffer.
I'll talk about that in a second, and we're going to want to increase the size
of the physical register file, so that we can get more performance.
If we have the same number of physical registers as we have architectural
registers, and we need to have at least one physical register for each
architectural register, we're not going to get anymore performance from having a
register renaming step to our pipe. Okay so this is kind of for completeness
where everything gets written in the pipe in time.
Two things I wanted to point out here, are the free list gets updated at the front
and also gets updated here at the end, and the condition that you need to de-allocate
a physical register, or a physical register gets a little complicated, and,
or we'll talk about that. And the rename table, it gets red up here
because that tells you that actually where you get the value.
It also gets updated here when we actually emit an instruction down the pipe.
And we also want to update some pending bits when we get to the end of the pipe.
So that it knows whether to go sort of pickup from the physical register file or
the architecture register file for roll back issues.
Okay, so let's jump into these data structures and see what we add.
Okay as I said we, we we're gonna add stuff to the reorder buffer.
Here, now our previous reorder buffer looked very similar to this.
We had some state where things were pending, free, or finished where we said
dash, dash represents free and f means finished.
It means the instruction got to the end of the pipe and is waiting to commit.
We got a bit that says, well it was after a branch.
You might have multiple of these branches if you allow multiple branches in flight.
A bit that says a store not. A bit that says whether, it writes a
register. So this says the destination is valid and
that's important for us to know because we meet at the end of the pipe we need to
know whether to actually commit some state into the architecture register file.
And we have a, a field here, which we had before, which is the physical register
file specifier. So this tells us where to go read from.
That's, that's all, that's all good. But now we add some extra, extra bits
here. And the first one is a architectural
register file specifier. Okay, so this gets a little complicated.
What, what are we thinking about here? Why do we need this?
When we get to the end of the pipe and we are going to do commit, if we go back and
look at this picture here, in the commit stage, we take something from the physical
register file, put it into and the rear buffer drives this and says, okay copy
that into the architecture register file, when the commit occurs.
Well now we've renamed everything. So it's not an identity map from physical
register number to architectural register number.
So we needed to know where you could actually write in the architectural
register file. And that's what this does.
It just tells us where to go write. So, this is where we read from, this is
where we write to. When this instruction, let's say it's the
most recent instruction here that's turned to finish.
It's going to go commit. We read the value from here.
We write it into the value pointed to by here.
And then we have one other field here, which is a little bit odd.
We have the previous physical register. Why, why would we need that?
That doesn't make any sense. It is the, what is, what is this doing?
So this is something we actually read out of the rename table at the front of the
pipe. And what it's going to tell us, is it's
going to tell us. Let's say this was register four.
This is where we, the in-flight physical register is, and this is the previous
physical register that h-, held the value of register four before we did the update.
And the reason we need to know this, is when we hit the end of the pipe, we need
some way to de-allocate physical registers, and we're going to use this to
track that, and we'll give an example in a minute.
But what this is really going to do its going to say; oh we wrote to the new value
of register four, which means that the in flight value, let's say it was register
four, physical register 27. And the new one is physical register 30 or
something like that, need to deallocate physical register 27 and we can do that
when we reach the end of the pipe by committing this instruction out of the
reorder buffer and cleaning up all the state.
Okay. A, a quick picture here of the, the rename
table, the renaming table. This is indexed by register.
P tells us whether we have a write in flight.
So it knows that, that value is not in the architectural register file.
And p regulator tells us where in the physical register file to go find the
value. And this is really important when a
subsequent instruction is looking for that value, shows up, and it wants to get that
value before it hits the architecture register file.
It looks here, this tells us, tells us, oh its pending, its gonna be here in a little
bit. Together with this and the scoreboard, you
might even be able to bypass it early. In, in, a, in a good day.
In a bad day we have to wait for it to get to the physical register file, but it's a
lot better than having to go pick it out of the architectural register file.
And finally we have a free list. And this is literally just a bit per
physical register, which is very different than a bit per architectural register.
And this is going to have, let's say we have big N physical registers.
Or rather we have 256 physical registers, and we have a bit saying whether that
register has been deallocated and is ready to be used, for future register renaming.
Or whether the instruction, or whether there's a instruction using that physical
register, or it's waiting to commit to the architectural register file, this will
tell us that information in this table here, and just a bit that says whether
it's free or not, pretty simple. Actually, before I go on here, I wanted to
make a, make an interesting observation. Where, where does this register renaming
become really important? Well, if we go look at something like the
original Intel architecture, they had eight registers.
If you want to run high performance code, you have to re-use those registers pretty
quickly. So they got register limited very quickly
when they tried to build faster and faster processors.
So they had to introduce registry naming quite early in the Intel imp-,
micro-architecture implementations. And they had many, many more physical
registers than architectural registers very quickly, cuz eight isn't gonna get
you very far. They, they can have, like about 100
in-flight instructions. And, by definition you can't have that
many inflight instructions, if you, maintain right after write stalling,
effectively cuz, you're, you're going to have to rewrite some register.
It's kind of like a, a, pigeon-hole problem.
If you have more than eight instructions, at least one of those instructions is
gonna cause a right after right, dependency and you're gonna stall the
pipe. So, they're, they're not going to have
more than say eight inflight instructions pretty quickly, if they did not do
register renaming. So, they did register renaming pretty
quickly, in their pipelines. Okay, so this gets us to the, the I chart
here. Let's walk through, basic case, of what's
in all these different tables as we execute our basic, simple code here.
On the top we have the four instructions, two muls, and some adds.
This was our original test case. Note there is all these dependencies
through register four we need to worry about.
There's both where you have to write, write after read and write after write
dependencies. We're gonna execute it quickly here by
pulling, as you can see, this add fires early, or issues early the, the final add.
And this is really driven by the registering ini.
So let's, let's take a look at what, what happens here.
We'll try to interpret this. Here we have cycles.
Cycles are also across the top of the stage.
We, we show what's in the decode issue, write back and commit stage of the pipe.
We leave out the execute stages cuz it's too much to draw here and it's drawn at
the top in a different form. Let's first look at the renamed table.
So. We're at the rename table.
Or, we're actually gonna say we only have, for, for clueless here, let's say we only
have seven architectural registers. But we're going to have let's say.
Ten physi, or, eleven physical registers. So we're gonna have more physical
registers than actual registers in this example.
We start off and we say, Okay, well, register one.
If you want to go find architectural register one, the values in physical
register zero. And we could basically just, you know, we
just come up with some allocation. And the circles mean that it's not
pending. It's not in flight in the pipe.
That's just sort of the base case. Everything is, the, the pipe has been
relaxed. Everything is, is allocated, and we just
drew a basic allocation here at the beginning.
Now as we go to execute, some interesting stuff starts to happen.
The first thing that is going to happen is, we are actually going to, here, issue
this instruction, which writes to register one.
We need to rename this, at this point. Register one will have to be named as
something else. So in this table here, if we look, we
register, we rename register one to physical register seven.
Okay, that sounds good. What happens next.
Well we next sit here, and we try to execute this instruction here, It says
mall, and it goes to try to read register one.
When it goes to read register one though, we can go look at the rename table and
say, oh well that's actually in flight, and it's in physical register seven.
So if we go look over here, we can draw this and say, oh that value is actually in
physical register seven, and it's currently not ready, maybe, and, But P4,
the other input, register five, to do, okay, yeah, register five got renamed to
P4, is ready. So it's ready to go.
Okay, let's, one of the other interesting things that happens here is we can see
that as we go to allocate this, we have to remove it from the free list.
So this list here is the list of all the free registers.
We start off with four free registers and we sort of narrow it down as we start to
do rights. At some point we run out.
So I want to make an important note about this is that when we run out of physical
registers we're going to have to stall the pipe, because we can't do any more
renaming. We can't issue more instructions at that
point. So that's, that's really, that's really
important to realize that when you build your machines you have to have enough
physical registers that you don't run out very often.
Now, it's possible that you could still run out.
So let's say you have hundreds of in flight instructions.
And you only have, let's say, 64 physical registers.
You might still run out, But the probability of that happening, my, might
be relatively low. And the, your utilization, and, you know,
you sort of bake into this your CPI. Your CPI may not be less than one, or may
not be low. So, you know, the probability of that
actually happening. You may not worry about it too much.
Another cute little story here is there's actually been some interesting bugs in
processors around the free list. So there were some alpha processors that
actually leaked free list entries in their register file.
So what happened was if you ran a certain piece of code for a long enough period of
time all of a sudden this processor just ground to a halt cause it was not able to
allocate more physical registers and it ran out.
And ends up with fewer physical registers, architectural registers, and the machine
just stopped. And this was a, sort of well-known bug in,
in some of the early Alpha, I think this was actually in the first, out of, I want
to think where was this, I think this was in the, 21264 had this problem.
They, they fixed it. And, they pulled those chips off the
shelf. And, you know, that's a, that's a really
bad thing to, have happen, in your processor.
How embarrassing. But as I said, if you run out, you're
really not going to be able to issue more. But in this case we made sure we had
enough. So we're not actually going to see any
stalls. And let's look at how things get on the
free list here. Cause that's a little bit interesting.
In our reorder buffer, I said we had extra fields.
If you recall, we had the previous physical register that this was allocated
into. So, if we go look, at this instruction,
which is the first instruction, go to execute that mull, R1 was in P0.
So when that instruction commits, we actually put p0 onto the free list.
And we're going to look at a case in a second why you can't do it earlier.
Because it seems like you should be able to basically de-allocate physical
registers earlier. You know, no one's probably gonna be
reading that value. Why can't you just, you know, get rid of
it early? But we'll look at in a second that a test
case that, that, that's, that's a problem with.
Let's see any, any other fun insights here?
That's, that's about it, what I wanted to get across from, from this diagram.
As, as the code continues on we end up with more and more free physical
registers. One thing one thing I did wanna just to
walk through this, understand this a little bit.
Let's say we have this instruction here which is our one, two, three, four.
It's our last instruction that we execute. Let's go see what it's doing here.
So writes architecture register four, so let's just store that in the reorder
buffer cuz we don't know where to go to the right.
We had allocated p10 to that, and we did that right here, when we actually issued
it. So he pulled it off the free list.
And the previous thing that it wrote was P8, so when in, that ultimately commits,
P8 is gonna end up back in our free list. So that's a, that's a nice little thing,
the circles just show when the values are no longer pending, so they are actually
not in the pipes anymore. And you can see that continuing here, this
instruction here, which is the second multiply.
When it commits it's going to free up P3. So P3 ends up on the list.
P5, P5 ends up on the list. P8, P8 ends up on the list.
And then if when we see true read after writes, for example right here we need to
make sure to pick up that correct value. We do that by looking up in the rename
table. So let's go find that in this chart here.
So instruction two. Is let's see what it's doing here.
So, that's gonna be right here. It's waiting on the eighth to be become
ready, in order to issue. So it's signaling an instruction queue,
and this is gonna stall. It's gonna stall all the way out to here,
or to stall to right there. And that's when it comes out of the
instruction. Okay.
So let's look at freeing up physical registers, and what is a good policy for
freeing up physical registers. So we're gonna have a different piece of
code here we're going to look at. It's gonna be just a bunch of ads.
And we're gonna look at. This code has some.
Read after write dependency is in it. Namely R1 there.
And, let's say, we try to go execute this. Well we, we're gonna, here's some
execution order. We're gonna look at, oh sorry, that one
there so, I meant point out for the read after write.
A write after read dependency here. So let's look at some execution order and
see what happens, let's say we allocate physical register zero, at the beginning
somewhere for register one. And then when we do the commit, we free it
up in our, free list. Well, lo and behold, another instruction,
in time, comes along here, and allocates in the physical register zero.
And it goes and writes to it. And we, like, free it up there.
This instruction here which we'd renamed and earlier we had renamed our one for and
we go to try to read this value, goes to, do the read and it looks in physical
register zero. And I guessed the wrong value.
Ooh. Yeah, we don't, we don't want that.
So what's a, what's a good policy here? Let's say instead, we don't, free up a
physical register until someone else goes to write that physical register.
Or our subsequent instruction goes to write that physical register.
Because then we know that, that physical register is in use, or could be in use by
other readers of that value. So if we look at this case here, let's say
we, write, physical register zero. And then we allocate a different physical
register. Right?
We allocate physical register two for this right here.
Register eight. And then we de-allocate when we go to
overwrite register one. So by doing that, we know when this R1
gets written, that no one else can possibly use that physical register, that
is after this instruction in program word, because we overwrote it, so the value is
no longer visible. So that's, that's pretty, pretty nice.
So that's, that's the a, a, a very good heuristic or very good way to get this
correct; cuz you could just keep the physical register live until you rewrite
the physical, or your rewrite the architectural register that physical
register maps to. And at that point you can remove it from
the number of allocated physical registers, and put in on the free list.
If you do it early, with the out of order execution pipeline, you know, bad, bad
things can happen. You can go read the wrong values.
Okay, so this brings us to a couple optimizations on register renaming.
The biggest one here is you can try to combine the architectural register file
and the physical register file to save space, and the insight here is, if you go
to try to combine these two things you can store the architectural register value and
the physical register value in the same physical storage location.
If that physical register's no longer pending.
So if there's nothing in flight to it and you don't have to roll back, if you're
just going to roll back to the same value anyway, why, why keep extra space for
this? One, one change you need to do here is, so
you're going to remove the architectural register file.
Which you still basically need to know when you go to do a rollback of some
speculative, say you take an interrupt, or you take a branch miss-predict, you still
need to know where to go rollback out of and we're gonna do that by let's say
having a second renaming table here, which allows us to keep track of just the
architectural state. So we have a speculative renaming table,
then we have an architectural renaming table.
It just has pointers in it, instead of actual values, at the end of the pipe.
And what's also nice here, is instead of copying values, we don't actually have to
move something out of the physical register file into the architectural
register file. Instead we just have to update a pointer
in a table now. And we did the copy, to potentially also
make rollback easier, cause we have to up date pointers now instead of actually
copying an enter register file, which can take awhile or requires lots of ports or
something else. So you, you can have a little table there
to do this remapping for you. And as I say you can typically get away
with less space than having for the same performance than if you were to having two
separate structures. When it downsizes, you, you might need to
have more Depending on how you implement this.
You're architectural register file, and your physical register file are now
together. It may be bigger.
So your registered default access might be a little slower.
Something like that could be, could be a down side, versus having it in two
separate partition structures.