Okay, so now we get to move on to even
more complicated processors. In order issue.
Or excuse me, in order front end. In order issue.
Out of order right back. And in order commit.
Okay so this is going to some problems that we have.
The biggest problem it's going to solve, is it's going to solve a problem of
precise exceptions. We can now have exceptions are all the way
at the end, because we're, we're committing data in order.
So let's, let's take a look that should probably a line there scribble that on
your, your own drawings. Okay, so let's, let's, let's take a look
at some other structures we've added to this diagram to, to make life a little bit
more interesting. Okay, the front end looks pretty much the
same. We, we split the load and the store apart
into two separate pites, pipes here. A load pipe and a store pipe.
And the store pipe is shorter because it just has to basically use to store, we'll
say. Maybe it's two stages, doesn't, doesn't
matter that much. It's not that material in this, in this
drawing. But something interesting to look at here
is we added a bunch of extra boxes over here on the, on the right side of this
foil. So let's, let's, let's define these
things. So we had our architectural register file,
which is our committed state to the processor.
And we added a second register file, typically called a physical register file,
or prf. Sometimes people call this a future file
and you'll see that in the literature there are some papers published about
future files and the reason it's called future file is, it's basically executive
speculatively in the future. The values in here have not been committed
to the processor. They can be thrown it out if you take an
exception, if a branch happens for a variety of reasons.
These are speculative, you're not guaranteed to actually have to keep those.
The architecture register filed though is committed state.
Okay, we added, we added two other structures here, something we call ROB or
Re-Order Buffer. And we added a finished store buffer.
So let's, let's talk about the reorder buffer first.
So in this pipeline, we actually want instructions to basically execute and
write the physical register file out of order.
This is an out of order processor. We'd that to happen.
We're basically making the execution and the write back out of order.
But, we want the commit to be in order. So we need some structure that is going to
guarantee that the write back, the write to the architecture register file happens
in order. And that's what the ROB is going to do.
So it's going to keep it's going to keep completed instructions.
And that could come in out of order and are going to leave in order.
So, things come into this out of order, and they go out of it in order.
And this is a reordering structure. It's typically a table that is sort of
ridden well, we'll talk about that in a second it's ridden in different places in
the pipe for a variety, for a couple different reasons, but you, you typically
want to keep track of the instructions in order somehow.
And then when you go to pull out of the reorder buffer, you want to pull in order
out of it. But the rights in the tracking of the
information that happens to it can be out of order.
And the other thing here is this finish store buffer.
The reason we have this finish store buffer is if we have a store operation, we
don't want to have to have the commit point like here so early in the pipe.
Because once you store in the main memory, it's really hard to go get that back,
possibly even impossible, probably is impossible.
You wrote to the main if you, if you had the old value and you write, overwrite it
with the new value, the old value's forever gone in your main memory.
You can't get it back. So, the solution that is, instead of doing
the store here, you have the store happen later in the pipe.
And if you sort of remember what you're supposed to do, the address and the data
that's supposed to be happening. And, for anybody who, who cares that,
that, that store has happened if it hits this future store buffer.
So you probably need to have your loads check that future store buffer with higher
priority than your cache. Because there could be a store living in
that location. Okay, so that's the, that's the sort of,
structures here. Let's talk about where things get read and
written. This one is really interesting our
architectural register file isn't read anywhere.
What's up with that? Well, we're going to use the physical
register file for all the intermediate values in our pipeline and the
architecture register file is only there if we take some sort of let's say branch
or interrupt. That's the only time we actually need to
go take this information and we probably gonna dump it into a psychical register
file or dump it into the future file when an interrupt happens or when a branch miss
predict happens. But otherwise, it doesn't have to be red.
Scoreboard is the same as usual. Read and writes in your register fetch
stage, written at the right backstage and that's no longer tracking architectural
register file registers. It's now tracking physical register file
registers. Re-order Buffer.
This one hell, a whole bunch of different places, it gets read and written.
Primarily, what's going to happen is when the instruction is issued, so goes from
the decode stage to the issue stage, that's going to allocate a location in the
Re-Order Buffer for the entry in the Re-Order Buffer.
And then at the end of the pipe, once the value completes we have to change some
state information in the Re-Order Buffer, saying, oh, that's, output register for a
particular instruction is now ready. And then, once we actually go to do the
commit, we have to basically clean that instruction out of the reorder buffer.
The feature store buffer is written, just sort of at the end of the pike here and
clean when the, actually posting to memory.
It's a, it's a little hard to draw this, but you, that information.
Somehow from the memory system if you have a, a load that reads from that, it will
probably read it either L-0 or L-1 in a sort of bypassing mode if you will.
It'll go check that structure. We'll talk more about that next class.
Okay so here is sort of a basic reorder buffer.
If you go looking some books they have a lot more data stored in reorder buffer cuz
it's kind of minimal reorder buffer you need for an out of work pipe.
And this reorder buffer is used to keep track of in order committing instructions,
but things will be put into it out of order.
So just, let's first talk about sort of the information here.
We keep track of state. So what do we mean by state?
So this is the state of an instruction. So each one of these entries here is a
different in flight instruction in the pipeline.
And we're actually going to store in order into the reorder buffer and we're going to
keep it sort of as a, as a queue. So this picture here, this state is we'll
say -,, - means free, and P means pending and F means finished probably should not
have chose two F words there. That's a little confusing.
But the newest instruction, is if we have a new instruction execute, it's going to
end up here in this entry. And when an instruction commits, or
retires, it's going to remove this entry, the bottom entry.
So we basically have a, sort of circular buffer running around, with a head and a
tail pointer, sort of chasing each other in this data structure.
So, tail, head. What's interesting about this and why does
this cool, is because, let's take a look at this instruction right here.
This instruction has a F, which means it's, it's finished.
That it's not pending in the pipe. It's hit the reorder buffer, the data is
stored in the physical register file, and, but, instructions that are older than it
with it these two instructions are still pending in the pipe.
Let's say these two are multiplies and this is add.
So this add is basically already done. These two instructions, which are these
long laying instructions, are still pending in the pipe.
In this cycle, we cannot commit anything. So we only commit instructions when the
oldest instruction becomes finished. And that's when we can commit and remove
something from the reorder buffer. Some other things we need to keep track of
here. We have a bit here S for speculative.
So what this means is if you have something like a branch.
You mark instructions that are newer in the branch with a peck, of a speck little
bit. So what this is saying is if that branch
mispredicts, it just gives you a convenient place to go find all the
dependence instructions on it to go flush and kill.
So if you have a, if you have, let's say, one branch is allowed in the pipe at a
time and the branch misdpredicts, what you can do is basically look for all the
entries in here that have ones, and just invalidate them ad hoc, and just flush the
entire pipe. You don't have to worry about there being
some value you need to, to worry about. So it's just a commute way to figure out
which instruction is speculative. And if the branch mispredicted what you
have to kill. Stores.
We'll be talking about this in a few more slides later.
But store bit, what we're really going to do is this is going to say.
If this instruction is a store it knows that we need to do something else with it.
We need to do something with the future store buffer when it gets to the end of
the pipe sort of a meeting place to put it.
And here is the actual business the business part of the reorder buffer.
V, which says that the instruction actually writes a register and then
finally once the instruction goes to the end of the pipe, it is going to fill in a
location in here which is the physical register file entry that is the
destination of that value. So this basically allows the pipeline to
know where to go find the actual value. We don't actually store the actual values
in here. We just store a pointer in the physical
register file, because it's fewer bits. And this can tell us, oh, well go, go
look, let's say, when this when this instruction here which is already finished
is ready to go retire or it's ready to go commit, go look in physical register file
number seven or something like that. And it goes and pulls that value out from
there. So, so a good discussion of this is in the
Shin Lapousky book that is sort of supplementary for this class.
Okay. So, let's, let's talk about the anyone
actually have questions first before we move on, 'cuz reorder buffer is, is a key
data structure here, and it's a complicated one.
Okay great. Next, next structure we added was the
finished store buffer. And this could actually be multiple
entries but for this pipe let's say there's only one.
So we are only allowed to have one store pending in this pipeline cuz it makes life
a little bit easier. Things you sort of need to actually have
here is you need to have both the address and the data whether it's valid.
Probably the op code the op code will tell you if it's store byte store word sort of
data width types of things. And that's most of what I wanted to say
here if we, this, this is what I was saying before if you allow multiple loads
and stores in the pipe at the same time, you're going to have to bypass from the
finished store buffer to the loads. And possibly in stores, if they has to be
right combined. So, if the user stored a different parts
of a word, you may have to bypass that, depending on how the pipe works.
Or, you can assume that there is only one memory instruction valid in the pipe at a
time, you'll have one of these entries, and no loads can happen while a store
happens. That's not very good performance.
People probably would not have actually built that, but that's, that's something
to think about. Okay.
So, now we get some more pipe line diagrams or on those pipeline diagrams.
And we're going to see how this is different, and, and what happens in the
reorder buffer. First thing I wanted to say here is this
little, little r, that you see show up in these diagrams.
That means that we've written the reorder buffer, but we're not ready to commit.
So from here to there, we basically have this add has written the reorder buffer,
we're waiting for it to commit at the end of the pipe.
But we can only commit in order, so you can sort of see these Cs are all lined up
in time. So we're only able to commit from left to
right and we can't reorder those, those Cs relative to another, another C.
Let's see, what did I want to say here. That was the main thing The dependency is
the same, it's the same code we've looked at before.
That's what I wanted to say. Which one is that?
Yes, it's this one. Okay, so here we have this ad writes
register twelve. Right there.
This add goes in read's register twelve so we have a read after write happening.
What's interesting here, is this read after write that's happening.
The write happens there. The read happens, let's say here.
That data's in a bypass anywhere, or it's not in the forwarding logic of the, of the
processor. That value is actually in the physical
register file. So this is kinda showing an example here
that data, when you're doing the bypass, can come from bypass network locations, it
can come from the physical register file, and that, those are sort of two places it
can come from. But you don't, you can, you can,
everything else actually in here, surprisingly, is basically coming from
bypass, except for that one location. So bypasses end up being really important.
But you can have data coming from the physical register file.
So could the C be here, could this C move over one.
So let's come in, in order. And we only have to, we can only commit
one thing at a time in, in this basic pipe.
More complex pipes, we're going to allow multiple commits at the same time.
When we start to mix super scalars with out of order, at the end of today's talk,
we're going to be able to think about trying to commit multiple things at the
same time. But we can't really do out of order.
So this has to be monotonically going that way.
Brief example here. This is kinda, kinda fun.
This is trying to show different entries into the order buffer and when those
things get allocated. And largely what's going to happen is for
a destination, so let's say instruction zero here allocates the reorder buffer and
R1 becomes active. And it's a long a long way to multiply.
It doesn't show up at, the, the circles here mean that they instruction is
finished. It's gone to the end of the pipe, and it's
ready to go. You could have other things, like this is
an add that happens to register eleven. It allocates, it finishes early, but it
doesn't commit till late. So it has to stay in the reorder buffer.
So it takes up space in the reorder buffer.
And, you can sort of see other examples that these, these adds here finish
relatively quickly. But they can't, they have to wait to
commit in order. And they're basically dependent on this
instruction here, committing before they can go commit.
So it's a nice little structure that can track all those things.
Okay let's look at commit points and if exceptions occur.
We are going have the serve, same example we had before.
The mall here is going along and it write backs to the physical register file.
Now, you'll say, woah, it wrote the register file.
How can it take an exception at this point.
If I was to make an exception it wasn't supposed to write the register file, but
we have two register files. So, it writes the speculative state
register file or the future file or the physical register file.
And this slash here means we don't actually commit that instruction.
So, commit doesn't, it doesn't happen. Now we get to go look at, sort of other
in-flight instructions to see what's going, what's going on here.
Can these other in-flight instructions potentially write information out of order
where current commit point be? Well, here's this add that before, in the
previous example, wrote to the register file, and now it writes to the physical
register file, but does not write the architectural register file.
Instead, it enters the reorder buffer here, denoted by the little r, and just
sits there until it actually gets the chance to commit an order.
But, that doesn't get a chance to commit because a previous instruction kills,
kills it. And kill because it takes an exception and
kills everything. And then you can go and start some new
instruction here. Let's say that is the exception handler.
And fetch, fetch that, you know out here. One, one interesting about this example
actually that I want to say is, sort of in this transition, lots of stuff, lots of
state has to change in the machine. You've take an exception, the
architectural register file is correct, the physical register file potentially has
many incorrect information, er, many incorrect values in it.
So, on this transition, what's really going to happen is you're going to copy
all of the state of the architectural register, all registers, over on top of
the physical register file. So you basically roll back all of your
speculative state in machine, in one fell swoop.
Obviously that can maybe be a little expensive.
But you don't take off that often. You do take, take branches relatively
often. We'll talk about that in a second.
But what's nice that's, that's logically what's happening.
Sometimes people will actually co-mingle the architecture register file or the
physical register file. And they just sort of keep pointers to
different pieces of information. So you don't actually have to sort of roll
back information, you just sort of change the pointers.
But for right now, let's model it as two complete separate register files where you
copy all the state from the architecture register file to the physical register
file on some form of roll back on an exception or a branch.
Branches. So, what do, how do, how we make the
branch latency better? What, what do we do out of branch first of
all? So, sort of ignore these bottom examples
here. This is a different code sequence that we
have looked at, its not the multiply, add multiply, add code sequence.
Instead this is a branch. So, we have a branch.
The branch commits. We know the branch is good.
But these instructions here, are the fall through case for the branch.
This instruction here is the target for the branch.
So, we need to squash all these instructions in the reorder buffer.
Conveniently we have a bit in the reorder buffer that says all the things that were
dependent on the branch, if the branch is misspeculated, just remove them from the
reorder buffer and basically throw everything out of our reorder, or throw
those entries out of the reorder buffer, invalidate them in the reorder buffer.
What gets a little interesting here is when do we start to execute target?
Well, let's say we compute the branch information here in the execute stage, and
we can sort of re-direct the fetch stage, That's okay, but the squash is a little
bit odd. Because what this really says, from a
pipeline perspective, is that you have to invalidate multiple entries in the reorder
buffer in one cycle. And this, to some extent, is a structural
hazard on the reorder buffer. You might need, you know, many, many ports
into that register, into that, reorder buffer, or you need to at least keep the
valid bits in some other extremely highly ported structure.
You could think about doing something even more interesting where you kill
instructions early. So the difference between this picture and
this picture is once we compute and figure out that the branch is taken, we just
instantaneously squash all these instructions, and we change the re-order
buffer. Or we, we write to the reorder buffer,
killing all the speculative instructions. Now if you note, this doesn't actually
help performance in this case. Places where this can help performance is
if you have an out of order processor with, that's a super scalar processor?
You could think they could try to put other instructions in these locations in
the pipe or try to restart earlier or have other things go on in the pipe and you're
just using less resources in the pipe. So this is gonna be the highest
performance case, this is, sort of, going to be medium performance.
Low performance, you can have a way that you don't actually have to add extra ports
to your reorder buffer. And way you can do that is you let the
inflate instructions that are dead continue going down the pipe until they
get to the commit stage and only then you clean them out of the pipe.
And you clean out the reorder buffer. So, you, sort of, are waiting for these
special instructions to reach the commit stage and squash them there.
In this example the performance of all three of these are the same, but I will
say, this is going to be the lowest performance if you have a more complicated
code sequence, cuz you are basically using up a lot of pipeline resources, you're
using entries in the reorder buffer, you're using locations in the pipes that
you could try to reuse for something else. Okay.
So as we said, we sort of have these three different cases, in increasing complexity
but you get some performance. I'm sorry, in decreasing complexity but
increasing performance. So, so, I think one thing that definitely
comes up. And this is probably going to make this
multi-ported issue come up, is if you have multiple branches in the pipe at the same
time. Then, the simple case of just moving, the
top pointer's not really going to work because you might miss-predict one of the
branches but not the other branch, that's going to, mess you up a little bit.
Okay. So, lets keep moving on here avoiding
stalls due to store misses. Okay, so you've got a store in the pipe.
It takes a cache miss, and now it's clogging up the commit point of the
processor. Because, depending on how you want to look
at this, maybe you don't want to commit until that store has actually reached main
memory, cuz that's where you're going to call commit for that store.
So you can actually pull it out of the future store buffer, cuz it's able to
actually to sort of commit that you, you try to pull it out of the future store
buffer and write it to main memory. Or you write it to your cache it doesn't,
it misses your cache and takes a couple of extra cycles.
So we'll see like this, here's a store word and let's say it takes a few extra
cycles here, three extra cycles stalling to actually to go and write the level two
cache we'll say. Or pull in the data from the level two
cache into the L1 cache and merge there. So there's, there's a way to solve that.
And, and what, what's bad about this, is because we're doing in or commit, it
pushes out the rest of these instructions later.
And that, that's kind of bad. So, what you can think about doing is
adding an extra stage in the pipe and just allowing the store to miss and basically
moving past the commit station this store has committed.
You, sort of, mark it down and say, well it's committed, I don't have to worry
about this anymore. And you basically can decouple the ends of
the pipe here or the store actually happening to memory until later.
And all you do is, you just have commit in order.
You can pull back these things earlier. This looks like a typo this should
probably be back one and then you can, you can commit in order, and have that store
sort of still outstanding out to main memory.
One important thing you need to do here, as I, as I've said before, if you let
another load into the pipe, or a store into the pipe, you're going to have to
bypass out of this data structure, and that data structure now, back to the load
stage of the pipe or the store stage of the pipe.
And that, that adds extra wires into your, out of your processor.
But we basically decoupled store committal from or it's technically committed once it
gets past this point. But it's not in main memory.
But it's, to everyone else and to the, the, the processor it looks like it's been
committed. Cuz you can, you try to go read the value
and it's it looks like it's committed.