Okay.
So now, now we get into the fun stuff. Out of order processors.
Up to this point we've been doing simple in order stuff.
We've been hinting at out of order, but now we're going to start looking at, truly
out of order machines. And we're gonna introduce a bunch of new
structures that you should not have learned about up to this point.
You may have briefly talked about scoreboard, but we're not still talking
about bunch of other structures. Okay, so, so let's talk about different
portions of a pipeline that can be in order versus out of order.
The front end. So what do I mean by the front end?
The front end is instruction, fetch, and probably decode.
This is pretty hard to do out of order [laugh] because you want to know, I mean,
unless you've actually sort of, can predict the future, it's pretty hard to
know so most of machines are going to be looking at have the front end being in
order. So IO here means in order, OOO means out
of order. Issue.
So this is the stage in the pipe that actually says All the operands are ready,
go start executing the instruction. Well, there are some, some things that can
be in order. Well, you can probably even get more
performance if you try and do that out of order.
So probably not today but next lecture we're actually gonna talk about how to
have the issue be out of order. Right back.
So this is writing to something that looks like a register file.
But not necessarily having hit the commit point of a processor, so this means things
are no longer in flight in the bypass network, it's been committed to a register
file, I won't call it the register file just yet.
Well you could do that in order, out of order, or excuse me, you could do that in
order, out of order, different choices there.
And then finally commit. So this is saying, yes I actually want
these instructions to actually commit and I'm basically, after that point, you're
not able to roll back the machine state to a previous state.
And we're gonna look at machines where that's both in order and out of order.
This is sort of a, a gentle introduction to out of order execution.
And we're gonna be building some different architectures.
And this name here just is, sort of a naming of five architectures we're gonna
look at as an introduction to out of order.
The I4 is in, in, in. So, or in, in, in order, in order, in
order, in order, in order. I2 O2 is.
In order, in order, out of order, out of order.
Into out of order in order is in order, in order, out of order.
It's just, it's, the number just means how many stages in a row are the same as the,
as the letter before it. Okay, so I wanna briefly touch on some
nomenclature over here. And you shouldn't know what these things
are yet, unless you went and read Shen Lipasti, which I highly recommend.
Scoreboard, this a structure where we keep information about what instruction is
ready execute. The order buffer sometimes gets merged
into a lot of different meanings, but typically, when we execute instructions
out of order, this is place where we can go and actually re-order'em, to commit
them in order. And we resolved all the dependencies so we
don't actually commit out of order. Store buffer.
This is something where we'll actually hold off, storing the memory until the
commit point, because we don't want to write to memory too early.
Cuz that would be bad, cuz then that would effectively be committing.
So you, this is one important thing to remember about commit points is if you
commit too early, what is an important notion of a commit here?
Well, it's any state that you can't roll back.
So, if you do a store to main, main memory, that's a commit.
That's really hard to roll back unless you can somehow keep all that state of what
was in that memory location before that. But people don't typically do that.
Instead, you'll actually sort of have a buffer which keeps track of all the
storage you're trying to do. And every once in a while, when you
actually do a commit of the instruction, at that point and only that point do you
store to main memory. And finally, we'll be talking about the
issue queue, which is a, if you are issuing out of order, this is a structure
which allows you to determine when it's safe to go and issue instruction or not.
And this is, what's important about this is that it keeps all the dependencies, the
read after write, write after write. Write after read dependencies in check.
That's what that structure is doing. So we're gonna look at some of these
structures today. Okay, so motivational code sequence here.
And what is a motivation for out of order execution?
Here's a, here's a simple code sequence and let's, let's take a look at some of
the read after write dependencies for this code sequence.
Here we have a write to register one, and a read of register one.
So that is a, a dependency there. So, instruction zero has instruction two
as a dependence. So we'll draw a nice arc there.
Let's see what else happens here. We're, register instruction two writes to
register five. And that gets read in instruction three.
So that's a dependency arc there. Okay, interesting enough here,
construction one isn't dependent on instruction zero, and instruction two and
instruction three are not dependent on that.
So, that's, we can just put it in a separate little bucket here.
Instruction four here, let's take a look. Register twelve, register eleven.
It reads register eleven. So it's dependent on instruction one.
Five is dependent on the output here twelve, So it gets put after four and
likewise six reads register twelve, so it is dependent.
So you have to start to think about if we have an out of order processor we can
choose the order of this instruction sequence and that instruction sequence,
and if we have a multi-issue or superscalar out of order processor we can
even try to think about how to execute these completely independently at the same
time and we can get performance from this, cuz we confined parallelism inside of a
sequential execution stream. So one, one important thing here to
realize is today we're going to be talking about how to dynamically, in hardware,
schedule this and this at the same time, and we're going to be using, this as a
motivating example in the VLIW, or very long instruction word lectures.
We're going to be talking about some software techniques that can determine
when to execute this and this at the same time and architectures which can take
advantage of that or the software has determined that these two things are
completely independent. Okay so an important thing to realize here
is that why does even, why do these even happen?
Why do we end up with sort of non-dependent instruction sequences in a
program? This is like a philosophical question.
Should this even, why, why is this possible?
So what do, what do, in order instruction semantics or sequential instruction
semantics force you to do to instructions. Of course, you need to come up with an
order. You need to come up with some order.
I need to write on a sheet of paper, in some linear order or store it my memory on
my computer system in some order. An important point here is that in the
instruction set architecture of a sequential computer, by definition, the
instructions need to be serialized. This is a limiter, this is a problem in
sequential machines. You need to come up with some order.
So, there is no way to express that this and this can execute at the same time in a
sequential, in order machine, or a sequential machine rather.
Unless you do it something like this. You need to have some order, like, this,
this is expressing the parallelism, but it's hard, the hardware has to go and
figure out where the parallelism is now. You can think of an alternative
instruction set, which is not sequential, but instead, is a graph, of different
dependencies like this. And, people have built those sorts of
machines, they are nothing common, they are typically called data flow machines.
And we, if we have time for it, we'll have one lecture at the end of the term.
So there are machines that allow you to express.
This and this are not dependent on each other.
And maybe sometime in the future they do become dependent based on some branch or
something like that. So you can, you can think about that.
But in a, in a sequential processor, you need to come up with some ordering.
This is a limiter for these machines. So effectively, let's say the compiler,
the compiler optimizations knows that this and this connects the queue at the same
time, by definition if it can generate this code it knows those two things can
execute at the same time. Unfortunately it has now way to express
it. And that's a bummer.
Very long instruction word machines, which we'll be talking about a little bit in the
future, allow you some ability to say I want to execute, let's say this and this
at the same time, but it's very limiting. Data flow machines allow you to do much
more complex sort of graphs of instructions and say, this is dependent
this instruction is dependent on this instruction, and that's all I really need
to, to say. So effectively, sequential machines have
over-constrained our designs. Okay, so let's start evaluating our first
out of order, in order machine. So I4, this is actually an in order
machine. But we're going to be analyzing it as a
motivator and then sort of a slippery slope if you will, into true out of order
machines. So here we have fetch, decode, execute,
memory, writeback. Same five stage pipe.
I renamed things a little bit ease, to make it easier here.
This actually follows the nomenclature that's in your labs, where the execute
stage has an X and the, decode stage, the duh, in, is not ID but instead D.
But this should all look relatively similar.
I, I took out all the sort of extraneous bypassing and other stuff in this
pipeline. So it's a, sort of, caricature of, of a
real pipeline. Okay so let's say, you know, we wanna do
superscalars here. Let's say we want to start adding multiple
pipelines to this processor, so we're going to take the same processor pipeline,
but we're going to split off and have two pipelines here.
We're going to have a memory pipe and an execution pipe.
This looks similar to our AMD superscalar, but for today let's focus on a in order,
in order machine, where these two pipes cannot have, Different instructions in the
same stage at the same time. So, we're going to look at a single fetch,
processor here or a non, a single issue processor, if you will.
But as a motivator, you'll see why we're doing this as we build up to more complex
out of order things. Most of these things hold actually for the
multi issue variants. But your head will hurt a little bit less
if you think about the, in order variant or the in order, single issue variants
first. Okay.
As I said, yeah. Things get a little more complicated.
Allow, allow us to take that same pipe that we had from before, and add in, a
4-stage multiplier. So now we have an execute stage, a single
execute stage which actually has some work in it.
This has an ALU inside of it. The memory let's say has two stages.
This does the address computation. This is the actual memory access and here
we have multiply which takes us a long time.
It's pretty common and multiplies a big complicated beast.
So there are four stages of multiply and then out over here we have the write back.
Hm, okay, so that's, that gets us, that gets us thinking and we want to start to
think what these things look like on the inside.
So the first question is, do we want to bypass this pipeline?
And how much pipeline, or how much bypassing do we need?
In order pipe, three pipes, or three functional units, we'll call it one, two,
three. But we probably still need to bypass.
Can we bypass out of right here? Out of, Y1, does that make any sense?
So the multiply is not done till here, so it makes no sense to bypass out of there.
Do we need a bypass out of here? Out here.
Here. Okay, a lot of bypass locations, so, this
starts to, you can see the bypass exploding very quicly in this sort of
pipeline. And, you know, that's, that's, pretty,
pretty, pretty much to be expected, if you don't bypass.
And you have to wait for everything. Let's say to get back to here, your, your
clocks per instruction of your machine goes up.
And we sort of go back to that simpler machine we had seen in earlier classes.
Where sort of you have to wait for instructions to go all the way to the end
of the pipe. And that, that's a bummer.
Okay. Yeah, that's, that's what we just said.
I do want to introduce this term functional unit.
So, functional unit. Means one execution pipeline.
So this is a functional, this is a multiply function unit, this is a memory
function unit, this is the execute function unit.
Okay. So now we're going to start to,
introducing some extra structures, and start talking about some extra structures,
which will make our lives easier when we start to have multiple pipelines, and
complicated things are happening. The first thing I want to introduce here
is the architectural register file. Or ARF for short.
So the architectural register file is where we keep the canonical state of the
machine that has been committed. Hence, it's called architectural register
file. If you see me say register file, it may or
may not be an architectural register file. It may have in-flight values, but the
architectural register file is the committed state of the machine.
So we draw that sort of end of the pipe here.
Cuz that's where rights happen to this register the architectural register file.
As shown here by this W or this nice time di, or the nice pipeline diagram W.
And we try to read it at the issue stage of the pipe or the register effect stage
of the pipe I did want to say that when we went to go do this we added an extra stage
here. So we went to a six stage pipe because
we're assuming that we bypass everything and we wanna have a little bit of extra
time. And we'll see for some more of the complex
pipes we'll start to put stuff in this decode stage.
But for right now, decode does decode issue stage does some instruction steering
to which functional unit and not a whole lot more.
So that's what in our architectural register file.
We read it here, we write it there. That makes sense.
We may have to bypass some inflight values around.
But those are not in our architectural file.
Those have not been committed, those are just in-flight speculative values.
Sb, so what's SB here? It's the scoreboard.
So, what does the scoreboard do? So we're gonna show some pictures of score
board but for right now, what our store, scoreboard is going to allow us to do is
it's basically to be convenient side structure where we're going to keep track
of where different values are in flight in these different pipelines.
All of this data, if you go back to our earlier pipelines was there.
We had sort of the instruction registers that went down the pipe here, which in the
control, which kept track of what was in each stage.
The scoreboard is just a convenient place to put all that information and centralize
it all. So when you go to build a pipeline like
this, you know you don't want to have to go sort of pick off of random locations on
the pipe. It's easier just to sort of mentalize that
data, in one location in the pipe, in the, in the, scoreboard.
And, when we start to go out of order we're actually gonna store some
information, in the scoreboard which gets, very hard to pick out of other locations.
The scoreboard. Read and write and write, what happens
here. Well, we read it to figure out what.
Where, where we go with the bypass information problem from.
So, in this, we're not going to use that calculation that we used before, but we're
going to read it to figure out where the bypass information is coming from.
And we're going to write it when we actually issue the instruction.
We have to issue the instruction, we have to update some information in there.
And then, as the instruction goes down to the end of the pipe, if the instruction
actually commits, we also need to do something in our scoreboard.
If the instruction doesn't commit or takes an exception, we will also have to do
something at the end of the pipeline to clean up the scoreboard.
So let's look at, let's look at the scoreboard.
Here's our basic scoreboard. This is, this is for a long multiply pipe,
the pipe we just saw. And we're going to keep track per real
register, R0 in MIPS is a dead register, so we're not actually going to, or a
register which has no dependencies going through so we're not even going to talk
about that. So p, so we have one bit in here which
says, if there's a write pending to that register.
F for functional unit keeps track of which of the three functional units we should go
in to pick off the value from. So it is important when we go into the
bypass calculation what's going on and then we are going to have a shift register
for each, register if you will. Where we're going to inject a bit.
In every cycle we're going to clock it forward.
And this is gonna tell us where to pick off the values.
So this tells us where, which pipeline to look at, out of three and the data
valuable tells us in the other dimension what stage in the pipe to go pick off
from. So if you go look at what functional unit
you are and you cross it with information you should figure out where to go do a
bypass from. Okay, so every cycle logically can think
of these bits shifting to the right. So it means the data, if there's a one
here that means the data for register one is going to be valuable in four cycles.
In the, somewhere. Probably the architectural register file,
if you implement this simp-, relatively simply.
Let's see, the other things I'd like to say about this.
We would cover this all, yeah. So, you need to use the functional unit
and the date available field to figure out when to bypass and where to bypass from.
And then if, if the, the pending bit is zero, that means go look in the
architectural register file, do not go pick it off the bypass network.
Okay, now we get to, into a fun example. Buyer motivating example that we saw at
the beginning of class. So here we have that same instruction
sequence we saw at the beginning of class. As we know.
You can try to execute some of those things either at the same time or out of
order. But for right now we have a in order, in
order, in order machine. So let's look at what the scoreboard has
to say about this. So on, on the bottom of this graph er, we,
er, this chart, we actually show the score board.
And red denotes that the value is ready. So let's take a look at something here.
So in cycle three, we start executing instruction zero.
Instruction zero is a multiply. So, what happens, is, it was in, sorry.
So I won't, I won't actually explain this. Here is cycles, D means what is in the
decode stage of the pipe, what instruction number is in the decode stage of the pipe.
I is what's in the issue stage of the pipe.
So zero on cycles instruction zero of the multiply enters the decode stage then goes
to the issue stage and then it moves to the execute units.
So it actually gets put into our scoreboard.
And what happens it gets put in for register one into its scoreboard, just for
register one, and that just marches down the pipe every cycle.
So every cycle, it's going to move to the right.
And now if you look at the pipeline. If you look at the functional unit, and
you know it's a multiply. And you look at where it is at in the pipe
you can say oh, this is where a value actually becomes ready.
So we now know exactly where in the pipe and when, or excuse me, which, which
functional unit and when that value is ready.
By looking at the functional unit and the bits in the scoreboard.
If we look at something, let's say next instruction here is add.
Instruction one, it moves here and then it just goes down the pipe, and it's ready
basically the whole time. We don't have to wait for this value to
become ready, cuz the execute instruction, you can basically bypass out of the
execute unit. And it's almost always ready.
That's why it's red the whole way down. And, you can sort of see here, that
register one, this is, this is when that scoreboard entry is valid.
And this when the register eleven scoreboard entry's valid in time.
Okay, so let's look at, our first real read after write dependence.
So here we have a multiply, which is dependent on R1.
As we can see that instruction is going to sit in the issue stage, until.
Y three happens and then it gets bypassed down here, excuse it me it gets bypassed
down here into the issue stage, and then that instruction can be issued.
If we go look at the scoreboard, that corresponds to this becoming red and then
we can go and actually issue an instruction to, to here and we can
basically bypass that value. And register five, which is a destination
of instruction 2's multiply is valid. So this is a, this is a roughly simply
pipeline design here, or a relatively simple a, a processor, but what' s nice
here is you can use your scoreboard to keep track of when things become ready and
without having to go, and which pipeline to go look in without having to sort of
look at intermediate bits in the pipeline. And it's going to become much more
important as we sort of go to out of order machines.
Okay, before we move on; does everyone understand how to look at one of these
diagrams? Because you are going to have to draw
these later in the course. So this is a great question.
So let's look at this picture here. So, we only have one entry, per location
in the pipe here. So, you're, you're seeing we only have one
of these entries. We have multiple bits going down here.
We only have one entry here. So, what happens if you have, lets say,
things have issued two different functional units, with different latencies
to the same destination register? That's, that's, that's a very important
question. What's gonna happen, it's going to be,
you're going to fill in with this functional unit the newest instruction
that gets issued's location. Because from a bypassing perspective,
that's all you need to know. So that's, that's, like, I'd say it's very
important. You don't need to know.
From a bypassing perspective something that was older in program order.
We only need to bypass the newest value. One thing we do need to do is make sure
that we don't have a right after right hazard.
So what that means is we're writing something later in the pipe then early in
the pipe. That doesn't happen in this pipeline,
because we pipe everything forward. We don't actually have write after write
hazards in the right back stage. The next pipeline we're going to look at
does have a write after write hazard, and that's going to require us to think a
little bit harder about the scoreboard. Exactly, yep.
So we're going to have, that's, that's sort of the next step, pipeline diagram,
is, we have a much more interesting scoreboard.
Actually two, two, not the next pipe, but the pipe after that.
The, the is gonna have a more interesting, scoreboard.
One interesting thing to note is strictly for what we've talked about today, you
don't even need bits. You don't need to track two, different
rights to that same register. Because you only need to know the most
recent writes to the register. So, one way that I've seen people build
scoreboards is, instead of having bits merging down here, you just have an
encoding like log base two encoding or binary encoding of where to go look in the
pipe, for the newest value. And then you just increment back, or
decrement it every cycle or something like that.
So you, you just have instead of having shift registers, you have that.
And that's strictly enough for this. The more complex pipes we are going to
look at as Pramod said, we're gonna have to track both the location.
The functional unit in the pipe and what stage in the pipe it's in.
So, you, you're saying how do I know that this is not ready until four or something
like that. Okay, so that is a, a subtle point is that
if you know what functional unit, if you know the functional unit latencies, you
just look it up in the table. You don't need to actually track that
information, there's just some piece of logic which says for a multiply if the
functional unit is multiply then we can't by-pass till here.
If it's an add we can by-pass from here or there or something like that.
So we don't need to track that in a table, it can just be a part of your
combinational logic in your decode unit, or your issue logic.
Does that you don't, yeah, you don't, you, I'll say it one more time.
To see. This.
Like this is the example here. We have a, a mull that writes R1 and
something which else is supposed to read R1.
But we end up stalling here. In this table the way we know that is we
know the functional unit, and because it says multiply we know we have to wait
until it gets to here, or rather, probably here, instead of trying to pick it off
earlier. If it was, if it said ALU then we could
have it pick it off earlier. So we don't necessarily have to encode
that, that's why in this picture here things turn red on some of these earlier
than others, it's that functional unit information which encodes that.
If you can even think about having that functional unit bits encoding some number
like you said, some people design scoreboards and things like that.
Okay, so I just want to briefly introduce this, then we are going into break.
Next class, we are going to be talking about in-order, front-end and issue, out
of order, write back and commit on these pipelines.
It looks pretty similar, except, some pipeline stages are taken out here.
And this makes you think harder about when to go read the score board.
So we're actually going to use this decode stage to do some stuff at the score board,
and then we're going to start talking about truly out of order machines where
you have these other fancier structures like reorder buffers and store buffers.
Okay, let's stop here for today and pick up next time.