Let's look at a little more complicated
case. We talked about control hazards due to
jumps. Let's put the control hazard one cycle
later. And a, a good example of that is something
like a conditional branch. So here we have a piece of code.
Add branch. And it's branching to 200 bytes into the
future. If you take a hundred branches on nips are
relative to the subsequent instruction, so PC plus four plus the offset, so we'll end
up at three or four. And let's walk through this case here.
We have I one is our first ad. You get the branch sitting at the decode
stage. And we have some decode logic which is
saying, is this a branch? Okay.
The next question though, is how do we compute whether this is a, a branch is
being taken or not. Can we do this in the decode stage?
Unfortunately we need to do a comparison here, and this comparison, the hardware we
want to use to do a comparison is in the ALU.
It's a, you know, subtract operation, or a, a comparison operation.
So this, this fits well within our arithmetic logic unit.
And we have the zero wire coming out. So, what this is gonna do is, instead of
having one cycle of latency, or one cycle of kills being inserted, we might have to
insert two cycles. Because now, we have to wait for the
branch to get to here. And if we're predicting PC plus four,
we're speculating PC plus four, we'll actually have fetched more data.
So let's walk through this. So the branch moves forward, one cycle
forward here. And, what you will see here is we're
actually fetching 108. So we're fetching, the next, next,
instruction. But these two are both potentially dead
or, in, we wanna kill both of these. But we don't know whether to kill them
until we get this, this zero wire here. Hm, okay so at that point we can redirect
the front of the pipe and we can insert a no op.
So that's going to change the logic that connects to here.
We are also going to have to insert a no op here because we fetched 104 at that
time into this point of the pipe. So, if a branch is taken, we need to kill
the next two instructions and the instructions decode stage is effectively
invalid so we need to swing this muck here.
So we're gonna have to add an extra control line on to this multiplexer.
And now, Stalling, and killing makes a lot bigger, a lot more difference, so the
stall signal has to effectively be killed or nullified at this point.
And why is this important? Well, let's say we're stalling the front
of the pipe and stall were to take priority.
That would stop this register from moving and stop this register from moving as this
red lines come in. So if the stall were to take priority, you
could have an instruction here which is let's say dependent on some stall
condition on dependent on some data hazard or dependent on some random hazard.
And you've a branch which is branching to someplace else.
So let's take this instruction here, 104, which is add, is depending on something as
stalling in front of the pipe. At the same time, you want to kill because
you took the branch. You want to kill that instruction.
If you were to put the stall at the higher priority, you'd actually end up in a
deadlock. Because you'd be stalling, but you would
have no way of killing the previous instructions to clear the stall.
So this is why it's really important that the instruction at the decode stade, stage
is now invalid. So unlike the, the jump case where, well,
you might be able to do it either way, with the priority of the stall, and the
kill, wires, here, the kill, has to take precedence.
Or the redirect has to take precedence. And that's going to redirect the front of
the pipe, and basically clear out everything here.
And if you were to stall it, you would not be clearing it out.
So instead, you need to kill, kill, sort of turn these on to no ops, and redirect
the front of the pipe to go fetch the actual target.
Of your branch, of your conditional branch.
Yeah, sorry, this is just a drawing now, that, this branch.
Information has to go into this multiplexer and go into this multiplexer.
The other thing that's important to your, for, for branches, is we need to figure
out the address that we're actually branching to.
And we'll talk about this more when we get to branch prediction later in the course,
but. What you're gonna have to do, is you're
gonna have to take the PC+4 and add it to the branch offset, cuz on something like
MIPS, there's PC relative branching. And the other thing is we can't reuse this
ALU necessarily to do that. You could potentially reuse this ALU if
you were, micro coding your design, But unfortunately, we need this ALU to do the
branch comparison. So we need to add another adder here to
come up with the target of our branch. Now, some people get smart on the way they
build this, and you can actually see some things where people will try to reuse the
main processor pipeline ALU through the branch offset calculation and maybe try to
do this comparison, because if you have simple enough branches you can do the
comparison with less hardware than a full adder.
That's another option. Let's hold off talking about that until a
little bit later. Okay, so now we get to talk about, the
stall signal, in, in detail. So we have, we start off in.
Blue here at the top, the stall signal we had from before.
So you need to check the data pendencies of, and check for data hazards between the
registers, of the source operands, these are the RS and RT, against, in the decode
stage, against the other stages in the pipe.
The execute stage, and the memory stage, and the, and the write back stage, and
make sure that you're both reading the registers and you're, check the write
enables to make sure that the respective instructions in the different pipeline
stages are actually, writing to the different locations.
But now we add another term. We add a, branch term.
So, we need to know whether there's a branch going on here, because we need to
unistall the processor, if the branch is taken.
So we're gonna add not, and this is effectively, it is a branch zero and zero.
Likewise this is, it's a branch not zero. And, the value is, not zero.
So these, these two terms are basically saying, a branch is happening, it's in the
execute stage, and we'll say that the branch is being taken.
So if it's a zero branch and the result is zero taken or if it's a not, not equal
zero branch, it's not taken. And why, don't, we stall if the branch is
taken. So, this is the same question we had on
the previous slide. You have a branch, it's happening, if we
stall the front of a pipe. The instruction at the decode stage has to
be unstalled to get that out and to clear it out.
So, the instruction at the decode stage is now invalid, so we don't want to pay
attention to any of the stall information. It's just like a, a red herring, or it's
a, it's a piece of information you shouldn't be paying attention to.
So you want to have the instruction at the decode stage be turned into, an unvalid,
or an invalid instruction. Okay, so now we get to talk about the
control equations for the program counter and the instruction register multiplexers.
So let's start off by talking about PC source, and so I want to remember what
that is. That's way up here.
That's the input to this multiplexer. And that's going to select where we're
branching to, or where we, excuse me. Not necessarily where we're branching to,
but the prudent counter of the next instruction we're gonna fetch, on the next
cycle. And it's on the next cycle, because this
is a, a flip flop. It's a little bit of control equations
here just to, I wanted to point, sort of, two, two basic things out in this, these,
these equations is that the older instruction or the thing that's farther
down the pipe is going to get precedence over younger instructions, and for the
control signals going down the pipe. So if we look at this PC muck select.
It actually has things coming out of different stages of the pipe.
We have something which is redirecting for a jump out a the decode stage.
And then we have these signals here which are coming out of the execute stage, which
is strictly farther down the pipe. And this needs to take precedents over
that. So if you have a branch which is directly
followed by a jump, the jumps can be seen there trying to swing the control on the
pc select marks. So the next programmer counter mucks.
And you can't let that happen, you have to let the branch that's actually executing
take precedence over that. And that's what I really want to get
across from this. And you'll see that similarly here the
branch takes precedents over the jump that links from the other control signals.
Let's look at this from a branch pipeline perspective.
So we'll draw the pipeline diagram for a branch executing for our example.
And what you notice real fast here is, if you take a branch and you actually execute
the branch and take the branch, you're going to be adding, two extra instructions
here, in this case 104 and 108, and they get killed as they go down the pipe.
And this is going to impact our CPI in a negative way.
And what's happening is this execute the, the branch which is in the execute stage,
is gonna reach back and kill these instructions concerning the NOOPs.
And we can plot this in the, in the other, other dimension also, but you'll see.
So now we need to think hard. Is this a good thing?
Well we speculated that the instruction was or the next instruction was gonna be
PC+4 or the address of the next instruction's gonna be PC+4.
So we took something where the CPI was, let's say, two.
Machine CPI is two, and we cut it down a little bit, in the common case.
But jumps, the CPI still goes back up to two, And here, the CPI is three.
Now, in our speculation case, or our no speculation case, if we had a branch we
would probably have to stall for three cycles anyway to know the destination of
that branch, so this didn't really hurt us in that case.
The CPI in that case was also three, four, a branch instruction.
Hm. Okay, well, but this still isn't good.
I don't want every single branch I take to have a CPI of, of three.
So let's talk about some ways to mitigate this cost.
And how do we reduce the branch penalty? Well, one way to do it is, we can actually
add a comparator one stage earlier and have it resolve in the same timeline as
something like the jump. So how do we do that?
Well, here we have a register file. This is in the decode stage of the pipe.
This is the fetch stage. What do we really need to do for a branch?
Well, we need to know that it's a branch. So we do some decode.
And we need to know whether it's a, a branch zero or a branch not zero, for
instance, if those are the only two branches in our architecture.
And we need to know that, let's say, register one is zero or not zero.
So some people came up with this smart idea that we can add a zero detector onto
the registry file output. And by adding the zero detector into the
register file output we'll know which way the branch goes.
So, this, this, the zero detector. One of the interesting things is that this
is actually a little bit easier than doing a full comparison with zero.
Because to do a zero detect, you can basically build a optimized OR tree.
And then you take the negation of N, sometimes people can do this with, their
wired OR's or more analog e-circuits. But this is effectively a big OR gate,
where you OR in all 32 bits if you have a 32 bit processor, and then take the NOT or
take the, the, the bit itself. And this is easy for zero detect, this can
get harder for different types of branches.
If you ask me, like, a branch-equals, or BEQ instruction.
You can't use, well, you can't use zero detector.
What you have to do instead is somehow actually do a full comparison between two,
of the source operands. And that usually takes more time.
So the, the nice thing about this is that It'll, take your, all your branches and
take'em from, a CPI of three when they mispredict, or when you misspeculate.
Two, two cycles of the same timeline that you had with.
A jump, so that you end up with the exact same pipeline that we had for jumps now.
The downside is that this is going to elongate.
Your cycle time. So you have this wire coming out, and the
wire needs to go into the logic which computes PC source and that's going to
become a critical path. And it's a critical path because, you have
to go through the decode. That's usually going to parallel, but you
have to access the register file, then through a zero detect, then go into
control logic which computes this, and then come around and latch the
information, or, or, or flip-flop the information.
So you can really negative, negatively impact your clock period and as we talked
about the iron law of processor performance.
You can either get performance by lowering your clocks by instruction or you can get
it by lowering your clock frequency or excuse me increasing your clock frequency
or lowering your clock period. So these two things trade off.
So you can elongate your cycle time which will negatively impact your CP, or
negatively impact your performance or time per program in our iron law of processor
performance. But on the flip side you will Have
branches resolve faster so your cost per instruction, or average cost per
instruction, especially for branches will, will go down.
So this is only one approach. Another approach is you can expose it to
software. So, we can change the instruction set
architecture. What is really cool here is the
instruction set architect can actually have an impact on what's going on here.
This is not all micro-architecture issues, this is not all fancy branch predictors,
which is what we'll be talking about later in this course, but instead you can expose
this problem or this challenge of the time it takes to resolve a branch to the
software. So what we're gonna do well, I'll define a
branch delay slot. And a branch delay slot is a architectural
when I say architectural I mean big A architectural or instruction set
architectural change which allows or, or forces some number of instructions after a
branch to always execute. So it could be, lets say one, two, three
or four, or some number architecturally defined.
And in this delay slot that instruction is, executed irregardless of the direction
of the branch. And what people try to do with this is
they'll have to take work that was above the branch and move it below the branch.
Because, if you know the works going to be done anyway, you can just put it below the
branch and everything's okay. But the problem with that is the branch
has to not be dependent on that work that you're moving below the branch.
So typically for very short loops where there's not much work inside the loop,
this becomes very problematic, cuz there's not enough work that the branch is not
dependent on to move below the branch. To have the loop run effectively.
And what we're trying to do here is the compiler has a hand in what's going on.
Or the, the program writer really has a hand in what's going on.
Cause that's what's doing the reordering of the information from above here to
below here. So lets take a look at an example here
where, we have a branch, and we have one delay slot.
Some architectures have two or more, but like, right now we'll assume one delay
slide. And let's also assume that we have this
reduction of branch panel, the optimization, so now branches only have
one dead cycle when we're, we're gonna execute them.
So what we can do is, we can say, okay, the branch is at address 100, And this, ad
here at address 104 in the delay slot executes irregardless of whether the
branch gets taken or falls through. So then we don't actually have to kill the
next instruction after the branch. We don't have to have all this fancy extra
hardware there, we just have this always execute.
The tricky thing is many times, as I have said, its hard to fill this branch delay
slot. It'd be great if you could have many, many
branch delay slots. So, if you go look at something like a,
penny-before, I think they have a twenty- odd cycle mispredict penalty.
They could have added twenty delay spots to their architecture.
That would have been really, really cool. But at the same time the compiler would
have to go find twenty instructions for every branch, to stick after the branch.
And it's usually just not that much work to be put the after the branch which the
branch is not dependent on. So, we have a really hard time filling
these branch delay slots. And, and, roughly, if you go look on
something like MIPS, across sort of old SPEC INTs to give you some example here of
how often the compiler can actually fill branch delay slots, I think the, the rough
rule of thumb is if you have one branch delay slot, you can usually fill it maybe
about 70 percent of the time on something like, SPEC INT.
Spec INT is a common benchmark suite that's used throughout the computer
architecture industry, or computing industry to measure performance, and if
you have two branch delay slots, that second branch delay slot only gets filled
less than like half, half the time. So, fill in these slots relatively and
frequently you might be better served by coming up with other prediction
mechanisms. And we're going to talk later about, not
this class but in a different lecture, branch prediction which is a technique to
cover branch latency and to make a branch decision early or we can predict whether
the branch has been taken or not taken. And this can be used to dramatically
reduce the branch penalty. So let's draw, the pipeline diagram here,
of, a, branch delay slot. Architecture has, has one branch delay
slot. So we, we have the add flowing down the
pipe, we have a branch flowing down the pipe, and now we're going to have this
next add, and architecturally we've defined that, that always executes whether
the branch is being taken or not taken. So this code sequence or, excuse me, this
pipeline diagram sequence here is the same for whether the branch is taken or the
branch is not taken. In this case we are going to say that the
branch is taken and we go to execute, this next add here.
But you see that there is no bubbles, there's no NOOPs been inserted.
So, we have actually improved the performance here.
This is assuming though, that this add is doing useful work.
Now, this is really important to, to, to say because the compiler or the program
writer has to put something here. And if they can't put anything useful
they're gonna have to put NOOP. And that's not useful work, doing a no
operation, you might as well just had the hardware insert, insert a, the NOOPs for
you and save some instruction coding space.
So there's a tough trade off here between how often you fill the branch delay slot
versus having a branch delay slot. But for right now we'll say branch delay
slots are one good technique to put useful instructions after a branch not have
branches stall out.