So let's look at some techniques to deal
with outcome prediction for branches, and we're gonna look at this mostly, first of
all, in the static domain. So things that are not trying to watch the
genetic instruction sequence, but instead are, can, basically statically analyze the
instruction sequence and with some high probability, we'll say, try to predict the
location. The first technique is actually not a
prediction at all. The first technique, we had talked about
this a few lectures ago, or we, actually we talked about this in lecture, two
adding delay slots. So, instead of, I don't know, having that
time just be dead cycles, when we're trying to resolve the branch.
We, let's say we put a, if we don't resolve the branch 'til here, we have what
say one delay slot, two delay slots, three delay slots, and we have three
instructions, we'll say. Or maybe you can redirect at this point
out of X and you have two branched delay slots, depending on how you sort of wire
that in if you can make the cycle time. Let's say if two branch delay slots, what
ends up happening is you have instruction EEQZ here.
Let's say that is at address 100. Then you have two instructions which
execute no matter whether the branch is taken or not taken, that follow it.
So these instructions are in the delay slots of the branch, and you know, if you
have architectures have which have things like load delay slots and other things
like that. But for right now we'll talk about
branches. And we can force these instructions always
to execute. Unfortunately, if you go look at the sort
of percentages of probability that you can actually fill these delay slots, it's
pretty low. It's, always not easy to find work to, to
shove down here, 'cuz you're basically taking work that was above the branch, and
in the compiler, reordering the code to put the, instructions that were before the
branch, after the branch. Hm.
Okay. Well, this is good.
I mean, if you can actually make this work out, this is, this is great.
One of the problems with this, though, is the probability of filling one branch
delay slot, let's say is 70%. The probability to filling two branch
delay slots is even lower than that. It's maybe, 50ish to 40ish%.
As we said before, if you have a monkey pulling out of a, a hat, or just some
random process pulling out of a hat, you're going to do 50%, correct.
So. And as we'll see today, if you use some
more fancy branch prediction, or the outcome prediction techniques, you can get
the prediction accuracy of a branch all the way up into the sort of 95, 96, 97
percent probability. And all the way people have built, if you
look at your sort of out of order super-scaler on your desktop, your core,
or I-7 these days from Intel, that's probably somewhere in the neighborhood of
98 to 99 percent correct. So, being able to fill these delay slots,
with some probability which is less than 98%, is not a good, good trade off.
You would've better been served possibly by allowing some fancier prediction
mechanism trying to fill it, if you care about performance.
Now if you care about complexity, or reducing complexity in area, a static
prediction might be a good, good way to go.
So let's, let's look at some basic rules of thumb here about static prediction.
So the, the overall probability of branch is taken with say out of something like
spec int is 60 to 70%. But it's not equally distributed.
Backwards branches have a much higher probability of being taken than forward
branches. Okay.
So, we got a question coming up here. Why is this?
Yes, so, so loops with high probability or by definition, to be a loop, you don't
have to jump backwards. You jump forwards, it's, it's pretty hard
to loop. So you jump backwards, it's a loop.
And in fact, people like to execute loops, loops, and stay in loops for a while,
'cause that's where a lot of work is done. So if you're seeing a loop, and you're
just spinning in this, this is increasing the probability that the backwards branch
is taken, and that says a high probability.
Such that this half forward branches 50%. Hm.
Can we, what's, what's going on there? Forward branch, 50 percent going forward.
These are usually some swarm of data dependent branches, like an if then else
clause. So that's why the probability of this is
much, much lower. You sit in loops for long periods of time,
forward branches typically sort of, are if then else's, and there's you're checking
some data value and then you're either executing or not.
Now this gets a little trickier with some more advanced compiler techniques.
Cuz sometimes with advanced compiler techniques, you'll actually, effectively
convert a loop with an with an if then else into it, and a condition check at the
end into a unconditional backwards branch and then a little sort of branch that goes
around the, the, piece of code which checks the loop sensible or checks the,
the whether the loop is completing or not. But even that's actually not a horrible
thing, cause at least that backwards branch will always be predicted to take in
a hundred percent, cause its just turned into a jump.
So that's not bad for a performance. It just, you know, might change these
percentages a little bit. But things do like to sit in loops, is
what you should take away from this slide. Okay, so let's think about a technique to
try to take advantage of this. So one technique they didn't take
advantage of this with is actually to add extra bits to your instruction set, and
allow the compiler to hint to the architecture whether the branch is taken
or not taken. Now just a little hint, if it get's it
wrong, we still won't get correct execution, so if it takes a branch, we
still want to go correct, we still want to execute the right piece of code, but the
performance might just be worse. So let's take a look at two branches here.
We have branch .T, and branch .NT. Branch .T is a branch which is taken or
predicted taken, and branch .NT is a branch that's predicted not taken.
So what do I want to say about this? Well, architectures do have this.
I was gonna say the itanium architecture actually has static branch prediction
completely. Some things have sort of intermediary
things with this, for instance the Motorola 68K kind of has something like
this but not, you can't do with all branches.
So this, this exist in real architectures and one of the, the things is that this
could actually be very accurate, especially if you profile your code.
So if you run the code once and you log the compiler to C, which way the branch
was taken, it can do quite good. And some of the insight here is there's a
lot of branches in a program which are there just to check error cases.
And the probability that they actually are taken or not taken, one way or the other
is, can basically almost be statically determined at compile time.
And definitely if you have feedback information, of execution of the program
once, that's, that's a, a very good indicator.
So for instance if you had a loop you would, as a compiler would denote that the
backwards branch is, is taken. So we put a br.t and if it's let's say
some piece of code which jumps over an error case and checks an error condition
and 99 percent of the time that never falls into that.
It would predict that taken. So we jump over that little piece of code.
So you can actually have very high prediction accuracy.
Now we're not, we're not up into the, 99 percent here yet, or something like that.
This isn't, this isn't great. We're still doing static things here.
And, you know, at this point, it may actually make sense still to have, a delay
slot, 'cuz we're still not over our, 70th percen-, or, or, or, or sort of a close
trade-off here. You have to get the stack creation correct
and your delay slot might actually still be a better approach at this, depending on
sort of how, how this accuracy compares up to how many times you can fill the delay
slot. But let's take a look at what happens in
the pipeline here. So here we have branch taken and it's been
taken to this target. We still end up with a dead cycle here
because we do not know where that target is until we've effectively decoded the
instruction. So that in the sort of intermediate time
here we just have fetched let's say the fall through case or something like that.
But, because we predicted a taken, we can, we can do better here.
We don't have to have two delay slots, or two dead cycles.
We can fill it. We can actually get the, the correct next
instruction, in here. For the target.
This branch not taken, let's say we speculatively execute the fall through.
We, we don't actually have any penalty for a correctly predicted branch here.
So if we predict this branch not taken, that's the fall through case, we can just
fetch p c plus four, p c plus eight and just keep executing.
And we have no, no penalty there. Okay, so if we get the hint wrong.
What are we gonna do? Well, we have effectively taken a
mis-predict penalty here. And if you look at the branch take-in case
and you take a mis-predict penalty. What's gonna happen here is you are
actually going to end up having two, two dead cycles, cuz we don't actually
determine whether we took the mis-predict or not until this execute stage.
And that's the, the soonest we can go redirect the front of the pipe.
Now, if you look at this, and squint really hard, you can see that we actually
fetched op A twice here. This was the fall through instruction.
It is hypothetically possible that you might be able to not hold off killing this
instruction if you will, and in this case you sort of have the pipe do something
special and actually end up getting the instruction after op A or op B here, if
you will. But that's pretty hard to do.
So, for the base case we'll just say that you end up with an extra cycle, a
mispredict penalty, when you mispredict. Yeah, so just to reiterate that, what I
was trying to say here is that, well, if a branch taken, we fetch the subsequent
instruction in the follow through case. We fetch the target of the branch.
We try to kill this, but, lo and behold, we end up fetching the exact same
instruction again. So, you could, if you really wanted, to
try to optimize run that and not fetch this twice, but then you are kind of
having things out of order in the pipe here.
Because, you'd have this instruction. This instruction is dead and you have to
figure out how to kill, sort of, sub-portions of your pipe or things out of
order in your pipe. That gets pretty, pretty hard to do.
But this does really quite well if we do static software based branch production
Okay, so let's look at hardware branch prediction.
Now when we say hardware branch prediction, that does not necessarily mean
dynamic hardware branch prediction. This could just mean that you have the
hardware do either prediction without any hints from the software.
And we have a couple different cases we can try to implement.
Heuristics, if you will, in hardware. The first one here always predicts not
taken. This is what we've done so far in all of
our pipelines that we've designed. We've predict, we predict fallthrough
effectively, and we just fetch PC plus four.
We didn't put a name on it, this, but this is actually what we were doing.
We were doing speculative execution with a static harbor branch prediction of PC plus
four. It's pretty simple to implement.
It's what you guys have done in your labs so far.
You know the fall-through early. Accuracy is not very good.
You get all your loops wrong. All your backward branches and your loops
just are always predicted wrong. Okay, let's, let's do the inverse here.
Predict taken, let's have that be our static strategy in hardware.
Well, kinda hard to do here because we don't really know the target of this
branch in D code. So it's like going back to here.
If we predict everything taken, what do we, what do we fetch in this cycle?
Well, I don't know. It's a big question mark there.
It doesn't do super well on, if-then-elses, cuz a lot of times those
are forward branches over things, or, or some, it depends on how you structure it.
Some architectures which have things like these static prediction techniques will
actually restructure their code, such that the compiler will figure out the
probability of a branch being taken or not taken, and then work the code around it,
to make it work out. That's actually pretty common the, the
Motorola 68K had something similar to that and they, the compilers there actually try
to work around it. So, hmm, well, this is definitely a bummer
here. Maybe we can fix that.
Like, like we said, this is the second part of today's lecture is trying to fix
this problem. Okay, how about we use a heuristic, that
all backwards branches are taken and the forward branches are not taken.
Okay so this does a lot better. This is her, our heuristic of what we were
saying before, that loops and things like that will get caught up in this.
So forward branch is not part of the loop. It'll, we'll predict that not taken, we'll
predict the fall through. Backward branch we'll predict taken.
This does pretty good. It's better accuracy.
It's still nowhere near the sort of 80 percent accuracy that we had if the
compiler were to get involved, cuz that's effectively, the compiler case that we
were talking about before could actually implement this entire algorithm in
software. Or something much more sophisticated and
that's usually what ends up happening.