Okay.
So that gets us out of the outcome, part of today's lecture.
Let's start talking about how to figure out where we're going.
So it brings us back to this picture and I took some stuff off this picture now.
Well now what we need to do is we need to know the targets, the role of targets.
And I said we don't know the target for branch, a jump, a jump and link until we
actually do PCE plus the offset calculation somewhere in the decode stage
and until we at least get the instruction. From jump registers and jump [inaudible]
registers, we don't know that until later. But this is peeking out, this is not
looking at the, the outcome, this is just looking at when do we know the value.
But this is important because, even though we have the prediction of where we're
whether it's taken or not taken. And we can do that let's say, with 99
percent accuracy. This is the harder one, we're trying to
pick from two to the 32 possible numbers. Can we actually do a good job here.
So something not on the slides, but something I wanted to talked about briefly
here is we can use a static algorithm similar to what we have seen with the
other predictors to do this calculation through the compiler if you will.
So there's been a few architectures, one example actually is the Hitachi, or now
Renaissance SH5 architecture which is sort of a, a media processor, we'll say.
And they actually have, an instruction. Which loads the target of a future branch.
So they decouple the branch into two instructions, the actual branch and a
prepared branch, which has data to execute earlier in their instruction sequence and
says, I'm playing to do a branch and its address something.
It's actually called prepared target, is the name of there, of this instruction, or
PTA. Under architecture and because of this,
they don't have this problem. They, all they need to do is actually do
the branch. So, they know the target of branch.
In the fetch stage, they can actually just go because they have this instruction that
says, load up this special register with a, a, my target to the branch, and then we
decouple the actual taking of the branch sometimes in the future.
So this is a, a, an interesting idea. I thought I am pretty sure there was
actually a, a one of the CDC machines that controls the [inaudible] machine.
Did this a long time ago also but I couldn't find the reference in time for
today's lecture. But I know the SH it's actually SH5
actually. Definitely has this instruction.
So all I want to try to get at here is that there are static software ways to go
about doing this. In addition now let's talk about something
[inaudible]. Okay.
So, the most common way to go do this is to implement something called a branch
target buffer. In a branch target buffer, you actually do
in parallel with both the branch prediction outcome or the branch outcome
prediction and the PC plus four. So you, you put into it.
The current PC and it's going to tell you somehow one out of two to the 32 possible
targets. How does it do this?
Well, what it does is it actually just keeps a table of where that branch has
gone in the past. So the first time it executes, it just
gets it wrong. But then the second time it exe, but given
that first time it executes, it's actually gonna load the table.
With a predicted target. So we're gonna load the predicted target,
based on where this branch went in the past.
We're gonna index into this table, based on our current program counter, and then
we're going to check, to see that we actually match.
So we're gonna take the whole PC, we're gonna do a tag check, very similar to our
cache, cause this is effectively a cache. And if.
Are, we hitting this, small cache here, of targets.
We have, the new program counter. If we predicted, taken.
If we predicted, not taken, we [inaudible], we just wanna use PC plus
four. So, in effect here, we can actually
Predict the target of the branch based on where this branch has gone in the past.
What, that is a lot easier, I thought this was gonna be really hard.
Well, it's actually not that bad. One thing I should say is the size of this
table kind of matters. If you alias too many places into here you
may end up just pulling out wrong data or you end up might having a miss quite
often. So you want to size this table
appropriately. But you don't want to make it too big,
because you're, you're carrying the entire program counter in here.
And that can get big. So the data in here is quite, quite large.
So one way to solve this actually is typically people try to make these things
set associative. So you'll actually have multiple.
Pcs and predicted targets because that is actually a little better performance if
you can check multiple and parallel versus making it larger.
I mean you could just make it larger but typically having a little bit of
associativity helps with this. But you're really executing this in
parallel with the PC plus four logic. One thing that also makes this a little
better here is the branch target buffer does not hold every instruction.
If it's not a branch or a jump, don't put it in this table.
So, you really only put in. Instructions that are branches into this
table. Which helps, save a lot of space.
Okay, so, we have a question here. How do we achieve this effect without
decoding the instruction? So, we have a french target buffer, we
don't wanna code the instruction. Can we just look in the branch, branch
target buffer. Is this good enough?
Is all the information we need in that table?
Cuz we are executing it in parallel both the PC plus four.
We haven't fetched the current instruction.
Is, is that good enough or are we missing something?
Yeah, this is kinda what I was trying to get at is that because we check the, the
program counter here into a full bit with match this is good enough.
We don't actually need to look anything else.
One thing I this, this, this comment was here to make a point to about something
called a line predictor. Sometimes, you wanna get rid of this valid
bit and PC from before. And then you just have, coming out of
here, instead of predicting the program counter, you don't really need to know the
program counter, you just need to know where to get it from your cache.
Sort of, which entry in the cache to go fetch, and this is called line predictor.
So a full branch target buffer will give you an actual PC.
At some point, you actually do need to know the PC.
But you can wait to actually know the PC until later down the pike.
Well, we, all we really need to know is which line in the cache, we wanna go fetch
the next cycle. And we can do that, actually by just sort
of removing this check. And just, ad hoc, just having it fetch
whatever it gets pointed to from here. Which might be wrong.
We have to check that. So anyway this is a, this a trick
question. We can achieve this with a full BTB.
A full branch target buffer. If we don't have.
This check, it turns into something called line predictor.
So we're predicting which line in the instruction memory or which line in the
instruction cache to go fetch from. [inaudible].
Okay. So, this brings us to jump, jump register.
So, jump register. Hm.
This one gets harder to train. We can try and train it the same way.
We can try to use the BTB. So let's look at these different cases.
So what is a jump register used for? It's used for switch statements.
It's used for C++ virtual function pointers, or dynamic function calls.
So if you're doing a jump through function pointer.
And it's used for returns from sub-routines.
So how long does a BTB work in these different cases.
So for, switch statements, if you use the same switch statement and the same case
gets lit up, this works pretty well. It's kind of like training on any other
predictor. If, if, the probability that you're
actually choosing something out of the switch statement is data dependent, or
highly data dependent, you're probably not gonna predict it very well.
And, you're not gonna do very, there's, there's probably no great solution to this
There's some people that do sort of, data, prediction and try to look at the register
values to try to do this a little better, but it gets very hard.
[inaudible] Function calls. Well, this works very well.
So if you, if you're in a C++ function. In C++ they have function pointers all
over the place. And this is to implement polymorphism.
And this is how when you in C++ you sort of do a method access on some sort of a
function your actually indexing through a jump table inside the data structure And
it just so happens that. This is basically always, always correct
unless you actually reassign that object to something else or if that object
changes to a different type So, an example of that is if you have.
Class animals. And you have, you know, dogs and cats,
which subclass from animals. And you have a pointer to a animal.
And you have a cat and a dog. And let's say you assign a cat to this
function, or excuse me, this object pointer which is an animal.
And you call some method on it, like run. Well that's going to train up very well,
but all of a sudden if you change that cat to a dog.
The virtual function pointer cause of what you actually execute on a catch when you
run, run the function run on the catch versus running the function run on a
dog.They do different things. Cat's like to sleep a lot.
Dogs like to run around and bark. So you know that this the code in the
middle of the run function of the cat and dog are different so.
We, we no longer, jump to the same location so our BTB would be wrong but it
will train up pretty fast, because you probably not changing the object pointer
very often. So that's, that's, that's decent.
Okay. So subroutine return calls.
Jump registers are used there a lot. So you call into a function.
You do it with a jump in link. To come back out, you come out with a jump
register. Well.
Btbs are okay, if the place that you are returning or excuse me, the place that you
call from. You calling from that location locked.
But, if you're calling a function from different locations, quite often.
Effectively, what's gonna happen here is the one function is gonna be called from
lots of different locations and you're not going to be able to predict the return
location very accurately. So if one leaf function gets called from
lots of different locations that's not, that's not particularly very good for BTB.
One, One interesting thing about the subroutine return.
Case, is that. The jump register the n of our leaf
function. The program counter for that for that jump
register, is not quarterly it's the function which called into it.
So effectively when you go to index the BTB you're using the address of the jump
register and it might point to some func, other function it had called it in the
past. And then you have no correlation if
somebody calls it in the future. And that's why this is really hard to do.
So can we, can we do better for jump registers.
Yes so the idea of a sub routine return call stack or sub routine return stack and
the idea here is you have a small predictor specifically just for jump
register or sub routine returns. And what you do is you track.
Function calls, and you push the return address onto a stack.
And then when you do a return, you use that prediction.
So let's say we have these three functions here.
We have function A which calls function B, function B which calls function C,
function C calls function D. As denoted here.
We have some stack that we're going to push entries onto, when we do, jump in
links. So we, we know that there's something of a
link to load this predictor with. So, when the function call gets executed,
or the jump-in link gets executed, we're going to push.
The address. Of, what is after, effectively, the
function call site. So, I'll denote it here by the address of
FB. So, that's for when we call, when FA calls
into FB. In reality, it's probably like, the
instruction write after that, is what we're gonna return to.
Like lies. When FC then calls FC, we are going to
push the return location after FC. And when FC calls FD, we're gonna push
that. And these come off in the retur-, in the
inverse order cuz it's a stack. So when we try to pull of this we're
actually gonna pop return addresses. Tup, tup, tup, like that.
So we are gonna pop off the FD, FC and FB, or the address after Fiii, FD, FC and FB.
And we are actually gonna predict the return for this function or the leaf
function, correctly, every single time. Where this gets a little hard, is if we
have our call depth, be deeper, or recursion depth, if you will, be deeper,
than how many entries we have in our return step predictor.
Then we are gonna start predicting wrong. But otherwise we can do pretty well with
this. In fact some architectures make this
little bit easier on the hardware and it'll actually have a special instruction
for the jump in link or the call in instruction.
So for instances XA60 they, they have the call that tells the code basically or
tells, excuse me tells the hardware to load the predictor.
Because otherwise, a jump and link, there's too many other uses for a jump and
link. It doesn't, it's not always a, function
call. It's almost always a function call, but
it's always a function call. But, for a call instruction on XA6, that's
basically always a call. So you know to load this.
And then, on XA6, there's a return instruction, which, is always a return
from a function. You know?
That's, that, the semantics of it aren't tied very well.
Okay. So there's actually one other topic I want
to talk about. Today that's not in the notes And this
correlates to line prediction. We talked about line prediction.
The one other thing I wanted to talk about is wave prediction.
We haven't talked about this yet but is you have a let's say two way instruction
cache. We talked about where to go look, of which
line to go look at in the instruction cache, but one of the things that's just
as hard is if you have two-way instruction cache, and you're trying to fetch the next
instruction, which way do you go and shove into the decode stage of your pipe?
Weigh zero or weigh one? Stuff you don't know.
So this is one of the reasons why people would build direct map instruction caches.
It's very hard to know which way is the correct way.
But a predictors just like our branch predictors helps with this.
We can build a way predictor. So we can have the same prediction
architecture we've talked about today, in today's lecture.
Dynamic, multi-level predictors, sometimes even baked into the, the branch predictor.
And then that can be used to predict whether we choose from, let's say, way
zero or way one. Or if we have a 4-way associative
cache-way zero, one, two or three. Anyway the, that's what I wanted to talk
about today and let's wrap up for here for