Okay, so we left off talking a little bit
about how to encode instructions in VLIW or how VLIW's encode instructions.
And as you may recall, in a VILW instruction encoding, you've a very long
instruction work. So it could be I don't know many, many
bytes. So, if you go and get something like the
multi flow processors, multi flow trace processors.
They had very long instruction words. They could, I think, execute something
like and their largest machine configuration twenty something
instructions. And they had to encode this but one of the
challenges was that you didn't necessarily want to have all that encoding space used
all the time. So if you are executing a simple
instruction, or something which doesn't actually need all the encoding space.
You end up of a whole lot of no ops. Or if you have a very long dependency
string in your execution trace if you will.
So you have a instruction which is dependent on the next instruction, which
is dependent on the next instruction, which is dependent on the next
instruction. And you cannot fill any of the extra
slots. You're gonna end up having a lot of no ops
in your program. So how do we go about solving this.
Couple, couple different solutions here. Cydra-5 went off and added a particular
instruction, which is a single operation instruction.
And used much smaller amount of memory. So this would effectively reduce your
instruction cache pressure. There was compressed formats in memory, so
you can hold your instruction sequences in memory, and then when you go and pull it
into your instruction cache, it would expand to the fully expanded version.
And this is kind of like a compression algorithm.
You're basically, compressing your VLIW instruction sequences.
Because otherwise these things are very, very large and have lots and lots of
zero's or no ops in them. Because, the probability you're gonna
fill, sort of twenty instructions per cycle, is low, but you want that
capability there, in case you'd actually do want to execute twenty instructions per
cycle. Another solution here is what's actually
used in the Intel Itanium or IA64 processor line and also the TI DSP's, the
C6000 series. They, they have the ability to effectively
mark where a bundle starts and stops. So it's something sort of between a,
variable length instruction and a fixed format, VLIW instruction.
So we'll talk more later in today's class about how IA64 or Intel's Itanium handles
this. But roughly what happens is, they have a
fixed instruction format and you can fit sort of, three instructions per bundle, or
what they call a bundle. And then, if you wanna go over that, you
can keep going over that, and they have special bits in their instruction where it
says, oh, well this is gonna express some more parallelism later.
So you need to look at the next bundle, and try to fetch that bundle, and that's
all can be executed parallel. So it's still a, a parallel execution
semantics, but the formats are not necessarily fixed.
And this is, sort of, outside the bound of classical VLIW now it's going to a little
bit more modified VLIWs, or, or more contemporary VLIW's.
So today, we're gonna as I said, talk about how to go about getting all the
performance that out of order superscalars can get, but in the context of a VLIW.
So we're gonna slowly build up all the different techniques that out of order
superscalars have, and all the ways that they can get instruction level
parallelism. And then apply that inside of VLIW or how
we make small changes to classical VLIW's to get a lot of the performance back.
Okay, so let's talk about the first, the first trick here.
One of the questions that comes up or the problems that comes up here is if you have
branches, and you mis-predict. This can limit your instruction level
parallelism in a VLIW. Okay, so why is this worse than in a out
of order superscalar? Well an out of order superscalar can
dynamically schedule around branch miss predicts.
So if your branch miss predict you can basically schedule other code, or
speculatively schedule code depending on your prediction.
But on something like a VLIW processor, that's a lot harder to do because you
don't have a dynamic scheduling engine behind there.
The compiler had to go do all that scheduling statically at compile time.
So how do we, how do we go about fixing this?
Well, one solution is just to eliminate all the hard to predict branches.
So you take out the branches that might be hard to predict.
So your branch predictor is trying really hard, you could still have a branch
predictor in your VLIW processor, but some of these things, you don't know.
It's like a data dependent branch. How do you know which way it's going to
go. And, you know, this is a problem even for
superscalars, but at least in a superscalar, you can try to sort of back
filled instructions, or try to, try to do something, or move around, if you will,
loads or long laid instructions around branches.
So in a super, in a out-of-order superscalar you can potentially move a
load beyond a branch. But in a VLIW processor, that's not easy
to do, because compiler has to make that decision, whether that branch is gonna
predict take in or not take in. And it may, that may change over the
lifetime of the program. And it may also change as the depending on
the sort of input sets to the program. So it may not be something that's compiler
time even known knowable. Okay, so our first technique here that
we're gonna use is called predication. So let's, let's define what predication
is. Well, first of all, predication is a
technique, that allows you to basically change the result of a register, based on
some other register. And it's going to let us transform hard to
predict branches into data flow or data dependencies.
So we're gonna add something to our instruction set and it's gonna take what
looks like a small branch or short distance branch and we're going to
basically execute both sides of those branches or the taken and the non taken
code path. And then at the end, decide which one was
the right one, with a, something like a select operator from C.
So this is like the question mark colon operator, which no one ever uses in C.
This is sort of the equivalent of that, but we're gonna do it in one instruction,
or maybe two instructions. So let's, let's, let's take a look at how
this helps with small branch regions or branches which are hard to predict.
So, we're gonna introduce two instructions, and I also want to point
out, that predication sometimes shows up in superscalar's processors or a, a
sequential instruction sets also. So a good example of this is, two
instructions very similar to this, are actually in x86.
They're called c-move or conditional move. And we're gonna introduce conditional move
as the most basic form of predication. So let's look at the semantics of these
two instructions we add for a conditional move.
First instruction here, move zero. So what does move zero do?
Well, move zero test whether one of these input operands here, rt, is equal to zero.
And if it's equal to zero, it's going to move rs into the result register, the
destination register here. So it's gonna move rs and rt, or rs and rd
if rt is equal to zero. And, if rt is not equal to zero, it's just
going to leave rd alone. Okay, well, on first appearance is that
seems like a very simple instruction. It's basically just doing a copy
operation. It's a, it's a gated copy.
Looks a lot simpler than something like add or subtract.
We don't have to do any math, at least. There's no compare, I mean the, the
compare is easy as the equal, it's not a less than equals.
So it's pretty simple instruction. And we're also gonna introduce move not
zero. So it's kind of a complimentary one of
this. And, we're going to want this because when
we go to execute code, we're going to want to, let's say, have a if then else, and
we're going to have, want to execute the, the, the then and else at the same time or
roughly at the same time, or indiscriminately not depending on the
branch. And at the end, we're going to decide
which one was the correct one. And we need sort of two instructions here
to choose what was, was the positive one the right one, or the one where the, the
conditional value is true or false the right one, or the one where it's true, or
vice versa. So, move not zero.
Move not zero is going to do the same thing here.
It's going to see except the sense of the, of the condition code is going to be
different. It's going to sense if rt is not equal to
zero, and if rt is not equal to zero then rs is going to go into rd, else we're just
going to leave rd alone with the original value of rd.
Okay. So let's, let's walk through a, a quick
code example here. And see, see how this helps.
So here we have a code example, we have some C code.
We have non-predicated MIPS-like assembly code.
And then we have a MIPS-like assembly code here with these fancy two predication
instructions. And they were all three of these are doing
the same thing. So let's, let's look at this piece of code
here and the C code is probably the easiest to understand.
It's gonna start off it's gonna say if a is less than b, so is a condition, and
there's a then and a else. So what does this actually implement?
Well this is a minimization function or a min function.
X gets the minimum of a or b. If a is less than b, it gets a.
If a is greater than b, it gets b. So it's going to be assigned to x, the
smallest value of a and b. And, these sort of, if and else is, you
can see here it's pretty short, we don't have a whole lot of code in either the,
the then case or the else case. And that's gonna be important in a second,
you'll see why. Mainly it revolves around if you have lots
and lots of code or one of the sides of the branches is, is, is long, it may not
make sense to actually predicate this, cuz there's some cost to predication.
Okay, so here's some assembly code does a similar sort of thing.
Set less than, this is a comparison operation in MIPS.
So we'll do set less than and these are going to be, R2 and R3 are going to have
our two a and b values here. If it's less than, we're going to put it
into R1, if not this gets set to one or zero.
And then we have a branch zero here effectively otherwise this is a branch.
Okay, so branch equal if zero, it's a little odd.
We probably should have just put branch zero there instead.
But is going to check to see whether this value is zero or one.
And if it's the one direction it's going to jump to L1.
If it's the other direction it's going to fall through and then jump over this move.
And these two moves here are the two assignment operators, the x = a or x = b.
Okay, that's, that's not so bad. Let's analyze how long this takes.
How many cycles this will take to execute on a, on a sequential machine.
Okay, it was a set less than. It's one.
This is branch. So that's two.
Let's say the branch is not taken or falls through.
Three, four. And then the jump over.
So it's gonna be one, two, three instructions in the one, cuz there was
like, one, two, three, four instructions in the one case.
1,2,3,4 instructions in the one case. And the other case it's going to take the
branch to jump here. So it's going to be one, two, jump over
all the stuff, three instructions. Okay.
That's, that's not so bad. So, the one case is four and the other
case is three. They're different.
Now something else that makes this interesting is let's say, this branch is
mispredicted. It's a data dependent branch, a and b.
So what, what's a good thing to predict here?
Can our branch predictor do a good job on this.
Hm, I don't know. It depends on what you know about a and b.
If the compiler knows nothing, or the hardware knows nothing, you're not going
to be able to do a particularly great job. So let's say it's a data dependent branch
and it's 50 percent the one direction, and 50 percent the other direction.
It's just going slow down our execution. Well sure.
It's going to slow down our execution, because all of a sudden, when we take a
branch one of the directions whichever way we predict the mispredict path is going to
add some mispredict penalty. So instead of the one path being three and
the other path being four, let's say the branch is predicted oh, I don't know.
Let's say the branch has predicted not taken.
So it predicts the fall through case. So, all of a sudden, the fall through case
is going to be one, two, three, four instructions.
And let's say we have a two cycle branch mispredict penalty.
So let's look at the other case. So the other case is going to be one, two,
and then two cycles of mispredict. So that gets us up to four.
Branch over, five. So it can be five cycles.
So all of a sudden, we have our branch mispredict penalty coming into the
equation here of. And, and, you know, two cycle branch
mispredicts are relatively benign. If you, if that starts to go longer, then
you really start to have problems with this.
Okay. So let's take this code sequence and look
about how to, how to predicate it. So if we're going to go predicate this,
we're actually going to execute, both sides regardless of the value of the if
clause or the conditional clause here. So, if we look at what we do, we've
actually restructured the code here a little bit.
And this is our compare, this is our if still up here, we have a set less then.
And then we say, if R1 is zero, move R2 to R4, else, just leave it alone.
And then here if, R1 is not zero move R3 to R4.
And what's important to know here is, this non destructive operation, especially here
in this last instruction is really important, cuz if the move zero was
executed in that, copied the new value in R4.
And then this move not zero, or to somehow change R4, you lose that result from this
previous ones. This really depends on these operations
being non-destructive if, if their, if their condition is, is not true.
Okay. So let's, let's analyze this from a number
of instructions perspective. First question is, does this matter what
happens to the branch, or what happens to the condition?
Is it going to have different numbers of instructions, or different number of
cycles dependents on the, the direction of the, the condition.
Well, no, there's no branch, nothing's going to change.
So, in all cases, this is going to take three cycles.
So all of a sudden we transform something which can take, as I said if, let's say
you have a branch with probably four cycles or five cycles so, an average,
let's say, the branch is taking you 50 percent of the time either the direction
we'll, we'll split the difference there, and we'll call it four, four and a half
cycles, on average. We turned it into something which takes
three. Well, that's a great win.
So where, where else is this useful? Cuz you always need to have if and else
clauses to make this work out. So, know, this also works good for small
code hammocks. So I say a small code hammock is just a if
which jumps around a small piece of code. If you can't predict that if very well
then it makes sense just to go execute effectively what's inside that code
hammock, that small piece of code, always. If, if, if for instance, you have two
instructions inside of a code hammock and your mispredict penalty is like ten cycles
and you don't know whether the, the branch is being taken 50 percent of the time
either way, all of a sudden your average mispredict penalty is going to be
something like five cycles. You could have just executed the code in
the middle, that would have been faster. You could have executed the two
instructions and then just done a conditional move at the end and that would
have been basically always better. Okay, so we have some questions here.
What if the if then else has lots of instructions?
Hm, so let's say we have if then else here but there's a thousand instructions here
and a thousand instructions there. Should we predicate this?
Let's say their branch mispredict finally, let's say five, for know, there's ten
cycles and it's 50 percent either way. But there's a thousand instructions in the
if the, the then clause and the else clause.
Well, no, that would, that would never make sense to do.
Because, you're basically doubling your code, and if there's a lot of code there
you'd be better served by just doing the branch.
Because you'd be taking something that was, could be 1000 instructions, and
turning it into always 2000 instructions. Or doubling the number of instructions
executing, executing, versus just adding, let's say, ten.
Cycles of branch [inaudible], or an average [inaudible] cycles of branch
[inaudible] 1,000 that you have to execute.
So all the sudden, you're, you're adding a little overhead.
So where, where this predication really helps is for, for small pieces of code
where you don't necessarily know the branch.
Outcome or can't predict the branch outcome very well.
Okay what if the, the code is unbalanced. So you have an if then and an else.
But the then clause and the else clause are very different in size.
So, one is three instructions, and the other is a thousand instructions.
Does this make sense? Well, it goes back to the same argument we
had before with the very large if then else clauses.
If the code is really unbalanced you have a lot in the then clause and a little bit
in the else clause or something like that. And let's say the branch is taking 50
percent of the time, it might make sense just to use the an actual branch there and
deal from the mispredict penalty cost verses trying to somehow predicate it.
Because in fact, what's going to happen is let's say you have three instructions and
a thousand instructions on the two different sides, you're going to be
adding, you're going to be executing if you predicate a thousand and three
instructions every single execution time plus maybe some sort of conditional moves
at the end, some overhead involved versus, if you have 50-50, and if you have to add
a little bit of branch overhead, penalty, your 50-50 is gonna 50-50 chance is gonna
say, well, I'm taking a thousand plus three, and we're going to add those two
things together, and take the average of them, basically.
So, you know, it's going to be what's the average of that, it's going to be like
501, or something like about 502 cycles. That's going to be better than always
executing a thousand cycles. Let's see, anything else we wanted to say
here. You know, one, one, one last thing I
wanted to say here before we move off to this slide.
Move zero and move not zero or sometimes called c moves, conditional move
instructions are what are called partial predication.
And we are now going to move on to full predication where we can actually put
conditions on every single instruction in our register every single instruction in
our ISA. But this is a little bit easier to do than
full predication. This is sort of the first step when people
typically implement and it's called partial predication.
And just to, just to hammer this home one more time, predication is really turning
what was control flow into a data flow operation or data operation.
Okay, so let's take a look at full predication.
Full predication, we're actually going to change the instruction set, such that all
of the instructions can be or almost all the instructions are executed
conditionally based on some predi, predicate.
If the predicate's false, the instructions are not going to do anything.
It's not going to have any, any side effect.
So let's look at a, a code sequence here. So this code sequence we have an if then
and else and then some code after, after we're done.
So here we have if. A little bit different this is not
actually in MIPS code, this is something just sort of serial code here, but it's
going to say, this is our if here; if a = b, then branch to b2, else fall through.
Here we're going to execute our else clause.
And when we're done we're gonna branch over the then clause.
So not, not, not anything interesting here for basic blocks.
I don't know if we've introduced this term yet in, in this class.
But I wanted to introduce the term basic block.
Basic block are is a, is a piece of code which has strictly one exit point, and one
entry point. So you could enter it from other places,
you jump to it from other pieces of code. But once you enter the piece of code
you're gonna execute the code, all the way to the end, and there's one place you can
leave. So you're going to jump to someplace else
or branch, and you can, you can fall out to, to two different places, but the exit
point, is only one place. So, we're, what we're going to is, big
differentiation here is, the piece of code can have for instance, two branches that,
that's not called a basic block. That, that's called a bigger type of block
but this is a compiler term. And it's just good to know.
Okay, so let's apply full predication here.
So, these are relatively short, code hammocks, two instructions here, two
instructions there and two instructions in the else, two instructions in the then
clause, this is like a, really great [inaudible] place something about
predicating. Okay, so lets, let's take a look at the,
the predication here. And I want, before, before we do that, I
wanted to just say that the, one of the big reasons that we are trying to
predicate on a VLIW, in particular, is in a VLIW, we have lots of extra, or we could
potentially have extra parallel execution slots.
So, in our instruction sequence or instruction encoding we have extra places
to put code so it may not actually hurt us to, to predicate.
Because the last time when we go to predicate we're basically going to execute
both sides of the if, then and else. And it's bad if we have to execute both
sides of the if, then and else, and they are big, sequentially.
But, if you try to, sort of, slide them up next to each other, if you have extra
no-op slots or extra slots in your VLIW, you can actually decrease the [inaudible]
path through your program. So let's look at this, we're actually
going to do this in this example. So in this example here, we took this
piece of code and added full predication. So when we say full predication, we added
some extra things to these instructions. They look a little strange at first.
So here are instruction three, instruction four, instruction five, instruction six.
These are same ones that were back over here.
But instead, we added some extra words in front of them here or these, these
parentheses with p's in front of it. What that is, is these are predicate
registers. So we're going to add a second register
file into our processor where we can store effectively one bid values, sort of true
or false values, that are going to determine whether this instruction
executes or doesn't execute. Excuse me.
So, what this means is if p1 is one, execute this.
If p1 is zero, nullify this instruction three here.
And we have, we have a double pipe here. What am I trying to show with this?
What I'm trying to show this is parallel. This and this are executing at the same
time, because we have a very long instruction where processor can execute
multiple things at the same time. And what we've done here is actually slid
this up next to the, the then up next to the else.
And we are gonna execute them at the same time.
Dependent on these predicates. So this is the else block, and this is the
then block, and we're gonna execute concurrently.
Okay. Sort of, stuff before the if, stuff after
the, if then else is all done, that's sort of from that, that four basic block
example. What's this instruction?
This instruction has very interesting semantics.
And this is something that you, probably have to add if you're going to try have
full predication. Hm, okay so let's compare.
This is our comparison. If a = b and then it writes into two
outputs. Ooh, that looks broken, that looks wrong
somehow. How can we have one instruction write to
two things? And what is the semantics of it?
What does it write to here and here? Well, what this is actually doing is this
is writing the positive outcome of this compare to the one and the inverse, or
the, the negation of that to the other. And this is useful, cuz then we can go use
these predicates down here. We can use the, the positive one and,
let's say, the negative one, and that can drive whether we execute this, sort of,
else clause or the, then clause. So lots of instruction sets that have full
predication have either something like this.
This is one option, you can actually have, sort of, two outputs, and you load to
predicate registers. The other option is, when you go to en,
actually encode the instructions, you have a sort of a not bit.
So it denotes the predicate register. But when we denote the predicate register,
we also denote whether we take the predicate register being true as, or, or
the predicate register being false as what drives one of the instruction executes in
our full predication scheme. So in effect this is computing p and p bar
at the same time and running of two different registers, and then we can use
those seperately down here. And interesting result here in is going to
95, Scot Malky showed that you can, on average let's say remove 50 percent of
your branches by using full predication. Now, full predication is not easy to
build. Very few instruction sets have actually
had full predication. Itanium was probably the closest to a real
processor that was built that has almost full predication.
It doesn't have quite full predication. The HP PlayDoh instruction set, I think,
actually had full predication. I think this is what Scott Malkey was
working with, or a derivative of that. So you can actually have, sort of just
some people who've thought about they've had full predication but not all of things
are easy. So first thing you need to add an extra
register file for the predicates, you need to bypass those predicates, you need to
compute them. There's some overheads involved and we're
going talk a little bit more about them.