Okay.
So, now let's start talking about how to, leverage different, more techniques to get
a lot of the performance, that we received, in, out of our superscalar, but
in a VLIW processor setting. I want to break this sort of into two, two
different things. Dynamic events, so things that happen in
the processor that you can't necessarily, or you can't really do anything about
statically. It's really hard to do anything statically
about. You just don't know what's going to
happen, you know, something like a, a, interrupt or an exception happening.
How can a compiler know where the exceptions are going to happen?
It's just not really possible. And then we'll talk about speculation
which an out-of-order superscalar does and a lot of things have fallen to this case.
Actually we can add things to classical VLIWs to get us closer to getting, that
those, those performance improvements or the instructional parallelism improvements
they can get from speculation. And we're talking about sort of two
different types of speculation, today. We're going to talk about how to move or
speculatively move instructions over branches.
So, in particular, this is going to get really interesting with loads moving
across branches. Because you, you have a load and you want
to move the load as early as possible. You want to pull the load up early into
your program so that if it misses in the cache.
Or it has to go out to main memory, you can hide the latency while you're doing
other things. We're also going to talk about speculation
of speculating that two memory addresses, one of a load, let's say, one of a store,
or you could also think about this as loads with loads and stores with stores.
Well, we want to reorder those. Similarly, because we want to get, sort
of, loads started early and stores started as late as possible, so we have enough
time to do the computation we have when we do the scheduling.
And remember in VLIWs, we're doing all this scheduling statically in the
compiler. And as you may recall, from an
out-of-order superscalar perspective, when you try to do load store speculation and
try to reorder loads and stores dynamically in an out-of-order superscalar
or some sort of, superscalar processor. You're actually, you need an extra
structure to figure out when you made a mistake.
When you reordered something, it was not a valid reordering.
And we have to add that same structure into our VLIW processor or very similar
structure. So let's, let's start off by talking
about, what the compiler has to do, or how the compiler even starts to think about
this speculation. So, I want to introduce a topic here, from
compilers. And this is sort of an introduction to how
we schedule for VLIW, the nitty gritty of what the instruction schedule or the back
end of the compiler is going to be doing. So, here're two pieces of code.
We have before code motion and after code motion.
So what is code motion? Well, code motion is reordering of
instructions in order to, tolerate the latencies of the instructions, or to do
something good. So, it may not as, might not as just be
tolerating the latencies of instructions, it might also be, reordering for alignment
issues sometimes. Sometimes branches can only happen on
certain alignments or the performance is better.
So, but roughly, we are trying to reorder the code.
Or pro, do valid reorderings of the code. We can't just move, a read after right
dependence, and move it someplace else. We can't say, you know, instruction A
reads the instruction, the result of instruction B.
We can't move A before B, because it just wouldn't make any sense.
A lot of times, the compiler has some flexibility, in what, what it can move
around. Okay.
So let's, let's look at this example case here.
We have, got a little bit of code, and what we're going to try to do here.
Well, two things to note. One, this branch at the end is dependent
on register sixteen. Register sixteen of.
Where does it get computed? Does it actually get computed anywhere in
here? No.
We're also saving away register sixteen, okay.
So, does register sixteen get computed? Maybe not, but we probably want to, give
it as much time to be computed as possible.
Some of the other things we have on here. We have a load.
Okay. Well, the load.
What if load cache misses? Well.
Well, this next instruction here is directly dependent on that load.
So, we probably want to give it as much time as possible to execute.
So, traditionally in code motion, you want to push the loads up and push the stores
down. What we are going to do this if we were to
push the loads as high up as possible and the stores down as possible.
We are going to move a load and a store past each other.
And the question comes up what if this value and this value is two addresses are
the same. Hm, well so let's, let's assume that they
are not. And let's assume the compiler can prove
they're not. So, you should be able to do code motion
here where we're gonna rejumble all the instructions and we're going to push the
load up and push the store down. Leave the branch at the, excuse me, at the
end. So, we've done, we've done some code
motion. The next thing we want to do is we're
going to want to bundle or schedule the instructions.
So, when we say bundle, we're going to take several operations.
And pack them together into one VLIW instruction or one VLIW bundle.
And we denote that here with these braces to say, these execute parallel or at the
same time. So first, you're going to sort of want to
do scheduling. Then you want to somehow bundle
nondependent instructions together. There's been a couple algorithms I've
tried to do sort of scheduling, code motion scheduling and everything, and
bundling together. Those don't always work out as well.
So, most of VLIW pros-, most of the VLIW compilers now will actually do the code
motion and the scheduling first. And then they'll do the bundling, at the
end. Okay.
So, let's look at our first technique here to get at instructional parallelism.
And our first type of, speculation in VLIW processors.
So what's, what's, what's our challenge here?
What's our, what's our problem? Well, our problem is the branches or the
presence of a branch rather, is going to restrict the forms of code motion the
compiler can do. So, we have a branch, we have some code
after the branch. If you have lets say, an add instruction.
That's pretty easy to move it before the branch.
Okay. How about an instruction which can take a
fault or take an interrupt? Loads are a good example of some, one of
those sorts of instructions. How's that possible?
Well, a load can take a, a page miss or a page fault.
It can try to, access some value which is not mapped in by the operating system.
So, let's say we take this load and we move it above the branch.
Okay, that's not too bad. Now, what happens if that load takes a
trap or takes an interrupt? That was never, never supposed to happen.
In correct program sequence order, the branch was going to branch around the
load. So, the load was never going to execute.
But all of a sudden, the load got moved up and the load executes and it takes an
interrupt. Well, your program crashes, it takes,
either a SEGV or a bus error if you're on something like a SPARC system or a, you
know, the reason of taking a segmentation fault.
Even though nothing, nothing went wrong. This branch was supposed to protect this
load from never executing. So R1 here, let's say, just gets loaded
with some bogus value if the branch is taken, and the compiler, the compiler knew
that, and the programmer knew that. But, all of a sudden, the compiler decides
to reorder things and has moved the load up.
Now, that, because it wanted to try to get more performance.
This is something that you can do in your out-of-order superscalars also.
That was our out-of-order superscalars are going to pull nondependent loads up in the
program and try to move them across branches.
But in, in out-of-order superscalar, if you pull a load up and it does something
wrong, you just do a speculative state, you just don't, don't care about it.
So, we need some way to have loads that don't hurt the system.
So, it's a similar sort of idea, but, you know.
We can't move with load above the branch, cause it might cause an exact, some sort
of exception. So a solution to this is, is we're going
to add special instructions, special versions of the instructions which don't
take faults. So.
If the fault were to happen, it sort of remembers that the fault happened.
But it doesn't actually interrupt the process.
And this is, this allows us to pull some of these instructions before branches or
reorder perform code motion statically at compile time.
Okay. So, let's take a look at this example
here. How would we go about doing this?
Okay, so we're gonna move the load up. So, let's say we move the load as high as
possible. We move it above these, these two
instructions instruction one, instruction two, here.
And what do we need to do? Well, if you note, we don't have the word
load here anymore. We have Load.s.
So, it's a speculative load. It has a different semantics.
The Load.s never causes an exception even if it has a page fault, or a null point or
exception, or any underlying memory reference or anything like that.
It's not gonna actually, take that fault. Instead, it's just going to sort of
remember that. And this is typically done if you look at
something like Itanium, which is the, this is the sort of code sequence out of
Itanium processors. In Itanium, what they're going to do is,
they're going to, the, the destination of a load, in this case R1, is going to be
set with a poison bit. Itanium this is actually called a NaT
number. And what it means is the destination
register of a load is marked as not there. Or that's, that's not a thing or not a
number. And what's important about this is, when
you go to use that value. That's when you take the exception.
So let's, let's take a look at example this.
What we do is down here where the load original was we put in a check speculative
instruction, and it uses r1 here. So, what this is going to do is check the
value of r1, and if r1 was, took a trap or exception there was some problem with it.
We can jump to fix-up code. So, there will be some other piece of code
somewhere else we can branch to. And a fix-up code is gonna re-execute the
load, and make sure everything is okay, and then, then jump back to the use.
So, it's speculation. We try to prove, we try to, make sure,
that the common case when we pull this load up, nothing bad is going to happen.
But we, we put a check in just to make sure that it's, this is the case.
One other thing I want to point out is these poison bits, or these not a, not a
numbers, they propagate. So why is, why is this important?
Well, what that mean, is. Actually I want to correct me and said a
little bit here. If you try to use R1 somewhere else before
you do a check, the not a number propagates.
It doesn't take a trap. It only takes a trap or only jumps to fix
up code when you go to actually do the use.
Or, or when you have a special check instruction or a there's another load you
can put there that will re-execute the load, if the, if the original load didn't
work, a non-speculative load to the same, to the same value.
Okay, so, so why is this propagation of not a numbers important?
Well, this allows you to pull code down here up and above this branch also.
So, we can put the full load up, we can pull dependent instructions up and just as
long as we do the check at some point. And check to make sure that the load was
not. Didn't take an exception not a problem.
We can basically speculate that the load is fine.
We can speculate that the instructions to pan a load are going to be fine.
But if and even if depending on the instructions on the load are not fine,
that we've also speculated. When we go to, at some point in our code
we need to introduce this check instruction here which sees that this
value is not a number. So, if it propagates all the way through
your code, you have a load to add to add a sub but which all reads and writes, let's
say register, or, somehow reads from register one.
And is sort of depends a tree from register one, through those, that code
sequence. At some point, you want to check after the
branch has actually happened whether everything's okay.
And because it all propagates, you can basically do work early, and then check
later. And when you do the check later, you can
say oh, I get it, everything's okay, and it has a value in, in it.
It has not number in it, or it has this poisoned bit set, and it's propagated all
the way through. At that point, and, you can jump to fixup
code, which you can re-execute the load, and all of the code you've pulled up.
All that speculative code, and that's why you need, fix-up code and not just
re-execution of the load. Because the fixup code might have to
re-execute more than just the load. It might have execute the load, and all
the code, the other code you pulled off the front of the branch.
Okay. So, let's, let's move onto another,
speculation. And this is data speculation.
So, in data speculation, our, our problem here is we have memory hazards that are
going to limit our scheduling. Especially if we have to do all the
scheduling statically. So, if you're conservative in your
compiler, and you can't prove anything about the different addresses.
All of a sudden what you have is, every single load is dependent on every single
other load. And every single store is dependent on
every single other, other store. Now, why I say why are loads depend on
other loads? Although most loads you might be able to
reorder pretty easily, you have to worry a little bit about if one takes a trap
before the other. I guess we kind of already talked about
how to solve that problem. But if you, once you reorder a load in the
store. Then you start to have problems because,
or reorder a store in a store, or anything from a store, then you start to have
problems because all of a sudden, you have some state that changed in your memory
system that you have make sure that the an original program or the later instructions
actually pick up the result of that. So, later loads or stores somehow honor
the results of that. So, it's really sort of focused on moving
loads of past stores. So, what's our solution?
Well, our solution is to add some hardware check to guarantee that the pointers or
the two memory addresses when you do the reordering of a load in a store, or a
store in a store, that these addresses are not the same.
And this is actually the same thing that we did in our load store queue, or our
load store disam-, memory disambiguation that we talked about, in out-of-order
processors. So it's, it's the same.
Same sort of thing going on here. But we need to actually change the
instruction set to handle this in a VLIW setting versus in the superscalar was all
microarchitectural issues, sort of below the instruction set that no one ever had
to look at. Okay.
So let's, let's look at how to do this. We have a code sequence here.
Some instructions, a store, and then a load.
Compiler can't prove anything about these stores and this load.
But in the out-of-order superscalar, it could guess or speculate that the load and
the store don't share the same address. And if they did, it could basically roll
back the code, and re-execute the load. I'm going to emulate something like this
but using, doing it all statically with the compiler.
Okay, so how, how do we do this? Well, this is something else that, Itanium
added and it also talked about at a previous, research VLIW processors or,
explicitly parallel instructions and processors.
What you're gonna do here, is you're going to move the load up above the store.
And this is basically speculating that the load in the stores don't hit the same
address. And much the same way that we, when we
reorder loads of those stores in an out-of-order superscalar added the load
address to a load store queue and a memory disambiguation queue.
We're gonna do the same sort of thing here.
That we're, when we move this load up we're gonna have a different instruction
here of load and add. So, what this is going to is it's going to
add, it's going to do the load and add the address to a special table which we're
going to call the address, what's does ALAT stand for?
Address load, Advanced Load. Address Table.
The ALAT is what Itanium calls this. So, it's a, it's a structure that's a
associative structure where you can put addresses in it, and then check, other
instructions, other addresses from other instructions against it.
And if, you get a hit or something like that, you know that something bad
happened. And you can, re-execute.
So, we're going to see here, that, we're going to have a load, we're just going to
get the load out in the memory system. The, we can overlap the time in the memory
system with executing other instructions. We have a store here.
Now, the store is going to take the address of itself, and we've changed the
instruction set here. So, we've added these two new instructions
here, and changed the semantics of a store.
And it's going to check in the ALAT, or Advanced Load Address Table, to see if the
other see if the address the stores happening to is already there.
And if, if it is. It's going to, going to change the table
somehow. And we'll talk about that in a second of
what exactly happens. But roughly it's going to, take this
address out of the table. And then when this load check happens
here. This checks to see if the address is still
in this table. And if it's been bumped out of the table.
It jumps to fixup code, which is basic going to re-execute the load, or
re-execute the load and all the dependent instructions of the load.
So, we can, effectively do data speculation, but all in software.
But this is going to require us, to have an associative hardware block or a some
sort of content addressable memory or CAM, that we need to add that all load address
and all store addresses, are, passed across.
This is similar to how we had our memory disambiguation queue in our out-of-order
superscalar. Okay so, lets look at the ALAT, the
Advanced Load Address Table. Okay, so what, what do we have in here?
Well three things. We store the address.
We store the size, this is important so we know, is it word, double word, half word,
bite sort of thing when we go and do matching.
And when a Load.a happens. It adds one into this table, and we, we,
we denote the register number that is the, the target of this also.
It's kind of, kind of useful to know and we go to do the check.
This is to match up to make sure that the same load is the same load from before,
and we can effectively remove the address from the table when we do the check.
Okay, so we added an address. Okay, we're running along and though we
execute stores but the stores don't hit on that address.
So, when a store executes, what is it going to do?
Well, it looks in this table and cams or, or does a comparison against all the
addresses with size matching, and everything, against this table.
And if it's in the table, it removes it from the table.
So, by the time that the check or the load check happens, that address is not in the
table, then. So, when the load check happens, it's
going to look in here at this table again and say, oh, I'm not in the table anymore.
This address is not in the table anymore. So at that point, you can know, what's
going to re-execute that load. But if it's still in the table, that means
no stores hit it, bumped it out of the table.
So, no stores in, you know, the leading time bumped it out.
So, the speculative load was fine. So, we've effectively done what
out-of-order superscalars can do. We can move now loads.
Above stores, and through a changed or instruction set architecture, by adding
these special check-in, load check-in, and check instructions.
And by changing the semantics of load, and adding a load, add, or load a instruction
here. We can change.
We can, we can have, data speculation into our program.
Okay. A couple other things that are fun to add
to classical LVIW's that, to improve performance.
I just kind of want to just breeze through these, these next two ones a little bit
faster here. First one, multiwave branches.
So problem, you have brand VLI architecture, or very long instruction
word architecture and if you have, let's say, ten things in one instruction, or
ten, ten operations in one instruction, or ten operations in one bundle, but you have
sort of short branches or maybe you have multiple branch decisions happening.
You want to have sort of a branch every three instructions.
Well, you have lots of wasted slots. If you can't somehow have one instruction
branch multiple directions. They might say branching multiple
directions? That's, That's a really weird idea.
How do you have an instruction go multiple ways?
How do branches could go one or two places?
They either fall through or go some place else.
Well, we could actually change the instruction set to have a branch that can
go multiple directions. So, let's look at this with a full
predication example here. So we have two bundles of code here.
This first one here is three instructions and it these are all executed in, in
parallel. And what its going to do it's going to do
three different comparers. So, this is similar to our, this is our
full predication example here. It's going to write to these predicate
registers in with the, the logical, positive and logical negative sort of
versions of, of the predicate comparison. Okay, so that sort of stages us to do
three different branch comparisons. And then, down here we have three branches
that all branch based on this predicate. Or these different predicates that we
already staged or, or setup up here. And they can branch to three different
places. Label one, label two, and label three.
This also could be fall-through code, if none of the branches are taken.
And, you know, one little wriggle here that you have to have sort of figure out
the semantics in, and I should say Itanium supports this.
You can buy real chips with this in it. If you have three branches together, you
need to sum up over of these branches because, let's say, they all say taken.
Well, they all, all resolve to actually being taken.
Well which one do you take? Because you could branch all three
different places, and want deterministic execution.
So, typically, when you have an instruction set like this of multi-way
branching, you define some order. Maybe the first one has priority over the
second one, which has priority over the third one, etc.