Okay.
So let's, let's talk about some of the other things that were on our list of
limiters to instruction level parallelism. And ask ourselves, are these things which
we can solve very easily? Well, things start to get harder pretty,
pretty fast here. Like I said, dynamic events.
They're basically things the compiler has no way of reacting to.
Now, the instruction set might not be, might be able to react to it.
You can add things into the instruction set.
You may take a branch dependent on a dynamic event.
But let's look at some dynamic events. First negative event here, cache miss.
Well, it's really for compiler, just, all things being equal, knowing whether a load
is going to take a cache miss or not. You might have some guesses, that could
influence the code, but, you know, actually doing something about it.
If you think about what an out-of-order super scalar does, is if you take a cache
miss, it'll reschedule the code. Around the cache miss.
It'll take the non dependent instructions that, the instructions that are not
dependent on load, and pull those up and try to execute those.
But, if the load hits in the cache, you want a totally different schedule.
You want to actually try to start executing the instructions that are
dependent on that load as soon as possible.
So an out of order superscalar has dynamic means to be able to do this, has dynamic,
instruction execution. But our Wiehl IW processor, which is
statically scheduled, can't really do this.
So what's, what's some techniques to, to go about after this?
One thing is something called informing loads.
As far as I know, this actually has not been built in any hardware, but it's been
proposed, at least in the computer architecture, Academic circles or in, in
the computer architecture literature. And informing loads is actually come up
the, the original paper about this by one of our faculty member or one of our
faculty members here at Princeton Margaret Martonosi wrote a, a, she's one of the
authors of this paper. But the basic idea is that if you have a,
a load that misses in the cache you can not execute subsequent instructions.
L seeking next actually no. If there, the load is missing the cache
you just don't execute the subsequent instructions, you basically nullify those
instructions. And this allows you to change the code
sequence dependent on whether the load hits in the cache or whether it misses the
cache. And this was done with Todd Maury
[unknown] Professor [unkown], and professor Todd Maury I think both when
they were graduate students, and Mark Horwitz from Stanford was on his paper, I
think a couple, trying to remember who was the last author, the, the professor on
this. On this work.
Another option is something that the Elbrus, Elbrus processor people did.
So you probably never heard of this processor.
It was something that was built. In the, well, actually don't know who's
ever finished so you get a prototype of it sort of in the Soviet Russia Day right
when the Soviet Union was kind of breaking, breaking down.
This was the design house in Soviet Russia which made all of these sort of military
processors. They later went off and that same design
team was gonna go build some commercial processors after the fall of the Soviet
Union. So, [laugh], they went to, you know, they
went to go to build this processor, and it's a VLIW processor, it's some very long
instruction word processor, and they had an instruction in there which tried to
solve this dynamic event, probable around cache misses.
So, what it said is if the instruction misses in the cache.
Go execute a different piece of code with a different schedule than if the
instructions if a load hit in the cache. So you could effectively have two
difference codes here because the compiler could actually generate two different code
sequences and it can get back while the performance and get back to almost exactly
what [inaudible] could do. This processor never really made it
commercially. And later, they, the company went under.
It was bought, I actually don't know if the assets were bought by Intel but at
least all the, the people who worked at this company now work at Intel,
effectively. So they still live in Russia.
So, that's a funny story there, that, Yeah, it didn't work out from a, a
commercial perspective, but they had some cool ideas in there.
That you could actually try to schedule around branch or, schedule around cache
misses. Okay, so some other things.
Branch mispredicts. Well, we already talked about one
technique here. We can talk about you can add predication.
But that doesn't help if you have big pieces of code, big code sequences and big
code hammocks, you can't necessarily predicate your entire program.
So what do you, what do you do instead? Well.
One, one solution that people have come up with for this.
And this is, this is the hard one, hard one to deal with here.
Is you can add branch delay slots. So you have a VWA processor, let's say
it's three wide. And you add branch delay slots in your
instruction set and you use predication in the branch delay slots.
And what this allows you to do is, it effectively lets you mask some of the
branch mis-predict penalty and change the code schedule, a little bit, dependent on
the prediction. Or change is actually completely based on
where the branch is, and the way this works is, in the, in the delay slots, you
use the same predicate that the branch is branching on.
So the branch is branching on let's say if A equals B.
We'll use that same thing in the predication and effectively you can
reschedule the code differently, depending on whether the branch is taken or not
taken. And because you're putting in the delay
slot. You can sort of get around some of the
problems here of whether it's, taking the one direction or taking the other
direction. No matter of which way the pred, which way
the mis-predict happens or, whether the branches can be predicted correctly or
not. And, and what we're doing is basically
putting code in the delay slots that will always execute, but it can be predicated.
And, if you can actually pull code up from the two destinations of the two branches.
So it's sort of a way to get around, branch mis-predict [inaudible] here.
This is actually done in a research process that was built at MIT.
And I think it's probably also done in some of the other HP processor.
But the MIT processor at least was called the M machine out of Bill Dali's group.
He's now at Stanford but I think the ATM machine was built right when he was moving
from MIT to Stanford. But they had I think three delay slots and
they were three wide. And they could predicate the instructions
in the delay slots. So last thing is exceptions.
So you take a exception. You want to schedule different code.
And these are, like, impossible to predict, you, the compiler has no way to
try to predict this. But this is hard on a super-scalar also.
You know, when you have a traditional super-scalar, and an exception happens,
they usually end up flushing the pipe anyway.
So, and, and it doesn't happen that often. It probably doesn't hurt your performance
that much. So you, no one's gonna lose, lose too much
sleep over this one. So briefly I wanted to say something about
how, how to build really wide VLIWs. As we start to go to wider and wider,
VLIWs lots of instructions execute at the same time.
You have to start thinking about what does the register file and the by-pass number
look like. So in this drawing here we actually have a
figure of the C64000 Series Processors. So these are TIDSPs sort of the flagship
of DSPs processors. And this is a sort of block level diagram
of what they, what they have. What they actually have is, they have,
they have divided the machine into local register files.
When they bypass with inside of what's called the clusters.
So is the clusterd VIWS similar to how we have clustered superscalars which also
divide the register file but this is a architectural, big architectural or isale
of architecture splitting or dividing going on here.
So in something like the C6400, they actually have four instructions per
cluster, so they are executing eight instructions at the same time.
And you can bypass values between these four ALUs.
Within, within them, or you can, bypass within these, but if you wanna sort of
take a value from here and move it to there, you have a very low bandwidth, sort
of, bypass path here and it takes an instruction.
You effectively have to have a move instruction to move between the two
different clusters. So there's a, there's a lower bandwidth
between the clusters and a higher latency between the clusters, but inside of a
cluster, it's very fast. And what's important to know here is these
are not two processors. These are all executing one instruction at
the same time, so it's a eight wide instruction executing under this eight
different L use. And this is used in the sort of TI.
High end DSPs also is its used in HPs and STMicro's LX processor.
This a probably a processor you never heard about, but this was actually what
Josh Fisher went on to go do at HP Labs after multi-flow.
So, this is after sort of some of the original VLIW work same, same person went
and built this LX processor. And this is sort of joint collaboration
between ST micro and HP. And this shows up in printers today.
So, it's not something you're gonna highly have.
The LX processor is probablty something you're not gonna have on your desktop,
machine.