So we're going to be continuing today
talking about VLIW or Very Long Instruction Word processors.
This is a alternative way to exploit parallelism in a processor, but instead of
doing scheduling dynamically, as we've been talking about in out-of-order
processors, and super scalars, we're looking at ways to do it statically in the
compiler and in architectures which can exploit that.
But let's now start talk about Very Long Instructions Word in processors.
So we've been talking about the superscalar processors.
The superscalar processors have a fair amount of complexity in them.
Especially out of order processors. Where does this complexity come from and
can we remove all that complexity? So, that's what we're going to be talking
about today. So here's where I drew a picture where I
have the issue width of a superscalar processor.
And, we represent that as W. And then we have the lifetime of an
instruction. So that's sort of, gives you a matrix of
things in flight in the processor. Now these may be sitting in your
instruction queue. We're not going to count if they actually,
let's say, get all the way to the reorder buffer, cuz by the time they get all the
way to the reorder buffer they probably, depending on how you build your processor,
you probably don't need to continually check those against the other instructions
that are coming. You've already broadcast that back to
either your reservation station or your instruction queue to issue a few further
instructions. But if you think about the complexity here
of what you're really doing when you try to go issue an instruction.
You're trying to take these inflight instructions, and you have to look up,
basically, against them, or in your instruction window, all of the
instructions, when they complete against basically all of the other instructions,
all the registers. Now, we built some structures that make
this so you don't have to do a complete all to all check.
So, example, structures that we create to help out with this.
First of all, the scoreboard. Instead of keeping track of where
everything is. Sort of our original design, where we had
portions of, or which register destination sort of flowing down the pipe.
And then we did an all to all compare. Instead, we just keep track of the
location and latency of the most recent write to a register.
And it turns something from a all to all compare into a index lookup based on the
register. So that's a little bit nicer, but if you
think about what's happening in the issue queue, we're still doing a as a
instruction retires, It's going to basically broadcast the registers that get
completed against the instruction que, and try to wake up instructions.
And see which ones are ready to go. And, that roughly as I said.
Scales, scales like this. Cause you're gonna be returning multiple
instructions per cycle. And you know, let's say you have a bunch
of stuff in the instruction queue. You are going to have to check all of
those and this can get painful. Let's, let's take a look at first sort of
the complexity of this for in order machine.
So in order machine is, is not so bad. L is kind of the pipeline length.
You have your width but you just basically have to have a scoreboard to figure this
out, and as we said that's an index based structure.
Now when you start to go out of order, you, you start to have the sort of
complexity of all the results that come back have to be broadcast against either
if you were distributing instructions to all the reservation stations or if you
have a, a single instruction queue, you have to check against all the possible
instructions to go issue. And what's interesting is one of the big
challenges of building these structures is not actually quite what you would think.
It's, there's a interesting challenge that you have too many things that could wake
up in one cycle. So let's say you have a single
instruction, or say all of the instructions in your instruction queue are
waiting at register five to become valid. All of a sudden, instruction some add,
let's say, writes register five, and it, it enters the re-order buffer.
So it gets to the end of the pipe. Well, now, all of a sudden, you have to
wake up all these instructions and choose the, some instructions that are the width
of your machine, that can go issue. Also, you need, still need to make sure
that you can satisfy all of the different requirements in the machine of functional
unit mixes. So, if you have, you know, say, two, two
adds. Two multiplies and one load and store, you
can't just go pick randomly from there. You need to pick exactly that mix.
So this becomes a big mess of combinational logic and this is why front
ends of processors can actually in out of word processors can start to get long and
longer. You get more and more stages in there to
do this and as we talked about last time in like the Pentium four there's actually
multiple stages out in front there to be able to do this and unfortunately you have
to pipe line that which makes this even more challenging.
So, so roughly If you look at sort of the out-of-order control logic here, it's sort
of Growing somewhere between the width squared, the width cubed.
And this is, this is not, not very good. Sometimes, there are a couple, circuit
level tricks that can help you. So if you go look at people who actually
do full custom logic for designs of processors, they'll make something that
they call Pickers. And what a picker is, is its actually a,
instead of using straight combination of logic to go compute what instructions
ready to go execute. You will instead have a much more analog
circuit in there, where you will basically have some custom, analog circuit there
which is almost a CAM, or a continual addressable memory.
It's not quiet a CAM, its more of a circuit which will choose what instruction
is ready to go and it is typically, you wanna have some heuristic that is the
oldest instruction. This is to prevent deadlocks.
You don't really want to issue arbitrarily instructions, arbitrary instructions.
You want to issue the oldest instructions cuz you want to make forward progress in
the machine. So, what we are trying to get out here is,
this is a lot of hardware complexity. It's growing, as the width of the machine,
and some things actually makes this even worse.
So, here, I have a little not here, that as you increase the width of your machine,
let's say you have a, you got a three wide machine, and all of a sudden, you want to
go for, let's say, a four or five wire machine, typically, we need also a larger
instruction queue, or instruction window to look across, in order to find enough
parallelism to keep the machine busy. So as you go wider, you also typically
have to make the machine sort of deeper or at least make your instruction queue
deeper to find enough parallelism to feed your functional units.
Hm. That's not good.
So as this causes the, you know, our, these sort of blow up factors here.
And, you can build these things, but they're hard.
So are there alternatives? Yes.
But let's, before we, we move onto the other alternatives, let's look at the sort
of complexity of this in, a real processor.
So here we have a di-photo, or di-micrograph of a MIPS R10,000 Processor.
This was used in workstations by SGI for many years.
It's a, real out of order superscalar, and one of the interesting things is if you go
look at how much area is dedicated just to control logic.
It's pretty substantial. So in this processor here, here's our
actual data path, it's relatively modest. Cache, instruction cache, data cache,
those are pretty big structures you can see those.
We have you know, some other let's say, the data tags, instruction tags the T.O.B
you know, those are big, big-ish sort of things, and that all this stuff here is
quite large. Let's look at what's inside of here.
We have sort of next instruction sorts of things.
We have a free list for the we had talked about this right.
We talked about the free list for the, which register is free if you will, to
reuse in your out of order processor. We have a big thing that just does
registry namer. The we have different queue here, so this
is a distributed instruction queues, or a distributed reservation station processor.
So there's one just for address calculation instructions, integer
instructions and floating point instructions.
But the interesting to see here is the prob, the, the, the percentage of this
just dies. It's actually just doing compute, let's
say this integer data path, the floating point data path, the floating point
multiplier. Is sorta pales in comparison to all the
overhead. So this is just sort of a get this, get
this idea across. Another problem with out of order super-
scalars, or superscalars in general, is thinking about how do we express
parallelism. And I alluded to this in a previous
lecture, but it's worth repeating. So, let's, let's take a look at what's
happens for codes, sequential codes on a out of order superscalar.
We start off here with some pieces of C code.
We put it into our fancy superscalar compiler.
That generates some sort of data flow graph.
And probably also con, control flow graph inside the compiler.
So there's sort of instructions, there's, dependencies between the different
instructions, and the compiler right now is trying to find independent operations,
because it's going to come up with a schedule.
Unfortunately, you have to come up with a sequential schedule, because the
instruction sets on sequential processors require you to come up with some order.
Even though, it's very possible, that there is no one perfect order.
It's also possible that, you know, instead of over-constraining the problem here, why
are we coming up with some, some ordering when we don't need to?
But, you know, in our, in our sequential, I say we have to come up with that.
We write out to disk somehow, or you know, save a file with this, this code.
And then we have to go and actually try to execute it.
Now what's funny, as I always get a chuckle out of this, is the compiler knew
what the dependencies were. And then we have our superscalar processor
here, which goes and takes apart the sequential code and tries to reorder the
instructions, and tries to find parallels in the instruction sequence, but the
compiler knew that. But we have like a breakdown here in the
middle that we just weren't able, not able to express that between the compiler and
the processor. Compiler knew about the parallelism.
Compiler knew about all the ordering and the requirements.
It had to come up with some sequential code sequence, provides it to the
processor. The processor takes those, those, that
sequential code sequence, back apart, does out of order scheduling like we did on the
exam, on the midterm. And it's gonna, you know, we built a bunch
of hardware here, which has high complexity, as we were talking about on
the combination of logic, checking the dependencies.
And then, it comes up with some schedule. This is done via our reservation stations,
or our instruction queues, And somethings going to fire and it's going to come up
with some schedule. And it's going to execute those
instructions by preserving read after write dependencies.
And hopefully if our superscaler processor is fancy we have some registry neighbor we
can break some of the other dependencies also.
The write after write and the write after read dependencies, and come up with some
schedule that's high performance. Now there's no guarantee that this
schedule is what the compiler would have done if it was smarter, or if could
somehow express parallelism across this interface.
But, you know, it does something. And what's, what's funny about this is
there's a whole lot of power that's being wasted.
It's a whole lot of effort that's being wasted in the processor.
There's a lot of effort of designers of the processor, that is, are effectively
designing all these circuits to take apart the sequential code sequence and do
something useful with it. So it's a lot of, a lot of work being done
there. Is there a better way?