Okay, so I wanna just briefly give a case
study here of one of the more interesting, modern day VLIW architectures.
Or probably the most famous and possibly also the most infamous VLIW processor out
there. This is the Intel Itanium, also known as
the, Intel I64, or what's known as an EPIC processor, Explicitly Parallel Instruction
Computing architecture. And a lot of this work actually.
Was done. In collaboration between Intel and HP.
Hp uses these a lot in their big servers. Their sort of big mainframe, well, not
quite mainframes. But big, big, heavy, big iron computers.
And Intel was trying to use this to effectively kill all of the other
workstation vendors. And this was gonna be their 64-bit
solution to computing. So it's a modern, non-classical VILW, and
this was going to be Intel's chosen ISA. There was, they were, they were going to
deprecate X-86, and choose IA-64 as the 64 bit ISA.
And as we now know, going a few, few years forward after the creation of all this
stuff, that didn't really happen. Intel went and did this, it built a bunch
of processors with this instruction set. You can still buy processors with this
instruction set, but it never got, as, as good of a, acceptance as competitor.
The competitor is, was at the time was called AMD64, which is a 64 bit extension,
to what people already had. And that's when people ended up wanting,
it's just a 64-bit extension to what we already had versus, you know, something
totally different. Okay, so couple of features here is object
code compatible VLIW, so, it's not quite a VLIW in a classical sense, it's object
code compatible, which means different generations, different micro-architectures
in this VLIW can have the same instruction code in the same binaries, you know, to
recompile. And how they did this is effectively, as I
alluded to before, they had the ability to have parallelism straddle across
instruction bundles. And they had this notion of groups which
we'll talk about in a second. So, the first few implementations of this
Merced, was the first Intel Itanium implementation.
It was kind of like the 8086 or x86. But Merced has, has lots of things that
you'll realize, if you look at Intel codewords and Intel code names named after
a river. Intel likes to name their things after
either rivers or places. I think this has something to do with it,
its, you can't trademark a, a place name, so they, they just sort of get around that
and make sure they don't have any Trademark issues by choosing place names
with all their code names, One of the big problems here, was supposed to ship in
1997. First customer shipment not until 2001.
It's a four year miss. And superscalar was another thing that
sort of had caught up on it at that time. And it was supposed to be faster and
better than everything else. And the first, the first one was not very
good. It had cold, low clock rates, and was not
as high performance as it was supposed to be.
And sort of the, the. X86 side of Intel's business line,
actually, had almost the same performance as, as the first Itanium, and then very
quickly surpassed the first Itanium. So, their high end processor wasn't
actually high end. Couple, couple other things here, so,
McKinley was the second implementation, shipped pretty quickly after that.
This was much better implementation, but, you know, it's still, still hard to do,
but we're still building these things. So, in 2011 at ISSCC, the Intel introduced
the Poulson processor. Big, big, machine here, eight cores and 32
nanometer, lots and lots of RAM. We'll, we'll look at that, yeah, so, so 32
megabytes of shared L3 cache, big, big processor.
544 square millimeters, in 32 nanometer. So, at the time this came out, this was
the biggest processor ever built, most number of transistors, over three billion
transistors, or at least the biggest commercial thing, Intel might have had a
research prototype, I think, that might have had more transistors than this.
I think their, Multicore processor or they're they call it the SCC their, their
single chip cloud computer might have had it more but I know I should know the
transistor count. But from a commercial processor
perspective, huge chip. But they are selling into extremely
expensive sort of sockets. There is a seller ship for premium was
going into big main frames. That was not what was this was originally
destined for. It was destined for both big main frames
and work stations. But now this is sort of the in 2012
standing here now. This is not used in lots of other places
except for sort of bigger, bigger hardware or mainframe sort of things.
So a few of the interesting here is the cores are multi-threaded and you can
execute six instructions. You can, you can fetch six instructions
per cycle and you can execute up to twelve instructions per cycle.
Per core, and then there's eight cores. So this is a beast of a machine.
Very, very high performance computer. Okay, so let's dive into some of the
details here of Itanium. Itanium has a 128-bit instruction bundle,
and inside of there you can fit the three operations, and then there is some word
called template bits, which sort of says what is in the instruction bundle.
So it's not actually a fixed format bundle, these instruction boundaries can
move around a little bit. And they did that so you can sort of mix
in something like a immediate instruction with instruction, which doesn't have
immediate, and get more space in the bundle for the immediate bits, so you can
have or, or branch offset or something like that.
These template bits also describe how a particular bundle relates to other bundles
around it. So sometimes these are called begin and
end bits, or start and stop bits. So it says the number of instructions
which can execute explicitly in parallel. And the machine doesn't necessarily have
to execute these in parallel. So for instance, if you say twenty
instructions can execute, or twenty operations can execute in parallel, but
your machine's only two wide or they built it two wide, implementation of Itanium or
I-64. You're just are gonna execute, you know,
two wide for ten cycles, or something like that.
But, what's really cool here is the compiler is able, just like all the other
VLIWs, to express the parallelism to the machine explicitly.
Some interesting things about the registers.
They, because this is a VLIW processor, and because you're gonna have to do code
scheduling like what we saw in last class, that increases the general purpose
register pressure. You don't have a register renamer.
So you can't go and use different names for things.
And the hardware's not gonna rename things for you.
So instead, the compiler and the software gonna have to do the renaming.
So they had 128 general purpose registers and another 128 floating point registers.
And they also have these predicate registers.
So, they're not quite full predication, but they're pretty close to full
predication. So you can have bits that say whether our
later instructions are gonna execute or not and you have to compute that into a
little register file. So they had a predicate register file that
you have to bypass. So that's, that's sort of interesting to
see. And then they had the, really interesting
feature here, which is called, ruh, rotating register file.
And let's, let's talk about what a rotating register file is.
So the problem this is trying to solve, is in a code sequence as we saw before, in
last lecture. If you have, if you have a very long
instruction word, scheduled piece of code, and you want to get good performance,
you're going to have to unroll the loop, and then you're going to have to software
pipeline the loop. But when you do this, this is going to
increase your register pressure or increase your register names, how many
register names you need to use. And, as we saw, you're gonna have to add
extra special code in the prologue and the epilogue, which are different than the
main loop body. So how do you solve this in one fell
swoop? Well, you add a subset of your register
space, which will sort of statically rename itself, every loop iteration.
So slightly change the, the loop iteration or change the, the naming of the
registers. And what this looks like, is if you go to
access let's say, register R1. There's a register, sort of a
architectural enabled register called the rotating register base, or RRB here, which
has a value that gets added to this. And it's marginal arithmetic, so it rolls
around at the end and that points to different locations in the physical
register file. Oh, this is pretty cool.
So, what we're gonna do is every single time we come to a new loop iteration,
we're going to change the RRB, and it's going to point to a different set of
registers. And we can actually effectively software
pipeline just by using this one, one feature.
So, here we have the same code sequence we had from last lecture, so is the, the
previous code example. And if we recall, when we unrolled all of
this, what we ended up with was a load, an add, a store.
We'll talk about this in a second. This was kind of the, the key thing that
we were trying to execute, and if we have to unroll this we just had to unroll the
code and then look at the dependencies. So, let's look at the dependencies here.
Well, dependencies that we're gonna have is, this load writes F1, or, the floating
point register F1 here. And.
We know that this is actually getting one get read.
Let's say, the leniency of this is one, two, three, cycle and doesn't get read til
here. Likewise, this add here computes P10 and
the add, let's say, is a floating point add, it has some long latency and down
here is when it's ready into the store. So on something like a, a team of rotating
register file, we don't actually generate all this code.
Instead, we generate one instruction. This which is going to take care of our
epilogue, our prologue, our prologue, our epilogue, and the main loop.
And, what we're gonna do is we encode the distance in register numbers between these
two values here. So, what this means is, if this writes F1,
and one, two, three loop iterations in the future wanna read that value, we encode
that here with a register number that is that number off.
And then likewise here. So this would be F1 to F4, because, it's
off by three. And here, this writes F5.
And we know this one's to be read, one, two, three, four later, so we encoded it
with a register number that's forward into the future.
And now we're going to talk about this instruction here.
So what this is going to do is it's going to change the routine register base number
or the RRB, and it's going to bump it by one.
So we can basically just keep branching to itself here.
And each time we do it, the, all the registers going to change names.
So by the time this is ready, or by the time the load is ready here.
These other values will have sort of caught up with it, where the physical
register that they're actually going to look at will now point to the correct
location. So we can effectively encode into one
instruction here all of this, including the prologue and the epilogue, using this
rotating register file. Okay, so last, last slide of today.
Why do I think Itanium, I think we can pretty confidently say, failed?
I actually don't think it was a lot of the ideas.
I think some of it, a lot of it had to do with the implementation.
So, first off, if you tied the hands of the micro architect, they're gonna scream.
So, I64 added a lot of architectural, big-A architecture, ISA level features, in
order to get specular parallelism. And a lot of this stuff was implemented
and talked about, but never actually built into real processors.
So, people didn't go through the effort, until basically, the first Itanium, to try
to implement some of these things, and they didn't all mix well together.
And they added a lot of states, and they added a lot of complexity to the
processor. So, we have a-lat, full predication or
almost full predication, routine register files to name a few.
This is really complex bundling sequence, the, probably one of the hardest to decode
instruction sets in the world. Very, very challenging and this was a big,
a big a big challenge and it, it type-hands the micro-architect, and the
micro-architect couldn't make a decision. So a good example of this, a funny, funny
story here is that after the DEC Alpha employees, Digital Equipment Corportation
employees, left DEC, they were sort of assumed into a part of Intel.
That same team that used to build out of order alpha processors, went on to go
build sort of the next, next generation of an Itanium processor.
And what they said they wanted to go look at the Itanium processor.
And like wow, this is really complicated. It took'em much more complicated than
alpha. And then they said oh, well, we could
probably do better if we just built it out of order superscalar, took apart all of
the instructions, took apart all of the dependencies, poured that data into what
was effectively a alpha out of order superscalar core and then execute it.
And what was funny, if you're going to look at this, as like all the, the, you
can sit there and just bang your head because you did all of this work and added
all of this architectural state. To allow the compiler to do all this, this
work. And then they would just wanted to undo it
all. They would do this because they wanted
performance but then they wanted to undo all of the sort of state and all of this
hard work the compiler did and just redo it all academically because they though
they could get better performance. They probably could have.
It probably was a good idea but what was kind of funny there is you built a
instruction set that had one micro architecture in mind?
Basically an in order architecture. And then, all of a sudden, people are
thinking about building out of order variants of it.
And it sort of throws everything you had before away, or all these notions sort of
went away. So it's just a, just a funny story that,
that you know people try to build out of, out of our versions.
They ultimately not, do not do, end up doing that.
That same team decided it was basically too hard, mostly due to predicate
registers, and sort of how to bypass predicate registers of out of order
things. And I think they ultimately ended up not,
not doing that, or they definitely ended up not doing that.
And that's just what's sort of known now is the, Wachusett, or excuse me, not the
Wachusett, it's known as the Tukwila processor from Intel.
Now there are other couple of problems here.
First implementation had very low clock rate, so your first one out the gate was
just not very good, this just sort of hurt.
And it was, it's hard to build these things.
They're wide, the speed demons versus the sort of brainiacs, this is this question
of do you want to go wide, or do you want to go long and narrow.
Long and narrow was doing okay at the time.
Big code-size bloat, fundamentally did not solve all the dynamic scheduling problems
that out of order superscalar could get at.
So for instance branching or changing your instruction schedule based on, based on
whether a load hit or miss in the cache, it couldn't do.
Big compiler complexity, need profiling, and not every one wanted to profile.
There's also just not that much in static level, static instructionable parallelism
in all programs, so the compiler couldn't necessarily find all the parallelism, or
it wasn't there statically, and if you're going for a compiler only approach, you
need to be able to do that. And then, this is what really killed it
here is, the, people did go build these more complex out of order superscalars.
So at the time, there was this big discussion.
Can we build more complex out of order superscalars?
And people said, no, those are too hard, they're too hard to build.
They take too much, they cost too much. We don't know how to solve all these
problems. So instead, we'll try to build something
simpler, and push a lot of complexity into the compiler.
Well. There was money behind this question.
So people went and did build these complex out-of-order superscalars.
And, that's what we're basically still using today in our sort of desktop
processors. We have out-of-order superscalars today.
And then finally, the last, last big one here, AMD64 happened.
What is AMD64? Well, it's a 64 bit extension to X-86, AMD
originally did this. Intel, after sort of dragging their feet
for a couple, couple years on this, finally decided, oh.
We're going to, we're going to use that, because people wanted this.
People wanted code compatibility with the ability to 64 bit sort of wider, both
arithmetic operations and wider address, addressing, so more amounts of memory.
And 64 bits is a lot of memory. So AMD originally came up with this, this
is now known as I believe EMT 64 or Intel 64, not to be confused with IA 64, that's
what Intel calls now these 64 bit extension x86, and now Intel is building
those processors too So, everyone as of jumped on that, and that's, and Intel has
kind of de-emphasize Itanium now, Itanium instruction set and instead, we are
basically sticking with IA64 and this instruction, or it's going to be IA32, the
32 bit x86 with extension 64 bit, you know, that's taken over the work,
workstation market. And what's kind of funny here is, this
was, this processor was really designed to kill or unify all the workstation vendors
together under one processor that was going to beat them all.
And it, and it did it's goal to some extent, Because this processor was coming
around, either company's went out of business, or they jumped on the IA64
bandwagon, and decided they were going to take that on.
But what replaced it, what replaced all the different little variants of
processors that were in workstations. So Spark, a, PA Risk for HP, SG, SGI sort
of MIPS processors, did I already say Spark?
All these sort of different things and powered by IBM.
Power is still around but a lot of the other ones died through attrition or moved
on to I or were supposed to move on to IA64.
But IA64 was, did not end up winning this. Instead we replaced it with 64 bit XA6
processors. So it sort of did its job it killed the,
killed the workstation processors, but replaced it with not itself, ended up
replacing it with something else. Anyway, we're gonna, we're gonna stop here
for today, and we'll, we'll talk more next