Okay, so I wanna just briefly give a case study here of one of the more interesting, modern day VLIW architectures. Or probably the most famous and possibly also the most infamous VLIW processor out there. This is the Intel Itanium, also known as the, Intel I64, or what's known as an EPIC processor, Explicitly Parallel Instruction Computing architecture. And a lot of this work actually. Was done. In collaboration between Intel and HP. Hp uses these a lot in their big servers. Their sort of big mainframe, well, not quite mainframes. But big, big, heavy, big iron computers. And Intel was trying to use this to effectively kill all of the other workstation vendors. And this was gonna be their 64-bit solution to computing. So it's a modern, non-classical VILW, and this was going to be Intel's chosen ISA. There was, they were, they were going to deprecate X-86, and choose IA-64 as the 64 bit ISA. And as we now know, going a few, few years forward after the creation of all this stuff, that didn't really happen. Intel went and did this, it built a bunch of processors with this instruction set. You can still buy processors with this instruction set, but it never got, as, as good of a, acceptance as competitor. The competitor is, was at the time was called AMD64, which is a 64 bit extension, to what people already had. And that's when people ended up wanting, it's just a 64-bit extension to what we already had versus, you know, something totally different. Okay, so couple of features here is object code compatible VLIW, so, it's not quite a VLIW in a classical sense, it's object code compatible, which means different generations, different micro-architectures in this VLIW can have the same instruction code in the same binaries, you know, to recompile. And how they did this is effectively, as I alluded to before, they had the ability to have parallelism straddle across instruction bundles. And they had this notion of groups which we'll talk about in a second. So, the first few implementations of this Merced, was the first Intel Itanium implementation. It was kind of like the 8086 or x86. But Merced has, has lots of things that you'll realize, if you look at Intel codewords and Intel code names named after a river. Intel likes to name their things after either rivers or places. I think this has something to do with it, its, you can't trademark a, a place name, so they, they just sort of get around that and make sure they don't have any Trademark issues by choosing place names with all their code names, One of the big problems here, was supposed to ship in 1997. First customer shipment not until 2001. It's a four year miss. And superscalar was another thing that sort of had caught up on it at that time. And it was supposed to be faster and better than everything else. And the first, the first one was not very good. It had cold, low clock rates, and was not as high performance as it was supposed to be. And sort of the, the. X86 side of Intel's business line, actually, had almost the same performance as, as the first Itanium, and then very quickly surpassed the first Itanium. So, their high end processor wasn't actually high end. Couple, couple other things here, so, McKinley was the second implementation, shipped pretty quickly after that. This was much better implementation, but, you know, it's still, still hard to do, but we're still building these things. So, in 2011 at ISSCC, the Intel introduced the Poulson processor. Big, big, machine here, eight cores and 32 nanometer, lots and lots of RAM. We'll, we'll look at that, yeah, so, so 32 megabytes of shared L3 cache, big, big processor. 544 square millimeters, in 32 nanometer. So, at the time this came out, this was the biggest processor ever built, most number of transistors, over three billion transistors, or at least the biggest commercial thing, Intel might have had a research prototype, I think, that might have had more transistors than this. I think their, Multicore processor or they're they call it the SCC their, their single chip cloud computer might have had it more but I know I should know the transistor count. But from a commercial processor perspective, huge chip. But they are selling into extremely expensive sort of sockets. There is a seller ship for premium was going into big main frames. That was not what was this was originally destined for. It was destined for both big main frames and work stations. But now this is sort of the in 2012 standing here now. This is not used in lots of other places except for sort of bigger, bigger hardware or mainframe sort of things. So a few of the interesting here is the cores are multi-threaded and you can execute six instructions. You can, you can fetch six instructions per cycle and you can execute up to twelve instructions per cycle. Per core, and then there's eight cores. So this is a beast of a machine. Very, very high performance computer. Okay, so let's dive into some of the details here of Itanium. Itanium has a 128-bit instruction bundle, and inside of there you can fit the three operations, and then there is some word called template bits, which sort of says what is in the instruction bundle. So it's not actually a fixed format bundle, these instruction boundaries can move around a little bit. And they did that so you can sort of mix in something like a immediate instruction with instruction, which doesn't have immediate, and get more space in the bundle for the immediate bits, so you can have or, or branch offset or something like that. These template bits also describe how a particular bundle relates to other bundles around it. So sometimes these are called begin and end bits, or start and stop bits. So it says the number of instructions which can execute explicitly in parallel. And the machine doesn't necessarily have to execute these in parallel. So for instance, if you say twenty instructions can execute, or twenty operations can execute in parallel, but your machine's only two wide or they built it two wide, implementation of Itanium or I-64. You're just are gonna execute, you know, two wide for ten cycles, or something like that. But, what's really cool here is the compiler is able, just like all the other VLIWs, to express the parallelism to the machine explicitly. Some interesting things about the registers. They, because this is a VLIW processor, and because you're gonna have to do code scheduling like what we saw in last class, that increases the general purpose register pressure. You don't have a register renamer. So you can't go and use different names for things. And the hardware's not gonna rename things for you. So instead, the compiler and the software gonna have to do the renaming. So they had 128 general purpose registers and another 128 floating point registers. And they also have these predicate registers. So, they're not quite full predication, but they're pretty close to full predication. So you can have bits that say whether our later instructions are gonna execute or not and you have to compute that into a little register file. So they had a predicate register file that you have to bypass. So that's, that's sort of interesting to see. And then they had the, really interesting feature here, which is called, ruh, rotating register file. And let's, let's talk about what a rotating register file is. So the problem this is trying to solve, is in a code sequence as we saw before, in last lecture. If you have, if you have a very long instruction word, scheduled piece of code, and you want to get good performance, you're going to have to unroll the loop, and then you're going to have to software pipeline the loop. But when you do this, this is going to increase your register pressure or increase your register names, how many register names you need to use. And, as we saw, you're gonna have to add extra special code in the prologue and the epilogue, which are different than the main loop body. So how do you solve this in one fell swoop? Well, you add a subset of your register space, which will sort of statically rename itself, every loop iteration. So slightly change the, the loop iteration or change the, the naming of the registers. And what this looks like, is if you go to access let's say, register R1. There's a register, sort of a architectural enabled register called the rotating register base, or RRB here, which has a value that gets added to this. And it's marginal arithmetic, so it rolls around at the end and that points to different locations in the physical register file. Oh, this is pretty cool. So, what we're gonna do is every single time we come to a new loop iteration, we're going to change the RRB, and it's going to point to a different set of registers. And we can actually effectively software pipeline just by using this one, one feature. So, here we have the same code sequence we had from last lecture, so is the, the previous code example. And if we recall, when we unrolled all of this, what we ended up with was a load, an add, a store. We'll talk about this in a second. This was kind of the, the key thing that we were trying to execute, and if we have to unroll this we just had to unroll the code and then look at the dependencies. So, let's look at the dependencies here. Well, dependencies that we're gonna have is, this load writes F1, or, the floating point register F1 here. And. We know that this is actually getting one get read. Let's say, the leniency of this is one, two, three, cycle and doesn't get read til here. Likewise, this add here computes P10 and the add, let's say, is a floating point add, it has some long latency and down here is when it's ready into the store. So on something like a, a team of rotating register file, we don't actually generate all this code. Instead, we generate one instruction. This which is going to take care of our epilogue, our prologue, our prologue, our epilogue, and the main loop. And, what we're gonna do is we encode the distance in register numbers between these two values here. So, what this means is, if this writes F1, and one, two, three loop iterations in the future wanna read that value, we encode that here with a register number that is that number off. And then likewise here. So this would be F1 to F4, because, it's off by three. And here, this writes F5. And we know this one's to be read, one, two, three, four later, so we encoded it with a register number that's forward into the future. And now we're going to talk about this instruction here. So what this is going to do is it's going to change the routine register base number or the RRB, and it's going to bump it by one. So we can basically just keep branching to itself here. And each time we do it, the, all the registers going to change names. So by the time this is ready, or by the time the load is ready here. These other values will have sort of caught up with it, where the physical register that they're actually going to look at will now point to the correct location. So we can effectively encode into one instruction here all of this, including the prologue and the epilogue, using this rotating register file. Okay, so last, last slide of today. Why do I think Itanium, I think we can pretty confidently say, failed? I actually don't think it was a lot of the ideas. I think some of it, a lot of it had to do with the implementation. So, first off, if you tied the hands of the micro architect, they're gonna scream. So, I64 added a lot of architectural, big-A architecture, ISA level features, in order to get specular parallelism. And a lot of this stuff was implemented and talked about, but never actually built into real processors. So, people didn't go through the effort, until basically, the first Itanium, to try to implement some of these things, and they didn't all mix well together. And they added a lot of states, and they added a lot of complexity to the processor. So, we have a-lat, full predication or almost full predication, routine register files to name a few. This is really complex bundling sequence, the, probably one of the hardest to decode instruction sets in the world. Very, very challenging and this was a big, a big a big challenge and it, it type-hands the micro-architect, and the micro-architect couldn't make a decision. So a good example of this, a funny, funny story here is that after the DEC Alpha employees, Digital Equipment Corportation employees, left DEC, they were sort of assumed into a part of Intel. That same team that used to build out of order alpha processors, went on to go build sort of the next, next generation of an Itanium processor. And what they said they wanted to go look at the Itanium processor. And like wow, this is really complicated. It took'em much more complicated than alpha. And then they said oh, well, we could probably do better if we just built it out of order superscalar, took apart all of the instructions, took apart all of the dependencies, poured that data into what was effectively a alpha out of order superscalar core and then execute it. And what was funny, if you're going to look at this, as like all the, the, you can sit there and just bang your head because you did all of this work and added all of this architectural state. To allow the compiler to do all this, this work. And then they would just wanted to undo it all. They would do this because they wanted performance but then they wanted to undo all of the sort of state and all of this hard work the compiler did and just redo it all academically because they though they could get better performance. They probably could have. It probably was a good idea but what was kind of funny there is you built a instruction set that had one micro architecture in mind? Basically an in order architecture. And then, all of a sudden, people are thinking about building out of order variants of it. And it sort of throws everything you had before away, or all these notions sort of went away. So it's just a, just a funny story that, that you know people try to build out of, out of our versions. They ultimately not, do not do, end up doing that. That same team decided it was basically too hard, mostly due to predicate registers, and sort of how to bypass predicate registers of out of order things. And I think they ultimately ended up not, not doing that, or they definitely ended up not doing that. And that's just what's sort of known now is the, Wachusett, or excuse me, not the Wachusett, it's known as the Tukwila processor from Intel. Now there are other couple of problems here. First implementation had very low clock rate, so your first one out the gate was just not very good, this just sort of hurt. And it was, it's hard to build these things. They're wide, the speed demons versus the sort of brainiacs, this is this question of do you want to go wide, or do you want to go long and narrow. Long and narrow was doing okay at the time. Big code-size bloat, fundamentally did not solve all the dynamic scheduling problems that out of order superscalar could get at. So for instance branching or changing your instruction schedule based on, based on whether a load hit or miss in the cache, it couldn't do. Big compiler complexity, need profiling, and not every one wanted to profile. There's also just not that much in static level, static instructionable parallelism in all programs, so the compiler couldn't necessarily find all the parallelism, or it wasn't there statically, and if you're going for a compiler only approach, you need to be able to do that. And then, this is what really killed it here is, the, people did go build these more complex out of order superscalars. So at the time, there was this big discussion. Can we build more complex out of order superscalars? And people said, no, those are too hard, they're too hard to build. They take too much, they cost too much. We don't know how to solve all these problems. So instead, we'll try to build something simpler, and push a lot of complexity into the compiler. Well. There was money behind this question. So people went and did build these complex out-of-order superscalars. And, that's what we're basically still using today in our sort of desktop processors. We have out-of-order superscalars today. And then finally, the last, last big one here, AMD64 happened. What is AMD64? Well, it's a 64 bit extension to X-86, AMD originally did this. Intel, after sort of dragging their feet for a couple, couple years on this, finally decided, oh. We're going to, we're going to use that, because people wanted this. People wanted code compatibility with the ability to 64 bit sort of wider, both arithmetic operations and wider address, addressing, so more amounts of memory. And 64 bits is a lot of memory. So AMD originally came up with this, this is now known as I believe EMT 64 or Intel 64, not to be confused with IA 64, that's what Intel calls now these 64 bit extension x86, and now Intel is building those processors too So, everyone as of jumped on that, and that's, and Intel has kind of de-emphasize Itanium now, Itanium instruction set and instead, we are basically sticking with IA64 and this instruction, or it's going to be IA32, the 32 bit x86 with extension 64 bit, you know, that's taken over the work, workstation market. And what's kind of funny here is, this was, this processor was really designed to kill or unify all the workstation vendors together under one processor that was going to beat them all. And it, and it did it's goal to some extent, Because this processor was coming around, either company's went out of business, or they jumped on the IA64 bandwagon, and decided they were going to take that on. But what replaced it, what replaced all the different little variants of processors that were in workstations. So Spark, a, PA Risk for HP, SG, SGI sort of MIPS processors, did I already say Spark? All these sort of different things and powered by IBM. Power is still around but a lot of the other ones died through attrition or moved on to I or were supposed to move on to IA64. But IA64 was, did not end up winning this. Instead we replaced it with 64 bit XA6 processors. So it sort of did its job it killed the, killed the workstation processors, but replaced it with not itself, ended up replacing it with something else. Anyway, we're gonna, we're gonna stop here for today, and we'll, we'll talk more next