So this is where some people got the idea that maybe there, maybe there is a better way. And I wanted to point out that there's only, this is only the tip of the iceberg of what we'll call better ways of having better instruction sequences or better encoding standards between compilers and the hardware. This is actually a open, sort of, research topic now. Very long instruction word processors or VLIW for short. Is one take on it. There's been a fair amount of work done after this, which we're not going to talk about in this class. Sort of in the last five to ten years, that has looked at this in a little bit more detail. Especially given sort of, Multi-cores, and can you schedule across multiple cores, Or there's a project. There was a project out of the University of Texas at Austin, which tried to schedule across, something that sort of looked like cores. But wasn't quite cores. That was sort of, we'll call a super VLIW but had some dynamic aspects and some static aspects. But for right now let's talk about very long instruction word processors. Okay where is, where does the name come from? Let's start there. Well, these were things that were actually originally called long instruction word processors. The name kind of fell out of favor. At, at one point people made a differentiation between long instruction word processors and very long instruction word processors. And it was kind of on how many instructions were packed together. The, the differentiation has largely sort of fallen out of favor now, and people mostly call all of these things VLIWs or very long instruction words, cuz it's, it's harder to say what is long, what is very. It's kind of just a extra term. It's also, it's kind of like people talking about Large Scale Integration versus Very Large Scale Integration versus Ultra Large Scale Integration or you know, people just sort of keep tacking on extra letters in the front. But let's talk about VLIW instruction sequences, And, and what is, what does one of these things look like? Well, VLIW instruction. Will actually have multiple, operations within one bundle. So typically, this is you call a bundle or an instruction, which with multiple operations inside of it. So, in this example here. We have six operations that can be executed in this one instruction, or in this one bundle. And, typically, there in a sort of fixed format. So let's say we have, you can only, can execute two integer operations, two memory operations, and two floating point operations, per cycle, and that's what you're allowed to encode. So instead of having on a disc a sequential sequence of instructions now we have a sequential we still have a sequence of instructions or sequential sequence of bundles. But each one of those instructions or each one of those bundles will actually encode multiple independent operations. So let's look at it as sort of an example code sequence of this. So for instance, you could have something that looks like this. ,, ,, . Let's say. , So what's interesting about this is we can see that there's actually two operations in this first instruction, in this first bundle. The second one only has one operation, And this multiply and this add will actually execute semantically at least they can execute in parallel. Now what's interesting, Cuz you look at this, I, I had purposely wrote this to show a, what looks to be a read after write or write after read or some sort of dependence between these two registers and these two instructions here. This small and this add. But that's not what's actually going on here. In a very long instruction word processor, the, within one instruction, within one bundle, these sorts of dependencies are norg. So, just because this reads our three and this writes our three, they're not dependent on each other. The subsequent instruction is dependent on our three, let's say. If this was R3, then that would probably be the result there. But within one bundle, it doesn't actually matter. So the semantics of the instruction set are everything within one instruction, or everything within one bundle, are parallel with each other. And there's not dependency checking. What's nice about this is we just took all that piece of hardware that we built. We took all that instruction checking, all the dependency checking, all the scoreboarding and we just threw it out the window. We don't need that hardware anymore in this instruction set, or in this architecture. So that's pretty cool. So we actually took out a bunch of hardware that we didn't need and we basically let the compiler do that checking for us. Now there's a question of this mul, this multiply takes multiple cycles, whether this instruction here picks up the result of our three, sorry I should be drawing the other way. Or let's say the ad had longer latency if it doesn't pick that up, and we'll talk about that in a minute. There's sort of two different choices there, in VLIW designs. Well let's, let's look at our slide here. Typically in sort of traditional VLIWs, each operation has a certain amount of latency. So, that's a guarantee. Unfortunately, beacuse of this, the architecture of the machine is very tied to the compiler. The compiler needs to know how long each operation takes. So that's sort of the downside. And in a typical VLIW. There's no data interlocks. So we don't even have a score board. Now there are some objections which do enforce in a walking that are VLIW's. But in sort of the traditional, most basic VLIW, you don't have a score board. You have no interlocking. So, if you would have let's say, this subtract operation. Which read register one, which the mall wrote. And the mall one size of flowing point model, It took four, four cycles. In the most basic operation, this mall here, We'd actually get the old value of R1, so we'll not get this value. Instead of, we get the original value R1. But we'll talk about that in more detail in a second. That's, there's a sort of choice there in VLID designs. But yes, so we, we, reduced our hardware. We don't have a registry namer, we don't have an issue window, we don't have a reorder buffer, we don't have a scoreboard, and we let the compiler do a lot of the work. Downsides to this. We're not able to react to dynamicism very well, or dynamic problems. So cache misses, branch mispredicts, things like that. Because we're not going down border, because we don't all have all that extra hardware in there. We can't go schedule around those problems. So that's a, that's a, that's a, that's a downside to these architectures. Now, People have thought really hard about how to make VLIWs have some of the benefits of superscalars and out-of-orderness, and out-of-order superscalars. So at the end of lecture today, and probably in the next lecture, we'll talk about some of the techniques that people have added in the, back in the VLIWs, that bring us somewhere in between an out-of-order processor and a VLIW processor. And get some of the benefits of both. Okay. Two, two models. This goes back to. When you have an instruction which writes to a register, and the latency of that instruction is coded to be longer than one. Which value do you pickup? Do you pickup the old value, Or do you pickup the new value if you have instruction which is effectively in the, the shadow of the other instruction. So the first VLIW model and this is, this is a sort of a classical naming scheme I did not come up with this. It's called the equals scheduling model. So the equals scheduling model you have, a instruction, and the lead-in to the instruction is specified, the compiler knows it. And, if you have an instruction which tries to read a value that gets written to. Before. The first instruction actually does the write, it'll get the old value. So let's, let's go through an example. Over here. So we're going to have our multiply, again. Okay, So here we have a mall and an add which are bundled together. So they're going to execute concurrently. And we have and, an instruction, or and operation in the second instruction, second bundle here. I should say they're using brackets and semicolons to delineate sub operations. Er, the brackets delineate entire instructions or entire bundle. The semicolon is just there to delineate between instructions, Or two operations within one instruction. And here we have an ant. We have what looks to be a read after write dependence here. Something like that. And let's say our pipeline looks like this. We have X0 which does ALU ops. We have Y0, Y1, Y2, Y3. And then we have let's say a two stage memory pipeline. And somewhere over here, you know, we have sort of right back. So this looks similar to sort of pipes we've looked at before. But now comes the question. Let's say multiplies go down this four stage pipe, very similar to things we've looked at before. Loads and stores go into the memory pipe and ALU operations go into the x pipe. Should this and get the result and multiply, if it's scheduled one cycle after it, or should it get the old value of R1, the previous value of R1? So we're gonna define in the equals scheduling model that the mutiply value for R1 is not ready until. The end of Y3. So in the equals model, this and, the, the compiler is not trying to express a read after write dependence. This does not actually exist. It gets the old value of R1. And the compiler knew about this. And everyone's okay with this. Now if this and was let's say three more cycles later, it would actually get that value and it would be a read after write dependence. So in the equals model we're just saying that the operation takes effect exactly at the specified latency and never earlier. Some positives to this, is you get some pretty cool registered usage. So if you think about it here, register one was live after this multiply. So effectively, this gives us a little bit less register pressure. We can have a little bit more registers in flight. Without having more physical registers or without having more architectural registers at all. We can basically have more registers. Because it doesn't go, it doesn't go dead when you override it. It goes dead when this multiply takes effect. So this said, we don't need any register renaming. But the compiler really depends on not having the registers visible early. Unfortunately, this causes some problems. This sort, these sorts of architectures. This is actually the, sort of, first formulation of very long instruction word processors looked like this. There's these equals architectures. The major problem with these actually comes around if you have things that are unpredictable mixed in with this very predictable code sequence. So let's say you take an inerot[sp?]. Let's say we just put some unimportant instruction here, some subtract operation. The subtract takes an interrupt. Now, semantically, this multiplication is add complete. Because the interrupt doesn't happen until the instruction after it. Doesn't happen until this subtract operation. Hm. Okay. Now you've fallen to the interrupt handler for the subtract. What happens when the and goes to execute? Does it actually pick up the correct value of R one, here? No. It picks up the new value of R1. It was supposed to pick up the old value of R1. So that's traditionally a problem with EQ or equals scheduling model architectures. People have solved some of these problems. Or sometimes, when people build these processors, they don't have interrupts. So some of these VLIW equals, processors were not able to handle, you know, handle interrupts at all. So it's something to think about. That's through the, the, the first case. I also get a little more forgiving case. We'll call it the less than or equals model or LEQ scheduling model. So here, in this model, a value is allowed to become value a, a, A register value becomes valid with a new register value, Any time between when it issues and the scheduling leniency. So what this means is the compiler still can't schedule an instruction early, So we can't go and try to read the value early, but you're guaranteed not to have a problem, when there's an interrupt if, let's say you come back and you filled in the right value. So compilers, who have schedules around this and knows not to schedule something too early. So you still don't have to implement interlocks. You still don't need a scoreboard but, you can now have precise interrupts. Some other positive things pop out of this. You end up with binary compatibility preserved, when the latencies are reduced. So let's say you make a faster processor, where you'll multiply instead of taking four cycles. Only takes three. That's, that's, that's a positive here. You may not get more performance from it, but at least you won't get incorrect execution. So that's, that's, that's a positive, positive outcome here. Okay so a little bit of history. Usually I try to not make harp on history too much in this course. Even though I really enjoy it. But, let's this, this, the VLIW processors is a, I wanted to make one point that, a lot of this research is relatively recernt, recent. So if you go look at the dates on this, you know, these first processors, and the first real VLIW processors was done in like, the late 80's. So this is not going back to the 60's. This is like a portion of computer architecture work, which is actually very, very recent. The, the first long instruction word processor is actually a floating point systems, FPS, that's what it stands for processor. And this was actually a, a co-processor to VAX machines. So, something that could speed up your floating point on a VAX machine. This was very much the most basic VLIW processor. There was no interlocking didn't take interrupts it was really sort of for hand coded vector arithmetic and floating point math. Probably when people talk about VIW, the thing that pops in the head, to their head first, is actually the Multiflow trace processor, which was made by a small startup company called Multiflow. This was an outgrowth actually of a bunch of research that was done at Yale, by Josh Fisher and a bunch of his students. And, I won't go into too much detail here, But one of the interesting things is they really, really did have a long, very long instruction work here. 1,024 bits long instructions. So, so this, this is like a, a beefy instruction. This is a, and it can have anywhere from, seven, fourteen, or 28 operations per instruction, And this was not dynamic. What this actually was is this is how they made different configurations of their machine. So they actually had wider machines that were more expensive and narrow, narrower machines that were cheaper. So it's sort of a family of processors, and they customized the compiler to this. Josh Fisher actually is much more of a compiler guy, probably, than an architect, by training, and you can sort of see that in his groups work in the PhD's that come out of it. He now works for HP, HP labs and sort of semi retired. At the same time actually. There was also another company that was commercializing a very, very similar idea. This was the siderome. Sidra, SIDRA five, This was Bob Rao, who's another, very famous computer architect. He was a, a professor at the University of Illinois, and he developed a lot of these things and then sort of left and started, Sidrome. And, some of the interesting things in that processor is they have this cool thing, instead of having a register renamer, they had a register file, that the naming of the register sort of changed as you did, sort of, function calls. We'll talk more about that later today or maybe tomorrow, Or maybe next lecture. But moreover what we really want to get out here is this is, this is very recent.