Okay so, very long instruction words, processors. Still wanna have high performance and we took all this hardware that was in a, out of our superscalar reserve through it out. Well, something's got to make up for it. And what makes up for it, is a very, very, very smart compiler. So, we put a lot of emphasis on the compiler, in these sorts of architectures. And, the compiler really has to do the scheduling. It has to do all of the dependency checking. It probably to avoid all the different data hazards, and this is just we're just getting started. We're gonna talk mostly next lecture about all of the different optimizations a compiler can do to try to approximate what a on-board superscalar can do, but by doing it statically. So it's a pretty cool trick if you take all that hardware, you put in the compiler, you run it once. And then, every time you go to execute the code, you don't have to go and recalculate all the dependencies. Sounds good. Okay. So let's see how we execute some code here and what is the sort of performance aspects to executing loop code on a very long instruction word processor. So here we have a very basic array increments. We're gonna take every element of this array. We're gonna increment it by the value C. [clears throat] And our, we run through aour compiler and here's the sequential code sequence. This has not been scheduled yet for VLIWs or for our VLIW architecture over here. So, we load some, we, we load the value. We increment our counters. We actually do the floating points, add. We store the value back. We increments the array index, and then we, we loop. Seems simple enough. Let's see how this gets scheduled here. Well, because the architecture and because the compiler knows about latencies here Let's say tthis load here has a few cycles of latency.. So it's gonna actually schedule this ad later. And the add, it's a floating-point add, this has a couple of cycles of latency also, so it schedules the consumer of that results here, this F2. Later. And, yeah, I don't know, you can sprinkle in the array additions and then the counter additions, kind of somewhere free, has lots of open slots. But let's say it schedules at the same time as the load in the store. So this is pretty cool. We are actually executing two wide parallels in here. And we didn't need all the extra overhead of in [inaudible] super scour. Oh, and we can just go to the branch somewhere. Okay. So, first question here how many floating point operations are we doing per cycle are we doing very well here? So, it looks, looks, looks pretty poor. We have one floating point operation here, just this ad. And we have one, two, three, four, five, six, seven, eight cycles? Yeah. So we're having 0.18 or 0.125 floating, like floating point operations per cycle. It's not that great. [inaudible[. You know we're actually having three instructions here. It's better than nothing but we're not really using the machine very well. Out of our superscalar, we'll probably try to take connstructions that are below here maybe intermix them, try to reorder a bunch of things and try to run faster. So what's a technique to go faster? Well, one of the big comply, as I said we have a lot of emphasis on compilers in this unit. One of the things the compiler can do, is it can actually unroll the loop. So here we have our loop, we unroll it, four times, and, now, we're going to sort of take the loop overhead, and we're gonna factor it out, so that it only happens once every four times. That sounds good. We're gonna do some more work here, per each [inaudible] loop. But things are a little more complicated. What happens here if N, big, upper case N here are terminating value is not a multiple of four. Well, we need to do something about that. We probably need to check sorta before we go to execute here whether we're, we're doing okay or whether we're actually are [inaudible] for convenient leaf and it's big enough and probably can run as multiple four for a long period of time and so is the last iteration you'll have to clean up. But we need to generate some clean up code and compiler is responsible to do this. So these compiler optimizations do take some effort. Okay, so let's look at the scheduling now of our loop unrolled code. We can do a bunch of loads upfront here. So we've inter, intermingled these loops. And what's kinda cool is because you pull out the loads and are thrown to the top and the stores been pushed into the bottom. And then put all the ads sort of in the middle. And maybe sprinkle the array update somewhere else. And we go to actually schedule that, we're going to do something similar. We're going to have the loads [inaudible] first, execute the floating point ads of the floating points stores and have the result. But what you could noticed here is we're actually starting to get some overlap, because we've unrolled, we can overlap this load and the first floating points addition 'cause we've effectively covered the latency of our functional units by putting other loop iterations during that time. So if you look at this schedule versus the schedule back here. We're just sort taking these dead cycles and we've put the other loop iterations in those dead cycles. In this loop unrolled case, we're incrementing this counters and the indexes not by four anymore. We're incrementing by however many times we've loop unrolled times the offset, so we're incrementing it by sixteen now. Does that make sense? 'Cause in, in this code here we were incrementing R2 by four, because that's the size of a single value is four bytes. So we have to sort of move our array index over by four. But now, because we're batching up all this work together. [clears throat] We actually have to move the, the index by a, a bigger value. So we're moving it by four 'cause we've unrolled four times, times the size of the data value, which was four. So we're moving it by sixteen. And, and, one of the nice things here if we look is in both the loads in the stores, we're using our register indirect addressing mode here to add in some offsets. So, we're actually offsetting, let's say, twelve plus this base register of R1 to figure out where we're actually doing the load from. But it's just, a convenient way that we don't have to compute a bunch of addresses [clears throat]. Okay, so going back here, we can see we're starting to overlap actual operations with other loop iterations. Well, that's really cool. So we're starting to get some performance here. So, let's, let's look at the performance. So, I ask the same question here. How many floating point operations per cycle? Hopefully, hopefully it's higher, one, two, three, four divided by one, two, three, four, five, six, seven, eight, nine, ten, eleven cycles Okay, so that's 0.36 is a lot better than 0.125. This is good. Loop unrolling is helping us. But is this, is this everything? Or could we do more in our compiler? So these four compiler people came up even fancier idea, which is called software pipelining. So uses, uses a term we've seen before, pipelining, but does it in software. So the idea here is that instead of. Just having one loop, unrolling the loop, and then overlapping the iterations. We're actually going to take multiple of these schedules, and interleave them. And try to fill some of these empty spots. So let's, let's look at this. We're going to have the code unrolled four times. It's the same piece of code we had on the previous slide. And we're gonna draw our schedule that we had before. So here is the schedule we had before in purple. Okay, that doesn't look so bad. But now, we're going to schedule it with another iteration of the exact same four time unrolled loop shown here in green. So we've just overlapped this with, this other iterational loop. Well are we, are we done here? Well, not quite. We still have some open spots here. So let's try to overlap even another iteration, as shown here in red. Now the fix-up code that you need to have to do this correctly gets more complex 'cause all of a sudden now you're overlapping multiple iterations, but as long as you don't modify some value, as long as you don't do a store, speculatively, or a store, you're probably okay, 'cause always just doing [inaudible]. Doing extra loads, doing extra work. You're doing extra work and you're filling slots thinking that you're not gonna have anything go wrong or thinking that the index variable N if you will, is a multiple of four and large, and you're not at the end. So let's, let's put some names to these things. [clears throat].. So we call the, beginning here the run up, the prologue. Here, we actually have our sort of actual iteration. You can see that the, sorry this is in green, that doesn't show up very well there are instructions here. There's ads. It's pretty full. We're actually doing a lot of work on our machine here. And then the epilog here is the, is when we're done. This when we're falling out. We're sort of at the last loop iteration of the outer loop, if you will. So let's, let's do some math and look at the performance of this. Okay, so let's ask the same question here. How many, floating point operations, per cycle? Well, we go look over here. We have one, two, three, four. And we have four cycles in our, in our tight loop case. That looks pretty good. That's cool. We just got a bunch of performance. But we have to do a lot of compiler optimization to make this work before we're able to use the [inaudible] machine, and we're able to overlap three different iterations of this of this loop, that we also did a software transform to unroll it twice. So this is called a software pipeline, pipelining. And have the nice picture in here to sort of show what's going on visually. So we're gonna have time on the horizontal axis, and we'll have sort of activity of how many instructions are executing or something like that on the vertical axis or, or shown here as performance. When you run multiple loop unroll iterations, you have, starts from start up, you have the actual you are in the loop, and then you come down from that. And that's, that's better than having lots of start up and sort of come down with small loop iteration portion here in the middle. [clears throat] But when we go look at the software pipelined, we can overlap, basically one iterational loops execution with or one loop start up or, a, a with loop iteration of another loop Now we can actually execute sort of our prologue, execute multiple iterations, very tight, and then have our epilogue here. Yeah, so. Our software pipelining [inaudible] start up and wind down costs. Once for the execution of the loop and not every iteration of the loop. So that's, that's, that's fun. That's cool. We're getting performance. If only the world was dense loops, life would be easy. Alas, the world is not all loops. If we just had a processor, which just did enter array calculations and all of our problems in the world were just dense array computations life would be really easy. But they're not. A lot a time code has lots of branches. It has if than else clauses. And here so graphically show something like a if than else. So we have some piece a code that makes a decision and executes the code on the left or executes the code on the right, based on an if statement. So this is the if true statement and this is the ELF's clause. [clears throat] and data dependent branches are a problem typically for very long instruction word processors. Now, why is that? Hm. Well in a, out of word processor, you can try to execute, code sort of around the branches and move instructions above the branch and below the branch. But if you're doing static scheduling when you hit this branch, and let's say it's a hard to predict branch. You can't really do anything because you packed a bunch of instructions next to each other and they need to execute atomically. So, thinking measure of move code up and down across that branch. Superscalar can do that, 'cause it has its instruction window, it has a bunch of techniques a bunch of hardware to be able to do that. But in our VLIW processor, that's a problem. So I wanted to introduce a, a compiler, an important compiler nomenclature here, which is important for this class, it's called a basic block. So what is a basic block? A basic block is a piece of code which has a single entry and a single exit. So this is a basic block, it has one entry, and one exit. And why is single entry important? Well, if you can jump into the middle of this piece of code, the compiler cannot necessary reorder the instructions inside this block. If you have multiple exits, let's say you exit here, the compiler can't push instructions below that exit point. But if you have a basic block, the compiler basically knows that these, this instruction sequence is going to execute effectively, effectively atomic, but not actually atomic. I mean that you can have other things going on inside there. But from a compiler perspective it can reorder the instructions around in here to get better performance. Hm, okay. So loops, loops are easy. We can solve for pipeline. We can unroll. Squirrelly code if and else spaghetti code are hard. So compiler guys came up with some fancy tricks to make VLIWs work better and take advantage of some of the code motion that out of a force superscalar does but in the compiler and not in dynamically in hardware. And one of the more famous ways to do this is something called [inaudible] scheduling, which was John Ellis' who was one of Jon Fishers' student's thesis work. This was in the Bulldog compiler out of [inaudible] So, what you do is you profile the code and, you compare the probabilities that these branches go the one way or the other way. So, let's say profiling, this is not a something hardware is doing at run time, this is something that you do with the program while you are sort of still back in the compiler stage. You take the program, the compiler goes and runs it on some input given, given input set and comes up with the probabilities of which way things go. And then what you do is, you come up, you take this profile information and you come up with some guess at what is the most probable one. And we're going to circle that here. And say. These darkened edges are the most probable sort of path through the squirrelly piece of code given this is the entry point. Now, this doesn't mean that you can't have branches that sort of branch out of this, but if you do you need to have some fix-up code 'cause what we're about to do is we're gonna take this entire sort of, big [inaudible] of code here. We're gonna remove all of the branches and we're gonna schedule for our VLIW processor as one big monolithic piece of chunk code. And by doing this we can move instructions, let's say, that are down here, which this opens last to executing your early portion of this codes sequence, we can move them up. And likewise we can move things that use the resolve of long latency instructions up here and push it down across branches. So our out-of-order superscalar does this with branch speculation, but our compiler can do this on our VLIW processor using trace schedule. But when do this, which be careful because there's always a possibility that while unlikely you can still branch the other way. So typically, the way this is done is you have some form of fix-up code that you branch away, you have to sort of fix up anything that was after the branch that made a committed change if you will, to the, the processor state. And you sort of roll that back somehow. So, we're basically in software, doing the rollback case from number our out-of-order superscalar. So instead of taking the, architectural register file and copying it to the physical register file on branch mispredict, instead our compiler generates a code sequence which does that same operation if you were to branch away there. And we'll roll back, only the only the certain register that needs to be rolled back and only the memory state that needs to be rolled back. So, that's pretty cool, so we can basically take all the functionality that was dawn in our out-of-order superscalar and put it in the software using trace scheduling.