Okay, so we left off talking a little bit about how to encode instructions in VLIW or how VLIW's encode instructions. And as you may recall, in a VILW instruction encoding, you've a very long instruction work. So it could be I don't know many, many bytes. So, if you go and get something like the multi flow processors, multi flow trace processors. They had very long instruction words. They could, I think, execute something like and their largest machine configuration twenty something instructions. And they had to encode this but one of the challenges was that you didn't necessarily want to have all that encoding space used all the time. So if you are executing a simple instruction, or something which doesn't actually need all the encoding space. You end up of a whole lot of no ops. Or if you have a very long dependency string in your execution trace if you will. So you have a instruction which is dependent on the next instruction, which is dependent on the next instruction, which is dependent on the next instruction. And you cannot fill any of the extra slots. You're gonna end up having a lot of no ops in your program. So how do we go about solving this. Couple, couple different solutions here. Cydra-5 went off and added a particular instruction, which is a single operation instruction. And used much smaller amount of memory. So this would effectively reduce your instruction cache pressure. There was compressed formats in memory, so you can hold your instruction sequences in memory, and then when you go and pull it into your instruction cache, it would expand to the fully expanded version. And this is kind of like a compression algorithm. You're basically, compressing your VLIW instruction sequences. Because otherwise these things are very, very large and have lots and lots of zero's or no ops in them. Because, the probability you're gonna fill, sort of twenty instructions per cycle, is low, but you want that capability there, in case you'd actually do want to execute twenty instructions per cycle. Another solution here is what's actually used in the Intel Itanium or IA64 processor line and also the TI DSP's, the C6000 series. They, they have the ability to effectively mark where a bundle starts and stops. So it's something sort of between a, variable length instruction and a fixed format, VLIW instruction. So we'll talk more later in today's class about how IA64 or Intel's Itanium handles this. But roughly what happens is, they have a fixed instruction format and you can fit sort of, three instructions per bundle, or what they call a bundle. And then, if you wanna go over that, you can keep going over that, and they have special bits in their instruction where it says, oh, well this is gonna express some more parallelism later. So you need to look at the next bundle, and try to fetch that bundle, and that's all can be executed parallel. So it's still a, a parallel execution semantics, but the formats are not necessarily fixed. And this is, sort of, outside the bound of classical VLIW now it's going to a little bit more modified VLIWs, or, or more contemporary VLIW's. So today, we're gonna as I said, talk about how to go about getting all the performance that out of order superscalars can get, but in the context of a VLIW. So we're gonna slowly build up all the different techniques that out of order superscalars have, and all the ways that they can get instruction level parallelism. And then apply that inside of VLIW or how we make small changes to classical VLIW's to get a lot of the performance back. Okay, so let's talk about the first, the first trick here. One of the questions that comes up or the problems that comes up here is if you have branches, and you mis-predict. This can limit your instruction level parallelism in a VLIW. Okay, so why is this worse than in a out of order superscalar? Well an out of order superscalar can dynamically schedule around branch miss predicts. So if your branch miss predict you can basically schedule other code, or speculatively schedule code depending on your prediction. But on something like a VLIW processor, that's a lot harder to do because you don't have a dynamic scheduling engine behind there. The compiler had to go do all that scheduling statically at compile time. So how do we, how do we go about fixing this? Well, one solution is just to eliminate all the hard to predict branches. So you take out the branches that might be hard to predict. So your branch predictor is trying really hard, you could still have a branch predictor in your VLIW processor, but some of these things, you don't know. It's like a data dependent branch. How do you know which way it's going to go. And, you know, this is a problem even for superscalars, but at least in a superscalar, you can try to sort of back filled instructions, or try to, try to do something, or move around, if you will, loads or long laid instructions around branches. So in a super, in a out-of-order superscalar you can potentially move a load beyond a branch. But in a VLIW processor, that's not easy to do, because compiler has to make that decision, whether that branch is gonna predict take in or not take in. And it may, that may change over the lifetime of the program. And it may also change as the depending on the sort of input sets to the program. So it may not be something that's compiler time even known knowable. Okay, so our first technique here that we're gonna use is called predication. So let's, let's define what predication is. Well, first of all, predication is a technique, that allows you to basically change the result of a register, based on some other register. And it's going to let us transform hard to predict branches into data flow or data dependencies. So we're gonna add something to our instruction set and it's gonna take what looks like a small branch or short distance branch and we're going to basically execute both sides of those branches or the taken and the non taken code path. And then at the end, decide which one was the right one, with a, something like a select operator from C. So this is like the question mark colon operator, which no one ever uses in C. This is sort of the equivalent of that, but we're gonna do it in one instruction, or maybe two instructions. So let's, let's, let's take a look at how this helps with small branch regions or branches which are hard to predict. So, we're gonna introduce two instructions, and I also want to point out, that predication sometimes shows up in superscalar's processors or a, a sequential instruction sets also. So a good example of this is, two instructions very similar to this, are actually in x86. They're called c-move or conditional move. And we're gonna introduce conditional move as the most basic form of predication. So let's look at the semantics of these two instructions we add for a conditional move. First instruction here, move zero. So what does move zero do? Well, move zero test whether one of these input operands here, rt, is equal to zero. And if it's equal to zero, it's going to move rs into the result register, the destination register here. So it's gonna move rs and rt, or rs and rd if rt is equal to zero. And, if rt is not equal to zero, it's just going to leave rd alone. Okay, well, on first appearance is that seems like a very simple instruction. It's basically just doing a copy operation. It's a, it's a gated copy. Looks a lot simpler than something like add or subtract. We don't have to do any math, at least. There's no compare, I mean the, the compare is easy as the equal, it's not a less than equals. So it's pretty simple instruction. And we're also gonna introduce move not zero. So it's kind of a complimentary one of this. And, we're going to want this because when we go to execute code, we're going to want to, let's say, have a if then else, and we're going to have, want to execute the, the, the then and else at the same time or roughly at the same time, or indiscriminately not depending on the branch. And at the end, we're going to decide which one was the correct one. And we need sort of two instructions here to choose what was, was the positive one the right one, or the one where the, the conditional value is true or false the right one, or the one where it's true, or vice versa. So, move not zero. Move not zero is going to do the same thing here. It's going to see except the sense of the, of the condition code is going to be different. It's going to sense if rt is not equal to zero, and if rt is not equal to zero then rs is going to go into rd, else we're just going to leave rd alone with the original value of rd. Okay. So let's, let's walk through a, a quick code example here. And see, see how this helps. So here we have a code example, we have some C code. We have non-predicated MIPS-like assembly code. And then we have a MIPS-like assembly code here with these fancy two predication instructions. And they were all three of these are doing the same thing. So let's, let's look at this piece of code here and the C code is probably the easiest to understand. It's gonna start off it's gonna say if a is less than b, so is a condition, and there's a then and a else. So what does this actually implement? Well this is a minimization function or a min function. X gets the minimum of a or b. If a is less than b, it gets a. If a is greater than b, it gets b. So it's going to be assigned to x, the smallest value of a and b. And, these sort of, if and else is, you can see here it's pretty short, we don't have a whole lot of code in either the, the then case or the else case. And that's gonna be important in a second, you'll see why. Mainly it revolves around if you have lots and lots of code or one of the sides of the branches is, is, is long, it may not make sense to actually predicate this, cuz there's some cost to predication. Okay, so here's some assembly code does a similar sort of thing. Set less than, this is a comparison operation in MIPS. So we'll do set less than and these are going to be, R2 and R3 are going to have our two a and b values here. If it's less than, we're going to put it into R1, if not this gets set to one or zero. And then we have a branch zero here effectively otherwise this is a branch. Okay, so branch equal if zero, it's a little odd. We probably should have just put branch zero there instead. But is going to check to see whether this value is zero or one. And if it's the one direction it's going to jump to L1. If it's the other direction it's going to fall through and then jump over this move. And these two moves here are the two assignment operators, the x = a or x = b. Okay, that's, that's not so bad. Let's analyze how long this takes. How many cycles this will take to execute on a, on a sequential machine. Okay, it was a set less than. It's one. This is branch. So that's two. Let's say the branch is not taken or falls through. Three, four. And then the jump over. So it's gonna be one, two, three instructions in the one, cuz there was like, one, two, three, four instructions in the one case. 1,2,3,4 instructions in the one case. And the other case it's going to take the branch to jump here. So it's going to be one, two, jump over all the stuff, three instructions. Okay. That's, that's not so bad. So, the one case is four and the other case is three. They're different. Now something else that makes this interesting is let's say, this branch is mispredicted. It's a data dependent branch, a and b. So what, what's a good thing to predict here? Can our branch predictor do a good job on this. Hm, I don't know. It depends on what you know about a and b. If the compiler knows nothing, or the hardware knows nothing, you're not going to be able to do a particularly great job. So let's say it's a data dependent branch and it's 50 percent the one direction, and 50 percent the other direction. It's just going slow down our execution. Well sure. It's going to slow down our execution, because all of a sudden, when we take a branch one of the directions whichever way we predict the mispredict path is going to add some mispredict penalty. So instead of the one path being three and the other path being four, let's say the branch is predicted oh, I don't know. Let's say the branch has predicted not taken. So it predicts the fall through case. So, all of a sudden, the fall through case is going to be one, two, three, four instructions. And let's say we have a two cycle branch mispredict penalty. So let's look at the other case. So the other case is going to be one, two, and then two cycles of mispredict. So that gets us up to four. Branch over, five. So it can be five cycles. So all of a sudden, we have our branch mispredict penalty coming into the equation here of. And, and, you know, two cycle branch mispredicts are relatively benign. If you, if that starts to go longer, then you really start to have problems with this. Okay. So let's take this code sequence and look about how to, how to predicate it. So if we're going to go predicate this, we're actually going to execute, both sides regardless of the value of the if clause or the conditional clause here. So, if we look at what we do, we've actually restructured the code here a little bit. And this is our compare, this is our if still up here, we have a set less then. And then we say, if R1 is zero, move R2 to R4, else, just leave it alone. And then here if, R1 is not zero move R3 to R4. And what's important to know here is, this non destructive operation, especially here in this last instruction is really important, cuz if the move zero was executed in that, copied the new value in R4. And then this move not zero, or to somehow change R4, you lose that result from this previous ones. This really depends on these operations being non-destructive if, if their, if their condition is, is not true. Okay. So let's, let's analyze this from a number of instructions perspective. First question is, does this matter what happens to the branch, or what happens to the condition? Is it going to have different numbers of instructions, or different number of cycles dependents on the, the direction of the, the condition. Well, no, there's no branch, nothing's going to change. So, in all cases, this is going to take three cycles. So all of a sudden we transform something which can take, as I said if, let's say you have a branch with probably four cycles or five cycles so, an average, let's say, the branch is taking you 50 percent of the time either the direction we'll, we'll split the difference there, and we'll call it four, four and a half cycles, on average. We turned it into something which takes three. Well, that's a great win. So where, where else is this useful? Cuz you always need to have if and else clauses to make this work out. So, know, this also works good for small code hammocks. So I say a small code hammock is just a if which jumps around a small piece of code. If you can't predict that if very well then it makes sense just to go execute effectively what's inside that code hammock, that small piece of code, always. If, if, if for instance, you have two instructions inside of a code hammock and your mispredict penalty is like ten cycles and you don't know whether the, the branch is being taken 50 percent of the time either way, all of a sudden your average mispredict penalty is going to be something like five cycles. You could have just executed the code in the middle, that would have been faster. You could have executed the two instructions and then just done a conditional move at the end and that would have been basically always better. Okay, so we have some questions here. What if the if then else has lots of instructions? Hm, so let's say we have if then else here but there's a thousand instructions here and a thousand instructions there. Should we predicate this? Let's say their branch mispredict finally, let's say five, for know, there's ten cycles and it's 50 percent either way. But there's a thousand instructions in the if the, the then clause and the else clause. Well, no, that would, that would never make sense to do. Because, you're basically doubling your code, and if there's a lot of code there you'd be better served by just doing the branch. Because you'd be taking something that was, could be 1000 instructions, and turning it into always 2000 instructions. Or doubling the number of instructions executing, executing, versus just adding, let's say, ten. Cycles of branch [inaudible], or an average [inaudible] cycles of branch [inaudible] 1,000 that you have to execute. So all the sudden, you're, you're adding a little overhead. So where, where this predication really helps is for, for small pieces of code where you don't necessarily know the branch. Outcome or can't predict the branch outcome very well. Okay what if the, the code is unbalanced. So you have an if then and an else. But the then clause and the else clause are very different in size. So, one is three instructions, and the other is a thousand instructions. Does this make sense? Well, it goes back to the same argument we had before with the very large if then else clauses. If the code is really unbalanced you have a lot in the then clause and a little bit in the else clause or something like that. And let's say the branch is taking 50 percent of the time, it might make sense just to use the an actual branch there and deal from the mispredict penalty cost verses trying to somehow predicate it. Because in fact, what's going to happen is let's say you have three instructions and a thousand instructions on the two different sides, you're going to be adding, you're going to be executing if you predicate a thousand and three instructions every single execution time plus maybe some sort of conditional moves at the end, some overhead involved versus, if you have 50-50, and if you have to add a little bit of branch overhead, penalty, your 50-50 is gonna 50-50 chance is gonna say, well, I'm taking a thousand plus three, and we're going to add those two things together, and take the average of them, basically. So, you know, it's going to be what's the average of that, it's going to be like 501, or something like about 502 cycles. That's going to be better than always executing a thousand cycles. Let's see, anything else we wanted to say here. You know, one, one, one last thing I wanted to say here before we move off to this slide. Move zero and move not zero or sometimes called c moves, conditional move instructions are what are called partial predication. And we are now going to move on to full predication where we can actually put conditions on every single instruction in our register every single instruction in our ISA. But this is a little bit easier to do than full predication. This is sort of the first step when people typically implement and it's called partial predication. And just to, just to hammer this home one more time, predication is really turning what was control flow into a data flow operation or data operation. Okay, so let's take a look at full predication. Full predication, we're actually going to change the instruction set, such that all of the instructions can be or almost all the instructions are executed conditionally based on some predi, predicate. If the predicate's false, the instructions are not going to do anything. It's not going to have any, any side effect. So let's look at a, a code sequence here. So this code sequence we have an if then and else and then some code after, after we're done. So here we have if. A little bit different this is not actually in MIPS code, this is something just sort of serial code here, but it's going to say, this is our if here; if a = b, then branch to b2, else fall through. Here we're going to execute our else clause. And when we're done we're gonna branch over the then clause. So not, not, not anything interesting here for basic blocks. I don't know if we've introduced this term yet in, in this class. But I wanted to introduce the term basic block. Basic block are is a, is a piece of code which has strictly one exit point, and one entry point. So you could enter it from other places, you jump to it from other pieces of code. But once you enter the piece of code you're gonna execute the code, all the way to the end, and there's one place you can leave. So you're going to jump to someplace else or branch, and you can, you can fall out to, to two different places, but the exit point, is only one place. So, we're, what we're going to is, big differentiation here is, the piece of code can have for instance, two branches that, that's not called a basic block. That, that's called a bigger type of block but this is a compiler term. And it's just good to know. Okay, so let's apply full predication here. So, these are relatively short, code hammocks, two instructions here, two instructions there and two instructions in the else, two instructions in the then clause, this is like a, really great [inaudible] place something about predicating. Okay, so lets, let's take a look at the, the predication here. And I want, before, before we do that, I wanted to just say that the, one of the big reasons that we are trying to predicate on a VLIW, in particular, is in a VLIW, we have lots of extra, or we could potentially have extra parallel execution slots. So, in our instruction sequence or instruction encoding we have extra places to put code so it may not actually hurt us to, to predicate. Because the last time when we go to predicate we're basically going to execute both sides of the if, then and else. And it's bad if we have to execute both sides of the if, then and else, and they are big, sequentially. But, if you try to, sort of, slide them up next to each other, if you have extra no-op slots or extra slots in your VLIW, you can actually decrease the [inaudible] path through your program. So let's look at this, we're actually going to do this in this example. So in this example here, we took this piece of code and added full predication. So when we say full predication, we added some extra things to these instructions. They look a little strange at first. So here are instruction three, instruction four, instruction five, instruction six. These are same ones that were back over here. But instead, we added some extra words in front of them here or these, these parentheses with p's in front of it. What that is, is these are predicate registers. So we're going to add a second register file into our processor where we can store effectively one bid values, sort of true or false values, that are going to determine whether this instruction executes or doesn't execute. Excuse me. So, what this means is if p1 is one, execute this. If p1 is zero, nullify this instruction three here. And we have, we have a double pipe here. What am I trying to show with this? What I'm trying to show this is parallel. This and this are executing at the same time, because we have a very long instruction where processor can execute multiple things at the same time. And what we've done here is actually slid this up next to the, the then up next to the else. And we are gonna execute them at the same time. Dependent on these predicates. So this is the else block, and this is the then block, and we're gonna execute concurrently. Okay. Sort of, stuff before the if, stuff after the, if then else is all done, that's sort of from that, that four basic block example. What's this instruction? This instruction has very interesting semantics. And this is something that you, probably have to add if you're going to try have full predication. Hm, okay so let's compare. This is our comparison. If a = b and then it writes into two outputs. Ooh, that looks broken, that looks wrong somehow. How can we have one instruction write to two things? And what is the semantics of it? What does it write to here and here? Well, what this is actually doing is this is writing the positive outcome of this compare to the one and the inverse, or the, the negation of that to the other. And this is useful, cuz then we can go use these predicates down here. We can use the, the positive one and, let's say, the negative one, and that can drive whether we execute this, sort of, else clause or the, then clause. So lots of instruction sets that have full predication have either something like this. This is one option, you can actually have, sort of, two outputs, and you load to predicate registers. The other option is, when you go to en, actually encode the instructions, you have a sort of a not bit. So it denotes the predicate register. But when we denote the predicate register, we also denote whether we take the predicate register being true as, or, or the predicate register being false as what drives one of the instruction executes in our full predication scheme. So in effect this is computing p and p bar at the same time and running of two different registers, and then we can use those seperately down here. And interesting result here in is going to 95, Scot Malky showed that you can, on average let's say remove 50 percent of your branches by using full predication. Now, full predication is not easy to build. Very few instruction sets have actually had full predication. Itanium was probably the closest to a real processor that was built that has almost full predication. It doesn't have quite full predication. The HP PlayDoh instruction set, I think, actually had full predication. I think this is what Scott Malkey was working with, or a derivative of that. So you can actually have, sort of just some people who've thought about they've had full predication but not all of things are easy. So first thing you need to add an extra register file for the predicates, you need to bypass those predicates, you need to compute them. There's some overheads involved and we're going talk a little bit more about them.