1 00:00:03,025 --> 00:00:09,071 Okay, so we left off talking a little bit about how to encode instructions in VLIW 2 00:00:09,071 --> 00:00:15,902 or how VLIW's encode instructions. And as you may recall, in a VILW 3 00:00:15,902 --> 00:00:20,483 instruction encoding, you've a very long instruction work. 4 00:00:20,483 --> 00:00:23,286 So it could be I don't know many, many bytes. 5 00:00:23,286 --> 00:00:26,748 So, if you go and get something like the multi flow processors, multi flow trace 6 00:00:26,748 --> 00:00:30,036 processors. They had very long instruction words. 7 00:00:30,036 --> 00:00:34,084 They could, I think, execute something like and their largest machine 8 00:00:34,084 --> 00:00:37,115 configuration twenty something instructions. 9 00:00:37,115 --> 00:00:42,674 And they had to encode this but one of the challenges was that you didn't necessarily 10 00:00:42,674 --> 00:00:46,066 want to have all that encoding space used all the time. 11 00:00:46,066 --> 00:00:51,007 So if you are executing a simple instruction, or something which doesn't 12 00:00:51,007 --> 00:00:55,985 actually need all the encoding space. You end up of a whole lot of no ops. 13 00:00:55,985 --> 00:00:59,808 Or if you have a very long dependency string in your execution trace if you 14 00:00:59,808 --> 00:01:01,392 will. So you have a instruction which is 15 00:01:01,392 --> 00:01:04,815 dependent on the next instruction, which is dependent on the next instruction, 16 00:01:04,815 --> 00:01:07,025 which is dependent on the next instruction. 17 00:01:07,025 --> 00:01:09,023 And you cannot fill any of the extra slots. 18 00:01:09,023 --> 00:01:11,096 You're gonna end up having a lot of no ops in your program. 19 00:01:11,096 --> 00:01:16,095 So how do we go about solving this. Couple, couple different solutions here. 20 00:01:17,239 --> 00:01:23,107 Cydra-5 went off and added a particular instruction, which is a single operation 21 00:01:23,107 --> 00:01:27,098 instruction. And used much smaller amount of memory. 22 00:01:27,098 --> 00:01:33,021 So this would effectively reduce your instruction cache pressure. 23 00:01:34,004 --> 00:01:38,186 There was compressed formats in memory, so you can hold your instruction sequences in 24 00:01:38,186 --> 00:01:42,053 memory, and then when you go and pull it into your instruction cache, it would 25 00:01:42,053 --> 00:01:46,070 expand to the fully expanded version. And this is kind of like a compression 26 00:01:46,070 --> 00:01:49,028 algorithm. You're basically, compressing your VLIW 27 00:01:49,028 --> 00:01:52,055 instruction sequences. Because otherwise these things are very, 28 00:01:52,055 --> 00:01:55,066 very large and have lots and lots of zero's or no ops in them. 29 00:01:55,066 --> 00:01:59,083 Because, the probability you're gonna fill, sort of twenty instructions per 30 00:01:59,083 --> 00:02:02,901 cycle, is low, but you want that capability there, in case you'd actually 31 00:02:02,901 --> 00:02:06,006 do want to execute twenty instructions per cycle. 32 00:02:06,870 --> 00:02:15,029 Another solution here is what's actually used in the Intel Itanium or IA64 33 00:02:15,029 --> 00:02:22,009 processor line and also the TI DSP's, the C6000 series. 34 00:02:22,009 --> 00:02:29,041 They, they have the ability to effectively mark where a bundle starts and stops. 35 00:02:29,041 --> 00:02:34,097 So it's something sort of between a, variable length instruction and a fixed 36 00:02:34,097 --> 00:02:40,041 format, VLIW instruction. So we'll talk more later in today's class 37 00:02:40,041 --> 00:02:43,511 about how IA64 or Intel's Itanium handles this. 38 00:02:43,511 --> 00:02:49,332 But roughly what happens is, they have a fixed instruction format and you can fit 39 00:02:49,332 --> 00:02:52,951 sort of, three instructions per bundle, or what they call a bundle. 40 00:02:52,951 --> 00:02:57,438 And then, if you wanna go over that, you can keep going over that, and they have 41 00:02:57,438 --> 00:03:01,936 special bits in their instruction where it says, oh, well this is gonna express some 42 00:03:01,936 --> 00:03:06,000 more parallelism later. So you need to look at the next bundle, 43 00:03:06,000 --> 00:03:09,792 and try to fetch that bundle, and that's all can be executed parallel. 44 00:03:09,792 --> 00:03:14,053 So it's still a, a parallel execution semantics, but the formats are not 45 00:03:14,053 --> 00:03:17,278 necessarily fixed. And this is, sort of, outside the bound of 46 00:03:17,278 --> 00:03:23,006 classical VLIW now it's going to a little bit more modified VLIWs, or, or more 47 00:03:23,006 --> 00:03:27,006 contemporary VLIW's. So today, we're gonna as I said, talk 48 00:03:27,006 --> 00:03:32,062 about how to go about getting all the performance that out of order superscalars 49 00:03:32,062 --> 00:03:37,534 can get, but in the context of a VLIW. So we're gonna slowly build up all the 50 00:03:37,534 --> 00:03:42,884 different techniques that out of order superscalars have, and all the ways that 51 00:03:42,884 --> 00:03:46,191 they can get instruction level parallelism. 52 00:03:46,191 --> 00:03:51,857 And then apply that inside of VLIW or how we make small changes to classical VLIW's 53 00:03:51,857 --> 00:03:57,032 to get a lot of the performance back. Okay, so let's talk about the first, the 54 00:03:57,032 --> 00:04:02,037 first trick here. One of the questions that comes up or the 55 00:04:02,037 --> 00:04:08,662 problems that comes up here is if you have branches, and you mis-predict. 56 00:04:08,662 --> 00:04:13,482 This can limit your instruction level parallelism in a VLIW. 57 00:04:13,482 --> 00:04:17,532 Okay, so why is this worse than in a out of order superscalar? 58 00:04:17,532 --> 00:04:23,303 Well an out of order superscalar can dynamically schedule around branch miss 59 00:04:23,303 --> 00:04:26,834 predicts. So if your branch miss predict you can 60 00:04:26,834 --> 00:04:31,946 basically schedule other code, or speculatively schedule code depending on 61 00:04:31,946 --> 00:04:35,823 your prediction. But on something like a VLIW processor, 62 00:04:35,823 --> 00:04:40,301 that's a lot harder to do because you don't have a dynamic scheduling engine 63 00:04:40,301 --> 00:04:42,990 behind there. The compiler had to go do all that 64 00:04:42,990 --> 00:04:48,700 scheduling statically at compile time. So how do we, how do we go about fixing 65 00:04:48,700 --> 00:04:51,514 this? Well, one solution is just to eliminate 66 00:04:51,514 --> 00:04:56,011 all the hard to predict branches. So you take out the branches that might be 67 00:04:56,011 --> 00:04:59,043 hard to predict. So your branch predictor is trying really 68 00:04:59,043 --> 00:05:03,085 hard, you could still have a branch predictor in your VLIW processor, but some 69 00:05:03,085 --> 00:05:07,057 of these things, you don't know. It's like a data dependent branch. 70 00:05:07,057 --> 00:05:10,004 How do you know which way it's going to go. 71 00:05:10,004 --> 00:05:14,032 And, you know, this is a problem even for superscalars, but at least in a 72 00:05:14,032 --> 00:05:19,022 superscalar, you can try to sort of back filled instructions, or try to, try to do 73 00:05:19,022 --> 00:05:23,081 something, or move around, if you will, loads or long laid instructions around 74 00:05:23,081 --> 00:05:25,236 branches. So in a super, in a out-of-order 75 00:05:25,236 --> 00:05:29,043 superscalar you can potentially move a load beyond a branch. 76 00:05:29,043 --> 00:05:34,027 But in a VLIW processor, that's not easy to do, because compiler has to make that 77 00:05:34,027 --> 00:05:38,044 decision, whether that branch is gonna predict take in or not take in. 78 00:05:38,044 --> 00:05:42,006 And it may, that may change over the lifetime of the program. 79 00:05:42,006 --> 00:05:51,006 And it may also change as the depending on the sort of input sets to the program. 80 00:05:51,006 --> 00:05:55,079 So it may not be something that's compiler time even known knowable. 81 00:05:55,079 --> 00:06:01,931 Okay, so our first technique here that we're gonna use is called predication. 82 00:06:01,931 --> 00:06:06,000 So let's, let's define what predication is. 83 00:06:06,000 --> 00:06:14,112 Well, first of all, predication is a technique, that allows you to basically 84 00:06:14,112 --> 00:06:20,270 change the result of a register, based on some other register. 85 00:06:20,270 --> 00:06:27,782 And it's going to let us transform hard to predict branches into data flow or data 86 00:06:27,782 --> 00:06:32,315 dependencies. So we're gonna add something to our 87 00:06:32,315 --> 00:06:38,831 instruction set and it's gonna take what looks like a small branch or short 88 00:06:38,831 --> 00:06:44,775 distance branch and we're going to basically execute both sides of those 89 00:06:44,775 --> 00:06:49,035 branches or the taken and the non taken code path. 90 00:06:49,035 --> 00:06:53,020 And then at the end, decide which one was the right one, with a, something like a 91 00:06:53,020 --> 00:06:56,027 select operator from C. So this is like the question mark colon 92 00:06:56,027 --> 00:06:59,098 operator, which no one ever uses in C. This is sort of the equivalent of that, 93 00:06:59,098 --> 00:07:03,025 but we're gonna do it in one instruction, or maybe two instructions. 94 00:07:03,025 --> 00:07:09,040 So let's, let's, let's take a look at how this helps with small branch regions or 95 00:07:09,040 --> 00:07:14,015 branches which are hard to predict. So, we're gonna introduce two 96 00:07:14,015 --> 00:07:19,573 instructions, and I also want to point out, that predication sometimes shows up 97 00:07:19,573 --> 00:07:25,027 in superscalar's processors or a, a sequential instruction sets also. 98 00:07:25,027 --> 00:07:30,054 So a good example of this is, two instructions very similar to this, are 99 00:07:30,054 --> 00:07:34,085 actually in x86. They're called c-move or conditional move. 100 00:07:34,709 --> 00:07:41,965 And we're gonna introduce conditional move as the most basic form of predication. 101 00:07:41,965 --> 00:07:47,728 So let's look at the semantics of these two instructions we add for a conditional 102 00:07:47,728 --> 00:07:52,929 move. First instruction here, move zero. 103 00:07:52,929 --> 00:08:00,072 So what does move zero do? Well, move zero test whether one of these 104 00:08:00,072 --> 00:08:07,701 input operands here, rt, is equal to zero. And if it's equal to zero, it's going to 105 00:08:07,701 --> 00:08:13,883 move rs into the result register, the destination register here. 106 00:08:13,883 --> 00:08:21,164 So it's gonna move rs and rt, or rs and rd if rt is equal to zero. 107 00:08:21,164 --> 00:08:28,758 And, if rt is not equal to zero, it's just going to leave rd alone. 108 00:08:28,758 --> 00:08:35,091 Okay, well, on first appearance is that seems like a very simple instruction. 109 00:08:35,091 --> 00:08:37,920 It's basically just doing a copy operation. 110 00:08:37,920 --> 00:08:42,176 It's a, it's a gated copy. Looks a lot simpler than something like 111 00:08:42,176 --> 00:08:45,325 add or subtract. We don't have to do any math, at least. 112 00:08:45,325 --> 00:08:49,920 There's no compare, I mean the, the compare is easy as the equal, it's not a 113 00:08:49,920 --> 00:08:52,894 less than equals. So it's pretty simple instruction. 114 00:08:52,894 --> 00:08:55,423 And we're also gonna introduce move not zero. 115 00:08:55,423 --> 00:08:58,338 So it's kind of a complimentary one of this. 116 00:08:58,338 --> 00:09:03,913 And, we're going to want this because when we go to execute code, we're going to want 117 00:09:03,913 --> 00:09:09,915 to, let's say, have a if then else, and we're going to have, want to execute the, 118 00:09:09,915 --> 00:09:14,799 the, the then and else at the same time or roughly at the same time, or 119 00:09:14,799 --> 00:09:17,292 indiscriminately not depending on the branch. 120 00:09:17,292 --> 00:09:20,521 And at the end, we're going to decide which one was the correct one. 121 00:09:20,521 --> 00:09:25,654 And we need sort of two instructions here to choose what was, was the positive one 122 00:09:25,654 --> 00:09:29,981 the right one, or the one where the, the conditional value is true or false the 123 00:09:29,981 --> 00:09:32,987 right one, or the one where it's true, or vice versa. 124 00:09:32,987 --> 00:09:37,121 So, move not zero. Move not zero is going to do the same 125 00:09:37,121 --> 00:09:40,949 thing here. It's going to see except the sense of the, 126 00:09:40,949 --> 00:09:44,025 of the condition code is going to be different. 127 00:09:44,025 --> 00:09:49,425 It's going to sense if rt is not equal to zero, and if rt is not equal to zero then 128 00:09:49,425 --> 00:09:55,095 rs is going to go into rd, else we're just going to leave rd alone with the original 129 00:09:55,095 --> 00:09:57,027 value of rd. Okay. 130 00:09:57,027 --> 00:10:01,086 So let's, let's walk through a, a quick code example here. 131 00:10:01,086 --> 00:10:07,364 And see, see how this helps. So here we have a code example, we have 132 00:10:07,364 --> 00:10:11,598 some C code. We have non-predicated MIPS-like assembly 133 00:10:11,598 --> 00:10:14,710 code. And then we have a MIPS-like assembly code 134 00:10:14,710 --> 00:10:18,015 here with these fancy two predication instructions. 135 00:10:18,015 --> 00:10:22,371 And they were all three of these are doing the same thing. 136 00:10:22,371 --> 00:10:27,899 So let's, let's look at this piece of code here and the C code is probably the 137 00:10:27,899 --> 00:10:31,935 easiest to understand. It's gonna start off it's gonna say if a 138 00:10:31,935 --> 00:10:36,777 is less than b, so is a condition, and there's a then and a else. 139 00:10:36,777 --> 00:10:43,885 So what does this actually implement? Well this is a minimization function or a 140 00:10:43,885 --> 00:10:47,524 min function. X gets the minimum of a or b. 141 00:10:47,524 --> 00:10:53,292 If a is less than b, it gets a. If a is greater than b, it gets b. 142 00:10:53,292 --> 00:10:59,674 So it's going to be assigned to x, the smallest value of a and b. 143 00:10:59,674 --> 00:11:06,825 And, these sort of, if and else is, you can see here it's pretty short, we don't 144 00:11:06,825 --> 00:11:12,503 have a whole lot of code in either the, the then case or the else case. 145 00:11:12,503 --> 00:11:17,022 And that's gonna be important in a second, you'll see why. 146 00:11:17,219 --> 00:11:22,583 Mainly it revolves around if you have lots and lots of code or one of the sides of 147 00:11:22,583 --> 00:11:27,276 the branches is, is, is long, it may not make sense to actually predicate this, cuz 148 00:11:27,276 --> 00:11:33,290 there's some cost to predication. Okay, so here's some assembly code does a 149 00:11:33,290 --> 00:11:37,687 similar sort of thing. Set less than, this is a comparison 150 00:11:37,687 --> 00:11:42,051 operation in MIPS. So we'll do set less than and these are 151 00:11:42,051 --> 00:11:47,233 going to be, R2 and R3 are going to have our two a and b values here. 152 00:11:47,233 --> 00:11:52,733 If it's less than, we're going to put it into R1, if not this gets set to one or 153 00:11:52,733 --> 00:11:55,938 zero. And then we have a branch zero here 154 00:11:55,938 --> 00:12:01,311 effectively otherwise this is a branch. Okay, so branch equal if zero, it's a 155 00:12:01,311 --> 00:12:04,104 little odd. We probably should have just put branch 156 00:12:04,104 --> 00:12:08,652 zero there instead. But is going to check to see whether this 157 00:12:08,652 --> 00:12:12,415 value is zero or one. And if it's the one direction it's going 158 00:12:12,415 --> 00:12:16,050 to jump to L1. If it's the other direction it's going to 159 00:12:16,050 --> 00:12:20,464 fall through and then jump over this move. And these two moves here are the two 160 00:12:20,464 --> 00:12:27,860 assignment operators, the x = a or x = b. Okay, that's, that's not so bad. 161 00:12:27,860 --> 00:12:34,096 Let's analyze how long this takes. How many cycles this will take to execute 162 00:12:34,096 --> 00:12:41,074 on a, on a sequential machine. Okay, it was a set less than. 163 00:12:41,074 --> 00:12:44,044 It's one. This is branch. 164 00:12:44,091 --> 00:12:52,046 So that's two. Let's say the branch is not taken or falls 165 00:12:52,046 --> 00:12:54,098 through. Three, four. 166 00:12:56,022 --> 00:13:00,096 And then the jump over. So it's gonna be one, two, three 167 00:13:00,096 --> 00:13:04,247 instructions in the one, cuz there was like, one, two, three, four instructions 168 00:13:04,247 --> 00:13:07,002 in the one case. 1,2,3,4 instructions in the one case. 169 00:13:07,002 --> 00:13:12,031 And the other case it's going to take the branch to jump here. 170 00:13:12,031 --> 00:13:17,075 So it's going to be one, two, jump over all the stuff, three instructions. 171 00:13:18,048 --> 00:13:20,085 Okay. That's, that's not so bad. 172 00:13:20,085 --> 00:13:24,097 So, the one case is four and the other case is three. 173 00:13:24,097 --> 00:13:29,016 They're different. Now something else that makes this 174 00:13:29,016 --> 00:13:33,057 interesting is let's say, this branch is mispredicted. 175 00:13:33,057 --> 00:13:38,067 It's a data dependent branch, a and b. So what, what's a good thing to predict 176 00:13:38,067 --> 00:13:41,082 here? Can our branch predictor do a good job on 177 00:13:41,082 --> 00:13:45,008 this. Hm, I don't know. 178 00:13:45,008 --> 00:13:49,055 It depends on what you know about a and b. If the compiler knows nothing, or the 179 00:13:49,055 --> 00:13:53,714 hardware knows nothing, you're not going to be able to do a particularly great job. 180 00:13:53,884 --> 00:13:58,079 So let's say it's a data dependent branch and it's 50 percent the one direction, and 181 00:13:58,079 --> 00:14:02,050 50 percent the other direction. It's just going slow down our execution. 182 00:14:03,027 --> 00:14:06,038 Well sure. It's going to slow down our execution, 183 00:14:06,038 --> 00:14:12,024 because all of a sudden, when we take a branch one of the directions whichever way 184 00:14:12,024 --> 00:14:17,013 we predict the mispredict path is going to add some mispredict penalty. 185 00:14:17,041 --> 00:14:23,058 So instead of the one path being three and the other path being four, let's say the 186 00:14:23,058 --> 00:14:29,038 branch is predicted oh, I don't know. Let's say the branch has predicted not 187 00:14:29,038 --> 00:14:32,058 taken. So it predicts the fall through case. 188 00:14:33,053 --> 00:14:38,029 So, all of a sudden, the fall through case is going to be one, two, three, four 189 00:14:38,029 --> 00:14:41,085 instructions. And let's say we have a two cycle branch 190 00:14:41,085 --> 00:14:45,021 mispredict penalty. So let's look at the other case. 191 00:14:45,021 --> 00:14:49,097 So the other case is going to be one, two, and then two cycles of mispredict. 192 00:14:49,097 --> 00:14:53,026 So that gets us up to four. Branch over, five. 193 00:14:53,026 --> 00:14:57,033 So it can be five cycles. So all of a sudden, we have our branch 194 00:14:57,033 --> 00:15:00,210 mispredict penalty coming into the equation here of. 195 00:15:00,210 --> 00:15:04,372 And, and, you know, two cycle branch mispredicts are relatively benign. 196 00:15:04,372 --> 00:15:09,710 If you, if that starts to go longer, then you really start to have problems with 197 00:15:09,710 --> 00:15:10,368 this. Okay. 198 00:15:10,368 --> 00:15:15,330 So let's take this code sequence and look about how to, how to predicate it. 199 00:15:15,330 --> 00:15:20,605 So if we're going to go predicate this, we're actually going to execute, both 200 00:15:20,605 --> 00:15:29,084 sides regardless of the value of the if clause or the conditional clause here. 201 00:15:30,027 --> 00:15:35,031 So, if we look at what we do, we've actually restructured the code here a 202 00:15:35,031 --> 00:15:39,624 little bit. And this is our compare, this is our if 203 00:15:39,624 --> 00:15:50,024 still up here, we have a set less then. And then we say, if R1 is zero, move R2 to 204 00:15:50,024 --> 00:16:01,302 R4, else, just leave it alone. And then here if, R1 is not zero move R3 205 00:16:01,302 --> 00:16:06,525 to R4. And what's important to know here is, this 206 00:16:06,525 --> 00:16:12,435 non destructive operation, especially here in this last instruction is really 207 00:16:12,435 --> 00:16:18,824 important, cuz if the move zero was executed in that, copied the new value in 208 00:16:18,824 --> 00:16:22,279 R4. And then this move not zero, or to somehow 209 00:16:22,279 --> 00:16:26,084 change R4, you lose that result from this previous ones. 210 00:16:26,084 --> 00:16:32,034 This really depends on these operations being non-destructive if, if their, if 211 00:16:32,034 --> 00:16:35,013 their condition is, is not true. Okay. 212 00:16:35,013 --> 00:16:39,070 So let's, let's analyze this from a number of instructions perspective. 213 00:16:40,005 --> 00:16:47,175 First question is, does this matter what happens to the branch, or what happens to 214 00:16:47,175 --> 00:16:50,094 the condition? Is it going to have different numbers of 215 00:16:50,094 --> 00:16:55,743 instructions, or different number of cycles dependents on the, the direction of 216 00:16:55,743 --> 00:17:01,003 the, the condition. Well, no, there's no branch, nothing's 217 00:17:01,003 --> 00:17:03,732 going to change. So, in all cases, this is going to take 218 00:17:03,732 --> 00:17:07,038 three cycles. So all of a sudden we transform something 219 00:17:07,038 --> 00:17:12,018 which can take, as I said if, let's say you have a branch with probably four 220 00:17:12,018 --> 00:17:16,745 cycles or five cycles so, an average, let's say, the branch is taking you 50 221 00:17:16,745 --> 00:17:21,003 percent of the time either the direction we'll, we'll split the difference there, 222 00:17:21,003 --> 00:17:24,080 and we'll call it four, four and a half cycles, on average. 223 00:17:24,080 --> 00:17:27,064 We turned it into something which takes three. 224 00:17:27,064 --> 00:17:33,168 Well, that's a great win. So where, where else is this useful? 225 00:17:33,435 --> 00:17:39,184 Cuz you always need to have if and else clauses to make this work out. 226 00:17:39,184 --> 00:17:43,018 So, know, this also works good for small code hammocks. 227 00:17:43,018 --> 00:17:49,509 So I say a small code hammock is just a if which jumps around a small piece of code. 228 00:17:49,509 --> 00:17:55,049 If you can't predict that if very well then it makes sense just to go execute 229 00:17:55,049 --> 00:17:59,184 effectively what's inside that code hammock, that small piece of code, always. 230 00:17:59,184 --> 00:18:03,211 If, if, if for instance, you have two instructions inside of a code hammock and 231 00:18:03,211 --> 00:18:07,440 your mispredict penalty is like ten cycles and you don't know whether the, the branch 232 00:18:07,440 --> 00:18:11,328 is being taken 50 percent of the time either way, all of a sudden your average 233 00:18:11,328 --> 00:18:14,624 mispredict penalty is going to be something like five cycles. 234 00:18:14,624 --> 00:18:18,691 You could have just executed the code in the middle, that would have been faster. 235 00:18:18,691 --> 00:18:21,839 You could have executed the two instructions and then just done a 236 00:18:21,839 --> 00:18:26,988 conditional move at the end and that would have been basically always better. 237 00:18:27,618 --> 00:18:33,098 Okay, so we have some questions here. What if the if then else has lots of 238 00:18:33,098 --> 00:18:38,637 instructions? Hm, so let's say we have if then else here 239 00:18:38,637 --> 00:18:45,896 but there's a thousand instructions here and a thousand instructions there. 240 00:18:45,896 --> 00:18:51,679 Should we predicate this? Let's say their branch mispredict finally, 241 00:18:51,679 --> 00:18:56,529 let's say five, for know, there's ten cycles and it's 50 percent either way. 242 00:18:56,529 --> 00:19:04,791 But there's a thousand instructions in the if the, the then clause and the else 243 00:19:04,791 --> 00:19:09,876 clause. Well, no, that would, that would never 244 00:19:09,876 --> 00:19:13,648 make sense to do. Because, you're basically doubling your 245 00:19:13,648 --> 00:19:19,428 code, and if there's a lot of code there you'd be better served by just doing the 246 00:19:19,428 --> 00:19:22,639 branch. Because you'd be taking something that 247 00:19:22,639 --> 00:19:27,584 was, could be 1000 instructions, and turning it into always 2000 instructions. 248 00:19:27,584 --> 00:19:32,956 Or doubling the number of instructions executing, executing, versus just adding, 249 00:19:32,956 --> 00:19:35,961 let's say, ten. Cycles of branch [inaudible], or an 250 00:19:35,961 --> 00:19:39,701 average [inaudible] cycles of branch [inaudible] 1,000 that you have to 251 00:19:39,701 --> 00:19:42,914 execute. So all the sudden, you're, you're adding a 252 00:19:42,914 --> 00:19:46,072 little overhead. So where, where this predication really 253 00:19:46,072 --> 00:19:50,576 helps is for, for small pieces of code where you don't necessarily know the 254 00:19:50,576 --> 00:19:55,031 branch. Outcome or can't predict the branch 255 00:19:55,031 --> 00:20:00,254 outcome very well. Okay what if the, the code is unbalanced. 256 00:20:00,254 --> 00:20:07,008 So you have an if then and an else. But the then clause and the else clause 257 00:20:07,008 --> 00:20:11,147 are very different in size. So, one is three instructions, and the 258 00:20:11,147 --> 00:20:15,052 other is a thousand instructions. Does this make sense? 259 00:20:15,052 --> 00:20:22,846 Well, it goes back to the same argument we had before with the very large if then 260 00:20:22,846 --> 00:20:28,052 else clauses. If the code is really unbalanced you have 261 00:20:28,052 --> 00:20:35,051 a lot in the then clause and a little bit in the else clause or something like that. 262 00:20:36,092 --> 00:20:42,036 And let's say the branch is taking 50 percent of the time, it might make sense 263 00:20:42,036 --> 00:20:47,095 just to use the an actual branch there and deal from the mispredict penalty cost 264 00:20:47,095 --> 00:20:53,026 verses trying to somehow predicate it. Because in fact, what's going to happen is 265 00:20:53,026 --> 00:20:57,502 let's say you have three instructions and a thousand instructions on the two 266 00:20:57,502 --> 00:21:02,853 different sides, you're going to be adding, you're going to be executing if 267 00:21:02,853 --> 00:21:07,032 you predicate a thousand and three instructions every single execution time 268 00:21:07,032 --> 00:21:12,260 plus maybe some sort of conditional moves at the end, some overhead involved versus, 269 00:21:12,260 --> 00:21:17,261 if you have 50-50, and if you have to add a little bit of branch overhead, penalty, 270 00:21:17,261 --> 00:21:21,773 your 50-50 is gonna 50-50 chance is gonna say, well, I'm taking a thousand plus 271 00:21:21,773 --> 00:21:26,950 three, and we're going to add those two things together, and take the average of 272 00:21:26,950 --> 00:21:30,084 them, basically. So, you know, it's going to be what's the 273 00:21:30,084 --> 00:21:34,454 average of that, it's going to be like 501, or something like about 502 cycles. 274 00:21:34,454 --> 00:21:40,025 That's going to be better than always executing a thousand cycles. 275 00:21:40,025 --> 00:21:46,092 Let's see, anything else we wanted to say here. 276 00:21:51,239 --> 00:21:55,416 You know, one, one, one last thing I wanted to say here before we move off to 277 00:21:55,416 --> 00:21:58,956 this slide. Move zero and move not zero or sometimes 278 00:21:58,956 --> 00:22:03,363 called c moves, conditional move instructions are what are called partial 279 00:22:03,363 --> 00:22:06,614 predication. And we are now going to move on to full 280 00:22:06,614 --> 00:22:11,857 predication where we can actually put conditions on every single instruction in 281 00:22:11,857 --> 00:22:15,477 our register every single instruction in our ISA. 282 00:22:15,477 --> 00:22:19,150 But this is a little bit easier to do than full predication. 283 00:22:19,150 --> 00:22:23,961 This is sort of the first step when people typically implement and it's called 284 00:22:23,961 --> 00:22:27,941 partial predication. And just to, just to hammer this home one 285 00:22:27,941 --> 00:22:33,775 more time, predication is really turning what was control flow into a data flow 286 00:22:33,775 --> 00:22:40,058 operation or data operation. Okay, so let's take a look at full 287 00:22:40,058 --> 00:22:46,077 predication. Full predication, we're actually going to 288 00:22:46,077 --> 00:22:52,679 change the instruction set, such that all of the instructions can be or almost all 289 00:22:52,679 --> 00:22:58,297 the instructions are executed conditionally based on some predi, 290 00:22:58,297 --> 00:23:04,087 predicate. If the predicate's false, the instructions 291 00:23:04,087 --> 00:23:09,428 are not going to do anything. It's not going to have any, any side 292 00:23:09,428 --> 00:23:13,009 effect. So let's look at a, a code sequence here. 293 00:23:13,009 --> 00:23:20,838 So this code sequence we have an if then and else and then some code after, after 294 00:23:20,838 --> 00:23:24,706 we're done. So here we have if. 295 00:23:24,706 --> 00:23:31,614 A little bit different this is not actually in MIPS code, this is something 296 00:23:31,614 --> 00:23:39,359 just sort of serial code here, but it's going to say, this is our if here; if a = 297 00:23:39,359 --> 00:23:46,098 b, then branch to b2, else fall through. Here we're going to execute our else 298 00:23:46,098 --> 00:23:50,015 clause. And when we're done we're gonna branch 299 00:23:50,015 --> 00:23:55,556 over the then clause. So not, not, not anything interesting here 300 00:23:55,556 --> 00:24:00,412 for basic blocks. I don't know if we've introduced this term 301 00:24:00,412 --> 00:24:06,548 yet in, in this class. But I wanted to introduce the term basic 302 00:24:06,548 --> 00:24:10,620 block. Basic block are is a, is a piece of code 303 00:24:10,620 --> 00:24:14,904 which has strictly one exit point, and one entry point. 304 00:24:14,904 --> 00:24:20,091 So you could enter it from other places, you jump to it from other pieces of code. 305 00:24:20,091 --> 00:24:24,533 But once you enter the piece of code you're gonna execute the code, all the way 306 00:24:24,533 --> 00:24:27,387 to the end, and there's one place you can leave. 307 00:24:27,387 --> 00:24:31,375 So you're going to jump to someplace else or branch, and you can, you can fall out 308 00:24:31,375 --> 00:24:34,381 to, to two different places, but the exit point, is only one place. 309 00:24:34,381 --> 00:24:38,399 So, we're, what we're going to is, big differentiation here is, the piece of code 310 00:24:38,399 --> 00:24:42,344 can have for instance, two branches that, that's not called a basic block. 311 00:24:42,344 --> 00:24:46,579 That, that's called a bigger type of block but this is a compiler term. 312 00:24:46,579 --> 00:24:50,666 And it's just good to know. Okay, so let's apply full predication 313 00:24:50,666 --> 00:24:53,050 here. So, these are relatively short, code 314 00:24:53,050 --> 00:24:58,228 hammocks, two instructions here, two instructions there and two instructions in 315 00:24:58,228 --> 00:25:02,830 the else, two instructions in the then clause, this is like a, really great 316 00:25:02,830 --> 00:25:05,035 [inaudible] place something about predicating. 317 00:25:06,038 --> 00:25:10,061 Okay, so lets, let's take a look at the, the predication here. 318 00:25:10,061 --> 00:25:14,981 And I want, before, before we do that, I wanted to just say that the, one of the 319 00:25:14,981 --> 00:25:20,332 big reasons that we are trying to predicate on a VLIW, in particular, is in 320 00:25:20,332 --> 00:25:25,697 a VLIW, we have lots of extra, or we could potentially have extra parallel execution 321 00:25:25,697 --> 00:25:28,726 slots. So, in our instruction sequence or 322 00:25:28,726 --> 00:25:34,678 instruction encoding we have extra places to put code so it may not actually hurt us 323 00:25:34,678 --> 00:25:37,233 to, to predicate. Because the last time when we go to 324 00:25:37,233 --> 00:25:41,017 predicate we're basically going to execute both sides of the if, then and else. 325 00:25:41,017 --> 00:25:45,038 And it's bad if we have to execute both sides of the if, then and else, and they 326 00:25:45,038 --> 00:25:48,064 are big, sequentially. But, if you try to, sort of, slide them up 327 00:25:50,066 --> 00:25:52,069 next to each other, if you have extra no-op slots or extra slots in your VLIW, 328 00:25:52,069 --> 00:25:56,031 you can actually decrease the [inaudible] path through your program. 329 00:25:56,075 --> 00:26:01,112 So let's look at this, we're actually going to do this in this example. 330 00:26:01,112 --> 00:26:06,534 So in this example here, we took this piece of code and added full predication. 331 00:26:06,534 --> 00:26:12,174 So when we say full predication, we added some extra things to these instructions. 332 00:26:12,174 --> 00:26:16,669 They look a little strange at first. So here are instruction three, instruction 333 00:26:16,669 --> 00:26:20,834 four, instruction five, instruction six. These are same ones that were back over 334 00:26:20,834 --> 00:26:24,762 here. But instead, we added some extra words in 335 00:26:24,762 --> 00:26:30,812 front of them here or these, these parentheses with p's in front of it. 336 00:26:30,812 --> 00:26:33,995 What that is, is these are predicate registers. 337 00:26:33,995 --> 00:26:40,965 So we're going to add a second register file into our processor where we can store 338 00:26:40,965 --> 00:26:46,510 effectively one bid values, sort of true or false values, that are going to 339 00:26:46,510 --> 00:26:51,962 determine whether this instruction executes or doesn't execute. 340 00:26:51,962 --> 00:26:57,218 Excuse me. So, what this means is if p1 is one, 341 00:26:57,218 --> 00:27:04,099 execute this. If p1 is zero, nullify this instruction 342 00:27:04,099 --> 00:27:10,068 three here. And we have, we have a double pipe here. 343 00:27:10,068 --> 00:27:14,090 What am I trying to show with this? What I'm trying to show this is parallel. 344 00:27:14,090 --> 00:27:19,030 This and this are executing at the same time, because we have a very long 345 00:27:19,030 --> 00:27:23,063 instruction where processor can execute multiple things at the same time. 346 00:27:23,063 --> 00:27:28,057 And what we've done here is actually slid this up next to the, the then up next to 347 00:27:28,057 --> 00:27:31,065 the else. And we are gonna execute them at the same 348 00:27:31,065 --> 00:27:33,075 time. Dependent on these predicates. 349 00:27:33,075 --> 00:27:38,039 So this is the else block, and this is the then block, and we're gonna execute 350 00:27:38,039 --> 00:27:41,055 concurrently. Okay. 351 00:27:44,228 --> 00:27:50,023 Sort of, stuff before the if, stuff after the, if then else is all done, that's sort 352 00:27:50,023 --> 00:27:53,239 of from that, that four basic block example. 353 00:27:53,239 --> 00:27:57,451 What's this instruction? This instruction has very interesting 354 00:27:57,451 --> 00:28:01,343 semantics. And this is something that you, probably 355 00:28:01,343 --> 00:28:06,305 have to add if you're going to try have full predication. 356 00:28:06,305 --> 00:28:10,504 Hm, okay so let's compare. This is our comparison. 357 00:28:10,504 --> 00:28:13,509 If a = b and then it writes into two outputs. 358 00:28:13,509 --> 00:28:16,559 Ooh, that looks broken, that looks wrong somehow. 359 00:28:16,559 --> 00:28:19,834 How can we have one instruction write to two things? 360 00:28:19,834 --> 00:28:24,402 And what is the semantics of it? What does it write to here and here? 361 00:28:24,402 --> 00:28:29,647 Well, what this is actually doing is this is writing the positive outcome of this 362 00:28:29,647 --> 00:28:35,090 compare to the one and the inverse, or the, the negation of that to the other. 363 00:28:35,090 --> 00:28:40,381 And this is useful, cuz then we can go use these predicates down here. 364 00:28:40,381 --> 00:28:45,537 We can use the, the positive one and, let's say, the negative one, and that can 365 00:28:45,724 --> 00:28:50,594 drive whether we execute this, sort of, else clause or the, then clause. 366 00:28:50,594 --> 00:28:56,689 So lots of instruction sets that have full predication have either something like 367 00:28:56,689 --> 00:28:59,357 this. This is one option, you can actually have, 368 00:28:59,357 --> 00:29:03,006 sort of, two outputs, and you load to predicate registers. 369 00:29:03,006 --> 00:29:07,085 The other option is, when you go to en, actually encode the instructions, you have 370 00:29:07,085 --> 00:29:11,011 a sort of a not bit. So it denotes the predicate register. 371 00:29:11,011 --> 00:29:15,084 But when we denote the predicate register, we also denote whether we take the 372 00:29:15,084 --> 00:29:20,051 predicate register being true as, or, or the predicate register being false as what 373 00:29:20,051 --> 00:29:24,066 drives one of the instruction executes in our full predication scheme. 374 00:29:25,055 --> 00:29:31,221 So in effect this is computing p and p bar at the same time and running of two 375 00:29:31,221 --> 00:29:36,051 different registers, and then we can use those seperately down here. 376 00:29:37,057 --> 00:29:48,978 And interesting result here in is going to 95, Scot Malky showed that you can, on 377 00:29:48,978 --> 00:29:56,017 average let's say remove 50 percent of your branches by using full predication. 378 00:29:56,017 --> 00:29:58,090 Now, full predication is not easy to build. 379 00:29:59,009 --> 00:30:03,006 Very few instruction sets have actually had full predication. 380 00:30:03,200 --> 00:30:08,073 Itanium was probably the closest to a real processor that was built that has almost 381 00:30:08,073 --> 00:30:12,037 full predication. It doesn't have quite full predication. 382 00:30:12,225 --> 00:30:16,093 The HP PlayDoh instruction set, I think, actually had full predication. 383 00:30:16,093 --> 00:30:21,081 I think this is what Scott Malkey was working with, or a derivative of that. 384 00:30:22,001 --> 00:30:25,715 So you can actually have, sort of just some people who've thought about they've 385 00:30:25,715 --> 00:30:28,052 had full predication but not all of things are easy. 386 00:30:28,052 --> 00:30:31,088 So first thing you need to add an extra register file for the predicates, you need 387 00:30:31,088 --> 00:30:34,001 to bypass those predicates, you need to compute them. 388 00:30:34,001 --> 00:30:52,008 There's some overheads involved and we're going talk a little bit more about them.