1 00:00:03,056 --> 00:00:09,046 Okay, so now we're going to sort of go through different problems with VLIWs and 2 00:00:09,046 --> 00:00:13,037 different solutions to that problem, those problems. 3 00:00:13,037 --> 00:00:20,975 So the top one on this list is a problem of hard to predict branches, and how that 4 00:00:20,975 --> 00:00:30,000 can limit instruction level parallelism. So, you just remove the branch. 5 00:00:30,057 --> 00:00:36,011 And we're going to call that predication. So we're actually going to add 6 00:00:36,011 --> 00:00:41,012 instructions to the hardware. Which we're actually going to add two 7 00:00:41,012 --> 00:00:45,860 instructions here so this is, this is, this is limited predication, where we add 8 00:00:45,860 --> 00:00:50,029 two very simple instructions. And if you look at these instructions, 9 00:00:50,029 --> 00:00:54,057 they're very similar to the question mark, colon or select operator in C. 10 00:00:55,015 --> 00:01:07,007 So what does, what does that operator do? We have a equals I don't know. 11 00:01:08,034 --> 00:01:25,447 C?d: e; What does this do? Well, it loads a. 12 00:01:25,447 --> 00:01:34,894 If, if, if c is true, it loads a with d, if c is false, it loads a with e. 13 00:01:34,894 --> 00:01:42,844 Well, you can think about actually doing this with, some sort of If-then-else, 14 00:01:42,844 --> 00:01:46,909 piece a code. Which is pretty common. 15 00:01:46,909 --> 00:01:51,313 If a < b So you can sort of put, that here. 16 00:01:51,313 --> 00:01:57,255 X gets a versus x getting b. That's our select operator. 17 00:01:57,255 --> 00:02:05,506 Well we add two, two special instructions here for our limited predication. 18 00:02:05,506 --> 00:02:12,293 Move if zero and move if not zero. Well, what does this do? 19 00:02:12,293 --> 00:02:20,610 Well if this operand is equal to zero, then this rd gets rs, else that's all it 20 00:02:20,610 --> 00:02:24,299 does. That's all that instruction does. 21 00:02:24,299 --> 00:02:30,890 And the flip one here is, it checks if it's not equal to zero. 22 00:02:30,890 --> 00:02:35,389 Why is this cool? Well, this allows us to transform control 23 00:02:35,389 --> 00:02:39,407 flow into a data instruction. So, we've taken a branch out, so if we 24 00:02:39,407 --> 00:02:44,604 look at this piece of code, if we're doing it with branches set less than, we do a 25 00:02:44,604 --> 00:02:47,448 branch. So, this, this computes our condition 26 00:02:47,448 --> 00:02:52,271 co-flag here, branch equals and if it's, the one way a branch is here, if not, it 27 00:02:52,271 --> 00:02:55,578 jumps over it. So, we have a bunch of control flow here. 28 00:02:55,578 --> 00:03:00,034 We have two control flow operations, the branch and the jump. 29 00:03:00,034 --> 00:03:09,616 When we add these instructions, we can basically do that if then else in, in an 30 00:03:09,616 --> 00:03:13,415 instruction. And, basically every VILW processor you're 31 00:03:13,415 --> 00:03:17,178 going to look at is going to have predication, or at least limited 32 00:03:17,178 --> 00:03:20,021 predication. This is, this is not full predication, 33 00:03:20,021 --> 00:03:24,012 this is limited predication. We'll talk about full predication in a 34 00:03:24,012 --> 00:03:28,024 second. Okay so that, let's, let's think about 35 00:03:28,024 --> 00:03:31,003 that for a second. We just took control flow. 36 00:03:31,003 --> 00:03:35,072 We turned it into something which is never going to take a branch mispredict. 37 00:03:35,072 --> 00:03:40,047 That sounds pretty cool cuz branch mispredicts, you know, were pretty, pretty 38 00:03:40,047 --> 00:03:43,020 bad. If we had a branch which was harder to 39 00:03:43,020 --> 00:03:47,076 predict, we didn't know with high probability if a was greater than b or 40 00:03:47,076 --> 00:03:50,041 not. We can just sort of stick this code 41 00:03:50,041 --> 00:03:55,043 sequence in here and just be done with it. And why it's really important for very 42 00:03:55,043 --> 00:03:59,021 long instructional processors, is because whenever you take a branch and mis 43 00:03:59,021 --> 00:04:02,056 predict, you basically have a bunch of dead instructions. 44 00:04:02,056 --> 00:04:05,045 And you can't schedule something in, in that point. 45 00:04:05,045 --> 00:04:09,836 But the, a, an Out-of-Order Superscalar can attempt to sort of schedule things in 46 00:04:09,836 --> 00:04:13,020 there. We can try to schedule non-dependent 47 00:04:13,020 --> 00:04:16,027 operations. But our compiler has to come up with some 48 00:04:16,027 --> 00:04:19,092 code sequence, and has to make them parallel at compile time. 49 00:04:20,066 --> 00:04:23,014 Okay. So a few, few questions here. 50 00:04:23,014 --> 00:04:27,087 What happens, if then, if the if then else has many instructions? 51 00:04:28,080 --> 00:04:32,078 This was a very simple case here. We just sort of had one thing inside of 52 00:04:32,078 --> 00:04:37,006 each of these if-then-elses. It's not the end of the world. 53 00:04:37,039 --> 00:04:43,016 What you can, can do, and typically what people do with partial predication, which 54 00:04:43,016 --> 00:04:49,022 is what this gives us, is they'll actually execute both sides of the if statement. 55 00:04:50,098 --> 00:04:57,031 Inter leave them in your VILW somehow, and then choose the result at the end with a 56 00:04:57,031 --> 00:05:03,033 predication or a move z instruction. Or, these are typically called conditional 57 00:05:03,033 --> 00:05:06,088 moves. If you'll look in something like x86 I 58 00:05:06,088 --> 00:05:13,006 think these are actually called c moves. If you go look in mips, it's called move 59 00:05:13,006 --> 00:05:18,007 z, but people started naming these things slightly differently. 60 00:05:19,030 --> 00:05:23,040 So, it's not, not the end of the world, but when you go to do that, you're 61 00:05:23,040 --> 00:05:27,075 actually going to execute extra instructions that you may not have to have 62 00:05:27,075 --> 00:05:29,060 executed. That's, that's a bummer. 63 00:05:29,259 --> 00:05:34,075 Because you could very easily, if, if, there's a lot of code in here and a lot of 64 00:05:34,075 --> 00:05:38,156 code in here and you're executing both code sequences, you're basically doing 65 00:05:38,156 --> 00:05:41,047 twice as much work. And if, it grows large, you're doing lots 66 00:05:41,047 --> 00:05:45,055 of extra work and you may not have enough open slots to sort of fulfill that. 67 00:05:45,055 --> 00:05:48,075 At that point, you have the choice, you actually put a branch in. 68 00:05:50,022 --> 00:05:54,211 If it's unbalanced also not the end of the world, it's probably actually a little bit 69 00:05:54,211 --> 00:05:57,078 easier, you're probably going to have to execute twice as much code. 70 00:05:57,273 --> 00:06:01,002 At some point though you may want to actually super unbalance. 71 00:06:01,002 --> 00:06:04,061 Like a thousand instructions on the one side of the branch, and like two 72 00:06:04,061 --> 00:06:06,085 instructions on the other side of the branch. 73 00:06:06,085 --> 00:06:10,043 You may just want to put a branch, an actual branch there, and not try to 74 00:06:10,043 --> 00:06:13,007 predicate it. Because if you took the, side which only 75 00:06:13,007 --> 00:06:16,071 has two instruction, or the, the two instruction case, well all of a sudden, 76 00:06:16,071 --> 00:06:20,089 you've bloated that by an extra thousand instructions, kind of in the common case 77 00:06:20,089 --> 00:06:26,010 and that's not very good. So that's, that's partial predication. 78 00:06:26,280 --> 00:06:32,070 Let's talk about full predication, which is kinda the extension of that. 79 00:06:32,070 --> 00:06:39,039 Instead of just adding, I mean, a simple instruction which moves data values 80 00:06:39,039 --> 00:06:43,068 dependent on another value it being zero or not. 81 00:06:44,036 --> 00:06:50,024 Let's say every single instruction in our instruction sequence except for maybe 82 00:06:50,024 --> 00:06:56,027 let's say branches or something like that can be nullified based on a register. 83 00:06:57,010 --> 00:07:02,016 What does this look like? Well, here we have some, little bit more 84 00:07:02,016 --> 00:07:06,027 complicated piece of code. We have four basic blocks. 85 00:07:06,098 --> 00:07:11,072 It's roughly an if. Then, no see. 86 00:07:11,072 --> 00:07:16,081 If else then and then sort of some code at the end. 87 00:07:17,291 --> 00:07:21,001 And let's see how this works with predication. 88 00:07:23,007 --> 00:07:27,029 Well, what you can do is first of all, you have to somehow set the predicate 89 00:07:27,029 --> 00:07:29,099 registers. So typically, these architectures have 90 00:07:29,099 --> 00:07:32,081 extra registers, which we call predicate registers. 91 00:07:32,081 --> 00:07:37,036 The predicate registers get loaded with some values sort of early, and then, let's 92 00:07:37,036 --> 00:07:40,085 say, this instruction and this instruction execute in parallel. 93 00:07:41,149 --> 00:07:45,047 Different notation here, let's say there's a semicolon here and there's sort of 94 00:07:45,047 --> 00:07:49,018 brackets around that. And, this, in front of the instruction 95 00:07:49,018 --> 00:07:53,051 here, in parentheses, we have a predicate register, and which says whether this 96 00:07:53,051 --> 00:07:57,006 instruction was supposed to execute or not supposed to execute. 97 00:07:58,089 --> 00:08:03,050 Now we can do even more complex things, than our partial prodication. 98 00:08:03,050 --> 00:08:09,019 Instead, now you can basically execute everything and not have to do any moves at 99 00:08:09,019 --> 00:08:12,050 the end. You don't do any bookkeeping and you can 100 00:08:12,050 --> 00:08:17,051 only, you can execute just the side of the branch that you need to execute. 101 00:08:19,292 --> 00:08:25,175 Scott Melkey and Insco 95 showed that if you do this, and you sort of have a fan-, 102 00:08:25,175 --> 00:08:28,542 fancy enough compiler. He was working at UAUC on the impact 103 00:08:28,542 --> 00:08:31,136 compiler. You can remove, let's say, 50 percent of 104 00:08:31,136 --> 00:08:34,507 your branches. A lot of these branches are short little 105 00:08:34,507 --> 00:08:38,686 branches in your programs. In a full predication, you can do some 106 00:08:38,686 --> 00:08:42,197 pretty fancy stuff. This showed up in the Plato compiler, 107 00:08:42,197 --> 00:08:46,092 which is a HP or, Plato architecture by HP, and the, sort of, compiler for that. 108 00:08:46,092 --> 00:08:50,080 Which was the, [unknown] project at UAUC, the impact compiler. 109 00:08:50,080 --> 00:08:55,034 So you can sort of see that, you know, you can get a lot of benefit from this. 110 00:08:58,274 --> 00:09:03,551 So, we're going to, I'm going to stop here today but I just want to do a, briefly 111 00:09:03,551 --> 00:09:10,554 wrap up and say, we start talking about how to deal with dynamic events and how to 112 00:09:10,554 --> 00:09:15,312 get a lot of the advantages of speculative execution from out of order super-scalers 113 00:09:15,312 --> 00:09:19,741 but in a statically scheduled regime. And we're going to talk more about how to 114 00:09:19,741 --> 00:09:24,654 do some this code motion, how to move instructions across branches, how to move 115 00:09:24,654 --> 00:09:27,674 memory operations across other memory operations. 116 00:09:27,674 --> 00:09:32,740 And then we're going to talk about how to deal with some dynamic events, which are 117 00:09:32,740 --> 00:09:38,467 hard to deal with in a statically scheduled environment in the next lecture. 118 00:09:38,467 --> 00:09:41,017 Okay, we'll stop here for today.