1 00:00:05,780 --> 00:00:12,354 So this is where some people got the idea that maybe there, maybe there is a better 2 00:00:12,354 --> 00:00:18,381 way. And I wanted to point out that there's 3 00:00:18,381 --> 00:00:24,086 only, this is only the tip of the iceberg of what we'll call better ways of having 4 00:00:24,086 --> 00:00:29,412 better instruction sequences or better encoding standards between compilers and 5 00:00:29,412 --> 00:00:33,006 the hardware. This is actually a open, sort of, research 6 00:00:33,006 --> 00:00:36,601 topic now. Very long instruction word processors or 7 00:00:36,601 --> 00:00:38,731 VLIW for short. Is one take on it. 8 00:00:38,731 --> 00:00:44,123 There's been a fair amount of work done after this, which we're not going to talk 9 00:00:44,123 --> 00:00:47,984 about in this class. Sort of in the last five to ten years, 10 00:00:47,984 --> 00:00:51,445 that has looked at this in a little bit more detail. 11 00:00:51,445 --> 00:00:55,628 Especially given sort of, Multi-cores, and can you schedule across 12 00:00:55,628 --> 00:00:59,153 multiple cores, Or there's a project. There was a project 13 00:00:59,153 --> 00:01:04,316 out of the University of Texas at Austin, which tried to schedule across, something 14 00:01:04,316 --> 00:01:07,716 that sort of looked like cores. But wasn't quite cores. 15 00:01:07,716 --> 00:01:12,815 That was sort of, we'll call a super VLIW but had some dynamic aspects and some 16 00:01:12,815 --> 00:01:17,415 static aspects. But for right now let's talk about very 17 00:01:17,415 --> 00:01:24,791 long instruction word processors. Okay where is, where does the name come 18 00:01:24,791 --> 00:01:26,120 from? Let's start there. 19 00:01:26,420 --> 00:01:31,013 Well, these were things that were actually originally called long instruction word 20 00:01:31,013 --> 00:01:33,698 processors. The name kind of fell out of favor. 21 00:01:33,698 --> 00:01:37,562 At, at one point people made a differentiation between long instruction 22 00:01:37,562 --> 00:01:40,719 word processors and very long instruction word processors. 23 00:01:40,719 --> 00:01:44,257 And it was kind of on how many instructions were packed together. 24 00:01:44,420 --> 00:01:48,720 The, the differentiation has largely sort of fallen out of favor now, and people 25 00:01:48,720 --> 00:01:52,965 mostly call all of these things VLIWs or very long instruction words, cuz it's, 26 00:01:52,965 --> 00:01:55,473 it's harder to say what is long, what is very. 27 00:01:55,473 --> 00:01:59,212 It's kind of just a extra term. It's also, it's kind of like people 28 00:01:59,212 --> 00:02:03,801 talking about Large Scale Integration versus Very Large Scale Integration versus 29 00:02:03,801 --> 00:02:08,560 Ultra Large Scale Integration or you know, people just sort of keep tacking on extra 30 00:02:08,560 --> 00:02:12,372 letters in the front. But let's talk about VLIW instruction 31 00:02:12,552 --> 00:02:15,483 sequences, And, and what is, what does one of these 32 00:02:15,483 --> 00:02:21,100 things look like? Well, VLIW instruction. 33 00:02:21,460 --> 00:02:26,957 Will actually have multiple, operations within one bundle. 34 00:02:26,957 --> 00:02:34,385 So typically, this is you call a bundle or an instruction, which with multiple 35 00:02:34,385 --> 00:02:39,780 operations inside of it. So, in this example here. 36 00:02:40,160 --> 00:02:46,569 We have six operations that can be executed in this one instruction, or in 37 00:02:46,569 --> 00:02:51,333 this one bundle. And, typically, there in a sort of fixed 38 00:02:51,333 --> 00:02:54,555 format. So let's say we have, you can only, can 39 00:02:54,555 --> 00:02:59,107 execute two integer operations, two memory operations, and two floating point 40 00:02:59,107 --> 00:03:02,820 operations, per cycle, and that's what you're allowed to encode. 41 00:03:03,580 --> 00:03:08,570 So instead of having on a disc a sequential sequence of instructions now we 42 00:03:08,570 --> 00:03:14,152 have a sequential we still have a sequence of instructions or sequential sequence of 43 00:03:14,152 --> 00:03:17,501 bundles. But each one of those instructions or each 44 00:03:17,501 --> 00:03:22,360 one of those bundles will actually encode multiple independent operations. 45 00:03:22,360 --> 00:03:26,760 So let's look at it as sort of an example code sequence of this. 46 00:03:28,754 --> 00:03:35,840 So for instance, you could have something that looks like this. 47 00:03:36,200 --> 00:04:13,184 ,, ,, . Let's say. 48 00:04:13,184 --> 00:04:33,795 , So what's interesting about this is we can see that there's actually two 49 00:04:33,795 --> 00:04:38,510 operations in this first instruction, in this first bundle. 50 00:04:38,510 --> 00:04:45,652 The second one only has one operation, And this multiply and this add will 51 00:04:45,652 --> 00:04:51,080 actually execute semantically at least they can execute in parallel. 52 00:04:51,660 --> 00:04:59,134 Now what's interesting, Cuz you look at this, I, I had purposely 53 00:04:59,134 --> 00:05:04,311 wrote this to show a, what looks to be a read after write or write after read or 54 00:05:04,311 --> 00:05:09,746 some sort of dependence between these two registers and these two instructions here. 55 00:05:09,746 --> 00:05:15,557 This small and this add. But that's not what's actually going on 56 00:05:15,557 --> 00:05:19,480 here. In a very long instruction word processor, 57 00:05:19,780 --> 00:05:26,040 the, within one instruction, within one bundle, these sorts of dependencies are 58 00:05:26,040 --> 00:05:29,781 norg. So, just because this reads our three and 59 00:05:29,781 --> 00:05:34,497 this writes our three, they're not dependent on each other. 60 00:05:34,497 --> 00:05:39,620 The subsequent instruction is dependent on our three, let's say. 61 00:05:40,600 --> 00:05:45,280 If this was R3, then that would probably be the result there. 62 00:05:46,380 --> 00:05:51,040 But within one bundle, it doesn't actually matter. 63 00:05:51,040 --> 00:05:55,698 So the semantics of the instruction set are everything within one instruction, or 64 00:05:55,698 --> 00:05:59,034 everything within one bundle, are parallel with each other. 65 00:05:59,034 --> 00:06:03,577 And there's not dependency checking. What's nice about this is we just took all 66 00:06:03,577 --> 00:06:08,293 that piece of hardware that we built. We took all that instruction checking, all 67 00:06:08,293 --> 00:06:12,606 the dependency checking, all the scoreboarding and we just threw it out the 68 00:06:12,606 --> 00:06:15,252 window. We don't need that hardware anymore in 69 00:06:15,252 --> 00:06:17,840 this instruction set, or in this architecture. 70 00:06:18,380 --> 00:06:23,573 So that's pretty cool. So we actually took out a bunch of 71 00:06:23,573 --> 00:06:29,320 hardware that we didn't need and we basically let the compiler do that 72 00:06:29,320 --> 00:06:33,853 checking for us. Now there's a question of this mul, this 73 00:06:33,853 --> 00:06:40,409 multiply takes multiple cycles, whether this instruction here picks up the result 74 00:06:40,409 --> 00:06:44,700 of our three, sorry I should be drawing the other way. 75 00:06:48,340 --> 00:06:52,888 Or let's say the ad had longer latency if it doesn't pick that up, and we'll talk 76 00:06:52,888 --> 00:06:58,042 about that in a minute. There's sort of two different choices 77 00:06:58,042 --> 00:07:03,372 there, in VLIW designs. Well let's, let's look at our slide here. 78 00:07:03,618 --> 00:07:10,014 Typically in sort of traditional VLIWs, each operation has a certain amount of 79 00:07:10,014 --> 00:07:12,475 latency. So, that's a guarantee. 80 00:07:12,475 --> 00:07:18,625 Unfortunately, beacuse of this, the architecture of the machine is very tied 81 00:07:18,625 --> 00:07:23,299 to the compiler. The compiler needs to know how long each 82 00:07:23,299 --> 00:07:27,153 operation takes. So that's sort of the downside. 83 00:07:27,820 --> 00:07:33,500 And in a typical VLIW. There's no data interlocks. 84 00:07:34,000 --> 00:07:39,328 So we don't even have a score board. Now there are some objections which do 85 00:07:39,328 --> 00:07:44,940 enforce in a walking that are VLIW's. But in sort of the traditional, most basic 86 00:07:44,940 --> 00:07:49,132 VLIW, you don't have a score board. You have no interlocking. 87 00:07:49,132 --> 00:07:53,040 So, if you would have let's say, this subtract operation. 88 00:07:53,740 --> 00:07:59,108 Which read register one, which the mall wrote. 89 00:07:59,108 --> 00:08:04,998 And the mall one size of flowing point model, 90 00:08:04,998 --> 00:08:13,343 It took four, four cycles. In the most basic operation, this mall 91 00:08:13,343 --> 00:08:17,777 here, We'd actually get the old value of R1, so 92 00:08:17,777 --> 00:08:22,390 we'll not get this value. Instead of, we get the original value R1. 93 00:08:22,390 --> 00:08:25,743 But we'll talk about that in more detail in a second. 94 00:08:25,743 --> 00:08:29,160 That's, there's a sort of choice there in VLID designs. 95 00:08:30,780 --> 00:08:35,931 But yes, so we, we, reduced our hardware. We don't have a registry namer, we don't 96 00:08:35,931 --> 00:08:40,421 have an issue window, we don't have a reorder buffer, we don't have a 97 00:08:40,421 --> 00:08:44,120 scoreboard, and we let the compiler do a lot of the work. 98 00:08:45,440 --> 00:08:50,826 Downsides to this. We're not able to react to dynamicism very 99 00:08:50,826 --> 00:08:54,826 well, or dynamic problems. So cache misses, branch mispredicts, 100 00:08:54,826 --> 00:08:58,360 things like that. Because we're not going down border, 101 00:08:58,360 --> 00:09:02,293 because we don't all have all that extra hardware in there. 102 00:09:02,293 --> 00:09:05,160 We can't go schedule around those problems. 103 00:09:05,520 --> 00:09:09,320 So that's a, that's a, that's a, that's a downside to these architectures. 104 00:09:09,800 --> 00:09:14,747 Now, People have thought really hard about how 105 00:09:14,747 --> 00:09:19,083 to make VLIWs have some of the benefits of superscalars and out-of-orderness, and 106 00:09:19,083 --> 00:09:22,348 out-of-order superscalars. So at the end of lecture today, and 107 00:09:22,348 --> 00:09:26,844 probably in the next lecture, we'll talk about some of the techniques that people 108 00:09:26,844 --> 00:09:30,804 have added in the, back in the VLIWs, that bring us somewhere in between an 109 00:09:30,804 --> 00:09:33,159 out-of-order processor and a VLIW processor. 110 00:09:33,159 --> 00:09:36,438 And get some of the benefits of both. Okay. 111 00:09:36,438 --> 00:09:40,520 Two, two models. This goes back to. 112 00:09:41,400 --> 00:09:47,943 When you have an instruction which writes to a register, and the latency of that 113 00:09:47,943 --> 00:09:51,460 instruction is coded to be longer than one. 114 00:09:53,120 --> 00:09:56,777 Which value do you pickup? Do you pickup the old value, 115 00:09:56,777 --> 00:10:02,398 Or do you pickup the new value if you have instruction which is effectively in the, 116 00:10:02,601 --> 00:10:09,709 the shadow of the other instruction. So the first VLIW model and this is, this 117 00:10:09,709 --> 00:10:15,530 is a sort of a classical naming scheme I did not come up with this. 118 00:10:15,780 --> 00:10:22,848 It's called the equals scheduling model. So the equals scheduling model you have, a 119 00:10:22,848 --> 00:10:28,918 instruction, and the lead-in to the instruction is specified, the compiler 120 00:10:28,918 --> 00:10:32,744 knows it. And, if you have an instruction which 121 00:10:32,744 --> 00:10:36,320 tries to read a value that gets written to. 122 00:10:36,600 --> 00:10:41,382 Before. The first instruction actually does the 123 00:10:41,382 --> 00:10:46,400 write, it'll get the old value. So let's, let's go through an example. 124 00:10:48,780 --> 00:10:56,344 Over here. So we're going to have our multiply, 125 00:10:56,344 --> 00:11:39,453 again. Okay, 126 00:11:39,453 --> 00:11:45,115 So here we have a mall and an add which are bundled together. So they're going to 127 00:11:45,115 --> 00:11:49,290 execute concurrently. And we have and, an instruction, or and 128 00:11:49,290 --> 00:11:53,182 operation in the second instruction, second bundle here. 129 00:11:53,182 --> 00:11:58,772 I should say they're using brackets and semicolons to delineate sub operations. 130 00:11:58,772 --> 00:12:03,231 Er, the brackets delineate entire instructions or entire bundle. 131 00:12:03,231 --> 00:12:07,618 The semicolon is just there to delineate between instructions, 132 00:12:07,618 --> 00:12:13,920 Or two operations within one instruction. And here we have an ant. 133 00:12:14,380 --> 00:12:18,400 We have what looks to be a read after write dependence here. 134 00:12:21,680 --> 00:12:29,442 Something like that. And let's say our pipeline looks like 135 00:12:29,442 --> 00:12:35,885 this. We have X0 which does ALU ops. 136 00:12:35,885 --> 00:12:46,516 We have Y0, Y1, Y2, Y3. And then we have let's say a two stage 137 00:12:46,516 --> 00:12:52,694 memory pipeline. And somewhere over here, you know, we have 138 00:12:52,694 --> 00:12:56,335 sort of right back. So this looks similar to sort of pipes 139 00:12:56,335 --> 00:13:02,200 we've looked at before. But now comes the question. 140 00:13:02,200 --> 00:13:07,292 Let's say multiplies go down this four stage pipe, very similar to things we've 141 00:13:07,292 --> 00:13:11,030 looked at before. Loads and stores go into the memory pipe 142 00:13:11,030 --> 00:13:18,219 and ALU operations go into the x pipe. Should this and get the result and 143 00:13:18,219 --> 00:13:24,422 multiply, if it's scheduled one cycle after it, or should it get the old value 144 00:13:24,422 --> 00:13:31,625 of R1, the previous value of R1? So we're gonna define in the equals 145 00:13:31,625 --> 00:13:39,520 scheduling model that the mutiply value for R1 is not ready until. 146 00:13:39,920 --> 00:13:43,827 The end of Y3. So in the equals model, this and, the, the 147 00:13:43,827 --> 00:13:48,457 compiler is not trying to express a read after write dependence. 148 00:13:48,457 --> 00:13:52,581 This does not actually exist. It gets the old value of R1. 149 00:13:52,581 --> 00:13:57,140 And the compiler knew about this. And everyone's okay with this. 150 00:13:58,750 --> 00:14:04,729 Now if this and was let's say three more cycles later, it would actually get that 151 00:14:04,729 --> 00:14:08,567 value and it would be a read after write dependence. 152 00:14:08,567 --> 00:14:14,767 So in the equals model we're just saying that the operation takes effect exactly at 153 00:14:14,767 --> 00:14:21,171 the specified latency and never earlier. Some positives to this, is you get some 154 00:14:21,171 --> 00:14:25,820 pretty cool registered usage. So if you think about it here, register 155 00:14:25,820 --> 00:14:30,947 one was live after this multiply. So effectively, this gives us a little bit 156 00:14:30,947 --> 00:14:35,459 less register pressure. We can have a little bit more registers in 157 00:14:35,459 --> 00:14:38,809 flight. Without having more physical registers or 158 00:14:38,809 --> 00:14:42,295 without having more architectural registers at all. 159 00:14:42,295 --> 00:14:47,628 We can basically have more registers. Because it doesn't go, it doesn't go dead 160 00:14:47,628 --> 00:14:51,661 when you override it. It goes dead when this multiply takes 161 00:14:51,661 --> 00:14:55,938 effect. So this said, we don't need any register 162 00:14:55,938 --> 00:15:00,158 renaming. But the compiler really depends on not 163 00:15:00,158 --> 00:15:05,695 having the registers visible early. Unfortunately, this causes some problems. 164 00:15:05,695 --> 00:15:10,348 This sort, these sorts of architectures. This is actually the, sort of, first 165 00:15:10,348 --> 00:15:14,750 formulation of very long instruction word processors looked like this. 166 00:15:14,750 --> 00:15:20,277 There's these equals architectures. The major problem with these actually 167 00:15:20,277 --> 00:15:25,593 comes around if you have things that are unpredictable mixed in with this very 168 00:15:25,593 --> 00:15:29,900 predictable code sequence. So let's say you take an inerot[sp?]. 169 00:15:37,720 --> 00:15:43,462 Let's say we just put some unimportant instruction here, some subtract operation. 170 00:15:43,462 --> 00:15:50,968 The subtract takes an interrupt. Now, semantically, this multiplication is 171 00:15:50,968 --> 00:15:56,037 add complete. Because the interrupt doesn't happen until 172 00:15:56,037 --> 00:15:59,605 the instruction after it. Doesn't happen until this subtract 173 00:15:59,605 --> 00:16:02,665 operation. Hm. 174 00:16:02,665 --> 00:16:08,173 Okay. Now you've fallen to the interrupt handler 175 00:16:08,173 --> 00:16:12,789 for the subtract. What happens when the and goes to execute? 176 00:16:12,789 --> 00:16:16,520 Does it actually pick up the correct value of R one, here? 177 00:16:18,640 --> 00:16:20,942 No. It picks up the new value of R1. 178 00:16:20,942 --> 00:16:24,035 It was supposed to pick up the old value of R1. 179 00:16:24,035 --> 00:16:29,496 So that's traditionally a problem with EQ or equals scheduling model architectures. 180 00:16:29,496 --> 00:16:34,891 People have solved some of these problems. Or sometimes, when people build these 181 00:16:34,891 --> 00:16:40,417 processors, they don't have interrupts. So some of these VLIW equals, processors 182 00:16:40,417 --> 00:16:44,234 were not able to handle, you know, handle interrupts at all. 183 00:16:44,431 --> 00:16:51,013 So it's something to think about. That's through the, the, the first case. 184 00:16:51,013 --> 00:16:58,823 I also get a little more forgiving case. We'll call it the less than or equals 185 00:16:58,823 --> 00:17:05,670 model or LEQ scheduling model. So here, in this model, a value is allowed 186 00:17:05,670 --> 00:17:11,841 to become value a, a, A register value becomes valid with a new 187 00:17:13,120 --> 00:17:18,513 register value, Any time between when it issues and the 188 00:17:18,513 --> 00:17:23,108 scheduling leniency. So what this means is the compiler still 189 00:17:23,108 --> 00:17:29,313 can't schedule an instruction early, So we can't go and try to read the value 190 00:17:29,313 --> 00:17:33,578 early, but you're guaranteed not to have a problem, when there's an interrupt if, 191 00:17:33,578 --> 00:17:36,710 let's say you come back and you filled in the right value. 192 00:17:36,710 --> 00:17:41,635 So compilers, who have schedules around this and knows not to schedule something 193 00:17:41,635 --> 00:17:44,529 too early. So you still don't have to implement 194 00:17:44,529 --> 00:17:47,978 interlocks. You still don't need a scoreboard but, you 195 00:17:47,978 --> 00:17:52,288 can now have precise interrupts. Some other positive things pop out of 196 00:17:52,288 --> 00:17:55,059 this. You end up with binary compatibility 197 00:17:55,059 --> 00:18:00,108 preserved, when the latencies are reduced. So let's say you make a faster processor, 198 00:18:00,108 --> 00:18:03,310 where you'll multiply instead of taking four cycles. 199 00:18:03,310 --> 00:18:06,687 Only takes three. That's, that's, that's a positive here. 200 00:18:06,875 --> 00:18:11,754 You may not get more performance from it, but at least you won't get incorrect 201 00:18:11,754 --> 00:18:17,566 execution. So that's, that's, that's a positive, 202 00:18:17,841 --> 00:18:22,820 positive outcome here. Okay so a little bit of history. 203 00:18:23,021 --> 00:18:27,459 Usually I try to not make harp on history too much in this course. 204 00:18:27,459 --> 00:18:32,098 Even though I really enjoy it. But, let's this, this, the VLIW processors 205 00:18:32,098 --> 00:18:35,191 is a, I wanted to make one point that, a lot of 206 00:18:35,191 --> 00:18:38,082 this research is relatively recernt, recent. 207 00:18:38,082 --> 00:18:43,596 So if you go look at the dates on this, you know, these first processors, and the 208 00:18:43,596 --> 00:18:47,966 first real VLIW processors was done in like, the late 80's. 209 00:18:47,966 --> 00:18:52,874 So this is not going back to the 60's. This is like a portion of computer 210 00:18:52,874 --> 00:18:56,640 architecture work, which is actually very, very recent. 211 00:18:56,960 --> 00:19:02,318 The, the first long instruction word processor is actually a floating point 212 00:19:02,318 --> 00:19:05,822 systems, FPS, that's what it stands for processor. 213 00:19:05,822 --> 00:19:09,601 And this was actually a, a co-processor to VAX machines. 214 00:19:09,601 --> 00:19:14,410 So, something that could speed up your floating point on a VAX machine. 215 00:19:14,410 --> 00:19:17,707 This was very much the most basic VLIW processor. 216 00:19:17,707 --> 00:19:23,616 There was no interlocking didn't take interrupts it was really sort of for hand 217 00:19:23,616 --> 00:19:27,120 coded vector arithmetic and floating point math. 218 00:19:29,050 --> 00:19:35,745 Probably when people talk about VIW, the thing that pops in the head, to their head 219 00:19:35,745 --> 00:19:41,573 first, is actually the Multiflow trace processor, which was made by a small 220 00:19:41,573 --> 00:19:47,717 startup company called Multiflow. This was an outgrowth actually of a bunch 221 00:19:47,717 --> 00:19:54,018 of research that was done at Yale, by Josh Fisher and a bunch of his students. 222 00:19:54,648 --> 00:20:00,782 And, I won't go into too much detail here, But one of the interesting things is they 223 00:20:00,782 --> 00:20:04,760 really, really did have a long, very long instruction work here. 224 00:20:05,200 --> 00:20:11,377 1,024 bits long instructions. So, so this, this is like a, a beefy 225 00:20:11,377 --> 00:20:14,354 instruction. This is a, and it can have anywhere from, 226 00:20:14,526 --> 00:20:17,273 seven, fourteen, or 28 operations per instruction, 227 00:20:17,273 --> 00:20:21,166 And this was not dynamic. What this actually was is this is how they 228 00:20:21,166 --> 00:20:23,856 made different configurations of their machine. 229 00:20:23,856 --> 00:20:28,493 So they actually had wider machines that were more expensive and narrow, narrower 230 00:20:28,493 --> 00:20:32,443 machines that were cheaper. So it's sort of a family of processors, 231 00:20:32,443 --> 00:20:39,764 and they customized the compiler to this. Josh Fisher actually is much more of a 232 00:20:39,764 --> 00:20:45,878 compiler guy, probably, than an architect, by training, and you can sort of see that 233 00:20:45,878 --> 00:20:50,077 in his groups work in the PhD's that come out of it. 234 00:20:50,077 --> 00:20:54,423 He now works for HP, HP labs and sort of semi retired. 235 00:20:54,644 --> 00:21:00,319 At the same time actually. There was also another company that was 236 00:21:00,319 --> 00:21:06,090 commercializing a very, very similar idea. This was the siderome. 237 00:21:06,090 --> 00:21:11,677 Sidra, SIDRA five, This was Bob Rao, who's another, very famous computer architect. 238 00:21:11,874 --> 00:21:17,329 He was a, a professor at the University of Illinois, and he developed a lot of these 239 00:21:17,329 --> 00:21:20,747 things and then sort of left and started, Sidrome. 240 00:21:20,747 --> 00:21:25,940 And, some of the interesting things in that processor is they have this cool 241 00:21:25,940 --> 00:21:30,738 thing, instead of having a register renamer, they had a register file, that 242 00:21:30,738 --> 00:21:35,800 the naming of the register sort of changed as you did, sort of, function calls. 243 00:21:35,800 --> 00:21:39,853 We'll talk more about that later today or maybe tomorrow, 244 00:21:39,853 --> 00:21:44,037 Or maybe next lecture. But moreover what we really want to get 245 00:21:44,037 --> 00:21:46,200 out here is this is, this is very recent.