1 00:00:03,052 --> 00:00:08,515 So we're going to be continuing today talking about VLIW or Very Long 2 00:00:08,515 --> 00:00:13,791 Instruction Word processors. This is a alternative way to exploit 3 00:00:13,791 --> 00:00:20,732 parallelism in a processor, but instead of doing scheduling dynamically, as we've 4 00:00:20,732 --> 00:00:26,219 been talking about in out-of-order processors, and super scalars, we're 5 00:00:26,219 --> 00:00:32,798 looking at ways to do it statically in the compiler and in architectures which can 6 00:00:32,798 --> 00:00:36,878 exploit that. But let's now start talk about Very Long 7 00:00:36,878 --> 00:00:42,109 Instructions Word in processors. So we've been talking about the 8 00:00:42,109 --> 00:00:47,002 superscalar processors. The superscalar processors have a fair 9 00:00:47,002 --> 00:00:51,002 amount of complexity in them. Especially out of order processors. 10 00:00:51,002 --> 00:00:55,070 Where does this complexity come from and can we remove all that complexity? 11 00:00:55,070 --> 00:00:59,002 So, that's what we're going to be talking about today. 12 00:01:00,025 --> 00:01:06,093 So here's where I drew a picture where I have the issue width of a superscalar 13 00:01:06,093 --> 00:01:10,040 processor. And, we represent that as W. 14 00:01:11,079 --> 00:01:15,090 And then we have the lifetime of an instruction. 15 00:01:16,020 --> 00:01:20,653 So that's sort of, gives you a matrix of things in flight in the processor. 16 00:01:20,653 --> 00:01:23,278 Now these may be sitting in your instruction queue. 17 00:01:23,278 --> 00:01:28,085 We're not going to count if they actually, let's say, get all the way to the reorder 18 00:01:28,085 --> 00:01:32,504 buffer, cuz by the time they get all the way to the reorder buffer they probably, 19 00:01:32,504 --> 00:01:37,411 depending on how you build your processor, you probably don't need to continually 20 00:01:37,411 --> 00:01:40,623 check those against the other instructions that are coming. 21 00:01:40,623 --> 00:01:44,529 You've already broadcast that back to either your reservation station or your 22 00:01:44,529 --> 00:01:47,782 instruction queue to issue a few further instructions. 23 00:01:47,782 --> 00:01:51,942 But if you think about the complexity here of what you're really doing when you try 24 00:01:51,942 --> 00:01:56,013 to go issue an instruction. You're trying to take these inflight 25 00:01:56,013 --> 00:02:00,361 instructions, and you have to look up, basically, against them, or in your 26 00:02:00,361 --> 00:02:04,203 instruction window, all of the instructions, when they complete against 27 00:02:04,203 --> 00:02:08,020 basically all of the other instructions, all the registers. 28 00:02:08,037 --> 00:02:12,095 Now, we built some structures that make this so you don't have to do a complete 29 00:02:12,095 --> 00:02:16,025 all to all check. So, example, structures that we create to 30 00:02:16,025 --> 00:02:19,015 help out with this. First of all, the scoreboard. 31 00:02:19,015 --> 00:02:21,093 Instead of keeping track of where everything is. 32 00:02:21,093 --> 00:02:26,091 Sort of our original design, where we had portions of, or which register destination 33 00:02:26,091 --> 00:02:30,084 sort of flowing down the pipe. And then we did an all to all compare. 34 00:02:30,084 --> 00:02:35,048 Instead, we just keep track of the location and latency of the most recent 35 00:02:35,048 --> 00:02:39,030 write to a register. And it turns something from a all to all 36 00:02:39,030 --> 00:02:42,043 compare into a index lookup based on the register. 37 00:02:42,043 --> 00:02:47,057 So that's a little bit nicer, but if you think about what's happening in the issue 38 00:02:47,057 --> 00:02:50,801 queue, we're still doing a as a instruction retires, It's going to 39 00:02:50,801 --> 00:02:55,301 basically broadcast the registers that get completed against the instruction que, and 40 00:02:55,301 --> 00:02:58,751 try to wake up instructions. And see which ones are ready to go. 41 00:02:58,751 --> 00:03:01,552 And, that roughly as I said. Scales, scales like this. 42 00:03:01,552 --> 00:03:05,640 Cause you're gonna be returning multiple instructions per cycle. 43 00:03:05,640 --> 00:03:10,643 And you know, let's say you have a bunch of stuff in the instruction queue. 44 00:03:10,643 --> 00:03:15,234 You are going to have to check all of those and this can get painful. 45 00:03:15,234 --> 00:03:21,299 Let's, let's take a look at first sort of the complexity of this for in order 46 00:03:21,299 --> 00:03:24,656 machine. So in order machine is, is not so bad. 47 00:03:24,656 --> 00:03:29,951 L is kind of the pipeline length. You have your width but you just basically 48 00:03:29,951 --> 00:03:35,404 have to have a scoreboard to figure this out, and as we said that's an index based 49 00:03:35,404 --> 00:03:38,192 structure. Now when you start to go out of order, 50 00:03:38,192 --> 00:03:42,058 you, you start to have the sort of complexity of all the results that come 51 00:03:42,058 --> 00:03:46,617 back have to be broadcast against either if you were distributing instructions to 52 00:03:46,617 --> 00:03:50,992 all the reservation stations or if you have a, a single instruction queue, you 53 00:03:50,992 --> 00:03:54,173 have to check against all the possible instructions to go issue. 54 00:03:54,173 --> 00:03:58,623 And what's interesting is one of the big challenges of building these structures is 55 00:03:58,623 --> 00:04:03,554 not actually quite what you would think. It's, there's a interesting challenge that 56 00:04:03,554 --> 00:04:06,717 you have too many things that could wake up in one cycle. 57 00:04:06,717 --> 00:04:09,819 So let's say you have a single instruction, or say all of the 58 00:04:09,819 --> 00:04:15,036 instructions in your instruction queue are waiting at register five to become valid. 59 00:04:15,093 --> 00:04:20,042 All of a sudden, instruction some add, let's say, writes register five, and it, 60 00:04:20,042 --> 00:04:24,015 it enters the re-order buffer. So it gets to the end of the pipe. 61 00:04:24,015 --> 00:04:28,075 Well, now, all of a sudden, you have to wake up all these instructions and choose 62 00:04:28,075 --> 00:04:33,012 the, some instructions that are the width of your machine, that can go issue. 63 00:04:33,066 --> 00:04:38,089 Also, you need, still need to make sure that you can satisfy all of the different 64 00:04:38,089 --> 00:04:42,023 requirements in the machine of functional unit mixes. 65 00:04:42,023 --> 00:04:44,075 So, if you have, you know, say, two, two adds. 66 00:04:44,094 --> 00:04:50,005 Two multiplies and one load and store, you can't just go pick randomly from there. 67 00:04:50,005 --> 00:04:53,067 You need to pick exactly that mix. So this becomes a big mess of 68 00:04:53,067 --> 00:04:57,082 combinational logic and this is why front ends of processors can actually in out of 69 00:04:57,082 --> 00:05:00,024 word processors can start to get long and longer. 70 00:05:00,038 --> 00:05:04,043 You get more and more stages in there to do this and as we talked about last time 71 00:05:04,043 --> 00:05:08,052 in like the Pentium four there's actually multiple stages out in front there to be 72 00:05:08,052 --> 00:05:12,062 able to do this and unfortunately you have to pipe line that which makes this even 73 00:05:12,062 --> 00:05:19,642 more challenging. So, so roughly If you look at sort of the 74 00:05:19,642 --> 00:05:24,442 out-of-order control logic here, it's sort of Growing somewhere between the width 75 00:05:24,442 --> 00:05:29,566 squared, the width cubed. And this is, this is not, not very good. 76 00:05:29,566 --> 00:05:35,598 Sometimes, there are a couple, circuit level tricks that can help you. 77 00:05:35,598 --> 00:05:41,918 So if you go look at people who actually do full custom logic for designs of 78 00:05:41,918 --> 00:05:46,317 processors, they'll make something that they call Pickers. 79 00:05:46,317 --> 00:05:51,664 And what a picker is, is its actually a, instead of using straight combination of 80 00:05:51,664 --> 00:05:55,386 logic to go compute what instructions ready to go execute. 81 00:05:55,386 --> 00:06:00,711 You will instead have a much more analog circuit in there, where you will basically 82 00:06:00,711 --> 00:06:05,723 have some custom, analog circuit there which is almost a CAM, or a continual 83 00:06:05,723 --> 00:06:09,733 addressable memory. It's not quiet a CAM, its more of a 84 00:06:09,733 --> 00:06:15,038 circuit which will choose what instruction is ready to go and it is typically, you 85 00:06:15,038 --> 00:06:17,842 wanna have some heuristic that is the oldest instruction. 86 00:06:17,842 --> 00:06:21,777 This is to prevent deadlocks. You don't really want to issue arbitrarily 87 00:06:21,777 --> 00:06:26,080 instructions, arbitrary instructions. You want to issue the oldest instructions 88 00:06:26,080 --> 00:06:28,985 cuz you want to make forward progress in the machine. 89 00:06:28,985 --> 00:06:34,406 So, what we are trying to get out here is, this is a lot of hardware complexity. 90 00:06:34,406 --> 00:06:38,807 It's growing, as the width of the machine, and some things actually makes this even 91 00:06:38,807 --> 00:06:41,418 worse. So, here, I have a little not here, that 92 00:06:41,418 --> 00:06:46,017 as you increase the width of your machine, let's say you have a, you got a three wide 93 00:06:46,017 --> 00:06:49,725 machine, and all of a sudden, you want to go for, let's say, a four or five wire 94 00:06:49,725 --> 00:06:54,582 machine, typically, we need also a larger instruction queue, or instruction window 95 00:06:54,582 --> 00:06:59,495 to look across, in order to find enough parallelism to keep the machine busy. 96 00:06:59,495 --> 00:07:05,287 So as you go wider, you also typically have to make the machine sort of deeper or 97 00:07:05,287 --> 00:07:11,352 at least make your instruction queue deeper to find enough parallelism to feed 98 00:07:11,352 --> 00:07:14,193 your functional units. Hm. 99 00:07:14,193 --> 00:07:17,572 That's not good. So as this causes the, you know, our, 100 00:07:17,572 --> 00:07:22,952 these sort of blow up factors here. And, you can build these things, but 101 00:07:22,952 --> 00:07:29,250 they're hard. So are there alternatives? 102 00:07:29,250 --> 00:07:32,889 Yes. But let's, before we, we move onto the 103 00:07:32,889 --> 00:07:38,350 other alternatives, let's look at the sort of complexity of this in, a real 104 00:07:38,350 --> 00:07:41,540 processor. So here we have a di-photo, or 105 00:07:41,540 --> 00:07:50,961 di-micrograph of a MIPS R10,000 Processor. This was used in workstations by SGI for 106 00:07:50,961 --> 00:07:55,934 many years. It's a, real out of order superscalar, and 107 00:07:55,934 --> 00:08:03,845 one of the interesting things is if you go look at how much area is dedicated just to 108 00:08:03,845 --> 00:08:07,661 control logic. It's pretty substantial. 109 00:08:07,661 --> 00:08:12,755 So in this processor here, here's our actual data path, it's relatively modest. 110 00:08:12,755 --> 00:08:18,232 Cache, instruction cache, data cache, those are pretty big structures you can 111 00:08:18,232 --> 00:08:21,931 see those. We have you know, some other let's say, 112 00:08:21,931 --> 00:08:27,524 the data tags, instruction tags the T.O.B you know, those are big, big-ish sort of 113 00:08:27,524 --> 00:08:30,787 things, and that all this stuff here is quite large. 114 00:08:30,787 --> 00:08:37,443 Let's look at what's inside of here. We have sort of next instruction sorts of 115 00:08:37,443 --> 00:08:40,897 things. We have a free list for the we had talked 116 00:08:40,897 --> 00:08:45,066 about this right. We talked about the free list for the, 117 00:08:45,066 --> 00:08:50,623 which register is free if you will, to reuse in your out of order processor. 118 00:08:50,623 --> 00:08:53,857 We have a big thing that just does registry namer. 119 00:08:53,857 --> 00:08:58,558 The we have different queue here, so this is a distributed instruction queues, or a 120 00:08:58,558 --> 00:09:03,053 distributed reservation station processor. So there's one just for address 121 00:09:03,053 --> 00:09:07,184 calculation instructions, integer instructions and floating point 122 00:09:07,184 --> 00:09:10,143 instructions. But the interesting to see here is the 123 00:09:10,143 --> 00:09:12,694 prob, the, the, the percentage of this just dies. 124 00:09:12,694 --> 00:09:17,708 It's actually just doing compute, let's say this integer data path, the floating 125 00:09:17,708 --> 00:09:20,662 point data path, the floating point multiplier. 126 00:09:20,662 --> 00:09:23,951 Is sorta pales in comparison to all the overhead. 127 00:09:23,951 --> 00:09:29,472 So this is just sort of a get this, get this idea across. 128 00:09:29,472 --> 00:09:35,713 Another problem with out of order super- scalars, or superscalars in general, is 129 00:09:35,713 --> 00:09:39,058 thinking about how do we express parallelism. 130 00:09:39,058 --> 00:09:45,040 And I alluded to this in a previous lecture, but it's worth repeating. 131 00:09:45,040 --> 00:09:51,006 So, let's, let's take a look at what's happens for codes, sequential codes on a 132 00:09:51,006 --> 00:09:56,069 out of order superscalar. We start off here with some pieces of C 133 00:09:56,069 --> 00:10:00,021 code. We put it into our fancy superscalar 134 00:10:00,021 --> 00:10:04,032 compiler. That generates some sort of data flow 135 00:10:04,032 --> 00:10:07,033 graph. And probably also con, control flow graph 136 00:10:07,033 --> 00:10:11,018 inside the compiler. So there's sort of instructions, there's, 137 00:10:11,018 --> 00:10:15,333 dependencies between the different instructions, and the compiler right now 138 00:10:15,333 --> 00:10:21,281 is trying to find independent operations, because it's going to come up with a 139 00:10:21,281 --> 00:10:26,018 schedule. Unfortunately, you have to come up with a 140 00:10:26,018 --> 00:10:31,971 sequential schedule, because the instruction sets on sequential processors 141 00:10:31,971 --> 00:10:38,315 require you to come up with some order. Even though, it's very possible, that 142 00:10:38,315 --> 00:10:43,645 there is no one perfect order. It's also possible that, you know, instead 143 00:10:43,645 --> 00:10:50,035 of over-constraining the problem here, why are we coming up with some, some ordering 144 00:10:50,035 --> 00:10:54,058 when we don't need to? But, you know, in our, in our sequential, 145 00:10:54,058 --> 00:10:59,066 I say we have to come up with that. We write out to disk somehow, or you know, 146 00:10:59,066 --> 00:11:04,066 save a file with this, this code. And then we have to go and actually try to 147 00:11:04,066 --> 00:11:07,074 execute it. Now what's funny, as I always get a 148 00:11:07,074 --> 00:11:13,084 chuckle out of this, is the compiler knew what the dependencies were. 149 00:11:14,025 --> 00:11:18,242 And then we have our superscalar processor here, which goes and takes apart the 150 00:11:18,242 --> 00:11:22,507 sequential code and tries to reorder the instructions, and tries to find parallels 151 00:11:22,507 --> 00:11:25,087 in the instruction sequence, but the compiler knew that. 152 00:11:25,087 --> 00:11:30,031 But we have like a breakdown here in the middle that we just weren't able, not able 153 00:11:30,031 --> 00:11:33,025 to express that between the compiler and the processor. 154 00:11:34,010 --> 00:11:38,019 Compiler knew about the parallelism. Compiler knew about all the ordering and 155 00:11:38,019 --> 00:11:41,017 the requirements. It had to come up with some sequential 156 00:11:41,017 --> 00:11:42,892 code sequence, provides it to the processor. 157 00:11:42,892 --> 00:11:47,061 The processor takes those, those, that sequential code sequence, back apart, does 158 00:11:47,061 --> 00:11:50,096 out of order scheduling like we did on the exam, on the midterm. 159 00:11:50,096 --> 00:11:54,085 And it's gonna, you know, we built a bunch of hardware here, which has high 160 00:11:54,085 --> 00:11:59,010 complexity, as we were talking about on the combination of logic, checking the 161 00:11:59,010 --> 00:12:02,012 dependencies. And then, it comes up with some schedule. 162 00:12:02,027 --> 00:12:05,880 This is done via our reservation stations, or our instruction queues, And somethings 163 00:12:05,880 --> 00:12:09,028 going to fire and it's going to come up with some schedule. 164 00:12:09,028 --> 00:12:12,702 And it's going to execute those instructions by preserving read after 165 00:12:12,702 --> 00:12:15,553 write dependencies. And hopefully if our superscaler processor 166 00:12:15,553 --> 00:12:19,723 is fancy we have some registry neighbor we can break some of the other dependencies 167 00:12:19,723 --> 00:12:22,079 also. The write after write and the write after 168 00:12:22,079 --> 00:12:26,053 read dependencies, and come up with some schedule that's high performance. 169 00:12:26,053 --> 00:12:30,042 Now there's no guarantee that this schedule is what the compiler would have 170 00:12:30,042 --> 00:12:35,182 done if it was smarter, or if could somehow express parallelism across this 171 00:12:35,182 --> 00:12:39,016 interface. But, you know, it does something. 172 00:12:39,016 --> 00:12:43,045 And what's, what's funny about this is there's a whole lot of power that's being 173 00:12:43,045 --> 00:12:46,014 wasted. It's a whole lot of effort that's being 174 00:12:46,014 --> 00:12:49,057 wasted in the processor. There's a lot of effort of designers of 175 00:12:49,057 --> 00:12:53,092 the processor, that is, are effectively designing all these circuits to take apart 176 00:12:53,092 --> 00:12:57,020 the sequential code sequence and do something useful with it. 177 00:12:57,020 --> 00:12:59,077 So it's a lot of, a lot of work being done there. 178 00:13:00,046 --> 00:13:02,036 Is there a better way?