1 00:00:03,046 --> 00:00:07,027 In-order fetch. So, we have an in-order frontend and we 2 00:00:07,027 --> 00:00:10,067 have an out-of-order issue, writeback and commit. 3 00:00:12,001 --> 00:00:18,280 So, what this is going to try and do is we're going to try and fix some of the 4 00:00:18,280 --> 00:00:25,770 problems that we saw, oops, here, where we'll actually had, let's say this 5 00:00:25,770 --> 00:00:33,173 instruction here, this add is waiting to issue because our issue was in order. 6 00:00:33,173 --> 00:00:39,960 So, that instruction could issue. All of its inputs are ready. 7 00:00:40,168 --> 00:00:43,432 R11 was written right there, or something like that. 8 00:00:43,432 --> 00:00:45,789 So, R11 is ready. It's ready to issue. 9 00:00:45,789 --> 00:00:51,496 But because we have in-order issue, by definition, we can't issue out-of-order. 10 00:00:51,496 --> 00:00:55,934 So, now let's look at a machine where we can issue out-of-order. 11 00:00:55,934 --> 00:01:01,106 So, we actually move it to the execute units, do our register fetch, 12 00:01:01,106 --> 00:01:05,574 out-of-order. So, to do that, we actually have to add 13 00:01:05,574 --> 00:01:10,642 another data structure, and this data structure we're going to call the issue 14 00:01:10,642 --> 00:01:14,007 queue. And this is going to be something like an 15 00:01:14,007 --> 00:01:17,491 associative FIFO, but it's not going to be FIFO, we're going to basically going to 16 00:01:17,491 --> 00:01:21,078 store instructions into it, and then we're going to, we're going to put instructions 17 00:01:21,078 --> 00:01:24,757 into it in order, because our fetch is in order, or frontend is in order. 18 00:01:24,757 --> 00:01:27,455 But we're going to pull out of it out of order. 19 00:01:27,455 --> 00:01:31,965 And, but we have to be a little careful, because when we go to pull out of that 20 00:01:31,965 --> 00:01:34,283 data structure, we have to make sure that the. 21 00:01:34,283 --> 00:01:37,609 All of the appropriate registers are, are ready. 22 00:01:37,609 --> 00:01:43,855 So, all of our dependencies are ready or at least somewhere in the bypass or we can 23 00:01:43,855 --> 00:01:48,147 go pick it off the bypass by the time it actually needs the value. 24 00:01:48,147 --> 00:01:54,255 Or in our, in our bypass stage. Let's take a look here. 25 00:01:54,255 --> 00:02:00,003 We still have architectural register file. We got rid of all that complex reorder 26 00:02:00,003 --> 00:02:03,400 buffer stuff on this example. So, we don't have a reorder buffer. 27 00:02:03,400 --> 00:02:07,022 We don't have a store buffer. We have out-of-order commit. 28 00:02:07,022 --> 00:02:10,848 So, we're going to have the same problems we had before with precise exceptions in 29 00:02:10,848 --> 00:02:15,749 this pipe but we, it's going to give us sort of easy introduction to understanding 30 00:02:15,749 --> 00:02:21,658 what the issue queue looks like. So, the issue queue gets written when 31 00:02:21,658 --> 00:02:26,017 instructions enter it. Sort of at the decode stage. 32 00:02:26,017 --> 00:02:31,065 This is, itself is a register structure. So, it's like, flip flops in here and a 33 00:02:31,065 --> 00:02:36,028 bunch of logic and stuff. So, you can just go from this stage to 34 00:02:36,028 --> 00:02:40,019 this stage in the pipe without having any, registers. 35 00:02:40,019 --> 00:02:45,977 So it gets read in there and things are basically going to update into this. 36 00:02:45,977 --> 00:02:52,192 So, when things actually end up in the architecture register file, we're going to 37 00:02:52,192 --> 00:02:58,279 mark bits in the issue queue, saying, oh, that register you were waiting on, it's 38 00:02:58,279 --> 00:03:01,459 now ready. And if you get, let's say, two registers 39 00:03:01,459 --> 00:03:05,403 that are ready and you're dependent on two registers, you can issue. 40 00:03:05,403 --> 00:03:11,416 And we can issue out-of-order. Okay. 41 00:03:11,416 --> 00:03:14,036 So, I have read and write here in the, the issue stage. 42 00:03:14,036 --> 00:03:18,015 Why do I, why do I have that? Well, when you actually go to issue it you 43 00:03:18,015 --> 00:03:22,032 basically want to mark that instruction as, oh, yeah, I actually did issue this 44 00:03:22,032 --> 00:03:24,820 instruction. You want to, sometimes people build this 45 00:03:24,820 --> 00:03:27,736 as actually remove the instruction from the issue queue. 46 00:03:27,889 --> 00:03:32,264 In this basic case, we're gonna leave it in the issue queue until the end of the 47 00:03:32,264 --> 00:03:36,427 pipe because we need some way to track the, the liveness of the instruction. 48 00:03:36,427 --> 00:03:41,250 But the, the next processor we're basically gonna think of it as sort of a. 49 00:03:41,250 --> 00:03:47,029 I, I won't say FIFO, because it's not strictly first in first out, but it's, 50 00:03:47,029 --> 00:03:52,771 it's a data structure where you can put stuff in and then remove stuff out of. 51 00:03:52,771 --> 00:03:59,009 So, it's a buffer or a multi entry buffer. Okay, let's, let's take a look inside the 52 00:03:59,009 --> 00:04:02,046 issue queue. And this is assuming something like MIPS. 53 00:04:02,046 --> 00:04:05,075 So, in our issue queue here we're going to have. 54 00:04:07,105 --> 00:04:09,633 First of all, we're going to actually going to have the Opcode, because we're 55 00:04:09,633 --> 00:04:13,033 going to have multiple instructions sort of ganging up in here. 56 00:04:13,033 --> 00:04:16,760 So, it's possible it could be like, I don't know, in this case one, two, three, 57 00:04:16,760 --> 00:04:19,023 four, five. There's five instructions just sort of 58 00:04:19,023 --> 00:04:22,227 sitting in this, in this data structure. So, just sort of slacking. 59 00:04:22,227 --> 00:04:26,061 There's a buffer and a slack in the front of your pipe, effectively. 60 00:04:26,061 --> 00:04:30,577 You might need the immediate's values, because that's sort of part of the Opcode, 61 00:04:30,577 --> 00:04:35,431 or part of the instruction, and then, then we're going to basically have things for 62 00:04:35,431 --> 00:04:41,548 the three different registers. And what were going to put in here is we 63 00:04:41,548 --> 00:04:44,265 are going to put the register identifiers, so. 64 00:04:44,265 --> 00:04:49,465 Here let's look, let's look at the sources first, because that makes a little more 65 00:04:49,465 --> 00:04:52,031 sense. The V bit is going to say that this 66 00:04:52,031 --> 00:04:59,070 instruction needs the source. So, it's possible that, as we said, you 67 00:04:59,070 --> 00:05:03,493 know, there's some like an immediate instructions don't read both source 68 00:05:03,493 --> 00:05:07,004 operands. So, for an immediate, only one, probably 69 00:05:07,004 --> 00:05:13,606 this valid bit is going to be set and this one is going to be zero. 70 00:05:13,606 --> 00:05:22,247 The P bit is pending. So, what that means is somewhere, later in 71 00:05:22,247 --> 00:05:27,411 the pipe there is a in-flight instruction that writes to that register. 72 00:05:27,411 --> 00:05:31,978 So, we're going to basically track that with a bit here saying that it, that iis 73 00:05:31,978 --> 00:05:35,097 still pending. Now, the reason we actually have to keep 74 00:05:35,097 --> 00:05:41,062 that pending information and the validity informations are separate and why need to 75 00:05:41,062 --> 00:05:46,080 keep both of them is it's possible that multiple instructions could light up 76 00:05:46,080 --> 00:05:52,050 simultaneously as being ready to go. So, if you have, let's say, two 77 00:05:52,050 --> 00:05:58,091 instructions that are both dependent only on register five and register five is not 78 00:05:58,091 --> 00:06:02,042 ready, it's, it's the destination of a multiply. 79 00:06:03,057 --> 00:06:07,770 But then register five gets ridden. It's going to actually clear the pending 80 00:06:07,770 --> 00:06:11,172 bits. So, it's no longer pending and it might be 81 00:06:11,172 --> 00:06:14,958 in multiple places. So, it might be like here and here or 82 00:06:14,958 --> 00:06:18,568 maybe even there. It's going to clear the pending bits, and 83 00:06:18,568 --> 00:06:23,891 whoever is trying to read register five, and then, because we're only let's say 84 00:06:23,891 --> 00:06:27,011 going to have single issue on this processor. 85 00:06:27,011 --> 00:06:32,083 You can only pull one thing out at a time. So, you need to sort of pull one thing out 86 00:06:32,083 --> 00:06:38,082 and you need someway to leave the other instruction in the issue queue having all 87 00:06:38,082 --> 00:06:42,016 of it's operands ready. So, that's how you do that. 88 00:06:42,043 --> 00:06:45,068 Okay, so how do we figure out if an instruction is ready? 89 00:06:48,030 --> 00:06:55,852 Well, we see if the if the, if it actually uses it, that the operand, uses the, 90 00:06:55,852 --> 00:07:02,148 excuse me, the, particular upper and identifier and we see if it's pending. 91 00:07:02,148 --> 00:07:06,959 And then we have to check the other, other for, for two inputs. 92 00:07:06,959 --> 00:07:10,942 We need to check the two inputs, and then we also have to make sure there's no 93 00:07:10,942 --> 00:07:15,051 structural hazard somewhere else in the pipe, like if we're trying to schedule 94 00:07:15,051 --> 00:07:17,721 write, write ports. And we're going to use the scoreboard to 95 00:07:17,721 --> 00:07:20,061 do that. The other thing is if we want high 96 00:07:20,061 --> 00:07:24,372 performance, we probably don't want to have to wait for values to get to the end 97 00:07:24,372 --> 00:07:26,883 of the pipe. So, in reality, this, this, this is even 98 00:07:26,883 --> 00:07:30,885 going to get more complicated because it's going to be, these things plus information 99 00:07:30,885 --> 00:07:34,799 coming from the scoreboard which is when, when a particular register identifier is 100 00:07:34,799 --> 00:07:37,522 ready or when a particular register is ready. 101 00:07:37,708 --> 00:07:42,417 Let's talk about the destination here. This just tracks whether an instruction 102 00:07:42,417 --> 00:07:46,316 writes a destination or not. And like I said, in these, in these basic 103 00:07:46,316 --> 00:07:50,082 pipelines, we're going to leave instructions in the instruction queue 104 00:07:50,082 --> 00:07:54,469 until they get to the end of the pipe. And when it gets to the end of the pipe, 105 00:07:54,469 --> 00:07:59,772 this is where we go to check to see, what other locations we need to, sort of, clear 106 00:07:59,772 --> 00:08:02,080 out. So, if there's an instruction, let's say, 107 00:08:02,080 --> 00:08:06,413 sitting here, which has the valid bits set, and its range of register five is the 108 00:08:06,413 --> 00:08:10,776 destination. When that instruction commits, we're going 109 00:08:10,776 --> 00:08:16,003 to clear it out of, the issue queue. And, we're going to say, oh well it wrote 110 00:08:16,003 --> 00:08:19,687 register five. Let's scan through all these other places 111 00:08:19,687 --> 00:08:22,445 and look for places that say register five. 112 00:08:22,445 --> 00:08:28,101 And if they say register five, we are going to flip the pending bid from pending 113 00:08:28,101 --> 00:08:33,664 to not pending anymore and we're going to remove the instruction from the issue 114 00:08:33,664 --> 00:08:37,048 queue. Okay, so, centralized versus distributed 115 00:08:37,048 --> 00:08:41,048 issue queues. In a, in a, in a sort of logical sense, in 116 00:08:41,048 --> 00:08:46,138 a perfect world it's probably nice to have a big centralized issue queue. 117 00:08:46,348 --> 00:08:51,662 You can scan over all the instructions. You don't have to sort of look around, but 118 00:08:51,662 --> 00:08:56,267 this can sometimes be harder to implement, because you have to put all of your 119 00:08:56,267 --> 00:09:00,352 instructions in one location. And sometimes they don't necessarily sort 120 00:09:00,352 --> 00:09:03,624 of cross communicate very well. Like floating points units. 121 00:09:03,624 --> 00:09:07,939 They have floating point registers versus integer register files is one type. 122 00:09:07,939 --> 00:09:11,035 They don't necessarily need a whole lot of communication. 123 00:09:11,035 --> 00:09:16,450 So, you could have distributed instruction queues where you sort of steer let's say 124 00:09:16,450 --> 00:09:21,058 here the execute or ALU and memory ops to one of these for these functional units. 125 00:09:21,058 --> 00:09:24,950 They have a different instruction queue just for, multiplies or something like 126 00:09:24,950 --> 00:09:27,861 that. I'm going to, put a paper on the website 127 00:09:27,861 --> 00:09:32,034 for you guys all to read. It's Tomasulo algorithm. 128 00:09:32,034 --> 00:09:37,033 It's a very famous paper, and in that they talk about distributed instruction queues. 129 00:09:37,033 --> 00:09:42,014 Strangely enough, sort of, the first place in literature that these instruction 130 00:09:42,014 --> 00:09:46,083 queues showed up was around that time, or in that paper, and they actually go 131 00:09:46,083 --> 00:09:51,040 straight to the distributed version and kind of skip the centralized version, 132 00:09:51,040 --> 00:09:56,052 which I always found a little bit odd but lots of people build centralized ones 133 00:09:56,069 --> 00:09:59,051 today. One of the, one of the reasons that the I 134 00:09:59,051 --> 00:10:04,233 think the Tomasulo algorithm people went for this distributing one first is because 135 00:10:04,233 --> 00:10:08,318 they were actually implementing it in multiple discreet chips. 136 00:10:08,318 --> 00:10:12,222 So they had a issue cube per chip. So, they had a floating point chip and a 137 00:10:12,222 --> 00:10:14,826 integer chip and they steered the instructions. 138 00:10:14,826 --> 00:10:19,539 They didn't wanna have to have one data structure that cross, cross two chips. 139 00:10:19,539 --> 00:10:21,712 Today we have, you know, lots of integration. 140 00:10:21,712 --> 00:10:24,098 So, we don't have to worry about that as much. 141 00:10:26,039 --> 00:10:31,827 Okay, so, I just wanted to briefly say, this is a question that came up last time, 142 00:10:32,047 --> 00:10:34,170 and scoreboards. If you have to worry about 143 00:10:34,170 --> 00:10:38,083 write-after-write hazards in the pipe. The things we are talking about today, we 144 00:10:38,083 --> 00:10:41,820 didn't have to really worry about that. But this is something to talk about next 145 00:10:41,820 --> 00:10:45,001 time, we're gonna have to think about better scoreboards. 146 00:10:45,001 --> 00:10:48,068 It looks the same as the previous scoreboard, but you might need to keep the 147 00:10:48,068 --> 00:10:52,059 functional unit, or you need to keep the functional unit number, or some bits that 148 00:10:52,059 --> 00:10:56,022 represent the functional unit, and then those march down the pipe versus 1s 149 00:10:56,022 --> 00:10:59,011 marching down the pipe. You have, let's say, different numbers, 150 00:10:59,011 --> 00:11:02,035 like one, two, three, or you have other things marching down the pipe. 151 00:11:02,035 --> 00:11:05,039 But that's only if you have to track write-after-write hazards. 152 00:11:05,039 --> 00:11:06,017 Okay. So. 153 00:11:06,017 --> 00:11:10,918 That was a quick aside. Let's look at this in order, fetch, 154 00:11:10,918 --> 00:11:18,325 out-of-order issue, out-of-order, writeback and out-of-order commit 155 00:11:18,325 --> 00:11:23,877 processor in a pipeline diagram. And I'm going to show a new little thing 156 00:11:23,877 --> 00:11:28,032 in our pipeline diagram here, we have a lowercase i. 157 00:11:28,032 --> 00:11:32,010 Which means the instruction enters the issue queue. 158 00:11:32,039 --> 00:11:37,000 And then, when it exits the issue queue, it, it just starts going down into the 159 00:11:37,000 --> 00:11:42,005 issue stage of the pipe. And, trying to think where at, where am I? 160 00:11:42,005 --> 00:11:48,649 This is not that complicated of a drawing. Here I'm actually showing the, the issue 161 00:11:48,649 --> 00:11:51,071 queue. What is in the issue queue. 162 00:11:51,071 --> 00:11:54,648 So I have two, or excuse me, three issue queue slots. 163 00:11:54,648 --> 00:11:57,071 Because, this is a relatively simplistic pipe. 164 00:11:58,044 --> 00:12:07,098 The, this first instruction which goes to execute register two and register three 165 00:12:07,098 --> 00:12:11,047 are just already ready, so we don't have to worry about it. 166 00:12:11,047 --> 00:12:19,026 But an example of something that's more interesting, let's say, is, I don't know. 167 00:12:19,026 --> 00:12:21,041 Where is it going. Here we go. 168 00:12:22,025 --> 00:12:27,080 This will enter the issue queue. But register twelve does not become ready 169 00:12:27,080 --> 00:12:31,072 until late. So, once that becomes, comes ready, we can 170 00:12:31,072 --> 00:12:37,019 basically start pulling things out of the issue queue, and, and issue that 171 00:12:37,019 --> 00:12:39,080 instruction. So that, that shows up. 172 00:12:39,080 --> 00:12:44,073 Let's see if we can show that here, register, what is this, fourteen, twelve? 173 00:12:44,080 --> 00:12:50,065 Okay. So this one here, we're waiting for, that 174 00:12:50,065 --> 00:12:52,367 to happen. Let's see. 175 00:12:52,367 --> 00:13:02,591 Yeah, we're waiting for no, where is it twelve, just add. 176 00:13:02,591 --> 00:13:19,044 [inaudible] this long. Wrong. 177 00:13:20,546 --> 00:13:40,401 Yes. That is actually is what I want to show 178 00:13:40,401 --> 00:13:42,056 here. This is interesting. 179 00:13:43,512 --> 00:13:50,065 That value actually becomes ready, early. But. 180 00:13:51,003 --> 00:13:59,012 Is that what's actually happening here, the value becomes ready at the end of this 181 00:13:59,012 --> 00:14:02,024 execute stage. But it can't go to issue. 182 00:14:02,077 --> 00:14:05,058 At that cycle because there's something else issuing. 183 00:14:05,058 --> 00:14:09,243 So, it has to be pushed out. So, we're only doing a single issue per 184 00:14:09,243 --> 00:14:12,005 cycle. If we had multiple issues, you could think 185 00:14:12,005 --> 00:14:14,044 of something more interesting happening here. 186 00:14:24,086 --> 00:14:30,034 Yeah, so this is actually a really complicated case because if we pull this 187 00:14:30,034 --> 00:14:34,016 back let's say to here we get a writeback hazard. 188 00:14:34,016 --> 00:14:39,065 The W would conflict with the other W. If we try to issue here, sorry, that was a 189 00:14:39,065 --> 00:14:42,097 reissue here. If we try to issue, here we get a 190 00:14:42,097 --> 00:14:48,017 structural hazard on the issue. So, it actually gets pushed way out there. 191 00:14:48,038 --> 00:14:53,008 So, it's kind of a, kind of a bummer. So, sometimes you just lose. 192 00:14:58,053 --> 00:15:08,041 Okay, so, here's a interesting case. So, let's assume that all the instructions 193 00:15:08,041 --> 00:15:12,092 are preloaded into the issue queue. So, showing this, we have fetch, decode, 194 00:15:12,092 --> 00:15:17,086 issue, issue, issue, and basically this thing's just sort of sitting in the issue 195 00:15:17,086 --> 00:15:20,067 queue and then we start looking. What happens? 196 00:15:20,096 --> 00:15:24,030 Those the performance get better than the previous example? 197 00:15:24,030 --> 00:15:28,067 So, interesting enough, even you if you sort of issue everything into the pipe 198 00:15:28,067 --> 00:15:31,185 early and then just sort of fill up your issue queue. 199 00:15:31,185 --> 00:15:35,876 Pulling out of the issue queue can still be limiter and the number of ALUs can 200 00:15:35,876 --> 00:15:39,020 still be a limiter. So, there is other structural hazards. 201 00:15:39,020 --> 00:15:43,518 The performance of this code in this pipeline does not actually get any better 202 00:15:43,518 --> 00:15:47,412 which is sort of interesting. You think, Oh, well, I don't have to wait 203 00:15:47,412 --> 00:15:50,008 for things to dribble into my instruction queue. 204 00:15:50,008 --> 00:15:52,093 I just sort of preload my instruction queue. 205 00:15:52,093 --> 00:15:55,996 But the performance really, really doesn't end up being better. 206 00:15:55,996 --> 00:16:02,256 So this motivates us going to both out of order mixed with duplication of 207 00:16:02,256 --> 00:16:05,873 structures. Like, maybe we can issue two instructions 208 00:16:05,873 --> 00:16:09,030 at the same time. And maybe we can have two ALUs. 209 00:16:09,030 --> 00:16:12,090 So this is, motivates us having a superscalar.