1 00:00:03,050 --> 00:00:06,003 Okay. So now, now we get into the fun stuff. 2 00:00:06,003 --> 00:00:10,013 Out of order processors. Up to this point we've been doing simple 3 00:00:10,013 --> 00:00:13,039 in order stuff. We've been hinting at out of order, but 4 00:00:13,039 --> 00:00:17,026 now we're going to start looking at, truly out of order machines. 5 00:00:18,057 --> 00:00:22,029 And we're gonna introduce a bunch of new structures that you should not have 6 00:00:22,029 --> 00:00:25,043 learned about up to this point. You may have briefly talked about 7 00:00:25,043 --> 00:00:28,085 scoreboard, but we're not still talking about bunch of other structures. 8 00:00:29,035 --> 00:00:34,078 Okay, so, so let's talk about different portions of a pipeline that can be in 9 00:00:34,078 --> 00:00:38,059 order versus out of order. The front end. 10 00:00:38,059 --> 00:00:44,047 So what do I mean by the front end? The front end is instruction, fetch, and 11 00:00:44,047 --> 00:00:51,032 probably decode. This is pretty hard to do out of order 12 00:00:51,032 --> 00:00:55,549 [laugh] because you want to know, I mean, unless you've actually sort of, can 13 00:00:55,549 --> 00:01:00,373 predict the future, it's pretty hard to know so most of machines are going to be 14 00:01:00,373 --> 00:01:03,319 looking at have the front end being in order. 15 00:01:03,319 --> 00:01:06,660 So IO here means in order, OOO means out of order. 16 00:01:06,660 --> 00:01:10,705 Issue. So this is the stage in the pipe that 17 00:01:10,705 --> 00:01:17,048 actually says All the operands are ready, go start executing the instruction. 18 00:01:18,015 --> 00:01:23,009 Well, there are some, some things that can be in order. 19 00:01:23,050 --> 00:01:26,084 Well, you can probably even get more performance if you try and do that out of 20 00:01:26,084 --> 00:01:30,066 order. So probably not today but next lecture 21 00:01:30,066 --> 00:01:35,071 we're actually gonna talk about how to have the issue be out of order. 22 00:01:36,023 --> 00:01:41,026 Right back. So this is writing to something that looks 23 00:01:41,026 --> 00:01:47,075 like a register file. But not necessarily having hit the commit 24 00:01:47,075 --> 00:01:53,042 point of a processor, so this means things are no longer in flight in the bypass 25 00:01:53,042 --> 00:01:59,015 network, it's been committed to a register file, I won't call it the register file 26 00:01:59,015 --> 00:02:02,055 just yet. Well you could do that in order, out of 27 00:02:02,055 --> 00:02:07,577 order, or excuse me, you could do that in order, out of order, different choices 28 00:02:07,577 --> 00:02:09,988 there. And then finally commit. 29 00:02:09,988 --> 00:02:15,133 So this is saying, yes I actually want these instructions to actually commit and 30 00:02:15,133 --> 00:02:19,220 I'm basically, after that point, you're not able to roll back the machine state to 31 00:02:19,220 --> 00:02:22,114 a previous state. And we're gonna look at machines where 32 00:02:22,114 --> 00:02:28,046 that's both in order and out of order. This is sort of a, a gentle introduction 33 00:02:28,046 --> 00:02:33,049 to out of order execution. And we're gonna be building some different 34 00:02:33,049 --> 00:02:36,886 architectures. And this name here just is, sort of a 35 00:02:36,886 --> 00:02:42,640 naming of five architectures we're gonna look at as an introduction to out of 36 00:02:42,640 --> 00:02:44,519 order. The I4 is in, in, in. 37 00:02:44,519 --> 00:02:49,208 So, or in, in, in order, in order, in order, in order, in order. 38 00:02:49,208 --> 00:02:52,625 I2 O2 is. In order, in order, out of order, out of 39 00:02:52,625 --> 00:02:56,454 order. Into out of order in order is in order, in 40 00:02:56,454 --> 00:03:01,008 order, out of order. It's just, it's, the number just means how 41 00:03:01,008 --> 00:03:05,949 many stages in a row are the same as the, as the letter before it. 42 00:03:05,949 --> 00:03:10,471 Okay, so I wanna briefly touch on some nomenclature over here. 43 00:03:10,471 --> 00:03:15,738 And you shouldn't know what these things are yet, unless you went and read Shen 44 00:03:15,738 --> 00:03:21,177 Lipasti, which I highly recommend. Scoreboard, this a structure where we keep 45 00:03:21,177 --> 00:03:25,071 information about what instruction is ready execute. 46 00:03:25,071 --> 00:03:30,465 The order buffer sometimes gets merged into a lot of different meanings, but 47 00:03:30,465 --> 00:03:35,104 typically, when we execute instructions out of order, this is place where we can 48 00:03:35,104 --> 00:03:38,060 go and actually re-order'em, to commit them in order. 49 00:03:38,060 --> 00:03:43,258 And we resolved all the dependencies so we don't actually commit out of order. 50 00:03:43,258 --> 00:03:47,922 Store buffer. This is something where we'll actually 51 00:03:47,922 --> 00:03:53,758 hold off, storing the memory until the commit point, because we don't want to 52 00:03:53,758 --> 00:03:58,052 write to memory too early. Cuz that would be bad, cuz then that would 53 00:03:58,052 --> 00:04:01,047 effectively be committing. So you, this is one important thing to 54 00:04:01,047 --> 00:04:05,019 remember about commit points is if you commit too early, what is an important 55 00:04:05,019 --> 00:04:08,028 notion of a commit here? Well, it's any state that you can't roll 56 00:04:08,028 --> 00:04:10,016 back. So, if you do a store to main, main 57 00:04:10,016 --> 00:04:13,035 memory, that's a commit. That's really hard to roll back unless you 58 00:04:13,035 --> 00:04:17,021 can somehow keep all that state of what was in that memory location before that. 59 00:04:17,021 --> 00:04:20,079 But people don't typically do that. Instead, you'll actually sort of have a 60 00:04:20,079 --> 00:04:23,088 buffer which keeps track of all the storage you're trying to do. 61 00:04:23,088 --> 00:04:27,050 And every once in a while, when you actually do a commit of the instruction, 62 00:04:27,050 --> 00:04:30,050 at that point and only that point do you store to main memory. 63 00:04:31,056 --> 00:04:38,058 And finally, we'll be talking about the issue queue, which is a, if you are 64 00:04:38,058 --> 00:04:43,868 issuing out of order, this is a structure which allows you to determine when it's 65 00:04:43,868 --> 00:04:48,529 safe to go and issue instruction or not. And this is, what's important about this 66 00:04:48,529 --> 00:04:54,017 is that it keeps all the dependencies, the read after write, write after write. 67 00:04:54,017 --> 00:04:58,087 Write after read dependencies in check. That's what that structure is doing. 68 00:04:58,087 --> 00:05:02,025 So we're gonna look at some of these structures today. 69 00:05:02,069 --> 00:05:08,004 Okay, so motivational code sequence here. And what is a motivation for out of order 70 00:05:08,004 --> 00:05:13,049 execution? Here's a, here's a simple code sequence 71 00:05:13,049 --> 00:05:18,040 and let's, let's take a look at some of the read after write dependencies for this 72 00:05:18,040 --> 00:05:25,072 code sequence. Here we have a write to register one, and 73 00:05:25,072 --> 00:05:30,089 a read of register one. So that is a, a dependency there. 74 00:05:30,089 --> 00:05:36,043 So, instruction zero has instruction two as a dependence. 75 00:05:36,043 --> 00:05:42,004 So we'll draw a nice arc there. Let's see what else happens here. 76 00:05:42,004 --> 00:05:46,025 We're, register instruction two writes to register five. 77 00:05:46,026 --> 00:05:51,039 And that gets read in instruction three. So that's a dependency arc there. 78 00:05:52,042 --> 00:05:59,372 Okay, interesting enough here, construction one isn't dependent on 79 00:05:59,372 --> 00:06:04,764 instruction zero, and instruction two and instruction three are not dependent on 80 00:06:04,764 --> 00:06:08,763 that. So, that's, we can just put it in a 81 00:06:08,763 --> 00:06:17,085 separate little bucket here. Instruction four here, let's take a look. 82 00:06:17,085 --> 00:06:22,470 Register twelve, register eleven. It reads register eleven. 83 00:06:22,475 --> 00:06:31,685 So it's dependent on instruction one. Five is dependent on the output here 84 00:06:31,685 --> 00:06:37,760 twelve, So it gets put after four and likewise six reads register twelve, so it 85 00:06:37,760 --> 00:06:42,043 is dependent. So you have to start to think about if we 86 00:06:42,043 --> 00:06:46,066 have an out of order processor we can choose the order of this instruction 87 00:06:46,066 --> 00:06:50,067 sequence and that instruction sequence, and if we have a multi-issue or 88 00:06:50,067 --> 00:06:55,030 superscalar out of order processor we can even try to think about how to execute 89 00:06:55,030 --> 00:06:57,085 these completely independently at the same time and we can get performance from this, 90 00:06:57,085 --> 00:07:04,822 cuz we confined parallelism inside of a sequential execution stream. 91 00:07:04,822 --> 00:07:11,455 So one, one important thing here to realize is today we're going to be talking 92 00:07:11,455 --> 00:07:17,005 about how to dynamically, in hardware, schedule this and this at the same time, 93 00:07:17,005 --> 00:07:21,468 and we're going to be using, this as a motivating example in the VLIW, or very 94 00:07:21,468 --> 00:07:25,243 long instruction word lectures. We're going to be talking about some 95 00:07:25,243 --> 00:07:30,021 software techniques that can determine when to execute this and this at the same 96 00:07:30,021 --> 00:07:34,396 time and architectures which can take advantage of that or the software has 97 00:07:34,396 --> 00:07:38,136 determined that these two things are completely independent. 98 00:07:38,136 --> 00:07:45,421 Okay so an important thing to realize here is that why does even, why do these even 99 00:07:45,421 --> 00:07:48,147 happen? Why do we end up with sort of 100 00:07:48,147 --> 00:07:52,107 non-dependent instruction sequences in a program? 101 00:07:52,107 --> 00:08:00,156 This is like a philosophical question. Should this even, why, why is this 102 00:08:00,156 --> 00:08:04,065 possible? So what do, what do, in order instruction 103 00:08:04,065 --> 00:08:10,051 semantics or sequential instruction semantics force you to do to instructions. 104 00:08:14,063 --> 00:08:17,036 Of course, you need to come up with an order. 105 00:08:17,036 --> 00:08:21,099 You need to come up with some order. I need to write on a sheet of paper, in 106 00:08:21,099 --> 00:08:26,069 some linear order or store it my memory on my computer system in some order. 107 00:08:26,069 --> 00:08:31,058 An important point here is that in the instruction set architecture of a 108 00:08:31,058 --> 00:08:36,010 sequential computer, by definition, the instructions need to be serialized. 109 00:08:38,015 --> 00:08:43,053 This is a limiter, this is a problem in sequential machines. 110 00:08:44,012 --> 00:08:50,917 You need to come up with some order. So, there is no way to express that this 111 00:08:50,917 --> 00:08:57,993 and this can execute at the same time in a sequential, in order machine, or a 112 00:08:57,993 --> 00:09:03,475 sequential machine rather. Unless you do it something like this. 113 00:09:03,475 --> 00:09:07,837 You need to have some order, like, this, this is expressing the parallelism, but 114 00:09:07,837 --> 00:09:12,075 it's hard, the hardware has to go and figure out where the parallelism is now. 115 00:09:12,075 --> 00:09:17,099 You can think of an alternative instruction set, which is not sequential, 116 00:09:17,099 --> 00:09:21,249 but instead, is a graph, of different dependencies like this. 117 00:09:21,249 --> 00:09:25,506 And, people have built those sorts of machines, they are nothing common, they 118 00:09:25,506 --> 00:09:30,219 are typically called data flow machines. And we, if we have time for it, we'll have 119 00:09:30,219 --> 00:09:32,900 one lecture at the end of the term. So there are machines that allow you to 120 00:09:32,900 --> 00:09:33,954 express. This and this are not dependent on each 121 00:09:33,954 --> 00:09:36,546 other. And maybe sometime in the future they do 122 00:09:36,546 --> 00:09:39,750 become dependent based on some branch or something like that. 123 00:09:39,750 --> 00:09:43,187 So you can, you can think about that. But in a, in a sequential processor, you 124 00:09:43,187 --> 00:09:47,033 need to come up with some ordering. This is a limiter for these machines. 125 00:09:47,033 --> 00:09:51,136 So effectively, let's say the compiler, the compiler optimizations knows that this 126 00:09:51,136 --> 00:09:55,022 and this connects the queue at the same time, by definition if it can generate 127 00:09:55,022 --> 00:09:58,585 this code it knows those two things can execute at the same time. 128 00:09:58,585 --> 00:10:01,137 Unfortunately it has now way to express it. 129 00:10:01,137 --> 00:10:04,491 And that's a bummer. Very long instruction word machines, which 130 00:10:04,491 --> 00:10:08,958 we'll be talking about a little bit in the future, allow you some ability to say I 131 00:10:08,958 --> 00:10:13,939 want to execute, let's say this and this at the same time, but it's very limiting. 132 00:10:13,939 --> 00:10:18,800 Data flow machines allow you to do much more complex sort of graphs of 133 00:10:18,800 --> 00:10:23,488 instructions and say, this is dependent this instruction is dependent on this 134 00:10:23,488 --> 00:10:26,682 instruction, and that's all I really need to, to say. 135 00:10:26,682 --> 00:10:32,791 So effectively, sequential machines have over-constrained our designs. 136 00:10:32,791 --> 00:10:39,895 Okay, so let's start evaluating our first out of order, in order machine. 137 00:10:39,895 --> 00:10:42,736 So I4, this is actually an in order machine. 138 00:10:42,736 --> 00:10:48,441 But we're going to be analyzing it as a motivator and then sort of a slippery 139 00:10:48,441 --> 00:10:51,850 slope if you will, into true out of order machines. 140 00:10:51,850 --> 00:10:56,390 So here we have fetch, decode, execute, memory, writeback. 141 00:10:56,390 --> 00:11:00,644 Same five stage pipe. I renamed things a little bit ease, to 142 00:11:00,644 --> 00:11:04,568 make it easier here. This actually follows the nomenclature 143 00:11:04,568 --> 00:11:09,891 that's in your labs, where the execute stage has an X and the, decode stage, the 144 00:11:09,891 --> 00:11:14,550 duh, in, is not ID but instead D. But this should all look relatively 145 00:11:14,550 --> 00:11:18,287 similar. I, I took out all the sort of extraneous 146 00:11:18,287 --> 00:11:21,590 bypassing and other stuff in this pipeline. 147 00:11:21,590 --> 00:11:25,084 So it's a, sort of, caricature of, of a real pipeline. 148 00:11:25,084 --> 00:11:29,526 Okay so let's say, you know, we wanna do superscalars here. 149 00:11:29,526 --> 00:11:35,836 Let's say we want to start adding multiple pipelines to this processor, so we're 150 00:11:35,836 --> 00:11:41,992 going to take the same processor pipeline, but we're going to split off and have two 151 00:11:41,992 --> 00:11:45,997 pipelines here. We're going to have a memory pipe and an 152 00:11:45,997 --> 00:11:50,383 execution pipe. This looks similar to our AMD superscalar, 153 00:11:50,383 --> 00:11:56,445 but for today let's focus on a in order, in order machine, where these two pipes 154 00:11:56,445 --> 00:12:01,812 cannot have, Different instructions in the same stage at the same time. 155 00:12:01,812 --> 00:12:07,630 So, we're going to look at a single fetch, processor here or a non, a single issue 156 00:12:07,630 --> 00:12:12,011 processor, if you will. But as a motivator, you'll see why we're 157 00:12:12,011 --> 00:12:16,646 doing this as we build up to more complex out of order things. 158 00:12:16,646 --> 00:12:21,373 Most of these things hold actually for the multi issue variants. 159 00:12:21,373 --> 00:12:27,147 But your head will hurt a little bit less if you think about the, in order variant 160 00:12:27,147 --> 00:12:31,133 or the in order, single issue variants first. 161 00:12:31,133 --> 00:12:32,888 Okay. As I said, yeah. 162 00:12:32,888 --> 00:12:37,084 Things get a little more complicated. Allow, allow us to take that same pipe 163 00:12:37,084 --> 00:12:40,642 that we had from before, and add in, a 4-stage multiplier. 164 00:12:40,642 --> 00:12:45,838 So now we have an execute stage, a single execute stage which actually has some work 165 00:12:45,838 --> 00:12:47,865 in it. This has an ALU inside of it. 166 00:12:47,865 --> 00:12:52,274 The memory let's say has two stages. This does the address computation. 167 00:12:52,274 --> 00:12:56,822 This is the actual memory access and here we have multiply which takes us a long 168 00:12:56,822 --> 00:12:59,147 time. It's pretty common and multiplies a big 169 00:12:59,147 --> 00:13:02,556 complicated beast. So there are four stages of multiply and 170 00:13:02,556 --> 00:13:09,844 then out over here we have the write back. Hm, okay, so that's, that gets us, that 171 00:13:09,844 --> 00:13:20,370 gets us thinking and we want to start to think what these things look like on the 172 00:13:20,370 --> 00:13:28,909 inside. So the first question is, do we want to 173 00:13:28,909 --> 00:13:36,042 bypass this pipeline? And how much pipeline, or how much 174 00:13:36,042 --> 00:13:49,089 bypassing do we need? In order pipe, three pipes, or three 175 00:13:49,089 --> 00:13:54,008 functional units, we'll call it one, two, three. 176 00:13:55,082 --> 00:14:07,187 But we probably still need to bypass. Can we bypass out of right here? 177 00:14:07,187 --> 00:14:15,017 Out of, Y1, does that make any sense? So the multiply is not done till here, so 178 00:14:15,017 --> 00:14:19,089 it makes no sense to bypass out of there. Do we need a bypass out of here? 179 00:14:21,032 --> 00:14:25,024 Out here. Here. 180 00:14:25,050 --> 00:14:31,062 Okay, a lot of bypass locations, so, this starts to, you can see the bypass 181 00:14:31,062 --> 00:14:35,073 exploding very quicly in this sort of pipeline. 182 00:14:35,099 --> 00:14:42,082 And, you know, that's, that's, pretty, pretty, pretty much to be expected, if you 183 00:14:42,082 --> 00:14:45,087 don't bypass. And you have to wait for everything. 184 00:14:45,087 --> 00:14:49,010 Let's say to get back to here, your, your clocks per instruction of your machine 185 00:14:49,010 --> 00:14:51,002 goes up. And we sort of go back to that simpler 186 00:14:51,002 --> 00:14:54,004 machine we had seen in earlier classes. Where sort of you have to wait for 187 00:14:54,004 --> 00:14:56,025 instructions to go all the way to the end of the pipe. 188 00:14:56,025 --> 00:14:58,039 And that, that's a bummer. Okay. 189 00:14:58,039 --> 00:15:03,081 Yeah, that's, that's what we just said. I do want to introduce this term 190 00:15:03,081 --> 00:15:06,041 functional unit. So, functional unit. 191 00:15:07,010 --> 00:15:11,086 Means one execution pipeline. So this is a functional, this is a 192 00:15:11,086 --> 00:15:17,046 multiply function unit, this is a memory function unit, this is the execute 193 00:15:17,046 --> 00:15:20,017 function unit. Okay. 194 00:15:20,017 --> 00:15:23,355 So now we're going to start to, introducing some extra structures, and 195 00:15:23,355 --> 00:15:28,671 start talking about some extra structures, which will make our lives easier when we 196 00:15:28,671 --> 00:15:33,690 start to have multiple pipelines, and complicated things are happening. 197 00:15:33,690 --> 00:15:41,715 The first thing I want to introduce here is the architectural register file. 198 00:15:41,715 --> 00:15:47,692 Or ARF for short. So the architectural register file is 199 00:15:47,692 --> 00:15:53,052 where we keep the canonical state of the machine that has been committed. 200 00:15:53,052 --> 00:15:56,019 Hence, it's called architectural register file. 201 00:15:56,019 --> 00:16:01,006 If you see me say register file, it may or may not be an architectural register file. 202 00:16:01,006 --> 00:16:05,024 It may have in-flight values, but the architectural register file is the 203 00:16:05,024 --> 00:16:11,040 committed state of the machine. So we draw that sort of end of the pipe 204 00:16:11,040 --> 00:16:14,032 here. Cuz that's where rights happen to this 205 00:16:14,032 --> 00:16:20,002 register the architectural register file. As shown here by this W or this nice time 206 00:16:20,002 --> 00:16:25,019 di, or the nice pipeline diagram W. And we try to read it at the issue stage 207 00:16:25,019 --> 00:16:31,025 of the pipe or the register effect stage of the pipe I did want to say that when we 208 00:16:31,025 --> 00:16:34,017 went to go do this we added an extra stage here. 209 00:16:34,017 --> 00:16:38,074 So we went to a six stage pipe because we're assuming that we bypass everything 210 00:16:38,074 --> 00:16:41,032 and we wanna have a little bit of extra time. 211 00:16:41,032 --> 00:16:46,001 And we'll see for some more of the complex pipes we'll start to put stuff in this 212 00:16:46,001 --> 00:16:48,087 decode stage. But for right now, decode does decode 213 00:16:49,004 --> 00:16:54,002 issue stage does some instruction steering to which functional unit and not a whole 214 00:16:54,002 --> 00:16:57,036 lot more. So that's what in our architectural 215 00:16:57,036 --> 00:17:00,009 register file. We read it here, we write it there. 216 00:17:00,009 --> 00:17:03,051 That makes sense. We may have to bypass some inflight values 217 00:17:03,051 --> 00:17:06,013 around. But those are not in our architectural 218 00:17:06,013 --> 00:17:08,313 file. Those have not been committed, those are 219 00:17:08,313 --> 00:17:13,087 just in-flight speculative values. Sb, so what's SB here? 220 00:17:13,087 --> 00:17:18,076 It's the scoreboard. So, what does the scoreboard do? 221 00:17:18,076 --> 00:17:25,081 So we're gonna show some pictures of score board but for right now, what our store, 222 00:17:25,081 --> 00:17:31,048 scoreboard is going to allow us to do is it's basically to be convenient side 223 00:17:31,048 --> 00:17:37,283 structure where we're going to keep track of where different values are in flight in 224 00:17:37,283 --> 00:17:41,954 these different pipelines. All of this data, if you go back to our 225 00:17:41,954 --> 00:17:47,023 earlier pipelines was there. We had sort of the instruction registers 226 00:17:47,023 --> 00:17:53,188 that went down the pipe here, which in the control, which kept track of what was in 227 00:17:53,188 --> 00:17:56,025 each stage. The scoreboard is just a convenient place 228 00:17:56,025 --> 00:17:58,354 to put all that information and centralize it all. 229 00:17:58,354 --> 00:18:02,005 So when you go to build a pipeline like this, you know you don't want to have to 230 00:18:02,005 --> 00:18:04,587 go sort of pick off of random locations on the pipe. 231 00:18:04,587 --> 00:18:09,080 It's easier just to sort of mentalize that data, in one location in the pipe, in the, 232 00:18:09,080 --> 00:18:11,637 in the, scoreboard. And, when we start to go out of order 233 00:18:11,637 --> 00:18:15,047 we're actually gonna store some information, in the scoreboard which gets, 234 00:18:15,047 --> 00:18:19,852 very hard to pick out of other locations. The scoreboard. 235 00:18:19,852 --> 00:18:25,074 Read and write and write, what happens here. 236 00:18:25,074 --> 00:18:31,095 Well, we read it to figure out what. Where, where we go with the bypass 237 00:18:31,095 --> 00:18:35,015 information problem from. So, in this, we're not going to use that 238 00:18:35,015 --> 00:18:39,030 calculation that we used before, but we're going to read it to figure out where the 239 00:18:39,030 --> 00:18:42,080 bypass information is coming from. And we're going to write it when we 240 00:18:42,080 --> 00:18:46,040 actually issue the instruction. We have to issue the instruction, we have 241 00:18:46,040 --> 00:18:50,026 to update some information in there. And then, as the instruction goes down to 242 00:18:50,026 --> 00:18:54,001 the end of the pipe, if the instruction actually commits, we also need to do 243 00:18:54,001 --> 00:18:57,071 something in our scoreboard. If the instruction doesn't commit or takes 244 00:18:57,071 --> 00:19:01,056 an exception, we will also have to do something at the end of the pipeline to 245 00:19:01,056 --> 00:19:05,046 clean up the scoreboard. So let's look at, let's look at the 246 00:19:05,046 --> 00:19:08,047 scoreboard. Here's our basic scoreboard. 247 00:19:09,009 --> 00:19:14,055 This is, this is for a long multiply pipe, the pipe we just saw. 248 00:19:15,011 --> 00:19:18,946 And we're going to keep track per real register, R0 in MIPS is a dead register, 249 00:19:18,946 --> 00:19:24,248 so we're not actually going to, or a register which has no dependencies going 250 00:19:24,248 --> 00:19:28,604 through so we're not even going to talk about that. 251 00:19:28,604 --> 00:19:35,316 So p, so we have one bit in here which says, if there's a write pending to that 252 00:19:35,316 --> 00:19:41,022 register. F for functional unit keeps track of which 253 00:19:41,022 --> 00:19:46,198 of the three functional units we should go in to pick off the value from. 254 00:19:46,198 --> 00:19:51,457 So it is important when we go into the bypass calculation what's going on and 255 00:19:51,457 --> 00:19:59,011 then we are going to have a shift register for each, register if you will. 256 00:19:59,011 --> 00:20:04,013 Where we're going to inject a bit. In every cycle we're going to clock it 257 00:20:04,013 --> 00:20:07,067 forward. And this is gonna tell us where to pick 258 00:20:07,067 --> 00:20:11,084 off the values. So this tells us where, which pipeline to 259 00:20:11,084 --> 00:20:17,043 look at, out of three and the data valuable tells us in the other dimension 260 00:20:17,043 --> 00:20:20,063 what stage in the pipe to go pick off from. 261 00:20:20,063 --> 00:20:26,089 So if you go look at what functional unit you are and you cross it with information 262 00:20:26,089 --> 00:20:30,070 you should figure out where to go do a bypass from. 263 00:20:33,090 --> 00:20:40,013 Okay, so every cycle logically can think of these bits shifting to the right. 264 00:20:40,013 --> 00:20:45,875 So it means the data, if there's a one here that means the data for register one 265 00:20:45,875 --> 00:20:51,079 is going to be valuable in four cycles. In the, somewhere. 266 00:20:51,079 --> 00:20:58,060 Probably the architectural register file, if you implement this simp-, relatively 267 00:20:58,060 --> 00:21:03,788 simply. Let's see, the other things I'd like to 268 00:21:03,788 --> 00:21:09,004 say about this. We would cover this all, yeah. 269 00:21:09,004 --> 00:21:14,027 So, you need to use the functional unit and the date available field to figure out 270 00:21:14,027 --> 00:21:19,097 when to bypass and where to bypass from. And then if, if the, the pending bit is 271 00:21:19,097 --> 00:21:24,611 zero, that means go look in the architectural register file, do not go 272 00:21:24,611 --> 00:21:30,043 pick it off the bypass network. Okay, now we get to, into a fun example. 273 00:21:30,043 --> 00:21:34,066 Buyer motivating example that we saw at the beginning of class. 274 00:21:41,025 --> 00:21:46,034 So here we have that same instruction sequence we saw at the beginning of class. 275 00:21:46,034 --> 00:21:49,050 As we know. You can try to execute some of those 276 00:21:49,050 --> 00:21:52,005 things either at the same time or out of order. 277 00:21:52,005 --> 00:21:55,048 But for right now we have a in order, in order, in order machine. 278 00:21:56,007 --> 00:22:01,010 So let's look at what the scoreboard has to say about this. 279 00:22:01,010 --> 00:22:08,001 So on, on the bottom of this graph er, we, er, this chart, we actually show the score 280 00:22:08,001 --> 00:22:11,093 board. And red denotes that the value is ready. 281 00:22:11,093 --> 00:22:19,010 So let's take a look at something here. So in cycle three, we start executing 282 00:22:19,010 --> 00:22:24,883 instruction zero. Instruction zero is a multiply. 283 00:22:24,883 --> 00:22:32,068 So, what happens, is, it was in, sorry. So I won't, I won't actually explain this. 284 00:22:32,068 --> 00:22:37,211 Here is cycles, D means what is in the decode stage of the pipe, what instruction 285 00:22:37,211 --> 00:22:42,666 number is in the decode stage of the pipe. I is what's in the issue stage of the 286 00:22:42,666 --> 00:22:45,416 pipe. So zero on cycles instruction zero of the 287 00:22:45,416 --> 00:22:50,830 multiply enters the decode stage then goes to the issue stage and then it moves to 288 00:22:50,830 --> 00:22:54,306 the execute units. So it actually gets put into our 289 00:22:54,306 --> 00:22:59,883 scoreboard. And what happens it gets put in for 290 00:22:59,883 --> 00:23:05,627 register one into its scoreboard, just for register one, and that just marches down 291 00:23:05,627 --> 00:23:10,001 the pipe every cycle. So every cycle, it's going to move to the 292 00:23:10,001 --> 00:23:17,064 right. And now if you look at the pipeline. 293 00:23:18,007 --> 00:23:22,016 If you look at the functional unit, and you know it's a multiply. 294 00:23:22,055 --> 00:23:29,079 And you look at where it is at in the pipe you can say oh, this is where a value 295 00:23:29,079 --> 00:23:35,065 actually becomes ready. So we now know exactly where in the pipe 296 00:23:35,065 --> 00:23:41,083 and when, or excuse me, which, which functional unit and when that value is 297 00:23:41,083 --> 00:23:45,089 ready. By looking at the functional unit and the 298 00:23:46,015 --> 00:23:51,066 bits in the scoreboard. If we look at something, let's say next 299 00:23:51,066 --> 00:23:57,057 instruction here is add. Instruction one, it moves here and then it 300 00:23:57,057 --> 00:24:02,014 just goes down the pipe, and it's ready basically the whole time. 301 00:24:02,014 --> 00:24:08,031 We don't have to wait for this value to become ready, cuz the execute instruction, 302 00:24:08,031 --> 00:24:11,092 you can basically bypass out of the execute unit. 303 00:24:11,092 --> 00:24:16,092 And it's almost always ready. That's why it's red the whole way down. 304 00:24:16,092 --> 00:24:22,029 And, you can sort of see here, that register one, this is, this is when that 305 00:24:22,029 --> 00:24:26,071 scoreboard entry is valid. And this when the register eleven 306 00:24:26,071 --> 00:24:31,085 scoreboard entry's valid in time. Okay, so let's look at, our first real 307 00:24:32,024 --> 00:24:37,056 read after write dependence. So here we have a multiply, which is 308 00:24:37,056 --> 00:24:42,047 dependent on R1. As we can see that instruction is going to 309 00:24:42,047 --> 00:24:47,068 sit in the issue stage, until. Y three happens and then it gets bypassed 310 00:24:47,068 --> 00:24:53,001 down here, excuse it me it gets bypassed down here into the issue stage, and then 311 00:24:53,001 --> 00:24:57,053 that instruction can be issued. If we go look at the scoreboard, that 312 00:24:57,053 --> 00:25:02,046 corresponds to this becoming red and then we can go and actually issue an 313 00:25:02,046 --> 00:25:06,058 instruction to, to here and we can basically bypass that value. 314 00:25:06,058 --> 00:25:13,044 And register five, which is a destination of instruction 2's multiply is valid. 315 00:25:15,038 --> 00:25:19,027 So this is a, this is a roughly simply pipeline design here, or a relatively 316 00:25:19,027 --> 00:25:22,920 simple a, a processor, but what' s nice here is you can use your scoreboard to 317 00:25:22,920 --> 00:25:27,793 keep track of when things become ready and without having to go, and which pipeline 318 00:25:27,793 --> 00:25:31,783 to go look in without having to sort of look at intermediate bits in the pipeline. 319 00:25:31,783 --> 00:25:35,545 And it's going to become much more important as we sort of go to out of order 320 00:25:35,545 --> 00:25:38,004 machines. Okay, before we move on; does everyone 321 00:25:38,004 --> 00:25:40,620 understand how to look at one of these diagrams? 322 00:25:40,620 --> 00:25:44,456 Because you are going to have to draw these later in the course. 323 00:25:44,456 --> 00:25:48,863 So this is a great question. So let's look at this picture here. 324 00:25:48,863 --> 00:25:53,215 So, we only have one entry, per location in the pipe here. 325 00:25:53,215 --> 00:25:57,269 So, you're, you're seeing we only have one of these entries. 326 00:25:57,269 --> 00:26:01,758 We have multiple bits going down here. We only have one entry here. 327 00:26:01,758 --> 00:26:06,265 So, what happens if you have, lets say, things have issued two different 328 00:26:06,265 --> 00:26:11,389 functional units, with different latencies to the same destination register? 329 00:26:11,389 --> 00:26:14,842 That's, that's, that's a very important question. 330 00:26:14,842 --> 00:26:21,838 What's gonna happen, it's going to be, you're going to fill in with this 331 00:26:21,838 --> 00:26:28,042 functional unit the newest instruction that gets issued's location. 332 00:26:29,061 --> 00:26:34,079 Because from a bypassing perspective, that's all you need to know. 333 00:26:36,098 --> 00:26:39,050 So that's, that's, like, I'd say it's very important. 334 00:26:39,050 --> 00:26:44,050 You don't need to know. From a bypassing perspective something 335 00:26:44,050 --> 00:26:49,090 that was older in program order. We only need to bypass the newest value. 336 00:26:49,090 --> 00:26:55,083 One thing we do need to do is make sure that we don't have a right after right 337 00:26:55,083 --> 00:26:58,070 hazard. So what that means is we're writing 338 00:26:58,070 --> 00:27:01,019 something later in the pipe then early in the pipe. 339 00:27:01,019 --> 00:27:04,070 That doesn't happen in this pipeline, because we pipe everything forward. 340 00:27:04,070 --> 00:27:08,026 We don't actually have write after write hazards in the right back stage. 341 00:27:08,026 --> 00:27:12,011 The next pipeline we're going to look at does have a write after write hazard, and 342 00:27:12,011 --> 00:27:15,073 that's going to require us to think a little bit harder about the scoreboard. 343 00:27:15,073 --> 00:27:17,774 Exactly, yep. So we're going to have, that's, that's 344 00:27:17,774 --> 00:27:22,043 sort of the next step, pipeline diagram, is, we have a much more interesting 345 00:27:22,043 --> 00:27:25,611 scoreboard. Actually two, two, not the next pipe, but 346 00:27:25,611 --> 00:27:29,371 the pipe after that. The, the is gonna have a more interesting, 347 00:27:29,371 --> 00:27:32,124 scoreboard. One interesting thing to note is strictly 348 00:27:32,124 --> 00:27:35,122 for what we've talked about today, you don't even need bits. 349 00:27:35,122 --> 00:27:39,207 You don't need to track two, different rights to that same register. 350 00:27:39,207 --> 00:27:44,654 Because you only need to know the most recent writes to the register. 351 00:27:44,654 --> 00:27:50,644 So, one way that I've seen people build scoreboards is, instead of having bits 352 00:27:50,644 --> 00:27:56,380 merging down here, you just have an encoding like log base two encoding or 353 00:27:56,380 --> 00:28:01,710 binary encoding of where to go look in the pipe, for the newest value. 354 00:28:01,710 --> 00:28:06,694 And then you just increment back, or decrement it every cycle or something like 355 00:28:06,694 --> 00:28:09,268 that. So you, you just have instead of having 356 00:28:09,268 --> 00:28:12,830 shift registers, you have that. And that's strictly enough for this. 357 00:28:12,830 --> 00:28:17,296 The more complex pipes we are going to look at as Pramod said, we're gonna have 358 00:28:17,296 --> 00:28:21,475 to track both the location. The functional unit in the pipe and what 359 00:28:21,475 --> 00:28:25,262 stage in the pipe it's in. So, you, you're saying how do I know that 360 00:28:25,262 --> 00:28:28,290 this is not ready until four or something like that. 361 00:28:28,290 --> 00:28:32,697 Okay, so that is a, a subtle point is that if you know what functional unit, if you 362 00:28:32,697 --> 00:28:36,252 know the functional unit latencies, you just look it up in the table. 363 00:28:36,252 --> 00:28:41,137 You don't need to actually track that information, there's just some piece of 364 00:28:41,137 --> 00:28:45,957 logic which says for a multiply if the functional unit is multiply then we can't 365 00:28:45,957 --> 00:28:49,469 by-pass till here. If it's an add we can by-pass from here or 366 00:28:49,469 --> 00:28:53,981 there or something like that. So we don't need to track that in a table, 367 00:28:53,981 --> 00:28:58,691 it can just be a part of your combinational logic in your decode unit, 368 00:28:58,691 --> 00:29:02,452 or your issue logic. Does that you don't, yeah, you don't, you, 369 00:29:02,452 --> 00:29:06,777 I'll say it one more time. To see. 370 00:29:06,777 --> 00:29:09,876 This. Like this is the example here. 371 00:29:09,876 --> 00:29:15,759 We have a, a mull that writes R1 and something which else is supposed to read 372 00:29:15,759 --> 00:29:18,048 R1. But we end up stalling here. 373 00:29:18,048 --> 00:29:23,502 In this table the way we know that is we know the functional unit, and because it 374 00:29:23,502 --> 00:29:28,837 says multiply we know we have to wait until it gets to here, or rather, probably 375 00:29:28,837 --> 00:29:31,853 here, instead of trying to pick it off earlier. 376 00:29:31,853 --> 00:29:35,492 If it was, if it said ALU then we could have it pick it off earlier. 377 00:29:35,492 --> 00:29:39,595 So we don't necessarily have to encode that, that's why in this picture here 378 00:29:39,595 --> 00:29:44,278 things turn red on some of these earlier than others, it's that functional unit 379 00:29:44,278 --> 00:29:48,487 information which encodes that. If you can even think about having that 380 00:29:48,487 --> 00:29:53,020 functional unit bits encoding some number like you said, some people design 381 00:29:53,020 --> 00:30:01,885 scoreboards and things like that. Okay, so I just want to briefly introduce 382 00:30:01,885 --> 00:30:06,676 this, then we are going into break. Next class, we are going to be talking 383 00:30:06,676 --> 00:30:11,452 about in-order, front-end and issue, out of order, write back and commit on these 384 00:30:11,452 --> 00:30:14,089 pipelines. It looks pretty similar, except, some 385 00:30:14,089 --> 00:30:20,337 pipeline stages are taken out here. And this makes you think harder about when 386 00:30:20,337 --> 00:30:25,015 to go read the score board. So we're actually going to use this decode 387 00:30:25,015 --> 00:30:30,372 stage to do some stuff at the score board, and then we're going to start talking 388 00:30:30,372 --> 00:30:36,018 about truly out of order machines where you have these other fancier structures 389 00:30:36,018 --> 00:30:41,342 like reorder buffers and store buffers. Okay, let's stop here for today and pick 390 00:30:41,342 --> 00:30:42,094 up next time.