1 00:00:03,031 --> 00:00:09,033 And now we're going to move on to talking about basic pipelines, and we will start 2 00:00:09,033 --> 00:00:13,698 off by just talking about how to build a pipeline and how to name things in a 3 00:00:13,698 --> 00:00:18,484 pipeline, and how instructions flow down a pipeline, and why you want to build the 4 00:00:18,484 --> 00:00:21,037 pipeline. And we start off with a very gentle 5 00:00:21,037 --> 00:00:25,087 introductions to this, instead of my talking about, a idealized pipeline, not 6 00:00:25,087 --> 00:00:29,058 necessarily in a processor. Pipelines show up many places in the 7 00:00:29,058 --> 00:00:31,084 world. They show up in car factories, toy 8 00:00:31,084 --> 00:00:34,067 factories. I know some people will say, you know, 9 00:00:34,067 --> 00:00:39,025 when you go use the washing machine in a laundry-mat you're pipelining cause you 10 00:00:39,025 --> 00:00:43,021 first put your, laundry in the washing machine and then take it out. 11 00:00:43,021 --> 00:00:47,073 Then you can put it in the dryer, we can put more work in the, or more laundry in 12 00:00:47,073 --> 00:00:51,035 the first washing machine. So we can, we see pipelines in lots of 13 00:00:51,035 --> 00:00:55,025 different places in the world. But we also see it, in, in this class 14 00:00:55,025 --> 00:00:58,053 we're gonna care about using pipelines in microprocessors. 15 00:00:58,053 --> 00:01:02,063 So let's think about an idealized pipeline. 16 00:01:02,063 --> 00:01:06,015 So I have, have a picture here of an idealized pipeline. 17 00:01:06,033 --> 00:01:09,098 What are some of the good, or what are some of the requisite. 18 00:01:09,098 --> 00:01:14,055 Conditions for an idealized pipeline. Well, first of all, all work, or all 19 00:01:14,055 --> 00:01:19,070 objects should go through all the stages. So here we have four stages in this 20 00:01:19,070 --> 00:01:25,011 pipeline, and there's no sort of squiggly lines where things sort of come out of the 21 00:01:25,011 --> 00:01:28,046 pipeline, or go around the pipeline, or exit early. 22 00:01:28,046 --> 00:01:33,081 It doesn't happen in this, this diagram. So, in an idealized pipeline, you want all 23 00:01:33,081 --> 00:01:36,065 of the objects to go through all the stages. 24 00:01:36,065 --> 00:01:40,058 You don't want any resources shared between different stages. 25 00:01:40,058 --> 00:01:45,086 So you don't want some resource here which is stared, shared between stage two and 26 00:01:45,086 --> 00:01:51,060 stage three So an example of this in a car assembly pipe pipeline or assembly line 27 00:01:51,060 --> 00:01:56,095 would be one tool which two different stages in the pipeline have to use or two 28 00:01:56,095 --> 00:02:02,050 different stages have to use that would cause what is known as a structural hazard 29 00:02:02,050 --> 00:02:06,044 in processor design. The propagation delay between all the 30 00:02:06,044 --> 00:02:09,074 different, of all the different stages should be the same. 31 00:02:09,074 --> 00:02:14,036 So the time taken for stage one should be the same as two, stage two, the same as 32 00:02:14,036 --> 00:02:17,083 stage three, the same as stage four in an idealized pipeline. 33 00:02:17,083 --> 00:02:23,096 And then, finally, the scheduling of an operation, or a transaction, or an object 34 00:02:23,096 --> 00:02:29,077 going down the pipeline should not be affected by what's currently in the 35 00:02:29,077 --> 00:02:33,062 pipeline. And, these conditions actually hold true 36 00:02:33,062 --> 00:02:40,038 for most assembly lines in plants where people go and build cars or something like 37 00:02:40,038 --> 00:02:44,080 that. There's no dependence between the 38 00:02:44,080 --> 00:02:47,071 different parts. A car comes into it. 39 00:02:47,071 --> 00:02:51,038 All cars are sort of, separate from each other. 40 00:02:51,038 --> 00:02:56,031 It's not, not really a problem. Unfortunately, if you go and look at 41 00:02:56,031 --> 00:03:00,087 something like a microprocessor and executing instructions. 42 00:03:00,087 --> 00:03:04,084 Instructions actually depend on earlier instructions. 43 00:03:04,084 --> 00:03:07,072 And they depend on it in multiple different ways. 44 00:03:07,072 --> 00:03:12,073 They can either depend on the data values or they can depend on the fact that you 45 00:03:12,073 --> 00:03:15,032 took a branch or a control flow instruction. 46 00:03:15,050 --> 00:03:20,021 So we have to look at these different hazards and we have to think about how to 47 00:03:20,021 --> 00:03:25,015 either solve them or deal with them and how we have to, to, we have to think about 48 00:03:25,015 --> 00:03:34,312 how to deal with non-ideal pipelines. So let's go one step beyond our microcoded 49 00:03:34,312 --> 00:03:40,662 disk processor design and look at a un-pipelined data path for the MIPS 50 00:03:40,662 --> 00:03:44,058 instruction set. So this is going to be similar. 51 00:03:44,058 --> 00:03:49,059 Or if you have already taken a computer organization class, you've probably seen 52 00:03:49,059 --> 00:03:54,063 similar sorts of unpipelined data paths. So, when on a unpipeline data path, where 53 00:03:54,063 --> 00:03:58,831 instead of the program counter here, we have some instruction, memory where we go 54 00:03:58,831 --> 00:04:03,064 and fetch our instruction out. And, this flows through, we fetch our 55 00:04:03,064 --> 00:04:07,082 registers from the register file. We flow through, we do the actual 56 00:04:07,082 --> 00:04:13,002 computation, we might have to go access, data from, data memory if we do a load or 57 00:04:13,002 --> 00:04:15,075 restore, and then finally comes back around. 58 00:04:15,075 --> 00:04:20,031 We write the result here and we increment or change the program counter. 59 00:04:20,068 --> 00:04:28,710 And, this is all done, in one cycle. And it's a very long cycle, cuz you launch 60 00:04:28,710 --> 00:04:33,923 here, you go through all this work. You do all these things. 61 00:04:33,923 --> 00:04:39,552 The data gets put back here, and you have to, it has to happen because we're 62 00:04:39,552 --> 00:04:44,683 unpipelined, we have an unpipelined processor here, it has to happen in one 63 00:04:44,683 --> 00:04:48,599 cycle. So this is kind of, not great from a cycle 64 00:04:48,599 --> 00:04:52,670 time perspective. It is good from a perspective of the 65 00:04:52,670 --> 00:04:56,999 number of cycles per instruction. It's always one because you launch 66 00:04:56,999 --> 00:05:02,213 instruction, you wait until it's done, then you launch the next instruction, you 67 00:05:02,213 --> 00:05:07,479 launch the next instruction, and each instruction it launches is one cycle. 68 00:05:07,479 --> 00:05:11,968 So let's, let's simplify this design, a little bit so we can talk about what's 69 00:05:11,968 --> 00:05:16,014 going on here in a little bit more depth. So, we're gonna simplify our un-pipelined 70 00:05:16,014 --> 00:05:20,546 processor design and just take out some of the, the extra stuff here and focus on, 71 00:05:20,546 --> 00:05:24,574 you learn from the program counter, it acts as the instruction memory all 72 00:05:24,574 --> 00:05:28,066 combinationly, you go through the registered file combinational logic, you 73 00:05:28,066 --> 00:05:30,760 go through the data memory, and you come back around. 74 00:05:30,760 --> 00:05:35,654 So, that would be something like a worst case load instruction, and, and the value 75 00:05:35,654 --> 00:05:39,716 gets run back here into the general purpose register file. 76 00:05:39,716 --> 00:05:46,263 So let's now move on to thinking about how do we build the same data path, but we try 77 00:05:46,263 --> 00:05:51,634 run a pipe line it now. Well the first thing we want to do is we 78 00:05:51,634 --> 00:05:54,759 want to cut the pipeline up or cut our design up. 79 00:05:54,759 --> 00:06:01,055 And we are gonna put in registers. And, it just, so happens that this is a 80 00:06:01,055 --> 00:06:06,653 convenient place to put registers for something like a MIPS datapath. 81 00:06:06,653 --> 00:06:12,942 And, some important things that you might recall from your digital logic design 82 00:06:12,942 --> 00:06:18,605 course is that when you go to pipeline a circuit, you need to make sure that you 83 00:06:18,605 --> 00:06:23,642 cut all lines which are going in this direction across the registers. 84 00:06:23,642 --> 00:06:30,063 So if we were to cut here, we need to put a register here, here and here cuz there's 85 00:06:30,063 --> 00:06:33,442 three sets of wires running from left to right. 86 00:06:33,442 --> 00:06:38,555 This feedback path here doesn't need a pipeline register cuz it's, it's flowing 87 00:06:38,555 --> 00:06:42,338 back and it's effectively running back into the register file here 88 00:06:42,338 --> 00:06:47,565 combinationally. And you know, the register file can either 89 00:06:47,565 --> 00:06:51,760 be clocked or unclocked depending on how you actually do this design. 90 00:06:51,760 --> 00:06:56,687 So we've pipelined our data path now. And we'll, let's put some names on these 91 00:06:56,687 --> 00:07:00,088 things, that we're gonna be re-using throughout the class. 92 00:07:00,088 --> 00:07:04,639 So we're gonna call this the, the fetch phase or the, the instruction fetch time. 93 00:07:04,639 --> 00:07:09,145 Sometimes in this class we'll either denote this with a upper case F, or we'll 94 00:07:09,145 --> 00:07:11,441 denote this with I F, for instruction fetch. 95 00:07:11,441 --> 00:07:17,123 The next thing we're going to do is we're going to decode the instruction, and we're 96 00:07:17,123 --> 00:07:21,587 also going to fetch the registers out of the register file. 97 00:07:21,587 --> 00:07:27,325 So the decode instruction this is how we take the tightly packed bits in the 98 00:07:27,325 --> 00:07:30,519 instruction and we blow it up into control wires. 99 00:07:30,519 --> 00:07:36,152 And the register fetch phase we're actually going to fetch from the register 100 00:07:36,152 --> 00:07:42,169 file and this is sometimes we'll denote with an upper case D, or sometimes we'll 101 00:07:42,169 --> 00:07:46,716 denote it with RF for register fetch. This is the execute phase. 102 00:07:46,716 --> 00:07:50,463 We're actually doing real work here. We're taking the ALU. 103 00:07:50,463 --> 00:07:54,715 We're doing some add, we're doing some multiply, something like that. 104 00:07:54,715 --> 00:08:00,364 Maybe some subtract or a shift operation, and we'll denote this with an upper case X 105 00:08:00,364 --> 00:08:05,537 or a, the letters E X, for execution. The memory phase we're accessing the 106 00:08:05,537 --> 00:08:10,652 memory here, the data memory and then finally the write back phase, and we'll 107 00:08:10,652 --> 00:08:13,482 denote this with m and then finally, or mem. 108 00:08:13,482 --> 00:08:19,037 And then finally the write back stage here we'll typically denote with wb or w to 109 00:08:19,037 --> 00:08:24,003 denote when we actually go to write the data into the register file. 110 00:08:24,003 --> 00:08:27,618 We have a whole pipeline stage dedicated to that. 111 00:08:27,618 --> 00:08:34,535 So, one of the, the important things, now, I, I just picked five different places 112 00:08:34,535 --> 00:08:38,776 here to pipeline this, this design, but, this is not accidental. 113 00:08:38,776 --> 00:08:44,404 What we're trying to do is we're trying to divide what was a long, a, a long time 114 00:08:44,404 --> 00:08:47,520 period, and cut it up into smaller chunks here. 115 00:08:47,520 --> 00:08:53,084 So we're trying to reduce our clock period, and we can do this by dividing the 116 00:08:53,084 --> 00:08:59,007 execution up into multiple cycles. And if we look at the time, the cycle time 117 00:08:59,007 --> 00:09:04,056 is greater than the maximum, or greater than or equal to the maximum of the time 118 00:09:04,056 --> 00:09:09,037 to access the instruction memory. The time to access the register file. 119 00:09:09,037 --> 00:09:13,063 The time to access the ALU. The time to access the data memory. 120 00:09:13,063 --> 00:09:17,069 And a time to write back. And those are the different things in the 121 00:09:17,069 --> 00:09:19,087 five stages here. So we're trying to. 122 00:09:20,027 --> 00:09:24,775 We're not, we're not trying to do anything here, but the, the time taken is greater 123 00:09:24,775 --> 00:09:29,029 than to or equal to the maximum of the rest of these times. 124 00:09:29,083 --> 00:09:34,097 And it's probably, in most pipelines, equal to the time it takes to access the 125 00:09:34,097 --> 00:09:38,058 data memory. The, the time to access the data memory in 126 00:09:38,058 --> 00:09:43,142 most, most processors is, is probably your, gong to be your limiting factor, but 127 00:09:43,142 --> 00:09:48,044 people will try to pipeline this, and we'll talk about how to pipeline that even 128 00:09:48,044 --> 00:09:51,478 more, so try to cut the memory in it in half. 129 00:09:51,478 --> 00:09:56,659 Okay, so that's the easy part. We pipeline the data, so let's now move 130 00:09:56,659 --> 00:10:00,709 onto a little bit more challenging question here of how to pipeline the 131 00:10:00,709 --> 00:10:03,475 control. So we're going to sort of look at our 132 00:10:03,475 --> 00:10:07,616 control unit and we're going to put down our hardwired control here. 133 00:10:07,616 --> 00:10:12,402 So the hardwired control is gonna take the instruction register and decode the 134 00:10:12,402 --> 00:10:16,173 instruction that comes in. Hm. 135 00:10:16,173 --> 00:10:22,632 Well. That's, that's okay. 136 00:10:22,632 --> 00:10:30,869 One of the problems here. When you start to look at this, is, well. 137 00:10:30,869 --> 00:10:37,389 Well actually, before we do that, let's, let's first look at how to use this 138 00:10:37,389 --> 00:10:42,002 processor, in a multi cycle yet pipelined fashion. 139 00:10:42,002 --> 00:10:46,049 So what do I mean by that? Well, let's say we're gonna. 140 00:10:46,049 --> 00:10:52,065 Use the same data path, and use the exact same control unit we had in the 141 00:10:52,065 --> 00:10:57,564 unpipelined MIPS processor. And in this design, the cycle time is now, 142 00:10:57,564 --> 00:11:04,434 lets say close to one-fifth of the cycle time we had before we add five pipeline 143 00:11:04,434 --> 00:11:08,020 stages. So our processor executes faster, But 144 00:11:10,036 --> 00:11:13,019 we're not going to pipeline our control unit. 145 00:11:13,019 --> 00:11:18,029 So what this means is we're going to fetch the first instruction, and we're going to 146 00:11:18,029 --> 00:11:22,020 have to wait for the instruction to go all the way to the end. 147 00:11:22,098 --> 00:11:28,045 Write the register file and be done. So what we just did here is we built a 148 00:11:28,045 --> 00:11:33,063 multi cycle yet pipelined processor. And this multi cycle yet pipelined 149 00:11:33,063 --> 00:11:39,033 processor has the clock period which is say, which is roughly one fifth of the 150 00:11:39,033 --> 00:11:45,002 time of the unpipelined one, but it now takes five cycles. 151 00:11:45,002 --> 00:11:49,293 So it's a wash. It's might be, it's probably a little bit 152 00:11:49,293 --> 00:11:54,651 worse cuz of some overhead, from the instruction overhead from the registers 153 00:11:54,651 --> 00:11:59,068 that you need to insert, but it's going to roughly, roughly be a wash. 154 00:11:59,068 --> 00:12:05,020 So let's say we want to start building a pipeline processor that can actually 155 00:12:05,020 --> 00:12:10,078 execute multiple instructions at the same time in different locations in the 156 00:12:10,078 --> 00:12:14,034 pipeline. In order to do this we need to go about 157 00:12:14,034 --> 00:12:20,005 adding pipeline registers to our control. And by adding pipeline registers to our 158 00:12:20,005 --> 00:12:24,029 control wires we can have multiple instructions flowing down the pipeline, 159 00:12:24,029 --> 00:12:28,031 and where these multiple instructions are in different stages in the pipe. 160 00:12:28,031 --> 00:12:33,007 So we can have one instruction here and a different instruction here and they have 161 00:12:33,007 --> 00:12:37,083 different control wires being asserted on them because we've pipelined the control 162 00:12:37,083 --> 00:12:43,033 wires. So now we're gonna start talking about how 163 00:12:43,033 --> 00:12:47,027 do we go about analyzing the performance of micro processors. 164 00:12:47,027 --> 00:12:52,019 And all performance basically boils down, this is just for performance. 165 00:12:52,019 --> 00:12:57,825 There is similar sorts of thing for power and area and other metrics you might want 166 00:12:57,825 --> 00:13:02,708 to care about when optimizing a processor. But in today's lecture, we're going to 167 00:13:02,708 --> 00:13:08,091 focus on optimizing for performance. There comes up this idea called the iron 168 00:13:08,091 --> 00:13:14,430 law of processor performance and it's this, excuse me, we talked about a bunch 169 00:13:14,430 --> 00:13:19,822 in the first chapter of the computer architecture or a quantitative approach 170 00:13:19,822 --> 00:13:25,508 book, and it boils down to the very basic formula where the time to execute the 171 00:13:25,508 --> 00:13:29,554 program. So this is the time it takes to run your 172 00:13:29,554 --> 00:13:35,318 program that you're trying to execute is equal to how many instructions your 173 00:13:35,318 --> 00:13:46,001 program has, times the cycles that it takes to execute one instruction for the 174 00:13:46,001 --> 00:13:50,604 program, multiplied by, the time per cycle. 175 00:13:50,604 --> 00:13:55,591 So as you see it's multiplicative, so if you can actually make any of these numbers 176 00:13:55,591 --> 00:13:59,668 smaller the time for the program will go down, which is a good thing. 177 00:13:59,668 --> 00:14:04,921 So if you can make any of these numbers smaller when it multiplies out we can make 178 00:14:04,921 --> 00:14:09,608 the time of the program smaller. And what you also note here is they're all 179 00:14:09,608 --> 00:14:14,302 equally weighted, so if you change any of them that's a good thing. 180 00:14:14,302 --> 00:14:20,452 One of the things that we should probably do is think about the dimensional 181 00:14:20,452 --> 00:14:24,361 analysis. Or, what is the unit on the time for the 182 00:14:24,361 --> 00:14:26,281 program? Hm. 183 00:14:26,281 --> 00:14:33,459 Well, let's derive it from the three things on the right side of this equation. 184 00:14:33,459 --> 00:14:37,617 Time per cycle. So what's the unit on time per cycle? 185 00:14:37,617 --> 00:14:43,077 It's going to be seconds per cycle. Or some time unit, or, or nanoseconds per 186 00:14:43,077 --> 00:14:47,273 cycle, well, let's, let's stick to seconds per cycle. 187 00:14:47,273 --> 00:14:51,791 Cycles per instruction. Okay what's the unit on that. 188 00:14:51,791 --> 00:14:54,973 Well the unit actually is cycles per instruction. 189 00:14:54,973 --> 00:15:00,712 So if we do the dimensional analysis on this cycles up here and cycle down there 190 00:15:00,712 --> 00:15:05,580 are going to cancel. And then finally we have instructions per 191 00:15:05,580 --> 00:15:10,056 program and the unit on that actually is instructions per program. 192 00:15:10,056 --> 00:15:15,969 So the number of instructions in your program divided by how one the program. 193 00:15:15,969 --> 00:15:21,893 And if we, you know, multiply this out the unit we're gonna get here for time per 194 00:15:21,893 --> 00:15:27,396 program is seconds per program or time per program that actually is the unit. 195 00:15:27,396 --> 00:15:33,488 Okay so let's think about where these different portions come from in how we can 196 00:15:33,488 --> 00:15:39,040 effect these three different quantities. So instructions per program, can that the 197 00:15:39,040 --> 00:15:41,542 affected? You just, you might say well, my program 198 00:15:41,542 --> 00:15:44,013 takes 100 instructions, and that's what it takes. 199 00:15:44,013 --> 00:15:47,874 Well the world's not that simple. If you had a different instruction set 200 00:15:47,874 --> 00:15:51,176 architecture, you might have different numbers of instructions. 201 00:15:51,176 --> 00:15:54,907 So if you have something like the MIPS instruction set, which is a reduced 202 00:15:54,907 --> 00:15:59,056 instruction set, you might actually have to take more instructions, to execute a 203 00:15:59,056 --> 00:16:03,039 program, than if you were executing on something like a X86 processor, where you 204 00:16:03,039 --> 00:16:07,080 have very complex instructions, or rather where you have like one instruction that 205 00:16:07,080 --> 00:16:10,547 can do a whole string copy. On something like MIPS, that would take 206 00:16:10,547 --> 00:16:13,576 thousands of instructions. So the ISA really matters. 207 00:16:13,576 --> 00:16:17,584 The other thing, the other thing that can matter here is the compiler. 208 00:16:17,584 --> 00:16:22,061 If you had a smarter compiler the compiler might be able to reduce or eliminate code 209 00:16:22,061 --> 00:16:26,613 which is redundant or not very important, or not important at all rather, or it can 210 00:16:26,613 --> 00:16:31,229 try to merge two instructions together. So a smarter compiler can actually effect 211 00:16:31,229 --> 00:16:34,473 your instructions for programming. It can actually change. 212 00:16:34,473 --> 00:16:40,441 The next thing is cycles per instruction. Well, this is actually dependent on your 213 00:16:40,441 --> 00:16:45,271 micro-architecture. As we talked about, we'll talk a little 214 00:16:45,271 --> 00:16:50,074 bit more in a second. The cycles per instruction, if you have a 215 00:16:50,074 --> 00:16:55,142 pipeline processor you might be able to execute, or you, you might take five 216 00:16:55,142 --> 00:16:59,579 cycles to execute an instruction, or complete the full instruction on something 217 00:16:59,579 --> 00:17:02,323 like the micro-coded, processor we already talked about. 218 00:17:02,323 --> 00:17:04,976 You might take ten cycles to execute a, a given instruction. 219 00:17:04,976 --> 00:17:09,515 So you can see there's different number of cycles per instruction on their dependent 220 00:17:09,515 --> 00:17:13,372 on the micro-architecture. And it also depends a little bit on the 221 00:17:13,372 --> 00:17:17,619 instruction set architecture. If you have a more complex instruction, 222 00:17:17,619 --> 00:17:21,177 you might need to take more cycles to execute that. 223 00:17:21,177 --> 00:17:26,190 So, something like the string move instruction, or the rep moves b 224 00:17:26,190 --> 00:17:29,499 instruction we talked about in the last lecture. 225 00:17:29,499 --> 00:17:34,900 On x86, we'll actually turn into multiple, micro-coded instructions on something like 226 00:17:34,900 --> 00:17:37,829 an x86 processor. And then, finally time per cycle. 227 00:17:37,829 --> 00:17:42,228 This is dependent on your technology, or rather the implementation technology. 228 00:17:42,228 --> 00:17:44,921 So, or the transistor and device technology. 229 00:17:44,921 --> 00:17:48,507 If you have faster transistors that will go down. 230 00:17:48,507 --> 00:17:50,989 It also depends on your micro-architecture. 231 00:17:50,989 --> 00:17:54,454 If you pipeline deeper, you can decrease the time, per cycle. 232 00:17:54,454 --> 00:18:00,759 Before we move off this slide let's, compare three different architectures that 233 00:18:00,759 --> 00:18:05,017 we talked about so far. Both the clocks per instruction, and the 234 00:18:05,017 --> 00:18:08,005 cycle time. And I wanted to find clocks per 235 00:18:08,005 --> 00:18:11,622 instruction here. So, clocks per instruction is how many 236 00:18:11,622 --> 00:18:15,007 clock cycles it takes to execute, that instruction. 237 00:18:15,007 --> 00:18:20,080 And the inverse of this, or if you take one over CPI, you get this other unit of 238 00:18:20,080 --> 00:18:26,024 measure which is very common in computer architecture called instructions per 239 00:18:26,024 --> 00:18:28,691 cycle. It's just, you know, the, the flip of 240 00:18:28,691 --> 00:18:31,797 here. So sometimes you'll hear us talk about IPC 241 00:18:31,797 --> 00:18:37,479 and sometimes we'll talk about CPI. And it's really important to know which 242 00:18:37,479 --> 00:18:41,892 is, which is which. So you want a low CPI and a high IPC cuz 243 00:18:41,892 --> 00:18:46,761 that means better performance. Okay, let's look at the micro architecture 244 00:18:46,761 --> 00:18:50,600 here. So we had our micro coded architecture and 245 00:18:50,600 --> 00:18:56,723 this took multiple clocks per instruction but the cycle time is relatively short. 246 00:18:56,723 --> 00:18:59,976 Okay. We had our single cycle unpipelined 247 00:18:59,976 --> 00:19:05,160 processor, so this is, everything is just a flow to the end, but there's no pipeline 248 00:19:05,160 --> 00:19:07,881 stages in it. And the clock cycle was one, but 249 00:19:07,881 --> 00:19:12,135 unfortunately the cycle time is long. So we have to do sort of all the work the 250 00:19:12,135 --> 00:19:16,390 micro-coded instruction did but it was doing it over multiple cycles, but we 251 00:19:16,390 --> 00:19:21,552 needed to do it in one cycle so we have to sort of add all those things together and 252 00:19:21,552 --> 00:19:24,313 it's all additive so we have a long cycle time. 253 00:19:24,313 --> 00:19:28,613 And then finally we have a fully pipedlined processor where the control is 254 00:19:28,613 --> 00:19:33,479 pipelined and the data path is pipelined. And here, the clocks for instructions 255 00:19:33,479 --> 00:19:36,791 coming out is one, and the cycle time is short. 256 00:19:36,791 --> 00:19:45,878 And as a question that I want to ask is we talked briefly about the multi-cycle 257 00:19:45,878 --> 00:19:52,522 unpipelined control processor. So this is the processor where the, you 258 00:19:52,522 --> 00:19:57,453 have pipelined the datapath. But you have not pipelined the control. 259 00:19:57,453 --> 00:20:02,615 And you launch an instruction that has to go down the pipe to the end before you can 260 00:20:02,615 --> 00:20:06,456 go and fetch the next instruction, but it takes multiple cycles. 261 00:20:06,456 --> 00:20:10,134 So, the question I have is, what should we put here in CPI? 262 00:20:10,134 --> 00:20:14,829 And what is the cycle time relative to the rest of these? 263 00:20:14,829 --> 00:20:24,446 So I'll let you think about that. And then that's, that's right, the clocks 264 00:20:24,446 --> 00:20:29,563 per instruction for this example is going to be more than one or in the five stage 265 00:20:29,563 --> 00:20:34,103 pipeline it's going to be five, or say for every instruction but the cycle time is 266 00:20:34,103 --> 00:20:37,372 going to be short. So, this not a great, great when you 267 00:20:37,372 --> 00:20:42,639 compare against the pipelined, the fully pipe lined processor because the clocks 268 00:20:42,639 --> 00:20:48,029 per instruction is five versus one, and they have the same cycle time. 269 00:20:48,029 --> 00:20:53,075 So, this is gonna be five times slower than something like that, on average. 270 00:20:54,083 --> 00:21:01,019 Okay, so let's graphically show some more clocks per instruction examples here. 271 00:21:01,019 --> 00:21:07,007 So if we look at the micro coded machine, we're going to be executing three 272 00:21:07,007 --> 00:21:10,303 instructions here. Instruction one, instruction one, 273 00:21:10,303 --> 00:21:16,001 instruction three, and time is increasing as you go to the right. 274 00:21:17,053 --> 00:21:21,095 Let's say the first instruction takes seven cycles to execute, the next 275 00:21:21,095 --> 00:21:25,061 instruction takes five and the third instruction takes ten. 276 00:21:25,061 --> 00:21:30,052 And if you do the math here, you end up with three instructions executing in 22 277 00:21:30,052 --> 00:21:36,020 cycles, or a CPI of 7.33. If you look at the unpipelined processor 278 00:21:36,020 --> 00:21:42,034 we have to, basically flow through all the data to the end, every instruction takes 279 00:21:42,034 --> 00:21:48,070 the same amount of time here. But the cycle time is long as I have 280 00:21:48,070 --> 00:21:51,040 drawn. So you have three instructions. 281 00:21:51,040 --> 00:21:56,007 It takes three cycles, the CPI is one, but the cycle time is high. 282 00:21:56,088 --> 00:22:01,053 And, in an ideal pipeline machine, we'll actually have multiple instructions 283 00:22:01,053 --> 00:22:05,094 executing at the same time. So time goes left to right on this graph. 284 00:22:06,044 --> 00:22:10,078 And, we'll be able to cut the instructions, and in an ideal world we'll 285 00:22:10,078 --> 00:22:14,045 have, CPI of one, and our clock frequency will also be short. 286 00:22:14,045 --> 00:22:19,035 And you can see, this is sort of smaller than those other solutions, and this is 287 00:22:19,035 --> 00:22:24,025 why we, start to think about pipe-lining. But we're using transistors to do this. 288 00:22:24,098 --> 00:22:29,818 Okay, so as, as we go along here, some of the questions that come up is, what are 289 00:22:29,818 --> 00:22:32,039 the assumptions we made about this pipeline? 290 00:22:32,039 --> 00:22:36,045 And why are we making these assumptions? And if the assumptions were to change we'd 291 00:22:36,045 --> 00:22:39,098 actually pipeline differently. So one of the assumptions is that you can 292 00:22:39,098 --> 00:22:43,060 put something like a cache in your processor and have very fast memory 293 00:22:43,060 --> 00:22:45,080 access. Because the memory access was so slow 294 00:22:45,080 --> 00:22:49,013 relative to the processor's speed. You probably wanna, wanna put that 295 00:22:49,013 --> 00:22:52,050 anywhere in your processor and you might want to pipeline the memory. 296 00:22:52,050 --> 00:22:58,035 Another assumption is that you have fast arithmetic logic units to do your, your 297 00:22:58,035 --> 00:23:03,494 math, and you also have multi-port register files, which, are maybe slower 298 00:23:03,494 --> 00:23:08,066 but you can still build them. So, if you take these sort of technology 299 00:23:08,066 --> 00:23:13,083 assumptions all together, this assumption is reasonable for a five stage MIPS 300 00:23:13,083 --> 00:23:16,054 pipeline. But it changes whenever technology 301 00:23:16,054 --> 00:23:21,023 changes, or rather device technology changes, and you have to reevaluate this. 302 00:23:21,023 --> 00:23:25,092 In this class and iin previous classes you focused on five stage pipelines, and we'll 303 00:23:25,092 --> 00:23:30,023 talk a fair amount about five stage pipelines, but we'll also talk about much 304 00:23:30,023 --> 00:23:33,030 deeper pipelines. We'll also talk about how to go about 305 00:23:33,030 --> 00:23:37,033 building those processors. And to give you an example some commercial 306 00:23:37,033 --> 00:23:42,019 designs actually this, this is outdated if you look at something like a Pentium four, 307 00:23:42,019 --> 00:23:46,078 they have or Intel Pentium four, I think there is about 44 stages in that pipeline. 308 00:23:46,078 --> 00:23:51,064 So they figured out how to cut up execute warning instruction into 44 pieces little 309 00:23:51,064 --> 00:23:54,002 pieces. Okay. 310 00:23:54,002 --> 00:24:00,011 So let's talk about pipe line diagrams. So one of the important things we need now 311 00:24:00,011 --> 00:24:02,060 is to. Think about the nomenclature, and think 312 00:24:02,060 --> 00:24:05,058 about how to draw instructions executing down a pipeline. 313 00:24:05,058 --> 00:24:09,056 Because if you want to an, answer questions about where an instruction is or 314 00:24:09,056 --> 00:24:13,074 how an instruction executes, we need some way to draw it graphically.'Cuz if you 315 00:24:13,074 --> 00:24:17,072 don't draw it graphically, it's, it's really hard to reason about, multiple 316 00:24:17,072 --> 00:24:21,065 things going on at the same time. And this is what we're gonna try to solve 317 00:24:21,065 --> 00:24:23,069 here. We're actually executing multiple 318 00:24:23,069 --> 00:24:27,019 instructions at the same time. There's parallelism going on, there's 319 00:24:27,019 --> 00:24:30,059 multiple instructions in flight, and you have lots of overlapping. 320 00:24:31,070 --> 00:24:39,001 So the first pipeline diagram were going to use here will plot transactions or 321 00:24:39,001 --> 00:24:44,075 instructions versus time. So we're going to put time on the 322 00:24:44,075 --> 00:24:49,029 horizontal axis here. So we have time T0 to T7, and it's 323 00:24:49,029 --> 00:24:55,061 increasing. And now we're going to have. 324 00:24:55,061 --> 00:25:01,045 An instruction flowing down the pipeline. And right now we'll look at an idealized 325 00:25:01,045 --> 00:25:05,070 pipeline with idealized instructions. We'll soon be looking at non-ideal 326 00:25:05,070 --> 00:25:09,070 pipelines with just dependencies between the different instructions. 327 00:25:09,070 --> 00:25:14,071 And this line is really important for this class and if you don't understand this, go 328 00:25:14,071 --> 00:25:18,031 back and watch this again. Because the understanding, pipeline 329 00:25:18,031 --> 00:25:23,038 diagrams is gonna be key to this class and being able to draw these yourself is going 330 00:25:23,038 --> 00:25:27,033 to be very, very important. So let's, let's look at an instruction 331 00:25:27,033 --> 00:25:31,089 flowing down the pipeline here. So, we have instruction one, at time zero, 332 00:25:31,097 --> 00:25:35,006 it's in the instruction fetch stage, at time. 333 00:25:35,006 --> 00:25:38,054 One, it's in the decode stage, and time two, is in the execute stage, and time 334 00:25:38,054 --> 00:25:42,016 three, it's in the memory access stage, and in time four, it's in the write-back 335 00:25:42,016 --> 00:25:45,007 stage. Okay, that's basic. 336 00:25:45,007 --> 00:25:52,062 Now let's put another instruction on here. It goes through the same set of stages but 337 00:25:52,062 --> 00:25:56,059 its shifted by one, because it can't go and use this resource. 338 00:25:56,059 --> 00:26:02,000 It needs to be fetched in the next cycle. So it's not gonna be executed in cycle T0 339 00:26:02,000 --> 00:26:05,000 or Time T0, it's gonna be fetched in the next time. 340 00:26:05,000 --> 00:26:10,008 And what you'll notice here is these resources, or these pipeline stages, don't 341 00:26:10,008 --> 00:26:13,047 have multiple instructions in them at the same time. 342 00:26:13,047 --> 00:26:17,064 They're restricted to only one in each pipeline stage at a time. 343 00:26:18,042 --> 00:26:23,013 So we put on more instructions here. And this is what our pipeline diagrams are 344 00:26:23,013 --> 00:26:27,061 mostly going to look like in this class, cuz we're going to have transaction or 345 00:26:27,079 --> 00:26:32,236 instruction versus time pipeline diagrams. And we're going to plot these out, and 346 00:26:32,236 --> 00:26:35,092 you'll see that instructions sort of flow down the pipe. 347 00:26:36,050 --> 00:26:41,650 And what's, what's really cool about this is, if you draw a vertical slice, or you 348 00:26:41,650 --> 00:26:46,212 draw one slice through here, you can freeze time, and see what is in every 349 00:26:46,212 --> 00:26:50,016 stage of the pipe. So if we freeze time here, t4, instruction 350 00:26:50,016 --> 00:26:55,028 five is in the instruction fetch stage, instruction four is in the decode stage, 351 00:26:55,028 --> 00:27:00,046 instruction three is in the execute stage, two is in the memory access stage, and 352 00:27:00,046 --> 00:27:03,096 then instruction one is writing back and finishing. 353 00:27:04,070 --> 00:27:09,053 And it's really important that when you're drawing these diagrams that only one 354 00:27:09,053 --> 00:27:14,025 instruction is in each stage at a time. If you ever draw something where you have 355 00:27:14,025 --> 00:27:18,080 let's say, this instruction is in the write back stage and this instruction is 356 00:27:18,080 --> 00:27:22,048 in the write back stage that's a hazard or a structural hazard. 357 00:27:22,048 --> 00:27:26,097 You made a mistake somewhere in your diagram, or you had to stall a processor 358 00:27:26,097 --> 00:27:31,010 or something else has to happen. So make sure that when you sliced your 359 00:27:31,010 --> 00:27:35,065 time vertically here, or look at one time segment that you don't have multiple 360 00:27:35,065 --> 00:27:38,027 things in that time segment at the same time. 361 00:27:39,027 --> 00:27:43,048 We can also draw a different pipeline diagram. 362 00:27:43,048 --> 00:27:50,080 So this pipeline diagram is instead going to plot space versus time or resource 363 00:27:50,080 --> 00:27:54,649 verses time. And this is going to be less common in 364 00:27:54,649 --> 00:27:58,041 this class, but it's still useful to think about. 365 00:27:58,041 --> 00:28:01,092 So, what we have here is we have time on the horizontal access. 366 00:28:01,092 --> 00:28:04,042 And we have resource on the vertical access. 367 00:28:04,042 --> 00:28:07,053 So we have here the fetch decode, execute, memory access. 368 00:28:07,053 --> 00:28:11,084 And then right back stage is the pipe. And we can see that we have different 369 00:28:11,084 --> 00:28:16,014 instructions flowing down the pipe. So instruction fetch has instruction one, 370 00:28:16,014 --> 00:28:20,078 then instruction two, and then instruction three, instruction four, instruction five 371 00:28:20,078 --> 00:28:23,082 flowing down it, and this is sometimes useful. 372 00:28:24,002 --> 00:28:29,002 Likewise you can cut vertically here, and see at a certain time all of the 373 00:28:29,002 --> 00:28:33,007 instructions in your pipeline. So, it's, it's a useful diagram. 374 00:28:33,007 --> 00:28:36,085 But it's a little less useful than the, the previous one. 375 00:28:36,085 --> 00:28:41,078 And when I refer to things as pipeline diagrams, I will be focusing on 376 00:28:41,078 --> 00:28:50,034 transaction versus time pipeline diagrams. Okay. 377 00:28:50,034 --> 00:28:56,088 So now, we're going to move off idealized pipelines, and talk about non-ideal 378 00:28:56,088 --> 00:29:00,092 pipelines. Or why pipelines are maybe not ideal. 379 00:29:03,081 --> 00:29:07,923 Then we're going to introduce in the purse, piece of nomenclature here, called 380 00:29:07,923 --> 00:29:13,011 a hazard. So, a hazard is. 381 00:29:13,057 --> 00:29:18,046 Anything where you have multiple portions of your pipeline interacting in a way that 382 00:29:18,046 --> 00:29:22,013 will cause, cause a problem. So there's three, three main types of 383 00:29:22,013 --> 00:29:24,089 hazards. We have structural hazards, data hazards, 384 00:29:24,089 --> 00:29:28,045 and control hazards. A structural hazard is instruction that's 385 00:29:28,045 --> 00:29:33,005 in the pipeline that needs a resource which is used by a different stage in the 386 00:29:33,005 --> 00:29:35,058 pipeline or some place else in the pipeline. 387 00:29:35,058 --> 00:29:40,000 A data hazard is when you have two instructions, and the two instructions are 388 00:29:40,000 --> 00:29:44,002 dependent on each other somehow. So the one instruction generates some 389 00:29:44,002 --> 00:29:47,013 data, and the other instruction needs to use that data. 390 00:29:47,013 --> 00:29:52,020 Well that, that's called a data hazard. You need to make sure you get the, data 391 00:29:52,020 --> 00:29:57,087 which is computed by the early instruction then somehow, move it to the data to the 392 00:29:57,087 --> 00:30:02,009 input of a later instruction. And that's, kind of, key to actually 393 00:30:02,009 --> 00:30:07,043 executing code on a processor is, being able to do that and we call this a data 394 00:30:07,043 --> 00:30:10,060 hazard. And then finally there's control hazards. 395 00:30:10,060 --> 00:30:15,054 Control hazards are another type of dependency, but it's a dependency which 396 00:30:15,054 --> 00:30:19,019 comes from having to change the control flow of the program. 397 00:30:19,019 --> 00:30:23,048 So an example of this is something like a branch or jump instruction, can cause a 398 00:30:23,048 --> 00:30:26,055 control hazard. So it's a dependency based on some control 399 00:30:26,055 --> 00:30:30,094 decision made by an earlier instruction. So if you have an earlier instruction that 400 00:30:30,094 --> 00:30:35,018 branches, the later instructions have to be the ones at the target of the branch, 401 00:30:35,018 --> 00:30:38,725 we'll say, and you can't just execute the ones which are the fall through for the