1 00:00:03,043 --> 00:00:06,064 Let's look at a little more complicated case. 2 00:00:06,064 --> 00:00:09,085 We talked about control hazards due to jumps. 3 00:00:09,085 --> 00:00:13,006 Let's put the control hazard one cycle later. 4 00:00:14,017 --> 00:00:20,089 And a, a good example of that is something like a conditional branch. 5 00:00:20,089 --> 00:00:25,014 So here we have a piece of code. Add branch. 6 00:00:25,014 --> 00:00:30,048 And it's branching to 200 bytes into the future. 7 00:00:30,073 --> 00:00:35,082 If you take a hundred branches on nips are relative to the subsequent instruction, so 8 00:00:35,082 --> 00:00:39,053 PC plus four plus the offset, so we'll end up at three or four. 9 00:00:41,011 --> 00:00:50,023 And let's walk through this case here. We have I one is our first ad. 10 00:00:50,023 --> 00:00:56,002 You get the branch sitting at the decode stage. 11 00:00:56,046 --> 00:01:00,047 And we have some decode logic which is saying, is this a branch? 12 00:01:02,085 --> 00:01:07,777 Okay. The next question though, is how do we 13 00:01:07,777 --> 00:01:12,090 compute whether this is a, a branch is being taken or not. 14 00:01:12,090 --> 00:01:19,035 Can we do this in the decode stage? Unfortunately we need to do a comparison 15 00:01:19,035 --> 00:01:25,712 here, and this comparison, the hardware we want to use to do a comparison is in the 16 00:01:25,712 --> 00:01:29,073 ALU. It's a, you know, subtract operation, or 17 00:01:29,073 --> 00:01:34,028 a, a comparison operation. So this, this fits well within our 18 00:01:34,028 --> 00:01:38,082 arithmetic logic unit. And we have the zero wire coming out. 19 00:01:38,082 --> 00:01:45,022 So, what this is gonna do is, instead of having one cycle of latency, or one cycle 20 00:01:45,022 --> 00:01:50,000 of kills being inserted, we might have to insert two cycles. 21 00:01:50,095 --> 00:01:54,059 Because now, we have to wait for the branch to get to here. 22 00:01:54,059 --> 00:01:59,024 And if we're predicting PC plus four, we're speculating PC plus four, we'll 23 00:01:59,024 --> 00:02:02,094 actually have fetched more data. So let's walk through this. 24 00:02:02,094 --> 00:02:08,069 So the branch moves forward, one cycle forward here. 25 00:02:09,003 --> 00:02:14,032 And, what you will see here is we're actually fetching 108. 26 00:02:14,032 --> 00:02:18,034 So we're fetching, the next, next, instruction. 27 00:02:20,031 --> 00:02:26,028 But these two are both potentially dead or, in, we wanna kill both of these. 28 00:02:26,028 --> 00:02:32,050 But we don't know whether to kill them until we get this, this zero wire here. 29 00:02:33,000 --> 00:02:40,010 Hm, okay so at that point we can redirect the front of the pipe and we can insert a 30 00:02:40,010 --> 00:02:44,012 no op. So that's going to change the logic that 31 00:02:44,012 --> 00:02:49,009 connects to here. We are also going to have to insert a no 32 00:02:49,009 --> 00:02:55,025 op here because we fetched 104 at that time into this point of the pipe. 33 00:02:59,003 --> 00:03:05,072 So, if a branch is taken, we need to kill the next two instructions and the 34 00:03:05,072 --> 00:03:12,538 instructions decode stage is effectively invalid so we need to swing this muck 35 00:03:12,538 --> 00:03:16,619 here. So we're gonna have to add an extra 36 00:03:16,619 --> 00:03:30,109 control line on to this multiplexer. And now, Stalling, and killing makes a lot 37 00:03:30,109 --> 00:03:38,972 bigger, a lot more difference, so the stall signal has to effectively be killed 38 00:03:38,972 --> 00:03:43,063 or nullified at this point. And why is this important? 39 00:03:43,063 --> 00:03:48,071 Well, let's say we're stalling the front of the pipe and stall were to take 40 00:03:48,071 --> 00:03:52,017 priority. That would stop this register from moving 41 00:03:52,017 --> 00:03:56,031 and stop this register from moving as this red lines come in. 42 00:03:56,031 --> 00:04:01,086 So if the stall were to take priority, you could have an instruction here which is 43 00:04:01,086 --> 00:04:07,008 let's say dependent on some stall condition on dependent on some data hazard 44 00:04:07,008 --> 00:04:12,076 or dependent on some random hazard. And you've a branch which is branching to 45 00:04:12,076 --> 00:04:16,074 someplace else. So let's take this instruction here, 104, 46 00:04:16,074 --> 00:04:21,095 which is add, is depending on something as stalling in front of the pipe. 47 00:04:22,022 --> 00:04:26,043 At the same time, you want to kill because you took the branch. 48 00:04:26,043 --> 00:04:31,075 You want to kill that instruction. If you were to put the stall at the higher 49 00:04:31,075 --> 00:04:34,086 priority, you'd actually end up in a deadlock. 50 00:04:34,086 --> 00:04:40,011 Because you'd be stalling, but you would have no way of killing the previous 51 00:04:40,011 --> 00:04:45,030 instructions to clear the stall. So this is why it's really important that 52 00:04:45,030 --> 00:04:49,024 the instruction at the decode stade, stage is now invalid. 53 00:04:49,051 --> 00:04:54,020 So unlike the, the jump case where, well, you might be able to do it either way, 54 00:04:54,020 --> 00:04:58,082 with the priority of the stall, and the kill, wires, here, the kill, has to take 55 00:04:58,082 --> 00:05:01,086 precedence. Or the redirect has to take precedence. 56 00:05:01,086 --> 00:05:06,061 And that's going to redirect the front of the pipe, and basically clear out 57 00:05:06,061 --> 00:05:10,013 everything here. And if you were to stall it, you would not 58 00:05:10,013 --> 00:05:13,078 be clearing it out. So instead, you need to kill, kill, sort 59 00:05:13,078 --> 00:05:18,053 of turn these on to no ops, and redirect the front of the pipe to go fetch the 60 00:05:18,053 --> 00:05:21,051 actual target. Of your branch, of your conditional 61 00:05:21,051 --> 00:05:29,020 branch. Yeah, sorry, this is just a drawing now, 62 00:05:29,020 --> 00:05:33,019 that, this branch. Information has to go into this 63 00:05:33,019 --> 00:05:41,089 multiplexer and go into this multiplexer. The other thing that's important to your, 64 00:05:41,089 --> 00:05:46,093 for, for branches, is we need to figure out the address that we're actually 65 00:05:46,093 --> 00:05:50,074 branching to. And we'll talk about this more when we get 66 00:05:50,074 --> 00:05:53,081 to branch prediction later in the course, but. 67 00:05:53,081 --> 00:05:59,034 What you're gonna have to do, is you're gonna have to take the PC+4 and add it to 68 00:05:59,034 --> 00:06:04,829 the branch offset, cuz on something like MIPS, there's PC relative branching. 69 00:06:04,829 --> 00:06:10,053 And the other thing is we can't reuse this ALU necessarily to do that. 70 00:06:10,053 --> 00:06:15,587 You could potentially reuse this ALU if you were, micro coding your design, But 71 00:06:15,587 --> 00:06:19,837 unfortunately, we need this ALU to do the branch comparison. 72 00:06:19,837 --> 00:06:25,898 So we need to add another adder here to come up with the target of our branch. 73 00:06:25,898 --> 00:06:32,022 Now, some people get smart on the way they build this, and you can actually see some 74 00:06:32,022 --> 00:06:37,456 things where people will try to reuse the main processor pipeline ALU through the 75 00:06:37,456 --> 00:06:42,591 branch offset calculation and maybe try to do this comparison, because if you have 76 00:06:42,591 --> 00:06:47,685 simple enough branches you can do the comparison with less hardware than a full 77 00:06:47,685 --> 00:06:49,442 adder. That's another option. 78 00:06:49,442 --> 00:06:53,962 Let's hold off talking about that until a little bit later. 79 00:06:53,962 --> 00:07:00,497 Okay, so now we get to talk about, the stall signal, in, in detail. 80 00:07:00,497 --> 00:07:05,933 So we have, we start off in. Blue here at the top, the stall signal we 81 00:07:05,933 --> 00:07:09,533 had from before. So you need to check the data pendencies 82 00:07:09,533 --> 00:07:14,546 of, and check for data hazards between the registers, of the source operands, these 83 00:07:14,546 --> 00:07:19,706 are the RS and RT, against, in the decode stage, against the other stages in the 84 00:07:19,706 --> 00:07:22,604 pipe. The execute stage, and the memory stage, 85 00:07:22,604 --> 00:07:27,578 and the, and the write back stage, and make sure that you're both reading the 86 00:07:27,578 --> 00:07:32,100 registers and you're, check the write enables to make sure that the respective 87 00:07:32,100 --> 00:07:36,632 instructions in the different pipeline stages are actually, writing to the 88 00:07:36,632 --> 00:07:41,884 different locations. But now we add another term. 89 00:07:41,884 --> 00:07:47,882 We add a, branch term. So, we need to know whether there's a 90 00:07:47,882 --> 00:07:56,367 branch going on here, because we need to unistall the processor, if the branch is 91 00:07:56,367 --> 00:08:01,769 taken. So we're gonna add not, and this is 92 00:08:01,769 --> 00:08:10,291 effectively, it is a branch zero and zero. Likewise this is, it's a branch not zero. 93 00:08:10,291 --> 00:08:15,974 And, the value is, not zero. So these, these two terms are basically 94 00:08:15,974 --> 00:08:22,580 saying, a branch is happening, it's in the execute stage, and we'll say that the 95 00:08:22,580 --> 00:08:29,046 branch is being taken. So if it's a zero branch and the result is 96 00:08:29,046 --> 00:08:35,084 zero taken or if it's a not, not equal zero branch, it's not taken. 97 00:08:36,063 --> 00:08:44,051 And why, don't, we stall if the branch is taken. 98 00:08:46,000 --> 00:08:49,096 So, this is the same question we had on the previous slide. 99 00:08:49,096 --> 00:08:54,039 You have a branch, it's happening, if we stall the front of a pipe. 100 00:08:54,087 --> 00:09:00,014 The instruction at the decode stage has to be unstalled to get that out and to clear 101 00:09:00,014 --> 00:09:03,018 it out. So, the instruction at the decode stage is 102 00:09:03,018 --> 00:09:08,001 now invalid, so we don't want to pay attention to any of the stall information. 103 00:09:08,001 --> 00:09:12,053 It's just like a, a red herring, or it's a, it's a piece of information you 104 00:09:12,053 --> 00:09:17,024 shouldn't be paying attention to. So you want to have the instruction at the 105 00:09:17,024 --> 00:09:21,046 decode stage be turned into, an unvalid, or an invalid instruction. 106 00:09:22,065 --> 00:09:28,092 Okay, so now we get to talk about the control equations for the program counter 107 00:09:28,092 --> 00:09:37,076 and the instruction register multiplexers. So let's start off by talking about PC 108 00:09:37,076 --> 00:09:41,010 source, and so I want to remember what that is. 109 00:09:41,010 --> 00:09:44,063 That's way up here. That's the input to this multiplexer. 110 00:09:44,063 --> 00:09:49,005 And that's going to select where we're branching to, or where we, excuse me. 111 00:09:49,005 --> 00:09:53,090 Not necessarily where we're branching to, but the prudent counter of the next 112 00:09:53,090 --> 00:09:56,077 instruction we're gonna fetch, on the next cycle. 113 00:09:56,077 --> 00:10:00,018 And it's on the next cycle, because this is a, a flip flop. 114 00:10:00,018 --> 00:10:05,002 It's a little bit of control equations here just to, I wanted to point, sort of, 115 00:10:05,002 --> 00:10:12,904 two, two basic things out in this, these, these equations is that the older 116 00:10:12,904 --> 00:10:20,175 instruction or the thing that's farther down the pipe is going to get precedence 117 00:10:20,175 --> 00:10:27,757 over younger instructions, and for the control signals going down the pipe. 118 00:10:27,757 --> 00:10:35,315 So if we look at this PC muck select. It actually has things coming out of 119 00:10:35,315 --> 00:10:39,995 different stages of the pipe. We have something which is redirecting for 120 00:10:39,995 --> 00:10:44,714 a jump out a the decode stage. And then we have these signals here which 121 00:10:44,714 --> 00:10:49,540 are coming out of the execute stage, which is strictly farther down the pipe. 122 00:10:49,540 --> 00:10:52,220 And this needs to take precedents over that. 123 00:10:52,220 --> 00:10:56,076 So if you have a branch which is directly followed by a jump, the jumps can be seen 124 00:10:56,076 --> 00:10:58,998 there trying to swing the control on the pc select marks. 125 00:10:58,998 --> 00:11:03,585 So the next programmer counter mucks. And you can't let that happen, you have to 126 00:11:03,585 --> 00:11:07,283 let the branch that's actually executing take precedence over that. 127 00:11:07,283 --> 00:11:09,980 And that's what I really want to get across from this. 128 00:11:09,980 --> 00:11:14,720 And you'll see that similarly here the branch takes precedents over the jump that 129 00:11:14,720 --> 00:11:22,734 links from the other control signals. Let's look at this from a branch pipeline 130 00:11:22,734 --> 00:11:28,463 perspective. So we'll draw the pipeline diagram for a 131 00:11:28,463 --> 00:11:36,188 branch executing for our example. And what you notice real fast here is, if 132 00:11:36,188 --> 00:11:44,582 you take a branch and you actually execute the branch and take the branch, you're 133 00:11:44,582 --> 00:11:51,399 going to be adding, two extra instructions here, in this case 104 and 108, and they 134 00:11:51,399 --> 00:11:58,819 get killed as they go down the pipe. And this is going to impact our CPI in a 135 00:11:58,819 --> 00:12:03,526 negative way. And what's happening is this execute the, 136 00:12:03,526 --> 00:12:10,066 the branch which is in the execute stage, is gonna reach back and kill these 137 00:12:10,066 --> 00:12:15,711 instructions concerning the NOOPs. And we can plot this in the, in the other, 138 00:12:15,711 --> 00:12:19,791 other dimension also, but you'll see. So now we need to think hard. 139 00:12:19,791 --> 00:12:25,480 Is this a good thing? Well we speculated that the instruction 140 00:12:25,480 --> 00:12:31,705 was or the next instruction was gonna be PC+4 or the address of the next 141 00:12:31,705 --> 00:12:38,832 instruction's gonna be PC+4. So we took something where the CPI was, 142 00:12:38,832 --> 00:12:43,457 let's say, two. Machine CPI is two, and we cut it down a 143 00:12:43,457 --> 00:12:51,932 little bit, in the common case. But jumps, the CPI still goes back up to 144 00:12:51,932 --> 00:12:58,511 two, And here, the CPI is three. Now, in our speculation case, or our no 145 00:12:58,511 --> 00:13:03,977 speculation case, if we had a branch we would probably have to stall for three 146 00:13:03,977 --> 00:13:10,088 cycles anyway to know the destination of that branch, so this didn't really hurt us 147 00:13:10,088 --> 00:13:13,748 in that case. The CPI in that case was also three, four, 148 00:13:13,748 --> 00:13:17,300 a branch instruction. Hm. 149 00:13:17,300 --> 00:13:23,525 Okay, well, but this still isn't good. I don't want every single branch I take to 150 00:13:23,525 --> 00:13:28,278 have a CPI of, of three. So let's talk about some ways to mitigate 151 00:13:28,278 --> 00:13:34,993 this cost. And how do we reduce the branch penalty? 152 00:13:34,993 --> 00:13:43,522 Well, one way to do it is, we can actually add a comparator one stage earlier and 153 00:13:43,522 --> 00:13:49,712 have it resolve in the same timeline as something like the jump. 154 00:13:49,712 --> 00:13:54,799 So how do we do that? Well, here we have a register file. 155 00:13:54,799 --> 00:14:00,615 This is in the decode stage of the pipe. This is the fetch stage. 156 00:14:00,615 --> 00:14:08,303 What do we really need to do for a branch? Well, we need to know that it's a branch. 157 00:14:08,303 --> 00:14:13,291 So we do some decode. And we need to know whether it's a, a 158 00:14:13,291 --> 00:14:18,242 branch zero or a branch not zero, for instance, if those are the only two 159 00:14:18,242 --> 00:14:22,998 branches in our architecture. And we need to know that, let's say, 160 00:14:22,998 --> 00:14:29,438 register one is zero or not zero. So some people came up with this smart 161 00:14:29,438 --> 00:14:34,953 idea that we can add a zero detector onto the registry file output. 162 00:14:34,953 --> 00:14:41,657 And by adding the zero detector into the register file output we'll know which way 163 00:14:41,657 --> 00:14:44,993 the branch goes. So, this, this, the zero detector. 164 00:14:44,993 --> 00:14:51,189 One of the interesting things is that this is actually a little bit easier than doing 165 00:14:51,189 --> 00:14:55,439 a full comparison with zero. Because to do a zero detect, you can 166 00:14:55,439 --> 00:15:00,894 basically build a optimized OR tree. And then you take the negation of N, 167 00:15:00,894 --> 00:15:06,977 sometimes people can do this with, their wired OR's or more analog e-circuits. 168 00:15:06,977 --> 00:15:13,012 But this is effectively a big OR gate, where you OR in all 32 bits if you have a 169 00:15:13,012 --> 00:15:17,716 32 bit processor, and then take the NOT or take the, the, the bit itself. 170 00:15:17,716 --> 00:15:22,413 And this is easy for zero detect, this can get harder for different types of 171 00:15:22,413 --> 00:15:25,157 branches. If you ask me, like, a branch-equals, or 172 00:15:25,157 --> 00:15:28,421 BEQ instruction. You can't use, well, you can't use zero 173 00:15:28,421 --> 00:15:31,428 detector. What you have to do instead is somehow 174 00:15:31,428 --> 00:15:35,274 actually do a full comparison between two, of the source operands. 175 00:15:35,274 --> 00:15:42,216 And that usually takes more time. So the, the nice thing about this is that 176 00:15:42,216 --> 00:15:49,169 It'll, take your, all your branches and take'em from, a CPI of three when they 177 00:15:49,169 --> 00:15:58,568 mispredict, or when you misspeculate. Two, two cycles of the same timeline that 178 00:15:58,568 --> 00:16:03,096 you had with. A jump, so that you end up with the exact 179 00:16:03,096 --> 00:16:09,249 same pipeline that we had for jumps now. The downside is that this is going to 180 00:16:09,249 --> 00:16:10,899 elongate. Your cycle time. 181 00:16:10,899 --> 00:16:15,303 So you have this wire coming out, and the wire needs to go into the logic which 182 00:16:15,303 --> 00:16:18,847 computes PC source and that's going to become a critical path. 183 00:16:18,847 --> 00:16:22,180 And it's a critical path because, you have to go through the decode. 184 00:16:22,180 --> 00:16:26,236 That's usually going to parallel, but you have to access the register file, then 185 00:16:26,236 --> 00:16:30,249 through a zero detect, then go into control logic which computes this, and 186 00:16:30,249 --> 00:16:33,563 then come around and latch the information, or, or, or flip-flop the 187 00:16:33,563 --> 00:16:36,357 information. So you can really negative, negatively 188 00:16:36,357 --> 00:16:41,257 impact your clock period and as we talked about the iron law of processor 189 00:16:41,257 --> 00:16:45,093 performance. You can either get performance by lowering 190 00:16:45,093 --> 00:16:50,917 your clocks by instruction or you can get it by lowering your clock frequency or 191 00:16:50,917 --> 00:16:56,010 excuse me increasing your clock frequency or lowering your clock period. 192 00:16:56,010 --> 00:17:00,911 So these two things trade off. So you can elongate your cycle time which 193 00:17:00,911 --> 00:17:06,219 will negatively impact your CP, or negatively impact your performance or time 194 00:17:06,219 --> 00:17:10,070 per program in our iron law of processor performance. 195 00:17:10,070 --> 00:17:16,395 But on the flip side you will Have branches resolve faster so your cost per 196 00:17:16,395 --> 00:17:21,347 instruction, or average cost per instruction, especially for branches will, 197 00:17:21,347 --> 00:17:24,164 will go down. So this is only one approach. 198 00:17:24,164 --> 00:17:28,658 Another approach is you can expose it to software. 199 00:17:28,658 --> 00:17:36,006 So, we can change the instruction set architecture. 200 00:17:36,006 --> 00:17:41,027 What is really cool here is the instruction set architect can actually 201 00:17:41,027 --> 00:17:46,048 have an impact on what's going on here. This is not all micro-architecture issues, 202 00:17:46,048 --> 00:17:51,081 this is not all fancy branch predictors, which is what we'll be talking about later 203 00:17:51,081 --> 00:17:57,027 in this course, but instead you can expose this problem or this challenge of the time 204 00:17:57,027 --> 00:18:00,017 it takes to resolve a branch to the software. 205 00:18:00,017 --> 00:18:05,007 So what we're gonna do well, I'll define a branch delay slot. 206 00:18:05,007 --> 00:18:11,094 And a branch delay slot is a architectural when I say architectural I mean big A 207 00:18:11,094 --> 00:18:17,091 architectural or instruction set architectural change which allows or, or 208 00:18:17,091 --> 00:18:23,048 forces some number of instructions after a branch to always execute. 209 00:18:24,042 --> 00:18:30,036 So it could be, lets say one, two, three or four, or some number architecturally 210 00:18:30,036 --> 00:18:34,030 defined. And in this delay slot that instruction 211 00:18:34,030 --> 00:18:38,062 is, executed irregardless of the direction of the branch. 212 00:18:38,062 --> 00:18:44,080 And what people try to do with this is they'll have to take work that was above 213 00:18:44,080 --> 00:18:50,013 the branch and move it below the branch. Because, if you know the works going to be 214 00:18:50,013 --> 00:18:53,092 done anyway, you can just put it below the branch and everything's okay. 215 00:18:53,092 --> 00:18:58,030 But the problem with that is the branch has to not be dependent on that work that 216 00:18:58,030 --> 00:19:03,012 you're moving below the branch. So typically for very short loops where 217 00:19:03,012 --> 00:19:07,810 there's not much work inside the loop, this becomes very problematic, cuz there's 218 00:19:07,810 --> 00:19:12,808 not enough work that the branch is not dependent on to move below the branch. 219 00:19:12,808 --> 00:19:17,605 To have the loop run effectively. And what we're trying to do here is the 220 00:19:17,605 --> 00:19:22,559 compiler has a hand in what's going on. Or the, the program writer really has a 221 00:19:22,559 --> 00:19:26,676 hand in what's going on. Cause that's what's doing the reordering 222 00:19:26,676 --> 00:19:29,739 of the information from above here to below here. 223 00:19:29,739 --> 00:19:34,435 So lets take a look at an example here where, we have a branch, and we have one 224 00:19:34,435 --> 00:19:38,625 delay slot. Some architectures have two or more, but 225 00:19:38,625 --> 00:19:41,464 like, right now we'll assume one delay slide. 226 00:19:41,464 --> 00:19:46,235 And let's also assume that we have this reduction of branch panel, the 227 00:19:46,235 --> 00:19:51,304 optimization, so now branches only have one dead cycle when we're, we're gonna 228 00:19:51,304 --> 00:19:56,410 execute them. So what we can do is, we can say, okay, 229 00:19:56,410 --> 00:20:04,248 the branch is at address 100, And this, ad here at address 104 in the delay slot 230 00:20:04,248 --> 00:20:09,029 executes irregardless of whether the branch gets taken or falls through. 231 00:20:09,070 --> 00:20:14,001 So then we don't actually have to kill the next instruction after the branch. 232 00:20:14,001 --> 00:20:18,083 We don't have to have all this fancy extra hardware there, we just have this always 233 00:20:18,083 --> 00:20:23,036 execute. The tricky thing is many times, as I have 234 00:20:23,036 --> 00:20:26,039 said, its hard to fill this branch delay slot. 235 00:20:26,039 --> 00:20:30,023 It'd be great if you could have many, many branch delay slots. 236 00:20:30,023 --> 00:20:35,049 So, if you go look at something like a, penny-before, I think they have a twenty- 237 00:20:35,049 --> 00:20:40,048 odd cycle mispredict penalty. They could have added twenty delay spots 238 00:20:40,048 --> 00:20:44,086 to their architecture. That would have been really, really cool. 239 00:20:44,086 --> 00:20:50,025 But at the same time the compiler would have to go find twenty instructions for 240 00:20:50,025 --> 00:20:55,034 every branch, to stick after the branch. And it's usually just not that much work 241 00:20:55,034 --> 00:20:59,058 to be put the after the branch which the branch is not dependent on. 242 00:20:59,058 --> 00:21:02,082 So, we have a really hard time filling these branch delay slots. 243 00:21:02,082 --> 00:21:06,058 And, and, roughly, if you go look on something like MIPS, across sort of old 244 00:21:06,058 --> 00:21:10,090 SPEC INTs to give you some example here of how often the compiler can actually fill 245 00:21:10,090 --> 00:21:15,007 branch delay slots, I think the, the rough rule of thumb is if you have one branch 246 00:21:15,007 --> 00:21:18,093 delay slot, you can usually fill it maybe about 70 percent of the time on something 247 00:21:18,093 --> 00:21:21,095 like, SPEC INT. Spec INT is a common benchmark suite 248 00:21:21,095 --> 00:21:26,000 that's used throughout the computer architecture industry, or computing 249 00:21:26,000 --> 00:21:29,745 industry to measure performance, and if you have two branch delay slots, that 250 00:21:29,745 --> 00:21:35,001 second branch delay slot only gets filled less than like half, half the time. 251 00:21:35,068 --> 00:21:41,089 So, fill in these slots relatively and frequently you might be better served by 252 00:21:41,089 --> 00:21:45,027 coming up with other prediction mechanisms. 253 00:21:45,027 --> 00:21:51,041 And we're going to talk later about, not this class but in a different lecture, 254 00:21:51,041 --> 00:21:58,002 branch prediction which is a technique to cover branch latency and to make a branch 255 00:21:58,002 --> 00:22:04,032 decision early or we can predict whether the branch has been taken or not taken. 256 00:22:04,032 --> 00:22:09,027 And this can be used to dramatically reduce the branch penalty. 257 00:22:12,024 --> 00:22:17,056 So let's draw, the pipeline diagram here, of, a, branch delay slot. 258 00:22:17,056 --> 00:22:21,020 Architecture has, has one branch delay slot. 259 00:22:21,066 --> 00:22:25,734 So we, we have the add flowing down the pipe, we have a branch flowing down the 260 00:22:25,734 --> 00:22:30,045 pipe, and now we're going to have this next add, and architecturally we've 261 00:22:30,045 --> 00:22:35,038 defined that, that always executes whether the branch is being taken or not taken. 262 00:22:35,038 --> 00:22:40,061 So this code sequence or, excuse me, this pipeline diagram sequence here is the same 263 00:22:40,061 --> 00:22:44,020 for whether the branch is taken or the branch is not taken. 264 00:22:44,020 --> 00:22:50,067 In this case we are going to say that the branch is taken and we go to execute, this 265 00:22:50,067 --> 00:22:54,076 next add here. But you see that there is no bubbles, 266 00:22:54,076 --> 00:22:59,078 there's no NOOPs been inserted. So, we have actually improved the 267 00:22:59,078 --> 00:23:04,025 performance here. This is assuming though, that this add is 268 00:23:04,025 --> 00:23:08,057 doing useful work. Now, this is really important to, to, to 269 00:23:08,057 --> 00:23:15,387 say because the compiler or the program writer has to put something here. 270 00:23:15,387 --> 00:23:20,726 And if they can't put anything useful they're gonna have to put NOOP. 271 00:23:20,726 --> 00:23:26,438 And that's not useful work, doing a no operation, you might as well just had the 272 00:23:26,438 --> 00:23:31,675 hardware insert, insert a, the NOOPs for you and save some instruction coding 273 00:23:31,675 --> 00:23:34,885 space. So there's a tough trade off here between 274 00:23:34,885 --> 00:23:39,890 how often you fill the branch delay slot versus having a branch delay slot. 275 00:23:39,890 --> 00:23:46,261 But for right now we'll say branch delay slots are one good technique to put useful 276 00:23:46,261 --> 00:23:51,081 instructions after a branch not have branches stall out.