1 00:00:03,035 --> 00:00:14,957 So let's look at some techniques to deal with outcome prediction for branches, and 2 00:00:14,957 --> 00:00:19,049 we're gonna look at this mostly, first of all, in the static domain. 3 00:00:19,049 --> 00:00:25,051 So things that are not trying to watch the genetic instruction sequence, but instead 4 00:00:25,051 --> 00:00:30,526 are, can, basically statically analyze the instruction sequence and with some high 5 00:00:30,526 --> 00:00:33,843 probability, we'll say, try to predict the location. 6 00:00:33,843 --> 00:00:38,312 The first technique is actually not a prediction at all. 7 00:00:38,312 --> 00:00:43,935 The first technique, we had talked about this a few lectures ago, or we, actually 8 00:00:43,935 --> 00:00:47,994 we talked about this in lecture, two adding delay slots. 9 00:00:47,994 --> 00:00:53,244 So, instead of, I don't know, having that time just be dead cycles, when we're 10 00:00:53,244 --> 00:00:57,956 trying to resolve the branch. We, let's say we put a, if we don't 11 00:00:57,956 --> 00:01:02,759 resolve the branch 'til here, we have what say one delay slot, two delay slots, three 12 00:01:02,759 --> 00:01:06,085 delay slots, and we have three instructions, we'll say. 13 00:01:06,085 --> 00:01:11,093 Or maybe you can redirect at this point out of X and you have two branched delay 14 00:01:11,093 --> 00:01:16,659 slots, depending on how you sort of wire that in if you can make the cycle time. 15 00:01:16,659 --> 00:01:23,241 Let's say if two branch delay slots, what ends up happening is you have instruction 16 00:01:23,241 --> 00:01:26,013 EEQZ here. Let's say that is at address 100. 17 00:01:26,013 --> 00:01:32,133 Then you have two instructions which execute no matter whether the branch is 18 00:01:32,133 --> 00:01:36,007 taken or not taken, that follow it. So these instructions are in the delay 19 00:01:36,007 --> 00:01:41,628 slots of the branch, and you know, if you have architectures have which have things 20 00:01:41,628 --> 00:01:44,720 like load delay slots and other things like that. 21 00:01:44,720 --> 00:01:47,571 But for right now we'll talk about branches. 22 00:01:47,571 --> 00:01:50,686 And we can force these instructions always to execute. 23 00:01:50,686 --> 00:01:55,222 Unfortunately, if you go look at the sort of percentages of probability that you can 24 00:01:55,222 --> 00:01:57,940 actually fill these delay slots, it's pretty low. 25 00:01:57,940 --> 00:02:02,910 It's, always not easy to find work to, to shove down here, 'cuz you're basically 26 00:02:02,910 --> 00:02:07,454 taking work that was above the branch, and in the compiler, reordering the code to 27 00:02:07,454 --> 00:02:12,687 put the, instructions that were before the branch, after the branch. 28 00:02:12,687 --> 00:02:13,452 Hm. Okay. 29 00:02:13,452 --> 00:02:18,149 Well, this is good. I mean, if you can actually make this work 30 00:02:18,149 --> 00:02:24,143 out, this is, this is great. One of the problems with this, though, is 31 00:02:24,143 --> 00:02:29,564 the probability of filling one branch delay slot, let's say is 70%. 32 00:02:29,564 --> 00:02:35,955 The probability to filling two branch delay slots is even lower than that. 33 00:02:35,955 --> 00:02:41,194 It's maybe, 50ish to 40ish%. As we said before, if you have a monkey 34 00:02:41,194 --> 00:02:47,341 pulling out of a, a hat, or just some random process pulling out of a hat, 35 00:02:47,341 --> 00:02:50,334 you're going to do 50%, correct. So. 36 00:02:50,334 --> 00:02:56,813 And as we'll see today, if you use some more fancy branch prediction, or the 37 00:02:56,813 --> 00:03:03,156 outcome prediction techniques, you can get the prediction accuracy of a branch all 38 00:03:03,156 --> 00:03:06,950 the way up into the sort of 95, 96, 97 percent probability. 39 00:03:06,950 --> 00:03:12,878 And all the way people have built, if you look at your sort of out of order 40 00:03:12,878 --> 00:03:18,833 super-scaler on your desktop, your core, or I-7 these days from Intel, that's 41 00:03:18,833 --> 00:03:23,981 probably somewhere in the neighborhood of 98 to 99 percent correct. 42 00:03:23,981 --> 00:03:30,405 So, being able to fill these delay slots, with some probability which is less than 43 00:03:30,405 --> 00:03:35,721 98%, is not a good, good trade off. You would've better been served possibly 44 00:03:35,721 --> 00:03:41,493 by allowing some fancier prediction mechanism trying to fill it, if you care 45 00:03:41,493 --> 00:03:45,502 about performance. Now if you care about complexity, or 46 00:03:45,502 --> 00:03:50,844 reducing complexity in area, a static prediction might be a good, good way to 47 00:03:50,844 --> 00:03:55,641 go. So let's, let's look at some basic rules 48 00:03:55,641 --> 00:04:03,591 of thumb here about static prediction. So the, the overall probability of branch 49 00:04:03,591 --> 00:04:08,981 is taken with say out of something like spec int is 60 to 70%. 50 00:04:08,981 --> 00:04:15,987 But it's not equally distributed. Backwards branches have a much higher 51 00:04:15,987 --> 00:04:20,993 probability of being taken than forward branches. 52 00:04:20,993 --> 00:04:25,073 Okay. So, we got a question coming up here. 53 00:04:25,073 --> 00:04:28,693 Why is this? Yes, so, so loops with high probability or 54 00:04:28,693 --> 00:04:32,522 by definition, to be a loop, you don't have to jump backwards. 55 00:04:32,522 --> 00:04:35,630 You jump forwards, it's, it's pretty hard to loop. 56 00:04:35,630 --> 00:04:40,680 So you jump backwards, it's a loop. And in fact, people like to execute loops, 57 00:04:40,680 --> 00:04:45,610 loops, and stay in loops for a while, 'cause that's where a lot of work is done. 58 00:04:45,610 --> 00:04:50,499 So if you're seeing a loop, and you're just spinning in this, this is increasing 59 00:04:50,499 --> 00:04:55,154 the probability that the backwards branch is taken, and that says a high 60 00:04:55,154 --> 00:05:03,459 probability. Such that this half forward branches 50%. 61 00:05:03,459 --> 00:05:09,193 Hm. Can we, what's, what's going on there? 62 00:05:09,193 --> 00:05:15,052 Forward branch, 50 percent going forward. These are usually some swarm of data 63 00:05:15,052 --> 00:05:18,033 dependent branches, like an if then else clause. 64 00:05:18,033 --> 00:05:21,082 So that's why the probability of this is much, much lower. 65 00:05:22,000 --> 00:05:27,032 You sit in loops for long periods of time, forward branches typically sort of, are if 66 00:05:27,032 --> 00:05:32,027 then else's, and there's you're checking some data value and then you're either 67 00:05:32,027 --> 00:05:36,095 executing or not. Now this gets a little trickier with some 68 00:05:36,095 --> 00:05:41,033 more advanced compiler techniques. Cuz sometimes with advanced compiler 69 00:05:41,033 --> 00:05:47,071 techniques, you'll actually, effectively convert a loop with an with an if then 70 00:05:47,071 --> 00:05:52,444 else into it, and a condition check at the end into a unconditional backwards branch 71 00:05:52,444 --> 00:05:58,243 and then a little sort of branch that goes around the, the, piece of code which 72 00:05:58,243 --> 00:06:05,030 checks the loop sensible or checks the, the whether the loop is completing or not. 73 00:06:05,030 --> 00:06:09,030 But even that's actually not a horrible thing, cause at least that backwards 74 00:06:09,030 --> 00:06:13,068 branch will always be predicted to take in a hundred percent, cause its just turned 75 00:06:13,068 --> 00:06:16,021 into a jump. So that's not bad for a performance. 76 00:06:16,021 --> 00:06:18,095 It just, you know, might change these percentages a little bit. 77 00:06:19,063 --> 00:06:23,066 But things do like to sit in loops, is what you should take away from this slide. 78 00:06:25,069 --> 00:06:33,019 Okay, so let's think about a technique to try to take advantage of this. 79 00:06:34,016 --> 00:06:39,086 So one technique they didn't take advantage of this with is actually to add 80 00:06:40,022 --> 00:06:47,003 extra bits to your instruction set, and allow the compiler to hint to the 81 00:06:47,003 --> 00:06:51,030 architecture whether the branch is taken or not taken. 82 00:06:52,012 --> 00:06:57,054 Now just a little hint, if it get's it wrong, we still won't get correct 83 00:06:57,054 --> 00:07:01,528 execution, so if it takes a branch, we still want to go correct, we still want to 84 00:07:01,528 --> 00:07:06,179 execute the right piece of code, but the performance might just be worse. 85 00:07:06,179 --> 00:07:11,779 So let's take a look at two branches here. We have branch .T, and branch .NT. 86 00:07:11,779 --> 00:07:17,623 Branch .T is a branch which is taken or predicted taken, and branch .NT is a 87 00:07:17,623 --> 00:07:25,695 branch that's predicted not taken. So what do I want to say about this? 88 00:07:25,695 --> 00:07:31,132 Well, architectures do have this. I was gonna say the itanium architecture 89 00:07:31,132 --> 00:07:34,689 actually has static branch prediction completely. 90 00:07:34,689 --> 00:07:39,530 Some things have sort of intermediary things with this, for instance the 91 00:07:39,530 --> 00:07:45,131 Motorola 68K kind of has something like this but not, you can't do with all 92 00:07:45,131 --> 00:07:49,062 branches. So this, this exist in real architectures 93 00:07:49,062 --> 00:07:54,054 and one of the, the things is that this could actually be very accurate, 94 00:07:54,054 --> 00:07:59,069 especially if you profile your code. So if you run the code once and you log 95 00:07:59,069 --> 00:08:04,032 the compiler to C, which way the branch was taken, it can do quite good. 96 00:08:04,032 --> 00:08:09,069 And some of the insight here is there's a lot of branches in a program which are 97 00:08:09,069 --> 00:08:14,067 there just to check error cases. And the probability that they actually are 98 00:08:14,067 --> 00:08:19,083 taken or not taken, one way or the other is, can basically almost be statically 99 00:08:19,083 --> 00:08:23,087 determined at compile time. And definitely if you have feedback 100 00:08:23,087 --> 00:08:28,062 information, of execution of the program once, that's, that's a, a very good 101 00:08:28,062 --> 00:08:32,015 indicator. So for instance if you had a loop you 102 00:08:32,015 --> 00:08:37,080 would, as a compiler would denote that the backwards branch is, is taken. 103 00:08:37,080 --> 00:08:43,075 So we put a br.t and if it's let's say some piece of code which jumps over an 104 00:08:43,075 --> 00:08:49,017 error case and checks an error condition and 99 percent of the time that never 105 00:08:49,017 --> 00:08:52,061 falls into that. It would predict that taken. 106 00:08:52,061 --> 00:08:58,051 So we jump over that little piece of code. So you can actually have very high 107 00:08:58,051 --> 00:09:02,069 prediction accuracy. Now we're not, we're not up into the, 99 108 00:09:02,069 --> 00:09:06,095 percent here yet, or something like that. This isn't, this isn't great. 109 00:09:06,095 --> 00:09:12,020 We're still doing static things here. And, you know, at this point, it may 110 00:09:12,020 --> 00:09:18,030 actually make sense still to have, a delay slot, 'cuz we're still not over our, 70th 111 00:09:18,030 --> 00:09:21,019 percen-, or, or, or, or sort of a close trade-off here. 112 00:09:21,019 --> 00:09:26,028 You have to get the stack creation correct and your delay slot might actually still 113 00:09:26,028 --> 00:09:31,021 be a better approach at this, depending on sort of how, how this accuracy compares up 114 00:09:31,021 --> 00:09:33,082 to how many times you can fill the delay slot. 115 00:09:33,082 --> 00:09:36,094 But let's take a look at what happens in the pipeline here. 116 00:09:37,058 --> 00:09:42,031 So here we have branch taken and it's been taken to this target. 117 00:09:42,031 --> 00:09:48,023 We still end up with a dead cycle here because we do not know where that target 118 00:09:48,023 --> 00:09:52,001 is until we've effectively decoded the instruction. 119 00:09:52,001 --> 00:09:58,000 So that in the sort of intermediate time here we just have fetched let's say the 120 00:09:58,000 --> 00:10:03,598 fall through case or something like that. But, because we predicted a taken, we can, 121 00:10:03,598 --> 00:10:08,094 we can do better here. We don't have to have two delay slots, or 122 00:10:08,094 --> 00:10:11,004 two dead cycles. We can fill it. 123 00:10:11,004 --> 00:10:15,017 We can actually get the, the correct next instruction, in here. 124 00:10:15,050 --> 00:10:22,002 For the target. This branch not taken, let's say we 125 00:10:22,002 --> 00:10:28,094 speculatively execute the fall through. We, we don't actually have any penalty for 126 00:10:28,094 --> 00:10:34,095 a correctly predicted branch here. So if we predict this branch not taken, 127 00:10:34,095 --> 00:10:41,054 that's the fall through case, we can just fetch p c plus four, p c plus eight and 128 00:10:41,054 --> 00:10:46,008 just keep executing. And we have no, no penalty there. 129 00:10:47,034 --> 00:10:52,010 Okay, so if we get the hint wrong. What are we gonna do? 130 00:10:52,010 --> 00:10:55,027 Well, we have effectively taken a mis-predict penalty here. 131 00:10:55,027 --> 00:10:59,052 And if you look at the branch take-in case and you take a mis-predict penalty. 132 00:10:59,052 --> 00:11:03,067 What's gonna happen here is you are actually going to end up having two, two 133 00:11:03,067 --> 00:11:07,082 dead cycles, cuz we don't actually determine whether we took the mis-predict 134 00:11:07,082 --> 00:11:11,058 or not until this execute stage. And that's the, the soonest we can go 135 00:11:11,058 --> 00:11:17,041 redirect the front of the pipe. Now, if you look at this, and squint 136 00:11:17,041 --> 00:11:21,083 really hard, you can see that we actually fetched op A twice here. 137 00:11:21,083 --> 00:11:28,007 This was the fall through instruction. It is hypothetically possible that you 138 00:11:28,007 --> 00:11:33,091 might be able to not hold off killing this instruction if you will, and in this case 139 00:11:33,091 --> 00:11:38,007 you sort of have the pipe do something special and actually end up getting the 140 00:11:38,007 --> 00:11:41,060 instruction after op A or op B here, if you will. 141 00:11:41,060 --> 00:11:46,012 But that's pretty hard to do. So, for the base case we'll just say that 142 00:11:46,012 --> 00:11:51,001 you end up with an extra cycle, a mispredict penalty, when you mispredict. 143 00:11:53,036 --> 00:11:58,140 Yeah, so just to reiterate that, what I was trying to say here is that, well, if a 144 00:11:58,140 --> 00:12:05,147 branch taken, we fetch the subsequent instruction in the follow through case. 145 00:12:05,147 --> 00:12:09,296 We fetch the target of the branch. We try to kill this, but, lo and behold, 146 00:12:09,296 --> 00:12:12,309 we end up fetching the exact same instruction again. 147 00:12:12,309 --> 00:12:16,638 So, you could, if you really wanted, to try to optimize run that and not fetch 148 00:12:16,638 --> 00:12:20,826 this twice, but then you are kind of having things out of order in the pipe 149 00:12:20,826 --> 00:12:23,196 here. Because, you'd have this instruction. 150 00:12:23,196 --> 00:12:27,518 This instruction is dead and you have to figure out how to kill, sort of, 151 00:12:27,518 --> 00:12:31,268 sub-portions of your pipe or things out of order in your pipe. 152 00:12:31,268 --> 00:12:36,597 That gets pretty, pretty hard to do. But this does really quite well if we do 153 00:12:36,597 --> 00:12:42,375 static software based branch production Okay, so let's look at hardware branch 154 00:12:42,375 --> 00:12:45,214 prediction. Now when we say hardware branch 155 00:12:45,214 --> 00:12:49,907 prediction, that does not necessarily mean dynamic hardware branch prediction. 156 00:12:49,907 --> 00:12:54,554 This could just mean that you have the hardware do either prediction without any 157 00:12:54,554 --> 00:12:58,719 hints from the software. And we have a couple different cases we 158 00:12:58,719 --> 00:13:02,959 can try to implement. Heuristics, if you will, in hardware. 159 00:13:02,959 --> 00:13:05,441 The first one here always predicts not taken. 160 00:13:05,441 --> 00:13:09,925 This is what we've done so far in all of our pipelines that we've designed. 161 00:13:09,925 --> 00:13:13,931 We've predict, we predict fallthrough effectively, and we just fetch PC plus 162 00:13:13,931 --> 00:13:16,304 four. We didn't put a name on it, this, but this 163 00:13:16,304 --> 00:13:20,054 is actually what we were doing. We were doing speculative execution with a 164 00:13:20,054 --> 00:13:22,492 static harbor branch prediction of PC plus four. 165 00:13:22,494 --> 00:13:31,610 It's pretty simple to implement. It's what you guys have done in your labs 166 00:13:31,610 --> 00:13:34,699 so far. You know the fall-through early. 167 00:13:34,699 --> 00:13:37,678 Accuracy is not very good. You get all your loops wrong. 168 00:13:37,678 --> 00:13:42,353 All your backward branches and your loops just are always predicted wrong. 169 00:13:42,353 --> 00:13:47,331 Okay, let's, let's do the inverse here. Predict taken, let's have that be our 170 00:13:47,331 --> 00:13:53,073 static strategy in hardware. Well, kinda hard to do here because we 171 00:13:53,073 --> 00:13:58,122 don't really know the target of this branch in D code. 172 00:13:58,122 --> 00:14:05,209 So it's like going back to here. If we predict everything taken, what do 173 00:14:05,209 --> 00:14:10,012 we, what do we fetch in this cycle? Well, I don't know. 174 00:14:10,012 --> 00:14:14,079 It's a big question mark there. It doesn't do super well on, 175 00:14:14,079 --> 00:14:20,419 if-then-elses, cuz a lot of times those are forward branches over things, or, or 176 00:14:20,419 --> 00:14:26,166 some, it depends on how you structure it. Some architectures which have things like 177 00:14:26,166 --> 00:14:32,419 these static prediction techniques will actually restructure their code, such that 178 00:14:32,419 --> 00:14:37,638 the compiler will figure out the probability of a branch being taken or not 179 00:14:37,638 --> 00:14:41,054 taken, and then work the code around it, to make it work out. 180 00:14:41,054 --> 00:14:45,576 That's actually pretty common the, the Motorola 68K had something similar to that 181 00:14:45,576 --> 00:14:49,609 and they, the compilers there actually try to work around it. 182 00:14:49,609 --> 00:14:53,634 So, hmm, well, this is definitely a bummer here. 183 00:14:53,634 --> 00:14:58,868 Maybe we can fix that. Like, like we said, this is the second 184 00:14:58,868 --> 00:15:03,026 part of today's lecture is trying to fix this problem. 185 00:15:03,026 --> 00:15:08,920 Okay, how about we use a heuristic, that all backwards branches are taken and the 186 00:15:08,920 --> 00:15:13,209 forward branches are not taken. Okay so this does a lot better. 187 00:15:13,209 --> 00:15:18,717 This is her, our heuristic of what we were saying before, that loops and things like 188 00:15:18,717 --> 00:15:24,023 that will get caught up in this. So forward branch is not part of the loop. 189 00:15:24,023 --> 00:15:28,020 It'll, we'll predict that not taken, we'll predict the fall through. 190 00:15:28,020 --> 00:15:32,003 Backward branch we'll predict taken. This does pretty good. 191 00:15:32,003 --> 00:15:36,000 It's better accuracy. It's still nowhere near the sort of 80 192 00:15:36,000 --> 00:15:39,818 percent accuracy that we had if the compiler were to get involved, cuz that's 193 00:15:39,818 --> 00:15:45,107 effectively, the compiler case that we were talking about before could actually 194 00:15:45,107 --> 00:15:48,084 implement this entire algorithm in software. 195 00:15:49,098 --> 00:15:54,044 Or something much more sophisticated and that's usually what ends up happening.