1 00:00:03,039 --> 00:00:08,011 Okay, so that was. One way to figure out the outcome. 2 00:00:08,011 --> 00:00:11,011 Let's look at a little bit more dynamic way. 3 00:00:15,069 --> 00:00:23,195 Okay, so we can try to build, pieces of hardware which determine the outcome of a 4 00:00:23,195 --> 00:00:28,041 branch, dynamically. And this is by adding extra little pieces 5 00:00:28,041 --> 00:00:31,286 of hardware, somewhere in the front of the pipe. 6 00:00:31,286 --> 00:00:36,097 So, somewhere in our PC plus four calculation this is going to be this 7 00:00:36,097 --> 00:00:39,428 multiplexor here in the PC plus four calculation. 8 00:00:39,428 --> 00:00:44,771 And effectively, we're going to use some pieces of hardware to feed into that, and 9 00:00:44,771 --> 00:00:48,344 this is our taking and not taking prediction values. 10 00:00:48,344 --> 00:00:53,041 And we'll, we'll talk about this in a second but one of the questions here is, 11 00:00:53,041 --> 00:00:55,565 what things feed into this PC plus four hardware. 12 00:00:55,565 --> 00:00:58,417 Hmmn, , well one of them is gonna be PC plus four. 13 00:00:58,417 --> 00:01:03,874 One of them should be your branch target. Unfortunately, as we said we don't know 14 00:01:03,874 --> 00:01:08,581 the branch targets up here. So we can't have that come around in one 15 00:01:08,581 --> 00:01:11,069 cycle. So we need to hold off, that's how we 16 00:01:11,069 --> 00:01:14,038 that's, that's the target address question. 17 00:01:14,038 --> 00:01:18,087 But first thing is, how do we swing this multiplexer here of our next PC? 18 00:01:21,029 --> 00:01:26,030 Okay, so, some solutions to this, there's dynamic in hardware they're trying to 19 00:01:26,030 --> 00:01:29,638 increase the prediction accuracy, is, we're going to think about having 20 00:01:32,042 --> 00:01:37,023 something which exploits, the nature of programs. 21 00:01:37,023 --> 00:01:43,055 So what we're going to try to exploit first of all is temporal correlation of a 22 00:01:43,055 --> 00:01:49,079 branch, [inaudible] a branch outcome with previous outcomes of that same branch. 23 00:01:50,059 --> 00:01:55,988 So, if you're in a loop, and you fallen into a loop, and you have a branch that's 24 00:01:55,988 --> 00:02:00,513 predicted taken, or, or, let's say if you have a branch that is taken. 25 00:02:00,513 --> 00:02:06,004 You can use that in the future to predict the outcome of that same branch. 26 00:02:06,004 --> 00:02:11,061 So here we're actually gonna have a one bit predictor, as, denoted by this finite 27 00:02:11,061 --> 00:02:17,018 state machine, and what's gonna happen is let's say we start in the not taken state. 28 00:02:18,405 --> 00:02:22,092 If the outcome of a, a particular branch is not taken, we just stay here and we 29 00:02:22,092 --> 00:02:27,009 continually predict not taken. If we then take a branch, if we then end 30 00:02:27,009 --> 00:02:32,003 up taking a branch, we transition over to this other predict, predict taken state 31 00:02:32,003 --> 00:02:36,085 and we sit in there if we're not taken. And if we have alternating branches taken 32 00:02:36,085 --> 00:02:41,032 and not taken, we're basically gonna be predicting the wrong one, er, well, we 33 00:02:41,032 --> 00:02:44,084 might have been predicting the wrong one, every single time. 34 00:02:45,009 --> 00:02:49,075 But we can implement this with one bit, one extra bit to our control logic. 35 00:02:49,075 --> 00:02:54,065 It's not, a lot of logic here it's literally one extra bit and it just says, 36 00:02:54,065 --> 00:03:01,086 which of these two states you're in. Let's take a look at the performance of 37 00:03:01,086 --> 00:03:03,053 this. Does this, does this help? 38 00:03:04,074 --> 00:03:10,034 So let's say we have a loop here. That, iterates four times. 39 00:03:10,034 --> 00:03:14,040 So we're gonna graph, the iteration number. 40 00:03:15,008 --> 00:03:18,768 And then we're gonna show the prediction that's coming out of the predictor. 41 00:03:18,768 --> 00:03:22,719 This one big predictor and then we're gonna look at what actually happened and 42 00:03:22,719 --> 00:03:25,684 we're gonna see if we predicted right or wrong. 43 00:03:25,684 --> 00:03:30,891 Okay, so in the first, iteration, let's say, we train up the, the predictor, it 44 00:03:30,891 --> 00:03:34,874 says not taken. It starts off and it's not taken, that's 45 00:03:34,874 --> 00:03:38,160 the start state. We're to predict not taken, it was 46 00:03:38,160 --> 00:03:41,931 actually taken because we're gonna loop four times in this loop. 47 00:03:41,931 --> 00:03:44,093 So we mis-predicted it. Hm, that wasn't great. 48 00:03:45,052 --> 00:03:50,007 But then, it transitions from the not taking, or predict not taking the state, 49 00:03:50,007 --> 00:03:54,050 to the predict taking the state. So the next few times we iterate the loop, 50 00:03:54,050 --> 00:03:58,228 we're going to get it right. Now this is a relatively short loop, we 51 00:03:58,228 --> 00:04:02,033 only run it four times. But if you have a loop that executes, 100 52 00:04:02,033 --> 00:04:05,041 times. This might actually do okay. 53 00:04:06,037 --> 00:04:11,353 Okay, but then we look at the end here. And well, hm. 54 00:04:11,353 --> 00:04:15,409 The fall-through case at the end of the loop. 55 00:04:15,409 --> 00:04:19,115 We're gonna predict taken but we're actually fall through. 56 00:04:19,115 --> 00:04:21,429 This is the last, last little bit of our loop. 57 00:04:21,429 --> 00:04:25,073 That's, that's not great. So here, you know if you look at 58 00:04:25,073 --> 00:04:29,414 prediction accuracy, this is 50%. This was, this was not, not very good. 59 00:04:29,609 --> 00:04:35,026 Let's, let's try to execute the loop again and I want, just to kind of want to show 60 00:04:35,026 --> 00:04:38,228 here. The reason that I show a second execution 61 00:04:38,228 --> 00:04:43,063 in the exact same loop, is to show why the start prediction was not taken. 62 00:04:43,063 --> 00:04:49,097 The second time we come back to the same loop, we left this predictor trained up to 63 00:04:49,097 --> 00:04:54,019 be not taken. So now when we go to execute again we come 64 00:04:54,019 --> 00:04:58,036 in not taken. And we mispredict the first time. 65 00:04:58,074 --> 00:05:01,752 Correctly predict the second time, correctly predict the third time, and miss 66 00:05:01,752 --> 00:05:07,018 predict the fall through case. So that's why you need to sort of execute 67 00:05:07,018 --> 00:05:11,062 this loop twice here to see what the common case is which is sort of looped 68 00:05:11,062 --> 00:05:14,060 back to back. Otherwise, you know, if we were to change 69 00:05:14,060 --> 00:05:18,001 the start state. You might end up thinking that this 70 00:05:18,001 --> 00:05:20,025 predictor does better than it actually does. 71 00:05:22,042 --> 00:05:26,060 So that's, that's, that's not, not a whole magic there. 72 00:05:26,503 --> 00:05:30,018 It's simple. It does okay. 73 00:05:30,060 --> 00:05:36,020 Can we do better? Well we're nowhere near 99%, so the 74 00:05:36,020 --> 00:05:42,036 answer's going to be yes. Okay, so what'll happens if we start to 75 00:05:42,036 --> 00:05:47,531 have more bits, to do prediction. So here, instead of having one extra flip 76 00:05:47,531 --> 00:05:51,540 flop added to our computer architect, or, or, our processor pipeline. 77 00:05:51,540 --> 00:05:56,808 We add two bits to processor architecture. We add a two bit saturating predictor. 78 00:05:56,808 --> 00:06:01,817 So what do we mean by saturate? Well, what this means is, we start off, 79 00:06:01,817 --> 00:06:07,645 let's say in the strong not taken state. And then if you have a predict take and it 80 00:06:07,645 --> 00:06:12,319 moves here, if you have another predict take and it moves here, if you have 81 00:06:12,319 --> 00:06:15,377 another predict, predict take and it moves here. 82 00:06:15,377 --> 00:06:20,184 So we slowly can sort of, we have a range here if you will of taken. 83 00:06:20,184 --> 00:06:26,445 So what we have taken, strong taken, weakly not taken, strong not taken, and we 84 00:06:26,445 --> 00:06:32,637 can sort of move across the spectrum as we reinforce the taken or not takenness of 85 00:06:32,637 --> 00:06:39,179 this, of a, of a branch. So still, still not quite rocket science. 86 00:06:39,416 --> 00:06:45,477 There is some, some of these here. Effectively we're introducing hysteresis 87 00:06:45,477 --> 00:06:49,677 into the prediction. That we're trying to do here. 88 00:06:49,677 --> 00:06:54,464 Or our prediction algorithm. And that, that, that can helps. 89 00:06:54,464 --> 00:06:59,701 Let's, let's see how that helps. So let's execute the same loop here, 90 00:06:59,701 --> 00:07:04,423 another time here. We'll start off with strong not taken. 91 00:07:04,423 --> 00:07:10,829 Strong not taken, predicts not taken and the first time we execute this loop, well 92 00:07:10,829 --> 00:07:14,605 it was actually taken because loops like to be taken. 93 00:07:14,871 --> 00:07:20,040 And that's a mispredict. But we transition from strong not taken 94 00:07:20,040 --> 00:07:26,043 over here into the weak not taken state. So something actually happened, that, that 95 00:07:26,043 --> 00:07:32,571 mis-predict was not in vain. The second iteration happens, and we still 96 00:07:32,571 --> 00:07:36,826 are predicting not taking, because we're in this weak not taken state. 97 00:07:36,826 --> 00:07:39,207 And we did take it. So we mis-predict again. 98 00:07:39,207 --> 00:07:42,560 But now, we actually transition to the weak, taken state. 99 00:07:42,560 --> 00:07:48,728 This time things get a little bit better here we're actually going to have a 100 00:07:48,728 --> 00:07:52,369 prediction of taken. The actual thing that is taken we 101 00:07:52,369 --> 00:07:58,036 predicted it correctly, and then finally for here, prediction of taking it was 102 00:07:58,036 --> 00:08:01,093 actually not taken. I'm sorry this was actually a mis-predict 103 00:08:01,093 --> 00:08:04,011 here. This is, this is incorrect. 104 00:08:04,011 --> 00:08:07,012 So if you go this should, this should be yes. 105 00:08:10,360 --> 00:08:18,355 We can see that this is actually, doing worse or appears to be doing worse than 106 00:08:18,355 --> 00:08:22,026 our one bit predictor. Well, does, does life get better. 107 00:08:22,026 --> 00:08:26,008 Well we, we train this up a little bit at this point. 108 00:08:26,008 --> 00:08:32,061 So let's, let's take a look at now. Let's execute the loop, another time given 109 00:08:32,061 --> 00:08:36,082 that we started off in this strong taken state. 110 00:08:39,013 --> 00:08:43,098 So now we're gonna execute again and we see that we actually predict taken, taken, 111 00:08:43,098 --> 00:08:46,038 taken. It actually was taken, taken, taken. 112 00:08:46,038 --> 00:08:51,055 And we just mispredict the last one. And what's nice about this is now when we 113 00:08:51,055 --> 00:08:56,200 go to execute again, we're gonna start off, because we mispredict to the end 114 00:08:56,200 --> 00:08:59,625 state here, we're actually gonna start off in the weak taking state. 115 00:08:59,625 --> 00:09:03,062 I'm sorry so this is, this should probably be weak taken. 116 00:09:03,270 --> 00:09:06,012 Is that what you were going to say? Yeah, so. 117 00:09:07,084 --> 00:09:12,034 This is another bug in the slide here. I will fix this in what I, upload. 118 00:09:12,034 --> 00:09:15,004 This, this should actually say week, week taken. 119 00:09:15,004 --> 00:09:18,061 Because that's where we left off, at the end of this one here. 120 00:09:18,061 --> 00:09:21,013 But the, the prediction accuracy's the same. 121 00:09:21,013 --> 00:09:25,040 We're still gonna predict taken. So you can see that this, does not get 122 00:09:25,040 --> 00:09:28,010 tricked, if you will. It does not get untrained. 123 00:09:28,070 --> 00:09:32,047 By this one not taken case here at the end. 124 00:09:32,047 --> 00:09:36,622 But what's nice about this is you can train up, basically, and the historesis 125 00:09:36,622 --> 00:09:40,816 will not get foled by one branch at the end of the loop sort of falling through. 126 00:09:40,816 --> 00:09:44,474 And that's, that's positive. You can even think about this section 127 00:09:44,474 --> 00:09:49,599 being helpful, that the fall through case might even signal you to do something. 128 00:09:49,599 --> 00:09:53,976 People could actually have hist, you could even use that to sort of, trigger 129 00:09:53,976 --> 00:09:58,585 something in your finite state machine if you see lots and lots of takens, then one 130 00:09:58,585 --> 00:10:03,083 not taken on the exact same branch, that might mean something to you. 131 00:10:03,083 --> 00:10:06,505 And people in fact have done things like that. 132 00:10:06,505 --> 00:10:13,584 So, you start to get even with two bits, you can think about having different state 133 00:10:13,584 --> 00:10:19,081 machines for the same thing. So here we have something which jumps 134 00:10:19,081 --> 00:10:27,034 directly to strong from weak. So, instead here when we had strong. 135 00:10:28,055 --> 00:10:32,453 Or sorry weak not taken, and when we have a taken here. 136 00:10:32,453 --> 00:10:37,089 We actually jump to strong. So we train effectively very fast. 137 00:10:37,089 --> 00:10:42,082 So it's still hysteresis , but they're a faster acting hysteresis. 138 00:10:42,277 --> 00:10:47,696 So this, looks pretty similar to what you had before but now this weak not taken 139 00:10:47,696 --> 00:10:54,920 state, state will jump straight to strong taken, if you have one mispredict or 140 00:10:54,920 --> 00:10:58,032 something like that. But, this is only one example. 141 00:10:58,032 --> 00:11:03,060 If you go look and actually [inaudible] book, in the chapter that talks about 142 00:11:03,060 --> 00:11:06,112 this. They give like three other examples that 143 00:11:06,112 --> 00:11:11,675 look similar to this and are different heuristics to sort of try to get different 144 00:11:11,675 --> 00:11:19,026 of chances or different, different, behaviors that are common with programs. 145 00:11:19,026 --> 00:11:24,383 But, you know none of these are going to be, super, super great because they're 146 00:11:24,383 --> 00:11:30,458 only 2-bits. Did you actually start to think about 147 00:11:30,458 --> 00:11:37,054 having a table of these predictors. So you have local predictions. 148 00:11:37,054 --> 00:11:43,066 So it's not predicted globally. But instead we are predicting based on 149 00:11:43,066 --> 00:11:48,078 each different branch. And how we do this is we take bits out of 150 00:11:48,078 --> 00:11:52,072 the program counter and we index that into this big table. 151 00:11:52,072 --> 00:11:58,022 And then we can store, lets say two bits in this table, and then we can train each 152 00:11:58,022 --> 00:12:02,084 of these things separately. Now, we are not gonna wanna have an entry 153 00:12:02,084 --> 00:12:07,032 in this table equivalent to the number of branches in the program. 154 00:12:07,032 --> 00:12:10,895 Because, it will be huge. But you want to somehow have some number 155 00:12:10,895 --> 00:12:13,007 of bits. K bits, in fact. 156 00:12:13,210 --> 00:12:18,045 Maybe some sort of, , maybe just take the low, or the medium order bits of the, of 157 00:12:18,045 --> 00:12:21,057 the program counter, and we use that as a index. 158 00:12:21,057 --> 00:12:25,022 Similar to how we use, how we use indexes and caches. 159 00:12:25,022 --> 00:12:30,033 You could also think about having multi way associative versions of these. 160 00:12:30,033 --> 00:12:34,051 So instead of having, this is a, this is a direct mapped version. 161 00:12:34,051 --> 00:12:38,043 But you could have a multi way associative version of this. 162 00:12:38,631 --> 00:12:43,311 And If you, if you look let's say something which is a four kilo entry 163 00:12:43,311 --> 00:12:45,992 version of this. That's pretty big actually. 164 00:12:45,992 --> 00:12:50,558 But a four kilo entry branch history table with two bits per entry. 165 00:12:50,558 --> 00:12:54,686 Now we're starting to get somewhere, we can get sort of 80 to 90 percent 166 00:12:54,686 --> 00:12:58,596 prediction accuracy. So that's, that's starting to get us 167 00:12:58,596 --> 00:13:02,088 there, and then we had to build. I don't know, eight kilobits worth of 168 00:13:02,088 --> 00:13:05,079 storage. And we had to somehow pipe it into our 169 00:13:05,079 --> 00:13:08,466 next PC logic. Which is kind of a sensitive path on our 170 00:13:08,466 --> 00:13:13,035 processor. But, you know that's, that's can be 171 00:13:13,035 --> 00:13:24,093 helpful in performance. So let's actually before I move off. 172 00:13:24,093 --> 00:13:32,372 I want to make one, one point. There are other ways to organize this. 173 00:13:32,372 --> 00:13:36,002 Table. So what we're really doing here is, we're 174 00:13:36,002 --> 00:13:42,017 indexing into a table and we're, let's say running through some sort of finite state 175 00:13:42,017 --> 00:13:45,799 machine decoder here, which tells us taken or not taken. 176 00:13:45,799 --> 00:13:50,659 And that we're individually updating subentries of this table, for a particular 177 00:13:50,659 --> 00:13:56,187 branch. You could think about having other things 178 00:13:56,187 --> 00:14:01,520 to index into this table. So, either more complex cache functions, 179 00:14:01,520 --> 00:14:08,473 or maybe some other history information, or maybe previous states or just some 180 00:14:08,473 --> 00:14:12,643 other bits up here. So this is just a base cases. 181 00:14:12,643 --> 00:14:18,499 We're gonna show some more exam-, more complicated examples for spatial 182 00:14:18,499 --> 00:14:22,645 correlating predictors. But you, a lot of the ideas we are going 183 00:14:22,645 --> 00:14:25,571 to talk about in spatial correlating predictor. 184 00:14:25,571 --> 00:14:29,041 You could apply that to these temporal coordinating predictors. 185 00:14:29,077 --> 00:14:34,097 But I just wanted to make that point that these things do what we're going to talk 186 00:14:34,097 --> 00:14:39,064 about we're going to show more, much more sophisticate versions in our spatial 187 00:14:39,064 --> 00:14:44,048 correlation predictors, but the you can apply level ideas back to these temporal 188 00:14:44,048 --> 00:14:47,077 ones. Okay so let's look at spatial correlation. 189 00:14:47,077 --> 00:14:52,055 We talked about temporal correlation. So, in time, one, one branch in the, and 190 00:14:52,055 --> 00:14:57,039 the direction that one branch goes has a high probability that it's going to 191 00:14:57,039 --> 00:15:02,035 influence or, or it's going to predict rather the, whether that branch in future 192 00:15:02,035 --> 00:15:06,057 is going to be taken or not taken. That's, that's what is, a temporal 193 00:15:06,057 --> 00:15:10,002 prediction is. A spatial prediction, what it's trying to 194 00:15:10,002 --> 00:15:15,017 get at this, this, this piece of code here is that, if you have let's say one branch, 195 00:15:15,044 --> 00:15:21,035 do something, the future branch that's, lets say right below it, is going to 196 00:15:21,035 --> 00:15:26,030 somehow be correlated. So a good example case, and this is kind 197 00:15:26,030 --> 00:15:33,723 of the canonical example here is, If you check this value, and it's less than 198 00:15:33,723 --> 00:15:38,044 seven. It's also less than five. 199 00:15:38,096 --> 00:15:41,071 And you get code like this relatively often. 200 00:15:41,071 --> 00:15:44,052 So these, these. These branches are correlated. 201 00:15:44,052 --> 00:15:48,022 But they're not the same branch. They're different branches. 202 00:15:48,048 --> 00:15:54,022 This branch and this branch. So we're having some sort of spatial 203 00:15:54,022 --> 00:15:59,001 correlation here, and we want to try to exploit this somehow. 204 00:15:59,001 --> 00:16:07,022 And this was Yael Pats and, and one of his students came up with this, not that long 205 00:16:07,022 --> 00:16:10,091 ago surprisingly. In the, in the, in the 90's. 206 00:16:10,091 --> 00:16:18,058 And the idea here is that I don't know, we're gonna have an extra register. 207 00:16:18,058 --> 00:16:25,523 We'll call a branch history register. Which will store whether previous branches 208 00:16:25,523 --> 00:16:28,094 that we've executed were taken or not taken. 209 00:16:28,094 --> 00:16:32,001 And then use that to index into some tables. 210 00:16:32,001 --> 00:16:35,029 Or use that somehow to predict what's going on. 211 00:16:36,000 --> 00:16:39,067 So this is actually gonna be a shift register. 212 00:16:39,067 --> 00:16:43,044 Where we're gonna shift in, what the branch did. 213 00:16:45,060 --> 00:16:57,062 So it's not based on just the address. Okay, so here's our picture of our branch 214 00:16:57,091 --> 00:17:03,003 history register. The branch outcome, either whether taken 215 00:17:03,003 --> 00:17:06,409 or not taken, goes into here. Every time we take a branch, we're gonna 216 00:17:06,409 --> 00:17:10,088 shift a new value in, and let the old one sort of fall off the end. 217 00:17:13,064 --> 00:17:19,038 So what could we do with this? Well, we can use it to index a table. 218 00:17:19,070 --> 00:17:27,751 So if we have a pattern history table. What parent the history table is, is will 219 00:17:27,751 --> 00:17:36,059 be indexed, by previous branches. And then based on this pattern, we're 220 00:17:36,059 --> 00:17:40,040 gonna look in this table, and that'll output either some bits that go into our 221 00:17:40,040 --> 00:17:44,030 state machine or it might just output one bit which says take it or not take it. 222 00:17:44,030 --> 00:17:48,129 But, you know the general sense is going to output in bits, maybe two bits which 223 00:17:48,129 --> 00:17:54,227 will feed into a state machine which updates itself, like what we had before. 224 00:17:54,227 --> 00:18:01,022 What's cool about this, is let's say. You have a, loop. 225 00:18:01,022 --> 00:18:08,000 Which, allways has four iterations. So we're gonna have taken, taken, taken, 226 00:18:08,000 --> 00:18:12,543 not taken. Well, what's gonna happen is we're gonna 227 00:18:12,543 --> 00:18:18,252 shift in these four take in, take in, take in, not take in the register and then, 228 00:18:18,252 --> 00:18:25,574 lets say we execute the loop we executed that whole inner that whole loop multiple 229 00:18:25,574 --> 00:18:29,031 times. We can train up this history table here, 230 00:18:29,031 --> 00:18:34,702 and it'll detect if we've seen the loop. Let's say, have taken, taken, taken. 231 00:18:34,702 --> 00:18:38,808 What do we think the next thing is going to be? 232 00:18:38,808 --> 00:18:45,312 It's gonna be not taken. And this predictor can detect that. 233 00:18:45,312 --> 00:18:50,189 And I can do that actually with only one, one bit here. 234 00:18:50,189 --> 00:18:54,033 One, you know just a one bit state machine. 235 00:18:54,033 --> 00:19:00,035 Because, if it sees taken, taken, taken not or excuse me or if it sees, taken, 236 00:19:00,035 --> 00:19:03,508 taken, taken. That's gonna index into this table, and 237 00:19:03,508 --> 00:19:09,186 the output of here is gonna be not taken. So it's gonna be able to predict, sort of 238 00:19:09,186 --> 00:19:12,507 short runs. Loops which have short runs. 239 00:19:12,774 --> 00:19:18,323 And if this, as this gets longer. This branch history register gets longer, 240 00:19:18,323 --> 00:19:21,975 you could actually predict loops which have longer runs. 241 00:19:21,975 --> 00:19:26,454 So if this is 100 bits long. Now, you're never gonna have 100 bits 242 00:19:26,454 --> 00:19:31,490 long, because a pattern history table of, of to the 100 is a very large number. 243 00:19:31,490 --> 00:19:34,351 But as this gets longer, let's say it's ten. 244 00:19:34,351 --> 00:19:37,806 That's very reasonable. You could predict a loop. 245 00:19:37,806 --> 00:19:42,110 Let's say that loops ten times. It will get the fall through case correct, 246 00:19:42,110 --> 00:19:46,147 it will get the start case correct, and you will have 100 percent accuracy in that 247 00:19:46,147 --> 00:19:50,833 loop, for your branch prediction. Wow, that's pretty cool. 248 00:19:50,833 --> 00:19:54,710 So this is where we're starting to actually get some, get some traction here. 249 00:19:55,220 --> 00:19:59,258 And this is where branch predication really starts taking off. 250 00:19:59,258 --> 00:20:04,831 This is where we start to get it, sort of, above 80 percent we start to get to a 251 00:20:04,831 --> 00:20:11,064 higher number. If we wanna go even, if you want to even 252 00:20:11,064 --> 00:20:15,661 have better prediction. We can start to think about having a 253 00:20:15,661 --> 00:20:21,157 branch history register which indexes into multiple tables. 254 00:20:21,157 --> 00:20:28,932 And then, chooses, based on some bits in the program counter, between these 255 00:20:28,932 --> 00:20:36,599 different pattern history tables. Now why, why do we wanna do this? 256 00:20:36,599 --> 00:20:46,057 Well we want to know, based on our PC here which one to choose. 257 00:20:46,057 --> 00:20:52,064 So, this is trying to sort of get the, the mixing here of temporal, and spatial 258 00:20:52,064 --> 00:20:56,661 correlation here. Because if a certain branch gets executed. 259 00:20:56,661 --> 00:21:01,181 You wanna use that prediction for that particular branch. 260 00:21:01,181 --> 00:21:07,624 And here you know, use some hash of some bits in the program counter, will choose 261 00:21:07,624 --> 00:21:11,348 between some numbers of these history tables. 262 00:21:11,348 --> 00:21:18,571 So this is mixing our spatial co-relation with our temporal co-relation together 263 00:21:18,571 --> 00:21:22,190 here. And this is, this is called 2-level branch 264 00:21:22,190 --> 00:21:26,147 prediction. Because we, we have multiple wells here we 265 00:21:26,147 --> 00:21:31,431 have a history register missing here in the, the program counter. 266 00:21:31,431 --> 00:21:38,038 And this can do quite well. And if we want to have a sort of 267 00:21:38,038 --> 00:21:48,047 generalized two level branch predictor. And you want to make this even go better, 268 00:21:48,047 --> 00:21:52,975 or even go faster. You can start to add multiple branch 269 00:21:52,975 --> 00:21:58,086 history registers and have them be per branch effectively. 270 00:21:58,086 --> 00:22:05,094 Now you're never gonna have it per branch. You're gonna have it be somehow aliased 271 00:22:05,094 --> 00:22:10,011 into here. So you have multiple ship registers and 272 00:22:10,011 --> 00:22:16,061 only when that branch is indexed is when you actually ship that ship register. 273 00:22:18,007 --> 00:22:26,995 So for, for, relatively small numbers of M coming into this index, into this branch, 274 00:22:26,995 --> 00:22:33,106 history register table, and K indexing between different pattern history tables. 275 00:22:33,106 --> 00:22:37,587 You actually can get very high accuracy, you can get like sort of 97 percent 276 00:22:37,587 --> 00:22:40,045 accuracy. Now we're still not 99. 277 00:22:40,045 --> 00:22:44,376 Now, you might say, why do I care 97 sounds pretty darn good. 278 00:22:44,376 --> 00:22:48,690 Not good enough, unfortunately. You actually need to go higher if you have 279 00:22:48,690 --> 00:22:53,403 22 stage pipe and you actually want to have decent performance, because, twenty, 280 00:22:53,403 --> 00:22:58,039 22 cycle mispredict , or twenty cycle mispredict is, is really quite bad., It's, 281 00:22:58,039 --> 00:23:03,210 it's pretty damaging when you go sort of work through the iron law of math. 282 00:23:03,943 --> 00:23:10,310 So before we go on this is kind of one of the more complex slides that we're gonna 283 00:23:10,310 --> 00:23:12,967 to see today. Does anyone have any questions about this 284 00:23:12,967 --> 00:23:15,005 of what's, what's happening in the mechanics? 285 00:23:23,022 --> 00:23:30,046 So let's talk about training this. So, as you're executing. 286 00:23:31,000 --> 00:23:35,957 You're going to be shifting, when only, when a branch occurs you're going to be 287 00:23:35,957 --> 00:23:38,987 shifting the old outcome into this register. 288 00:23:38,987 --> 00:23:43,614 One, one of these registers here based on the PC of that branch. 289 00:23:43,614 --> 00:23:49,109 And then you're also transitioning, based on the outcome, the prediction. 290 00:23:49,109 --> 00:23:53,468 Or the state machine, which is embodied in the prediction. 291 00:23:53,468 --> 00:23:58,146 So this is actually a pretty complex machine here. 292 00:23:58,146 --> 00:24:03,454 And you don't probably want to be updating that. 293 00:24:03,454 --> 00:24:12,001 Oh we'll say, right in the fetch stage or right in the decode stage. 294 00:24:12,001 --> 00:24:17,040 So typically, how this works is you go to do this update later down the pipe sort 295 00:24:17,040 --> 00:24:22,033 of, at the end of the pipe somewhere. You go, and you actually feed back the 296 00:24:22,033 --> 00:24:27,012 true outcome of all the branches. And that feeds back into the, the branch 297 00:24:27,012 --> 00:24:30,052 history register and the pattern history tables. 298 00:24:30,052 --> 00:24:34,092 And, and, the downside to this is by waiting to the end of the pipe. 299 00:24:35,020 --> 00:24:37,056 Is, well. We'll, first, we'll go for the upside. 300 00:24:37,056 --> 00:24:41,092 The upside is, you know that the branch was actually executed, you know the actual 301 00:24:41,092 --> 00:24:44,029 outcome. You know that it wasn't squashed by 302 00:24:44,029 --> 00:24:47,025 something else. If you were to out order a pipeline, you 303 00:24:47,025 --> 00:24:50,200 it was actually committed. Because you don't really wanna train your 304 00:24:50,200 --> 00:24:53,675 predictor on bogus data. So you don't wanna like, sort of predict 305 00:24:54,052 --> 00:24:57,054 train your data on, incorrect re-predicted branches themselves. 306 00:24:57,054 --> 00:25:01,063 So if you, let's say, have a mispredict branch in the shadow of the mispredict 307 00:25:01,063 --> 00:25:03,594 branch. You have something else, and that were to 308 00:25:03,594 --> 00:25:06,363 try to go and corrupt your prediction state, that's not good. 309 00:25:06,363 --> 00:25:08,594 So usually, you sort of wait to the end of the pipe. 310 00:25:08,594 --> 00:25:12,508 The downside of waiting to the end of the pipe is, there's more feedback and it 311 00:25:12,508 --> 00:25:16,024 takes longer to train up It doesn't necessarily take longer time but if you 312 00:25:16,024 --> 00:25:19,199 have branch, followed by branch that wants to use that prediction. 313 00:25:19,199 --> 00:25:23,055 You can't actually use that let's say until it gets to the end of the pipe. 314 00:25:23,055 --> 00:25:27,242 So, you're gonna be using sort of old stale data for a little bit of time or as 315 00:25:27,242 --> 00:25:30,635 long as your pipeline [inaudible]. Not the end of the world, but something to 316 00:25:30,635 --> 00:25:36,640 be aware of. Okay, so now let's start talking about 317 00:25:36,640 --> 00:25:41,164 getting onto our higher 90's well higher than 97%. 318 00:25:41,164 --> 00:25:45,062 We're going to talk about serving to our 99 percent accurate branch predictors. 319 00:25:45,062 --> 00:25:50,063 And, this brings us to tournament predictors. 320 00:25:50,092 --> 00:25:55,800 It's great name. It came up first in the, ALPHA 21264 and 321 00:25:55,800 --> 00:26:03,590 what they did is they actually had predictors, to predict which predictor to 322 00:26:03,590 --> 00:26:04,379 use. So. 323 00:26:04,379 --> 00:26:11,356 So one of the interesting things going back to this two level branch predictor, 324 00:26:11,356 --> 00:26:17,620 is sometimes, you want per branch information or per branch history. 325 00:26:17,620 --> 00:26:21,948 And sometimes you want. Global branch history. 326 00:26:21,948 --> 00:26:27,889 The pro branch history, you're only basically training this up based on that 327 00:26:27,889 --> 00:26:31,210 previous branch. But in this example here, you are training 328 00:26:31,210 --> 00:26:35,112 it, training it upon other branches. So sort of going back here, this is a 329 00:26:35,112 --> 00:26:39,727 different branch than that branch. So you might even want, you might actually 330 00:26:39,727 --> 00:26:43,058 want global history. But there are times, for instance in a 331 00:26:43,058 --> 00:26:49,796 loop, you actually want local history. So Turner's predictors, actually they end 332 00:26:49,796 --> 00:26:54,743 up with things we are using a predictor. We'll call it a choice predictor, that 333 00:26:54,743 --> 00:26:59,660 predicts which predictor to use. I showed two examples here, one which is 334 00:26:59,660 --> 00:27:02,939 choosing from a global predictor and a local predictor. 335 00:27:02,939 --> 00:27:06,537 That's, that's, that's kind of the, the canonical usage of this. 336 00:27:06,537 --> 00:27:09,926 Actually, and in 21264, they had something more complicated. 337 00:27:09,926 --> 00:27:14,565 I think they had I think, anecdotally, I heard that they have about ten predictors 338 00:27:14,565 --> 00:27:17,971 in there. And they're trying to choose between ten, 339 00:27:17,971 --> 00:27:22,324 ten, ten different predictors. And they train the choice predictor, based 340 00:27:22,324 --> 00:27:25,583 on which predictor was correct. So you can use. 341 00:27:25,583 --> 00:27:31,712 Basically, because you run loops, and you run codes a lot, you can train up. 342 00:27:31,712 --> 00:27:36,066 The choice predictor to choose between other predictors, based on the accuracy of 343 00:27:36,066 --> 00:27:39,500 the other predictors. So for particular branch, one of the 344 00:27:39,500 --> 00:27:43,459 predictors is always wrong. The choice predictor can say, just don't 345 00:27:43,459 --> 00:27:45,530 use that one. Just X it out. 346 00:27:46,350 --> 00:27:53,083 So, this, this does, this does, quite a bit well here, and you know, you're 347 00:27:53,083 --> 00:27:58,842 starting to get into the 90, I don't actually know the full numbers here, but 348 00:27:58,842 --> 00:28:04,417 it's probably 98 to 99 percent accuracy for large individual predictors hooked up 349 00:28:04,417 --> 00:28:09,148 to this. Your, the book [inaudible] that you have 350 00:28:09,148 --> 00:28:13,240 talks about some other things you might want to stick into here. 351 00:28:13,240 --> 00:28:18,152 But for right now we'll, we'll, we'll leave it at this so we have sort of global 352 00:28:18,152 --> 00:28:23,670 predictors, local predictors, things that use a branch history register for both of 353 00:28:23,670 --> 00:28:26,938 these, and then something which predicts between the two. 354 00:28:26,938 --> 00:28:31,104 So I'll just restate it one more time. You train up your predictor. 355 00:28:31,104 --> 00:28:35,394 To determine which one's used and it has lots of training data. 356 00:28:35,394 --> 00:28:40,745 But this is all heuristics if you go look in the book actually there's a even joke 357 00:28:40,745 --> 00:28:44,664 there's, it's actually not a joke it's something that someone has actually 358 00:28:44,664 --> 00:28:47,540 proposed. They call it the perceptron predictor and 359 00:28:47,540 --> 00:28:51,676 they say they have a neural net hooked in to their predictor and the prediction 360 00:28:51,676 --> 00:28:54,092 strategy. So these things can get pretty 361 00:28:54,092 --> 00:28:56,998 sophisticated. I don't know if anyone has actually ever 362 00:28:56,998 --> 00:29:01,001 built a perceptron predictor for real but at least it's been proposed.