Okay, so that was. One way to figure out the outcome. Let's look at a little bit more dynamic way. Okay, so we can try to build, pieces of hardware which determine the outcome of a branch, dynamically. And this is by adding extra little pieces of hardware, somewhere in the front of the pipe. So, somewhere in our PC plus four calculation this is going to be this multiplexor here in the PC plus four calculation. And effectively, we're going to use some pieces of hardware to feed into that, and this is our taking and not taking prediction values. And we'll, we'll talk about this in a second but one of the questions here is, what things feed into this PC plus four hardware. Hmmn, , well one of them is gonna be PC plus four. One of them should be your branch target. Unfortunately, as we said we don't know the branch targets up here. So we can't have that come around in one cycle. So we need to hold off, that's how we that's, that's the target address question. But first thing is, how do we swing this multiplexer here of our next PC? Okay, so, some solutions to this, there's dynamic in hardware they're trying to increase the prediction accuracy, is, we're going to think about having something which exploits, the nature of programs. So what we're going to try to exploit first of all is temporal correlation of a branch, [inaudible] a branch outcome with previous outcomes of that same branch. So, if you're in a loop, and you fallen into a loop, and you have a branch that's predicted taken, or, or, let's say if you have a branch that is taken. You can use that in the future to predict the outcome of that same branch. So here we're actually gonna have a one bit predictor, as, denoted by this finite state machine, and what's gonna happen is let's say we start in the not taken state. If the outcome of a, a particular branch is not taken, we just stay here and we continually predict not taken. If we then take a branch, if we then end up taking a branch, we transition over to this other predict, predict taken state and we sit in there if we're not taken. And if we have alternating branches taken and not taken, we're basically gonna be predicting the wrong one, er, well, we might have been predicting the wrong one, every single time. But we can implement this with one bit, one extra bit to our control logic. It's not, a lot of logic here it's literally one extra bit and it just says, which of these two states you're in. Let's take a look at the performance of this. Does this, does this help? So let's say we have a loop here. That, iterates four times. So we're gonna graph, the iteration number. And then we're gonna show the prediction that's coming out of the predictor. This one big predictor and then we're gonna look at what actually happened and we're gonna see if we predicted right or wrong. Okay, so in the first, iteration, let's say, we train up the, the predictor, it says not taken. It starts off and it's not taken, that's the start state. We're to predict not taken, it was actually taken because we're gonna loop four times in this loop. So we mis-predicted it. Hm, that wasn't great. But then, it transitions from the not taking, or predict not taking the state, to the predict taking the state. So the next few times we iterate the loop, we're going to get it right. Now this is a relatively short loop, we only run it four times. But if you have a loop that executes, 100 times. This might actually do okay. Okay, but then we look at the end here. And well, hm. The fall-through case at the end of the loop. We're gonna predict taken but we're actually fall through. This is the last, last little bit of our loop. That's, that's not great. So here, you know if you look at prediction accuracy, this is 50%. This was, this was not, not very good. Let's, let's try to execute the loop again and I want, just to kind of want to show here. The reason that I show a second execution in the exact same loop, is to show why the start prediction was not taken. The second time we come back to the same loop, we left this predictor trained up to be not taken. So now when we go to execute again we come in not taken. And we mispredict the first time. Correctly predict the second time, correctly predict the third time, and miss predict the fall through case. So that's why you need to sort of execute this loop twice here to see what the common case is which is sort of looped back to back. Otherwise, you know, if we were to change the start state. You might end up thinking that this predictor does better than it actually does. So that's, that's, that's not, not a whole magic there. It's simple. It does okay. Can we do better? Well we're nowhere near 99%, so the answer's going to be yes. Okay, so what'll happens if we start to have more bits, to do prediction. So here, instead of having one extra flip flop added to our computer architect, or, or, our processor pipeline. We add two bits to processor architecture. We add a two bit saturating predictor. So what do we mean by saturate? Well, what this means is, we start off, let's say in the strong not taken state. And then if you have a predict take and it moves here, if you have another predict take and it moves here, if you have another predict, predict take and it moves here. So we slowly can sort of, we have a range here if you will of taken. So what we have taken, strong taken, weakly not taken, strong not taken, and we can sort of move across the spectrum as we reinforce the taken or not takenness of this, of a, of a branch. So still, still not quite rocket science. There is some, some of these here. Effectively we're introducing hysteresis into the prediction. That we're trying to do here. Or our prediction algorithm. And that, that, that can helps. Let's, let's see how that helps. So let's execute the same loop here, another time here. We'll start off with strong not taken. Strong not taken, predicts not taken and the first time we execute this loop, well it was actually taken because loops like to be taken. And that's a mispredict. But we transition from strong not taken over here into the weak not taken state. So something actually happened, that, that mis-predict was not in vain. The second iteration happens, and we still are predicting not taking, because we're in this weak not taken state. And we did take it. So we mis-predict again. But now, we actually transition to the weak, taken state. This time things get a little bit better here we're actually going to have a prediction of taken. The actual thing that is taken we predicted it correctly, and then finally for here, prediction of taking it was actually not taken. I'm sorry this was actually a mis-predict here. This is, this is incorrect. So if you go this should, this should be yes. We can see that this is actually, doing worse or appears to be doing worse than our one bit predictor. Well, does, does life get better. Well we, we train this up a little bit at this point. So let's, let's take a look at now. Let's execute the loop, another time given that we started off in this strong taken state. So now we're gonna execute again and we see that we actually predict taken, taken, taken. It actually was taken, taken, taken. And we just mispredict the last one. And what's nice about this is now when we go to execute again, we're gonna start off, because we mispredict to the end state here, we're actually gonna start off in the weak taking state. I'm sorry so this is, this should probably be weak taken. Is that what you were going to say? Yeah, so. This is another bug in the slide here. I will fix this in what I, upload. This, this should actually say week, week taken. Because that's where we left off, at the end of this one here. But the, the prediction accuracy's the same. We're still gonna predict taken. So you can see that this, does not get tricked, if you will. It does not get untrained. By this one not taken case here at the end. But what's nice about this is you can train up, basically, and the historesis will not get foled by one branch at the end of the loop sort of falling through. And that's, that's positive. You can even think about this section being helpful, that the fall through case might even signal you to do something. People could actually have hist, you could even use that to sort of, trigger something in your finite state machine if you see lots and lots of takens, then one not taken on the exact same branch, that might mean something to you. And people in fact have done things like that. So, you start to get even with two bits, you can think about having different state machines for the same thing. So here we have something which jumps directly to strong from weak. So, instead here when we had strong. Or sorry weak not taken, and when we have a taken here. We actually jump to strong. So we train effectively very fast. So it's still hysteresis , but they're a faster acting hysteresis. So this, looks pretty similar to what you had before but now this weak not taken state, state will jump straight to strong taken, if you have one mispredict or something like that. But, this is only one example. If you go look and actually [inaudible] book, in the chapter that talks about this. They give like three other examples that look similar to this and are different heuristics to sort of try to get different of chances or different, different, behaviors that are common with programs. But, you know none of these are going to be, super, super great because they're only 2-bits. Did you actually start to think about having a table of these predictors. So you have local predictions. So it's not predicted globally. But instead we are predicting based on each different branch. And how we do this is we take bits out of the program counter and we index that into this big table. And then we can store, lets say two bits in this table, and then we can train each of these things separately. Now, we are not gonna wanna have an entry in this table equivalent to the number of branches in the program. Because, it will be huge. But you want to somehow have some number of bits. K bits, in fact. Maybe some sort of, , maybe just take the low, or the medium order bits of the, of the program counter, and we use that as a index. Similar to how we use, how we use indexes and caches. You could also think about having multi way associative versions of these. So instead of having, this is a, this is a direct mapped version. But you could have a multi way associative version of this. And If you, if you look let's say something which is a four kilo entry version of this. That's pretty big actually. But a four kilo entry branch history table with two bits per entry. Now we're starting to get somewhere, we can get sort of 80 to 90 percent prediction accuracy. So that's, that's starting to get us there, and then we had to build. I don't know, eight kilobits worth of storage. And we had to somehow pipe it into our next PC logic. Which is kind of a sensitive path on our processor. But, you know that's, that's can be helpful in performance. So let's actually before I move off. I want to make one, one point. There are other ways to organize this. Table. So what we're really doing here is, we're indexing into a table and we're, let's say running through some sort of finite state machine decoder here, which tells us taken or not taken. And that we're individually updating subentries of this table, for a particular branch. You could think about having other things to index into this table. So, either more complex cache functions, or maybe some other history information, or maybe previous states or just some other bits up here. So this is just a base cases. We're gonna show some more exam-, more complicated examples for spatial correlating predictors. But you, a lot of the ideas we are going to talk about in spatial correlating predictor. You could apply that to these temporal coordinating predictors. But I just wanted to make that point that these things do what we're going to talk about we're going to show more, much more sophisticate versions in our spatial correlation predictors, but the you can apply level ideas back to these temporal ones. Okay so let's look at spatial correlation. We talked about temporal correlation. So, in time, one, one branch in the, and the direction that one branch goes has a high probability that it's going to influence or, or it's going to predict rather the, whether that branch in future is going to be taken or not taken. That's, that's what is, a temporal prediction is. A spatial prediction, what it's trying to get at this, this, this piece of code here is that, if you have let's say one branch, do something, the future branch that's, lets say right below it, is going to somehow be correlated. So a good example case, and this is kind of the canonical example here is, If you check this value, and it's less than seven. It's also less than five. And you get code like this relatively often. So these, these. These branches are correlated. But they're not the same branch. They're different branches. This branch and this branch. So we're having some sort of spatial correlation here, and we want to try to exploit this somehow. And this was Yael Pats and, and one of his students came up with this, not that long ago surprisingly. In the, in the, in the 90's. And the idea here is that I don't know, we're gonna have an extra register. We'll call a branch history register. Which will store whether previous branches that we've executed were taken or not taken. And then use that to index into some tables. Or use that somehow to predict what's going on. So this is actually gonna be a shift register. Where we're gonna shift in, what the branch did. So it's not based on just the address. Okay, so here's our picture of our branch history register. The branch outcome, either whether taken or not taken, goes into here. Every time we take a branch, we're gonna shift a new value in, and let the old one sort of fall off the end. So what could we do with this? Well, we can use it to index a table. So if we have a pattern history table. What parent the history table is, is will be indexed, by previous branches. And then based on this pattern, we're gonna look in this table, and that'll output either some bits that go into our state machine or it might just output one bit which says take it or not take it. But, you know the general sense is going to output in bits, maybe two bits which will feed into a state machine which updates itself, like what we had before. What's cool about this, is let's say. You have a, loop. Which, allways has four iterations. So we're gonna have taken, taken, taken, not taken. Well, what's gonna happen is we're gonna shift in these four take in, take in, take in, not take in the register and then, lets say we execute the loop we executed that whole inner that whole loop multiple times. We can train up this history table here, and it'll detect if we've seen the loop. Let's say, have taken, taken, taken. What do we think the next thing is going to be? It's gonna be not taken. And this predictor can detect that. And I can do that actually with only one, one bit here. One, you know just a one bit state machine. Because, if it sees taken, taken, taken not or excuse me or if it sees, taken, taken, taken. That's gonna index into this table, and the output of here is gonna be not taken. So it's gonna be able to predict, sort of short runs. Loops which have short runs. And if this, as this gets longer. This branch history register gets longer, you could actually predict loops which have longer runs. So if this is 100 bits long. Now, you're never gonna have 100 bits long, because a pattern history table of, of to the 100 is a very large number. But as this gets longer, let's say it's ten. That's very reasonable. You could predict a loop. Let's say that loops ten times. It will get the fall through case correct, it will get the start case correct, and you will have 100 percent accuracy in that loop, for your branch prediction. Wow, that's pretty cool. So this is where we're starting to actually get some, get some traction here. And this is where branch predication really starts taking off. This is where we start to get it, sort of, above 80 percent we start to get to a higher number. If we wanna go even, if you want to even have better prediction. We can start to think about having a branch history register which indexes into multiple tables. And then, chooses, based on some bits in the program counter, between these different pattern history tables. Now why, why do we wanna do this? Well we want to know, based on our PC here which one to choose. So, this is trying to sort of get the, the mixing here of temporal, and spatial correlation here. Because if a certain branch gets executed. You wanna use that prediction for that particular branch. And here you know, use some hash of some bits in the program counter, will choose between some numbers of these history tables. So this is mixing our spatial co-relation with our temporal co-relation together here. And this is, this is called 2-level branch prediction. Because we, we have multiple wells here we have a history register missing here in the, the program counter. And this can do quite well. And if we want to have a sort of generalized two level branch predictor. And you want to make this even go better, or even go faster. You can start to add multiple branch history registers and have them be per branch effectively. Now you're never gonna have it per branch. You're gonna have it be somehow aliased into here. So you have multiple ship registers and only when that branch is indexed is when you actually ship that ship register. So for, for, relatively small numbers of M coming into this index, into this branch, history register table, and K indexing between different pattern history tables. You actually can get very high accuracy, you can get like sort of 97 percent accuracy. Now we're still not 99. Now, you might say, why do I care 97 sounds pretty darn good. Not good enough, unfortunately. You actually need to go higher if you have 22 stage pipe and you actually want to have decent performance, because, twenty, 22 cycle mispredict , or twenty cycle mispredict is, is really quite bad., It's, it's pretty damaging when you go sort of work through the iron law of math. So before we go on this is kind of one of the more complex slides that we're gonna to see today. Does anyone have any questions about this of what's, what's happening in the mechanics? So let's talk about training this. So, as you're executing. You're going to be shifting, when only, when a branch occurs you're going to be shifting the old outcome into this register. One, one of these registers here based on the PC of that branch. And then you're also transitioning, based on the outcome, the prediction. Or the state machine, which is embodied in the prediction. So this is actually a pretty complex machine here. And you don't probably want to be updating that. Oh we'll say, right in the fetch stage or right in the decode stage. So typically, how this works is you go to do this update later down the pipe sort of, at the end of the pipe somewhere. You go, and you actually feed back the true outcome of all the branches. And that feeds back into the, the branch history register and the pattern history tables. And, and, the downside to this is by waiting to the end of the pipe. Is, well. We'll, first, we'll go for the upside. The upside is, you know that the branch was actually executed, you know the actual outcome. You know that it wasn't squashed by something else. If you were to out order a pipeline, you it was actually committed. Because you don't really wanna train your predictor on bogus data. So you don't wanna like, sort of predict train your data on, incorrect re-predicted branches themselves. So if you, let's say, have a mispredict branch in the shadow of the mispredict branch. You have something else, and that were to try to go and corrupt your prediction state, that's not good. So usually, you sort of wait to the end of the pipe. The downside of waiting to the end of the pipe is, there's more feedback and it takes longer to train up It doesn't necessarily take longer time but if you have branch, followed by branch that wants to use that prediction. You can't actually use that let's say until it gets to the end of the pipe. So, you're gonna be using sort of old stale data for a little bit of time or as long as your pipeline [inaudible]. Not the end of the world, but something to be aware of. Okay, so now let's start talking about getting onto our higher 90's well higher than 97%. We're going to talk about serving to our 99 percent accurate branch predictors. And, this brings us to tournament predictors. It's great name. It came up first in the, ALPHA 21264 and what they did is they actually had predictors, to predict which predictor to use. So. So one of the interesting things going back to this two level branch predictor, is sometimes, you want per branch information or per branch history. And sometimes you want. Global branch history. The pro branch history, you're only basically training this up based on that previous branch. But in this example here, you are training it, training it upon other branches. So sort of going back here, this is a different branch than that branch. So you might even want, you might actually want global history. But there are times, for instance in a loop, you actually want local history. So Turner's predictors, actually they end up with things we are using a predictor. We'll call it a choice predictor, that predicts which predictor to use. I showed two examples here, one which is choosing from a global predictor and a local predictor. That's, that's, that's kind of the, the canonical usage of this. Actually, and in 21264, they had something more complicated. I think they had I think, anecdotally, I heard that they have about ten predictors in there. And they're trying to choose between ten, ten, ten different predictors. And they train the choice predictor, based on which predictor was correct. So you can use. Basically, because you run loops, and you run codes a lot, you can train up. The choice predictor to choose between other predictors, based on the accuracy of the other predictors. So for particular branch, one of the predictors is always wrong. The choice predictor can say, just don't use that one. Just X it out. So, this, this does, this does, quite a bit well here, and you know, you're starting to get into the 90, I don't actually know the full numbers here, but it's probably 98 to 99 percent accuracy for large individual predictors hooked up to this. Your, the book [inaudible] that you have talks about some other things you might want to stick into here. But for right now we'll, we'll, we'll leave it at this so we have sort of global predictors, local predictors, things that use a branch history register for both of these, and then something which predicts between the two. So I'll just restate it one more time. You train up your predictor. To determine which one's used and it has lots of training data. But this is all heuristics if you go look in the book actually there's a even joke there's, it's actually not a joke it's something that someone has actually proposed. They call it the perceptron predictor and they say they have a neural net hooked in to their predictor and the prediction strategy. So these things can get pretty sophisticated. I don't know if anyone has actually ever built a perceptron predictor for real but at least it's been proposed.