Okay. So, an important question comes up with something like a superscalar when you're executing multiple executions at a time is what happens when you fetch two instructions and you, lets say one of those instructions takes an interrupt or exception as its going down the pipeline. So, let's, let's take an example here. Let's say, we have a load and then a system-call instruction. Now, both these instructions can effectively take interrupts or exceptions. The, the load can take something like a TOB miss, or alignment fault. The SYSCALL instruction, by definition, is effectively, making an interrupt occur. And one of the interesting questions here is this load word, which is going to go down the pipe. If, we fetch these two at the same time, and they start marching down the pipes. This is our pipeline diagram. We fetch at the same time, we decode at the same time. The load has to go to the A-pipe or the load has to go to the B-pipe, so it ends up in B, and the SYSCALL ends up in the A-pipe. Well, what does this exactly mean if the load is in the B pipeline, but it takes an interrupt, and it commits in order first. Hm, well, actually, let's, let's think about that even, let's, let's think of even a, a simpler question here. What if the SYSC will load does not take any faults and the SYSCALL takes a fault. Which happened first, A or B? Okay. So, what should, should happen first in a program order? A load and then, an instruction after a load. A load should happen first because in program order we go sort of top to bottom. But, the load is in our, our B-pipe here and our A-pipe, we have a instruction which takes a interrupt. Well, so, what happens here, right, is that the load should go down the pipe and complete in order not to deadlock your, basically, your code logic is going to sort of either know about this or very, very late in the pipe, you have to have what we're going to call a commit point, which we're going to be talking about later in today's lecture, and you have to make some rational decision here of which of these actually occurred in order and you somehow have to track that going down the pipe. And then, you have to make a decision of oh, well, the A-pipe actually just took an interrupt, but we, in program order, the B-pipe is the first instruction going down the pipe. So, at the end of the pipe there, you're going to have to make some logical decision, and have to have a little bit of control logic to make sure that you're not going to, let's say, take the interrupt for the SYSCALL, even though, and, and kill the load construction before the SYSCALL. So, one thing you could do is actually have both of them go down the end of the pipe, not kill load, but a commit. And have the SYSCALL actually take the interrupts. That's probably the highest performing thing you can do in this case. Lower performing things probably would be easier but that's probably the highest performing thing you can do in this case. Okay. So, we sort of introduce this two-way super scalarr in, or superscalar. One thing we need to think about is we add a lot more places that data could be coming from in, if we were to forward data. So, we now, instead of if, if, when we have one pipeline, we could bypass them out of here, here, and there. So, it's only three places. But now that we have two pipelines, you effectively multiply the places that you can bypass out of, and now you've got six places. So, if you go sort of pull, pull steering logic off and then, you know, make your multiplexors bigger here which are, you're doing, you're bypassing, you end up with six different locations that you have to choose between for each input operand. And this is a relatively short pipe. So, as you start to go to bigger and bigger pipelines either in depth or in width, and you want full bypassing, you're going to have more and more, much wider multiplexers here and a lot more data being bypassed. So, this, this actually becomes, becomes a problem that you need to start to, to think about this really hard. So, some, what are some solutions to this? Well, one solution that people sometimes do, is they don't have full bypassing. You can only bypass at certain locations. That's, that's one option. Another option, which we will be talking about a little bit later today, is you can actually, maybe, not actually have this pipeline register and if you start to think of having out of order processors, you could start to think of committing information back to the register file early. So, this pipe here has nothing happening in this stage. Well, couldn't we just shove it in the register file? On first appearance that sounds great, you start to think about that a little more and you're, you can start to get worried here because you start to see write after write hazards actually start showing up as real problems then, because if you issue an instruction here, which writes to the same register or which happens at the end of this load operation, we'll say, then you could actually get out of order writing to the register file. So, you need to be cognizant of that. Other approaches that people take this, is sometimes they will actually have what is called clustered superscalars. So, clustered superscalars though, superscalars will actually have, let's say, four pipelines, and they'll cluster them into two pipelines of two each, and you'll allow bypassing between two of the pipes, and two of the other pipes, and if you bypass between the two sets of clustered pipes, then it takes an extra cycle or you'd have to do it through the register file or something like that. So, there's other approaches there to try and mitigate the blowup of this bypassing network. You have to remember, in something like a 64-bit, something like a 64-bit processor, each of these are 64-bit buzzes. Each, each one of these little wires here, so these things get pretty, pretty big, pretty quick. So, you have to worry about actually running theses things around the chip, cuz all of a sudden you have hundreds and hundreds of, you know, 600 bits running over [unknown] wires running up and over just for your bypass from this simple pipeline, and if we go wider, it's going to be much worse, or longer. So, one, one thing that people do a lot to handle this bypassing, for a critical path perspective, cuz this takes a long time or starts to take a long time is you start to break the decode and the issue. So, we're going to sort of get away from our five-stage pipes now, up to this point we've been doing things you've seen in the first Patterson-Hennessy book. And now, we're going to start thinking about things that have longer pipelines. So, one, one good thing to do is you can actually break the decode and the register file access into two separate stages in the pipe effectively making a six-stage pipeline now. And what do we, what do we put? Well, one thing we can do is actually break the decode into its own pipe stage and we can try to figure out structural hazards even in that pipe stage. That's when people sort of traditionally do the decode and they also look to see if you're going to have structural hazard, let's say, in the right port of the register file that ends the pipe. And then, in the issue stage, I for issue, you'll do the register file and you'll probably swizzle or cross over or steer the instructions to the, and the operands to the correct location. And, of course, you'll do this bypassing if you have lots of bypassing operands going back. So, to give u a sort of a, a brief pipeline example here, here we have two cycles and, each, execute two instructions per cycle. And, we can see now, our pipeline has an extra I in here, which is just an extra front end stage. Okay. So, this has some, this has some negative aspects. Can anyone think of one negative aspects of putting extra pipe line stages in the front of our pipe line? Yes. So, branches, if you, if you, if we know that the branch gets resolved in outside of, out of the first execute stage of the pipe or in our two pipe here, [unknown] A0. We've just increased the branch cost by, by one. So now, something that would have branched or, a, a branch, you know, mispredict probably of, let's say, two cycles just became three cycles. And this, this can start hurting your performance. And, this really starts to hurt your performance, as you start to go wide. So, let's take this instruction sequence here, where we have this extra issue stage on front and we have a branch in the first instruction. We try to execute and then we just have sort of fall through code here, which we sort of, predict fall through. We don't realize that the branch happens till A0, at that point, we can redirecting to sort of kill everything that we've already done in flight. But look at all the things that have gone in flight already. We're, we're sitting here, which means we've had one, two, three other stages if you will, or three other cycles to go fetch instructions. So, we fetch these, these, these, we decode them, we did, we spent a lot of power, we spent a lot of time, we spent a lot of fetch bandwidth doing this. And then, we just kill it all and re-vector to the correct branch target. So, in this example here, we've killed seven instructions. So, this can have a, a pretty negative impact on our clocks per instruction, if you will. So, let's, let's talk briefly about how to fix this. We're not going to fix all this today, we have a whole dedicated lecture to fixing this. But, what could we possibly do to minimize the probability that there is, all these dead cycles here, all these killed instructions going down our 2A pipe. Well, we can, we can, hopefully, if we are lucky, we can try to predict the destination with some accuracy. And have a branch predictor which will figure out where the destination of the actual branch is with some high probability. And then, instead of executing, let's say, Op A here which is a, a dead instruction doing incorrect branch target, we can try to fetch and try to execute the correct branch part. And we're going to have a whole talk, we're going to have a whole lecture on how to get your branch prediction accuracy up. So, in, in modern day processors, there's, you know, somewhere around 98%accurate, give or take a little bit. I actually don't know what the state of the art is on this because they, they, keep getting better and more complex branch predictors, but there's pretty simple things you can do to sort of get you into the mid '80s range. And then to get from, from the mid '80s range of branch prediction accuracy to the '90s, you have to sort of put a lot of effort and, and time into that. But, we'll have a whole lecture later in today's, later in the, the course, about branch prediction, just dedicated to branch prediction. But I just want to motivate here that, if you have a longer front end of your pipe, and we are going to look at some pipelines where there is even more front end stages than just fetch decode issue before the branch gets resolved, whenever you add an extra pipe stage in the front, that's going to impact your performance. Because even if you have a high prediction accuracy, your prediction accuracy is not going to be 100%. And, if you mispredict, you sure going to have dead instructions going on the pipe, as this can be wasting time, energy, and utilization of the pipeline.