Okay.
So, an important question comes up with something like a superscalar when you're
executing multiple executions at a time is what happens when you fetch two
instructions and you, lets say one of those instructions takes an interrupt or
exception as its going down the pipeline. So, let's, let's take an example here.
Let's say, we have a load and then a system-call instruction.
Now, both these instructions can effectively take interrupts or exceptions.
The, the load can take something like a TOB miss, or alignment fault.
The SYSCALL instruction, by definition, is effectively, making an interrupt occur.
And one of the interesting questions here is this load word, which is going to go
down the pipe. If, we fetch these two at the same time,
and they start marching down the pipes. This is our pipeline diagram.
We fetch at the same time, we decode at the same time.
The load has to go to the A-pipe or the load has to go to the B-pipe, so it ends
up in B, and the SYSCALL ends up in the A-pipe.
Well, what does this exactly mean if the load is in the B pipeline, but it takes an
interrupt, and it commits in order first. Hm, well, actually, let's, let's think
about that even, let's, let's think of even a, a simpler question here.
What if the SYSC will load does not take any faults and the SYSCALL takes a fault.
Which happened first, A or B? Okay.
So, what should, should happen first in a program order?
A load and then, an instruction after a load.
A load should happen first because in program order we go sort of top to bottom.
But, the load is in our, our B-pipe here and our A-pipe, we have a instruction
which takes a interrupt. Well, so, what happens here, right, is
that the load should go down the pipe and complete in order not to deadlock your,
basically, your code logic is going to sort of either know about this or very,
very late in the pipe, you have to have what we're going to call a commit point,
which we're going to be talking about later in today's lecture, and you have to
make some rational decision here of which of these actually occurred in order and
you somehow have to track that going down the pipe.
And then, you have to make a decision of oh, well, the A-pipe actually just took an
interrupt, but we, in program order, the B-pipe is the first instruction going down
the pipe. So, at the end of the pipe there, you're
going to have to make some logical decision, and have to have a little bit of
control logic to make sure that you're not going to, let's say, take the interrupt
for the SYSCALL, even though, and, and kill the load construction before the
SYSCALL. So, one thing you could do is actually
have both of them go down the end of the pipe, not kill load, but a commit.
And have the SYSCALL actually take the interrupts.
That's probably the highest performing thing you can do in this case.
Lower performing things probably would be easier but that's probably the highest
performing thing you can do in this case. Okay.
So, we sort of introduce this two-way super scalarr in, or superscalar.
One thing we need to think about is we add a lot more places that data could be
coming from in, if we were to forward data.
So, we now, instead of if, if, when we have one pipeline, we could bypass them
out of here, here, and there. So, it's only three places.
But now that we have two pipelines, you effectively multiply the places that you
can bypass out of, and now you've got six places.
So, if you go sort of pull, pull steering logic off and then, you know, make your
multiplexors bigger here which are, you're doing, you're bypassing, you end up with
six different locations that you have to choose between for each input operand.
And this is a relatively short pipe. So, as you start to go to bigger and
bigger pipelines either in depth or in width, and you want full bypassing, you're
going to have more and more, much wider multiplexers here and a lot more data
being bypassed. So, this, this actually becomes, becomes a
problem that you need to start to, to think about this really hard.
So, some, what are some solutions to this? Well, one solution that people sometimes
do, is they don't have full bypassing. You can only bypass at certain locations.
That's, that's one option. Another option, which we will be talking
about a little bit later today, is you can actually, maybe, not actually have this
pipeline register and if you start to think of having out of order processors,
you could start to think of committing information back to the register file
early. So, this pipe here has nothing happening
in this stage. Well, couldn't we just shove it in the
register file? On first appearance that sounds great, you
start to think about that a little more and you're, you can start to get worried
here because you start to see write after write hazards actually start showing up as
real problems then, because if you issue an instruction here, which writes to the
same register or which happens at the end of this load operation, we'll say, then
you could actually get out of order writing to the register file.
So, you need to be cognizant of that. Other approaches that people take this, is
sometimes they will actually have what is called clustered superscalars.
So, clustered superscalars though, superscalars will actually have, let's
say, four pipelines, and they'll cluster them into two pipelines of two each, and
you'll allow bypassing between two of the pipes, and two of the other pipes, and if
you bypass between the two sets of clustered pipes, then it takes an extra
cycle or you'd have to do it through the register file or something like that.
So, there's other approaches there to try and mitigate the blowup of this bypassing
network. You have to remember, in something like a
64-bit, something like a 64-bit processor, each of these are 64-bit buzzes.
Each, each one of these little wires here, so these things get pretty, pretty big,
pretty quick. So, you have to worry about actually
running theses things around the chip, cuz all of a sudden you have hundreds and
hundreds of, you know, 600 bits running over [unknown] wires running up and over
just for your bypass from this simple pipeline, and if we go wider, it's going
to be much worse, or longer. So, one, one thing that people do a lot to
handle this bypassing, for a critical path perspective, cuz this takes a long time or
starts to take a long time is you start to break the decode and the issue.
So, we're going to sort of get away from our five-stage pipes now, up to this point
we've been doing things you've seen in the first Patterson-Hennessy book.
And now, we're going to start thinking about things that have longer pipelines.
So, one, one good thing to do is you can actually break the decode and the register
file access into two separate stages in the pipe effectively making a six-stage
pipeline now. And what do we, what do we put?
Well, one thing we can do is actually break the decode into its own pipe stage
and we can try to figure out structural hazards even in that pipe stage.
That's when people sort of traditionally do the decode and they also look to see if
you're going to have structural hazard, let's say, in the right port of the
register file that ends the pipe. And then, in the issue stage, I for issue,
you'll do the register file and you'll probably swizzle or cross over or steer
the instructions to the, and the operands to the correct location.
And, of course, you'll do this bypassing if you have lots of bypassing operands
going back. So, to give u a sort of a, a brief
pipeline example here, here we have two cycles and, each, execute two instructions
per cycle. And, we can see now, our pipeline has an
extra I in here, which is just an extra front end stage.
Okay. So, this has some, this has some negative
aspects. Can anyone think of one negative aspects
of putting extra pipe line stages in the front of our pipe line?
Yes. So, branches, if you, if you, if we know
that the branch gets resolved in outside of, out of the first execute stage of the
pipe or in our two pipe here, [unknown] A0.
We've just increased the branch cost by, by one.
So now, something that would have branched or, a, a branch, you know, mispredict
probably of, let's say, two cycles just became three cycles.
And this, this can start hurting your performance.
And, this really starts to hurt your performance, as you start to go wide.
So, let's take this instruction sequence here, where we have this extra issue stage
on front and we have a branch in the first instruction.
We try to execute and then we just have sort of fall through code here, which we
sort of, predict fall through. We don't realize that the branch happens
till A0, at that point, we can redirecting to sort of kill everything that we've
already done in flight. But look at all the things that have gone
in flight already. We're, we're sitting here, which means
we've had one, two, three other stages if you will, or three other cycles to go
fetch instructions. So, we fetch these, these, these, we
decode them, we did, we spent a lot of power, we spent a lot of time, we spent a
lot of fetch bandwidth doing this. And then, we just kill it all and
re-vector to the correct branch target. So, in this example here, we've killed
seven instructions. So, this can have a, a pretty negative
impact on our clocks per instruction, if you will.
So, let's, let's talk briefly about how to fix this.
We're not going to fix all this today, we have a whole dedicated lecture to fixing
this. But, what could we possibly do to minimize
the probability that there is, all these dead cycles here, all these killed
instructions going down our 2A pipe. Well, we can, we can, hopefully, if we are
lucky, we can try to predict the destination with some accuracy.
And have a branch predictor which will figure out where the destination of the
actual branch is with some high probability.
And then, instead of executing, let's say, Op A here which is a, a dead instruction
doing incorrect branch target, we can try to fetch and try to execute the correct
branch part. And we're going to have a whole talk,
we're going to have a whole lecture on how to get your branch prediction accuracy up.
So, in, in modern day processors, there's, you know, somewhere around 98%accurate,
give or take a little bit. I actually don't know what the state of
the art is on this because they, they, keep getting better and more complex
branch predictors, but there's pretty simple things you can do to sort of get
you into the mid '80s range. And then to get from, from the mid '80s
range of branch prediction accuracy to the '90s, you have to sort of put a lot of
effort and, and time into that. But, we'll have a whole lecture later in
today's, later in the, the course, about branch prediction, just dedicated to
branch prediction. But I just want to motivate here that, if
you have a longer front end of your pipe, and we are going to look at some pipelines
where there is even more front end stages than just fetch decode issue before the
branch gets resolved, whenever you add an extra pipe stage in the front, that's
going to impact your performance. Because even if you have a high prediction
accuracy, your prediction accuracy is not going to be 100%.
And, if you mispredict, you sure going to have dead instructions going on the pipe,
as this can be wasting time, energy, and utilization of the pipeline.