Okay, let's look at the issue logic here
and a pipeline diagram. So here we have OP A, B, C, D, E, F, so we
have straight line code, no branches. And we have things flowing down the pipe.
And we have our nice pipeline diagram. And one of the cool things is now that we
have a two wide superscalar, we can actually violate a rule that we had
before, which said two things cannot be in the same pipe stage at the same time,
temporally, cuz time, time runs from left to right in this diagram.
So here we have two things to do. You could go to stage two operands, or two
operations or two instructions in the fetch stage.
And, we're just gonna name, because we don't have a great name for these things,
we're gonna call these A and, A0 and A1 and B0 and B1 to represent the different
execution unit stages. So in an ideal world, this is pretty
sweet. In this, in this at least for this code
here, we actually have a clocks per instruction of one half.
That's pretty awesome. And as I said, we can have two, two
instructions in the same stage of the pipe.
Okay. Let's look at a little bit more complex
code sequence here. We have add, loads, some more loads, an
add, a load. Your issue logic, that swapping logic
actually will have to move instructions around in this case.
So we have this add and this load. Well this is actually easy.
The add goes to the A unit, the load goes to the B unit.
No problems there. Okay, so now we have the load.
Uh-oh, loads in, we fetched it and it's in the instruction register zero.
That means it wants to go to the A pipe, but we need to swap these two.
So, you can see here, this is how we draw this.
We actually say this add is going to the A pipe here and that's the opposite of
what's going on there. But, there's still those stalls going on,
at least in this example. And then finally here we actually are
going to get a structural hazard. And the structural hazard introduces a
stall. So, we fetch these two loads
simultaneously, or we can only execute one load at a time.
So, we need to stall one of the loads in the decode stage and push that out of the
limit. So, it actually has a different pipeline
diagram than the no stall, or the no conflicts, no structural hazard example.
Okay, so a let's look at a, little bit more complex example here, a dual issue
data hazard. What happens when you have data hazards?
So unfortunately when you have data hazards you can actually, this is without
any bypassing. This, this first, this first example, this
first two instructions here don't have any data hazards.
But here we have a write to register five, and a read from register five.
And, this is a read after write hazard. And because we're not bypassing in this
pipeline yet, we actually have to stall the second instruction waiting for that
first one, even though we could have potentially executed at the same time, but
there's a real data hazard there. So, we need to introduce stall cycles into
the second instruction. Does this make sense to everybody?
So, we're going to push out that add. If we have full bypassing, we still need
to add stalling potentially. So no we don't have to wait for this value
to get to the end of the pipe to go pick it up in the ALU but we can pull it back
because we can bypass, let's say the add result after a zero, and what you see here
is the same instruction sequence. But now it's bypassed from A0 into the
decode stage and we can start going again, quicker.
So bypassing is really helping us here, and it's crossed with the superscalarness,
if you will. So wh-, wh-, what we mean by order matters
is that here, we've interchanged these last two instructions.
So we just flipped them, and we turned what was a write, excuse me, a read after
write hazard into a write after read hazard, and because of that this actually
pulls in by one cycle and we don't get the stall.
So, just by changing the ordering in the instructions, it will change the data
dependencies and that will actually change the ordering and change the execution
length. Does that make sense, everybody, why we
can actually interchange two instructions, and the data dependencies completely
change, and we need to worry very different things about the data hazards.
Okay, so I want to briefly wrap up about fetch logic and alignments.
So this is, someone was alluding that, I think you were alluding to this.
Let's look at some code here, and it's going to take jumps.
So execute some instructions. So this is the address, this is the
instruction, and we have a jump here to address, 100 hexadecimal.
And then we execute one instruction, OPT E, and we jump to 204 hexadecimal, and
then we jump, we execute one instruction and execute to, and jump to 300 and, or
3OC hexadecimal, and we just execute some stuff.
Here is our cache, and let's say our cache, the block size is four instructions
long. And we're going to look at how many cycles
this takes to execute. So let's say there's no alignment
constraints in the first, in the first case.
So, in cycle zero here, we execute these two instructions, and we, and we fetch
them from the instruction cache, and they're, they're aligned nicely together.
There's nothing sort of weird going on, we just go pull them out.
Okay, these next two instructions eight and C, those, those are next to each
other, that's, that's great. And then, and then we jump somewhere else,
to 100, and we're going to execute these two instructions that are next to each
other and they're at the beginning of their lines, so that's great, no problem
there. Hm.
Okay. Now we start to get some weird stuff.
Now we start to jump to sort of the middle of a cache line.
In, in this example here we jump to sum address two or four.
So our block size is said four instructions.
We're sort of jumping, not to the first instruction in that block.
So when he, Fully fleshed out, fetch unit, lets say, you can execute with any
alignments. So, life is easy.
We can just, fetch and we can execute these two instructions at the same time in
the same cycle, in cycle three, we fetch both of those.
Hm, that could get harder if we actually try to put some realistic constraints in
that. Okay, now let's jump to a three the end of
a, end of a cache block and we're gonna try to fetch these two instructions at the
same time. So one is on this cache line, and one is
on that cache line. Do we need to fetch two things from our
cache at the same time? Yeah, we do.
If we actually wanted to try to execute this instruction and that instruction at
the same time. Let's say, for right now, this issue logic
actually allows us to do that. Somehow, it's a dual ported instruction
cache, we'll say. And then, finally, op five here.
Or, 314, executes last. And, and it's just sort of fall through.
There's no jumps or anything happening. So some things that can be really hard to
actually make work out right are fetching across cache lines and possibly even
fetching randomly inside of the cache line, depending on your fetching fetch
unit logic. And, and like I said, we might need extra
ports on the cache. Here, here is the this code executing, and
as you can see we don't actually get any introduce stalls, which just sort of
executes this, then this, then this, and this, and we execute two instructions
every single cycle. Now let's look at lists of alignment
constraints. So, here's our, here's our original
example, and let's look at what, what, what we could possibly try to execute
here. So we're jumping through call.
We, we only use these two instructions from the middle of the line.
So let's say we can only fetch a half of a block at a time or something like that in
each cycle, because that's how wide our cache is.
So what you might have to do in some architectures if you have alignment issues
like that, and let's say you are not allowed to have a straddle.
You'll actually have sort of extra data fetched that you are just never going to
use. You are just throwing away this bandwidth.
And also the cycles of this change. So, let's, let's look at this same code
sequence and look at what happens when we go to execute it.
So going, going back to this, so we execute, op A and op B.
Okay, let's just go down the pipe. Okay.
Life is, life is good. We get to this address eight here, eight
hexadecimal. Well, we're going to swap that, because
the jump needs to go down pipe A. But otherwise things, things are okay.
Well, now, now we jump to, to the middle of a, of a line here.
Hm, that starts to get more interesting. And we're gonna basically end up wasting
cycles. So this will take seven cycles where
before, we had this taking only five cycles.
Cuz we've effectively introduced dead cycles, where we fetched instructions we
just didn't use. So the three X's here show up as
instructions we fetched. So like, for instance, this instruction
or, the instruction at address 200 is that.
We fetched it and we're not using it. And we fetched this two and we weren't
using either of them, so having a complex fetched unit or not fully bypassed, or not
fully alignment-happy fetch unit can cause some serious problems in our performance.
Let's stop here for today and we'll talk about the rest next time.