1
00:00:03,029 --> 00:00:07,674
Okay, let's look at the issue logic here
and a pipeline diagram.

2
00:00:07,674 --> 00:00:14,023
So here we have OP A, B, C, D, E, F, so we
have straight line code, no branches.

3
00:00:14,023 --> 00:00:17,092
And we have things flowing down the pipe.
And we have our nice pipeline diagram.

4
00:00:17,092 --> 00:00:21,052
And one of the cool things is now that we
have a two wide superscalar, we can

5
00:00:21,052 --> 00:00:25,017
actually violate a rule that we had
before, which said two things cannot be in

6
00:00:25,017 --> 00:00:29,052
the same pipe stage at the same time,
temporally, cuz time, time runs from left

7
00:00:29,052 --> 00:00:33,007
to right in this diagram.
So here we have two things to do.

8
00:00:33,007 --> 00:00:38,033
You could go to stage two operands, or two
operations or two instructions in the

9
00:00:38,033 --> 00:00:41,032
fetch stage.
And, we're just gonna name, because we

10
00:00:41,032 --> 00:00:46,022
don't have a great name for these things,
we're gonna call these A and, A0 and A1

11
00:00:46,022 --> 00:00:50,043
and B0 and B1 to represent the different
execution unit stages.

12
00:00:50,043 --> 00:00:53,008
So in an ideal world, this is pretty
sweet.

13
00:00:53,008 --> 00:00:58,043
In this, in this at least for this code
here, we actually have a clocks per

14
00:00:58,043 --> 00:01:01,032
instruction of one half.
That's pretty awesome.

15
00:01:01,051 --> 00:01:06,004
And as I said, we can have two, two
instructions in the same stage of the

16
00:01:06,004 --> 00:01:06,074
pipe.
Okay.

17
00:01:06,074 --> 00:01:11,032
Let's look at a little bit more complex
code sequence here.

18
00:01:11,032 --> 00:01:14,089
We have add, loads, some more loads, an
add, a load.

19
00:01:14,089 --> 00:01:19,098
Your issue logic, that swapping logic
actually will have to move instructions

20
00:01:19,098 --> 00:01:23,055
around in this case.
So we have this add and this load.

21
00:01:23,055 --> 00:01:28,004
Well this is actually easy.
The add goes to the A unit, the load goes

22
00:01:28,004 --> 00:01:30,016
to the B unit.
No problems there.

23
00:01:30,016 --> 00:01:34,040
Okay, so now we have the load.
Uh-oh, loads in, we fetched it and it's in

24
00:01:34,040 --> 00:01:38,037
the instruction register zero.
That means it wants to go to the A pipe,

25
00:01:38,037 --> 00:01:42,034
but we need to swap these two.
So, you can see here, this is how we draw

26
00:01:42,034 --> 00:01:45,099
this.
We actually say this add is going to the A

27
00:01:45,099 --> 00:01:48,046
pipe here and that's the opposite of
what's going on there.

28
00:01:48,046 --> 00:01:52,031
But, there's still those stalls going on,
at least in this example.

29
00:01:52,031 --> 00:01:59,070
And then finally here we actually are
going to get a structural hazard.

30
00:02:01,057 --> 00:02:05,012
And the structural hazard introduces a
stall.

31
00:02:05,012 --> 00:02:10,063
So, we fetch these two loads
simultaneously, or we can only execute one

32
00:02:10,063 --> 00:02:14,096
load at a time.
So, we need to stall one of the loads in

33
00:02:14,096 --> 00:02:18,074
the decode stage and push that out of the
limit.

34
00:02:18,074 --> 00:02:24,096
So, it actually has a different pipeline
diagram than the no stall, or the no

35
00:02:24,096 --> 00:02:31,030
conflicts, no structural hazard example.
Okay, so a let's look at a, little bit

36
00:02:31,030 --> 00:02:34,085
more complex example here, a dual issue
data hazard.

37
00:02:34,085 --> 00:02:40,036
What happens when you have data hazards?
So unfortunately when you have data

38
00:02:40,036 --> 00:02:44,054
hazards you can actually, this is without
any bypassing.

39
00:02:44,054 --> 00:02:50,340
This, this first, this first example, this
first two instructions here don't have any

40
00:02:50,340 --> 00:02:54,167
data hazards.
But here we have a write to register five,

41
00:02:54,171 --> 00:02:59,031
and a read from register five.
And, this is a read after write hazard.

42
00:02:59,031 --> 00:03:04,035
And because we're not bypassing in this
pipeline yet, we actually have to stall

43
00:03:04,035 --> 00:03:08,094
the second instruction waiting for that
first one, even though we could have

44
00:03:08,094 --> 00:03:13,073
potentially executed at the same time, but
there's a real data hazard there.

45
00:03:13,073 --> 00:03:18,007
So, we need to introduce stall cycles into
the second instruction.

46
00:03:18,007 --> 00:03:22,693
Does this make sense to everybody?
So, we're going to push out that add.

47
00:03:22,693 --> 00:03:27,065
If we have full bypassing, we still need
to add stalling potentially.

48
00:03:28,002 --> 00:03:33,093
So no we don't have to wait for this value
to get to the end of the pipe to go pick

49
00:03:33,093 --> 00:03:39,030
it up in the ALU but we can pull it back
because we can bypass, let's say the add

50
00:03:39,030 --> 00:03:44,040
result after a zero, and what you see here
is the same instruction sequence.

51
00:03:44,065 --> 00:03:51,214
But now it's bypassed from A0 into the
decode stage and we can start going again,

52
00:03:51,214 --> 00:03:54,770
quicker.
So bypassing is really helping us here,

53
00:03:54,770 --> 00:04:00,011
and it's crossed with the superscalarness,
if you will.

54
00:04:00,036 --> 00:04:05,440
So wh-, wh-, what we mean by order matters
is that here, we've interchanged these

55
00:04:05,440 --> 00:04:12,095
last two instructions.
So we just flipped them, and we turned

56
00:04:12,095 --> 00:04:23,078
what was a write, excuse me, a read after
write hazard into a write after read

57
00:04:23,078 --> 00:04:28,774
hazard, and because of that this actually
pulls in by one cycle and we don't get the

58
00:04:28,774 --> 00:04:31,069
stall.
So, just by changing the ordering in the

59
00:04:31,069 --> 00:04:35,061
instructions, it will change the data
dependencies and that will actually change

60
00:04:35,061 --> 00:04:37,081
the ordering and change the execution
length.

61
00:04:37,096 --> 00:04:41,773
Does that make sense, everybody, why we
can actually interchange two instructions,

62
00:04:41,773 --> 00:04:45,192
and the data dependencies completely
change, and we need to worry very

63
00:04:45,192 --> 00:04:52,093
different things about the data hazards.
Okay, so I want to briefly wrap up about

64
00:04:52,093 --> 00:04:58,032
fetch logic and alignments.
So this is, someone was alluding that, I

65
00:04:58,032 --> 00:05:04,001
think you were alluding to this.
Let's look at some code here, and it's

66
00:05:04,001 --> 00:05:08,001
going to take jumps.
So execute some instructions.

67
00:05:08,001 --> 00:05:13,009
So this is the address, this is the
instruction, and we have a jump here to

68
00:05:13,009 --> 00:05:16,916
address, 100 hexadecimal.
And then we execute one instruction, OPT

69
00:05:16,916 --> 00:05:21,805
E, and we jump to 204 hexadecimal, and
then we jump, we execute one instruction

70
00:05:21,805 --> 00:05:27,082
and execute to, and jump to 300 and, or
3OC hexadecimal, and we just execute some

71
00:05:27,082 --> 00:05:33,158
stuff.
Here is our cache, and let's say our

72
00:05:33,158 --> 00:05:38,459
cache, the block size is four instructions
long.

73
00:05:38,459 --> 00:05:44,619
And we're going to look at how many cycles
this takes to execute.

74
00:05:44,619 --> 00:05:49,653
So let's say there's no alignment
constraints in the first, in the first

75
00:05:49,653 --> 00:05:52,090
case.
So, in cycle zero here, we execute these

76
00:05:52,090 --> 00:05:56,444
two instructions, and we, and we fetch
them from the instruction cache, and

77
00:05:56,444 --> 00:06:01,695
they're, they're aligned nicely together.
There's nothing sort of weird going on, we

78
00:06:01,695 --> 00:06:05,688
just go pull them out.
Okay, these next two instructions eight

79
00:06:05,688 --> 00:06:09,604
and C, those, those are next to each
other, that's, that's great.

80
00:06:09,604 --> 00:06:14,593
And then, and then we jump somewhere else,
to 100, and we're going to execute these

81
00:06:14,593 --> 00:06:19,517
two instructions that are next to each
other and they're at the beginning of

82
00:06:19,517 --> 00:06:22,470
their lines, so that's great, no problem
there.

83
00:06:22,470 --> 00:06:23,041
Hm.
Okay.

84
00:06:23,041 --> 00:06:28,092
Now we start to get some weird stuff.
Now we start to jump to sort of the middle

85
00:06:28,092 --> 00:06:33,053
of a cache line.
In, in this example here we jump to sum

86
00:06:33,053 --> 00:06:36,094
address two or four.
So our block size is said four

87
00:06:36,096 --> 00:06:40,046
instructions.
We're sort of jumping, not to the first

88
00:06:40,046 --> 00:06:46,405
instruction in that block.
So when he, Fully fleshed out, fetch unit,

89
00:06:46,405 --> 00:06:50,032
lets say, you can execute with any
alignments.

90
00:06:50,032 --> 00:06:54,028
So, life is easy.
We can just, fetch and we can execute

91
00:06:54,028 --> 00:07:00,006
these two instructions at the same time in
the same cycle, in cycle three, we fetch

92
00:07:00,006 --> 00:07:03,095
both of those.
Hm, that could get harder if we actually

93
00:07:03,095 --> 00:07:07,019
try to put some realistic constraints in
that.

94
00:07:07,074 --> 00:07:13,039
Okay, now let's jump to a three the end of
a, end of a cache block and we're gonna

95
00:07:13,039 --> 00:07:16,083
try to fetch these two instructions at the
same time.

96
00:07:16,083 --> 00:07:20,066
So one is on this cache line, and one is
on that cache line.

97
00:07:20,066 --> 00:07:24,075
Do we need to fetch two things from our
cache at the same time?

98
00:07:27,000 --> 00:07:29,771
Yeah, we do.
If we actually wanted to try to execute

99
00:07:29,771 --> 00:07:33,055
this instruction and that instruction at
the same time.

100
00:07:33,055 --> 00:07:37,098
Let's say, for right now, this issue logic
actually allows us to do that.

101
00:07:37,098 --> 00:07:41,041
Somehow, it's a dual ported instruction
cache, we'll say.

102
00:07:41,041 --> 00:07:44,439
And then, finally, op five here.
Or, 314, executes last.

103
00:07:44,439 --> 00:07:50,412
And, and it's just sort of fall through.
There's no jumps or anything happening.

104
00:07:50,412 --> 00:07:56,673
So some things that can be really hard to
actually make work out right are fetching

105
00:07:56,673 --> 00:08:01,402
across cache lines and possibly even
fetching randomly inside of the cache

106
00:08:01,402 --> 00:08:04,607
line, depending on your fetching fetch
unit logic.

107
00:08:04,607 --> 00:08:09,628
And, and like I said, we might need extra
ports on the cache.

108
00:08:09,628 --> 00:08:15,704
Here, here is the this code executing, and
as you can see we don't actually get any

109
00:08:15,704 --> 00:08:20,410
introduce stalls, which just sort of
executes this, then this, then this, and

110
00:08:20,410 --> 00:08:24,374
this, and we execute two instructions
every single cycle.

111
00:08:24,374 --> 00:08:29,077
Now let's look at lists of alignment
constraints.

112
00:08:29,077 --> 00:08:34,963
So, here's our, here's our original
example, and let's look at what, what,

113
00:08:34,963 --> 00:08:38,598
what we could possibly try to execute
here.

114
00:08:38,598 --> 00:08:44,996
So we're jumping through call.
We, we only use these two instructions

115
00:08:44,996 --> 00:08:50,062
from the middle of the line.
So let's say we can only fetch a half of a

116
00:08:50,062 --> 00:08:55,458
block at a time or something like that in
each cycle, because that's how wide our

117
00:08:55,458 --> 00:08:58,343
cache is.
So what you might have to do in some

118
00:08:58,343 --> 00:09:03,330
architectures if you have alignment issues
like that, and let's say you are not

119
00:09:03,330 --> 00:09:07,528
allowed to have a straddle.
You'll actually have sort of extra data

120
00:09:07,528 --> 00:09:10,332
fetched that you are just never going to
use.

121
00:09:10,332 --> 00:09:15,952
You are just throwing away this bandwidth.
And also the cycles of this change.

122
00:09:15,952 --> 00:09:22,616
So, let's, let's look at this same code
sequence and look at what happens when we

123
00:09:22,616 --> 00:09:27,087
go to execute it.
So going, going back to this, so we

124
00:09:27,087 --> 00:09:31,670
execute, op A and op B.
Okay, let's just go down the pipe.

125
00:09:31,670 --> 00:09:34,448
Okay.
Life is, life is good.

126
00:09:34,448 --> 00:09:38,957
We get to this address eight here, eight
hexadecimal.

127
00:09:38,957 --> 00:09:44,240
Well, we're going to swap that, because
the jump needs to go down pipe A.

128
00:09:44,240 --> 00:09:51,013
But otherwise things, things are okay.
Well, now, now we jump to, to the middle

129
00:09:51,013 --> 00:09:56,453
of a, of a line here.
Hm, that starts to get more interesting.

130
00:09:56,453 --> 00:10:00,491
And we're gonna basically end up wasting
cycles.

131
00:10:00,491 --> 00:10:05,596
So this will take seven cycles where
before, we had this taking only five

132
00:10:05,596 --> 00:10:09,008
cycles.
Cuz we've effectively introduced dead

133
00:10:09,008 --> 00:10:13,377
cycles, where we fetched instructions we
just didn't use.

134
00:10:13,377 --> 00:10:17,977
So the three X's here show up as
instructions we fetched.

135
00:10:17,977 --> 00:10:23,684
So like, for instance, this instruction
or, the instruction at address 200 is

136
00:10:23,684 --> 00:10:26,920
that.
We fetched it and we're not using it.

137
00:10:26,920 --> 00:10:34,404
And we fetched this two and we weren't
using either of them, so having a complex

138
00:10:34,404 --> 00:10:41,937
fetched unit or not fully bypassed, or not
fully alignment-happy fetch unit can cause

139
00:10:41,937 --> 00:10:49,012
some serious problems in our performance.
Let's stop here for today and we'll talk

140
00:10:49,012 --> 00:10:50,046
about the rest next time.