1
00:00:03,043 --> 00:00:06,064
Let's look at a little more complicated
case.

2
00:00:06,064 --> 00:00:09,085
We talked about control hazards due to
jumps.

3
00:00:09,085 --> 00:00:13,006
Let's put the control hazard one cycle
later.

4
00:00:14,017 --> 00:00:20,089
And a, a good example of that is something
like a conditional branch.

5
00:00:20,089 --> 00:00:25,014
So here we have a piece of code.
Add branch.

6
00:00:25,014 --> 00:00:30,048
And it's branching to 200 bytes into the
future.

7
00:00:30,073 --> 00:00:35,082
If you take a hundred branches on nips are
relative to the subsequent instruction, so

8
00:00:35,082 --> 00:00:39,053
PC plus four plus the offset, so we'll end
up at three or four.

9
00:00:41,011 --> 00:00:50,023
And let's walk through this case here.
We have I one is our first ad.

10
00:00:50,023 --> 00:00:56,002
You get the branch sitting at the decode
stage.

11
00:00:56,046 --> 00:01:00,047
And we have some decode logic which is
saying, is this a branch?

12
00:01:02,085 --> 00:01:07,777
Okay.
The next question though, is how do we

13
00:01:07,777 --> 00:01:12,090
compute whether this is a, a branch is
being taken or not.

14
00:01:12,090 --> 00:01:19,035
Can we do this in the decode stage?
Unfortunately we need to do a comparison

15
00:01:19,035 --> 00:01:25,712
here, and this comparison, the hardware we
want to use to do a comparison is in the

16
00:01:25,712 --> 00:01:29,073
ALU.
It's a, you know, subtract operation, or

17
00:01:29,073 --> 00:01:34,028
a, a comparison operation.
So this, this fits well within our

18
00:01:34,028 --> 00:01:38,082
arithmetic logic unit.
And we have the zero wire coming out.

19
00:01:38,082 --> 00:01:45,022
So, what this is gonna do is, instead of
having one cycle of latency, or one cycle

20
00:01:45,022 --> 00:01:50,000
of kills being inserted, we might have to
insert two cycles.

21
00:01:50,095 --> 00:01:54,059
Because now, we have to wait for the
branch to get to here.

22
00:01:54,059 --> 00:01:59,024
And if we're predicting PC plus four,
we're speculating PC plus four, we'll

23
00:01:59,024 --> 00:02:02,094
actually have fetched more data.
So let's walk through this.

24
00:02:02,094 --> 00:02:08,069
So the branch moves forward, one cycle
forward here.

25
00:02:09,003 --> 00:02:14,032
And, what you will see here is we're
actually fetching 108.

26
00:02:14,032 --> 00:02:18,034
So we're fetching, the next, next,
instruction.

27
00:02:20,031 --> 00:02:26,028
But these two are both potentially dead
or, in, we wanna kill both of these.

28
00:02:26,028 --> 00:02:32,050
But we don't know whether to kill them
until we get this, this zero wire here.

29
00:02:33,000 --> 00:02:40,010
Hm, okay so at that point we can redirect
the front of the pipe and we can insert a

30
00:02:40,010 --> 00:02:44,012
no op.
So that's going to change the logic that

31
00:02:44,012 --> 00:02:49,009
connects to here.
We are also going to have to insert a no

32
00:02:49,009 --> 00:02:55,025
op here because we fetched 104 at that
time into this point of the pipe.

33
00:02:59,003 --> 00:03:05,072
So, if a branch is taken, we need to kill
the next two instructions and the

34
00:03:05,072 --> 00:03:12,538
instructions decode stage is effectively
invalid so we need to swing this muck

35
00:03:12,538 --> 00:03:16,619
here.
So we're gonna have to add an extra

36
00:03:16,619 --> 00:03:30,109
control line on to this multiplexer.
And now, Stalling, and killing makes a lot

37
00:03:30,109 --> 00:03:38,972
bigger, a lot more difference, so the
stall signal has to effectively be killed

38
00:03:38,972 --> 00:03:43,063
or nullified at this point.
And why is this important?

39
00:03:43,063 --> 00:03:48,071
Well, let's say we're stalling the front
of the pipe and stall were to take

40
00:03:48,071 --> 00:03:52,017
priority.
That would stop this register from moving

41
00:03:52,017 --> 00:03:56,031
and stop this register from moving as this
red lines come in.

42
00:03:56,031 --> 00:04:01,086
So if the stall were to take priority, you
could have an instruction here which is

43
00:04:01,086 --> 00:04:07,008
let's say dependent on some stall
condition on dependent on some data hazard

44
00:04:07,008 --> 00:04:12,076
or dependent on some random hazard.
And you've a branch which is branching to

45
00:04:12,076 --> 00:04:16,074
someplace else.
So let's take this instruction here, 104,

46
00:04:16,074 --> 00:04:21,095
which is add, is depending on something as
stalling in front of the pipe.

47
00:04:22,022 --> 00:04:26,043
At the same time, you want to kill because
you took the branch.

48
00:04:26,043 --> 00:04:31,075
You want to kill that instruction.
If you were to put the stall at the higher

49
00:04:31,075 --> 00:04:34,086
priority, you'd actually end up in a
deadlock.

50
00:04:34,086 --> 00:04:40,011
Because you'd be stalling, but you would
have no way of killing the previous

51
00:04:40,011 --> 00:04:45,030
instructions to clear the stall.
So this is why it's really important that

52
00:04:45,030 --> 00:04:49,024
the instruction at the decode stade, stage
is now invalid.

53
00:04:49,051 --> 00:04:54,020
So unlike the, the jump case where, well,
you might be able to do it either way,

54
00:04:54,020 --> 00:04:58,082
with the priority of the stall, and the
kill, wires, here, the kill, has to take

55
00:04:58,082 --> 00:05:01,086
precedence.
Or the redirect has to take precedence.

56
00:05:01,086 --> 00:05:06,061
And that's going to redirect the front of
the pipe, and basically clear out

57
00:05:06,061 --> 00:05:10,013
everything here.
And if you were to stall it, you would not

58
00:05:10,013 --> 00:05:13,078
be clearing it out.
So instead, you need to kill, kill, sort

59
00:05:13,078 --> 00:05:18,053
of turn these on to no ops, and redirect
the front of the pipe to go fetch the

60
00:05:18,053 --> 00:05:21,051
actual target.
Of your branch, of your conditional

61
00:05:21,051 --> 00:05:29,020
branch.
Yeah, sorry, this is just a drawing now,

62
00:05:29,020 --> 00:05:33,019
that, this branch.
Information has to go into this

63
00:05:33,019 --> 00:05:41,089
multiplexer and go into this multiplexer.
The other thing that's important to your,

64
00:05:41,089 --> 00:05:46,093
for, for branches, is we need to figure
out the address that we're actually

65
00:05:46,093 --> 00:05:50,074
branching to.
And we'll talk about this more when we get

66
00:05:50,074 --> 00:05:53,081
to branch prediction later in the course,
but.

67
00:05:53,081 --> 00:05:59,034
What you're gonna have to do, is you're
gonna have to take the PC+4 and add it to

68
00:05:59,034 --> 00:06:04,829
the branch offset, cuz on something like
MIPS, there's PC relative branching.

69
00:06:04,829 --> 00:06:10,053
And the other thing is we can't reuse this
ALU necessarily to do that.

70
00:06:10,053 --> 00:06:15,587
You could potentially reuse this ALU if
you were, micro coding your design, But

71
00:06:15,587 --> 00:06:19,837
unfortunately, we need this ALU to do the
branch comparison.

72
00:06:19,837 --> 00:06:25,898
So we need to add another adder here to
come up with the target of our branch.

73
00:06:25,898 --> 00:06:32,022
Now, some people get smart on the way they
build this, and you can actually see some

74
00:06:32,022 --> 00:06:37,456
things where people will try to reuse the
main processor pipeline ALU through the

75
00:06:37,456 --> 00:06:42,591
branch offset calculation and maybe try to
do this comparison, because if you have

76
00:06:42,591 --> 00:06:47,685
simple enough branches you can do the
comparison with less hardware than a full

77
00:06:47,685 --> 00:06:49,442
adder.
That's another option.

78
00:06:49,442 --> 00:06:53,962
Let's hold off talking about that until a
little bit later.

79
00:06:53,962 --> 00:07:00,497
Okay, so now we get to talk about, the
stall signal, in, in detail.

80
00:07:00,497 --> 00:07:05,933
So we have, we start off in.
Blue here at the top, the stall signal we

81
00:07:05,933 --> 00:07:09,533
had from before.
So you need to check the data pendencies

82
00:07:09,533 --> 00:07:14,546
of, and check for data hazards between the
registers, of the source operands, these

83
00:07:14,546 --> 00:07:19,706
are the RS and RT, against, in the decode
stage, against the other stages in the

84
00:07:19,706 --> 00:07:22,604
pipe.
The execute stage, and the memory stage,

85
00:07:22,604 --> 00:07:27,578
and the, and the write back stage, and
make sure that you're both reading the

86
00:07:27,578 --> 00:07:32,100
registers and you're, check the write
enables to make sure that the respective

87
00:07:32,100 --> 00:07:36,632
instructions in the different pipeline
stages are actually, writing to the

88
00:07:36,632 --> 00:07:41,884
different locations.
But now we add another term.

89
00:07:41,884 --> 00:07:47,882
We add a, branch term.
So, we need to know whether there's a

90
00:07:47,882 --> 00:07:56,367
branch going on here, because we need to
unistall the processor, if the branch is

91
00:07:56,367 --> 00:08:01,769
taken.
So we're gonna add not, and this is

92
00:08:01,769 --> 00:08:10,291
effectively, it is a branch zero and zero.
Likewise this is, it's a branch not zero.

93
00:08:10,291 --> 00:08:15,974
And, the value is, not zero.
So these, these two terms are basically

94
00:08:15,974 --> 00:08:22,580
saying, a branch is happening, it's in the
execute stage, and we'll say that the

95
00:08:22,580 --> 00:08:29,046
branch is being taken.
So if it's a zero branch and the result is

96
00:08:29,046 --> 00:08:35,084
zero taken or if it's a not, not equal
zero branch, it's not taken.

97
00:08:36,063 --> 00:08:44,051
And why, don't, we stall if the branch is
taken.

98
00:08:46,000 --> 00:08:49,096
So, this is the same question we had on
the previous slide.

99
00:08:49,096 --> 00:08:54,039
You have a branch, it's happening, if we
stall the front of a pipe.

100
00:08:54,087 --> 00:09:00,014
The instruction at the decode stage has to
be unstalled to get that out and to clear

101
00:09:00,014 --> 00:09:03,018
it out.
So, the instruction at the decode stage is

102
00:09:03,018 --> 00:09:08,001
now invalid, so we don't want to pay
attention to any of the stall information.

103
00:09:08,001 --> 00:09:12,053
It's just like a, a red herring, or it's
a, it's a piece of information you

104
00:09:12,053 --> 00:09:17,024
shouldn't be paying attention to.
So you want to have the instruction at the

105
00:09:17,024 --> 00:09:21,046
decode stage be turned into, an unvalid,
or an invalid instruction.

106
00:09:22,065 --> 00:09:28,092
Okay, so now we get to talk about the
control equations for the program counter

107
00:09:28,092 --> 00:09:37,076
and the instruction register multiplexers.
So let's start off by talking about PC

108
00:09:37,076 --> 00:09:41,010
source, and so I want to remember what
that is.

109
00:09:41,010 --> 00:09:44,063
That's way up here.
That's the input to this multiplexer.

110
00:09:44,063 --> 00:09:49,005
And that's going to select where we're
branching to, or where we, excuse me.

111
00:09:49,005 --> 00:09:53,090
Not necessarily where we're branching to,
but the prudent counter of the next

112
00:09:53,090 --> 00:09:56,077
instruction we're gonna fetch, on the next
cycle.

113
00:09:56,077 --> 00:10:00,018
And it's on the next cycle, because this
is a, a flip flop.

114
00:10:00,018 --> 00:10:05,002
It's a little bit of control equations
here just to, I wanted to point, sort of,

115
00:10:05,002 --> 00:10:12,904
two, two basic things out in this, these,
these equations is that the older

116
00:10:12,904 --> 00:10:20,175
instruction or the thing that's farther
down the pipe is going to get precedence

117
00:10:20,175 --> 00:10:27,757
over younger instructions, and for the
control signals going down the pipe.

118
00:10:27,757 --> 00:10:35,315
So if we look at this PC muck select.
It actually has things coming out of

119
00:10:35,315 --> 00:10:39,995
different stages of the pipe.
We have something which is redirecting for

120
00:10:39,995 --> 00:10:44,714
a jump out a the decode stage.
And then we have these signals here which

121
00:10:44,714 --> 00:10:49,540
are coming out of the execute stage, which
is strictly farther down the pipe.

122
00:10:49,540 --> 00:10:52,220
And this needs to take precedents over
that.

123
00:10:52,220 --> 00:10:56,076
So if you have a branch which is directly
followed by a jump, the jumps can be seen

124
00:10:56,076 --> 00:10:58,998
there trying to swing the control on the
pc select marks.

125
00:10:58,998 --> 00:11:03,585
So the next programmer counter mucks.
And you can't let that happen, you have to

126
00:11:03,585 --> 00:11:07,283
let the branch that's actually executing
take precedence over that.

127
00:11:07,283 --> 00:11:09,980
And that's what I really want to get
across from this.

128
00:11:09,980 --> 00:11:14,720
And you'll see that similarly here the
branch takes precedents over the jump that

129
00:11:14,720 --> 00:11:22,734
links from the other control signals.
Let's look at this from a branch pipeline

130
00:11:22,734 --> 00:11:28,463
perspective.
So we'll draw the pipeline diagram for a

131
00:11:28,463 --> 00:11:36,188
branch executing for our example.
And what you notice real fast here is, if

132
00:11:36,188 --> 00:11:44,582
you take a branch and you actually execute
the branch and take the branch, you're

133
00:11:44,582 --> 00:11:51,399
going to be adding, two extra instructions
here, in this case 104 and 108, and they

134
00:11:51,399 --> 00:11:58,819
get killed as they go down the pipe.
And this is going to impact our CPI in a

135
00:11:58,819 --> 00:12:03,526
negative way.
And what's happening is this execute the,

136
00:12:03,526 --> 00:12:10,066
the branch which is in the execute stage,
is gonna reach back and kill these

137
00:12:10,066 --> 00:12:15,711
instructions concerning the NOOPs.
And we can plot this in the, in the other,

138
00:12:15,711 --> 00:12:19,791
other dimension also, but you'll see.
So now we need to think hard.

139
00:12:19,791 --> 00:12:25,480
Is this a good thing?
Well we speculated that the instruction

140
00:12:25,480 --> 00:12:31,705
was or the next instruction was gonna be
PC+4 or the address of the next

141
00:12:31,705 --> 00:12:38,832
instruction's gonna be PC+4.
So we took something where the CPI was,

142
00:12:38,832 --> 00:12:43,457
let's say, two.
Machine CPI is two, and we cut it down a

143
00:12:43,457 --> 00:12:51,932
little bit, in the common case.
But jumps, the CPI still goes back up to

144
00:12:51,932 --> 00:12:58,511
two, And here, the CPI is three.
Now, in our speculation case, or our no

145
00:12:58,511 --> 00:13:03,977
speculation case, if we had a branch we
would probably have to stall for three

146
00:13:03,977 --> 00:13:10,088
cycles anyway to know the destination of
that branch, so this didn't really hurt us

147
00:13:10,088 --> 00:13:13,748
in that case.
The CPI in that case was also three, four,

148
00:13:13,748 --> 00:13:17,300
a branch instruction.
Hm.

149
00:13:17,300 --> 00:13:23,525
Okay, well, but this still isn't good.
I don't want every single branch I take to

150
00:13:23,525 --> 00:13:28,278
have a CPI of, of three.
So let's talk about some ways to mitigate

151
00:13:28,278 --> 00:13:34,993
this cost.
And how do we reduce the branch penalty?

152
00:13:34,993 --> 00:13:43,522
Well, one way to do it is, we can actually
add a comparator one stage earlier and

153
00:13:43,522 --> 00:13:49,712
have it resolve in the same timeline as
something like the jump.

154
00:13:49,712 --> 00:13:54,799
So how do we do that?
Well, here we have a register file.

155
00:13:54,799 --> 00:14:00,615
This is in the decode stage of the pipe.
This is the fetch stage.

156
00:14:00,615 --> 00:14:08,303
What do we really need to do for a branch?
Well, we need to know that it's a branch.

157
00:14:08,303 --> 00:14:13,291
So we do some decode.
And we need to know whether it's a, a

158
00:14:13,291 --> 00:14:18,242
branch zero or a branch not zero, for
instance, if those are the only two

159
00:14:18,242 --> 00:14:22,998
branches in our architecture.
And we need to know that, let's say,

160
00:14:22,998 --> 00:14:29,438
register one is zero or not zero.
So some people came up with this smart

161
00:14:29,438 --> 00:14:34,953
idea that we can add a zero detector onto
the registry file output.

162
00:14:34,953 --> 00:14:41,657
And by adding the zero detector into the
register file output we'll know which way

163
00:14:41,657 --> 00:14:44,993
the branch goes.
So, this, this, the zero detector.

164
00:14:44,993 --> 00:14:51,189
One of the interesting things is that this
is actually a little bit easier than doing

165
00:14:51,189 --> 00:14:55,439
a full comparison with zero.
Because to do a zero detect, you can

166
00:14:55,439 --> 00:15:00,894
basically build a optimized OR tree.
And then you take the negation of N,

167
00:15:00,894 --> 00:15:06,977
sometimes people can do this with, their
wired OR's or more analog e-circuits.

168
00:15:06,977 --> 00:15:13,012
But this is effectively a big OR gate,
where you OR in all 32 bits if you have a

169
00:15:13,012 --> 00:15:17,716
32 bit processor, and then take the NOT or
take the, the, the bit itself.

170
00:15:17,716 --> 00:15:22,413
And this is easy for zero detect, this can
get harder for different types of

171
00:15:22,413 --> 00:15:25,157
branches.
If you ask me, like, a branch-equals, or

172
00:15:25,157 --> 00:15:28,421
BEQ instruction.
You can't use, well, you can't use zero

173
00:15:28,421 --> 00:15:31,428
detector.
What you have to do instead is somehow

174
00:15:31,428 --> 00:15:35,274
actually do a full comparison between two,
of the source operands.

175
00:15:35,274 --> 00:15:42,216
And that usually takes more time.
So the, the nice thing about this is that

176
00:15:42,216 --> 00:15:49,169
It'll, take your, all your branches and
take'em from, a CPI of three when they

177
00:15:49,169 --> 00:15:58,568
mispredict, or when you misspeculate.
Two, two cycles of the same timeline that

178
00:15:58,568 --> 00:16:03,096
you had with.
A jump, so that you end up with the exact

179
00:16:03,096 --> 00:16:09,249
same pipeline that we had for jumps now.
The downside is that this is going to

180
00:16:09,249 --> 00:16:10,899
elongate.
Your cycle time.

181
00:16:10,899 --> 00:16:15,303
So you have this wire coming out, and the
wire needs to go into the logic which

182
00:16:15,303 --> 00:16:18,847
computes PC source and that's going to
become a critical path.

183
00:16:18,847 --> 00:16:22,180
And it's a critical path because, you have
to go through the decode.

184
00:16:22,180 --> 00:16:26,236
That's usually going to parallel, but you
have to access the register file, then

185
00:16:26,236 --> 00:16:30,249
through a zero detect, then go into
control logic which computes this, and

186
00:16:30,249 --> 00:16:33,563
then come around and latch the
information, or, or, or flip-flop the

187
00:16:33,563 --> 00:16:36,357
information.
So you can really negative, negatively

188
00:16:36,357 --> 00:16:41,257
impact your clock period and as we talked
about the iron law of processor

189
00:16:41,257 --> 00:16:45,093
performance.
You can either get performance by lowering

190
00:16:45,093 --> 00:16:50,917
your clocks by instruction or you can get
it by lowering your clock frequency or

191
00:16:50,917 --> 00:16:56,010
excuse me increasing your clock frequency
or lowering your clock period.

192
00:16:56,010 --> 00:17:00,911
So these two things trade off.
So you can elongate your cycle time which

193
00:17:00,911 --> 00:17:06,219
will negatively impact your CP, or
negatively impact your performance or time

194
00:17:06,219 --> 00:17:10,070
per program in our iron law of processor
performance.

195
00:17:10,070 --> 00:17:16,395
But on the flip side you will Have
branches resolve faster so your cost per

196
00:17:16,395 --> 00:17:21,347
instruction, or average cost per
instruction, especially for branches will,

197
00:17:21,347 --> 00:17:24,164
will go down.
So this is only one approach.

198
00:17:24,164 --> 00:17:28,658
Another approach is you can expose it to
software.

199
00:17:28,658 --> 00:17:36,006
So, we can change the instruction set
architecture.

200
00:17:36,006 --> 00:17:41,027
What is really cool here is the
instruction set architect can actually

201
00:17:41,027 --> 00:17:46,048
have an impact on what's going on here.
This is not all micro-architecture issues,

202
00:17:46,048 --> 00:17:51,081
this is not all fancy branch predictors,
which is what we'll be talking about later

203
00:17:51,081 --> 00:17:57,027
in this course, but instead you can expose
this problem or this challenge of the time

204
00:17:57,027 --> 00:18:00,017
it takes to resolve a branch to the
software.

205
00:18:00,017 --> 00:18:05,007
So what we're gonna do well, I'll define a
branch delay slot.

206
00:18:05,007 --> 00:18:11,094
And a branch delay slot is a architectural
when I say architectural I mean big A

207
00:18:11,094 --> 00:18:17,091
architectural or instruction set
architectural change which allows or, or

208
00:18:17,091 --> 00:18:23,048
forces some number of instructions after a
branch to always execute.

209
00:18:24,042 --> 00:18:30,036
So it could be, lets say one, two, three
or four, or some number architecturally

210
00:18:30,036 --> 00:18:34,030
defined.
And in this delay slot that instruction

211
00:18:34,030 --> 00:18:38,062
is, executed irregardless of the direction
of the branch.

212
00:18:38,062 --> 00:18:44,080
And what people try to do with this is
they'll have to take work that was above

213
00:18:44,080 --> 00:18:50,013
the branch and move it below the branch.
Because, if you know the works going to be

214
00:18:50,013 --> 00:18:53,092
done anyway, you can just put it below the
branch and everything's okay.

215
00:18:53,092 --> 00:18:58,030
But the problem with that is the branch
has to not be dependent on that work that

216
00:18:58,030 --> 00:19:03,012
you're moving below the branch.
So typically for very short loops where

217
00:19:03,012 --> 00:19:07,810
there's not much work inside the loop,
this becomes very problematic, cuz there's

218
00:19:07,810 --> 00:19:12,808
not enough work that the branch is not
dependent on to move below the branch.

219
00:19:12,808 --> 00:19:17,605
To have the loop run effectively.
And what we're trying to do here is the

220
00:19:17,605 --> 00:19:22,559
compiler has a hand in what's going on.
Or the, the program writer really has a

221
00:19:22,559 --> 00:19:26,676
hand in what's going on.
Cause that's what's doing the reordering

222
00:19:26,676 --> 00:19:29,739
of the information from above here to
below here.

223
00:19:29,739 --> 00:19:34,435
So lets take a look at an example here
where, we have a branch, and we have one

224
00:19:34,435 --> 00:19:38,625
delay slot.
Some architectures have two or more, but

225
00:19:38,625 --> 00:19:41,464
like, right now we'll assume one delay
slide.

226
00:19:41,464 --> 00:19:46,235
And let's also assume that we have this
reduction of branch panel, the

227
00:19:46,235 --> 00:19:51,304
optimization, so now branches only have
one dead cycle when we're, we're gonna

228
00:19:51,304 --> 00:19:56,410
execute them.
So what we can do is, we can say, okay,

229
00:19:56,410 --> 00:20:04,248
the branch is at address 100, And this, ad
here at address 104 in the delay slot

230
00:20:04,248 --> 00:20:09,029
executes irregardless of whether the
branch gets taken or falls through.

231
00:20:09,070 --> 00:20:14,001
So then we don't actually have to kill the
next instruction after the branch.

232
00:20:14,001 --> 00:20:18,083
We don't have to have all this fancy extra
hardware there, we just have this always

233
00:20:18,083 --> 00:20:23,036
execute.
The tricky thing is many times, as I have

234
00:20:23,036 --> 00:20:26,039
said, its hard to fill this branch delay
slot.

235
00:20:26,039 --> 00:20:30,023
It'd be great if you could have many, many
branch delay slots.

236
00:20:30,023 --> 00:20:35,049
So, if you go look at something like a,
penny-before, I think they have a twenty-

237
00:20:35,049 --> 00:20:40,048
odd cycle mispredict penalty.
They could have added twenty delay spots

238
00:20:40,048 --> 00:20:44,086
to their architecture.
That would have been really, really cool.

239
00:20:44,086 --> 00:20:50,025
But at the same time the compiler would
have to go find twenty instructions for

240
00:20:50,025 --> 00:20:55,034
every branch, to stick after the branch.
And it's usually just not that much work

241
00:20:55,034 --> 00:20:59,058
to be put the after the branch which the
branch is not dependent on.

242
00:20:59,058 --> 00:21:02,082
So, we have a really hard time filling
these branch delay slots.

243
00:21:02,082 --> 00:21:06,058
And, and, roughly, if you go look on
something like MIPS, across sort of old

244
00:21:06,058 --> 00:21:10,090
SPEC INTs to give you some example here of
how often the compiler can actually fill

245
00:21:10,090 --> 00:21:15,007
branch delay slots, I think the, the rough
rule of thumb is if you have one branch

246
00:21:15,007 --> 00:21:18,093
delay slot, you can usually fill it maybe
about 70 percent of the time on something

247
00:21:18,093 --> 00:21:21,095
like, SPEC INT.
Spec INT is a common benchmark suite

248
00:21:21,095 --> 00:21:26,000
that's used throughout the computer
architecture industry, or computing

249
00:21:26,000 --> 00:21:29,745
industry to measure performance, and if
you have two branch delay slots, that

250
00:21:29,745 --> 00:21:35,001
second branch delay slot only gets filled
less than like half, half the time.

251
00:21:35,068 --> 00:21:41,089
So, fill in these slots relatively and
frequently you might be better served by

252
00:21:41,089 --> 00:21:45,027
coming up with other prediction
mechanisms.

253
00:21:45,027 --> 00:21:51,041
And we're going to talk later about, not
this class but in a different lecture,

254
00:21:51,041 --> 00:21:58,002
branch prediction which is a technique to
cover branch latency and to make a branch

255
00:21:58,002 --> 00:22:04,032
decision early or we can predict whether
the branch has been taken or not taken.

256
00:22:04,032 --> 00:22:09,027
And this can be used to dramatically
reduce the branch penalty.

257
00:22:12,024 --> 00:22:17,056
So let's draw, the pipeline diagram here,
of, a, branch delay slot.

258
00:22:17,056 --> 00:22:21,020
Architecture has, has one branch delay
slot.

259
00:22:21,066 --> 00:22:25,734
So we, we have the add flowing down the
pipe, we have a branch flowing down the

260
00:22:25,734 --> 00:22:30,045
pipe, and now we're going to have this
next add, and architecturally we've

261
00:22:30,045 --> 00:22:35,038
defined that, that always executes whether
the branch is being taken or not taken.

262
00:22:35,038 --> 00:22:40,061
So this code sequence or, excuse me, this
pipeline diagram sequence here is the same

263
00:22:40,061 --> 00:22:44,020
for whether the branch is taken or the
branch is not taken.

264
00:22:44,020 --> 00:22:50,067
In this case we are going to say that the
branch is taken and we go to execute, this

265
00:22:50,067 --> 00:22:54,076
next add here.
But you see that there is no bubbles,

266
00:22:54,076 --> 00:22:59,078
there's no NOOPs been inserted.
So, we have actually improved the

267
00:22:59,078 --> 00:23:04,025
performance here.
This is assuming though, that this add is

268
00:23:04,025 --> 00:23:08,057
doing useful work.
Now, this is really important to, to, to

269
00:23:08,057 --> 00:23:15,387
say because the compiler or the program
writer has to put something here.

270
00:23:15,387 --> 00:23:20,726
And if they can't put anything useful
they're gonna have to put NOOP.

271
00:23:20,726 --> 00:23:26,438
And that's not useful work, doing a no
operation, you might as well just had the

272
00:23:26,438 --> 00:23:31,675
hardware insert, insert a, the NOOPs for
you and save some instruction coding

273
00:23:31,675 --> 00:23:34,885
space.
So there's a tough trade off here between

274
00:23:34,885 --> 00:23:39,890
how often you fill the branch delay slot
versus having a branch delay slot.

275
00:23:39,890 --> 00:23:46,261
But for right now we'll say branch delay
slots are one good technique to put useful

276
00:23:46,261 --> 00:23:51,081
instructions after a branch not have
branches stall out.