1
00:00:03,035 --> 00:00:14,957
So let's look at some techniques to deal
with outcome prediction for branches, and

2
00:00:14,957 --> 00:00:19,049
we're gonna look at this mostly, first of
all, in the static domain.

3
00:00:19,049 --> 00:00:25,051
So things that are not trying to watch the
genetic instruction sequence, but instead

4
00:00:25,051 --> 00:00:30,526
are, can, basically statically analyze the
instruction sequence and with some high

5
00:00:30,526 --> 00:00:33,843
probability, we'll say, try to predict the
location.

6
00:00:33,843 --> 00:00:38,312
The first technique is actually not a
prediction at all.

7
00:00:38,312 --> 00:00:43,935
The first technique, we had talked about
this a few lectures ago, or we, actually

8
00:00:43,935 --> 00:00:47,994
we talked about this in lecture, two
adding delay slots.

9
00:00:47,994 --> 00:00:53,244
So, instead of, I don't know, having that
time just be dead cycles, when we're

10
00:00:53,244 --> 00:00:57,956
trying to resolve the branch.
We, let's say we put a, if we don't

11
00:00:57,956 --> 00:01:02,759
resolve the branch 'til here, we have what
say one delay slot, two delay slots, three

12
00:01:02,759 --> 00:01:06,085
delay slots, and we have three
instructions, we'll say.

13
00:01:06,085 --> 00:01:11,093
Or maybe you can redirect at this point
out of X and you have two branched delay

14
00:01:11,093 --> 00:01:16,659
slots, depending on how you sort of wire
that in if you can make the cycle time.

15
00:01:16,659 --> 00:01:23,241
Let's say if two branch delay slots, what
ends up happening is you have instruction

16
00:01:23,241 --> 00:01:26,013
EEQZ here.
Let's say that is at address 100.

17
00:01:26,013 --> 00:01:32,133
Then you have two instructions which
execute no matter whether the branch is

18
00:01:32,133 --> 00:01:36,007
taken or not taken, that follow it.
So these instructions are in the delay

19
00:01:36,007 --> 00:01:41,628
slots of the branch, and you know, if you
have architectures have which have things

20
00:01:41,628 --> 00:01:44,720
like load delay slots and other things
like that.

21
00:01:44,720 --> 00:01:47,571
But for right now we'll talk about
branches.

22
00:01:47,571 --> 00:01:50,686
And we can force these instructions always
to execute.

23
00:01:50,686 --> 00:01:55,222
Unfortunately, if you go look at the sort
of percentages of probability that you can

24
00:01:55,222 --> 00:01:57,940
actually fill these delay slots, it's
pretty low.

25
00:01:57,940 --> 00:02:02,910
It's, always not easy to find work to, to
shove down here, 'cuz you're basically

26
00:02:02,910 --> 00:02:07,454
taking work that was above the branch, and
in the compiler, reordering the code to

27
00:02:07,454 --> 00:02:12,687
put the, instructions that were before the
branch, after the branch.

28
00:02:12,687 --> 00:02:13,452
Hm.
Okay.

29
00:02:13,452 --> 00:02:18,149
Well, this is good.
I mean, if you can actually make this work

30
00:02:18,149 --> 00:02:24,143
out, this is, this is great.
One of the problems with this, though, is

31
00:02:24,143 --> 00:02:29,564
the probability of filling one branch
delay slot, let's say is 70%.

32
00:02:29,564 --> 00:02:35,955
The probability to filling two branch
delay slots is even lower than that.

33
00:02:35,955 --> 00:02:41,194
It's maybe, 50ish to 40ish%.
As we said before, if you have a monkey

34
00:02:41,194 --> 00:02:47,341
pulling out of a, a hat, or just some
random process pulling out of a hat,

35
00:02:47,341 --> 00:02:50,334
you're going to do 50%, correct.
So.

36
00:02:50,334 --> 00:02:56,813
And as we'll see today, if you use some
more fancy branch prediction, or the

37
00:02:56,813 --> 00:03:03,156
outcome prediction techniques, you can get
the prediction accuracy of a branch all

38
00:03:03,156 --> 00:03:06,950
the way up into the sort of 95, 96, 97
percent probability.

39
00:03:06,950 --> 00:03:12,878
And all the way people have built, if you
look at your sort of out of order

40
00:03:12,878 --> 00:03:18,833
super-scaler on your desktop, your core,
or I-7 these days from Intel, that's

41
00:03:18,833 --> 00:03:23,981
probably somewhere in the neighborhood of
98 to 99 percent correct.

42
00:03:23,981 --> 00:03:30,405
So, being able to fill these delay slots,
with some probability which is less than

43
00:03:30,405 --> 00:03:35,721
98%, is not a good, good trade off.
You would've better been served possibly

44
00:03:35,721 --> 00:03:41,493
by allowing some fancier prediction
mechanism trying to fill it, if you care

45
00:03:41,493 --> 00:03:45,502
about performance.
Now if you care about complexity, or

46
00:03:45,502 --> 00:03:50,844
reducing complexity in area, a static
prediction might be a good, good way to

47
00:03:50,844 --> 00:03:55,641
go.
So let's, let's look at some basic rules

48
00:03:55,641 --> 00:04:03,591
of thumb here about static prediction.
So the, the overall probability of branch

49
00:04:03,591 --> 00:04:08,981
is taken with say out of something like
spec int is 60 to 70%.

50
00:04:08,981 --> 00:04:15,987
But it's not equally distributed.
Backwards branches have a much higher

51
00:04:15,987 --> 00:04:20,993
probability of being taken than forward
branches.

52
00:04:20,993 --> 00:04:25,073
Okay.
So, we got a question coming up here.

53
00:04:25,073 --> 00:04:28,693
Why is this?
Yes, so, so loops with high probability or

54
00:04:28,693 --> 00:04:32,522
by definition, to be a loop, you don't
have to jump backwards.

55
00:04:32,522 --> 00:04:35,630
You jump forwards, it's, it's pretty hard
to loop.

56
00:04:35,630 --> 00:04:40,680
So you jump backwards, it's a loop.
And in fact, people like to execute loops,

57
00:04:40,680 --> 00:04:45,610
loops, and stay in loops for a while,
'cause that's where a lot of work is done.

58
00:04:45,610 --> 00:04:50,499
So if you're seeing a loop, and you're
just spinning in this, this is increasing

59
00:04:50,499 --> 00:04:55,154
the probability that the backwards branch
is taken, and that says a high

60
00:04:55,154 --> 00:05:03,459
probability.
Such that this half forward branches 50%.

61
00:05:03,459 --> 00:05:09,193
Hm.
Can we, what's, what's going on there?

62
00:05:09,193 --> 00:05:15,052
Forward branch, 50 percent going forward.
These are usually some swarm of data

63
00:05:15,052 --> 00:05:18,033
dependent branches, like an if then else
clause.

64
00:05:18,033 --> 00:05:21,082
So that's why the probability of this is
much, much lower.

65
00:05:22,000 --> 00:05:27,032
You sit in loops for long periods of time,
forward branches typically sort of, are if

66
00:05:27,032 --> 00:05:32,027
then else's, and there's you're checking
some data value and then you're either

67
00:05:32,027 --> 00:05:36,095
executing or not.
Now this gets a little trickier with some

68
00:05:36,095 --> 00:05:41,033
more advanced compiler techniques.
Cuz sometimes with advanced compiler

69
00:05:41,033 --> 00:05:47,071
techniques, you'll actually, effectively
convert a loop with an with an if then

70
00:05:47,071 --> 00:05:52,444
else into it, and a condition check at the
end into a unconditional backwards branch

71
00:05:52,444 --> 00:05:58,243
and then a little sort of branch that goes
around the, the, piece of code which

72
00:05:58,243 --> 00:06:05,030
checks the loop sensible or checks the,
the whether the loop is completing or not.

73
00:06:05,030 --> 00:06:09,030
But even that's actually not a horrible
thing, cause at least that backwards

74
00:06:09,030 --> 00:06:13,068
branch will always be predicted to take in
a hundred percent, cause its just turned

75
00:06:13,068 --> 00:06:16,021
into a jump.
So that's not bad for a performance.

76
00:06:16,021 --> 00:06:18,095
It just, you know, might change these
percentages a little bit.

77
00:06:19,063 --> 00:06:23,066
But things do like to sit in loops, is
what you should take away from this slide.

78
00:06:25,069 --> 00:06:33,019
Okay, so let's think about a technique to
try to take advantage of this.

79
00:06:34,016 --> 00:06:39,086
So one technique they didn't take
advantage of this with is actually to add

80
00:06:40,022 --> 00:06:47,003
extra bits to your instruction set, and
allow the compiler to hint to the

81
00:06:47,003 --> 00:06:51,030
architecture whether the branch is taken
or not taken.

82
00:06:52,012 --> 00:06:57,054
Now just a little hint, if it get's it
wrong, we still won't get correct

83
00:06:57,054 --> 00:07:01,528
execution, so if it takes a branch, we
still want to go correct, we still want to

84
00:07:01,528 --> 00:07:06,179
execute the right piece of code, but the
performance might just be worse.

85
00:07:06,179 --> 00:07:11,779
So let's take a look at two branches here.
We have branch .T, and branch .NT.

86
00:07:11,779 --> 00:07:17,623
Branch .T is a branch which is taken or
predicted taken, and branch .NT is a

87
00:07:17,623 --> 00:07:25,695
branch that's predicted not taken.
So what do I want to say about this?

88
00:07:25,695 --> 00:07:31,132
Well, architectures do have this.
I was gonna say the itanium architecture

89
00:07:31,132 --> 00:07:34,689
actually has static branch prediction
completely.

90
00:07:34,689 --> 00:07:39,530
Some things have sort of intermediary
things with this, for instance the

91
00:07:39,530 --> 00:07:45,131
Motorola 68K kind of has something like
this but not, you can't do with all

92
00:07:45,131 --> 00:07:49,062
branches.
So this, this exist in real architectures

93
00:07:49,062 --> 00:07:54,054
and one of the, the things is that this
could actually be very accurate,

94
00:07:54,054 --> 00:07:59,069
especially if you profile your code.
So if you run the code once and you log

95
00:07:59,069 --> 00:08:04,032
the compiler to C, which way the branch
was taken, it can do quite good.

96
00:08:04,032 --> 00:08:09,069
And some of the insight here is there's a
lot of branches in a program which are

97
00:08:09,069 --> 00:08:14,067
there just to check error cases.
And the probability that they actually are

98
00:08:14,067 --> 00:08:19,083
taken or not taken, one way or the other
is, can basically almost be statically

99
00:08:19,083 --> 00:08:23,087
determined at compile time.
And definitely if you have feedback

100
00:08:23,087 --> 00:08:28,062
information, of execution of the program
once, that's, that's a, a very good

101
00:08:28,062 --> 00:08:32,015
indicator.
So for instance if you had a loop you

102
00:08:32,015 --> 00:08:37,080
would, as a compiler would denote that the
backwards branch is, is taken.

103
00:08:37,080 --> 00:08:43,075
So we put a br.t and if it's let's say
some piece of code which jumps over an

104
00:08:43,075 --> 00:08:49,017
error case and checks an error condition
and 99 percent of the time that never

105
00:08:49,017 --> 00:08:52,061
falls into that.
It would predict that taken.

106
00:08:52,061 --> 00:08:58,051
So we jump over that little piece of code.
So you can actually have very high

107
00:08:58,051 --> 00:09:02,069
prediction accuracy.
Now we're not, we're not up into the, 99

108
00:09:02,069 --> 00:09:06,095
percent here yet, or something like that.
This isn't, this isn't great.

109
00:09:06,095 --> 00:09:12,020
We're still doing static things here.
And, you know, at this point, it may

110
00:09:12,020 --> 00:09:18,030
actually make sense still to have, a delay
slot, 'cuz we're still not over our, 70th

111
00:09:18,030 --> 00:09:21,019
percen-, or, or, or, or sort of a close
trade-off here.

112
00:09:21,019 --> 00:09:26,028
You have to get the stack creation correct
and your delay slot might actually still

113
00:09:26,028 --> 00:09:31,021
be a better approach at this, depending on
sort of how, how this accuracy compares up

114
00:09:31,021 --> 00:09:33,082
to how many times you can fill the delay
slot.

115
00:09:33,082 --> 00:09:36,094
But let's take a look at what happens in
the pipeline here.

116
00:09:37,058 --> 00:09:42,031
So here we have branch taken and it's been
taken to this target.

117
00:09:42,031 --> 00:09:48,023
We still end up with a dead cycle here
because we do not know where that target

118
00:09:48,023 --> 00:09:52,001
is until we've effectively decoded the
instruction.

119
00:09:52,001 --> 00:09:58,000
So that in the sort of intermediate time
here we just have fetched let's say the

120
00:09:58,000 --> 00:10:03,598
fall through case or something like that.
But, because we predicted a taken, we can,

121
00:10:03,598 --> 00:10:08,094
we can do better here.
We don't have to have two delay slots, or

122
00:10:08,094 --> 00:10:11,004
two dead cycles.
We can fill it.

123
00:10:11,004 --> 00:10:15,017
We can actually get the, the correct next
instruction, in here.

124
00:10:15,050 --> 00:10:22,002
For the target.
This branch not taken, let's say we

125
00:10:22,002 --> 00:10:28,094
speculatively execute the fall through.
We, we don't actually have any penalty for

126
00:10:28,094 --> 00:10:34,095
a correctly predicted branch here.
So if we predict this branch not taken,

127
00:10:34,095 --> 00:10:41,054
that's the fall through case, we can just
fetch p c plus four, p c plus eight and

128
00:10:41,054 --> 00:10:46,008
just keep executing.
And we have no, no penalty there.

129
00:10:47,034 --> 00:10:52,010
Okay, so if we get the hint wrong.
What are we gonna do?

130
00:10:52,010 --> 00:10:55,027
Well, we have effectively taken a
mis-predict penalty here.

131
00:10:55,027 --> 00:10:59,052
And if you look at the branch take-in case
and you take a mis-predict penalty.

132
00:10:59,052 --> 00:11:03,067
What's gonna happen here is you are
actually going to end up having two, two

133
00:11:03,067 --> 00:11:07,082
dead cycles, cuz we don't actually
determine whether we took the mis-predict

134
00:11:07,082 --> 00:11:11,058
or not until this execute stage.
And that's the, the soonest we can go

135
00:11:11,058 --> 00:11:17,041
redirect the front of the pipe.
Now, if you look at this, and squint

136
00:11:17,041 --> 00:11:21,083
really hard, you can see that we actually
fetched op A twice here.

137
00:11:21,083 --> 00:11:28,007
This was the fall through instruction.
It is hypothetically possible that you

138
00:11:28,007 --> 00:11:33,091
might be able to not hold off killing this
instruction if you will, and in this case

139
00:11:33,091 --> 00:11:38,007
you sort of have the pipe do something
special and actually end up getting the

140
00:11:38,007 --> 00:11:41,060
instruction after op A or op B here, if
you will.

141
00:11:41,060 --> 00:11:46,012
But that's pretty hard to do.
So, for the base case we'll just say that

142
00:11:46,012 --> 00:11:51,001
you end up with an extra cycle, a
mispredict penalty, when you mispredict.

143
00:11:53,036 --> 00:11:58,140
Yeah, so just to reiterate that, what I
was trying to say here is that, well, if a

144
00:11:58,140 --> 00:12:05,147
branch taken, we fetch the subsequent
instruction in the follow through case.

145
00:12:05,147 --> 00:12:09,296
We fetch the target of the branch.
We try to kill this, but, lo and behold,

146
00:12:09,296 --> 00:12:12,309
we end up fetching the exact same
instruction again.

147
00:12:12,309 --> 00:12:16,638
So, you could, if you really wanted, to
try to optimize run that and not fetch

148
00:12:16,638 --> 00:12:20,826
this twice, but then you are kind of
having things out of order in the pipe

149
00:12:20,826 --> 00:12:23,196
here.
Because, you'd have this instruction.

150
00:12:23,196 --> 00:12:27,518
This instruction is dead and you have to
figure out how to kill, sort of,

151
00:12:27,518 --> 00:12:31,268
sub-portions of your pipe or things out of
order in your pipe.

152
00:12:31,268 --> 00:12:36,597
That gets pretty, pretty hard to do.
But this does really quite well if we do

153
00:12:36,597 --> 00:12:42,375
static software based branch production
Okay, so let's look at hardware branch

154
00:12:42,375 --> 00:12:45,214
prediction.
Now when we say hardware branch

155
00:12:45,214 --> 00:12:49,907
prediction, that does not necessarily mean
dynamic hardware branch prediction.

156
00:12:49,907 --> 00:12:54,554
This could just mean that you have the
hardware do either prediction without any

157
00:12:54,554 --> 00:12:58,719
hints from the software.
And we have a couple different cases we

158
00:12:58,719 --> 00:13:02,959
can try to implement.
Heuristics, if you will, in hardware.

159
00:13:02,959 --> 00:13:05,441
The first one here always predicts not
taken.

160
00:13:05,441 --> 00:13:09,925
This is what we've done so far in all of
our pipelines that we've designed.

161
00:13:09,925 --> 00:13:13,931
We've predict, we predict fallthrough
effectively, and we just fetch PC plus

162
00:13:13,931 --> 00:13:16,304
four.
We didn't put a name on it, this, but this

163
00:13:16,304 --> 00:13:20,054
is actually what we were doing.
We were doing speculative execution with a

164
00:13:20,054 --> 00:13:22,492
static harbor branch prediction of PC plus
four.

165
00:13:22,494 --> 00:13:31,610
It's pretty simple to implement.
It's what you guys have done in your labs

166
00:13:31,610 --> 00:13:34,699
so far.
You know the fall-through early.

167
00:13:34,699 --> 00:13:37,678
Accuracy is not very good.
You get all your loops wrong.

168
00:13:37,678 --> 00:13:42,353
All your backward branches and your loops
just are always predicted wrong.

169
00:13:42,353 --> 00:13:47,331
Okay, let's, let's do the inverse here.
Predict taken, let's have that be our

170
00:13:47,331 --> 00:13:53,073
static strategy in hardware.
Well, kinda hard to do here because we

171
00:13:53,073 --> 00:13:58,122
don't really know the target of this
branch in D code.

172
00:13:58,122 --> 00:14:05,209
So it's like going back to here.
If we predict everything taken, what do

173
00:14:05,209 --> 00:14:10,012
we, what do we fetch in this cycle?
Well, I don't know.

174
00:14:10,012 --> 00:14:14,079
It's a big question mark there.
It doesn't do super well on,

175
00:14:14,079 --> 00:14:20,419
if-then-elses, cuz a lot of times those
are forward branches over things, or, or

176
00:14:20,419 --> 00:14:26,166
some, it depends on how you structure it.
Some architectures which have things like

177
00:14:26,166 --> 00:14:32,419
these static prediction techniques will
actually restructure their code, such that

178
00:14:32,419 --> 00:14:37,638
the compiler will figure out the
probability of a branch being taken or not

179
00:14:37,638 --> 00:14:41,054
taken, and then work the code around it,
to make it work out.

180
00:14:41,054 --> 00:14:45,576
That's actually pretty common the, the
Motorola 68K had something similar to that

181
00:14:45,576 --> 00:14:49,609
and they, the compilers there actually try
to work around it.

182
00:14:49,609 --> 00:14:53,634
So, hmm, well, this is definitely a bummer
here.

183
00:14:53,634 --> 00:14:58,868
Maybe we can fix that.
Like, like we said, this is the second

184
00:14:58,868 --> 00:15:03,026
part of today's lecture is trying to fix
this problem.

185
00:15:03,026 --> 00:15:08,920
Okay, how about we use a heuristic, that
all backwards branches are taken and the

186
00:15:08,920 --> 00:15:13,209
forward branches are not taken.
Okay so this does a lot better.

187
00:15:13,209 --> 00:15:18,717
This is her, our heuristic of what we were
saying before, that loops and things like

188
00:15:18,717 --> 00:15:24,023
that will get caught up in this.
So forward branch is not part of the loop.

189
00:15:24,023 --> 00:15:28,020
It'll, we'll predict that not taken, we'll
predict the fall through.

190
00:15:28,020 --> 00:15:32,003
Backward branch we'll predict taken.
This does pretty good.

191
00:15:32,003 --> 00:15:36,000
It's better accuracy.
It's still nowhere near the sort of 80

192
00:15:36,000 --> 00:15:39,818
percent accuracy that we had if the
compiler were to get involved, cuz that's

193
00:15:39,818 --> 00:15:45,107
effectively, the compiler case that we
were talking about before could actually

194
00:15:45,107 --> 00:15:48,084
implement this entire algorithm in
software.

195
00:15:49,098 --> 00:15:54,044
Or something much more sophisticated and
that's usually what ends up happening.