1
00:00:03,039 --> 00:00:08,011
Okay, so that was.
One way to figure out the outcome.

2
00:00:08,011 --> 00:00:11,011
Let's look at a little bit more dynamic
way.

3
00:00:15,069 --> 00:00:23,195
Okay, so we can try to build, pieces of
hardware which determine the outcome of a

4
00:00:23,195 --> 00:00:28,041
branch, dynamically.
And this is by adding extra little pieces

5
00:00:28,041 --> 00:00:31,286
of hardware, somewhere in the front of the
pipe.

6
00:00:31,286 --> 00:00:36,097
So, somewhere in our PC plus four
calculation this is going to be this

7
00:00:36,097 --> 00:00:39,428
multiplexor here in the PC plus four
calculation.

8
00:00:39,428 --> 00:00:44,771
And effectively, we're going to use some
pieces of hardware to feed into that, and

9
00:00:44,771 --> 00:00:48,344
this is our taking and not taking
prediction values.

10
00:00:48,344 --> 00:00:53,041
And we'll, we'll talk about this in a
second but one of the questions here is,

11
00:00:53,041 --> 00:00:55,565
what things feed into this PC plus four
hardware.

12
00:00:55,565 --> 00:00:58,417
Hmmn, , well one of them is gonna be PC
plus four.

13
00:00:58,417 --> 00:01:03,874
One of them should be your branch target.
Unfortunately, as we said we don't know

14
00:01:03,874 --> 00:01:08,581
the branch targets up here.
So we can't have that come around in one

15
00:01:08,581 --> 00:01:11,069
cycle.
So we need to hold off, that's how we

16
00:01:11,069 --> 00:01:14,038
that's, that's the target address
question.

17
00:01:14,038 --> 00:01:18,087
But first thing is, how do we swing this
multiplexer here of our next PC?

18
00:01:21,029 --> 00:01:26,030
Okay, so, some solutions to this, there's
dynamic in hardware they're trying to

19
00:01:26,030 --> 00:01:29,638
increase the prediction accuracy, is,
we're going to think about having

20
00:01:32,042 --> 00:01:37,023
something which exploits, the nature of
programs.

21
00:01:37,023 --> 00:01:43,055
So what we're going to try to exploit
first of all is temporal correlation of a

22
00:01:43,055 --> 00:01:49,079
branch, [inaudible] a branch outcome with
previous outcomes of that same branch.

23
00:01:50,059 --> 00:01:55,988
So, if you're in a loop, and you fallen
into a loop, and you have a branch that's

24
00:01:55,988 --> 00:02:00,513
predicted taken, or, or, let's say if you
have a branch that is taken.

25
00:02:00,513 --> 00:02:06,004
You can use that in the future to predict
the outcome of that same branch.

26
00:02:06,004 --> 00:02:11,061
So here we're actually gonna have a one
bit predictor, as, denoted by this finite

27
00:02:11,061 --> 00:02:17,018
state machine, and what's gonna happen is
let's say we start in the not taken state.

28
00:02:18,405 --> 00:02:22,092
If the outcome of a, a particular branch
is not taken, we just stay here and we

29
00:02:22,092 --> 00:02:27,009
continually predict not taken.
If we then take a branch, if we then end

30
00:02:27,009 --> 00:02:32,003
up taking a branch, we transition over to
this other predict, predict taken state

31
00:02:32,003 --> 00:02:36,085
and we sit in there if we're not taken.
And if we have alternating branches taken

32
00:02:36,085 --> 00:02:41,032
and not taken, we're basically gonna be
predicting the wrong one, er, well, we

33
00:02:41,032 --> 00:02:44,084
might have been predicting the wrong one,
every single time.

34
00:02:45,009 --> 00:02:49,075
But we can implement this with one bit,
one extra bit to our control logic.

35
00:02:49,075 --> 00:02:54,065
It's not, a lot of logic here it's
literally one extra bit and it just says,

36
00:02:54,065 --> 00:03:01,086
which of these two states you're in.
Let's take a look at the performance of

37
00:03:01,086 --> 00:03:03,053
this.
Does this, does this help?

38
00:03:04,074 --> 00:03:10,034
So let's say we have a loop here.
That, iterates four times.

39
00:03:10,034 --> 00:03:14,040
So we're gonna graph, the iteration
number.

40
00:03:15,008 --> 00:03:18,768
And then we're gonna show the prediction
that's coming out of the predictor.

41
00:03:18,768 --> 00:03:22,719
This one big predictor and then we're
gonna look at what actually happened and

42
00:03:22,719 --> 00:03:25,684
we're gonna see if we predicted right or
wrong.

43
00:03:25,684 --> 00:03:30,891
Okay, so in the first, iteration, let's
say, we train up the, the predictor, it

44
00:03:30,891 --> 00:03:34,874
says not taken.
It starts off and it's not taken, that's

45
00:03:34,874 --> 00:03:38,160
the start state.
We're to predict not taken, it was

46
00:03:38,160 --> 00:03:41,931
actually taken because we're gonna loop
four times in this loop.

47
00:03:41,931 --> 00:03:44,093
So we mis-predicted it.
Hm, that wasn't great.

48
00:03:45,052 --> 00:03:50,007
But then, it transitions from the not
taking, or predict not taking the state,

49
00:03:50,007 --> 00:03:54,050
to the predict taking the state.
So the next few times we iterate the loop,

50
00:03:54,050 --> 00:03:58,228
we're going to get it right.
Now this is a relatively short loop, we

51
00:03:58,228 --> 00:04:02,033
only run it four times.
But if you have a loop that executes, 100

52
00:04:02,033 --> 00:04:05,041
times.
This might actually do okay.

53
00:04:06,037 --> 00:04:11,353
Okay, but then we look at the end here.
And well, hm.

54
00:04:11,353 --> 00:04:15,409
The fall-through case at the end of the
loop.

55
00:04:15,409 --> 00:04:19,115
We're gonna predict taken but we're
actually fall through.

56
00:04:19,115 --> 00:04:21,429
This is the last, last little bit of our
loop.

57
00:04:21,429 --> 00:04:25,073
That's, that's not great.
So here, you know if you look at

58
00:04:25,073 --> 00:04:29,414
prediction accuracy, this is 50%.
This was, this was not, not very good.

59
00:04:29,609 --> 00:04:35,026
Let's, let's try to execute the loop again
and I want, just to kind of want to show

60
00:04:35,026 --> 00:04:38,228
here.
The reason that I show a second execution

61
00:04:38,228 --> 00:04:43,063
in the exact same loop, is to show why the
start prediction was not taken.

62
00:04:43,063 --> 00:04:49,097
The second time we come back to the same
loop, we left this predictor trained up to

63
00:04:49,097 --> 00:04:54,019
be not taken.
So now when we go to execute again we come

64
00:04:54,019 --> 00:04:58,036
in not taken.
And we mispredict the first time.

65
00:04:58,074 --> 00:05:01,752
Correctly predict the second time,
correctly predict the third time, and miss

66
00:05:01,752 --> 00:05:07,018
predict the fall through case.
So that's why you need to sort of execute

67
00:05:07,018 --> 00:05:11,062
this loop twice here to see what the
common case is which is sort of looped

68
00:05:11,062 --> 00:05:14,060
back to back.
Otherwise, you know, if we were to change

69
00:05:14,060 --> 00:05:18,001
the start state.
You might end up thinking that this

70
00:05:18,001 --> 00:05:20,025
predictor does better than it actually
does.

71
00:05:22,042 --> 00:05:26,060
So that's, that's, that's not, not a whole
magic there.

72
00:05:26,503 --> 00:05:30,018
It's simple.
It does okay.

73
00:05:30,060 --> 00:05:36,020
Can we do better?
Well we're nowhere near 99%, so the

74
00:05:36,020 --> 00:05:42,036
answer's going to be yes.
Okay, so what'll happens if we start to

75
00:05:42,036 --> 00:05:47,531
have more bits, to do prediction.
So here, instead of having one extra flip

76
00:05:47,531 --> 00:05:51,540
flop added to our computer architect, or,
or, our processor pipeline.

77
00:05:51,540 --> 00:05:56,808
We add two bits to processor architecture.
We add a two bit saturating predictor.

78
00:05:56,808 --> 00:06:01,817
So what do we mean by saturate?
Well, what this means is, we start off,

79
00:06:01,817 --> 00:06:07,645
let's say in the strong not taken state.
And then if you have a predict take and it

80
00:06:07,645 --> 00:06:12,319
moves here, if you have another predict
take and it moves here, if you have

81
00:06:12,319 --> 00:06:15,377
another predict, predict take and it moves
here.

82
00:06:15,377 --> 00:06:20,184
So we slowly can sort of, we have a range
here if you will of taken.

83
00:06:20,184 --> 00:06:26,445
So what we have taken, strong taken,
weakly not taken, strong not taken, and we

84
00:06:26,445 --> 00:06:32,637
can sort of move across the spectrum as we
reinforce the taken or not takenness of

85
00:06:32,637 --> 00:06:39,179
this, of a, of a branch.
So still, still not quite rocket science.

86
00:06:39,416 --> 00:06:45,477
There is some, some of these here.
Effectively we're introducing hysteresis

87
00:06:45,477 --> 00:06:49,677
into the prediction.
That we're trying to do here.

88
00:06:49,677 --> 00:06:54,464
Or our prediction algorithm.
And that, that, that can helps.

89
00:06:54,464 --> 00:06:59,701
Let's, let's see how that helps.
So let's execute the same loop here,

90
00:06:59,701 --> 00:07:04,423
another time here.
We'll start off with strong not taken.

91
00:07:04,423 --> 00:07:10,829
Strong not taken, predicts not taken and
the first time we execute this loop, well

92
00:07:10,829 --> 00:07:14,605
it was actually taken because loops like
to be taken.

93
00:07:14,871 --> 00:07:20,040
And that's a mispredict.
But we transition from strong not taken

94
00:07:20,040 --> 00:07:26,043
over here into the weak not taken state.
So something actually happened, that, that

95
00:07:26,043 --> 00:07:32,571
mis-predict was not in vain.
The second iteration happens, and we still

96
00:07:32,571 --> 00:07:36,826
are predicting not taking, because we're
in this weak not taken state.

97
00:07:36,826 --> 00:07:39,207
And we did take it.
So we mis-predict again.

98
00:07:39,207 --> 00:07:42,560
But now, we actually transition to the
weak, taken state.

99
00:07:42,560 --> 00:07:48,728
This time things get a little bit better
here we're actually going to have a

100
00:07:48,728 --> 00:07:52,369
prediction of taken.
The actual thing that is taken we

101
00:07:52,369 --> 00:07:58,036
predicted it correctly, and then finally
for here, prediction of taking it was

102
00:07:58,036 --> 00:08:01,093
actually not taken.
I'm sorry this was actually a mis-predict

103
00:08:01,093 --> 00:08:04,011
here.
This is, this is incorrect.

104
00:08:04,011 --> 00:08:07,012
So if you go this should, this should be
yes.

105
00:08:10,360 --> 00:08:18,355
We can see that this is actually, doing
worse or appears to be doing worse than

106
00:08:18,355 --> 00:08:22,026
our one bit predictor.
Well, does, does life get better.

107
00:08:22,026 --> 00:08:26,008
Well we, we train this up a little bit at
this point.

108
00:08:26,008 --> 00:08:32,061
So let's, let's take a look at now.
Let's execute the loop, another time given

109
00:08:32,061 --> 00:08:36,082
that we started off in this strong taken
state.

110
00:08:39,013 --> 00:08:43,098
So now we're gonna execute again and we
see that we actually predict taken, taken,

111
00:08:43,098 --> 00:08:46,038
taken.
It actually was taken, taken, taken.

112
00:08:46,038 --> 00:08:51,055
And we just mispredict the last one.
And what's nice about this is now when we

113
00:08:51,055 --> 00:08:56,200
go to execute again, we're gonna start
off, because we mispredict to the end

114
00:08:56,200 --> 00:08:59,625
state here, we're actually gonna start off
in the weak taking state.

115
00:08:59,625 --> 00:09:03,062
I'm sorry so this is, this should probably
be weak taken.

116
00:09:03,270 --> 00:09:06,012
Is that what you were going to say?
Yeah, so.

117
00:09:07,084 --> 00:09:12,034
This is another bug in the slide here.
I will fix this in what I, upload.

118
00:09:12,034 --> 00:09:15,004
This, this should actually say week, week
taken.

119
00:09:15,004 --> 00:09:18,061
Because that's where we left off, at the
end of this one here.

120
00:09:18,061 --> 00:09:21,013
But the, the prediction accuracy's the
same.

121
00:09:21,013 --> 00:09:25,040
We're still gonna predict taken.
So you can see that this, does not get

122
00:09:25,040 --> 00:09:28,010
tricked, if you will.
It does not get untrained.

123
00:09:28,070 --> 00:09:32,047
By this one not taken case here at the
end.

124
00:09:32,047 --> 00:09:36,622
But what's nice about this is you can
train up, basically, and the historesis

125
00:09:36,622 --> 00:09:40,816
will not get foled by one branch at the
end of the loop sort of falling through.

126
00:09:40,816 --> 00:09:44,474
And that's, that's positive.
You can even think about this section

127
00:09:44,474 --> 00:09:49,599
being helpful, that the fall through case
might even signal you to do something.

128
00:09:49,599 --> 00:09:53,976
People could actually have hist, you could
even use that to sort of, trigger

129
00:09:53,976 --> 00:09:58,585
something in your finite state machine if
you see lots and lots of takens, then one

130
00:09:58,585 --> 00:10:03,083
not taken on the exact same branch, that
might mean something to you.

131
00:10:03,083 --> 00:10:06,505
And people in fact have done things like
that.

132
00:10:06,505 --> 00:10:13,584
So, you start to get even with two bits,
you can think about having different state

133
00:10:13,584 --> 00:10:19,081
machines for the same thing.
So here we have something which jumps

134
00:10:19,081 --> 00:10:27,034
directly to strong from weak.
So, instead here when we had strong.

135
00:10:28,055 --> 00:10:32,453
Or sorry weak not taken, and when we have
a taken here.

136
00:10:32,453 --> 00:10:37,089
We actually jump to strong.
So we train effectively very fast.

137
00:10:37,089 --> 00:10:42,082
So it's still hysteresis , but they're a
faster acting hysteresis.

138
00:10:42,277 --> 00:10:47,696
So this, looks pretty similar to what you
had before but now this weak not taken

139
00:10:47,696 --> 00:10:54,920
state, state will jump straight to strong
taken, if you have one mispredict or

140
00:10:54,920 --> 00:10:58,032
something like that.
But, this is only one example.

141
00:10:58,032 --> 00:11:03,060
If you go look and actually [inaudible]
book, in the chapter that talks about

142
00:11:03,060 --> 00:11:06,112
this.
They give like three other examples that

143
00:11:06,112 --> 00:11:11,675
look similar to this and are different
heuristics to sort of try to get different

144
00:11:11,675 --> 00:11:19,026
of chances or different, different,
behaviors that are common with programs.

145
00:11:19,026 --> 00:11:24,383
But, you know none of these are going to
be, super, super great because they're

146
00:11:24,383 --> 00:11:30,458
only 2-bits.
Did you actually start to think about

147
00:11:30,458 --> 00:11:37,054
having a table of these predictors.
So you have local predictions.

148
00:11:37,054 --> 00:11:43,066
So it's not predicted globally.
But instead we are predicting based on

149
00:11:43,066 --> 00:11:48,078
each different branch.
And how we do this is we take bits out of

150
00:11:48,078 --> 00:11:52,072
the program counter and we index that into
this big table.

151
00:11:52,072 --> 00:11:58,022
And then we can store, lets say two bits
in this table, and then we can train each

152
00:11:58,022 --> 00:12:02,084
of these things separately.
Now, we are not gonna wanna have an entry

153
00:12:02,084 --> 00:12:07,032
in this table equivalent to the number of
branches in the program.

154
00:12:07,032 --> 00:12:10,895
Because, it will be huge.
But you want to somehow have some number

155
00:12:10,895 --> 00:12:13,007
of bits.
K bits, in fact.

156
00:12:13,210 --> 00:12:18,045
Maybe some sort of, , maybe just take the
low, or the medium order bits of the, of

157
00:12:18,045 --> 00:12:21,057
the program counter, and we use that as a
index.

158
00:12:21,057 --> 00:12:25,022
Similar to how we use, how we use indexes
and caches.

159
00:12:25,022 --> 00:12:30,033
You could also think about having multi
way associative versions of these.

160
00:12:30,033 --> 00:12:34,051
So instead of having, this is a, this is a
direct mapped version.

161
00:12:34,051 --> 00:12:38,043
But you could have a multi way associative
version of this.

162
00:12:38,631 --> 00:12:43,311
And If you, if you look let's say
something which is a four kilo entry

163
00:12:43,311 --> 00:12:45,992
version of this.
That's pretty big actually.

164
00:12:45,992 --> 00:12:50,558
But a four kilo entry branch history table
with two bits per entry.

165
00:12:50,558 --> 00:12:54,686
Now we're starting to get somewhere, we
can get sort of 80 to 90 percent

166
00:12:54,686 --> 00:12:58,596
prediction accuracy.
So that's, that's starting to get us

167
00:12:58,596 --> 00:13:02,088
there, and then we had to build.
I don't know, eight kilobits worth of

168
00:13:02,088 --> 00:13:05,079
storage.
And we had to somehow pipe it into our

169
00:13:05,079 --> 00:13:08,466
next PC logic.
Which is kind of a sensitive path on our

170
00:13:08,466 --> 00:13:13,035
processor.
But, you know that's, that's can be

171
00:13:13,035 --> 00:13:24,093
helpful in performance.
So let's actually before I move off.

172
00:13:24,093 --> 00:13:32,372
I want to make one, one point.
There are other ways to organize this.

173
00:13:32,372 --> 00:13:36,002
Table.
So what we're really doing here is, we're

174
00:13:36,002 --> 00:13:42,017
indexing into a table and we're, let's say
running through some sort of finite state

175
00:13:42,017 --> 00:13:45,799
machine decoder here, which tells us taken
or not taken.

176
00:13:45,799 --> 00:13:50,659
And that we're individually updating
subentries of this table, for a particular

177
00:13:50,659 --> 00:13:56,187
branch.
You could think about having other things

178
00:13:56,187 --> 00:14:01,520
to index into this table.
So, either more complex cache functions,

179
00:14:01,520 --> 00:14:08,473
or maybe some other history information,
or maybe previous states or just some

180
00:14:08,473 --> 00:14:12,643
other bits up here.
So this is just a base cases.

181
00:14:12,643 --> 00:14:18,499
We're gonna show some more exam-, more
complicated examples for spatial

182
00:14:18,499 --> 00:14:22,645
correlating predictors.
But you, a lot of the ideas we are going

183
00:14:22,645 --> 00:14:25,571
to talk about in spatial correlating
predictor.

184
00:14:25,571 --> 00:14:29,041
You could apply that to these temporal
coordinating predictors.

185
00:14:29,077 --> 00:14:34,097
But I just wanted to make that point that
these things do what we're going to talk

186
00:14:34,097 --> 00:14:39,064
about we're going to show more, much more
sophisticate versions in our spatial

187
00:14:39,064 --> 00:14:44,048
correlation predictors, but the you can
apply level ideas back to these temporal

188
00:14:44,048 --> 00:14:47,077
ones.
Okay so let's look at spatial correlation.

189
00:14:47,077 --> 00:14:52,055
We talked about temporal correlation.
So, in time, one, one branch in the, and

190
00:14:52,055 --> 00:14:57,039
the direction that one branch goes has a
high probability that it's going to

191
00:14:57,039 --> 00:15:02,035
influence or, or it's going to predict
rather the, whether that branch in future

192
00:15:02,035 --> 00:15:06,057
is going to be taken or not taken.
That's, that's what is, a temporal

193
00:15:06,057 --> 00:15:10,002
prediction is.
A spatial prediction, what it's trying to

194
00:15:10,002 --> 00:15:15,017
get at this, this, this piece of code here
is that, if you have let's say one branch,

195
00:15:15,044 --> 00:15:21,035
do something, the future branch that's,
lets say right below it, is going to

196
00:15:21,035 --> 00:15:26,030
somehow be correlated.
So a good example case, and this is kind

197
00:15:26,030 --> 00:15:33,723
of the canonical example here is, If you
check this value, and it's less than

198
00:15:33,723 --> 00:15:38,044
seven.
It's also less than five.

199
00:15:38,096 --> 00:15:41,071
And you get code like this relatively
often.

200
00:15:41,071 --> 00:15:44,052
So these, these.
These branches are correlated.

201
00:15:44,052 --> 00:15:48,022
But they're not the same branch.
They're different branches.

202
00:15:48,048 --> 00:15:54,022
This branch and this branch.
So we're having some sort of spatial

203
00:15:54,022 --> 00:15:59,001
correlation here, and we want to try to
exploit this somehow.

204
00:15:59,001 --> 00:16:07,022
And this was Yael Pats and, and one of his
students came up with this, not that long

205
00:16:07,022 --> 00:16:10,091
ago surprisingly.
In the, in the, in the 90's.

206
00:16:10,091 --> 00:16:18,058
And the idea here is that I don't know,
we're gonna have an extra register.

207
00:16:18,058 --> 00:16:25,523
We'll call a branch history register.
Which will store whether previous branches

208
00:16:25,523 --> 00:16:28,094
that we've executed were taken or not
taken.

209
00:16:28,094 --> 00:16:32,001
And then use that to index into some
tables.

210
00:16:32,001 --> 00:16:35,029
Or use that somehow to predict what's
going on.

211
00:16:36,000 --> 00:16:39,067
So this is actually gonna be a shift
register.

212
00:16:39,067 --> 00:16:43,044
Where we're gonna shift in, what the
branch did.

213
00:16:45,060 --> 00:16:57,062
So it's not based on just the address.
Okay, so here's our picture of our branch

214
00:16:57,091 --> 00:17:03,003
history register.
The branch outcome, either whether taken

215
00:17:03,003 --> 00:17:06,409
or not taken, goes into here.
Every time we take a branch, we're gonna

216
00:17:06,409 --> 00:17:10,088
shift a new value in, and let the old one
sort of fall off the end.

217
00:17:13,064 --> 00:17:19,038
So what could we do with this?
Well, we can use it to index a table.

218
00:17:19,070 --> 00:17:27,751
So if we have a pattern history table.
What parent the history table is, is will

219
00:17:27,751 --> 00:17:36,059
be indexed, by previous branches.
And then based on this pattern, we're

220
00:17:36,059 --> 00:17:40,040
gonna look in this table, and that'll
output either some bits that go into our

221
00:17:40,040 --> 00:17:44,030
state machine or it might just output one
bit which says take it or not take it.

222
00:17:44,030 --> 00:17:48,129
But, you know the general sense is going
to output in bits, maybe two bits which

223
00:17:48,129 --> 00:17:54,227
will feed into a state machine which
updates itself, like what we had before.

224
00:17:54,227 --> 00:18:01,022
What's cool about this, is let's say.
You have a, loop.

225
00:18:01,022 --> 00:18:08,000
Which, allways has four iterations.
So we're gonna have taken, taken, taken,

226
00:18:08,000 --> 00:18:12,543
not taken.
Well, what's gonna happen is we're gonna

227
00:18:12,543 --> 00:18:18,252
shift in these four take in, take in, take
in, not take in the register and then,

228
00:18:18,252 --> 00:18:25,574
lets say we execute the loop we executed
that whole inner that whole loop multiple

229
00:18:25,574 --> 00:18:29,031
times.
We can train up this history table here,

230
00:18:29,031 --> 00:18:34,702
and it'll detect if we've seen the loop.
Let's say, have taken, taken, taken.

231
00:18:34,702 --> 00:18:38,808
What do we think the next thing is going
to be?

232
00:18:38,808 --> 00:18:45,312
It's gonna be not taken.
And this predictor can detect that.

233
00:18:45,312 --> 00:18:50,189
And I can do that actually with only one,
one bit here.

234
00:18:50,189 --> 00:18:54,033
One, you know just a one bit state
machine.

235
00:18:54,033 --> 00:19:00,035
Because, if it sees taken, taken, taken
not or excuse me or if it sees, taken,

236
00:19:00,035 --> 00:19:03,508
taken, taken.
That's gonna index into this table, and

237
00:19:03,508 --> 00:19:09,186
the output of here is gonna be not taken.
So it's gonna be able to predict, sort of

238
00:19:09,186 --> 00:19:12,507
short runs.
Loops which have short runs.

239
00:19:12,774 --> 00:19:18,323
And if this, as this gets longer.
This branch history register gets longer,

240
00:19:18,323 --> 00:19:21,975
you could actually predict loops which
have longer runs.

241
00:19:21,975 --> 00:19:26,454
So if this is 100 bits long.
Now, you're never gonna have 100 bits

242
00:19:26,454 --> 00:19:31,490
long, because a pattern history table of,
of to the 100 is a very large number.

243
00:19:31,490 --> 00:19:34,351
But as this gets longer, let's say it's
ten.

244
00:19:34,351 --> 00:19:37,806
That's very reasonable.
You could predict a loop.

245
00:19:37,806 --> 00:19:42,110
Let's say that loops ten times.
It will get the fall through case correct,

246
00:19:42,110 --> 00:19:46,147
it will get the start case correct, and
you will have 100 percent accuracy in that

247
00:19:46,147 --> 00:19:50,833
loop, for your branch prediction.
Wow, that's pretty cool.

248
00:19:50,833 --> 00:19:54,710
So this is where we're starting to
actually get some, get some traction here.

249
00:19:55,220 --> 00:19:59,258
And this is where branch predication
really starts taking off.

250
00:19:59,258 --> 00:20:04,831
This is where we start to get it, sort of,
above 80 percent we start to get to a

251
00:20:04,831 --> 00:20:11,064
higher number.
If we wanna go even, if you want to even

252
00:20:11,064 --> 00:20:15,661
have better prediction.
We can start to think about having a

253
00:20:15,661 --> 00:20:21,157
branch history register which indexes into
multiple tables.

254
00:20:21,157 --> 00:20:28,932
And then, chooses, based on some bits in
the program counter, between these

255
00:20:28,932 --> 00:20:36,599
different pattern history tables.
Now why, why do we wanna do this?

256
00:20:36,599 --> 00:20:46,057
Well we want to know, based on our PC here
which one to choose.

257
00:20:46,057 --> 00:20:52,064
So, this is trying to sort of get the, the
mixing here of temporal, and spatial

258
00:20:52,064 --> 00:20:56,661
correlation here.
Because if a certain branch gets executed.

259
00:20:56,661 --> 00:21:01,181
You wanna use that prediction for that
particular branch.

260
00:21:01,181 --> 00:21:07,624
And here you know, use some hash of some
bits in the program counter, will choose

261
00:21:07,624 --> 00:21:11,348
between some numbers of these history
tables.

262
00:21:11,348 --> 00:21:18,571
So this is mixing our spatial co-relation
with our temporal co-relation together

263
00:21:18,571 --> 00:21:22,190
here.
And this is, this is called 2-level branch

264
00:21:22,190 --> 00:21:26,147
prediction.
Because we, we have multiple wells here we

265
00:21:26,147 --> 00:21:31,431
have a history register missing here in
the, the program counter.

266
00:21:31,431 --> 00:21:38,038
And this can do quite well.
And if we want to have a sort of

267
00:21:38,038 --> 00:21:48,047
generalized two level branch predictor.
And you want to make this even go better,

268
00:21:48,047 --> 00:21:52,975
or even go faster.
You can start to add multiple branch

269
00:21:52,975 --> 00:21:58,086
history registers and have them be per
branch effectively.

270
00:21:58,086 --> 00:22:05,094
Now you're never gonna have it per branch.
You're gonna have it be somehow aliased

271
00:22:05,094 --> 00:22:10,011
into here.
So you have multiple ship registers and

272
00:22:10,011 --> 00:22:16,061
only when that branch is indexed is when
you actually ship that ship register.

273
00:22:18,007 --> 00:22:26,995
So for, for, relatively small numbers of M
coming into this index, into this branch,

274
00:22:26,995 --> 00:22:33,106
history register table, and K indexing
between different pattern history tables.

275
00:22:33,106 --> 00:22:37,587
You actually can get very high accuracy,
you can get like sort of 97 percent

276
00:22:37,587 --> 00:22:40,045
accuracy.
Now we're still not 99.

277
00:22:40,045 --> 00:22:44,376
Now, you might say, why do I care 97
sounds pretty darn good.

278
00:22:44,376 --> 00:22:48,690
Not good enough, unfortunately.
You actually need to go higher if you have

279
00:22:48,690 --> 00:22:53,403
22 stage pipe and you actually want to
have decent performance, because, twenty,

280
00:22:53,403 --> 00:22:58,039
22 cycle mispredict , or twenty cycle
mispredict is, is really quite bad., It's,

281
00:22:58,039 --> 00:23:03,210
it's pretty damaging when you go sort of
work through the iron law of math.

282
00:23:03,943 --> 00:23:10,310
So before we go on this is kind of one of
the more complex slides that we're gonna

283
00:23:10,310 --> 00:23:12,967
to see today.
Does anyone have any questions about this

284
00:23:12,967 --> 00:23:15,005
of what's, what's happening in the
mechanics?

285
00:23:23,022 --> 00:23:30,046
So let's talk about training this.
So, as you're executing.

286
00:23:31,000 --> 00:23:35,957
You're going to be shifting, when only,
when a branch occurs you're going to be

287
00:23:35,957 --> 00:23:38,987
shifting the old outcome into this
register.

288
00:23:38,987 --> 00:23:43,614
One, one of these registers here based on
the PC of that branch.

289
00:23:43,614 --> 00:23:49,109
And then you're also transitioning, based
on the outcome, the prediction.

290
00:23:49,109 --> 00:23:53,468
Or the state machine, which is embodied in
the prediction.

291
00:23:53,468 --> 00:23:58,146
So this is actually a pretty complex
machine here.

292
00:23:58,146 --> 00:24:03,454
And you don't probably want to be updating
that.

293
00:24:03,454 --> 00:24:12,001
Oh we'll say, right in the fetch stage or
right in the decode stage.

294
00:24:12,001 --> 00:24:17,040
So typically, how this works is you go to
do this update later down the pipe sort

295
00:24:17,040 --> 00:24:22,033
of, at the end of the pipe somewhere.
You go, and you actually feed back the

296
00:24:22,033 --> 00:24:27,012
true outcome of all the branches.
And that feeds back into the, the branch

297
00:24:27,012 --> 00:24:30,052
history register and the pattern history
tables.

298
00:24:30,052 --> 00:24:34,092
And, and, the downside to this is by
waiting to the end of the pipe.

299
00:24:35,020 --> 00:24:37,056
Is, well.
We'll, first, we'll go for the upside.

300
00:24:37,056 --> 00:24:41,092
The upside is, you know that the branch
was actually executed, you know the actual

301
00:24:41,092 --> 00:24:44,029
outcome.
You know that it wasn't squashed by

302
00:24:44,029 --> 00:24:47,025
something else.
If you were to out order a pipeline, you

303
00:24:47,025 --> 00:24:50,200
it was actually committed.
Because you don't really wanna train your

304
00:24:50,200 --> 00:24:53,675
predictor on bogus data.
So you don't wanna like, sort of predict

305
00:24:54,052 --> 00:24:57,054
train your data on, incorrect re-predicted
branches themselves.

306
00:24:57,054 --> 00:25:01,063
So if you, let's say, have a mispredict
branch in the shadow of the mispredict

307
00:25:01,063 --> 00:25:03,594
branch.
You have something else, and that were to

308
00:25:03,594 --> 00:25:06,363
try to go and corrupt your prediction
state, that's not good.

309
00:25:06,363 --> 00:25:08,594
So usually, you sort of wait to the end of
the pipe.

310
00:25:08,594 --> 00:25:12,508
The downside of waiting to the end of the
pipe is, there's more feedback and it

311
00:25:12,508 --> 00:25:16,024
takes longer to train up It doesn't
necessarily take longer time but if you

312
00:25:16,024 --> 00:25:19,199
have branch, followed by branch that wants
to use that prediction.

313
00:25:19,199 --> 00:25:23,055
You can't actually use that let's say
until it gets to the end of the pipe.

314
00:25:23,055 --> 00:25:27,242
So, you're gonna be using sort of old
stale data for a little bit of time or as

315
00:25:27,242 --> 00:25:30,635
long as your pipeline [inaudible].
Not the end of the world, but something to

316
00:25:30,635 --> 00:25:36,640
be aware of.
Okay, so now let's start talking about

317
00:25:36,640 --> 00:25:41,164
getting onto our higher 90's well higher
than 97%.

318
00:25:41,164 --> 00:25:45,062
We're going to talk about serving to our
99 percent accurate branch predictors.

319
00:25:45,062 --> 00:25:50,063
And, this brings us to tournament
predictors.

320
00:25:50,092 --> 00:25:55,800
It's great name.
It came up first in the, ALPHA 21264 and

321
00:25:55,800 --> 00:26:03,590
what they did is they actually had
predictors, to predict which predictor to

322
00:26:03,590 --> 00:26:04,379
use.
So.

323
00:26:04,379 --> 00:26:11,356
So one of the interesting things going
back to this two level branch predictor,

324
00:26:11,356 --> 00:26:17,620
is sometimes, you want per branch
information or per branch history.

325
00:26:17,620 --> 00:26:21,948
And sometimes you want.
Global branch history.

326
00:26:21,948 --> 00:26:27,889
The pro branch history, you're only
basically training this up based on that

327
00:26:27,889 --> 00:26:31,210
previous branch.
But in this example here, you are training

328
00:26:31,210 --> 00:26:35,112
it, training it upon other branches.
So sort of going back here, this is a

329
00:26:35,112 --> 00:26:39,727
different branch than that branch.
So you might even want, you might actually

330
00:26:39,727 --> 00:26:43,058
want global history.
But there are times, for instance in a

331
00:26:43,058 --> 00:26:49,796
loop, you actually want local history.
So Turner's predictors, actually they end

332
00:26:49,796 --> 00:26:54,743
up with things we are using a predictor.
We'll call it a choice predictor, that

333
00:26:54,743 --> 00:26:59,660
predicts which predictor to use.
I showed two examples here, one which is

334
00:26:59,660 --> 00:27:02,939
choosing from a global predictor and a
local predictor.

335
00:27:02,939 --> 00:27:06,537
That's, that's, that's kind of the, the
canonical usage of this.

336
00:27:06,537 --> 00:27:09,926
Actually, and in 21264, they had something
more complicated.

337
00:27:09,926 --> 00:27:14,565
I think they had I think, anecdotally, I
heard that they have about ten predictors

338
00:27:14,565 --> 00:27:17,971
in there.
And they're trying to choose between ten,

339
00:27:17,971 --> 00:27:22,324
ten, ten different predictors.
And they train the choice predictor, based

340
00:27:22,324 --> 00:27:25,583
on which predictor was correct.
So you can use.

341
00:27:25,583 --> 00:27:31,712
Basically, because you run loops, and you
run codes a lot, you can train up.

342
00:27:31,712 --> 00:27:36,066
The choice predictor to choose between
other predictors, based on the accuracy of

343
00:27:36,066 --> 00:27:39,500
the other predictors.
So for particular branch, one of the

344
00:27:39,500 --> 00:27:43,459
predictors is always wrong.
The choice predictor can say, just don't

345
00:27:43,459 --> 00:27:45,530
use that one.
Just X it out.

346
00:27:46,350 --> 00:27:53,083
So, this, this does, this does, quite a
bit well here, and you know, you're

347
00:27:53,083 --> 00:27:58,842
starting to get into the 90, I don't
actually know the full numbers here, but

348
00:27:58,842 --> 00:28:04,417
it's probably 98 to 99 percent accuracy
for large individual predictors hooked up

349
00:28:04,417 --> 00:28:09,148
to this.
Your, the book [inaudible] that you have

350
00:28:09,148 --> 00:28:13,240
talks about some other things you might
want to stick into here.

351
00:28:13,240 --> 00:28:18,152
But for right now we'll, we'll, we'll
leave it at this so we have sort of global

352
00:28:18,152 --> 00:28:23,670
predictors, local predictors, things that
use a branch history register for both of

353
00:28:23,670 --> 00:28:26,938
these, and then something which predicts
between the two.

354
00:28:26,938 --> 00:28:31,104
So I'll just restate it one more time.
You train up your predictor.

355
00:28:31,104 --> 00:28:35,394
To determine which one's used and it has
lots of training data.

356
00:28:35,394 --> 00:28:40,745
But this is all heuristics if you go look
in the book actually there's a even joke

357
00:28:40,745 --> 00:28:44,664
there's, it's actually not a joke it's
something that someone has actually

358
00:28:44,664 --> 00:28:47,540
proposed.
They call it the perceptron predictor and

359
00:28:47,540 --> 00:28:51,676
they say they have a neural net hooked in
to their predictor and the prediction

360
00:28:51,676 --> 00:28:54,092
strategy.
So these things can get pretty

361
00:28:54,092 --> 00:28:56,998
sophisticated.
I don't know if anyone has actually ever

362
00:28:56,998 --> 00:29:01,001
built a perceptron predictor for real but
at least it's been proposed.