1
00:00:03,087 --> 00:00:07,054
Okay.
So that gets us out of the outcome, part

2
00:00:07,054 --> 00:00:12,034
of today's lecture.
Let's start talking about how to figure

3
00:00:12,034 --> 00:00:19,076
out where we're going.
So it brings us back to this picture and I

4
00:00:19,076 --> 00:00:25,005
took some stuff off this picture now.
Well now what we need to do is we need to

5
00:00:25,005 --> 00:00:31,034
know the targets, the role of targets.
And I said we don't know the target for

6
00:00:31,034 --> 00:00:35,057
branch, a jump, a jump and link until we
actually do PCE plus the offset

7
00:00:35,057 --> 00:00:40,057
calculation somewhere in the decode stage
and until we at least get the instruction.

8
00:00:41,017 --> 00:00:47,017
From jump registers and jump [inaudible]
registers, we don't know that until later.

9
00:00:47,017 --> 00:00:53,004
But this is peeking out, this is not
looking at the, the outcome, this is just

10
00:00:53,004 --> 00:00:57,037
looking at when do we know the value.
But this is important because, even though

11
00:00:57,037 --> 00:01:01,036
we have the prediction of where we're
whether it's taken or not taken.

12
00:01:01,036 --> 00:01:03,088
And we can do that let's say, with 99
percent accuracy.

13
00:01:03,088 --> 00:01:08,026
This is the harder one, we're trying to
pick from two to the 32 possible numbers.

14
00:01:08,026 --> 00:01:13,078
Can we actually do a good job here.
So something not on the slides, but

15
00:01:13,078 --> 00:01:20,046
something I wanted to talked about briefly
here is we can use a static algorithm

16
00:01:20,046 --> 00:01:26,573
similar to what we have seen with the
other predictors to do this calculation

17
00:01:26,573 --> 00:01:32,582
through the compiler if you will.
So there's been a few architectures, one

18
00:01:32,582 --> 00:01:39,593
example actually is the Hitachi, or now
Renaissance SH5 architecture which is sort

19
00:01:39,593 --> 00:01:47,085
of a, a media processor, we'll say.
And they actually have, an instruction.

20
00:01:48,029 --> 00:01:54,083
Which loads the target of a future branch.
So they decouple the branch into two

21
00:01:54,083 --> 00:02:01,036
instructions, the actual branch and a
prepared branch, which has data to execute

22
00:02:01,036 --> 00:02:08,015
earlier in their instruction sequence and
says, I'm playing to do a branch and its

23
00:02:08,015 --> 00:02:13,049
address something.
It's actually called prepared target, is

24
00:02:13,049 --> 00:02:16,065
the name of there, of this instruction, or
PTA.

25
00:02:16,065 --> 00:02:21,080
Under architecture and because of this,
they don't have this problem.

26
00:02:21,080 --> 00:02:26,780
They, all they need to do is actually do
the branch.

27
00:02:26,780 --> 00:02:31,560
So, they know the target of branch.
In the fetch stage, they can actually just

28
00:02:31,560 --> 00:02:36,562
go because they have this instruction that
says, load up this special register with

29
00:02:36,562 --> 00:02:41,545
a, a, my target to the branch, and then we
decouple the actual taking of the branch

30
00:02:41,545 --> 00:02:44,512
sometimes in the future.
So this is a, a, an interesting idea.

31
00:02:44,512 --> 00:02:48,338
I thought I am pretty sure there was
actually a, a one of the CDC machines that

32
00:02:48,338 --> 00:02:52,430
controls the [inaudible] machine.
Did this a long time ago also but I

33
00:02:52,430 --> 00:02:55,942
couldn't find the reference in time for
today's lecture.

34
00:02:55,942 --> 00:02:58,462
But I know the SH it's actually SH5
actually.

35
00:02:58,462 --> 00:03:01,992
Definitely has this instruction.
So all I want to try to get at here is

36
00:03:01,992 --> 00:03:05,194
that there are static software ways to go
about doing this.

37
00:03:05,194 --> 00:03:09,445
In addition now let's talk about something
[inaudible].

38
00:03:09,445 --> 00:03:14,078
Okay.
So, the most common way to go do this is

39
00:03:14,078 --> 00:03:19,258
to implement something called a branch
target buffer.

40
00:03:19,258 --> 00:03:25,248
In a branch target buffer, you actually do
in parallel with both the branch

41
00:03:25,248 --> 00:03:30,364
prediction outcome or the branch outcome
prediction and the PC plus four.

42
00:03:30,364 --> 00:03:35,563
So you, you put into it.
The current PC and it's going to tell you

43
00:03:35,563 --> 00:03:39,646
somehow one out of two to the 32 possible
targets.

44
00:03:39,646 --> 00:03:44,526
How does it do this?
Well, what it does is it actually just

45
00:03:44,526 --> 00:03:49,577
keeps a table of where that branch has
gone in the past.

46
00:03:49,577 --> 00:03:54,009
So the first time it executes, it just
gets it wrong.

47
00:03:54,009 --> 00:03:59,055
But then the second time it exe, but given
that first time it executes, it's actually

48
00:03:59,055 --> 00:04:04,029
gonna load the table.
With a predicted target.

49
00:04:04,029 --> 00:04:11,705
So we're gonna load the predicted target,
based on where this branch went in the

50
00:04:11,705 --> 00:04:14,566
past.
We're gonna index into this table, based

51
00:04:14,566 --> 00:04:19,869
on our current program counter, and then
we're going to check, to see that we

52
00:04:19,869 --> 00:04:23,486
actually match.
So we're gonna take the whole PC, we're

53
00:04:23,486 --> 00:04:28,636
gonna do a tag check, very similar to our
cache, cause this is effectively a cache.

54
00:04:28,636 --> 00:04:34,250
And if.
Are, we hitting this, small cache here, of

55
00:04:34,250 --> 00:04:39,636
targets.
We have, the new program counter.

56
00:04:39,636 --> 00:04:44,674
If we predicted, taken.
If we predicted, not taken, we

57
00:04:44,674 --> 00:04:48,576
[inaudible], we just wanna use PC plus
four.

58
00:04:48,576 --> 00:04:55,180
So, in effect here, we can actually
Predict the target of the branch based on

59
00:04:55,180 --> 00:05:01,540
where this branch has gone in the past.
What, that is a lot easier, I thought this

60
00:05:01,540 --> 00:05:05,619
was gonna be really hard.
Well, it's actually not that bad.

61
00:05:05,619 --> 00:05:10,735
One thing I should say is the size of this
table kind of matters.

62
00:05:10,735 --> 00:05:16,697
If you alias too many places into here you
may end up just pulling out wrong data or

63
00:05:16,697 --> 00:05:19,547
you end up might having a miss quite
often.

64
00:05:19,547 --> 00:05:22,840
So you want to size this table
appropriately.

65
00:05:22,840 --> 00:05:27,590
But you don't want to make it too big,
because you're, you're carrying the entire

66
00:05:27,590 --> 00:05:30,068
program counter in here.
And that can get big.

67
00:05:30,068 --> 00:05:36,841
So the data in here is quite, quite large.
So one way to solve this actually is

68
00:05:36,841 --> 00:05:41,372
typically people try to make these things
set associative.

69
00:05:41,372 --> 00:05:46,649
So you'll actually have multiple.
Pcs and predicted targets because that is

70
00:05:46,649 --> 00:05:51,923
actually a little better performance if
you can check multiple and parallel versus

71
00:05:51,923 --> 00:05:55,287
making it larger.
I mean you could just make it larger but

72
00:05:55,287 --> 00:05:58,908
typically having a little bit of
associativity helps with this.

73
00:05:58,908 --> 00:06:05,402
But you're really executing this in
parallel with the PC plus four logic.

74
00:06:05,402 --> 00:06:12,844
One thing that also makes this a little
better here is the branch target buffer

75
00:06:12,844 --> 00:06:19,550
does not hold every instruction.
If it's not a branch or a jump, don't put

76
00:06:19,550 --> 00:06:23,994
it in this table.
So, you really only put in.

77
00:06:23,994 --> 00:06:27,521
Instructions that are branches into this
table.

78
00:06:27,521 --> 00:06:35,798
Which helps, save a lot of space.
Okay, so, we have a question here.

79
00:06:35,798 --> 00:06:41,082
How do we achieve this effect without
decoding the instruction?

80
00:06:41,082 --> 00:06:47,330
So, we have a french target buffer, we
don't wanna code the instruction.

81
00:06:47,330 --> 00:06:51,896
Can we just look in the branch, branch
target buffer.

82
00:06:51,896 --> 00:06:55,662
Is this good enough?
Is all the information we need in that

83
00:06:55,662 --> 00:07:02,400
table?
Cuz we are executing it in parallel both

84
00:07:02,400 --> 00:07:07,180
the PC plus four.
We haven't fetched the current

85
00:07:07,180 --> 00:07:10,455
instruction.
Is, is that good enough or are we missing

86
00:07:10,455 --> 00:07:13,565
something?
Yeah, this is kinda what I was trying to

87
00:07:13,565 --> 00:07:19,136
get at is that because we check the, the
program counter here into a full bit with

88
00:07:19,136 --> 00:07:23,693
match this is good enough.
We don't actually need to look anything

89
00:07:23,693 --> 00:07:26,526
else.
One thing I this, this, this comment was

90
00:07:26,526 --> 00:07:30,878
here to make a point to about something
called a line predictor.

91
00:07:30,878 --> 00:07:35,566
Sometimes, you wanna get rid of this valid
bit and PC from before.

92
00:07:35,566 --> 00:07:40,491
And then you just have, coming out of
here, instead of predicting the program

93
00:07:40,491 --> 00:07:45,678
counter, you don't really need to know the
program counter, you just need to know

94
00:07:45,678 --> 00:07:50,477
where to get it from your cache.
Sort of, which entry in the cache to go

95
00:07:50,477 --> 00:07:56,197
fetch, and this is called line predictor.
So a full branch target buffer will give

96
00:07:56,197 --> 00:08:00,082
you an actual PC.
At some point, you actually do need to

97
00:08:00,082 --> 00:08:03,492
know the PC.
But you can wait to actually know the PC

98
00:08:03,492 --> 00:08:07,441
until later down the pike.
Well, we, all we really need to know is

99
00:08:07,441 --> 00:08:11,137
which line in the cache, we wanna go fetch
the next cycle.

100
00:08:11,137 --> 00:08:15,568
And we can do that, actually by just sort
of removing this check.

101
00:08:15,568 --> 00:08:20,593
And just, ad hoc, just having it fetch
whatever it gets pointed to from here.

102
00:08:20,593 --> 00:08:23,632
Which might be wrong.
We have to check that.

103
00:08:23,632 --> 00:08:26,852
So anyway this is a, this a trick
question.

104
00:08:26,852 --> 00:08:33,179
We can achieve this with a full BTB.
A full branch target buffer.

105
00:08:33,179 --> 00:08:38,232
If we don't have.
This check, it turns into something called

106
00:08:38,232 --> 00:08:40,484
line predictor.
So we're predicting which line in the

107
00:08:40,484 --> 00:08:46,499
instruction memory or which line in the
instruction cache to go fetch from.

108
00:08:46,499 --> 00:08:49,639
[inaudible].
Okay.

109
00:08:49,639 --> 00:09:00,604
So, this brings us to jump, jump register.
So, jump register.

110
00:09:00,604 --> 00:09:04,842
Hm.
This one gets harder to train.

111
00:09:04,842 --> 00:09:09,024
We can try and train it the same way.
We can try to use the BTB.

112
00:09:09,024 --> 00:09:14,149
So let's look at these different cases.
So what is a jump register used for?

113
00:09:14,149 --> 00:09:18,487
It's used for switch statements.
It's used for C++ virtual function

114
00:09:18,487 --> 00:09:23,606
pointers, or dynamic function calls.
So if you're doing a jump through function

115
00:09:23,606 --> 00:09:27,150
pointer.
And it's used for returns from

116
00:09:27,150 --> 00:09:31,133
sub-routines.
So how long does a BTB work in these

117
00:09:31,133 --> 00:09:36,024
different cases.
So for, switch statements, if you use the

118
00:09:36,024 --> 00:09:39,910
same switch statement and the same case
gets lit up, this works pretty well.

119
00:09:39,910 --> 00:09:42,510
It's kind of like training on any other
predictor.

120
00:09:42,510 --> 00:09:46,575
If, if, the probability that you're
actually choosing something out of the

121
00:09:46,575 --> 00:09:50,873
switch statement is data dependent, or
highly data dependent, you're probably not

122
00:09:50,873 --> 00:09:54,357
gonna predict it very well.
And, you're not gonna do very, there's,

123
00:09:54,357 --> 00:09:59,004
there's probably no great solution to this
There's some people that do sort of, data,

124
00:09:59,004 --> 00:10:03,645
prediction and try to look at the register
values to try to do this a little better,

125
00:10:03,645 --> 00:10:08,123
but it gets very hard.
[inaudible] Function calls.

126
00:10:08,123 --> 00:10:13,040
Well, this works very well.
So if you, if you're in a C++ function.

127
00:10:13,040 --> 00:10:17,469
In C++ they have function pointers all
over the place.

128
00:10:17,469 --> 00:10:24,021
And this is to implement polymorphism.
And this is how when you in C++ you sort

129
00:10:24,021 --> 00:10:31,092
of do a method access on some sort of a
function your actually indexing through a

130
00:10:31,092 --> 00:10:36,905
jump table inside the data structure And
it just so happens that.

131
00:10:36,905 --> 00:10:43,782
This is basically always, always correct
unless you actually reassign that object

132
00:10:43,782 --> 00:10:50,899
to something else or if that object
changes to a different type So, an example

133
00:10:50,899 --> 00:10:54,133
of that is if you have.
Class animals.

134
00:10:54,133 --> 00:10:59,270
And you have, you know, dogs and cats,
which subclass from animals.

135
00:10:59,270 --> 00:11:04,625
And you have a pointer to a animal.
And you have a cat and a dog.

136
00:11:04,625 --> 00:11:10,819
And let's say you assign a cat to this
function, or excuse me, this object

137
00:11:10,819 --> 00:11:16,032
pointer which is an animal.
And you call some method on it, like run.

138
00:11:16,032 --> 00:11:21,518
Well that's going to train up very well,
but all of a sudden if you change that cat

139
00:11:21,518 --> 00:11:25,566
to a dog.
The virtual function pointer cause of what

140
00:11:25,566 --> 00:11:30,067
you actually execute on a catch when you
run, run the function run on the catch

141
00:11:30,067 --> 00:11:34,142
versus running the function run on a
dog.They do different things.

142
00:11:34,142 --> 00:11:37,876
Cat's like to sleep a lot.
Dogs like to run around and bark.

143
00:11:37,876 --> 00:11:42,689
So you know that this the code in the
middle of the run function of the cat and

144
00:11:42,689 --> 00:11:47,240
dog are different so.
We, we no longer, jump to the same

145
00:11:47,240 --> 00:11:52,420
location so our BTB would be wrong but it
will train up pretty fast, because you

146
00:11:52,420 --> 00:11:56,315
probably not changing the object pointer
very often.

147
00:11:56,315 --> 00:12:01,685
So that's, that's, that's decent.
Okay.

148
00:12:01,685 --> 00:12:08,859
So subroutine return calls.
Jump registers are used there a lot.

149
00:12:08,859 --> 00:12:13,112
So you call into a function.
You do it with a jump in link.

150
00:12:13,112 --> 00:12:17,993
To come back out, you come out with a jump
register.

151
00:12:17,993 --> 00:12:23,455
Well.
Btbs are okay, if the place that you are

152
00:12:23,455 --> 00:12:27,897
returning or excuse me, the place that you
call from.

153
00:12:27,897 --> 00:12:34,762
You calling from that location locked.
But, if you're calling a function from

154
00:12:34,762 --> 00:12:41,021
different locations, quite often.
Effectively, what's gonna happen here is

155
00:12:41,021 --> 00:12:46,675
the one function is gonna be called from
lots of different locations and you're not

156
00:12:46,675 --> 00:12:50,586
going to be able to predict the return
location very accurately.

157
00:12:50,586 --> 00:12:55,950
So if one leaf function gets called from
lots of different locations that's not,

158
00:12:55,950 --> 00:13:06,391
that's not particularly very good for BTB.
One, One interesting thing about the

159
00:13:06,391 --> 00:13:11,901
subroutine return.
Case, is that.

160
00:13:11,901 --> 00:13:17,719
The jump register the n of our leaf
function.

161
00:13:17,719 --> 00:13:23,209
The program counter for that for that jump
register, is not quarterly it's the

162
00:13:23,209 --> 00:13:27,015
function which called into it.
So effectively when you go to index the

163
00:13:27,015 --> 00:13:31,111
BTB you're using the address of the jump
register and it might point to some func,

164
00:13:31,111 --> 00:13:33,316
other function it had called it in the
past.

165
00:13:33,316 --> 00:13:36,831
And then you have no correlation if
somebody calls it in the future.

166
00:13:36,831 --> 00:13:44,153
And that's why this is really hard to do.
So can we, can we do better for jump

167
00:13:44,153 --> 00:13:47,967
registers.
Yes so the idea of a sub routine return

168
00:13:47,967 --> 00:13:53,470
call stack or sub routine return stack and
the idea here is you have a small

169
00:13:53,470 --> 00:13:59,743
predictor specifically just for jump
register or sub routine returns.

170
00:13:59,743 --> 00:14:08,300
And what you do is you track.
Function calls, and you push the return

171
00:14:08,300 --> 00:14:13,156
address onto a stack.
And then when you do a return, you use

172
00:14:13,156 --> 00:14:16,765
that prediction.
So let's say we have these three functions

173
00:14:16,765 --> 00:14:19,844
here.
We have function A which calls function B,

174
00:14:19,844 --> 00:14:25,257
function B which calls function C,
function C calls function D.

175
00:14:25,257 --> 00:14:33,648
As denoted here.
We have some stack that we're going to

176
00:14:33,648 --> 00:14:37,547
push entries onto, when we do, jump in
links.

177
00:14:37,547 --> 00:14:43,767
So we, we know that there's something of a
link to load this predictor with.

178
00:14:43,767 --> 00:14:50,025
So, when the function call gets executed,
or the jump-in link gets executed, we're

179
00:14:50,025 --> 00:14:53,672
going to push.
The address.

180
00:14:53,672 --> 00:14:59,640
Of, what is after, effectively, the
function call site.

181
00:14:59,640 --> 00:15:03,320
So, I'll denote it here by the address of
FB.

182
00:15:03,320 --> 00:15:07,589
So, that's for when we call, when FA calls
into FB.

183
00:15:07,589 --> 00:15:13,177
In reality, it's probably like, the
instruction write after that, is what

184
00:15:13,177 --> 00:15:18,071
we're gonna return to.
Like lies.

185
00:15:18,071 --> 00:15:25,811
When FC then calls FC, we are going to
push the return location after FC.

186
00:15:25,811 --> 00:15:30,462
And when FC calls FD, we're gonna push
that.

187
00:15:30,462 --> 00:15:36,790
And these come off in the retur-, in the
inverse order cuz it's a stack.

188
00:15:36,790 --> 00:15:43,168
So when we try to pull of this we're
actually gonna pop return addresses.

189
00:15:43,168 --> 00:15:48,120
Tup, tup, tup, like that.
So we are gonna pop off the FD, FC and FB,

190
00:15:48,120 --> 00:15:53,582
or the address after Fiii, FD, FC and FB.
And we are actually gonna predict the

191
00:15:53,582 --> 00:15:58,958
return for this function or the leaf
function, correctly, every single time.

192
00:15:58,958 --> 00:16:03,906
Where this gets a little hard, is if we
have our call depth, be deeper, or

193
00:16:03,906 --> 00:16:09,179
recursion depth, if you will, be deeper,
than how many entries we have in our

194
00:16:09,179 --> 00:16:14,036
return step predictor.
Then we are gonna start predicting wrong.

195
00:16:14,036 --> 00:16:17,420
But otherwise we can do pretty well with
this.

196
00:16:17,420 --> 00:16:22,911
In fact some architectures make this
little bit easier on the hardware and

197
00:16:22,911 --> 00:16:28,322
it'll actually have a special instruction
for the jump in link or the call in

198
00:16:28,322 --> 00:16:32,173
instruction.
So for instances XA60 they, they have the

199
00:16:32,173 --> 00:16:37,707
call that tells the code basically or
tells, excuse me tells the hardware to

200
00:16:37,707 --> 00:16:41,255
load the predictor.
Because otherwise, a jump and link,

201
00:16:41,255 --> 00:16:44,259
there's too many other uses for a jump and
link.

202
00:16:44,259 --> 00:16:46,423
It doesn't, it's not always a, function
call.

203
00:16:46,423 --> 00:16:50,668
It's almost always a function call, but
it's always a function call.

204
00:16:50,668 --> 00:16:54,509
But, for a call instruction on XA6, that's
basically always a call.

205
00:16:54,509 --> 00:16:57,922
So you know to load this.
And then, on XA6, there's a return

206
00:16:57,922 --> 00:17:01,297
instruction, which, is always a return
from a function.

207
00:17:01,297 --> 00:17:04,182
You know?
That's, that, the semantics of it aren't

208
00:17:04,182 --> 00:17:05,500
tied very well.
Okay.

209
00:17:05,500 --> 00:17:10,111
So there's actually one other topic I want
to talk about.

210
00:17:10,111 --> 00:17:15,472
Today that's not in the notes And this
correlates to line prediction.

211
00:17:15,472 --> 00:17:20,811
We talked about line prediction.
The one other thing I wanted to talk about

212
00:17:20,811 --> 00:17:24,990
is wave prediction.
We haven't talked about this yet but is

213
00:17:24,990 --> 00:17:28,510
you have a let's say two way instruction
cache.

214
00:17:28,510 --> 00:17:33,999
We talked about where to go look, of which
line to go look at in the instruction

215
00:17:33,999 --> 00:17:38,606
cache, but one of the things that's just
as hard is if you have two-way instruction

216
00:17:38,606 --> 00:17:43,774
cache, and you're trying to fetch the next
instruction, which way do you go and shove

217
00:17:43,774 --> 00:17:50,785
into the decode stage of your pipe?
Weigh zero or weigh one?

218
00:17:50,785 --> 00:17:57,006
Stuff you don't know.
So this is one of the reasons why people

219
00:17:57,006 --> 00:18:02,376
would build direct map instruction caches.
It's very hard to know which way is the

220
00:18:02,376 --> 00:18:05,971
correct way.
But a predictors just like our branch

221
00:18:05,971 --> 00:18:09,849
predictors helps with this.
We can build a way predictor.

222
00:18:09,849 --> 00:18:14,229
So we can have the same prediction
architecture we've talked about today, in

223
00:18:14,229 --> 00:18:17,566
today's lecture.
Dynamic, multi-level predictors, sometimes

224
00:18:17,566 --> 00:18:22,562
even baked into the, the branch predictor.
And then that can be used to predict

225
00:18:22,562 --> 00:18:25,982
whether we choose from, let's say, way
zero or way one.

226
00:18:25,982 --> 00:18:29,798
Or if we have a 4-way associative
cache-way zero, one, two or three.

227
00:18:29,798 --> 00:18:34,661
Anyway the, that's what I wanted to talk
about today and let's wrap up for here for