1
00:00:03,087 --> 00:00:06,098
Okay.
So, now let's start talking about how to,

2
00:00:06,098 --> 00:00:12,058
leverage different, more techniques to get
a lot of the performance, that we

3
00:00:12,292 --> 00:00:17,076
received, in, out of our superscalar, but
in a VLIW processor setting.

4
00:00:20,032 --> 00:00:24,085
I want to break this sort of into two, two
different things.

5
00:00:26,039 --> 00:00:30,098
Dynamic events, so things that happen in
the processor that you can't necessarily,

6
00:00:31,135 --> 00:00:33,093
or you can't really do anything about
statically.

7
00:00:33,093 --> 00:00:36,071
It's really hard to do anything statically
about.

8
00:00:36,071 --> 00:00:40,073
You just don't know what's going to
happen, you know, something like a, a,

9
00:00:40,073 --> 00:00:44,070
interrupt or an exception happening.
How can a compiler know where the

10
00:00:44,070 --> 00:00:48,015
exceptions are going to happen?
It's just not really possible.

11
00:00:48,088 --> 00:00:54,574
And then we'll talk about speculation
which an out-of-order superscalar does and

12
00:00:54,574 --> 00:00:59,989
a lot of things have fallen to this case.
Actually we can add things to classical

13
00:00:59,989 --> 00:01:05,481
VLIWs to get us closer to getting, that
those, those performance improvements or

14
00:01:05,481 --> 00:01:10,353
the instructional parallelism improvements
they can get from speculation.

15
00:01:10,353 --> 00:01:15,538
And we're talking about sort of two
different types of speculation, today.

16
00:01:15,538 --> 00:01:20,498
We're going to talk about how to move or
speculatively move instructions over

17
00:01:20,498 --> 00:01:23,954
branches.
So, in particular, this is going to get

18
00:01:23,954 --> 00:01:27,548
really interesting with loads moving
across branches.

19
00:01:27,548 --> 00:01:32,085
Because you, you have a load and you want
to move the load as early as possible.

20
00:01:32,085 --> 00:01:37,645
You want to pull the load up early into
your program so that if it misses in the

21
00:01:37,645 --> 00:01:40,185
cache.
Or it has to go out to main memory, you

22
00:01:40,185 --> 00:01:43,222
can hide the latency while you're doing
other things.

23
00:01:43,222 --> 00:01:48,365
We're also going to talk about speculation
of speculating that two memory addresses,

24
00:01:48,545 --> 00:01:53,378
one of a load, let's say, one of a store,
or you could also think about this as

25
00:01:53,378 --> 00:01:57,703
loads with loads and stores with stores.
Well, we want to reorder those.

26
00:01:57,703 --> 00:02:01,955
Similarly, because we want to get, sort
of, loads started early and stores started

27
00:02:01,955 --> 00:02:06,766
as late as possible, so we have enough
time to do the computation we have when we

28
00:02:06,766 --> 00:02:10,167
do the scheduling.
And remember in VLIWs, we're doing all

29
00:02:10,167 --> 00:02:12,893
this scheduling statically in the
compiler.

30
00:02:12,893 --> 00:02:19,013
And as you may recall, from an
out-of-order superscalar perspective, when

31
00:02:19,013 --> 00:02:23,728
you try to do load store speculation and
try to reorder loads and stores

32
00:02:23,728 --> 00:02:29,312
dynamically in an out-of-order superscalar
or some sort of, superscalar processor.

33
00:02:29,312 --> 00:02:34,519
You're actually, you need an extra
structure to figure out when you made a

34
00:02:34,519 --> 00:02:37,202
mistake.
When you reordered something, it was not a

35
00:02:37,202 --> 00:02:40,822
valid reordering.
And we have to add that same structure

36
00:02:40,822 --> 00:02:44,313
into our VLIW processor or very similar
structure.

37
00:02:44,521 --> 00:02:49,537
So let's, let's start off by talking
about, what the compiler has to do, or how

38
00:02:49,537 --> 00:02:53,136
the compiler even starts to think about
this speculation.

39
00:02:53,136 --> 00:02:56,819
So, I want to introduce a topic here, from
compilers.

40
00:02:56,819 --> 00:03:02,654
And this is sort of an introduction to how
we schedule for VLIW, the nitty gritty of

41
00:03:02,654 --> 00:03:08,012
what the instruction schedule or the back
end of the compiler is going to be doing.

42
00:03:08,074 --> 00:03:14,428
So, here're two pieces of code.
We have before code motion and after code

43
00:03:14,428 --> 00:03:17,062
motion.
So what is code motion?

44
00:03:17,062 --> 00:03:23,144
Well, code motion is reordering of
instructions in order to, tolerate the

45
00:03:23,144 --> 00:03:27,009
latencies of the instructions, or to do
something good.

46
00:03:27,009 --> 00:03:31,959
So, it may not as, might not as just be
tolerating the latencies of instructions,

47
00:03:31,959 --> 00:03:36,356
it might also be, reordering for alignment
issues sometimes.

48
00:03:36,356 --> 00:03:42,456
Sometimes branches can only happen on
certain alignments or the performance is

49
00:03:42,456 --> 00:03:45,495
better.
So, but roughly, we are trying to reorder

50
00:03:45,495 --> 00:03:49,030
the code.
Or pro, do valid reorderings of the code.

51
00:03:49,030 --> 00:03:54,899
We can't just move, a read after right
dependence, and move it someplace else.

52
00:03:54,899 --> 00:04:00,115
We can't say, you know, instruction A
reads the instruction, the result of

53
00:04:00,115 --> 00:04:04,226
instruction B.
We can't move A before B, because it just

54
00:04:04,226 --> 00:04:11,046
wouldn't make any sense.
A lot of times, the compiler has some

55
00:04:11,046 --> 00:04:13,664
flexibility, in what, what it can move
around.

56
00:04:13,664 --> 00:04:16,752
Okay.
So let's, let's look at this example case

57
00:04:16,752 --> 00:04:19,738
here.
We have, got a little bit of code, and

58
00:04:19,738 --> 00:04:25,971
what we're going to try to do here.
Well, two things to note.

59
00:04:25,971 --> 00:04:31,523
One, this branch at the end is dependent
on register sixteen.

60
00:04:31,523 --> 00:04:37,688
Register sixteen of.
Where does it get computed?

61
00:04:37,976 --> 00:04:42,650
Does it actually get computed anywhere in
here?

62
00:04:42,650 --> 00:04:47,487
No.
We're also saving away register sixteen,

63
00:04:47,487 --> 00:04:50,052
okay.
So, does register sixteen get computed?

64
00:04:50,052 --> 00:04:54,635
Maybe not, but we probably want to, give
it as much time to be computed as

65
00:04:54,635 --> 00:04:57,075
possible.
Some of the other things we have on here.

66
00:04:57,075 --> 00:05:00,084
We have a load.
Okay.

67
00:05:00,084 --> 00:05:05,406
Well, the load.
What if load cache misses?

68
00:05:05,406 --> 00:05:08,459
Well.
Well, this next instruction here is

69
00:05:08,459 --> 00:05:12,677
directly dependent on that load.
So, we probably want to give it as much

70
00:05:12,677 --> 00:05:16,567
time as possible to execute.
So, traditionally in code motion, you want

71
00:05:16,567 --> 00:05:19,369
to push the loads up and push the stores
down.

72
00:05:19,369 --> 00:05:24,603
What we are going to do this if we were to
push the loads as high up as possible and

73
00:05:24,603 --> 00:05:28,922
the stores down as possible.
We are going to move a load and a store

74
00:05:28,922 --> 00:05:31,985
past each other.
And the question comes up what if this

75
00:05:31,985 --> 00:05:34,648
value and this value is two addresses are
the same.

76
00:05:34,648 --> 00:05:37,585
Hm, well so let's, let's assume that they
are not.

77
00:05:37,585 --> 00:05:41,211
And let's assume the compiler can prove
they're not.

78
00:05:41,211 --> 00:05:47,350
So, you should be able to do code motion
here where we're gonna rejumble all the

79
00:05:47,350 --> 00:05:51,559
instructions and we're going to push the
load up and push the store down.

80
00:05:51,559 --> 00:05:54,093
Leave the branch at the, excuse me, at the
end.

81
00:05:54,093 --> 00:05:57,061
So, we've done, we've done some code
motion.

82
00:05:57,061 --> 00:06:02,083
The next thing we want to do is we're
going to want to bundle or schedule the

83
00:06:02,083 --> 00:06:05,651
instructions.
So, when we say bundle, we're going to

84
00:06:05,651 --> 00:06:10,037
take several operations.
And pack them together into one VLIW

85
00:06:10,037 --> 00:06:15,078
instruction or one VLIW bundle.
And we denote that here with these braces

86
00:06:15,078 --> 00:06:19,048
to say, these execute parallel or at the
same time.

87
00:06:19,048 --> 00:06:23,003
So first, you're going to sort of want to
do scheduling.

88
00:06:23,003 --> 00:06:28,074
Then you want to somehow bundle
nondependent instructions together.

89
00:06:29,004 --> 00:06:33,077
There's been a couple algorithms I've
tried to do sort of scheduling, code

90
00:06:33,077 --> 00:06:36,095
motion scheduling and everything, and
bundling together.

91
00:06:36,095 --> 00:06:41,046
Those don't always work out as well.
So, most of VLIW pros-, most of the VLIW

92
00:06:41,046 --> 00:06:45,062
compilers now will actually do the code
motion and the scheduling first.

93
00:06:45,062 --> 00:06:48,033
And then they'll do the bundling, at the
end.

94
00:06:51,087 --> 00:06:55,007
Okay.
So, let's look at our first technique here

95
00:06:55,007 --> 00:07:00,566
to get at instructional parallelism.
And our first type of, speculation in VLIW

96
00:07:00,566 --> 00:07:04,018
processors.
So what's, what's, what's our challenge

97
00:07:04,018 --> 00:07:06,056
here?
What's our, what's our problem?

98
00:07:06,056 --> 00:07:12,026
Well, our problem is the branches or the
presence of a branch rather, is going to

99
00:07:12,026 --> 00:07:16,052
restrict the forms of code motion the
compiler can do.

100
00:07:16,052 --> 00:07:20,040
So, we have a branch, we have some code
after the branch.

101
00:07:20,040 --> 00:07:25,090
If you have lets say, an add instruction.
That's pretty easy to move it before the

102
00:07:25,090 --> 00:07:27,034
branch.
Okay.

103
00:07:27,313 --> 00:07:33,092
How about an instruction which can take a
fault or take an interrupt?

104
00:07:37,025 --> 00:07:43,008
Loads are a good example of some, one of
those sorts of instructions.

105
00:07:43,008 --> 00:07:48,078
How's that possible?
Well, a load can take a, a page miss or a

106
00:07:48,078 --> 00:07:53,037
page fault.
It can try to, access some value which is

107
00:07:53,037 --> 00:07:58,718
not mapped in by the operating system.
So, let's say we take this load and we

108
00:07:58,718 --> 00:08:04,021
move it above the branch.
Okay, that's not too bad.

109
00:08:04,021 --> 00:08:08,071
Now, what happens if that load takes a
trap or takes an interrupt?

110
00:08:08,071 --> 00:08:13,095
That was never, never supposed to happen.
In correct program sequence order, the

111
00:08:13,095 --> 00:08:16,231
branch was going to branch around the
load.

112
00:08:16,231 --> 00:08:22,235
So, the load was never going to execute.
But all of a sudden, the load got moved up

113
00:08:22,235 --> 00:08:26,291
and the load executes and it takes an
interrupt.

114
00:08:26,291 --> 00:08:34,897
Well, your program crashes, it takes,
either a SEGV or a bus error if you're on

115
00:08:34,897 --> 00:08:42,232
something like a SPARC system or a, you
know, the reason of taking a segmentation

116
00:08:42,232 --> 00:08:46,437
fault.
Even though nothing, nothing went wrong.

117
00:08:46,437 --> 00:08:51,595
This branch was supposed to protect this
load from never executing.

118
00:08:51,595 --> 00:08:56,004
So R1 here, let's say, just gets loaded
with some bogus value if the branch is

119
00:08:56,004 --> 00:09:00,313
taken, and the compiler, the compiler knew
that, and the programmer knew that.

120
00:09:00,313 --> 00:09:04,953
But, all of a sudden, the compiler decides
to reorder things and has moved the load

121
00:09:04,953 --> 00:09:07,049
up.
Now, that, because it wanted to try to get

122
00:09:07,049 --> 00:09:10,075
more performance.
This is something that you can do in your

123
00:09:10,075 --> 00:09:13,640
out-of-order superscalars also.
That was our out-of-order superscalars are

124
00:09:13,640 --> 00:09:18,443
going to pull nondependent loads up in the
program and try to move them across

125
00:09:18,443 --> 00:09:21,464
branches.
But in, in out-of-order superscalar, if

126
00:09:21,464 --> 00:09:26,119
you pull a load up and it does something
wrong, you just do a speculative state,

127
00:09:26,119 --> 00:09:29,989
you just don't, don't care about it.
So, we need some way to have loads that

128
00:09:29,989 --> 00:09:33,700
don't hurt the system.
So, it's a similar sort of idea, but, you

129
00:09:33,700 --> 00:09:37,602
know.
We can't move with load above the branch,

130
00:09:37,602 --> 00:09:42,214
cause it might cause an exact, some sort
of exception.

131
00:09:42,214 --> 00:09:48,572
So a solution to this is, is we're going
to add special instructions, special

132
00:09:48,572 --> 00:09:54,168
versions of the instructions which don't
take faults.

133
00:09:54,168 --> 00:09:57,751
So.
If the fault were to happen, it sort of

134
00:09:57,751 --> 00:10:02,542
remembers that the fault happened.
But it doesn't actually interrupt the

135
00:10:02,542 --> 00:10:05,630
process.
And this is, this allows us to pull some

136
00:10:05,630 --> 00:10:11,166
of these instructions before branches or
reorder perform code motion statically at

137
00:10:11,166 --> 00:10:12,381
compile time.
Okay.

138
00:10:12,381 --> 00:10:15,066
So, let's take a look at this example
here.

139
00:10:15,066 --> 00:10:21,022
How would we go about doing this?
Okay, so we're gonna move the load up.

140
00:10:21,022 --> 00:10:23,565
So, let's say we move the load as high as
possible.

141
00:10:23,565 --> 00:10:27,543
We move it above these, these two
instructions instruction one, instruction

142
00:10:27,543 --> 00:10:33,041
two, here.
And what do we need to do?

143
00:10:33,041 --> 00:10:37,015
Well, if you note, we don't have the word
load here anymore.

144
00:10:37,015 --> 00:10:39,423
We have Load.s.
So, it's a speculative load.

145
00:10:39,423 --> 00:10:44,199
It has a different semantics.
The Load.s never causes an exception even

146
00:10:44,199 --> 00:10:50,053
if it has a page fault, or a null point or
exception, or any underlying memory

147
00:10:50,053 --> 00:10:55,072
reference or anything like that.
It's not gonna actually, take that fault.

148
00:10:56,008 --> 00:10:59,025
Instead, it's just going to sort of
remember that.

149
00:10:59,025 --> 00:11:05,019
And this is typically done if you look at
something like Itanium, which is the, this

150
00:11:05,019 --> 00:11:08,099
is the sort of code sequence out of
Itanium processors.

151
00:11:08,099 --> 00:11:14,024
In Itanium, what they're going to do is,
they're going to, the, the destination of

152
00:11:14,024 --> 00:11:18,039
a load, in this case R1, is going to be
set with a poison bit.

153
00:11:18,059 --> 00:11:21,084
Itanium this is actually called a NaT
number.

154
00:11:21,084 --> 00:11:27,422
And what it means is the destination
register of a load is marked as not there.

155
00:11:27,422 --> 00:11:30,675
Or that's, that's not a thing or not a
number.

156
00:11:30,675 --> 00:11:36,148
And what's important about this is, when
you go to use that value.

157
00:11:36,148 --> 00:11:43,149
That's when you take the exception.
So let's, let's take a look at example

158
00:11:43,149 --> 00:11:47,309
this.
What we do is down here where the load

159
00:11:47,309 --> 00:11:53,708
original was we put in a check speculative
instruction, and it uses r1 here.

160
00:11:53,708 --> 00:12:01,207
So, what this is going to do is check the
value of r1, and if r1 was, took a trap or

161
00:12:01,207 --> 00:12:07,066
exception there was some problem with it.
We can jump to fix-up code.

162
00:12:07,066 --> 00:12:12,005
So, there will be some other piece of code
somewhere else we can branch to.

163
00:12:12,005 --> 00:12:16,435
And a fix-up code is gonna re-execute the
load, and make sure everything is okay,

164
00:12:16,435 --> 00:12:19,979
and then, then jump back to the use.
So, it's speculation.

165
00:12:19,979 --> 00:12:24,382
We try to prove, we try to, make sure,
that the common case when we pull this

166
00:12:24,382 --> 00:12:29,233
load up, nothing bad is going to happen.
But we, we put a check in just to make

167
00:12:29,233 --> 00:12:35,203
sure that it's, this is the case.
One other thing I want to point out is

168
00:12:35,203 --> 00:12:41,112
these poison bits, or these not a, not a
numbers, they propagate.

169
00:12:41,112 --> 00:12:45,810
So why is, why is this important?
Well, what that mean, is.

170
00:12:46,042 --> 00:12:50,240
Actually I want to correct me and said a
little bit here.

171
00:12:50,240 --> 00:12:55,252
If you try to use R1 somewhere else before
you do a check, the not a number

172
00:12:55,252 --> 00:12:57,427
propagates.
It doesn't take a trap.

173
00:12:57,427 --> 00:13:01,888
It only takes a trap or only jumps to fix
up code when you go to actually do the

174
00:13:01,888 --> 00:13:04,380
use.
Or, or when you have a special check

175
00:13:04,380 --> 00:13:09,314
instruction or a there's another load you
can put there that will re-execute the

176
00:13:09,314 --> 00:13:14,678
load, if the, if the original load didn't
work, a non-speculative load to the same,

177
00:13:14,678 --> 00:13:19,086
to the same value.
Okay, so, so why is this propagation of

178
00:13:19,086 --> 00:13:23,649
not a numbers important?
Well, this allows you to pull code down

179
00:13:23,649 --> 00:13:28,656
here up and above this branch also.
So, we can put the full load up, we can

180
00:13:28,656 --> 00:13:35,005
pull dependent instructions up and just as
long as we do the check at some point.

181
00:13:36,082 --> 00:13:41,022
And check to make sure that the load was
not.

182
00:13:41,072 --> 00:13:47,024
Didn't take an exception not a problem.
We can basically speculate that the load

183
00:13:47,024 --> 00:13:50,087
is fine.
We can speculate that the instructions to

184
00:13:50,087 --> 00:13:54,053
pan a load are going to be fine.
But if and even if depending on the

185
00:13:54,053 --> 00:13:57,380
instructions on the load are not fine,
that we've also speculated.

186
00:13:57,380 --> 00:14:01,903
When we go to, at some point in our code
we need to introduce this check

187
00:14:01,903 --> 00:14:05,061
instruction here which sees that this
value is not a number.

188
00:14:05,061 --> 00:14:10,104
So, if it propagates all the way through
your code, you have a load to add to add a

189
00:14:10,104 --> 00:14:14,072
sub but which all reads and writes, let's
say register, or, somehow reads from

190
00:14:14,072 --> 00:14:16,440
register one.
And is sort of depends a tree from

191
00:14:16,440 --> 00:14:19,688
register one, through those, that code
sequence.

192
00:14:19,688 --> 00:14:25,079
At some point, you want to check after the
branch has actually happened whether

193
00:14:25,079 --> 00:14:29,027
everything's okay.
And because it all propagates, you can

194
00:14:29,027 --> 00:14:32,432
basically do work early, and then check
later.

195
00:14:32,432 --> 00:14:36,881
And when you do the check later, you can
say oh, I get it, everything's okay, and

196
00:14:36,881 --> 00:14:40,473
it has a value in, in it.
It has not number in it, or it has this

197
00:14:40,473 --> 00:14:43,848
poisoned bit set, and it's propagated all
the way through.

198
00:14:43,848 --> 00:14:48,473
At that point, and, you can jump to fixup
code, which you can re-execute the load,

199
00:14:48,473 --> 00:14:53,216
and all of the code you've pulled up.
All that speculative code, and that's why

200
00:14:53,216 --> 00:14:56,270
you need, fix-up code and not just
re-execution of the load.

201
00:14:56,270 --> 00:15:00,417
Because the fixup code might have to
re-execute more than just the load.

202
00:15:00,417 --> 00:15:04,898
It might have execute the load, and all
the code, the other code you pulled off

203
00:15:04,898 --> 00:15:08,416
the front of the branch.
Okay.

204
00:15:08,416 --> 00:15:12,261
So, let's, let's move onto another,
speculation.

205
00:15:12,261 --> 00:15:20,166
And this is data speculation.
So, in data speculation, our, our problem

206
00:15:20,166 --> 00:15:26,090
here is we have memory hazards that are
going to limit our scheduling.

207
00:15:26,090 --> 00:15:30,041
Especially if we have to do all the
scheduling statically.

208
00:15:30,041 --> 00:15:34,008
So, if you're conservative in your
compiler, and you can't prove anything

209
00:15:34,008 --> 00:15:37,060
about the different addresses.
All of a sudden what you have is, every

210
00:15:37,060 --> 00:15:40,025
single load is dependent on every single
other load.

211
00:15:40,025 --> 00:15:43,082
And every single store is dependent on
every single other, other store.

212
00:15:43,082 --> 00:15:46,011
Now, why I say why are loads depend on
other loads?

213
00:15:46,011 --> 00:15:50,034
Although most loads you might be able to
reorder pretty easily, you have to worry a

214
00:15:50,034 --> 00:15:52,094
little bit about if one takes a trap
before the other.

215
00:15:53,141 --> 00:15:56,045
I guess we kind of already talked about
how to solve that problem.

216
00:15:56,045 --> 00:15:59,087
But if you, once you reorder a load in the
store.

217
00:15:59,087 --> 00:16:04,040
Then you start to have problems because,
or reorder a store in a store, or anything

218
00:16:04,040 --> 00:16:08,061
from a store, then you start to have
problems because all of a sudden, you have

219
00:16:08,061 --> 00:16:13,014
some state that changed in your memory
system that you have make sure that the an

220
00:16:13,014 --> 00:16:17,051
original program or the later instructions
actually pick up the result of that.

221
00:16:17,051 --> 00:16:20,077
So, later loads or stores somehow honor
the results of that.

222
00:16:22,004 --> 00:16:26,059
So, it's really sort of focused on moving
loads of past stores.

223
00:16:26,059 --> 00:16:31,036
So, what's our solution?
Well, our solution is to add some hardware

224
00:16:31,036 --> 00:16:35,783
check to guarantee that the pointers or
the two memory addresses when you do the

225
00:16:35,783 --> 00:16:40,240
reordering of a load in a store, or a
store in a store, that these addresses are

226
00:16:40,240 --> 00:16:43,126
not the same.
And this is actually the same thing that

227
00:16:43,126 --> 00:16:47,157
we did in our load store queue, or our
load store disam-, memory disambiguation

228
00:16:47,157 --> 00:16:50,062
that we talked about, in out-of-order
processors.

229
00:16:50,062 --> 00:16:54,949
So it's, it's the same.
Same sort of thing going on here.

230
00:16:54,949 --> 00:17:00,571
But we need to actually change the
instruction set to handle this in a VLIW

231
00:17:00,571 --> 00:17:06,000
setting versus in the superscalar was all
microarchitectural issues, sort of below

232
00:17:06,000 --> 00:17:09,005
the instruction set that no one ever had
to look at.

233
00:17:09,094 --> 00:17:12,075
Okay.
So let's, let's look at how to do this.

234
00:17:12,075 --> 00:17:17,007
We have a code sequence here.
Some instructions, a store, and then a

235
00:17:17,007 --> 00:17:20,068
load.
Compiler can't prove anything about these

236
00:17:20,068 --> 00:17:24,061
stores and this load.
But in the out-of-order superscalar, it

237
00:17:24,061 --> 00:17:29,098
could guess or speculate that the load and
the store don't share the same address.

238
00:17:29,098 --> 00:17:35,009
And if they did, it could basically roll
back the code, and re-execute the load.

239
00:17:35,009 --> 00:17:40,052
I'm going to emulate something like this
but using, doing it all statically with

240
00:17:40,052 --> 00:17:44,069
the compiler.
Okay, so how, how do we do this?

241
00:17:44,069 --> 00:17:50,069
Well, this is something else that, Itanium
added and it also talked about at a

242
00:17:50,069 --> 00:17:57,034
previous, research VLIW processors or,
explicitly parallel instructions and

243
00:17:57,034 --> 00:18:02,029
processors.
What you're gonna do here, is you're going

244
00:18:02,029 --> 00:18:08,723
to move the load up above the store.
And this is basically speculating that the

245
00:18:08,723 --> 00:18:11,609
load in the stores don't hit the same
address.

246
00:18:11,609 --> 00:18:16,892
And much the same way that we, when we
reorder loads of those stores in an

247
00:18:16,892 --> 00:18:22,437
out-of-order superscalar added the load
address to a load store queue and a memory

248
00:18:22,437 --> 00:18:26,393
disambiguation queue.
We're gonna do the same sort of thing

249
00:18:26,393 --> 00:18:29,028
here.
That we're, when we move this load up

250
00:18:29,028 --> 00:18:32,983
we're gonna have a different instruction
here of load and add.

251
00:18:32,983 --> 00:18:42,373
So, what this is going to is it's going to
add, it's going to do the load and add the

252
00:18:42,373 --> 00:18:48,685
address to a special table which we're
going to call the address, what's does

253
00:18:48,685 --> 00:18:52,076
ALAT stand for?
Address load, Advanced Load.

254
00:18:52,076 --> 00:18:56,037
Address Table.
The ALAT is what Itanium calls this.

255
00:18:56,037 --> 00:19:01,008
So, it's a, it's a structure that's a
associative structure where you can put

256
00:19:01,008 --> 00:19:05,058
addresses in it, and then check, other
instructions, other addresses from other

257
00:19:05,058 --> 00:19:09,026
instructions against it.
And if, you get a hit or something like

258
00:19:09,026 --> 00:19:11,071
that, you know that something bad
happened.

259
00:19:11,071 --> 00:19:15,010
And you can, re-execute.
So, we're going to see here, that, we're

260
00:19:15,010 --> 00:19:19,037
going to have a load, we're just going to
get the load out in the memory system.

261
00:19:19,037 --> 00:19:24,206
The, we can overlap the time in the memory
system with executing other instructions.

262
00:19:24,206 --> 00:19:28,402
We have a store here.
Now, the store is going to take the

263
00:19:28,402 --> 00:19:32,387
address of itself, and we've changed the
instruction set here.

264
00:19:32,387 --> 00:19:37,599
So, we've added these two new instructions
here, and changed the semantics of a

265
00:19:37,599 --> 00:19:40,438
store.
And it's going to check in the ALAT, or

266
00:19:40,438 --> 00:19:45,231
Advanced Load Address Table, to see if the
other see if the address the stores

267
00:19:45,231 --> 00:19:48,209
happening to is already there.
And if, if it is.

268
00:19:48,209 --> 00:19:51,573
It's going to, going to change the table
somehow.

269
00:19:51,573 --> 00:19:55,183
And we'll talk about that in a second of
what exactly happens.

270
00:19:55,183 --> 00:19:58,663
But roughly it's going to, take this
address out of the table.

271
00:19:58,663 --> 00:20:01,463
And then when this load check happens
here.

272
00:20:01,463 --> 00:20:05,781
This checks to see if the address is still
in this table.

273
00:20:05,781 --> 00:20:12,330
And if it's been bumped out of the table.
It jumps to fixup code, which is basic

274
00:20:12,330 --> 00:20:18,218
going to re-execute the load, or
re-execute the load and all the dependent

275
00:20:18,218 --> 00:20:22,516
instructions of the load.
So, we can, effectively do data

276
00:20:22,516 --> 00:20:28,690
speculation, but all in software.
But this is going to require us, to have

277
00:20:28,690 --> 00:20:35,538
an associative hardware block or a some
sort of content addressable memory or CAM,

278
00:20:35,768 --> 00:20:41,791
that we need to add that all load address
and all store addresses, are, passed

279
00:20:41,791 --> 00:20:45,072
across.
This is similar to how we had our memory

280
00:20:45,072 --> 00:20:48,884
disambiguation queue in our out-of-order
superscalar.

281
00:20:48,884 --> 00:20:54,304
Okay so, lets look at the ALAT, the
Advanced Load Address Table.

282
00:20:54,304 --> 00:21:00,065
Okay, so what, what do we have in here?
Well three things.

283
00:21:00,065 --> 00:21:07,031
We store the address.
We store the size, this is important so we

284
00:21:07,031 --> 00:21:15,056
know, is it word, double word, half word,
bite sort of thing when we go and do

285
00:21:15,056 --> 00:21:20,756
matching.
And when a Load.a happens.

286
00:21:20,756 --> 00:21:27,603
It adds one into this table, and we, we,
we denote the register number that is the,

287
00:21:27,603 --> 00:21:31,600
the target of this also.
It's kind of, kind of useful to know and

288
00:21:31,600 --> 00:21:35,110
we go to do the check.
This is to match up to make sure that the

289
00:21:35,110 --> 00:21:39,739
same load is the same load from before,
and we can effectively remove the address

290
00:21:39,739 --> 00:21:44,095
from the table when we do the check.
Okay, so we added an address.

291
00:21:44,095 --> 00:21:51,957
Okay, we're running along and though we
execute stores but the stores don't hit on

292
00:21:51,957 --> 00:21:55,377
that address.
So, when a store executes, what is it

293
00:21:55,377 --> 00:21:58,095
going to do?
Well, it looks in this table and cams or,

294
00:21:58,095 --> 00:22:03,065
or does a comparison against all the
addresses with size matching, and

295
00:22:03,065 --> 00:22:08,073
everything, against this table.
And if it's in the table, it removes it

296
00:22:08,073 --> 00:22:12,357
from the table.
So, by the time that the check or the load

297
00:22:12,357 --> 00:22:16,041
check happens, that address is not in the
table, then.

298
00:22:16,041 --> 00:22:20,578
So, when the load check happens, it's
going to look in here at this table again

299
00:22:20,578 --> 00:22:25,086
and say, oh, I'm not in the table anymore.
This address is not in the table anymore.

300
00:22:25,086 --> 00:22:29,050
So at that point, you can know, what's
going to re-execute that load.

301
00:22:29,050 --> 00:22:34,035
But if it's still in the table, that means
no stores hit it, bumped it out of the

302
00:22:34,035 --> 00:22:36,692
table.
So, no stores in, you know, the leading

303
00:22:36,692 --> 00:22:39,087
time bumped it out.
So, the speculative load was fine.

304
00:22:39,087 --> 00:22:43,075
So, we've effectively done what
out-of-order superscalars can do.

305
00:22:43,075 --> 00:22:48,042
We can move now loads.
Above stores, and through a changed or

306
00:22:48,042 --> 00:22:55,009
instruction set architecture, by adding
these special check-in, load check-in, and

307
00:22:55,009 --> 00:22:59,473
check instructions.
And by changing the semantics of load, and

308
00:22:59,473 --> 00:23:03,696
adding a load, add, or load a instruction
here.

309
00:23:03,696 --> 00:23:09,043
We can change.
We can, we can have, data speculation into

310
00:23:09,043 --> 00:23:11,626
our program.
Okay.

311
00:23:11,626 --> 00:23:18,341
A couple other things that are fun to add
to classical LVIW's that, to improve

312
00:23:18,341 --> 00:23:22,207
performance.
I just kind of want to just breeze through

313
00:23:22,207 --> 00:23:25,634
these, these next two ones a little bit
faster here.

314
00:23:25,634 --> 00:23:29,296
First one, multiwave branches.
So problem, you have brand VLI

315
00:23:29,296 --> 00:23:34,873
architecture, or very long instruction
word architecture and if you have, let's

316
00:23:34,873 --> 00:23:40,216
say, ten things in one instruction, or
ten, ten operations in one instruction, or

317
00:23:40,216 --> 00:23:45,884
ten operations in one bundle, but you have
sort of short branches or maybe you have

318
00:23:45,884 --> 00:23:51,028
multiple branch decisions happening.
You want to have sort of a branch every

319
00:23:51,028 --> 00:23:54,092
three instructions.
Well, you have lots of wasted slots.

320
00:23:54,092 --> 00:23:59,056
If you can't somehow have one instruction
branch multiple directions.

321
00:23:59,056 --> 00:24:01,050
They might say branching multiple
directions?

322
00:24:01,050 --> 00:24:04,080
That's, That's a really weird idea.
How do you have an instruction go multiple

323
00:24:04,080 --> 00:24:06,060
ways?
How do branches could go one or two

324
00:24:06,060 --> 00:24:07,900
places?
They either fall through or go some place

325
00:24:07,900 --> 00:24:11,285
else.
Well, we could actually change the

326
00:24:11,285 --> 00:24:15,669
instruction set to have a branch that can
go multiple directions.

327
00:24:15,669 --> 00:24:19,695
So, let's look at this with a full
predication example here.

328
00:24:19,695 --> 00:24:24,742
So we have two bundles of code here.
This first one here is three instructions

329
00:24:24,742 --> 00:24:27,372
and it these are all executed in, in
parallel.

330
00:24:27,372 --> 00:24:30,706
And what its going to do it's going to do
three different comparers.

331
00:24:30,706 --> 00:24:34,802
So, this is similar to our, this is our
full predication example here.

332
00:24:34,802 --> 00:24:38,958
It's going to write to these predicate
registers in with the, the logical,

333
00:24:38,958 --> 00:24:44,195
positive and logical negative sort of
versions of, of the predicate comparison.

334
00:24:44,195 --> 00:24:50,229
Okay, so that sort of stages us to do
three different branch comparisons.

335
00:24:50,229 --> 00:24:57,129
And then, down here we have three branches
that all branch based on this predicate.

336
00:24:57,129 --> 00:25:02,453
Or these different predicates that we
already staged or, or setup up here.

337
00:25:02,453 --> 00:25:05,566
And they can branch to three different
places.

338
00:25:05,566 --> 00:25:10,263
Label one, label two, and label three.
This also could be fall-through code, if

339
00:25:10,263 --> 00:25:13,908
none of the branches are taken.
And, you know, one little wriggle here

340
00:25:13,908 --> 00:25:18,170
that you have to have sort of figure out
the semantics in, and I should say Itanium

341
00:25:18,170 --> 00:25:21,192
supports this.
You can buy real chips with this in it.

342
00:25:21,192 --> 00:25:26,084
If you have three branches together, you
need to sum up over of these branches

343
00:25:26,084 --> 00:25:31,052
because, let's say, they all say taken.
Well, they all, all resolve to actually

344
00:25:31,052 --> 00:25:33,160
being taken.
Well which one do you take?

345
00:25:33,160 --> 00:25:37,472
Because you could branch all three
different places, and want deterministic

346
00:25:37,472 --> 00:25:39,820
execution.
So, typically, when you have an

347
00:25:39,820 --> 00:25:44,240
instruction set like this of multi-way
branching, you define some order.

348
00:25:44,240 --> 00:25:49,287
Maybe the first one has priority over the
second one, which has priority over the

349
00:25:49,287 --> 00:25:50,088
third one, etc.