1
00:00:03,025 --> 00:00:09,071
Okay, so we left off talking a little bit
about how to encode instructions in VLIW

2
00:00:09,071 --> 00:00:15,902
or how VLIW's encode instructions.
And as you may recall, in a VILW

3
00:00:15,902 --> 00:00:20,483
instruction encoding, you've a very long
instruction work.

4
00:00:20,483 --> 00:00:23,286
So it could be I don't know many, many
bytes.

5
00:00:23,286 --> 00:00:26,748
So, if you go and get something like the
multi flow processors, multi flow trace

6
00:00:26,748 --> 00:00:30,036
processors.
They had very long instruction words.

7
00:00:30,036 --> 00:00:34,084
They could, I think, execute something
like and their largest machine

8
00:00:34,084 --> 00:00:37,115
configuration twenty something
instructions.

9
00:00:37,115 --> 00:00:42,674
And they had to encode this but one of the
challenges was that you didn't necessarily

10
00:00:42,674 --> 00:00:46,066
want to have all that encoding space used
all the time.

11
00:00:46,066 --> 00:00:51,007
So if you are executing a simple
instruction, or something which doesn't

12
00:00:51,007 --> 00:00:55,985
actually need all the encoding space.
You end up of a whole lot of no ops.

13
00:00:55,985 --> 00:00:59,808
Or if you have a very long dependency
string in your execution trace if you

14
00:00:59,808 --> 00:01:01,392
will.
So you have a instruction which is

15
00:01:01,392 --> 00:01:04,815
dependent on the next instruction, which
is dependent on the next instruction,

16
00:01:04,815 --> 00:01:07,025
which is dependent on the next
instruction.

17
00:01:07,025 --> 00:01:09,023
And you cannot fill any of the extra
slots.

18
00:01:09,023 --> 00:01:11,096
You're gonna end up having a lot of no ops
in your program.

19
00:01:11,096 --> 00:01:16,095
So how do we go about solving this.
Couple, couple different solutions here.

20
00:01:17,239 --> 00:01:23,107
Cydra-5 went off and added a particular
instruction, which is a single operation

21
00:01:23,107 --> 00:01:27,098
instruction.
And used much smaller amount of memory.

22
00:01:27,098 --> 00:01:33,021
So this would effectively reduce your
instruction cache pressure.

23
00:01:34,004 --> 00:01:38,186
There was compressed formats in memory, so
you can hold your instruction sequences in

24
00:01:38,186 --> 00:01:42,053
memory, and then when you go and pull it
into your instruction cache, it would

25
00:01:42,053 --> 00:01:46,070
expand to the fully expanded version.
And this is kind of like a compression

26
00:01:46,070 --> 00:01:49,028
algorithm.
You're basically, compressing your VLIW

27
00:01:49,028 --> 00:01:52,055
instruction sequences.
Because otherwise these things are very,

28
00:01:52,055 --> 00:01:55,066
very large and have lots and lots of
zero's or no ops in them.

29
00:01:55,066 --> 00:01:59,083
Because, the probability you're gonna
fill, sort of twenty instructions per

30
00:01:59,083 --> 00:02:02,901
cycle, is low, but you want that
capability there, in case you'd actually

31
00:02:02,901 --> 00:02:06,006
do want to execute twenty instructions per
cycle.

32
00:02:06,870 --> 00:02:15,029
Another solution here is what's actually
used in the Intel Itanium or IA64

33
00:02:15,029 --> 00:02:22,009
processor line and also the TI DSP's, the
C6000 series.

34
00:02:22,009 --> 00:02:29,041
They, they have the ability to effectively
mark where a bundle starts and stops.

35
00:02:29,041 --> 00:02:34,097
So it's something sort of between a,
variable length instruction and a fixed

36
00:02:34,097 --> 00:02:40,041
format, VLIW instruction.
So we'll talk more later in today's class

37
00:02:40,041 --> 00:02:43,511
about how IA64 or Intel's Itanium handles
this.

38
00:02:43,511 --> 00:02:49,332
But roughly what happens is, they have a
fixed instruction format and you can fit

39
00:02:49,332 --> 00:02:52,951
sort of, three instructions per bundle, or
what they call a bundle.

40
00:02:52,951 --> 00:02:57,438
And then, if you wanna go over that, you
can keep going over that, and they have

41
00:02:57,438 --> 00:03:01,936
special bits in their instruction where it
says, oh, well this is gonna express some

42
00:03:01,936 --> 00:03:06,000
more parallelism later.
So you need to look at the next bundle,

43
00:03:06,000 --> 00:03:09,792
and try to fetch that bundle, and that's
all can be executed parallel.

44
00:03:09,792 --> 00:03:14,053
So it's still a, a parallel execution
semantics, but the formats are not

45
00:03:14,053 --> 00:03:17,278
necessarily fixed.
And this is, sort of, outside the bound of

46
00:03:17,278 --> 00:03:23,006
classical VLIW now it's going to a little
bit more modified VLIWs, or, or more

47
00:03:23,006 --> 00:03:27,006
contemporary VLIW's.
So today, we're gonna as I said, talk

48
00:03:27,006 --> 00:03:32,062
about how to go about getting all the
performance that out of order superscalars

49
00:03:32,062 --> 00:03:37,534
can get, but in the context of a VLIW.
So we're gonna slowly build up all the

50
00:03:37,534 --> 00:03:42,884
different techniques that out of order
superscalars have, and all the ways that

51
00:03:42,884 --> 00:03:46,191
they can get instruction level
parallelism.

52
00:03:46,191 --> 00:03:51,857
And then apply that inside of VLIW or how
we make small changes to classical VLIW's

53
00:03:51,857 --> 00:03:57,032
to get a lot of the performance back.
Okay, so let's talk about the first, the

54
00:03:57,032 --> 00:04:02,037
first trick here.
One of the questions that comes up or the

55
00:04:02,037 --> 00:04:08,662
problems that comes up here is if you have
branches, and you mis-predict.

56
00:04:08,662 --> 00:04:13,482
This can limit your instruction level
parallelism in a VLIW.

57
00:04:13,482 --> 00:04:17,532
Okay, so why is this worse than in a out
of order superscalar?

58
00:04:17,532 --> 00:04:23,303
Well an out of order superscalar can
dynamically schedule around branch miss

59
00:04:23,303 --> 00:04:26,834
predicts.
So if your branch miss predict you can

60
00:04:26,834 --> 00:04:31,946
basically schedule other code, or
speculatively schedule code depending on

61
00:04:31,946 --> 00:04:35,823
your prediction.
But on something like a VLIW processor,

62
00:04:35,823 --> 00:04:40,301
that's a lot harder to do because you
don't have a dynamic scheduling engine

63
00:04:40,301 --> 00:04:42,990
behind there.
The compiler had to go do all that

64
00:04:42,990 --> 00:04:48,700
scheduling statically at compile time.
So how do we, how do we go about fixing

65
00:04:48,700 --> 00:04:51,514
this?
Well, one solution is just to eliminate

66
00:04:51,514 --> 00:04:56,011
all the hard to predict branches.
So you take out the branches that might be

67
00:04:56,011 --> 00:04:59,043
hard to predict.
So your branch predictor is trying really

68
00:04:59,043 --> 00:05:03,085
hard, you could still have a branch
predictor in your VLIW processor, but some

69
00:05:03,085 --> 00:05:07,057
of these things, you don't know.
It's like a data dependent branch.

70
00:05:07,057 --> 00:05:10,004
How do you know which way it's going to
go.

71
00:05:10,004 --> 00:05:14,032
And, you know, this is a problem even for
superscalars, but at least in a

72
00:05:14,032 --> 00:05:19,022
superscalar, you can try to sort of back
filled instructions, or try to, try to do

73
00:05:19,022 --> 00:05:23,081
something, or move around, if you will,
loads or long laid instructions around

74
00:05:23,081 --> 00:05:25,236
branches.
So in a super, in a out-of-order

75
00:05:25,236 --> 00:05:29,043
superscalar you can potentially move a
load beyond a branch.

76
00:05:29,043 --> 00:05:34,027
But in a VLIW processor, that's not easy
to do, because compiler has to make that

77
00:05:34,027 --> 00:05:38,044
decision, whether that branch is gonna
predict take in or not take in.

78
00:05:38,044 --> 00:05:42,006
And it may, that may change over the
lifetime of the program.

79
00:05:42,006 --> 00:05:51,006
And it may also change as the depending on
the sort of input sets to the program.

80
00:05:51,006 --> 00:05:55,079
So it may not be something that's compiler
time even known knowable.

81
00:05:55,079 --> 00:06:01,931
Okay, so our first technique here that
we're gonna use is called predication.

82
00:06:01,931 --> 00:06:06,000
So let's, let's define what predication
is.

83
00:06:06,000 --> 00:06:14,112
Well, first of all, predication is a
technique, that allows you to basically

84
00:06:14,112 --> 00:06:20,270
change the result of a register, based on
some other register.

85
00:06:20,270 --> 00:06:27,782
And it's going to let us transform hard to
predict branches into data flow or data

86
00:06:27,782 --> 00:06:32,315
dependencies.
So we're gonna add something to our

87
00:06:32,315 --> 00:06:38,831
instruction set and it's gonna take what
looks like a small branch or short

88
00:06:38,831 --> 00:06:44,775
distance branch and we're going to
basically execute both sides of those

89
00:06:44,775 --> 00:06:49,035
branches or the taken and the non taken
code path.

90
00:06:49,035 --> 00:06:53,020
And then at the end, decide which one was
the right one, with a, something like a

91
00:06:53,020 --> 00:06:56,027
select operator from C.
So this is like the question mark colon

92
00:06:56,027 --> 00:06:59,098
operator, which no one ever uses in C.
This is sort of the equivalent of that,

93
00:06:59,098 --> 00:07:03,025
but we're gonna do it in one instruction,
or maybe two instructions.

94
00:07:03,025 --> 00:07:09,040
So let's, let's, let's take a look at how
this helps with small branch regions or

95
00:07:09,040 --> 00:07:14,015
branches which are hard to predict.
So, we're gonna introduce two

96
00:07:14,015 --> 00:07:19,573
instructions, and I also want to point
out, that predication sometimes shows up

97
00:07:19,573 --> 00:07:25,027
in superscalar's processors or a, a
sequential instruction sets also.

98
00:07:25,027 --> 00:07:30,054
So a good example of this is, two
instructions very similar to this, are

99
00:07:30,054 --> 00:07:34,085
actually in x86.
They're called c-move or conditional move.

100
00:07:34,709 --> 00:07:41,965
And we're gonna introduce conditional move
as the most basic form of predication.

101
00:07:41,965 --> 00:07:47,728
So let's look at the semantics of these
two instructions we add for a conditional

102
00:07:47,728 --> 00:07:52,929
move.
First instruction here, move zero.

103
00:07:52,929 --> 00:08:00,072
So what does move zero do?
Well, move zero test whether one of these

104
00:08:00,072 --> 00:08:07,701
input operands here, rt, is equal to zero.
And if it's equal to zero, it's going to

105
00:08:07,701 --> 00:08:13,883
move rs into the result register, the
destination register here.

106
00:08:13,883 --> 00:08:21,164
So it's gonna move rs and rt, or rs and rd
if rt is equal to zero.

107
00:08:21,164 --> 00:08:28,758
And, if rt is not equal to zero, it's just
going to leave rd alone.

108
00:08:28,758 --> 00:08:35,091
Okay, well, on first appearance is that
seems like a very simple instruction.

109
00:08:35,091 --> 00:08:37,920
It's basically just doing a copy
operation.

110
00:08:37,920 --> 00:08:42,176
It's a, it's a gated copy.
Looks a lot simpler than something like

111
00:08:42,176 --> 00:08:45,325
add or subtract.
We don't have to do any math, at least.

112
00:08:45,325 --> 00:08:49,920
There's no compare, I mean the, the
compare is easy as the equal, it's not a

113
00:08:49,920 --> 00:08:52,894
less than equals.
So it's pretty simple instruction.

114
00:08:52,894 --> 00:08:55,423
And we're also gonna introduce move not
zero.

115
00:08:55,423 --> 00:08:58,338
So it's kind of a complimentary one of
this.

116
00:08:58,338 --> 00:09:03,913
And, we're going to want this because when
we go to execute code, we're going to want

117
00:09:03,913 --> 00:09:09,915
to, let's say, have a if then else, and
we're going to have, want to execute the,

118
00:09:09,915 --> 00:09:14,799
the, the then and else at the same time or
roughly at the same time, or

119
00:09:14,799 --> 00:09:17,292
indiscriminately not depending on the
branch.

120
00:09:17,292 --> 00:09:20,521
And at the end, we're going to decide
which one was the correct one.

121
00:09:20,521 --> 00:09:25,654
And we need sort of two instructions here
to choose what was, was the positive one

122
00:09:25,654 --> 00:09:29,981
the right one, or the one where the, the
conditional value is true or false the

123
00:09:29,981 --> 00:09:32,987
right one, or the one where it's true, or
vice versa.

124
00:09:32,987 --> 00:09:37,121
So, move not zero.
Move not zero is going to do the same

125
00:09:37,121 --> 00:09:40,949
thing here.
It's going to see except the sense of the,

126
00:09:40,949 --> 00:09:44,025
of the condition code is going to be
different.

127
00:09:44,025 --> 00:09:49,425
It's going to sense if rt is not equal to
zero, and if rt is not equal to zero then

128
00:09:49,425 --> 00:09:55,095
rs is going to go into rd, else we're just
going to leave rd alone with the original

129
00:09:55,095 --> 00:09:57,027
value of rd.
Okay.

130
00:09:57,027 --> 00:10:01,086
So let's, let's walk through a, a quick
code example here.

131
00:10:01,086 --> 00:10:07,364
And see, see how this helps.
So here we have a code example, we have

132
00:10:07,364 --> 00:10:11,598
some C code.
We have non-predicated MIPS-like assembly

133
00:10:11,598 --> 00:10:14,710
code.
And then we have a MIPS-like assembly code

134
00:10:14,710 --> 00:10:18,015
here with these fancy two predication
instructions.

135
00:10:18,015 --> 00:10:22,371
And they were all three of these are doing
the same thing.

136
00:10:22,371 --> 00:10:27,899
So let's, let's look at this piece of code
here and the C code is probably the

137
00:10:27,899 --> 00:10:31,935
easiest to understand.
It's gonna start off it's gonna say if a

138
00:10:31,935 --> 00:10:36,777
is less than b, so is a condition, and
there's a then and a else.

139
00:10:36,777 --> 00:10:43,885
So what does this actually implement?
Well this is a minimization function or a

140
00:10:43,885 --> 00:10:47,524
min function.
X gets the minimum of a or b.

141
00:10:47,524 --> 00:10:53,292
If a is less than b, it gets a.
If a is greater than b, it gets b.

142
00:10:53,292 --> 00:10:59,674
So it's going to be assigned to x, the
smallest value of a and b.

143
00:10:59,674 --> 00:11:06,825
And, these sort of, if and else is, you
can see here it's pretty short, we don't

144
00:11:06,825 --> 00:11:12,503
have a whole lot of code in either the,
the then case or the else case.

145
00:11:12,503 --> 00:11:17,022
And that's gonna be important in a second,
you'll see why.

146
00:11:17,219 --> 00:11:22,583
Mainly it revolves around if you have lots
and lots of code or one of the sides of

147
00:11:22,583 --> 00:11:27,276
the branches is, is, is long, it may not
make sense to actually predicate this, cuz

148
00:11:27,276 --> 00:11:33,290
there's some cost to predication.
Okay, so here's some assembly code does a

149
00:11:33,290 --> 00:11:37,687
similar sort of thing.
Set less than, this is a comparison

150
00:11:37,687 --> 00:11:42,051
operation in MIPS.
So we'll do set less than and these are

151
00:11:42,051 --> 00:11:47,233
going to be, R2 and R3 are going to have
our two a and b values here.

152
00:11:47,233 --> 00:11:52,733
If it's less than, we're going to put it
into R1, if not this gets set to one or

153
00:11:52,733 --> 00:11:55,938
zero.
And then we have a branch zero here

154
00:11:55,938 --> 00:12:01,311
effectively otherwise this is a branch.
Okay, so branch equal if zero, it's a

155
00:12:01,311 --> 00:12:04,104
little odd.
We probably should have just put branch

156
00:12:04,104 --> 00:12:08,652
zero there instead.
But is going to check to see whether this

157
00:12:08,652 --> 00:12:12,415
value is zero or one.
And if it's the one direction it's going

158
00:12:12,415 --> 00:12:16,050
to jump to L1.
If it's the other direction it's going to

159
00:12:16,050 --> 00:12:20,464
fall through and then jump over this move.
And these two moves here are the two

160
00:12:20,464 --> 00:12:27,860
assignment operators, the x = a or x = b.
Okay, that's, that's not so bad.

161
00:12:27,860 --> 00:12:34,096
Let's analyze how long this takes.
How many cycles this will take to execute

162
00:12:34,096 --> 00:12:41,074
on a, on a sequential machine.
Okay, it was a set less than.

163
00:12:41,074 --> 00:12:44,044
It's one.
This is branch.

164
00:12:44,091 --> 00:12:52,046
So that's two.
Let's say the branch is not taken or falls

165
00:12:52,046 --> 00:12:54,098
through.
Three, four.

166
00:12:56,022 --> 00:13:00,096
And then the jump over.
So it's gonna be one, two, three

167
00:13:00,096 --> 00:13:04,247
instructions in the one, cuz there was
like, one, two, three, four instructions

168
00:13:04,247 --> 00:13:07,002
in the one case.
1,2,3,4 instructions in the one case.

169
00:13:07,002 --> 00:13:12,031
And the other case it's going to take the
branch to jump here.

170
00:13:12,031 --> 00:13:17,075
So it's going to be one, two, jump over
all the stuff, three instructions.

171
00:13:18,048 --> 00:13:20,085
Okay.
That's, that's not so bad.

172
00:13:20,085 --> 00:13:24,097
So, the one case is four and the other
case is three.

173
00:13:24,097 --> 00:13:29,016
They're different.
Now something else that makes this

174
00:13:29,016 --> 00:13:33,057
interesting is let's say, this branch is
mispredicted.

175
00:13:33,057 --> 00:13:38,067
It's a data dependent branch, a and b.
So what, what's a good thing to predict

176
00:13:38,067 --> 00:13:41,082
here?
Can our branch predictor do a good job on

177
00:13:41,082 --> 00:13:45,008
this.
Hm, I don't know.

178
00:13:45,008 --> 00:13:49,055
It depends on what you know about a and b.
If the compiler knows nothing, or the

179
00:13:49,055 --> 00:13:53,714
hardware knows nothing, you're not going
to be able to do a particularly great job.

180
00:13:53,884 --> 00:13:58,079
So let's say it's a data dependent branch
and it's 50 percent the one direction, and

181
00:13:58,079 --> 00:14:02,050
50 percent the other direction.
It's just going slow down our execution.

182
00:14:03,027 --> 00:14:06,038
Well sure.
It's going to slow down our execution,

183
00:14:06,038 --> 00:14:12,024
because all of a sudden, when we take a
branch one of the directions whichever way

184
00:14:12,024 --> 00:14:17,013
we predict the mispredict path is going to
add some mispredict penalty.

185
00:14:17,041 --> 00:14:23,058
So instead of the one path being three and
the other path being four, let's say the

186
00:14:23,058 --> 00:14:29,038
branch is predicted oh, I don't know.
Let's say the branch has predicted not

187
00:14:29,038 --> 00:14:32,058
taken.
So it predicts the fall through case.

188
00:14:33,053 --> 00:14:38,029
So, all of a sudden, the fall through case
is going to be one, two, three, four

189
00:14:38,029 --> 00:14:41,085
instructions.
And let's say we have a two cycle branch

190
00:14:41,085 --> 00:14:45,021
mispredict penalty.
So let's look at the other case.

191
00:14:45,021 --> 00:14:49,097
So the other case is going to be one, two,
and then two cycles of mispredict.

192
00:14:49,097 --> 00:14:53,026
So that gets us up to four.
Branch over, five.

193
00:14:53,026 --> 00:14:57,033
So it can be five cycles.
So all of a sudden, we have our branch

194
00:14:57,033 --> 00:15:00,210
mispredict penalty coming into the
equation here of.

195
00:15:00,210 --> 00:15:04,372
And, and, you know, two cycle branch
mispredicts are relatively benign.

196
00:15:04,372 --> 00:15:09,710
If you, if that starts to go longer, then
you really start to have problems with

197
00:15:09,710 --> 00:15:10,368
this.
Okay.

198
00:15:10,368 --> 00:15:15,330
So let's take this code sequence and look
about how to, how to predicate it.

199
00:15:15,330 --> 00:15:20,605
So if we're going to go predicate this,
we're actually going to execute, both

200
00:15:20,605 --> 00:15:29,084
sides regardless of the value of the if
clause or the conditional clause here.

201
00:15:30,027 --> 00:15:35,031
So, if we look at what we do, we've
actually restructured the code here a

202
00:15:35,031 --> 00:15:39,624
little bit.
And this is our compare, this is our if

203
00:15:39,624 --> 00:15:50,024
still up here, we have a set less then.
And then we say, if R1 is zero, move R2 to

204
00:15:50,024 --> 00:16:01,302
R4, else, just leave it alone.
And then here if, R1 is not zero move R3

205
00:16:01,302 --> 00:16:06,525
to R4.
And what's important to know here is, this

206
00:16:06,525 --> 00:16:12,435
non destructive operation, especially here
in this last instruction is really

207
00:16:12,435 --> 00:16:18,824
important, cuz if the move zero was
executed in that, copied the new value in

208
00:16:18,824 --> 00:16:22,279
R4.
And then this move not zero, or to somehow

209
00:16:22,279 --> 00:16:26,084
change R4, you lose that result from this
previous ones.

210
00:16:26,084 --> 00:16:32,034
This really depends on these operations
being non-destructive if, if their, if

211
00:16:32,034 --> 00:16:35,013
their condition is, is not true.
Okay.

212
00:16:35,013 --> 00:16:39,070
So let's, let's analyze this from a number
of instructions perspective.

213
00:16:40,005 --> 00:16:47,175
First question is, does this matter what
happens to the branch, or what happens to

214
00:16:47,175 --> 00:16:50,094
the condition?
Is it going to have different numbers of

215
00:16:50,094 --> 00:16:55,743
instructions, or different number of
cycles dependents on the, the direction of

216
00:16:55,743 --> 00:17:01,003
the, the condition.
Well, no, there's no branch, nothing's

217
00:17:01,003 --> 00:17:03,732
going to change.
So, in all cases, this is going to take

218
00:17:03,732 --> 00:17:07,038
three cycles.
So all of a sudden we transform something

219
00:17:07,038 --> 00:17:12,018
which can take, as I said if, let's say
you have a branch with probably four

220
00:17:12,018 --> 00:17:16,745
cycles or five cycles so, an average,
let's say, the branch is taking you 50

221
00:17:16,745 --> 00:17:21,003
percent of the time either the direction
we'll, we'll split the difference there,

222
00:17:21,003 --> 00:17:24,080
and we'll call it four, four and a half
cycles, on average.

223
00:17:24,080 --> 00:17:27,064
We turned it into something which takes
three.

224
00:17:27,064 --> 00:17:33,168
Well, that's a great win.
So where, where else is this useful?

225
00:17:33,435 --> 00:17:39,184
Cuz you always need to have if and else
clauses to make this work out.

226
00:17:39,184 --> 00:17:43,018
So, know, this also works good for small
code hammocks.

227
00:17:43,018 --> 00:17:49,509
So I say a small code hammock is just a if
which jumps around a small piece of code.

228
00:17:49,509 --> 00:17:55,049
If you can't predict that if very well
then it makes sense just to go execute

229
00:17:55,049 --> 00:17:59,184
effectively what's inside that code
hammock, that small piece of code, always.

230
00:17:59,184 --> 00:18:03,211
If, if, if for instance, you have two
instructions inside of a code hammock and

231
00:18:03,211 --> 00:18:07,440
your mispredict penalty is like ten cycles
and you don't know whether the, the branch

232
00:18:07,440 --> 00:18:11,328
is being taken 50 percent of the time
either way, all of a sudden your average

233
00:18:11,328 --> 00:18:14,624
mispredict penalty is going to be
something like five cycles.

234
00:18:14,624 --> 00:18:18,691
You could have just executed the code in
the middle, that would have been faster.

235
00:18:18,691 --> 00:18:21,839
You could have executed the two
instructions and then just done a

236
00:18:21,839 --> 00:18:26,988
conditional move at the end and that would
have been basically always better.

237
00:18:27,618 --> 00:18:33,098
Okay, so we have some questions here.
What if the if then else has lots of

238
00:18:33,098 --> 00:18:38,637
instructions?
Hm, so let's say we have if then else here

239
00:18:38,637 --> 00:18:45,896
but there's a thousand instructions here
and a thousand instructions there.

240
00:18:45,896 --> 00:18:51,679
Should we predicate this?
Let's say their branch mispredict finally,

241
00:18:51,679 --> 00:18:56,529
let's say five, for know, there's ten
cycles and it's 50 percent either way.

242
00:18:56,529 --> 00:19:04,791
But there's a thousand instructions in the
if the, the then clause and the else

243
00:19:04,791 --> 00:19:09,876
clause.
Well, no, that would, that would never

244
00:19:09,876 --> 00:19:13,648
make sense to do.
Because, you're basically doubling your

245
00:19:13,648 --> 00:19:19,428
code, and if there's a lot of code there
you'd be better served by just doing the

246
00:19:19,428 --> 00:19:22,639
branch.
Because you'd be taking something that

247
00:19:22,639 --> 00:19:27,584
was, could be 1000 instructions, and
turning it into always 2000 instructions.

248
00:19:27,584 --> 00:19:32,956
Or doubling the number of instructions
executing, executing, versus just adding,

249
00:19:32,956 --> 00:19:35,961
let's say, ten.
Cycles of branch [inaudible], or an

250
00:19:35,961 --> 00:19:39,701
average [inaudible] cycles of branch
[inaudible] 1,000 that you have to

251
00:19:39,701 --> 00:19:42,914
execute.
So all the sudden, you're, you're adding a

252
00:19:42,914 --> 00:19:46,072
little overhead.
So where, where this predication really

253
00:19:46,072 --> 00:19:50,576
helps is for, for small pieces of code
where you don't necessarily know the

254
00:19:50,576 --> 00:19:55,031
branch.
Outcome or can't predict the branch

255
00:19:55,031 --> 00:20:00,254
outcome very well.
Okay what if the, the code is unbalanced.

256
00:20:00,254 --> 00:20:07,008
So you have an if then and an else.
But the then clause and the else clause

257
00:20:07,008 --> 00:20:11,147
are very different in size.
So, one is three instructions, and the

258
00:20:11,147 --> 00:20:15,052
other is a thousand instructions.
Does this make sense?

259
00:20:15,052 --> 00:20:22,846
Well, it goes back to the same argument we
had before with the very large if then

260
00:20:22,846 --> 00:20:28,052
else clauses.
If the code is really unbalanced you have

261
00:20:28,052 --> 00:20:35,051
a lot in the then clause and a little bit
in the else clause or something like that.

262
00:20:36,092 --> 00:20:42,036
And let's say the branch is taking 50
percent of the time, it might make sense

263
00:20:42,036 --> 00:20:47,095
just to use the an actual branch there and
deal from the mispredict penalty cost

264
00:20:47,095 --> 00:20:53,026
verses trying to somehow predicate it.
Because in fact, what's going to happen is

265
00:20:53,026 --> 00:20:57,502
let's say you have three instructions and
a thousand instructions on the two

266
00:20:57,502 --> 00:21:02,853
different sides, you're going to be
adding, you're going to be executing if

267
00:21:02,853 --> 00:21:07,032
you predicate a thousand and three
instructions every single execution time

268
00:21:07,032 --> 00:21:12,260
plus maybe some sort of conditional moves
at the end, some overhead involved versus,

269
00:21:12,260 --> 00:21:17,261
if you have 50-50, and if you have to add
a little bit of branch overhead, penalty,

270
00:21:17,261 --> 00:21:21,773
your 50-50 is gonna 50-50 chance is gonna
say, well, I'm taking a thousand plus

271
00:21:21,773 --> 00:21:26,950
three, and we're going to add those two
things together, and take the average of

272
00:21:26,950 --> 00:21:30,084
them, basically.
So, you know, it's going to be what's the

273
00:21:30,084 --> 00:21:34,454
average of that, it's going to be like
501, or something like about 502 cycles.

274
00:21:34,454 --> 00:21:40,025
That's going to be better than always
executing a thousand cycles.

275
00:21:40,025 --> 00:21:46,092
Let's see, anything else we wanted to say
here.

276
00:21:51,239 --> 00:21:55,416
You know, one, one, one last thing I
wanted to say here before we move off to

277
00:21:55,416 --> 00:21:58,956
this slide.
Move zero and move not zero or sometimes

278
00:21:58,956 --> 00:22:03,363
called c moves, conditional move
instructions are what are called partial

279
00:22:03,363 --> 00:22:06,614
predication.
And we are now going to move on to full

280
00:22:06,614 --> 00:22:11,857
predication where we can actually put
conditions on every single instruction in

281
00:22:11,857 --> 00:22:15,477
our register every single instruction in
our ISA.

282
00:22:15,477 --> 00:22:19,150
But this is a little bit easier to do than
full predication.

283
00:22:19,150 --> 00:22:23,961
This is sort of the first step when people
typically implement and it's called

284
00:22:23,961 --> 00:22:27,941
partial predication.
And just to, just to hammer this home one

285
00:22:27,941 --> 00:22:33,775
more time, predication is really turning
what was control flow into a data flow

286
00:22:33,775 --> 00:22:40,058
operation or data operation.
Okay, so let's take a look at full

287
00:22:40,058 --> 00:22:46,077
predication.
Full predication, we're actually going to

288
00:22:46,077 --> 00:22:52,679
change the instruction set, such that all
of the instructions can be or almost all

289
00:22:52,679 --> 00:22:58,297
the instructions are executed
conditionally based on some predi,

290
00:22:58,297 --> 00:23:04,087
predicate.
If the predicate's false, the instructions

291
00:23:04,087 --> 00:23:09,428
are not going to do anything.
It's not going to have any, any side

292
00:23:09,428 --> 00:23:13,009
effect.
So let's look at a, a code sequence here.

293
00:23:13,009 --> 00:23:20,838
So this code sequence we have an if then
and else and then some code after, after

294
00:23:20,838 --> 00:23:24,706
we're done.
So here we have if.

295
00:23:24,706 --> 00:23:31,614
A little bit different this is not
actually in MIPS code, this is something

296
00:23:31,614 --> 00:23:39,359
just sort of serial code here, but it's
going to say, this is our if here; if a =

297
00:23:39,359 --> 00:23:46,098
b, then branch to b2, else fall through.
Here we're going to execute our else

298
00:23:46,098 --> 00:23:50,015
clause.
And when we're done we're gonna branch

299
00:23:50,015 --> 00:23:55,556
over the then clause.
So not, not, not anything interesting here

300
00:23:55,556 --> 00:24:00,412
for basic blocks.
I don't know if we've introduced this term

301
00:24:00,412 --> 00:24:06,548
yet in, in this class.
But I wanted to introduce the term basic

302
00:24:06,548 --> 00:24:10,620
block.
Basic block are is a, is a piece of code

303
00:24:10,620 --> 00:24:14,904
which has strictly one exit point, and one
entry point.

304
00:24:14,904 --> 00:24:20,091
So you could enter it from other places,
you jump to it from other pieces of code.

305
00:24:20,091 --> 00:24:24,533
But once you enter the piece of code
you're gonna execute the code, all the way

306
00:24:24,533 --> 00:24:27,387
to the end, and there's one place you can
leave.

307
00:24:27,387 --> 00:24:31,375
So you're going to jump to someplace else
or branch, and you can, you can fall out

308
00:24:31,375 --> 00:24:34,381
to, to two different places, but the exit
point, is only one place.

309
00:24:34,381 --> 00:24:38,399
So, we're, what we're going to is, big
differentiation here is, the piece of code

310
00:24:38,399 --> 00:24:42,344
can have for instance, two branches that,
that's not called a basic block.

311
00:24:42,344 --> 00:24:46,579
That, that's called a bigger type of block
but this is a compiler term.

312
00:24:46,579 --> 00:24:50,666
And it's just good to know.
Okay, so let's apply full predication

313
00:24:50,666 --> 00:24:53,050
here.
So, these are relatively short, code

314
00:24:53,050 --> 00:24:58,228
hammocks, two instructions here, two
instructions there and two instructions in

315
00:24:58,228 --> 00:25:02,830
the else, two instructions in the then
clause, this is like a, really great

316
00:25:02,830 --> 00:25:05,035
[inaudible] place something about
predicating.

317
00:25:06,038 --> 00:25:10,061
Okay, so lets, let's take a look at the,
the predication here.

318
00:25:10,061 --> 00:25:14,981
And I want, before, before we do that, I
wanted to just say that the, one of the

319
00:25:14,981 --> 00:25:20,332
big reasons that we are trying to
predicate on a VLIW, in particular, is in

320
00:25:20,332 --> 00:25:25,697
a VLIW, we have lots of extra, or we could
potentially have extra parallel execution

321
00:25:25,697 --> 00:25:28,726
slots.
So, in our instruction sequence or

322
00:25:28,726 --> 00:25:34,678
instruction encoding we have extra places
to put code so it may not actually hurt us

323
00:25:34,678 --> 00:25:37,233
to, to predicate.
Because the last time when we go to

324
00:25:37,233 --> 00:25:41,017
predicate we're basically going to execute
both sides of the if, then and else.

325
00:25:41,017 --> 00:25:45,038
And it's bad if we have to execute both
sides of the if, then and else, and they

326
00:25:45,038 --> 00:25:48,064
are big, sequentially.
But, if you try to, sort of, slide them up

327
00:25:50,066 --> 00:25:52,069
next to each other, if you have extra
no-op slots or extra slots in your VLIW,

328
00:25:52,069 --> 00:25:56,031
you can actually decrease the [inaudible]
path through your program.

329
00:25:56,075 --> 00:26:01,112
So let's look at this, we're actually
going to do this in this example.

330
00:26:01,112 --> 00:26:06,534
So in this example here, we took this
piece of code and added full predication.

331
00:26:06,534 --> 00:26:12,174
So when we say full predication, we added
some extra things to these instructions.

332
00:26:12,174 --> 00:26:16,669
They look a little strange at first.
So here are instruction three, instruction

333
00:26:16,669 --> 00:26:20,834
four, instruction five, instruction six.
These are same ones that were back over

334
00:26:20,834 --> 00:26:24,762
here.
But instead, we added some extra words in

335
00:26:24,762 --> 00:26:30,812
front of them here or these, these
parentheses with p's in front of it.

336
00:26:30,812 --> 00:26:33,995
What that is, is these are predicate
registers.

337
00:26:33,995 --> 00:26:40,965
So we're going to add a second register
file into our processor where we can store

338
00:26:40,965 --> 00:26:46,510
effectively one bid values, sort of true
or false values, that are going to

339
00:26:46,510 --> 00:26:51,962
determine whether this instruction
executes or doesn't execute.

340
00:26:51,962 --> 00:26:57,218
Excuse me.
So, what this means is if p1 is one,

341
00:26:57,218 --> 00:27:04,099
execute this.
If p1 is zero, nullify this instruction

342
00:27:04,099 --> 00:27:10,068
three here.
And we have, we have a double pipe here.

343
00:27:10,068 --> 00:27:14,090
What am I trying to show with this?
What I'm trying to show this is parallel.

344
00:27:14,090 --> 00:27:19,030
This and this are executing at the same
time, because we have a very long

345
00:27:19,030 --> 00:27:23,063
instruction where processor can execute
multiple things at the same time.

346
00:27:23,063 --> 00:27:28,057
And what we've done here is actually slid
this up next to the, the then up next to

347
00:27:28,057 --> 00:27:31,065
the else.
And we are gonna execute them at the same

348
00:27:31,065 --> 00:27:33,075
time.
Dependent on these predicates.

349
00:27:33,075 --> 00:27:38,039
So this is the else block, and this is the
then block, and we're gonna execute

350
00:27:38,039 --> 00:27:41,055
concurrently.
Okay.

351
00:27:44,228 --> 00:27:50,023
Sort of, stuff before the if, stuff after
the, if then else is all done, that's sort

352
00:27:50,023 --> 00:27:53,239
of from that, that four basic block
example.

353
00:27:53,239 --> 00:27:57,451
What's this instruction?
This instruction has very interesting

354
00:27:57,451 --> 00:28:01,343
semantics.
And this is something that you, probably

355
00:28:01,343 --> 00:28:06,305
have to add if you're going to try have
full predication.

356
00:28:06,305 --> 00:28:10,504
Hm, okay so let's compare.
This is our comparison.

357
00:28:10,504 --> 00:28:13,509
If a = b and then it writes into two
outputs.

358
00:28:13,509 --> 00:28:16,559
Ooh, that looks broken, that looks wrong
somehow.

359
00:28:16,559 --> 00:28:19,834
How can we have one instruction write to
two things?

360
00:28:19,834 --> 00:28:24,402
And what is the semantics of it?
What does it write to here and here?

361
00:28:24,402 --> 00:28:29,647
Well, what this is actually doing is this
is writing the positive outcome of this

362
00:28:29,647 --> 00:28:35,090
compare to the one and the inverse, or
the, the negation of that to the other.

363
00:28:35,090 --> 00:28:40,381
And this is useful, cuz then we can go use
these predicates down here.

364
00:28:40,381 --> 00:28:45,537
We can use the, the positive one and,
let's say, the negative one, and that can

365
00:28:45,724 --> 00:28:50,594
drive whether we execute this, sort of,
else clause or the, then clause.

366
00:28:50,594 --> 00:28:56,689
So lots of instruction sets that have full
predication have either something like

367
00:28:56,689 --> 00:28:59,357
this.
This is one option, you can actually have,

368
00:28:59,357 --> 00:29:03,006
sort of, two outputs, and you load to
predicate registers.

369
00:29:03,006 --> 00:29:07,085
The other option is, when you go to en,
actually encode the instructions, you have

370
00:29:07,085 --> 00:29:11,011
a sort of a not bit.
So it denotes the predicate register.

371
00:29:11,011 --> 00:29:15,084
But when we denote the predicate register,
we also denote whether we take the

372
00:29:15,084 --> 00:29:20,051
predicate register being true as, or, or
the predicate register being false as what

373
00:29:20,051 --> 00:29:24,066
drives one of the instruction executes in
our full predication scheme.

374
00:29:25,055 --> 00:29:31,221
So in effect this is computing p and p bar
at the same time and running of two

375
00:29:31,221 --> 00:29:36,051
different registers, and then we can use
those seperately down here.

376
00:29:37,057 --> 00:29:48,978
And interesting result here in is going to
95, Scot Malky showed that you can, on

377
00:29:48,978 --> 00:29:56,017
average let's say remove 50 percent of
your branches by using full predication.

378
00:29:56,017 --> 00:29:58,090
Now, full predication is not easy to
build.

379
00:29:59,009 --> 00:30:03,006
Very few instruction sets have actually
had full predication.

380
00:30:03,200 --> 00:30:08,073
Itanium was probably the closest to a real
processor that was built that has almost

381
00:30:08,073 --> 00:30:12,037
full predication.
It doesn't have quite full predication.

382
00:30:12,225 --> 00:30:16,093
The HP PlayDoh instruction set, I think,
actually had full predication.

383
00:30:16,093 --> 00:30:21,081
I think this is what Scott Malkey was
working with, or a derivative of that.

384
00:30:22,001 --> 00:30:25,715
So you can actually have, sort of just
some people who've thought about they've

385
00:30:25,715 --> 00:30:28,052
had full predication but not all of things
are easy.

386
00:30:28,052 --> 00:30:31,088
So first thing you need to add an extra
register file for the predicates, you need

387
00:30:31,088 --> 00:30:34,001
to bypass those predicates, you need to
compute them.

388
00:30:34,001 --> 00:30:52,008
There's some overheads involved and we're
going talk a little bit more about them.