1
00:00:03,098 --> 00:00:09,083
Okay so, very long instruction words,
processors.

2
00:00:09,083 --> 00:00:15,519
Still wanna have high performance and we
took all this hardware that was in a, out

3
00:00:15,519 --> 00:00:21,535
of our superscalar reserve through it out.
Well, something's got to make up for it.

4
00:00:21,535 --> 00:00:25,540
And what makes up for it, is a very, very,
very smart compiler.

5
00:00:25,540 --> 00:00:31,167
So, we put a lot of emphasis on the
compiler, in these sorts of architectures.

6
00:00:31,167 --> 00:00:34,714
And, the compiler really has to do the
scheduling.

7
00:00:34,714 --> 00:00:38,155
It has to do all of the dependency
checking.

8
00:00:38,155 --> 00:00:45,737
It probably to avoid all the different
data hazards, and this is just we're just

9
00:00:45,737 --> 00:00:49,828
getting started.
We're gonna talk mostly next lecture about

10
00:00:49,828 --> 00:00:55,308
all of the different optimizations a
compiler can do to try to approximate what

11
00:00:55,308 --> 00:00:59,718
a on-board superscalar can do, but by
doing it statically.

12
00:00:59,718 --> 00:01:04,735
So it's a pretty cool trick if you take
all that hardware, you put in the

13
00:01:04,735 --> 00:01:09,074
compiler, you run it once.
And then, every time you go to execute the

14
00:01:09,074 --> 00:01:13,940
code, you don't have to go and recalculate
all the dependencies.

15
00:01:13,940 --> 00:01:16,419
Sounds good.
Okay.

16
00:01:16,419 --> 00:01:23,004
So let's see how we execute some code here
and what is the sort of performance

17
00:01:23,004 --> 00:01:28,057
aspects to executing loop code on a very
long instruction word processor.

18
00:01:29,079 --> 00:01:38,051
So here we have a very basic array
increments.

19
00:01:38,051 --> 00:01:42,735
We're gonna take every element of this
array.

20
00:01:42,735 --> 00:01:49,897
We're gonna increment it by the value C.
[clears throat] And our, we run through

21
00:01:49,897 --> 00:01:53,468
aour compiler and here's the sequential
code sequence.

22
00:01:53,468 --> 00:02:00,092
This has not been scheduled yet for VLIWs
or for our VLIW architecture over here.

23
00:02:01,048 --> 00:02:07,049
So, we load some, we, we load the value.
We increment our counters.

24
00:02:07,049 --> 00:02:13,040
We actually do the floating points, add.
We store the value back.

25
00:02:13,040 --> 00:02:18,065
We increments the array index, and then
we, we loop.

26
00:02:19,000 --> 00:02:26,053
Seems simple enough.
Let's see how this gets scheduled here.

27
00:02:27,097 --> 00:02:34,009
Well, because the architecture and because
the compiler knows about latencies here

28
00:02:34,009 --> 00:02:36,943
Let's say tthis load here has a few cycles
of latency..

29
00:02:36,943 --> 00:02:39,489
So it's gonna actually schedule this ad
later.

30
00:02:39,489 --> 00:02:44,633
And the add, it's a floating-point add,
this has a couple of cycles of latency

31
00:02:44,633 --> 00:02:51,241
also, so it schedules the consumer of that
results here, this F2.

32
00:02:51,241 --> 00:02:56,250
Later.
And, yeah, I don't know, you can sprinkle

33
00:02:56,250 --> 00:03:02,010
in the array additions and then the
counter additions, kind of somewhere free,

34
00:03:02,010 --> 00:03:06,038
has lots of open slots.
But let's say it schedules at the same

35
00:03:06,038 --> 00:03:09,087
time as the load in the store.
So this is pretty cool.

36
00:03:09,087 --> 00:03:13,036
We are actually executing two wide
parallels in here.

37
00:03:13,080 --> 00:03:19,028
And we didn't need all the extra overhead
of in [inaudible] super scour.

38
00:03:21,037 --> 00:03:24,019
Oh, and we can just go to the branch
somewhere.

39
00:03:24,093 --> 00:03:26,729
Okay.
So, first question here how many floating

40
00:03:26,729 --> 00:03:31,041
point operations are we doing per cycle
are we doing very well here?

41
00:03:31,041 --> 00:03:37,086
So, it looks, looks, looks pretty poor.
We have one floating point operation here,

42
00:03:37,086 --> 00:03:42,051
just this ad.
And we have one, two, three, four, five,

43
00:03:42,051 --> 00:03:45,016
six, seven, eight cycles?
Yeah.

44
00:03:45,016 --> 00:03:50,640
So we're having 0.18 or 0.125 floating,
like floating point operations per cycle.

45
00:03:50,640 --> 00:03:57,443
It's not that great.
[inaudible[.

46
00:03:57,443 --> 00:03:59,034
You know we're actually having three
instructions here.

47
00:03:59,034 --> 00:04:03,016
It's better than nothing but we're not
really using the machine very well.

48
00:04:03,016 --> 00:04:07,292
Out of our superscalar, we'll probably try
to take connstructions that are below here

49
00:04:07,292 --> 00:04:11,043
maybe intermix them, try to reorder a
bunch of things and try to run faster.

50
00:04:12,024 --> 00:04:17,030
So what's a technique to go faster?
Well, one of the big comply, as I said we

51
00:04:17,030 --> 00:04:20,051
have a lot of emphasis on compilers in
this unit.

52
00:04:20,081 --> 00:04:25,080
One of the things the compiler can do, is
it can actually unroll the loop.

53
00:04:25,080 --> 00:04:30,784
So here we have our loop, we unroll it,
four times, and, now, we're going to sort

54
00:04:30,784 --> 00:04:36,033
of take the loop overhead, and we're gonna
factor it out, so that it only happens

55
00:04:36,033 --> 00:04:39,020
once every four times.
That sounds good.

56
00:04:39,020 --> 00:04:43,099
We're gonna do some more work here, per
each [inaudible] loop.

57
00:04:44,024 --> 00:04:51,022
But things are a little more complicated.
What happens here if N, big, upper case N

58
00:04:51,022 --> 00:04:54,080
here are terminating value is not a
multiple of four.

59
00:04:54,080 --> 00:05:03,012
Well, we need to do something about that.
We probably need to check sorta before we

60
00:05:03,012 --> 00:05:08,000
go to execute here whether we're, we're
doing okay or whether we're actually are

61
00:05:08,000 --> 00:05:12,064
[inaudible] for convenient leaf and it's
big enough and probably can run as

62
00:05:12,064 --> 00:05:17,058
multiple four for a long period of time
and so is the last iteration you'll have

63
00:05:17,058 --> 00:05:20,093
to clean up.
But we need to generate some clean up code

64
00:05:20,093 --> 00:05:27,095
and compiler is responsible to do this.
So these compiler optimizations do take

65
00:05:27,095 --> 00:05:32,092
some effort.
Okay, so let's look at the scheduling now

66
00:05:33,024 --> 00:05:40,079
of our loop unrolled code.
We can do a bunch of loads upfront here.

67
00:05:40,079 --> 00:05:45,364
So we've inter, intermingled these loops.
And what's kinda cool is because you pull

68
00:05:45,364 --> 00:05:49,717
out the loads and are thrown to the top
and the stores been pushed into the

69
00:05:49,717 --> 00:05:52,567
bottom.
And then put all the ads sort of in the

70
00:05:52,567 --> 00:05:55,178
middle.
And maybe sprinkle the array update

71
00:05:55,178 --> 00:05:58,361
somewhere else.
And we go to actually schedule that, we're

72
00:05:58,361 --> 00:06:07,005
going to do something similar.
We're going to have the loads [inaudible]

73
00:06:07,005 --> 00:06:10,707
first, execute the floating point ads of
the floating points stores and have the

74
00:06:10,707 --> 00:06:13,655
result.
But what you could noticed here is we're

75
00:06:13,655 --> 00:06:18,981
actually starting to get some overlap,
because we've unrolled, we can overlap

76
00:06:18,981 --> 00:06:24,394
this load and the first floating points
addition 'cause we've effectively covered

77
00:06:24,394 --> 00:06:30,458
the latency of our functional units by
putting other loop iterations during that

78
00:06:30,458 --> 00:06:34,044
time.
So if you look at this schedule versus the

79
00:06:34,044 --> 00:06:38,966
schedule back here.
We're just sort taking these dead cycles

80
00:06:38,966 --> 00:06:42,443
and we've put the other loop iterations in
those dead cycles.

81
00:06:42,443 --> 00:06:47,273
In this loop unrolled case, we're
incrementing this counters and the indexes

82
00:06:47,273 --> 00:06:52,572
not by four anymore.
We're incrementing by however many times

83
00:06:52,572 --> 00:06:58,019
we've loop unrolled times the offset, so
we're incrementing it by sixteen now.

84
00:06:58,050 --> 00:07:06,060
Does that make sense?
'Cause in, in this code here we were

85
00:07:06,060 --> 00:07:15,341
incrementing R2 by four, because that's
the size of a single value is four bytes.

86
00:07:15,341 --> 00:07:19,301
So we have to sort of move our array index
over by four.

87
00:07:19,301 --> 00:07:23,838
But now, because we're batching up all
this work together.

88
00:07:23,838 --> 00:07:33,072
[clears throat] We actually have to move
the, the index by a, a bigger value.

89
00:07:33,072 --> 00:07:38,162
So we're moving it by four 'cause we've
unrolled four times, times the size of the

90
00:07:38,162 --> 00:07:42,672
data value, which was four.
So we're moving it by sixteen.

91
00:07:42,672 --> 00:07:48,935
And, and, one of the nice things here if
we look is in both the loads in the

92
00:07:48,935 --> 00:07:59,320
stores, we're using our register indirect
addressing mode here to add in some

93
00:07:59,320 --> 00:08:03,348
offsets.
So, we're actually offsetting, let's say,

94
00:08:03,348 --> 00:08:10,224
twelve plus this base register of R1 to
figure out where we're actually doing the

95
00:08:10,224 --> 00:08:13,434
load from.
But it's just, a convenient way that we

96
00:08:13,434 --> 00:08:19,010
don't have to compute a bunch of addresses
[clears throat].

97
00:08:19,010 --> 00:08:26,068
Okay, so going back here, we can see we're
starting to overlap actual operations with

98
00:08:26,068 --> 00:08:29,688
other loop iterations.
Well, that's really cool.

99
00:08:29,688 --> 00:08:33,076
So we're starting to get some performance
here.

100
00:08:33,076 --> 00:08:36,080
So, let's, let's look at the performance.
So, I ask the same question here.

101
00:08:36,080 --> 00:08:39,003
How many floating point operations per
cycle?

102
00:08:39,359 --> 00:08:48,034
Hopefully, hopefully it's higher, one,
two, three, four divided by one, two,

103
00:08:48,034 --> 00:08:55,025
three, four, five, six, seven, eight,
nine, ten, eleven cycles Okay, so that's

104
00:08:55,025 --> 00:09:01,002
0.36 is a lot better than 0.125.
This is good.

105
00:09:01,002 --> 00:09:07,043
Loop unrolling is helping us.
But is this, is this everything?

106
00:09:07,043 --> 00:09:12,700
Or could we do more in our compiler?
So these four compiler people came up even

107
00:09:13,064 --> 00:09:19,008
fancier idea, which is called software
pipelining.

108
00:09:20,020 --> 00:09:25,029
So uses, uses a term we've seen before,
pipelining, but does it in software.

109
00:09:25,055 --> 00:09:34,016
So the idea here is that instead of.
Just having one loop, unrolling the loop,

110
00:09:34,016 --> 00:09:39,735
and then overlapping the iterations.
We're actually going to take multiple of

111
00:09:39,735 --> 00:09:47,471
these schedules, and interleave them.
And try to fill some of these empty spots.

112
00:09:47,471 --> 00:09:51,426
So let's, let's look at this.
We're going to have the code unrolled four

113
00:09:51,426 --> 00:09:54,027
times.
It's the same piece of code we had on the

114
00:09:54,027 --> 00:09:59,045
previous slide.
And we're gonna draw our schedule that we

115
00:09:59,045 --> 00:10:02,095
had before.
So here is the schedule we had before in

116
00:10:02,095 --> 00:10:06,044
purple.
Okay, that doesn't look so bad.

117
00:10:07,000 --> 00:10:12,696
But now, we're going to schedule it with
another iteration of the exact same four

118
00:10:12,696 --> 00:10:22,521
time unrolled loop shown here in green.
So we've just overlapped this with, this

119
00:10:22,521 --> 00:10:28,696
other iterational loop.
Well are we, are we done here?

120
00:10:28,696 --> 00:10:32,975
Well, not quite.
We still have some open spots here.

121
00:10:32,975 --> 00:10:39,089
So let's try to overlap even another
iteration, as shown here in red.

122
00:10:41,050 --> 00:10:48,074
Now the fix-up code that you need to have
to do this correctly gets more complex

123
00:10:48,074 --> 00:10:51,465
'cause all of a sudden now you're
overlapping multiple iterations, but as

124
00:10:51,465 --> 00:10:56,531
long as you don't modify some value, as
long as you don't do a store,

125
00:10:56,531 --> 00:11:01,312
speculatively, or a store, you're probably
okay, 'cause always just doing

126
00:11:01,312 --> 00:11:03,864
[inaudible].
Doing extra loads, doing extra work.

127
00:11:03,864 --> 00:11:08,560
You're doing extra work and you're filling
slots thinking that you're not gonna have

128
00:11:08,560 --> 00:11:14,173
anything go wrong or thinking that the
index variable N if you will, is a

129
00:11:14,173 --> 00:11:18,983
multiple of four and large, and you're not
at the end.

130
00:11:18,983 --> 00:11:23,050
So let's, let's put some names to these
things.

131
00:11:23,050 --> 00:11:29,055
[clears throat]..
So we call the, beginning here the run up,

132
00:11:29,055 --> 00:11:35,054
the prologue.
Here, we actually have our sort of actual

133
00:11:35,054 --> 00:11:38,099
iteration.
You can see that the, sorry this is in

134
00:11:38,099 --> 00:11:42,004
green, that doesn't show up very well
there are instructions here.

135
00:11:42,004 --> 00:11:45,053
There's ads.
It's pretty full.

136
00:11:45,053 --> 00:11:49,059
We're actually doing a lot of work on our
machine here.

137
00:11:49,059 --> 00:11:53,003
And then the epilog here is the, is when
we're done.

138
00:11:53,003 --> 00:11:57,089
This when we're falling out.
We're sort of at the last loop iteration

139
00:11:57,089 --> 00:12:02,086
of the outer loop, if you will.
So let's, let's do some math and look at

140
00:12:02,086 --> 00:12:07,096
the performance of this.
Okay, so let's ask the same question here.

141
00:12:07,096 --> 00:12:10,065
How many, floating point operations, per
cycle?

142
00:12:11,015 --> 00:12:15,044
Well, we go look over here.
We have one, two, three, four.

143
00:12:15,091 --> 00:12:20,075
And we have four cycles in our, in our
tight loop case.

144
00:12:21,012 --> 00:12:23,024
That looks pretty good.
That's cool.

145
00:12:23,024 --> 00:12:27,056
We just got a bunch of performance.
But we have to do a lot of compiler

146
00:12:27,056 --> 00:12:31,575
optimization to make this work before
we're able to use the [inaudible] machine,

147
00:12:31,575 --> 00:12:36,641
and we're able to overlap three different
iterations of this of this loop, that we

148
00:12:36,641 --> 00:12:40,039
also did a software transform to unroll it
twice.

149
00:12:40,094 --> 00:12:44,061
So this is called a software pipeline,
pipelining.

150
00:12:44,061 --> 00:12:49,091
And have the nice picture in here to sort
of show what's going on visually.

151
00:12:49,091 --> 00:12:56,001
So we're gonna have time on the horizontal
axis, and we'll have sort of activity of

152
00:12:56,001 --> 00:13:01,089
how many instructions are executing or
something like that on the vertical axis

153
00:13:01,089 --> 00:13:06,075
or, or shown here as performance.
When you run multiple loop unroll

154
00:13:06,075 --> 00:13:13,914
iterations, you have, starts from start
up, you have the actual you are in the

155
00:13:13,914 --> 00:13:20,760
loop, and then you come down from that.
And that's, that's better than having lots

156
00:13:20,760 --> 00:13:26,570
of start up and sort of come down with
small loop iteration portion here in the

157
00:13:26,570 --> 00:13:30,213
middle.
[clears throat] But when we go look at the

158
00:13:30,213 --> 00:13:37,780
software pipelined, we can overlap,
basically one iterational loops execution

159
00:13:37,780 --> 00:13:46,946
with or one loop start up or, a, a with
loop iteration of another loop Now we can

160
00:13:46,946 --> 00:13:54,270
actually execute sort of our prologue,
execute multiple iterations, very tight,

161
00:13:54,270 --> 00:14:06,034
and then have our epilogue here.
Yeah, so.

162
00:14:06,034 --> 00:14:13,163
Our software pipelining [inaudible] start
up and wind down costs.

163
00:14:13,163 --> 00:14:19,567
Once for the execution of the loop and not
every iteration of the loop.

164
00:14:19,567 --> 00:14:22,994
So that's, that's, that's fun.
That's cool.

165
00:14:22,994 --> 00:14:29,710
We're getting performance.
If only the world was dense loops, life

166
00:14:29,710 --> 00:14:36,678
would be easy.
Alas, the world is not all loops.

167
00:14:36,678 --> 00:14:46,581
If we just had a processor, which just did
enter array calculations and all of our

168
00:14:46,581 --> 00:14:50,836
problems in the world were just dense
array computations life would be really

169
00:14:50,836 --> 00:14:51,947
easy.
But they're not.

170
00:14:51,947 --> 00:14:56,222
A lot a time code has lots of branches.
It has if than else clauses.

171
00:14:56,222 --> 00:15:01,064
And here so graphically show something
like a if than else.

172
00:15:01,064 --> 00:15:06,268
So we have some piece a code that makes a
decision and executes the code on the left

173
00:15:06,268 --> 00:15:11,983
or executes the code on the right, based
on an if statement.

174
00:15:11,983 --> 00:15:18,462
So this is the if true statement and this
is the ELF's clause.

175
00:15:18,462 --> 00:15:24,455
[clears throat] and data dependent
branches are a problem typically for very

176
00:15:24,455 --> 00:15:31,730
long instruction word processors.
Now, why is that?

177
00:15:31,730 --> 00:15:35,582
Hm.
Well in a, out of word processor, you can

178
00:15:35,582 --> 00:15:40,424
try to execute, code sort of around the
branches and move instructions above the

179
00:15:40,424 --> 00:15:44,853
branch and below the branch.
But if you're doing static scheduling when

180
00:15:44,853 --> 00:15:49,616
you hit this branch, and let's say it's a
hard to predict branch.

181
00:15:49,616 --> 00:15:55,124
You can't really do anything because you
packed a bunch of instructions next to

182
00:15:55,124 --> 00:15:57,806
each other and they need to execute
atomically.

183
00:15:57,806 --> 00:16:02,303
So, thinking measure of move code up and
down across that branch.

184
00:16:02,303 --> 00:16:07,061
Superscalar can do that, 'cause it has its
instruction window, it has a bunch of

185
00:16:07,061 --> 00:16:10,052
techniques a bunch of hardware to be able
to do that.

186
00:16:10,052 --> 00:16:13,089
But in our VLIW processor, that's a
problem.

187
00:16:14,096 --> 00:16:21,003
So I wanted to introduce a, a compiler, an
important compiler nomenclature here,

188
00:16:21,003 --> 00:16:25,023
which is important for this class, it's
called a basic block.

189
00:16:25,023 --> 00:16:30,019
So what is a basic block?
A basic block is a piece of code which has

190
00:16:30,019 --> 00:16:35,001
a single entry and a single exit.
So this is a basic block, it has one

191
00:16:35,001 --> 00:16:38,071
entry, and one exit.
And why is single entry important?

192
00:16:38,071 --> 00:16:44,023
Well, if you can jump into the middle of
this piece of code, the compiler cannot

193
00:16:44,023 --> 00:16:47,094
necessary reorder the instructions inside
this block.

194
00:16:48,089 --> 00:16:53,069
If you have multiple exits, let's say you
exit here, the compiler can't push

195
00:16:53,069 --> 00:16:58,029
instructions below that exit point.
But if you have a basic block, the

196
00:16:58,029 --> 00:17:03,075
compiler basically knows that these, this
instruction sequence is going to execute

197
00:17:03,075 --> 00:17:06,096
effectively, effectively atomic, but not
actually atomic.

198
00:17:06,096 --> 00:17:10,038
I mean that you can have other things
going on inside there.

199
00:17:10,038 --> 00:17:13,501
But from a compiler perspective it can
reorder the instructions around in here to

200
00:17:13,501 --> 00:17:18,061
get better performance.
Hm, okay.

201
00:17:19,009 --> 00:17:22,098
So loops, loops are easy.
We can solve for pipeline.

202
00:17:22,098 --> 00:17:27,555
We can unroll.
Squirrelly code if and else spaghetti code

203
00:17:27,555 --> 00:17:33,025
are hard.
So compiler guys came up with some fancy

204
00:17:33,025 --> 00:17:40,063
tricks to make VLIWs work better and take
advantage of some of the code motion that

205
00:17:40,063 --> 00:17:47,022
out of a force superscalar does but in the
compiler and not in dynamically in

206
00:17:47,022 --> 00:17:52,029
hardware.
And one of the more famous ways to do this

207
00:17:52,029 --> 00:17:59,146
is something called [inaudible]
scheduling, which was John Ellis' who was

208
00:17:59,146 --> 00:18:06,095
one of Jon Fishers' student's thesis work.
This was in the Bulldog compiler out of

209
00:18:06,095 --> 00:18:13,251
[inaudible] So, what you do is you profile
the code and, you compare the

210
00:18:13,251 --> 00:18:17,412
probabilities that these branches go the
one way or the other way.

211
00:18:17,412 --> 00:18:22,889
So, let's say profiling, this is not a
something hardware is doing at run time,

212
00:18:22,889 --> 00:18:27,421
this is something that you do with the
program while you are sort of still back

213
00:18:27,421 --> 00:18:31,443
in the compiler stage.
You take the program, the compiler goes

214
00:18:31,443 --> 00:18:35,895
and runs it on some input given, given
input set and comes up with the

215
00:18:35,895 --> 00:18:41,300
probabilities of which way things go.
And then what you do is, you come up, you

216
00:18:41,300 --> 00:18:47,253
take this profile information and you come
up with some guess at what is the most

217
00:18:47,253 --> 00:18:51,501
probable one.
And we're going to circle that here.

218
00:18:51,501 --> 00:18:56,084
And say.
These darkened edges are the most probable

219
00:18:56,084 --> 00:19:01,070
sort of path through the squirrelly piece
of code given this is the entry point.

220
00:19:03,007 --> 00:19:09,035
Now, this doesn't mean that you can't have
branches that sort of branch out of this,

221
00:19:09,035 --> 00:19:14,612
but if you do you need to have some fix-up
code 'cause what we're about to do is

222
00:19:14,612 --> 00:19:18,082
we're gonna take this entire sort of, big
[inaudible] of code here.

223
00:19:18,082 --> 00:19:25,082
We're gonna remove all of the branches and
we're gonna schedule for our VLIW

224
00:19:25,082 --> 00:19:29,054
processor as one big monolithic piece of
chunk code.

225
00:19:29,054 --> 00:19:34,033
And by doing this we can move
instructions, let's say, that are down

226
00:19:34,033 --> 00:19:39,731
here, which this opens last to executing
your early portion of this codes sequence,

227
00:19:39,731 --> 00:19:43,908
we can move them up.
And likewise we can move things that use

228
00:19:43,908 --> 00:19:50,150
the resolve of long latency instructions
up here and push it down across branches.

229
00:19:50,150 --> 00:19:56,586
So our out-of-order superscalar does this
with branch speculation, but our compiler

230
00:19:56,586 --> 00:20:00,431
can do this on our VLIW processor using
trace schedule.

231
00:20:00,431 --> 00:20:05,968
But when do this, which be careful because
there's always a possibility that while

232
00:20:05,968 --> 00:20:09,038
unlikely you can still branch the other
way.

233
00:20:09,038 --> 00:20:15,936
So typically, the way this is done is you
have some form of fix-up code that you

234
00:20:15,936 --> 00:20:21,640
branch away, you have to sort of fix up
anything that was after the branch that

235
00:20:21,640 --> 00:20:26,408
made a committed change if you will, to
the, the processor state.

236
00:20:26,408 --> 00:20:31,475
And you sort of roll that back somehow.
So, we're basically in software, doing the

237
00:20:31,475 --> 00:20:36,527
rollback case from number our out-of-order
superscalar.

238
00:20:36,527 --> 00:20:41,044
So instead of taking the, architectural
register file and copying it to the

239
00:20:41,044 --> 00:20:46,625
physical register file on branch
mispredict, instead our compiler generates

240
00:20:46,625 --> 00:20:51,634
a code sequence which does that same
operation if you were to branch away

241
00:20:51,634 --> 00:20:54,976
there.
And we'll roll back, only the only the

242
00:20:54,976 --> 00:20:59,150
certain register that needs to be rolled
back and only the memory state that needs

243
00:20:59,150 --> 00:21:01,757
to be rolled back.
So, that's pretty cool, so we can

244
00:21:01,757 --> 00:21:05,814
basically take all the functionality that
was dawn in our out-of-order superscalar

245
00:21:05,814 --> 00:21:08,018
and put it in the software using trace
scheduling.