1
00:00:03,072 --> 00:00:08,033
Okay, so now we get to move on to even
more complicated processors.

2
00:00:08,093 --> 00:00:12,023
In order issue.
Or excuse me, in order front end.

3
00:00:12,023 --> 00:00:14,097
In order issue.
Out of order right back.

4
00:00:14,097 --> 00:00:19,622
And in order commit.
Okay so this is going to some problems

5
00:00:19,622 --> 00:00:23,825
that we have.
The biggest problem it's going to solve,

6
00:00:23,825 --> 00:00:28,034
is it's going to solve a problem of
precise exceptions.

7
00:00:28,034 --> 00:00:33,400
We can now have exceptions are all the way
at the end, because we're, we're

8
00:00:33,400 --> 00:00:41,057
committing data in order.
So let's, let's take a look that should

9
00:00:41,057 --> 00:00:47,563
probably a line there scribble that on
your, your own drawings.

10
00:00:47,563 --> 00:00:53,366
Okay, so let's, let's, let's take a look
at some other structures we've added to

11
00:00:53,366 --> 00:00:58,212
this diagram to, to make life a little bit
more interesting.

12
00:00:58,212 --> 00:01:01,348
Okay, the front end looks pretty much the
same.

13
00:01:01,348 --> 00:01:05,692
We, we split the load and the store apart
into two separate pites, pipes here.

14
00:01:05,692 --> 00:01:10,596
A load pipe and a store pipe.
And the store pipe is shorter because it

15
00:01:10,596 --> 00:01:13,204
just has to basically use to store, we'll
say.

16
00:01:13,204 --> 00:01:16,275
Maybe it's two stages, doesn't, doesn't
matter that much.

17
00:01:16,275 --> 00:01:19,475
It's not that material in this, in this
drawing.

18
00:01:19,646 --> 00:01:24,877
But something interesting to look at here
is we added a bunch of extra boxes over

19
00:01:24,877 --> 00:01:27,841
here on the, on the right side of this
foil.

20
00:01:27,841 --> 00:01:30,456
So let's, let's, let's define these
things.

21
00:01:30,456 --> 00:01:34,639
So we had our architectural register file,
which is our committed state to the

22
00:01:34,639 --> 00:01:39,745
processor.
And we added a second register file,

23
00:01:39,745 --> 00:01:44,815
typically called a physical register file,
or prf.

24
00:01:44,815 --> 00:01:51,653
Sometimes people call this a future file
and you'll see that in the literature

25
00:01:51,653 --> 00:01:56,520
there are some papers published about
future files and the reason it's called

26
00:01:56,520 --> 00:02:00,591
future file is, it's basically executive
speculatively in the future.

27
00:02:00,591 --> 00:02:03,917
The values in here have not been committed
to the processor.

28
00:02:03,917 --> 00:02:08,720
They can be thrown it out if you take an
exception, if a branch happens for a

29
00:02:08,720 --> 00:02:13,241
variety of reasons.
These are speculative, you're not

30
00:02:13,241 --> 00:02:18,943
guaranteed to actually have to keep those.
The architecture register filed though is

31
00:02:18,943 --> 00:02:22,278
committed state.
Okay, we added, we added two other

32
00:02:22,278 --> 00:02:27,454
structures here, something we call ROB or
Re-Order Buffer.

33
00:02:27,454 --> 00:02:36,288
And we added a finished store buffer.
So let's, let's talk about the reorder

34
00:02:36,288 --> 00:02:42,007
buffer first.
So in this pipeline, we actually want

35
00:02:42,007 --> 00:02:48,522
instructions to basically execute and
write the physical register file out of

36
00:02:48,522 --> 00:02:51,589
order.
This is an out of order processor.

37
00:02:51,589 --> 00:02:56,213
We'd that to happen.
We're basically making the execution and

38
00:02:56,213 --> 00:03:01,846
the write back out of order.
But, we want the commit to be in order.

39
00:03:01,846 --> 00:03:07,393
So we need some structure that is going to
guarantee that the write back, the write

40
00:03:07,393 --> 00:03:10,481
to the architecture register file happens
in order.

41
00:03:10,481 --> 00:03:16,246
And that's what the ROB is going to do.
So it's going to keep it's going to keep

42
00:03:16,246 --> 00:03:22,783
completed instructions.
And that could come in out of order and

43
00:03:22,783 --> 00:03:30,768
are going to leave in order.
So, things come into this out of order,

44
00:03:30,768 --> 00:03:35,891
and they go out of it in order.
And this is a reordering structure.

45
00:03:35,891 --> 00:03:42,438
It's typically a table that is sort of
ridden well, we'll talk about that in a

46
00:03:42,438 --> 00:03:47,864
second it's ridden in different places in
the pipe for a variety, for a couple

47
00:03:47,864 --> 00:03:53,937
different reasons, but you, you typically
want to keep track of the instructions in

48
00:03:53,937 --> 00:03:57,489
order somehow.
And then when you go to pull out of the

49
00:03:57,489 --> 00:04:00,670
reorder buffer, you want to pull in order
out of it.

50
00:04:00,670 --> 00:04:07,234
But the rights in the tracking of the
information that happens to it can be out

51
00:04:07,234 --> 00:04:13,049
of order.
And the other thing here is this finish

52
00:04:13,049 --> 00:04:16,379
store buffer.
The reason we have this finish store

53
00:04:16,379 --> 00:04:21,574
buffer is if we have a store operation, we
don't want to have to have the commit

54
00:04:21,574 --> 00:04:26,182
point like here so early in the pipe.
Because once you store in the main memory,

55
00:04:26,182 --> 00:04:30,361
it's really hard to go get that back,
possibly even impossible, probably is

56
00:04:30,361 --> 00:04:33,117
impossible.
You wrote to the main if you, if you had

57
00:04:33,117 --> 00:04:37,229
the old value and you write, overwrite it
with the new value, the old value's

58
00:04:37,229 --> 00:04:40,465
forever gone in your main memory.
You can't get it back.

59
00:04:40,465 --> 00:04:45,139
So, the solution that is, instead of doing
the store here, you have the store happen

60
00:04:45,139 --> 00:04:48,535
later in the pipe.
And if you sort of remember what you're

61
00:04:48,535 --> 00:04:52,956
supposed to do, the address and the data
that's supposed to be happening.

62
00:04:52,956 --> 00:04:58,590
And, for anybody who, who cares that,
that, that store has happened if it hits

63
00:04:58,590 --> 00:05:03,887
this future store buffer.
So you probably need to have your loads

64
00:05:03,887 --> 00:05:08,927
check that future store buffer with higher
priority than your cache.

65
00:05:08,927 --> 00:05:13,328
Because there could be a store living in
that location.

66
00:05:13,598 --> 00:05:18,096
Okay, so that's the, that's the sort of,
structures here.

67
00:05:18,096 --> 00:05:21,386
Let's talk about where things get read and
written.

68
00:05:21,386 --> 00:05:27,128
This one is really interesting our
architectural register file isn't read

69
00:05:27,128 --> 00:05:31,437
anywhere.
What's up with that?

70
00:05:31,437 --> 00:05:35,405
Well, we're going to use the physical
register file for all the intermediate

71
00:05:35,405 --> 00:05:38,907
values in our pipeline and the
architecture register file is only there

72
00:05:38,907 --> 00:05:41,758
if we take some sort of let's say branch
or interrupt.

73
00:05:41,758 --> 00:05:46,084
That's the only time we actually need to
go take this information and we probably

74
00:05:46,084 --> 00:05:50,510
gonna dump it into a psychical register
file or dump it into the future file when

75
00:05:50,510 --> 00:05:53,888
an interrupt happens or when a branch miss
predict happens.

76
00:05:53,888 --> 00:05:59,081
But otherwise, it doesn't have to be red.
Scoreboard is the same as usual.

77
00:05:59,081 --> 00:06:06,794
Read and writes in your register fetch
stage, written at the right backstage and

78
00:06:06,794 --> 00:06:11,949
that's no longer tracking architectural
register file registers.

79
00:06:11,949 --> 00:06:16,624
It's now tracking physical register file
registers.

80
00:06:16,624 --> 00:06:23,290
Re-order Buffer.
This one hell, a whole bunch of different

81
00:06:23,290 --> 00:06:27,811
places, it gets read and written.
Primarily, what's going to happen is when

82
00:06:27,811 --> 00:06:32,593
the instruction is issued, so goes from
the decode stage to the issue stage,

83
00:06:32,593 --> 00:06:38,011
that's going to allocate a location in the
Re-Order Buffer for the entry in the

84
00:06:38,011 --> 00:06:41,077
Re-Order Buffer.
And then at the end of the pipe, once the

85
00:06:41,077 --> 00:06:46,090
value completes we have to change some
state information in the Re-Order Buffer,

86
00:06:46,090 --> 00:06:51,072
saying, oh, that's, output register for a
particular instruction is now ready.

87
00:06:51,072 --> 00:06:56,073
And then, once we actually go to do the
commit, we have to basically clean that

88
00:06:56,073 --> 00:07:04,071
instruction out of the reorder buffer.
The feature store buffer is written, just

89
00:07:04,071 --> 00:07:10,093
sort of at the end of the pike here and
clean when the, actually posting to

90
00:07:10,093 --> 00:07:14,053
memory.
It's a, it's a little hard to draw this,

91
00:07:14,053 --> 00:07:18,065
but you, that information.
Somehow from the memory system if you have

92
00:07:18,065 --> 00:07:23,055
a, a load that reads from that, it will
probably read it either L-0 or L-1 in a

93
00:07:23,055 --> 00:07:27,048
sort of bypassing mode if you will.
It'll go check that structure.

94
00:07:27,048 --> 00:07:33,014
We'll talk more about that next class.
Okay so here is sort of a basic reorder

95
00:07:33,014 --> 00:07:36,026
buffer.
If you go looking some books they have a

96
00:07:36,026 --> 00:07:41,067
lot more data stored in reorder buffer cuz
it's kind of minimal reorder buffer you

97
00:07:41,067 --> 00:07:50,003
need for an out of work pipe.
And this reorder buffer is used to keep

98
00:07:50,003 --> 00:07:56,188
track of in order committing instructions,
but things will be put into it out of

99
00:07:56,188 --> 00:07:59,060
order.
So just, let's first talk about sort of

100
00:07:59,060 --> 00:08:02,061
the information here.
We keep track of state.

101
00:08:02,061 --> 00:08:07,020
So what do we mean by state?
So this is the state of an instruction.

102
00:08:07,020 --> 00:08:12,054
So each one of these entries here is a
different in flight instruction in the

103
00:08:12,054 --> 00:08:16,703
pipeline.
And we're actually going to store in order

104
00:08:16,703 --> 00:08:22,395
into the reorder buffer and we're going to
keep it sort of as a, as a queue.

105
00:08:22,395 --> 00:08:28,910
So this picture here, this state is we'll
say -,, - means free, and P means pending

106
00:08:28,910 --> 00:08:35,045
and F means finished probably should not
have chose two F words there.

107
00:08:35,045 --> 00:08:40,596
That's a little confusing.
But the newest instruction, is if we have

108
00:08:40,596 --> 00:08:46,207
a new instruction execute, it's going to
end up here in this entry.

109
00:08:46,207 --> 00:08:53,240
And when an instruction commits, or
retires, it's going to remove this entry,

110
00:08:53,240 --> 00:08:58,504
the bottom entry.
So we basically have a, sort of circular

111
00:08:58,504 --> 00:09:05,216
buffer running around, with a head and a
tail pointer, sort of chasing each other

112
00:09:05,216 --> 00:09:08,376
in this data structure.
So, tail, head.

113
00:09:08,679 --> 00:09:15,825
What's interesting about this and why does
this cool, is because, let's take a look

114
00:09:15,825 --> 00:09:21,361
at this instruction right here.
This instruction has a F, which means

115
00:09:21,361 --> 00:09:25,806
it's, it's finished.
That it's not pending in the pipe.

116
00:09:25,806 --> 00:09:32,586
It's hit the reorder buffer, the data is
stored in the physical register file, and,

117
00:09:32,586 --> 00:09:38,722
but, instructions that are older than it
with it these two instructions are still

118
00:09:38,722 --> 00:09:42,358
pending in the pipe.
Let's say these two are multiplies and

119
00:09:42,358 --> 00:09:46,012
this is add.
So this add is basically already done.

120
00:09:46,012 --> 00:09:50,603
These two instructions, which are these
long laying instructions, are still

121
00:09:50,603 --> 00:09:54,543
pending in the pipe.
In this cycle, we cannot commit anything.

122
00:09:54,543 --> 00:10:01,044
So we only commit instructions when the
oldest instruction becomes finished.

123
00:10:01,087 --> 00:10:07,049
And that's when we can commit and remove
something from the reorder buffer.

124
00:10:07,049 --> 00:10:11,008
Some other things we need to keep track of
here.

125
00:10:11,030 --> 00:10:16,035
We have a bit here S for speculative.
So what this means is if you have

126
00:10:16,035 --> 00:10:21,094
something like a branch.
You mark instructions that are newer in

127
00:10:21,094 --> 00:10:26,008
the branch with a peck, of a speck little
bit.

128
00:10:26,008 --> 00:10:30,066
So what this is saying is if that branch
mispredicts, it just gives you a

129
00:10:30,066 --> 00:10:35,056
convenient place to go find all the
dependence instructions on it to go flush

130
00:10:35,056 --> 00:10:38,057
and kill.
So if you have a, if you have, let's say,

131
00:10:38,057 --> 00:10:43,155
one branch is allowed in the pipe at a
time and the branch misdpredicts, what you

132
00:10:43,155 --> 00:10:47,380
can do is basically look for all the
entries in here that have ones, and just

133
00:10:47,380 --> 00:10:50,438
invalidate them ad hoc, and just flush the
entire pipe.

134
00:10:50,438 --> 00:10:55,089
You don't have to worry about there being
some value you need to, to worry about.

135
00:10:55,089 --> 00:10:59,562
So it's just a commute way to figure out
which instruction is speculative.

136
00:10:59,562 --> 00:11:02,721
And if the branch mispredicted what you
have to kill.

137
00:11:02,721 --> 00:11:06,353
Stores.
We'll be talking about this in a few more

138
00:11:06,353 --> 00:11:09,431
slides later.
But store bit, what we're really going to

139
00:11:09,431 --> 00:11:13,799
do is this is going to say.
If this instruction is a store it knows

140
00:11:13,799 --> 00:11:19,026
that we need to do something else with it.
We need to do something with the future

141
00:11:19,026 --> 00:11:24,063
store buffer when it gets to the end of
the pipe sort of a meeting place to put

142
00:11:24,063 --> 00:11:27,045
it.
And here is the actual business the

143
00:11:27,045 --> 00:11:31,519
business part of the reorder buffer.
V, which says that the instruction

144
00:11:31,519 --> 00:11:37,200
actually writes a register and then
finally once the instruction goes to the

145
00:11:37,200 --> 00:11:43,709
end of the pipe, it is going to fill in a
location in here which is the physical

146
00:11:43,709 --> 00:11:48,096
register file entry that is the
destination of that value.

147
00:11:48,096 --> 00:11:54,847
So this basically allows the pipeline to
know where to go find the actual value.

148
00:11:54,847 --> 00:11:57,588
We don't actually store the actual values
in here.

149
00:11:57,588 --> 00:12:02,652
We just store a pointer in the physical
register file, because it's fewer bits.

150
00:12:02,652 --> 00:12:07,361
And this can tell us, oh, well go, go
look, let's say, when this when this

151
00:12:07,361 --> 00:12:12,462
instruction here which is already finished
is ready to go retire or it's ready to go

152
00:12:12,637 --> 00:12:17,048
commit, go look in physical register file
number seven or something like that.

153
00:12:17,048 --> 00:12:20,037
And it goes and pulls that value out from
there.

154
00:12:22,089 --> 00:12:31,003
So, so a good discussion of this is in the
Shin Lapousky book that is sort of

155
00:12:31,003 --> 00:12:34,561
supplementary for this class.
Okay.

156
00:12:34,561 --> 00:12:40,413
So, let's, let's talk about the anyone
actually have questions first before we

157
00:12:40,413 --> 00:12:44,920
move on, 'cuz reorder buffer is, is a key
data structure here, and it's a

158
00:12:44,920 --> 00:12:48,403
complicated one.
Okay great.

159
00:12:48,403 --> 00:12:53,545
Next, next structure we added was the
finished store buffer.

160
00:12:53,545 --> 00:12:58,055
And this could actually be multiple
entries but for this pipe let's say

161
00:12:58,055 --> 00:13:01,465
there's only one.
So we are only allowed to have one store

162
00:13:01,465 --> 00:13:05,024
pending in this pipeline cuz it makes life
a little bit easier.

163
00:13:05,024 --> 00:13:09,796
Things you sort of need to actually have
here is you need to have both the address

164
00:13:09,796 --> 00:13:15,011
and the data whether it's valid.
Probably the op code the op code will tell

165
00:13:15,011 --> 00:13:19,035
you if it's store byte store word sort of
data width types of things.

166
00:13:19,035 --> 00:13:25,495
And that's most of what I wanted to say
here if we, this, this is what I was

167
00:13:25,495 --> 00:13:32,471
saying before if you allow multiple loads
and stores in the pipe at the same time,

168
00:13:32,471 --> 00:13:39,045
you're going to have to bypass from the
finished store buffer to the loads.

169
00:13:39,045 --> 00:13:42,043
And possibly in stores, if they has to be
right combined.

170
00:13:42,043 --> 00:13:46,060
So, if the user stored a different parts
of a word, you may have to bypass that,

171
00:13:46,060 --> 00:13:50,050
depending on how the pipe works.
Or, you can assume that there is only one

172
00:13:50,050 --> 00:13:54,077
memory instruction valid in the pipe at a
time, you'll have one of these entries,

173
00:13:54,077 --> 00:13:57,022
and no loads can happen while a store
happens.

174
00:13:57,022 --> 00:14:01,012
That's not very good performance.
People probably would not have actually

175
00:14:01,012 --> 00:14:03,095
built that, but that's, that's something
to think about.

176
00:14:03,095 --> 00:14:06,003
Okay.
So, now we get some more pipe line

177
00:14:06,003 --> 00:14:09,599
diagrams or on those pipeline diagrams.
And we're going to see how this is

178
00:14:09,599 --> 00:14:13,046
different, and, and what happens in the
reorder buffer.

179
00:14:14,304 --> 00:14:20,013
First thing I wanted to say here is this
little, little r, that you see show up in

180
00:14:20,013 --> 00:14:24,049
these diagrams.
That means that we've written the reorder

181
00:14:24,049 --> 00:14:29,031
buffer, but we're not ready to commit.
So from here to there, we basically have

182
00:14:29,031 --> 00:14:34,044
this add has written the reorder buffer,
we're waiting for it to commit at the end

183
00:14:34,044 --> 00:14:37,063
of the pipe.
But we can only commit in order, so you

184
00:14:37,063 --> 00:14:40,075
can sort of see these Cs are all lined up
in time.

185
00:14:40,075 --> 00:14:46,020
So we're only able to commit from left to
right and we can't reorder those, those Cs

186
00:14:46,020 --> 00:14:53,009
relative to another, another C.
Let's see, what did I want to say here.

187
00:14:53,009 --> 00:14:59,410
That was the main thing The dependency is
the same, it's the same code we've looked

188
00:14:59,410 --> 00:15:04,737
at before.
That's what I wanted to say.

189
00:15:04,737 --> 00:15:08,044
Which one is that?
Yes, it's this one.

190
00:15:09,079 --> 00:15:15,052
Okay, so here we have this ad writes
register twelve.

191
00:15:15,055 --> 00:15:22,096
Right there.
This add goes in read's register twelve so

192
00:15:22,096 --> 00:15:29,461
we have a read after write happening.
What's interesting here, is this read

193
00:15:29,461 --> 00:15:34,091
after write that's happening.
The write happens there.

194
00:15:34,091 --> 00:15:41,014
The read happens, let's say here.
That data's in a bypass anywhere, or it's

195
00:15:41,014 --> 00:15:44,039
not in the forwarding logic of the, of the
processor.

196
00:15:45,060 --> 00:15:52,022
That value is actually in the physical
register file.

197
00:15:52,022 --> 00:15:56,064
So this is kinda showing an example here
that data, when you're doing the bypass,

198
00:15:56,064 --> 00:16:01,022
can come from bypass network locations, it
can come from the physical register file,

199
00:16:01,192 --> 00:16:04,031
and that, those are sort of two places it
can come from.

200
00:16:04,031 --> 00:16:06,951
But you don't, you can, you can,
everything else actually in here,

201
00:16:06,951 --> 00:16:11,363
surprisingly, is basically coming from
bypass, except for that one location.

202
00:16:11,363 --> 00:16:15,974
So bypasses end up being really important.
But you can have data coming from the

203
00:16:15,974 --> 00:16:23,519
physical register file.
So could the C be here, could this C move

204
00:16:23,519 --> 00:16:25,669
over one.
So let's come in, in order.

205
00:16:25,669 --> 00:16:30,666
And we only have to, we can only commit
one thing at a time in, in this basic

206
00:16:30,666 --> 00:16:33,171
pipe.
More complex pipes, we're going to allow

207
00:16:33,171 --> 00:16:37,033
multiple commits at the same time.
When we start to mix super scalars with

208
00:16:37,033 --> 00:16:41,322
out of order, at the end of today's talk,
we're going to be able to think about

209
00:16:41,322 --> 00:16:44,248
trying to commit multiple things at the
same time.

210
00:16:44,248 --> 00:16:47,909
But we can't really do out of order.
So this has to be monotonically going that

211
00:16:47,909 --> 00:16:50,974
way.
Brief example here.

212
00:16:50,974 --> 00:16:56,189
This is kinda, kinda fun.
This is trying to show different entries

213
00:16:56,189 --> 00:17:00,706
into the order buffer and when those
things get allocated.

214
00:17:00,706 --> 00:17:07,158
And largely what's going to happen is for
a destination, so let's say instruction

215
00:17:07,158 --> 00:17:11,505
zero here allocates the reorder buffer and
R1 becomes active.

216
00:17:11,505 --> 00:17:16,672
And it's a long a long way to multiply.
It doesn't show up at, the, the circles

217
00:17:16,672 --> 00:17:19,764
here mean that they instruction is
finished.

218
00:17:19,764 --> 00:17:24,157
It's gone to the end of the pipe, and it's
ready to go.

219
00:17:24,157 --> 00:17:30,470
You could have other things, like this is
an add that happens to register eleven.

220
00:17:30,473 --> 00:17:34,988
It allocates, it finishes early, but it
doesn't commit till late.

221
00:17:34,988 --> 00:17:40,532
So it has to stay in the reorder buffer.
So it takes up space in the reorder

222
00:17:40,532 --> 00:17:43,221
buffer.
And, you can sort of see other examples

223
00:17:43,221 --> 00:17:46,126
that these, these adds here finish
relatively quickly.

224
00:17:46,126 --> 00:17:49,211
But they can't, they have to wait to
commit in order.

225
00:17:49,211 --> 00:17:54,227
And they're basically dependent on this
instruction here, committing before they

226
00:17:54,227 --> 00:17:58,588
can go commit.
So it's a nice little structure that can

227
00:17:58,588 --> 00:18:07,894
track all those things.
Okay let's look at commit points and if

228
00:18:07,894 --> 00:18:15,043
exceptions occur.
We are going have the serve, same example

229
00:18:15,043 --> 00:18:23,151
we had before.
The mall here is going along and it write

230
00:18:23,151 --> 00:18:29,041
backs to the physical register file.
Now, you'll say, woah, it wrote the

231
00:18:29,041 --> 00:18:32,023
register file.
How can it take an exception at this

232
00:18:32,023 --> 00:18:34,077
point.
If I was to make an exception it wasn't

233
00:18:34,077 --> 00:18:38,048
supposed to write the register file, but
we have two register files.

234
00:18:38,048 --> 00:18:41,908
So, it writes the speculative state
register file or the future file or the

235
00:18:42,079 --> 00:18:46,000
physical register file.
And this slash here means we don't

236
00:18:46,000 --> 00:18:49,081
actually commit that instruction.
So, commit doesn't, it doesn't happen.

237
00:18:51,018 --> 00:18:56,031
Now we get to go look at, sort of other
in-flight instructions to see what's

238
00:18:56,031 --> 00:19:00,078
going, what's going on here.
Can these other in-flight instructions

239
00:19:00,078 --> 00:19:05,072
potentially write information out of order
where current commit point be?

240
00:19:06,070 --> 00:19:12,477
Well, here's this add that before, in the
previous example, wrote to the register

241
00:19:12,477 --> 00:19:17,285
file, and now it writes to the physical
register file, but does not write the

242
00:19:17,285 --> 00:19:21,515
architectural register file.
Instead, it enters the reorder buffer

243
00:19:21,515 --> 00:19:26,402
here, denoted by the little r, and just
sits there until it actually gets the

244
00:19:26,402 --> 00:19:31,638
chance to commit an order.
But, that doesn't get a chance to commit

245
00:19:31,638 --> 00:19:34,749
because a previous instruction kills,
kills it.

246
00:19:34,749 --> 00:19:39,579
And kill because it takes an exception and
kills everything.

247
00:19:39,579 --> 00:19:44,510
And then you can go and start some new
instruction here.

248
00:19:44,510 --> 00:19:51,421
Let's say that is the exception handler.
And fetch, fetch that, you know out here.

249
00:19:51,678 --> 00:19:59,094
One, one interesting about this example
actually that I want to say is, sort of in

250
00:19:59,094 --> 00:20:05,012
this transition, lots of stuff, lots of
state has to change in the machine.

251
00:20:05,012 --> 00:20:10,487
You've take an exception, the
architectural register file is correct,

252
00:20:10,487 --> 00:20:16,081
the physical register file potentially has
many incorrect information, er, many

253
00:20:16,081 --> 00:20:20,778
incorrect values in it.
So, on this transition, what's really

254
00:20:20,778 --> 00:20:24,973
going to happen is you're going to copy
all of the state of the architectural

255
00:20:24,973 --> 00:20:29,119
register, all registers, over on top of
the physical register file.

256
00:20:29,119 --> 00:20:33,827
So you basically roll back all of your
speculative state in machine, in one fell

257
00:20:33,827 --> 00:20:37,623
swoop.
Obviously that can maybe be a little

258
00:20:37,623 --> 00:20:39,954
expensive.
But you don't take off that often.

259
00:20:39,954 --> 00:20:42,321
You do take, take branches relatively
often.

260
00:20:42,321 --> 00:20:45,533
We'll talk about that in a second.
But what's nice that's, that's logically

261
00:20:45,533 --> 00:20:48,156
what's happening.
Sometimes people will actually co-mingle

262
00:20:48,156 --> 00:20:51,175
the architecture register file or the
physical register file.

263
00:20:51,175 --> 00:20:54,440
And they just sort of keep pointers to
different pieces of information.

264
00:20:54,440 --> 00:20:58,404
So you don't actually have to sort of roll
back information, you just sort of change

265
00:20:58,404 --> 00:21:00,981
the pointers.
But for right now, let's model it as two

266
00:21:00,981 --> 00:21:06,029
complete separate register files where you
copy all the state from the architecture

267
00:21:06,029 --> 00:21:12,113
register file to the physical register
file on some form of roll back on an

268
00:21:12,113 --> 00:21:15,451
exception or a branch.
Branches.

269
00:21:15,451 --> 00:21:18,784
So, what do, how do, how we make the
branch latency better?

270
00:21:18,784 --> 00:21:22,200
What, what do we do out of branch first of
all?

271
00:21:22,200 --> 00:21:24,490
So, sort of ignore these bottom examples
here.

272
00:21:24,490 --> 00:21:30,436
This is a different code sequence that we
have looked at, its not the multiply, add

273
00:21:30,436 --> 00:21:33,725
multiply, add code sequence.
Instead this is a branch.

274
00:21:33,725 --> 00:21:37,677
So, we have a branch.
The branch commits.

275
00:21:37,677 --> 00:21:42,835
We know the branch is good.
But these instructions here, are the fall

276
00:21:42,835 --> 00:21:46,384
through case for the branch.
This instruction here is the target for

277
00:21:46,384 --> 00:21:48,442
the branch.
So, we need to squash all these

278
00:21:48,442 --> 00:21:52,054
instructions in the reorder buffer.
Conveniently we have a bit in the reorder

279
00:21:52,054 --> 00:21:56,085
buffer that says all the things that were
dependent on the branch, if the branch is

280
00:21:56,085 --> 00:21:59,935
misspeculated, just remove them from the
reorder buffer and basically throw

281
00:21:59,935 --> 00:22:04,068
everything out of our reorder, or throw
those entries out of the reorder buffer,

282
00:22:04,068 --> 00:22:09,698
invalidate them in the reorder buffer.
What gets a little interesting here is

283
00:22:09,698 --> 00:22:16,311
when do we start to execute target?
Well, let's say we compute the branch

284
00:22:16,311 --> 00:22:23,607
information here in the execute stage, and
we can sort of re-direct the fetch stage,

285
00:22:23,607 --> 00:22:27,548
That's okay, but the squash is a little
bit odd.

286
00:22:27,548 --> 00:22:32,484
Because what this really says, from a
pipeline perspective, is that you have to

287
00:22:32,484 --> 00:22:36,407
invalidate multiple entries in the reorder
buffer in one cycle.

288
00:22:36,407 --> 00:22:41,157
And this, to some extent, is a structural
hazard on the reorder buffer.

289
00:22:41,157 --> 00:22:46,450
You might need, you know, many, many ports
into that register, into that, reorder

290
00:22:46,450 --> 00:22:51,976
buffer, or you need to at least keep the
valid bits in some other extremely highly

291
00:22:51,976 --> 00:22:57,904
ported structure.
You could think about doing something even

292
00:22:57,904 --> 00:23:01,553
more interesting where you kill
instructions early.

293
00:23:01,553 --> 00:23:08,469
So the difference between this picture and
this picture is once we compute and figure

294
00:23:08,469 --> 00:23:13,478
out that the branch is taken, we just
instantaneously squash all these

295
00:23:13,478 --> 00:23:16,521
instructions, and we change the re-order
buffer.

296
00:23:16,521 --> 00:23:21,616
Or we, we write to the reorder buffer,
killing all the speculative instructions.

297
00:23:21,616 --> 00:23:25,701
Now if you note, this doesn't actually
help performance in this case.

298
00:23:25,701 --> 00:23:30,485
Places where this can help performance is
if you have an out of order processor

299
00:23:30,677 --> 00:23:35,432
with, that's a super scalar processor?
You could think they could try to put

300
00:23:35,432 --> 00:23:40,550
other instructions in these locations in
the pipe or try to restart earlier or have

301
00:23:40,550 --> 00:23:45,210
other things go on in the pipe and you're
just using less resources in the pipe.

302
00:23:45,210 --> 00:23:49,478
So this is gonna be the highest
performance case, this is, sort of, going

303
00:23:49,478 --> 00:23:53,446
to be medium performance.
Low performance, you can have a way that

304
00:23:53,446 --> 00:23:57,533
you don't actually have to add extra ports
to your reorder buffer.

305
00:23:57,533 --> 00:24:02,406
And way you can do that is you let the
inflate instructions that are dead

306
00:24:02,406 --> 00:24:08,098
continue going down the pipe until they
get to the commit stage and only then you

307
00:24:08,098 --> 00:24:13,533
clean them out of the pipe.
And you clean out the reorder buffer.

308
00:24:13,533 --> 00:24:18,384
So, you, sort of, are waiting for these
special instructions to reach the commit

309
00:24:18,384 --> 00:24:22,136
stage and squash them there.
In this example the performance of all

310
00:24:22,136 --> 00:24:26,244
three of these are the same, but I will
say, this is going to be the lowest

311
00:24:26,244 --> 00:24:31,702
performance if you have a more complicated
code sequence, cuz you are basically using

312
00:24:31,702 --> 00:24:36,304
up a lot of pipeline resources, you're
using entries in the reorder buffer,

313
00:24:36,304 --> 00:24:41,775
you're using locations in the pipes that
you could try to reuse for something else.

314
00:24:41,775 --> 00:24:45,966
Okay.
So as we said, we sort of have these three

315
00:24:45,966 --> 00:24:55,493
different cases, in increasing complexity
but you get some performance.

316
00:24:55,493 --> 00:25:03,323
I'm sorry, in decreasing complexity but
increasing performance.

317
00:25:03,323 --> 00:25:07,018
So, so, I think one thing that definitely
comes up.

318
00:25:07,018 --> 00:25:12,051
And this is probably going to make this
multi-ported issue come up, is if you have

319
00:25:12,051 --> 00:25:15,046
multiple branches in the pipe at the same
time.

320
00:25:16,007 --> 00:25:20,079
Then, the simple case of just moving, the
top pointer's not really going to work

321
00:25:20,079 --> 00:25:25,075
because you might miss-predict one of the
branches but not the other branch, that's

322
00:25:25,075 --> 00:25:31,050
going to, mess you up a little bit.
Okay.

323
00:25:31,050 --> 00:25:45,011
So, lets keep moving on here avoiding
stalls due to store misses.

324
00:25:45,066 --> 00:25:53,504
Okay, so you've got a store in the pipe.
It takes a cache miss, and now it's

325
00:25:53,504 --> 00:25:56,808
clogging up the commit point of the
processor.

326
00:25:56,808 --> 00:26:02,864
Because, depending on how you want to look
at this, maybe you don't want to commit

327
00:26:02,864 --> 00:26:08,439
until that store has actually reached main
memory, cuz that's where you're going to

328
00:26:08,439 --> 00:26:13,024
call commit for that store.
So you can actually pull it out of the

329
00:26:13,251 --> 00:26:18,086
future store buffer, cuz it's able to
actually to sort of commit that you, you

330
00:26:18,086 --> 00:26:22,348
try to pull it out of the future store
buffer and write it to main memory.

331
00:26:22,348 --> 00:26:27,316
Or you write it to your cache it doesn't,
it misses your cache and takes a couple of

332
00:26:27,316 --> 00:26:29,981
extra cycles.
So we'll see like this, here's a store

333
00:26:29,981 --> 00:26:34,467
word and let's say it takes a few extra
cycles here, three extra cycles stalling

334
00:26:34,467 --> 00:26:37,880
to actually to go and write the level two
cache we'll say.

335
00:26:37,880 --> 00:26:42,052
Or pull in the data from the level two
cache into the L1 cache and merge there.

336
00:26:42,052 --> 00:26:47,094
So there's, there's a way to solve that.
And, and what, what's bad about this, is

337
00:26:47,094 --> 00:26:52,095
because we're doing in or commit, it
pushes out the rest of these instructions

338
00:26:52,095 --> 00:26:55,029
later.
And that, that's kind of bad.

339
00:26:55,029 --> 00:27:00,738
So, what you can think about doing is
adding an extra stage in the pipe and just

340
00:27:00,738 --> 00:27:06,942
allowing the store to miss and basically
moving past the commit station this store

341
00:27:06,942 --> 00:27:11,017
has committed.
You, sort of, mark it down and say, well

342
00:27:11,017 --> 00:27:15,005
it's committed, I don't have to worry
about this anymore.

343
00:27:15,005 --> 00:27:20,053
And you basically can decouple the ends of
the pipe here or the store actually

344
00:27:20,053 --> 00:27:25,066
happening to memory until later.
And all you do is, you just have commit in

345
00:27:25,066 --> 00:27:30,059
order.
You can pull back these things earlier.

346
00:27:31,423 --> 00:27:40,271
This looks like a typo this should
probably be back one and then you can, you

347
00:27:40,271 --> 00:27:45,331
can commit in order, and have that store
sort of still outstanding out to main

348
00:27:45,331 --> 00:27:48,071
memory.
One important thing you need to do here,

349
00:27:48,071 --> 00:27:52,681
as I, as I've said before, if you let
another load into the pipe, or a store

350
00:27:52,681 --> 00:27:58,254
into the pipe, you're going to have to
bypass out of this data structure, and

351
00:27:58,254 --> 00:28:03,240
that data structure now, back to the load
stage of the pipe or the store stage of

352
00:28:03,240 --> 00:28:06,059
the pipe.
And that, that adds extra wires into your,

353
00:28:06,059 --> 00:28:12,068
out of your processor.
But we basically decoupled store committal

354
00:28:12,068 --> 00:28:18,091
from or it's technically committed once it
gets past this point.

355
00:28:18,091 --> 00:28:24,663
But it's not in main memory.
But it's, to everyone else and to the,

356
00:28:24,663 --> 00:28:27,661
the, the processor it looks like it's been
committed.

357
00:28:27,661 --> 00:28:33,038
Cuz you can, you try to go read the value
and it's it looks like it's committed.