1
00:00:03,046 --> 00:00:07,027
In-order fetch.
So, we have an in-order frontend and we

2
00:00:07,027 --> 00:00:10,067
have an out-of-order issue, writeback and
commit.

3
00:00:12,001 --> 00:00:18,280
So, what this is going to try and do is
we're going to try and fix some of the

4
00:00:18,280 --> 00:00:25,770
problems that we saw, oops, here, where
we'll actually had, let's say this

5
00:00:25,770 --> 00:00:33,173
instruction here, this add is waiting to
issue because our issue was in order.

6
00:00:33,173 --> 00:00:39,960
So, that instruction could issue.
All of its inputs are ready.

7
00:00:40,168 --> 00:00:43,432
R11 was written right there, or something
like that.

8
00:00:43,432 --> 00:00:45,789
So, R11 is ready.
It's ready to issue.

9
00:00:45,789 --> 00:00:51,496
But because we have in-order issue, by
definition, we can't issue out-of-order.

10
00:00:51,496 --> 00:00:55,934
So, now let's look at a machine where we
can issue out-of-order.

11
00:00:55,934 --> 00:01:01,106
So, we actually move it to the execute
units, do our register fetch,

12
00:01:01,106 --> 00:01:05,574
out-of-order.
So, to do that, we actually have to add

13
00:01:05,574 --> 00:01:10,642
another data structure, and this data
structure we're going to call the issue

14
00:01:10,642 --> 00:01:14,007
queue.
And this is going to be something like an

15
00:01:14,007 --> 00:01:17,491
associative FIFO, but it's not going to be
FIFO, we're going to basically going to

16
00:01:17,491 --> 00:01:21,078
store instructions into it, and then we're
going to, we're going to put instructions

17
00:01:21,078 --> 00:01:24,757
into it in order, because our fetch is in
order, or frontend is in order.

18
00:01:24,757 --> 00:01:27,455
But we're going to pull out of it out of
order.

19
00:01:27,455 --> 00:01:31,965
And, but we have to be a little careful,
because when we go to pull out of that

20
00:01:31,965 --> 00:01:34,283
data structure, we have to make sure that
the.

21
00:01:34,283 --> 00:01:37,609
All of the appropriate registers are, are
ready.

22
00:01:37,609 --> 00:01:43,855
So, all of our dependencies are ready or
at least somewhere in the bypass or we can

23
00:01:43,855 --> 00:01:48,147
go pick it off the bypass by the time it
actually needs the value.

24
00:01:48,147 --> 00:01:54,255
Or in our, in our bypass stage.
Let's take a look here.

25
00:01:54,255 --> 00:02:00,003
We still have architectural register file.
We got rid of all that complex reorder

26
00:02:00,003 --> 00:02:03,400
buffer stuff on this example.
So, we don't have a reorder buffer.

27
00:02:03,400 --> 00:02:07,022
We don't have a store buffer.
We have out-of-order commit.

28
00:02:07,022 --> 00:02:10,848
So, we're going to have the same problems
we had before with precise exceptions in

29
00:02:10,848 --> 00:02:15,749
this pipe but we, it's going to give us
sort of easy introduction to understanding

30
00:02:15,749 --> 00:02:21,658
what the issue queue looks like.
So, the issue queue gets written when

31
00:02:21,658 --> 00:02:26,017
instructions enter it.
Sort of at the decode stage.

32
00:02:26,017 --> 00:02:31,065
This is, itself is a register structure.
So, it's like, flip flops in here and a

33
00:02:31,065 --> 00:02:36,028
bunch of logic and stuff.
So, you can just go from this stage to

34
00:02:36,028 --> 00:02:40,019
this stage in the pipe without having any,
registers.

35
00:02:40,019 --> 00:02:45,977
So it gets read in there and things are
basically going to update into this.

36
00:02:45,977 --> 00:02:52,192
So, when things actually end up in the
architecture register file, we're going to

37
00:02:52,192 --> 00:02:58,279
mark bits in the issue queue, saying, oh,
that register you were waiting on, it's

38
00:02:58,279 --> 00:03:01,459
now ready.
And if you get, let's say, two registers

39
00:03:01,459 --> 00:03:05,403
that are ready and you're dependent on two
registers, you can issue.

40
00:03:05,403 --> 00:03:11,416
And we can issue out-of-order.
Okay.

41
00:03:11,416 --> 00:03:14,036
So, I have read and write here in the, the
issue stage.

42
00:03:14,036 --> 00:03:18,015
Why do I, why do I have that?
Well, when you actually go to issue it you

43
00:03:18,015 --> 00:03:22,032
basically want to mark that instruction
as, oh, yeah, I actually did issue this

44
00:03:22,032 --> 00:03:24,820
instruction.
You want to, sometimes people build this

45
00:03:24,820 --> 00:03:27,736
as actually remove the instruction from
the issue queue.

46
00:03:27,889 --> 00:03:32,264
In this basic case, we're gonna leave it
in the issue queue until the end of the

47
00:03:32,264 --> 00:03:36,427
pipe because we need some way to track
the, the liveness of the instruction.

48
00:03:36,427 --> 00:03:41,250
But the, the next processor we're
basically gonna think of it as sort of a.

49
00:03:41,250 --> 00:03:47,029
I, I won't say FIFO, because it's not
strictly first in first out, but it's,

50
00:03:47,029 --> 00:03:52,771
it's a data structure where you can put
stuff in and then remove stuff out of.

51
00:03:52,771 --> 00:03:59,009
So, it's a buffer or a multi entry buffer.
Okay, let's, let's take a look inside the

52
00:03:59,009 --> 00:04:02,046
issue queue.
And this is assuming something like MIPS.

53
00:04:02,046 --> 00:04:05,075
So, in our issue queue here we're going to
have.

54
00:04:07,105 --> 00:04:09,633
First of all, we're going to actually
going to have the Opcode, because we're

55
00:04:09,633 --> 00:04:13,033
going to have multiple instructions sort
of ganging up in here.

56
00:04:13,033 --> 00:04:16,760
So, it's possible it could be like, I
don't know, in this case one, two, three,

57
00:04:16,760 --> 00:04:19,023
four, five.
There's five instructions just sort of

58
00:04:19,023 --> 00:04:22,227
sitting in this, in this data structure.
So, just sort of slacking.

59
00:04:22,227 --> 00:04:26,061
There's a buffer and a slack in the front
of your pipe, effectively.

60
00:04:26,061 --> 00:04:30,577
You might need the immediate's values,
because that's sort of part of the Opcode,

61
00:04:30,577 --> 00:04:35,431
or part of the instruction, and then, then
we're going to basically have things for

62
00:04:35,431 --> 00:04:41,548
the three different registers.
And what were going to put in here is we

63
00:04:41,548 --> 00:04:44,265
are going to put the register identifiers,
so.

64
00:04:44,265 --> 00:04:49,465
Here let's look, let's look at the sources
first, because that makes a little more

65
00:04:49,465 --> 00:04:52,031
sense.
The V bit is going to say that this

66
00:04:52,031 --> 00:04:59,070
instruction needs the source.
So, it's possible that, as we said, you

67
00:04:59,070 --> 00:05:03,493
know, there's some like an immediate
instructions don't read both source

68
00:05:03,493 --> 00:05:07,004
operands.
So, for an immediate, only one, probably

69
00:05:07,004 --> 00:05:13,606
this valid bit is going to be set and this
one is going to be zero.

70
00:05:13,606 --> 00:05:22,247
The P bit is pending.
So, what that means is somewhere, later in

71
00:05:22,247 --> 00:05:27,411
the pipe there is a in-flight instruction
that writes to that register.

72
00:05:27,411 --> 00:05:31,978
So, we're going to basically track that
with a bit here saying that it, that iis

73
00:05:31,978 --> 00:05:35,097
still pending.
Now, the reason we actually have to keep

74
00:05:35,097 --> 00:05:41,062
that pending information and the validity
informations are separate and why need to

75
00:05:41,062 --> 00:05:46,080
keep both of them is it's possible that
multiple instructions could light up

76
00:05:46,080 --> 00:05:52,050
simultaneously as being ready to go.
So, if you have, let's say, two

77
00:05:52,050 --> 00:05:58,091
instructions that are both dependent only
on register five and register five is not

78
00:05:58,091 --> 00:06:02,042
ready, it's, it's the destination of a
multiply.

79
00:06:03,057 --> 00:06:07,770
But then register five gets ridden.
It's going to actually clear the pending

80
00:06:07,770 --> 00:06:11,172
bits.
So, it's no longer pending and it might be

81
00:06:11,172 --> 00:06:14,958
in multiple places.
So, it might be like here and here or

82
00:06:14,958 --> 00:06:18,568
maybe even there.
It's going to clear the pending bits, and

83
00:06:18,568 --> 00:06:23,891
whoever is trying to read register five,
and then, because we're only let's say

84
00:06:23,891 --> 00:06:27,011
going to have single issue on this
processor.

85
00:06:27,011 --> 00:06:32,083
You can only pull one thing out at a time.
So, you need to sort of pull one thing out

86
00:06:32,083 --> 00:06:38,082
and you need someway to leave the other
instruction in the issue queue having all

87
00:06:38,082 --> 00:06:42,016
of it's operands ready.
So, that's how you do that.

88
00:06:42,043 --> 00:06:45,068
Okay, so how do we figure out if an
instruction is ready?

89
00:06:48,030 --> 00:06:55,852
Well, we see if the if the, if it actually
uses it, that the operand, uses the,

90
00:06:55,852 --> 00:07:02,148
excuse me, the, particular upper and
identifier and we see if it's pending.

91
00:07:02,148 --> 00:07:06,959
And then we have to check the other, other
for, for two inputs.

92
00:07:06,959 --> 00:07:10,942
We need to check the two inputs, and then
we also have to make sure there's no

93
00:07:10,942 --> 00:07:15,051
structural hazard somewhere else in the
pipe, like if we're trying to schedule

94
00:07:15,051 --> 00:07:17,721
write, write ports.
And we're going to use the scoreboard to

95
00:07:17,721 --> 00:07:20,061
do that.
The other thing is if we want high

96
00:07:20,061 --> 00:07:24,372
performance, we probably don't want to
have to wait for values to get to the end

97
00:07:24,372 --> 00:07:26,883
of the pipe.
So, in reality, this, this, this is even

98
00:07:26,883 --> 00:07:30,885
going to get more complicated because it's
going to be, these things plus information

99
00:07:30,885 --> 00:07:34,799
coming from the scoreboard which is when,
when a particular register identifier is

100
00:07:34,799 --> 00:07:37,522
ready or when a particular register is
ready.

101
00:07:37,708 --> 00:07:42,417
Let's talk about the destination here.
This just tracks whether an instruction

102
00:07:42,417 --> 00:07:46,316
writes a destination or not.
And like I said, in these, in these basic

103
00:07:46,316 --> 00:07:50,082
pipelines, we're going to leave
instructions in the instruction queue

104
00:07:50,082 --> 00:07:54,469
until they get to the end of the pipe.
And when it gets to the end of the pipe,

105
00:07:54,469 --> 00:07:59,772
this is where we go to check to see, what
other locations we need to, sort of, clear

106
00:07:59,772 --> 00:08:02,080
out.
So, if there's an instruction, let's say,

107
00:08:02,080 --> 00:08:06,413
sitting here, which has the valid bits
set, and its range of register five is the

108
00:08:06,413 --> 00:08:10,776
destination.
When that instruction commits, we're going

109
00:08:10,776 --> 00:08:16,003
to clear it out of, the issue queue.
And, we're going to say, oh well it wrote

110
00:08:16,003 --> 00:08:19,687
register five.
Let's scan through all these other places

111
00:08:19,687 --> 00:08:22,445
and look for places that say register
five.

112
00:08:22,445 --> 00:08:28,101
And if they say register five, we are
going to flip the pending bid from pending

113
00:08:28,101 --> 00:08:33,664
to not pending anymore and we're going to
remove the instruction from the issue

114
00:08:33,664 --> 00:08:37,048
queue.
Okay, so, centralized versus distributed

115
00:08:37,048 --> 00:08:41,048
issue queues.
In a, in a, in a sort of logical sense, in

116
00:08:41,048 --> 00:08:46,138
a perfect world it's probably nice to have
a big centralized issue queue.

117
00:08:46,348 --> 00:08:51,662
You can scan over all the instructions.
You don't have to sort of look around, but

118
00:08:51,662 --> 00:08:56,267
this can sometimes be harder to implement,
because you have to put all of your

119
00:08:56,267 --> 00:09:00,352
instructions in one location.
And sometimes they don't necessarily sort

120
00:09:00,352 --> 00:09:03,624
of cross communicate very well.
Like floating points units.

121
00:09:03,624 --> 00:09:07,939
They have floating point registers versus
integer register files is one type.

122
00:09:07,939 --> 00:09:11,035
They don't necessarily need a whole lot of
communication.

123
00:09:11,035 --> 00:09:16,450
So, you could have distributed instruction
queues where you sort of steer let's say

124
00:09:16,450 --> 00:09:21,058
here the execute or ALU and memory ops to
one of these for these functional units.

125
00:09:21,058 --> 00:09:24,950
They have a different instruction queue
just for, multiplies or something like

126
00:09:24,950 --> 00:09:27,861
that.
I'm going to, put a paper on the website

127
00:09:27,861 --> 00:09:32,034
for you guys all to read.
It's Tomasulo algorithm.

128
00:09:32,034 --> 00:09:37,033
It's a very famous paper, and in that they
talk about distributed instruction queues.

129
00:09:37,033 --> 00:09:42,014
Strangely enough, sort of, the first place
in literature that these instruction

130
00:09:42,014 --> 00:09:46,083
queues showed up was around that time, or
in that paper, and they actually go

131
00:09:46,083 --> 00:09:51,040
straight to the distributed version and
kind of skip the centralized version,

132
00:09:51,040 --> 00:09:56,052
which I always found a little bit odd but
lots of people build centralized ones

133
00:09:56,069 --> 00:09:59,051
today.
One of the, one of the reasons that the I

134
00:09:59,051 --> 00:10:04,233
think the Tomasulo algorithm people went
for this distributing one first is because

135
00:10:04,233 --> 00:10:08,318
they were actually implementing it in
multiple discreet chips.

136
00:10:08,318 --> 00:10:12,222
So they had a issue cube per chip.
So, they had a floating point chip and a

137
00:10:12,222 --> 00:10:14,826
integer chip and they steered the
instructions.

138
00:10:14,826 --> 00:10:19,539
They didn't wanna have to have one data
structure that cross, cross two chips.

139
00:10:19,539 --> 00:10:21,712
Today we have, you know, lots of
integration.

140
00:10:21,712 --> 00:10:24,098
So, we don't have to worry about that as
much.

141
00:10:26,039 --> 00:10:31,827
Okay, so, I just wanted to briefly say,
this is a question that came up last time,

142
00:10:32,047 --> 00:10:34,170
and scoreboards.
If you have to worry about

143
00:10:34,170 --> 00:10:38,083
write-after-write hazards in the pipe.
The things we are talking about today, we

144
00:10:38,083 --> 00:10:41,820
didn't have to really worry about that.
But this is something to talk about next

145
00:10:41,820 --> 00:10:45,001
time, we're gonna have to think about
better scoreboards.

146
00:10:45,001 --> 00:10:48,068
It looks the same as the previous
scoreboard, but you might need to keep the

147
00:10:48,068 --> 00:10:52,059
functional unit, or you need to keep the
functional unit number, or some bits that

148
00:10:52,059 --> 00:10:56,022
represent the functional unit, and then
those march down the pipe versus 1s

149
00:10:56,022 --> 00:10:59,011
marching down the pipe.
You have, let's say, different numbers,

150
00:10:59,011 --> 00:11:02,035
like one, two, three, or you have other
things marching down the pipe.

151
00:11:02,035 --> 00:11:05,039
But that's only if you have to track
write-after-write hazards.

152
00:11:05,039 --> 00:11:06,017
Okay.
So.

153
00:11:06,017 --> 00:11:10,918
That was a quick aside.
Let's look at this in order, fetch,

154
00:11:10,918 --> 00:11:18,325
out-of-order issue, out-of-order,
writeback and out-of-order commit

155
00:11:18,325 --> 00:11:23,877
processor in a pipeline diagram.
And I'm going to show a new little thing

156
00:11:23,877 --> 00:11:28,032
in our pipeline diagram here, we have a
lowercase i.

157
00:11:28,032 --> 00:11:32,010
Which means the instruction enters the
issue queue.

158
00:11:32,039 --> 00:11:37,000
And then, when it exits the issue queue,
it, it just starts going down into the

159
00:11:37,000 --> 00:11:42,005
issue stage of the pipe.
And, trying to think where at, where am I?

160
00:11:42,005 --> 00:11:48,649
This is not that complicated of a drawing.
Here I'm actually showing the, the issue

161
00:11:48,649 --> 00:11:51,071
queue.
What is in the issue queue.

162
00:11:51,071 --> 00:11:54,648
So I have two, or excuse me, three issue
queue slots.

163
00:11:54,648 --> 00:11:57,071
Because, this is a relatively simplistic
pipe.

164
00:11:58,044 --> 00:12:07,098
The, this first instruction which goes to
execute register two and register three

165
00:12:07,098 --> 00:12:11,047
are just already ready, so we don't have
to worry about it.

166
00:12:11,047 --> 00:12:19,026
But an example of something that's more
interesting, let's say, is, I don't know.

167
00:12:19,026 --> 00:12:21,041
Where is it going.
Here we go.

168
00:12:22,025 --> 00:12:27,080
This will enter the issue queue.
But register twelve does not become ready

169
00:12:27,080 --> 00:12:31,072
until late.
So, once that becomes, comes ready, we can

170
00:12:31,072 --> 00:12:37,019
basically start pulling things out of the
issue queue, and, and issue that

171
00:12:37,019 --> 00:12:39,080
instruction.
So that, that shows up.

172
00:12:39,080 --> 00:12:44,073
Let's see if we can show that here,
register, what is this, fourteen, twelve?

173
00:12:44,080 --> 00:12:50,065
Okay.
So this one here, we're waiting for, that

174
00:12:50,065 --> 00:12:52,367
to happen.
Let's see.

175
00:12:52,367 --> 00:13:02,591
Yeah, we're waiting for no, where is it
twelve, just add.

176
00:13:02,591 --> 00:13:19,044
[inaudible] this long.
Wrong.

177
00:13:20,546 --> 00:13:40,401
Yes.
That is actually is what I want to show

178
00:13:40,401 --> 00:13:42,056
here.
This is interesting.

179
00:13:43,512 --> 00:13:50,065
That value actually becomes ready, early.
But.

180
00:13:51,003 --> 00:13:59,012
Is that what's actually happening here,
the value becomes ready at the end of this

181
00:13:59,012 --> 00:14:02,024
execute stage.
But it can't go to issue.

182
00:14:02,077 --> 00:14:05,058
At that cycle because there's something
else issuing.

183
00:14:05,058 --> 00:14:09,243
So, it has to be pushed out.
So, we're only doing a single issue per

184
00:14:09,243 --> 00:14:12,005
cycle.
If we had multiple issues, you could think

185
00:14:12,005 --> 00:14:14,044
of something more interesting happening
here.

186
00:14:24,086 --> 00:14:30,034
Yeah, so this is actually a really
complicated case because if we pull this

187
00:14:30,034 --> 00:14:34,016
back let's say to here we get a writeback
hazard.

188
00:14:34,016 --> 00:14:39,065
The W would conflict with the other W.
If we try to issue here, sorry, that was a

189
00:14:39,065 --> 00:14:42,097
reissue here.
If we try to issue, here we get a

190
00:14:42,097 --> 00:14:48,017
structural hazard on the issue.
So, it actually gets pushed way out there.

191
00:14:48,038 --> 00:14:53,008
So, it's kind of a, kind of a bummer.
So, sometimes you just lose.

192
00:14:58,053 --> 00:15:08,041
Okay, so, here's a interesting case.
So, let's assume that all the instructions

193
00:15:08,041 --> 00:15:12,092
are preloaded into the issue queue.
So, showing this, we have fetch, decode,

194
00:15:12,092 --> 00:15:17,086
issue, issue, issue, and basically this
thing's just sort of sitting in the issue

195
00:15:17,086 --> 00:15:20,067
queue and then we start looking.
What happens?

196
00:15:20,096 --> 00:15:24,030
Those the performance get better than the
previous example?

197
00:15:24,030 --> 00:15:28,067
So, interesting enough, even you if you
sort of issue everything into the pipe

198
00:15:28,067 --> 00:15:31,185
early and then just sort of fill up your
issue queue.

199
00:15:31,185 --> 00:15:35,876
Pulling out of the issue queue can still
be limiter and the number of ALUs can

200
00:15:35,876 --> 00:15:39,020
still be a limiter.
So, there is other structural hazards.

201
00:15:39,020 --> 00:15:43,518
The performance of this code in this
pipeline does not actually get any better

202
00:15:43,518 --> 00:15:47,412
which is sort of interesting.
You think, Oh, well, I don't have to wait

203
00:15:47,412 --> 00:15:50,008
for things to dribble into my instruction
queue.

204
00:15:50,008 --> 00:15:52,093
I just sort of preload my instruction
queue.

205
00:15:52,093 --> 00:15:55,996
But the performance really, really doesn't
end up being better.

206
00:15:55,996 --> 00:16:02,256
So this motivates us going to both out of
order mixed with duplication of

207
00:16:02,256 --> 00:16:05,873
structures.
Like, maybe we can issue two instructions

208
00:16:05,873 --> 00:16:09,030
at the same time.
And maybe we can have two ALUs.

209
00:16:09,030 --> 00:16:12,090
So this is, motivates us having a
superscalar.