1
00:00:03,052 --> 00:00:08,515
So we're going to be continuing today
talking about VLIW or Very Long

2
00:00:08,515 --> 00:00:13,791
Instruction Word processors.
This is a alternative way to exploit

3
00:00:13,791 --> 00:00:20,732
parallelism in a processor, but instead of
doing scheduling dynamically, as we've

4
00:00:20,732 --> 00:00:26,219
been talking about in out-of-order
processors, and super scalars, we're

5
00:00:26,219 --> 00:00:32,798
looking at ways to do it statically in the
compiler and in architectures which can

6
00:00:32,798 --> 00:00:36,878
exploit that.
But let's now start talk about Very Long

7
00:00:36,878 --> 00:00:42,109
Instructions Word in processors.
So we've been talking about the

8
00:00:42,109 --> 00:00:47,002
superscalar processors.
The superscalar processors have a fair

9
00:00:47,002 --> 00:00:51,002
amount of complexity in them.
Especially out of order processors.

10
00:00:51,002 --> 00:00:55,070
Where does this complexity come from and
can we remove all that complexity?

11
00:00:55,070 --> 00:00:59,002
So, that's what we're going to be talking
about today.

12
00:01:00,025 --> 00:01:06,093
So here's where I drew a picture where I
have the issue width of a superscalar

13
00:01:06,093 --> 00:01:10,040
processor.
And, we represent that as W.

14
00:01:11,079 --> 00:01:15,090
And then we have the lifetime of an
instruction.

15
00:01:16,020 --> 00:01:20,653
So that's sort of, gives you a matrix of
things in flight in the processor.

16
00:01:20,653 --> 00:01:23,278
Now these may be sitting in your
instruction queue.

17
00:01:23,278 --> 00:01:28,085
We're not going to count if they actually,
let's say, get all the way to the reorder

18
00:01:28,085 --> 00:01:32,504
buffer, cuz by the time they get all the
way to the reorder buffer they probably,

19
00:01:32,504 --> 00:01:37,411
depending on how you build your processor,
you probably don't need to continually

20
00:01:37,411 --> 00:01:40,623
check those against the other instructions
that are coming.

21
00:01:40,623 --> 00:01:44,529
You've already broadcast that back to
either your reservation station or your

22
00:01:44,529 --> 00:01:47,782
instruction queue to issue a few further
instructions.

23
00:01:47,782 --> 00:01:51,942
But if you think about the complexity here
of what you're really doing when you try

24
00:01:51,942 --> 00:01:56,013
to go issue an instruction.
You're trying to take these inflight

25
00:01:56,013 --> 00:02:00,361
instructions, and you have to look up,
basically, against them, or in your

26
00:02:00,361 --> 00:02:04,203
instruction window, all of the
instructions, when they complete against

27
00:02:04,203 --> 00:02:08,020
basically all of the other instructions,
all the registers.

28
00:02:08,037 --> 00:02:12,095
Now, we built some structures that make
this so you don't have to do a complete

29
00:02:12,095 --> 00:02:16,025
all to all check.
So, example, structures that we create to

30
00:02:16,025 --> 00:02:19,015
help out with this.
First of all, the scoreboard.

31
00:02:19,015 --> 00:02:21,093
Instead of keeping track of where
everything is.

32
00:02:21,093 --> 00:02:26,091
Sort of our original design, where we had
portions of, or which register destination

33
00:02:26,091 --> 00:02:30,084
sort of flowing down the pipe.
And then we did an all to all compare.

34
00:02:30,084 --> 00:02:35,048
Instead, we just keep track of the
location and latency of the most recent

35
00:02:35,048 --> 00:02:39,030
write to a register.
And it turns something from a all to all

36
00:02:39,030 --> 00:02:42,043
compare into a index lookup based on the
register.

37
00:02:42,043 --> 00:02:47,057
So that's a little bit nicer, but if you
think about what's happening in the issue

38
00:02:47,057 --> 00:02:50,801
queue, we're still doing a as a
instruction retires, It's going to

39
00:02:50,801 --> 00:02:55,301
basically broadcast the registers that get
completed against the instruction que, and

40
00:02:55,301 --> 00:02:58,751
try to wake up instructions.
And see which ones are ready to go.

41
00:02:58,751 --> 00:03:01,552
And, that roughly as I said.
Scales, scales like this.

42
00:03:01,552 --> 00:03:05,640
Cause you're gonna be returning multiple
instructions per cycle.

43
00:03:05,640 --> 00:03:10,643
And you know, let's say you have a bunch
of stuff in the instruction queue.

44
00:03:10,643 --> 00:03:15,234
You are going to have to check all of
those and this can get painful.

45
00:03:15,234 --> 00:03:21,299
Let's, let's take a look at first sort of
the complexity of this for in order

46
00:03:21,299 --> 00:03:24,656
machine.
So in order machine is, is not so bad.

47
00:03:24,656 --> 00:03:29,951
L is kind of the pipeline length.
You have your width but you just basically

48
00:03:29,951 --> 00:03:35,404
have to have a scoreboard to figure this
out, and as we said that's an index based

49
00:03:35,404 --> 00:03:38,192
structure.
Now when you start to go out of order,

50
00:03:38,192 --> 00:03:42,058
you, you start to have the sort of
complexity of all the results that come

51
00:03:42,058 --> 00:03:46,617
back have to be broadcast against either
if you were distributing instructions to

52
00:03:46,617 --> 00:03:50,992
all the reservation stations or if you
have a, a single instruction queue, you

53
00:03:50,992 --> 00:03:54,173
have to check against all the possible
instructions to go issue.

54
00:03:54,173 --> 00:03:58,623
And what's interesting is one of the big
challenges of building these structures is

55
00:03:58,623 --> 00:04:03,554
not actually quite what you would think.
It's, there's a interesting challenge that

56
00:04:03,554 --> 00:04:06,717
you have too many things that could wake
up in one cycle.

57
00:04:06,717 --> 00:04:09,819
So let's say you have a single
instruction, or say all of the

58
00:04:09,819 --> 00:04:15,036
instructions in your instruction queue are
waiting at register five to become valid.

59
00:04:15,093 --> 00:04:20,042
All of a sudden, instruction some add,
let's say, writes register five, and it,

60
00:04:20,042 --> 00:04:24,015
it enters the re-order buffer.
So it gets to the end of the pipe.

61
00:04:24,015 --> 00:04:28,075
Well, now, all of a sudden, you have to
wake up all these instructions and choose

62
00:04:28,075 --> 00:04:33,012
the, some instructions that are the width
of your machine, that can go issue.

63
00:04:33,066 --> 00:04:38,089
Also, you need, still need to make sure
that you can satisfy all of the different

64
00:04:38,089 --> 00:04:42,023
requirements in the machine of functional
unit mixes.

65
00:04:42,023 --> 00:04:44,075
So, if you have, you know, say, two, two
adds.

66
00:04:44,094 --> 00:04:50,005
Two multiplies and one load and store, you
can't just go pick randomly from there.

67
00:04:50,005 --> 00:04:53,067
You need to pick exactly that mix.
So this becomes a big mess of

68
00:04:53,067 --> 00:04:57,082
combinational logic and this is why front
ends of processors can actually in out of

69
00:04:57,082 --> 00:05:00,024
word processors can start to get long and
longer.

70
00:05:00,038 --> 00:05:04,043
You get more and more stages in there to
do this and as we talked about last time

71
00:05:04,043 --> 00:05:08,052
in like the Pentium four there's actually
multiple stages out in front there to be

72
00:05:08,052 --> 00:05:12,062
able to do this and unfortunately you have
to pipe line that which makes this even

73
00:05:12,062 --> 00:05:19,642
more challenging.
So, so roughly If you look at sort of the

74
00:05:19,642 --> 00:05:24,442
out-of-order control logic here, it's sort
of Growing somewhere between the width

75
00:05:24,442 --> 00:05:29,566
squared, the width cubed.
And this is, this is not, not very good.

76
00:05:29,566 --> 00:05:35,598
Sometimes, there are a couple, circuit
level tricks that can help you.

77
00:05:35,598 --> 00:05:41,918
So if you go look at people who actually
do full custom logic for designs of

78
00:05:41,918 --> 00:05:46,317
processors, they'll make something that
they call Pickers.

79
00:05:46,317 --> 00:05:51,664
And what a picker is, is its actually a,
instead of using straight combination of

80
00:05:51,664 --> 00:05:55,386
logic to go compute what instructions
ready to go execute.

81
00:05:55,386 --> 00:06:00,711
You will instead have a much more analog
circuit in there, where you will basically

82
00:06:00,711 --> 00:06:05,723
have some custom, analog circuit there
which is almost a CAM, or a continual

83
00:06:05,723 --> 00:06:09,733
addressable memory.
It's not quiet a CAM, its more of a

84
00:06:09,733 --> 00:06:15,038
circuit which will choose what instruction
is ready to go and it is typically, you

85
00:06:15,038 --> 00:06:17,842
wanna have some heuristic that is the
oldest instruction.

86
00:06:17,842 --> 00:06:21,777
This is to prevent deadlocks.
You don't really want to issue arbitrarily

87
00:06:21,777 --> 00:06:26,080
instructions, arbitrary instructions.
You want to issue the oldest instructions

88
00:06:26,080 --> 00:06:28,985
cuz you want to make forward progress in
the machine.

89
00:06:28,985 --> 00:06:34,406
So, what we are trying to get out here is,
this is a lot of hardware complexity.

90
00:06:34,406 --> 00:06:38,807
It's growing, as the width of the machine,
and some things actually makes this even

91
00:06:38,807 --> 00:06:41,418
worse.
So, here, I have a little not here, that

92
00:06:41,418 --> 00:06:46,017
as you increase the width of your machine,
let's say you have a, you got a three wide

93
00:06:46,017 --> 00:06:49,725
machine, and all of a sudden, you want to
go for, let's say, a four or five wire

94
00:06:49,725 --> 00:06:54,582
machine, typically, we need also a larger
instruction queue, or instruction window

95
00:06:54,582 --> 00:06:59,495
to look across, in order to find enough
parallelism to keep the machine busy.

96
00:06:59,495 --> 00:07:05,287
So as you go wider, you also typically
have to make the machine sort of deeper or

97
00:07:05,287 --> 00:07:11,352
at least make your instruction queue
deeper to find enough parallelism to feed

98
00:07:11,352 --> 00:07:14,193
your functional units.
Hm.

99
00:07:14,193 --> 00:07:17,572
That's not good.
So as this causes the, you know, our,

100
00:07:17,572 --> 00:07:22,952
these sort of blow up factors here.
And, you can build these things, but

101
00:07:22,952 --> 00:07:29,250
they're hard.
So are there alternatives?

102
00:07:29,250 --> 00:07:32,889
Yes.
But let's, before we, we move onto the

103
00:07:32,889 --> 00:07:38,350
other alternatives, let's look at the sort
of complexity of this in, a real

104
00:07:38,350 --> 00:07:41,540
processor.
So here we have a di-photo, or

105
00:07:41,540 --> 00:07:50,961
di-micrograph of a MIPS R10,000 Processor.
This was used in workstations by SGI for

106
00:07:50,961 --> 00:07:55,934
many years.
It's a, real out of order superscalar, and

107
00:07:55,934 --> 00:08:03,845
one of the interesting things is if you go
look at how much area is dedicated just to

108
00:08:03,845 --> 00:08:07,661
control logic.
It's pretty substantial.

109
00:08:07,661 --> 00:08:12,755
So in this processor here, here's our
actual data path, it's relatively modest.

110
00:08:12,755 --> 00:08:18,232
Cache, instruction cache, data cache,
those are pretty big structures you can

111
00:08:18,232 --> 00:08:21,931
see those.
We have you know, some other let's say,

112
00:08:21,931 --> 00:08:27,524
the data tags, instruction tags the T.O.B
you know, those are big, big-ish sort of

113
00:08:27,524 --> 00:08:30,787
things, and that all this stuff here is
quite large.

114
00:08:30,787 --> 00:08:37,443
Let's look at what's inside of here.
We have sort of next instruction sorts of

115
00:08:37,443 --> 00:08:40,897
things.
We have a free list for the we had talked

116
00:08:40,897 --> 00:08:45,066
about this right.
We talked about the free list for the,

117
00:08:45,066 --> 00:08:50,623
which register is free if you will, to
reuse in your out of order processor.

118
00:08:50,623 --> 00:08:53,857
We have a big thing that just does
registry namer.

119
00:08:53,857 --> 00:08:58,558
The we have different queue here, so this
is a distributed instruction queues, or a

120
00:08:58,558 --> 00:09:03,053
distributed reservation station processor.
So there's one just for address

121
00:09:03,053 --> 00:09:07,184
calculation instructions, integer
instructions and floating point

122
00:09:07,184 --> 00:09:10,143
instructions.
But the interesting to see here is the

123
00:09:10,143 --> 00:09:12,694
prob, the, the, the percentage of this
just dies.

124
00:09:12,694 --> 00:09:17,708
It's actually just doing compute, let's
say this integer data path, the floating

125
00:09:17,708 --> 00:09:20,662
point data path, the floating point
multiplier.

126
00:09:20,662 --> 00:09:23,951
Is sorta pales in comparison to all the
overhead.

127
00:09:23,951 --> 00:09:29,472
So this is just sort of a get this, get
this idea across.

128
00:09:29,472 --> 00:09:35,713
Another problem with out of order super-
scalars, or superscalars in general, is

129
00:09:35,713 --> 00:09:39,058
thinking about how do we express
parallelism.

130
00:09:39,058 --> 00:09:45,040
And I alluded to this in a previous
lecture, but it's worth repeating.

131
00:09:45,040 --> 00:09:51,006
So, let's, let's take a look at what's
happens for codes, sequential codes on a

132
00:09:51,006 --> 00:09:56,069
out of order superscalar.
We start off here with some pieces of C

133
00:09:56,069 --> 00:10:00,021
code.
We put it into our fancy superscalar

134
00:10:00,021 --> 00:10:04,032
compiler.
That generates some sort of data flow

135
00:10:04,032 --> 00:10:07,033
graph.
And probably also con, control flow graph

136
00:10:07,033 --> 00:10:11,018
inside the compiler.
So there's sort of instructions, there's,

137
00:10:11,018 --> 00:10:15,333
dependencies between the different
instructions, and the compiler right now

138
00:10:15,333 --> 00:10:21,281
is trying to find independent operations,
because it's going to come up with a

139
00:10:21,281 --> 00:10:26,018
schedule.
Unfortunately, you have to come up with a

140
00:10:26,018 --> 00:10:31,971
sequential schedule, because the
instruction sets on sequential processors

141
00:10:31,971 --> 00:10:38,315
require you to come up with some order.
Even though, it's very possible, that

142
00:10:38,315 --> 00:10:43,645
there is no one perfect order.
It's also possible that, you know, instead

143
00:10:43,645 --> 00:10:50,035
of over-constraining the problem here, why
are we coming up with some, some ordering

144
00:10:50,035 --> 00:10:54,058
when we don't need to?
But, you know, in our, in our sequential,

145
00:10:54,058 --> 00:10:59,066
I say we have to come up with that.
We write out to disk somehow, or you know,

146
00:10:59,066 --> 00:11:04,066
save a file with this, this code.
And then we have to go and actually try to

147
00:11:04,066 --> 00:11:07,074
execute it.
Now what's funny, as I always get a

148
00:11:07,074 --> 00:11:13,084
chuckle out of this, is the compiler knew
what the dependencies were.

149
00:11:14,025 --> 00:11:18,242
And then we have our superscalar processor
here, which goes and takes apart the

150
00:11:18,242 --> 00:11:22,507
sequential code and tries to reorder the
instructions, and tries to find parallels

151
00:11:22,507 --> 00:11:25,087
in the instruction sequence, but the
compiler knew that.

152
00:11:25,087 --> 00:11:30,031
But we have like a breakdown here in the
middle that we just weren't able, not able

153
00:11:30,031 --> 00:11:33,025
to express that between the compiler and
the processor.

154
00:11:34,010 --> 00:11:38,019
Compiler knew about the parallelism.
Compiler knew about all the ordering and

155
00:11:38,019 --> 00:11:41,017
the requirements.
It had to come up with some sequential

156
00:11:41,017 --> 00:11:42,892
code sequence, provides it to the
processor.

157
00:11:42,892 --> 00:11:47,061
The processor takes those, those, that
sequential code sequence, back apart, does

158
00:11:47,061 --> 00:11:50,096
out of order scheduling like we did on the
exam, on the midterm.

159
00:11:50,096 --> 00:11:54,085
And it's gonna, you know, we built a bunch
of hardware here, which has high

160
00:11:54,085 --> 00:11:59,010
complexity, as we were talking about on
the combination of logic, checking the

161
00:11:59,010 --> 00:12:02,012
dependencies.
And then, it comes up with some schedule.

162
00:12:02,027 --> 00:12:05,880
This is done via our reservation stations,
or our instruction queues, And somethings

163
00:12:05,880 --> 00:12:09,028
going to fire and it's going to come up
with some schedule.

164
00:12:09,028 --> 00:12:12,702
And it's going to execute those
instructions by preserving read after

165
00:12:12,702 --> 00:12:15,553
write dependencies.
And hopefully if our superscaler processor

166
00:12:15,553 --> 00:12:19,723
is fancy we have some registry neighbor we
can break some of the other dependencies

167
00:12:19,723 --> 00:12:22,079
also.
The write after write and the write after

168
00:12:22,079 --> 00:12:26,053
read dependencies, and come up with some
schedule that's high performance.

169
00:12:26,053 --> 00:12:30,042
Now there's no guarantee that this
schedule is what the compiler would have

170
00:12:30,042 --> 00:12:35,182
done if it was smarter, or if could
somehow express parallelism across this

171
00:12:35,182 --> 00:12:39,016
interface.
But, you know, it does something.

172
00:12:39,016 --> 00:12:43,045
And what's, what's funny about this is
there's a whole lot of power that's being

173
00:12:43,045 --> 00:12:46,014
wasted.
It's a whole lot of effort that's being

174
00:12:46,014 --> 00:12:49,057
wasted in the processor.
There's a lot of effort of designers of

175
00:12:49,057 --> 00:12:53,092
the processor, that is, are effectively
designing all these circuits to take apart

176
00:12:53,092 --> 00:12:57,020
the sequential code sequence and do
something useful with it.

177
00:12:57,020 --> 00:12:59,077
So it's a lot of, a lot of work being done
there.

178
00:13:00,046 --> 00:13:02,036
Is there a better way?