1
00:00:07,840 --> 00:00:14,090
so one of the, the first optimization we
are going to look at is what we call

2
00:00:14,090 --> 00:00:19,242
vector chaining.
So vector chaining introduced in the Cray

3
00:00:19,242 --> 00:00:24,420
one, The idea is that you don't have to
wait for the first.

4
00:00:25,580 --> 00:00:31,189
Operation to complete or the whole vector
to complete before you start the second

5
00:00:31,189 --> 00:00:34,610
vector.
Because if we went back and looked at this

6
00:00:34,610 --> 00:00:42,175
picture here,
This load creates the element of the

7
00:00:42,175 --> 00:00:49,132
vector zero, we'll say.
This operation here looks at the third

8
00:00:49,132 --> 00:00:55,515
element in the vector so it cannot execute
after this load is done. It has to wait

9
00:00:55,515 --> 00:00:59,824
for this one.
But this multiply here can execute right

10
00:00:59,824 --> 00:01:05,011
after this load happens.
So, chaining in vector architecture is, is

11
00:01:05,011 --> 00:01:09,640
to some extent the moral equivalent of
register bypassing.

12
00:01:10,500 --> 00:01:16,219
And what this looks like is, is first of
all, you don't actually have to add

13
00:01:16,219 --> 00:01:21,708
intra-functional unit bypassing.
The most basic notion of chaining is you

14
00:01:21,708 --> 00:01:27,167
bypass just between functional units.
So what will happen here, is we're going

15
00:01:27,167 --> 00:01:31,474
to actually do a load, and we can take the
value of the load as it's heading for the

16
00:01:31,474 --> 00:01:35,462
vector register file and also sort of feed
it into our bypass network into a

17
00:01:35,462 --> 00:01:39,770
different pipeline, so for instance, the
multiply pipeline in our example we've

18
00:01:39,770 --> 00:01:42,801
been looking at.
And you can take the result and multiply

19
00:01:42,801 --> 00:01:47,161
it when it's done and feed that into the
like store pipeline or something like

20
00:01:47,161 --> 00:01:51,947
that, if you have a, a sub broken apart
stall, store pipeline or in this example

21
00:01:51,947 --> 00:01:54,660
here we're actually feeding a multiply
into an add.

22
00:01:58,480 --> 00:02:02,846
So the, the, advantage here, is as I said,
instead of having to wait for load,

23
00:02:02,846 --> 00:02:07,507
multiply, and the next instruction, wait
for a complete, complete, we can actually

24
00:02:07,507 --> 00:02:12,050
overlap, portions, some portions of the
dependent instruction when they become

25
00:02:12,050 --> 00:02:19,260
available.
So let's look at how this works out here

26
00:02:20,120 --> 00:02:25,420
where we're doing chaining.
But through the register file.

27
00:02:28,560 --> 00:02:32,864
So chaining through the register file,
what I mean by that is, we're not going to

28
00:02:32,864 --> 00:02:37,169
have a bypass directly from the output
here to the read stage. We're actually

29
00:02:37,169 --> 00:02:41,642
going to have to wait for it to write and
then read it, but at least we can overlap.

30
00:02:41,642 --> 00:02:45,332
In our previous example, we were not even
able to do this overlap.

31
00:02:45,332 --> 00:02:49,525
We had to wait for the entire register to,
vector register to be written in.

32
00:02:49,525 --> 00:02:52,880
So we've effectively pulled this forward
from here to there.

33
00:02:53,160 --> 00:02:58,156
But, what you might notice here is, well,
this value is already here, so we could

34
00:02:58,156 --> 00:03:02,640
even try to pull it a little bit farther
forward by actually adding bypassing.

35
00:03:04,140 --> 00:03:09,520
Or more traditional bypassing, but this is
just chaining through the register file.

36
00:03:10,200 --> 00:03:15,280
Okay, so now let's now add taining through
the bypass network.

37
00:03:15,740 --> 00:03:21,100
And,.
It didn't change.

38
00:03:21,680 --> 00:03:24,134
Well, what happened?
That's a, that's a bummer.

39
00:03:24,134 --> 00:03:26,980
Nothing happened.
I was expecting this to go faster.

40
00:03:26,980 --> 00:03:30,048
You know, I want to make my architecture
go faster, here.

41
00:03:30,048 --> 00:03:33,509
Well, what happens is.
These hours?

42
00:03:33,509 --> 00:03:40,854
You start to get structural hazards on the
report, on the register, on the vector

43
00:03:40,854 --> 00:03:45,254
register file.
So to make this go faster we're actually

44
00:03:45,254 --> 00:03:49,418
going to have to add more reports and more
write ports to the register file and if

45
00:03:49,418 --> 00:03:53,478
you were smart, into the vector register
file, if you were smart what we really

46
00:03:53,478 --> 00:03:59,639
want to do is we want to add.
One per pipeline, or basically, as many as

47
00:03:59,639 --> 00:04:04,343
we need ports per pipeline.
So if you have two input operands and one

48
00:04:04,343 --> 00:04:09,523
output operand per pipeline, you're
probably going to add one output operand

49
00:04:09,523 --> 00:04:14,772
and two input operands, or two reports
deregs files and one right port deregs

50
00:04:14,772 --> 00:04:21,031
file per pipeline.
And this, actually lets things go extra

51
00:04:21,031 --> 00:04:25,921
beat faster.
So now we have, bypassing and we have

52
00:04:25,921 --> 00:04:28,172
more.
Register file bandwidth.

53
00:04:28,172 --> 00:04:33,271
So more ports to our vector register file.
So what you'll note here, is now we have

54
00:04:33,271 --> 00:04:37,996
multiple things in R at the same time,
multiple things in W at the same time.

55
00:04:37,996 --> 00:04:42,784
And we can do that because we have
multiple ports to our multiple read ports

56
00:04:42,784 --> 00:04:46,080
and multiple write ports to our vector
register file.

57
00:04:46,660 --> 00:04:53,239
And, the other thing you should note here
is, this, multiply here, which is on

58
00:04:53,239 --> 00:04:59,446
element zero,
Needs the result here of the load.

59
00:04:59,446 --> 00:05:04,855
And then it gets created here, it gets
bypassed down to here, and then it just

60
00:05:04,855 --> 00:05:08,789
goes and executes.
And then, they just step in, in lock

61
00:05:08,789 --> 00:05:12,428
sequence here.
And similarly, the store here means the

62
00:05:12,428 --> 00:05:17,379
output of the multiply in Y3 of the first
element, so it can bypass it there and

63
00:05:17,379 --> 00:05:21,464
then start doing the store.
So we are able to get some parallelism

64
00:05:21,464 --> 00:05:26,538
here, even without having multiple copies
of the same execution unit, but we are

65
00:05:26,538 --> 00:05:29,694
able to get parallelism in our vector
architecture.

66
00:05:29,694 --> 00:05:34,460
And what's really cool is we only had to
issue three instructions to do this.

67
00:05:34,460 --> 00:05:42,519
So, very low instruction bandwidth.
And I wanted to make the point here that

68
00:05:42,519 --> 00:05:46,391
this is a kind of a contrived case, 'cause
our vector length is very short.

69
00:05:46,391 --> 00:05:50,847
It almost makes no, it almost doesn't make
sense to actually use vector instructions

70
00:05:50,847 --> 00:05:53,658
to do this.
But here is a picture of the same sort of

71
00:05:53,658 --> 00:05:57,955
thing happening and the overlap you can
get when you increase your vector length.

72
00:05:57,955 --> 00:06:01,244
So here we're going to set our vector
length registered to eight.

73
00:06:01,244 --> 00:06:05,901
And we're going to get much more overlap
of the executions of this, and you can

74
00:06:05,901 --> 00:06:11,113
basically have this instruction here,
starting 3's, the, the element zero

75
00:06:11,113 --> 00:06:15,214
operation here, starting three cycles
after that first load.

76
00:06:15,423 --> 00:06:20,844
And this would be as if you had, let's
say, eight independent Scalar processors

77
00:06:20,844 --> 00:06:25,222
sort of running in parallel.
But we could do with almost no encoding

78
00:06:25,222 --> 00:06:30,158
space.
So, questions about that?

79
00:06:30,158 --> 00:06:36,060
So far, a little technical with the
pictures.

80
00:06:43,100 --> 00:06:48,001
Lots, lots of bypassing happening here
through the bypass network but the

81
00:06:48,001 --> 00:06:53,032
bypassing is relatively restricted.
We only need to bypass one value at a time

82
00:06:53,226 --> 00:06:56,257
from one location to another location, we
don't.

83
00:06:56,257 --> 00:07:01,352
And if you do it through a register file
it sort of pushes this out one cycle so

84
00:07:01,352 --> 00:07:04,900
it's not so bad,
Or, or pushes each bypass out one cycle.

85
00:07:06,700 --> 00:07:09,070
Okay.
So this brings up the question, what

86
00:07:09,070 --> 00:07:15,000
happens if your vector length or the size
of your ray does not equal a nice power of

87
00:07:15,000 --> 00:07:18,440
two multiple of the vector length
register,

88
00:07:18,440 --> 00:07:25,160
Or let's say the vector that you're trying
to multiply is much, much larger than the

89
00:07:25,160 --> 00:07:32,224
maximum vector length register can be.
So, we just use this notion of strip

90
00:07:32,224 --> 00:07:37,019
mining.
And the idea is that you do big chunks and

91
00:07:37,019 --> 00:07:41,440
then you do the remainder by resetting the
vector length register.

92
00:07:41,980 --> 00:07:46,823
Now, you could do this one of two ways.
You could either do the remainder first,

93
00:07:46,823 --> 00:07:51,418
and then do the big chunks, or you could
do the big chunks and then do the

94
00:07:51,418 --> 00:07:54,709
remainder.
And both are, designs have actually been

95
00:07:54,709 --> 00:07:57,317
used.
To some extent, this is a software,

96
00:07:57,503 --> 00:08:02,533
technique here, because the architecture
is, just allows you set the vector length,

97
00:08:02,533 --> 00:08:05,700
but doesn't have any special hardware
support this.

98
00:08:06,160 --> 00:08:12,140
So, if you look at a piece of code here.
This added a lot of stuff to our loop.

99
00:08:13,240 --> 00:08:18,087
Now all of a sudden, we still have load,
look, load, load, multiply, store.

100
00:08:18,087 --> 00:08:21,148
This is our old instructions that we had
before.

101
00:08:21,148 --> 00:08:26,123
We have also other stuff here that sort of
bumping pointers, comes back into the, it,

102
00:08:26,123 --> 00:08:31,162
and we also have to write the vector
length register every single time through

103
00:08:31,162 --> 00:08:34,370
this loop.
And the reason we're in the vector length

104
00:08:34,370 --> 00:08:40,451
register here is we're going to write it
with, The, this is actually coded right

105
00:08:40,451 --> 00:08:45,471
now that we do the remainder first and
then do the big chunks.

106
00:08:45,471 --> 00:08:51,868
So we're going to write the vector length
register with the remainder first.

107
00:08:51,868 --> 00:08:56,316
So, N mod 64.
Do that chunk first, and then we're gonna

108
00:08:56,316 --> 00:09:02,964
do the rest of them with 64.
What I really what everyone to get, go

109
00:09:02,964 --> 00:09:07,076
take away from this is.
When you actually have to do strip mining

110
00:09:07,076 --> 00:09:12,137
and your, you know that the away sides is
not a nice even multiple of the vector

111
00:09:12,137 --> 00:09:15,618
length,
You actually have to do more work here.

112
00:09:15,618 --> 00:09:20,314
You have to do more work in your loop.
The other thing I want, I want to take

113
00:09:20,314 --> 00:09:24,000
away here is the vector length register
allows you to do this,

114
00:09:24,380 --> 00:09:26,859
But you actually have to, have to change
it.

115
00:09:26,859 --> 00:09:33,647
You can't just say it's one value.
So let's look at a code example here.

116
00:09:33,647 --> 00:09:39,233
Oh, oh yeah, I wanted to say.
All of this stuff here is overhead, sort

117
00:09:39,233 --> 00:09:43,271
of extra overhead.
And for short vector lengths, this, this

118
00:09:43,271 --> 00:09:47,380
can be a problem.
For longer vector lengths, this starts to,

119
00:09:47,380 --> 00:09:52,870
to go away.
So here we have vector stripmining And we

120
00:09:52,870 --> 00:10:00,145
have a, same piece of code that we saw
before, but now we saw that load, load

121
00:10:00,145 --> 00:10:05,480
mall, store and then we have some junk
down here.

122
00:10:06,060 --> 00:10:11,698
And this junk is that extra code you need
to basically handle the loop around and

123
00:10:11,698 --> 00:10:17,336
the checking and bumping of pointers, and
feel the handle that n may not be a fixed

124
00:10:17,336 --> 00:10:20,706
number.
One, one other thing I wanted to point out

125
00:10:20,706 --> 00:10:26,482
about vector strip mining is this allows
you to handle variable length arraus that

126
00:10:26,482 --> 00:10:31,708
you don't know a compile time also.
Because you can just use any theorem and

127
00:10:31,708 --> 00:10:34,940
this, this code snippet will do the right
thing.

128
00:10:39,760 --> 00:10:44,185
One, one other thing I wanted to, yeah,
point out here is that I didn't draw this,

129
00:10:44,185 --> 00:10:48,384
cause this is actually very hard to draw
and it wouldn't fit in the slide,

130
00:10:48,384 --> 00:10:53,758
Is this overhead is substantial when
looking at a small vector length like

131
00:10:53,758 --> 00:10:57,013
four.
If we make our vector length 64, there's a

132
00:10:57,013 --> 00:11:01,238
lot more work up here.
And this sort of starts to, to be small

133
00:11:01,238 --> 00:11:06,224
and, and unimportant to some extent.
So this is the overhead of this loop.

134
00:11:06,224 --> 00:11:11,765
Checking and then the maintenance of the
loop, starts to get small when you have

135
00:11:11,765 --> 00:11:20,225
very long, vectors.
Okay, so now we're going to start to look

136
00:11:20,225 --> 00:11:28,042
at parellelism a little bit more deeper.
So here we have an instruction.

137
00:11:28,042 --> 00:11:32,687
We have a deep pipeline process ALU.
And, we're going to have multiple things

138
00:11:32,687 --> 00:11:37,038
flowing down the pipe at a time, multiple
results, and we're going to take, you

139
00:11:37,038 --> 00:11:41,330
know, A of three and B of three, and it's
going to take six cycles to do this

140
00:11:41,330 --> 00:11:44,754
multiply.
This is assuming that we have one Python

141
00:11:44,754 --> 00:11:48,842
function unit.
What if all of a sudden, we start to have

142
00:11:48,842 --> 00:11:54,337
multiple copies of the same function unit?
Well, we can actually start to think about

143
00:11:54,337 --> 00:12:00,163
executing these things in parallel, and
interleaving the respective operations.

144
00:12:00,163 --> 00:12:05,933
So we have result C of zero here, C of one
here, C of two here, C of three.

145
00:12:05,933 --> 00:12:13,365
Then module around it, it wraps around..
So this is effectively going to decrease

146
00:12:13,365 --> 00:12:17,639
our, our time because we're going to add
more functional units, more parallel

147
00:12:17,639 --> 00:12:22,306
functional units, more copies of the same
functional unit and that will allow us to

148
00:12:22,306 --> 00:12:26,725
speed up our execution.
So here is an example of that same

149
00:12:26,725 --> 00:12:31,089
repetition we were looking at,
But now instead of having one copy of

150
00:12:31,089 --> 00:12:35,388
everything, we have two copies.
And typically, these, these things are

151
00:12:35,388 --> 00:12:38,661
called lanes.
So we're going to call this Lane one and

152
00:12:38,661 --> 00:12:45,192
this Lane two.
And take a look at the same piece of code,

153
00:12:45,192 --> 00:12:51,000
at this vector stripmining loop with a
vector length of four, with two lanes.

154
00:12:51,600 --> 00:13:00,419
And we can see, is this basically pulls
forward and we can overlap multiple loads

155
00:13:00,419 --> 00:13:04,240
at the same time.
So while two things are in L0.

156
00:13:06,020 --> 00:13:10,173
And, also in this multiply here, we can
have basically two multiplies going down

157
00:13:10,173 --> 00:13:18,663
the pipe at the same time like that.
Now, one thing to note here is, different

158
00:13:18,663 --> 00:13:22,960
architectures can do the interweaving in
different ways.

159
00:13:23,820 --> 00:13:26,678
This is one possible way to do the
interweaving..

160
00:13:27,250 --> 00:13:32,609
You can also think about these two pulled
forward and these two delayed, or

161
00:13:32,609 --> 00:13:36,826
something like that.
The, both, both those architectures work

162
00:13:36,826 --> 00:13:39,970
and.
But it, I will say that typically you'll

163
00:13:39,970 --> 00:13:46,320
try to interleave your lanes in a Low
order interleave matter.

164
00:13:47,420 --> 00:13:50,383
So what that would mean is not this
picture.

165
00:13:50,383 --> 00:13:53,548
This is a, sort of, higher or interleave
matter.

166
00:13:53,548 --> 00:13:58,330
But a little order interleave matter would
be element zero, element one, element two,

167
00:13:58,330 --> 00:14:01,900
and then you, or, or, zero, one, two,
three, and then you wrap around.

168
00:14:02,380 --> 00:14:07,712
This is a, a picture here showing an
architecture which has four lanes, and one

169
00:14:07,712 --> 00:14:12,908
of the things I wanted to point out here
is you can actually partition your

170
00:14:12,908 --> 00:14:16,464
register file.
From architectural perspective, from a

171
00:14:16,464 --> 00:14:22,001
hardware perspective, you don't need to
have one monolithic register file because

172
00:14:22,001 --> 00:14:28,923
you know that you'll only ever access,
Let's say element zero, four, eight, dot,

173
00:14:28,923 --> 00:14:32,583
dot, dot of your vector registers or, or
excuse me.

174
00:14:32,583 --> 00:14:38,707
This is elements of the vector register.
So if you have 64 elements in your vector

175
00:14:38,707 --> 00:14:42,511
register,
Element zero and zero four eight of those

176
00:14:42,511 --> 00:14:48,188
can go here. So you're basically doing
interleaving of the respective sub

177
00:14:48,188 --> 00:14:52,777
portions of a vector register across these
different lanes.

178
00:14:52,777 --> 00:14:58,300
So you can partition the register file, is
the, the key take away point.

179
00:14:58,800 --> 00:15:02,207
I drew this, sort of show two different
things going on.

180
00:15:02,207 --> 00:15:07,287
Let's say like in our examples we've been
looking at, one of these is the multiply

181
00:15:07,287 --> 00:15:11,624
pipe, one of these is add pipe.
And here we have our load store pipe or

182
00:15:11,624 --> 00:15:15,280
something like that.
You can have multiple things in a lane.

183
00:15:17,250 --> 00:15:21,299
And then, you know, functional unit sort
of runs the other way.

184
00:15:21,299 --> 00:15:24,220
So this is, all of the, adder units, we'll
say.

185
00:15:24,415 --> 00:15:34,207
And in fact, this is, basically exactly
the architecture of the t0 vector

186
00:15:34,207 --> 00:15:40,166
processor form Berkeley.
And, as you guess, you see their layout

187
00:15:40,166 --> 00:15:42,373
here.
They striped over the lanes,

188
00:15:42,373 --> 00:15:47,856
The respective elements of their, well of
one register in the vector register file.

189
00:15:47,856 --> 00:15:51,267
And then they actually mirrored the data
path here.

190
00:15:51,267 --> 00:15:56,014
They actually have two copies of the same
thing on either side of the lane,

191
00:15:56,014 --> 00:16:00,963
And then, up here is their Scalar
processor, their control processor if you

192
00:16:00,963 --> 00:16:04,106
will.
And this is where they execute the Scalar

193
00:16:04,106 --> 00:16:08,720
operations like the shifts and the
branches and, and things like that.

194
00:16:10,880 --> 00:16:17,430
Okay so, so recap here.
Vector instructions sets have quite a few

195
00:16:17,430 --> 00:16:20,457
advantages.
Main one is they're very compact.

196
00:16:20,457 --> 00:16:24,380
One instruction can generate lots and lots
of work to do.

197
00:16:25,260 --> 00:16:33,547
And, they're relatively expressive.
When they're expressive is that they very

198
00:16:33,547 --> 00:16:39,054
concisely say, exactly all the constraints
between all the different operations they

199
00:16:39,054 --> 00:16:42,135
generate.
So for instance, each of the different

200
00:16:42,135 --> 00:16:46,462
operations are independent.
This allows us to stripe the registers.

201
00:16:46,462 --> 00:16:49,740
It allows us to know that we don't need to
bypass.

202
00:16:51,620 --> 00:16:58,114
Allows us to sort of access all these
registers, in a straight mannered, in a

203
00:16:58,114 --> 00:17:02,212
disjointed way.
Our memory pattern, at least for, Unit

204
00:17:02,212 --> 00:17:06,870
stride operations is very well understood,
and very easy to plan for.

205
00:17:06,870 --> 00:17:09,952
We know that it's going to, bank very
well.

206
00:17:09,952 --> 00:17:15,637
So if we have a, a cache, for instance, we
can actually bank our cache, cuz we know,

207
00:17:15,637 --> 00:17:21,322
if we're doing unit stride operations, and
we have multiple lanes, for instance here.

208
00:17:21,322 --> 00:17:26,459
We know that we're only going to be
accessing the zero-width element of a

209
00:17:26,459 --> 00:17:29,679
vector here.
The first element there, the second

210
00:17:29,679 --> 00:17:31,460
element there,
Etc.

211
00:17:32,020 --> 00:17:35,933
And this is, this works out very well from
a banking perspective,

212
00:17:35,933 --> 00:17:44,260
But sometimes, it may not always work out.
And finally,

213
00:17:45,020 --> 00:17:47,645
These sorts of architectures can be
scalable.

214
00:17:47,645 --> 00:17:50,679
What I mean by scalable is you could add
more lanes.

215
00:17:50,679 --> 00:17:54,880
And as long as you don't add more lanes,
than your maximum vector length,

216
00:17:55,240 --> 00:17:59,635
You're probably doing okay.
Or, you might want to be off my a little

217
00:17:59,635 --> 00:18:04,224
bit, you know, your, your depth of your
pipeline, but you can add more and more

218
00:18:04,224 --> 00:18:08,753
work, and, and, in fact, people have
actualy built families of these processors

219
00:18:08,753 --> 00:18:12,798
where they have different number of lanes
in the same architecture.

220
00:18:12,798 --> 00:18:18,051
And this is actually common in our GPU's
today where they will have multiple copies

221
00:18:18,051 --> 00:18:22,761
of the same units and they will actually
build different variants of the same

222
00:18:22,761 --> 00:18:26,324
processor pipe.
Or same pipe a processor pipeline where

223
00:18:26,324 --> 00:18:31,588
you'll have, ones that have, let's say,
one lane versus, I don't know, fifteen

224
00:18:31,588 --> 00:18:37,087
lanes or something like that.
And you can sell a high end part, which is

225
00:18:37,087 --> 00:18:40,960
fifteen lanes of a low end part that has
one lane.