so one of the, the first optimization we
are going to look at is what we call vector chaining.
So vector chaining introduced in the Cray one, The idea is that you don't have to
wait for the first. Operation to complete or the whole vector
to complete before you start the second vector.
Because if we went back and looked at this picture here,
This load creates the element of the vector zero, we'll say.
This operation here looks at the third element in the vector so it cannot execute
after this load is done. It has to wait for this one.
But this multiply here can execute right after this load happens.
So, chaining in vector architecture is, is to some extent the moral equivalent of
register bypassing. And what this looks like is, is first of
all, you don't actually have to add intra-functional unit bypassing.
The most basic notion of chaining is you bypass just between functional units.
So what will happen here, is we're going to actually do a load, and we can take the
value of the load as it's heading for the vector register file and also sort of feed
it into our bypass network into a different pipeline, so for instance, the
multiply pipeline in our example we've been looking at.
And you can take the result and multiply it when it's done and feed that into the
like store pipeline or something like that, if you have a, a sub broken apart
stall, store pipeline or in this example here we're actually feeding a multiply
into an add. So the, the, advantage here, is as I said,
instead of having to wait for load, multiply, and the next instruction, wait
for a complete, complete, we can actually overlap, portions, some portions of the
dependent instruction when they become available.
So let's look at how this works out here where we're doing chaining.
But through the register file. So chaining through the register file,
what I mean by that is, we're not going to have a bypass directly from the output
here to the read stage. We're actually going to have to wait for it to write and
then read it, but at least we can overlap. In our previous example, we were not even
able to do this overlap. We had to wait for the entire register to,
vector register to be written in. So we've effectively pulled this forward
from here to there. But, what you might notice here is, well,
this value is already here, so we could even try to pull it a little bit farther
forward by actually adding bypassing. Or more traditional bypassing, but this is
just chaining through the register file. Okay, so now let's now add taining through
the bypass network. And,.
It didn't change. Well, what happened?
That's a, that's a bummer. Nothing happened.
I was expecting this to go faster. You know, I want to make my architecture
go faster, here. Well, what happens is.
These hours? You start to get structural hazards on the
report, on the register, on the vector register file.
So to make this go faster we're actually going to have to add more reports and more
write ports to the register file and if you were smart, into the vector register
file, if you were smart what we really want to do is we want to add.
One per pipeline, or basically, as many as we need ports per pipeline.
So if you have two input operands and one output operand per pipeline, you're
probably going to add one output operand and two input operands, or two reports
deregs files and one right port deregs file per pipeline.
And this, actually lets things go extra beat faster.
So now we have, bypassing and we have more.
Register file bandwidth. So more ports to our vector register file.
So what you'll note here, is now we have multiple things in R at the same time,
multiple things in W at the same time. And we can do that because we have
multiple ports to our multiple read ports and multiple write ports to our vector
register file. And, the other thing you should note here
is, this, multiply here, which is on element zero,
Needs the result here of the load. And then it gets created here, it gets
bypassed down to here, and then it just goes and executes.
And then, they just step in, in lock sequence here.
And similarly, the store here means the output of the multiply in Y3 of the first
element, so it can bypass it there and then start doing the store.
So we are able to get some parallelism here, even without having multiple copies
of the same execution unit, but we are able to get parallelism in our vector
architecture. And what's really cool is we only had to
issue three instructions to do this. So, very low instruction bandwidth.
And I wanted to make the point here that this is a kind of a contrived case, 'cause
our vector length is very short. It almost makes no, it almost doesn't make
sense to actually use vector instructions to do this.
But here is a picture of the same sort of thing happening and the overlap you can
get when you increase your vector length. So here we're going to set our vector
length registered to eight. And we're going to get much more overlap
of the executions of this, and you can basically have this instruction here,
starting 3's, the, the element zero operation here, starting three cycles
after that first load. And this would be as if you had, let's
say, eight independent Scalar processors sort of running in parallel.
But we could do with almost no encoding space.
So, questions about that? So far, a little technical with the
pictures. Lots, lots of bypassing happening here
through the bypass network but the bypassing is relatively restricted.
We only need to bypass one value at a time from one location to another location, we
don't. And if you do it through a register file
it sort of pushes this out one cycle so it's not so bad,
Or, or pushes each bypass out one cycle. Okay.
So this brings up the question, what happens if your vector length or the size
of your ray does not equal a nice power of two multiple of the vector length
register, Or let's say the vector that you're trying
to multiply is much, much larger than the maximum vector length register can be.
So, we just use this notion of strip mining.
And the idea is that you do big chunks and then you do the remainder by resetting the
vector length register. Now, you could do this one of two ways.
You could either do the remainder first, and then do the big chunks, or you could
do the big chunks and then do the remainder.
And both are, designs have actually been used.
To some extent, this is a software, technique here, because the architecture
is, just allows you set the vector length, but doesn't have any special hardware
support this. So, if you look at a piece of code here.
This added a lot of stuff to our loop. Now all of a sudden, we still have load,
look, load, load, multiply, store. This is our old instructions that we had
before. We have also other stuff here that sort of
bumping pointers, comes back into the, it, and we also have to write the vector
length register every single time through this loop.
And the reason we're in the vector length register here is we're going to write it
with, The, this is actually coded right now that we do the remainder first and
then do the big chunks. So we're going to write the vector length
register with the remainder first. So, N mod 64.
Do that chunk first, and then we're gonna do the rest of them with 64.
What I really what everyone to get, go take away from this is.
When you actually have to do strip mining and your, you know that the away sides is
not a nice even multiple of the vector length,
You actually have to do more work here. You have to do more work in your loop.
The other thing I want, I want to take away here is the vector length register
allows you to do this, But you actually have to, have to change
it. You can't just say it's one value.
So let's look at a code example here. Oh, oh yeah, I wanted to say.
All of this stuff here is overhead, sort of extra overhead.
And for short vector lengths, this, this can be a problem.
For longer vector lengths, this starts to, to go away.
So here we have vector stripmining And we have a, same piece of code that we saw
before, but now we saw that load, load mall, store and then we have some junk
down here. And this junk is that extra code you need
to basically handle the loop around and the checking and bumping of pointers, and
feel the handle that n may not be a fixed number.
One, one other thing I wanted to point out about vector strip mining is this allows
you to handle variable length arraus that you don't know a compile time also.
Because you can just use any theorem and this, this code snippet will do the right
thing. One, one other thing I wanted to, yeah,
point out here is that I didn't draw this, cause this is actually very hard to draw
and it wouldn't fit in the slide, Is this overhead is substantial when
looking at a small vector length like four.
If we make our vector length 64, there's a lot more work up here.
And this sort of starts to, to be small and, and unimportant to some extent.
So this is the overhead of this loop. Checking and then the maintenance of the
loop, starts to get small when you have very long, vectors.
Okay, so now we're going to start to look at parellelism a little bit more deeper.
So here we have an instruction. We have a deep pipeline process ALU.
And, we're going to have multiple things flowing down the pipe at a time, multiple
results, and we're going to take, you know, A of three and B of three, and it's
going to take six cycles to do this multiply.
This is assuming that we have one Python function unit.
What if all of a sudden, we start to have multiple copies of the same function unit?
Well, we can actually start to think about executing these things in parallel, and
interleaving the respective operations. So we have result C of zero here, C of one
here, C of two here, C of three. Then module around it, it wraps around..
So this is effectively going to decrease our, our time because we're going to add
more functional units, more parallel functional units, more copies of the same
functional unit and that will allow us to speed up our execution.
So here is an example of that same repetition we were looking at,
But now instead of having one copy of everything, we have two copies.
And typically, these, these things are called lanes.
So we're going to call this Lane one and this Lane two.
And take a look at the same piece of code, at this vector stripmining loop with a
vector length of four, with two lanes. And we can see, is this basically pulls
forward and we can overlap multiple loads at the same time.
So while two things are in L0. And, also in this multiply here, we can
have basically two multiplies going down the pipe at the same time like that.
Now, one thing to note here is, different architectures can do the interweaving in
different ways. This is one possible way to do the
interweaving.. You can also think about these two pulled
forward and these two delayed, or something like that.
The, both, both those architectures work and.
But it, I will say that typically you'll try to interleave your lanes in a Low
order interleave matter. So what that would mean is not this
picture. This is a, sort of, higher or interleave
matter. But a little order interleave matter would
be element zero, element one, element two, and then you, or, or, zero, one, two,
three, and then you wrap around. This is a, a picture here showing an
architecture which has four lanes, and one of the things I wanted to point out here
is you can actually partition your register file.
From architectural perspective, from a hardware perspective, you don't need to
have one monolithic register file because you know that you'll only ever access,
Let's say element zero, four, eight, dot, dot, dot of your vector registers or, or
excuse me. This is elements of the vector register.
So if you have 64 elements in your vector register,
Element zero and zero four eight of those can go here. So you're basically doing
interleaving of the respective sub portions of a vector register across these
different lanes. So you can partition the register file, is
the, the key take away point. I drew this, sort of show two different
things going on. Let's say like in our examples we've been
looking at, one of these is the multiply pipe, one of these is add pipe.
And here we have our load store pipe or something like that.
You can have multiple things in a lane. And then, you know, functional unit sort
of runs the other way. So this is, all of the, adder units, we'll
say. And in fact, this is, basically exactly
the architecture of the t0 vector processor form Berkeley.
And, as you guess, you see their layout here.
They striped over the lanes, The respective elements of their, well of
one register in the vector register file. And then they actually mirrored the data
path here. They actually have two copies of the same
thing on either side of the lane, And then, up here is their Scalar
processor, their control processor if you will.
And this is where they execute the Scalar operations like the shifts and the
branches and, and things like that. Okay so, so recap here.
Vector instructions sets have quite a few advantages.
Main one is they're very compact. One instruction can generate lots and lots
of work to do. And, they're relatively expressive.
When they're expressive is that they very concisely say, exactly all the constraints
between all the different operations they generate.
So for instance, each of the different operations are independent.
This allows us to stripe the registers. It allows us to know that we don't need to
bypass. Allows us to sort of access all these
registers, in a straight mannered, in a disjointed way.
Our memory pattern, at least for, Unit stride operations is very well understood,
and very easy to plan for. We know that it's going to, bank very
well. So if we have a, a cache, for instance, we
can actually bank our cache, cuz we know, if we're doing unit stride operations, and
we have multiple lanes, for instance here. We know that we're only going to be
accessing the zero-width element of a vector here.
The first element there, the second element there,
Etc. And this is, this works out very well from
a banking perspective, But sometimes, it may not always work out.
And finally, These sorts of architectures can be
scalable. What I mean by scalable is you could add
more lanes. And as long as you don't add more lanes,
than your maximum vector length, You're probably doing okay.
Or, you might want to be off my a little bit, you know, your, your depth of your
pipeline, but you can add more and more work, and, and, in fact, people have
actualy built families of these processors where they have different number of lanes
in the same architecture. And this is actually common in our GPU's
today where they will have multiple copies of the same units and they will actually
build different variants of the same processor pipe.
Or same pipe a processor pipeline where you'll have, ones that have, let's say,
one lane versus, I don't know, fifteen lanes or something like that.
And you can sell a high end part, which is fifteen lanes of a low end part that has
one lane.