Okay. So now, we get into some more fun
pictures here. So, let's take a look at a processor based
on the in-order processors we've been looking at up to this point.
We fetch the code, we added or renamed some stage here, which we will talk about
in a second, and then we have different pipelines, and one right back.
We have two register files. We have a scalar register file, which is
what we're calling our old register file. And we have the vector register file,
Which has lots and lots of data in it. The vector length register sort of, sits
out in front here in our register fetch stage. And we, we, we make a special stage
for this because it does a lot of work. What is, what is it going to do?
Well, when an operation gets to this stage, it's going to start reading the
register file. And if you have vector registers, it's
going to sit there and read the first element, the second element, the third
element in time. And, start shoving it down one of the
pipes. So, let's say, this is our multiply pipe
here, it's four stages long. And we do a multiply of 60 a vector length
of 64. It's going to send, its going to do 64
sequential reads out of here, And then six, 64 instructions down this
multiply pipe. Note, we are not looking at any
parallelism yet in this example. We're doing everything sequentially here.
One thing you will note is, instructions can stop here and they, basically,
generate more, more work at that point. And, and up to the vector, maximum vector
length or whatever the vector length register's currently set at.
So, let's look at, some a basic operation here.
We're going to have a piece of code which is doing the same operation.
We're taking vector A and vector B, and element Y is multiplying them.
And, i is four. I
Use a small i here because it's hard to draw these things with i of 64.
[LAUGH] Just lots of instructions to draw. If we go look at the assembly code, it's
the same assembly code we had before but we load the vector length with four
instead of 64. So, load vector, load vector, multiply
vector, vector, double and store. Let's look at the second load multiply and
store. I don't have the second load, or the load immediate on here cuz it just
took up too much space. This load vector here, it's going to start
off, it's going to fetch, decode, and then it's usually going to sit at the R stage
for a while inserting loads, loads, loads, loads down the pipeline.
Okay. This basic vector of execution, we don't
have any bypassing, and we do register dependency checking, on the through a
register file and on a whole registers. So, we install instruction if the whole
vector is not ready. Now, we're going to look at ways to get
that better in a second. But, what that means is, we wait for all of the values of
this load to write back to the register file before we go and start to do the
register fetches of the next instructions. Then, we do the multiplies, multiplies
have longer pipe line lengths. We wait for all of those multiplies to
write to the register file before we start to go do, the storage operations.
So, score boarding here is, is very score boarding in our bypassing here are very,
very limited. They're, they, you can have, basically, a check to make sure everything
is in the register file before you before you go ahead.
I do want to introduce one piece of nomenclature.
And, your book calls this out, It's called a chime.
So, a chime is how long it takes to execute one vector instruction one the
architecture. So, what we're going to see in a little
bit is we're going to have some architectures where you can actually
overlap some portion of the execution. Let's say, they're having different
functional units, and decrease the chime. So, for this architecture here, the chime
is, is four cuz it takes, basically, occupancy of four for, for a vector length
of, of, of four. And, we only have one ALU effectively that
can be used at a time here. So now, let's take a look at how to, how
to make things run a little bit faster. We explained null parallelism.
The only real advantage we've taken advantage of here is we've decreased our
memory bandwidth. We've not actually or decreased our
instruction fetch and instruction fetch memory bandwidth. But, that's not a real
great reason to go do anything. It probably reduces power a little bit,
But we want to go fast. So, we can start to think about how to
overlap. If we, if we have different functional
units, x0, the load, the ALU, the load, the store, and the multiply unit, we can
start to think about overlapping them, in both space and time.
So, here's an example where we're executing 32 elements,
Is our vector length. And, we actually have multiple copies of
function units, And we have multiple function units. So,
we can overlap different executions of each other and we'll look at some more
detailed examples in a second. So, we can start to add parallelism by
using multiple units at the same time. Our adder unit, our multiply unit, and our
load unit, which was, this was L and this was our Y and this was our X units,
And we can actually put multiple copies of those. And we'll, we're going to call
those lanes.