Okay. So now, we get into some more fun pictures here. So, let's take a look at a processor based on the in-order processors we've been looking at up to this point. We fetch the code, we added or renamed some stage here, which we will talk about in a second, and then we have different pipelines, and one right back. We have two register files. We have a scalar register file, which is what we're calling our old register file. And we have the vector register file, Which has lots and lots of data in it. The vector length register sort of, sits out in front here in our register fetch stage. And we, we, we make a special stage for this because it does a lot of work. What is, what is it going to do? Well, when an operation gets to this stage, it's going to start reading the register file. And if you have vector registers, it's going to sit there and read the first element, the second element, the third element in time. And, start shoving it down one of the pipes. So, let's say, this is our multiply pipe here, it's four stages long. And we do a multiply of 60 a vector length of 64. It's going to send, its going to do 64 sequential reads out of here, And then six, 64 instructions down this multiply pipe. Note, we are not looking at any parallelism yet in this example. We're doing everything sequentially here. One thing you will note is, instructions can stop here and they, basically, generate more, more work at that point. And, and up to the vector, maximum vector length or whatever the vector length register's currently set at. So, let's look at, some a basic operation here. We're going to have a piece of code which is doing the same operation. We're taking vector A and vector B, and element Y is multiplying them. And, i is four. I Use a small i here because it's hard to draw these things with i of 64. [LAUGH] Just lots of instructions to draw. If we go look at the assembly code, it's the same assembly code we had before but we load the vector length with four instead of 64. So, load vector, load vector, multiply vector, vector, double and store. Let's look at the second load multiply and store. I don't have the second load, or the load immediate on here cuz it just took up too much space. This load vector here, it's going to start off, it's going to fetch, decode, and then it's usually going to sit at the R stage for a while inserting loads, loads, loads, loads down the pipeline. Okay. This basic vector of execution, we don't have any bypassing, and we do register dependency checking, on the through a register file and on a whole registers. So, we install instruction if the whole vector is not ready. Now, we're going to look at ways to get that better in a second. But, what that means is, we wait for all of the values of this load to write back to the register file before we go and start to do the register fetches of the next instructions. Then, we do the multiplies, multiplies have longer pipe line lengths. We wait for all of those multiplies to write to the register file before we start to go do, the storage operations. So, score boarding here is, is very score boarding in our bypassing here are very, very limited. They're, they, you can have, basically, a check to make sure everything is in the register file before you before you go ahead. I do want to introduce one piece of nomenclature. And, your book calls this out, It's called a chime. So, a chime is how long it takes to execute one vector instruction one the architecture. So, what we're going to see in a little bit is we're going to have some architectures where you can actually overlap some portion of the execution. Let's say, they're having different functional units, and decrease the chime. So, for this architecture here, the chime is, is four cuz it takes, basically, occupancy of four for, for a vector length of, of, of four. And, we only have one ALU effectively that can be used at a time here. So now, let's take a look at how to, how to make things run a little bit faster. We explained null parallelism. The only real advantage we've taken advantage of here is we've decreased our memory bandwidth. We've not actually or decreased our instruction fetch and instruction fetch memory bandwidth. But, that's not a real great reason to go do anything. It probably reduces power a little bit, But we want to go fast. So, we can start to think about how to overlap. If we, if we have different functional units, x0, the load, the ALU, the load, the store, and the multiply unit, we can start to think about overlapping them, in both space and time. So, here's an example where we're executing 32 elements, Is our vector length. And, we actually have multiple copies of function units, And we have multiple function units. So, we can overlap different executions of each other and we'll look at some more detailed examples in a second. So, we can start to add parallelism by using multiple units at the same time. Our adder unit, our multiply unit, and our load unit, which was, this was L and this was our Y and this was our X units, And we can actually put multiple copies of those. And we'll, we're going to call those lanes.