1 00:00:04,780 --> 00:00:08,615 Okay. So now, we get into some more fun pictures here. 2 00:00:08,615 --> 00:00:14,515 So, let's take a look at a processor based on the in-order processors we've been 3 00:00:14,515 --> 00:00:19,434 looking at up to this point. We fetch the code, we added or renamed 4 00:00:19,434 --> 00:00:24,855 some stage here, which we will talk about in a second, and then we have different 5 00:00:24,855 --> 00:00:28,709 pipelines, and one right back. We have two register files. 6 00:00:28,709 --> 00:00:33,178 We have a scalar register file, which is what we're calling our old register file. 7 00:00:33,178 --> 00:00:38,480 And we have the vector register file, Which has lots and lots of data in it. 8 00:00:40,120 --> 00:00:45,614 The vector length register sort of, sits out in front here in our register fetch 9 00:00:45,614 --> 00:00:51,040 stage. And we, we, we make a special stage for this because it does a lot of work. 10 00:00:51,440 --> 00:00:55,560 What is, what is it going to do? Well, when an operation gets to this 11 00:00:55,560 --> 00:00:58,908 stage, it's going to start reading the register file. 12 00:00:58,908 --> 00:01:03,607 And if you have vector registers, it's going to sit there and read the first 13 00:01:03,607 --> 00:01:07,020 element, the second element, the third element in time. 14 00:01:07,340 --> 00:01:10,277 And, start shoving it down one of the pipes. 15 00:01:10,277 --> 00:01:14,787 So, let's say, this is our multiply pipe here, it's four stages long. 16 00:01:14,787 --> 00:01:18,339 And we do a multiply of 60 a vector length of 64. 17 00:01:18,339 --> 00:01:22,438 It's going to send, its going to do 64 sequential reads out of here, 18 00:01:22,438 --> 00:01:26,060 And then six, 64 instructions down this multiply pipe. 19 00:01:26,760 --> 00:01:31,200 Note, we are not looking at any parallelism yet in this example. 20 00:01:31,620 --> 00:01:36,448 We're doing everything sequentially here. One thing you will note is, instructions 21 00:01:36,448 --> 00:01:40,680 can stop here and they, basically, generate more, more work at that point. 22 00:01:40,680 --> 00:01:45,270 And, and up to the vector, maximum vector length or whatever the vector length 23 00:01:45,270 --> 00:01:50,339 register's currently set at. So, let's look at, some a basic operation 24 00:01:50,339 --> 00:01:53,245 here. We're going to have a piece of code which 25 00:01:53,245 --> 00:01:57,669 is doing the same operation. We're taking vector A and vector B, and 26 00:01:57,669 --> 00:02:00,244 element Y is multiplying them. And, i is four. 27 00:02:00,244 --> 00:02:03,348 I Use a small i here because it's hard to 28 00:02:03,348 --> 00:02:08,169 draw these things with i of 64. [LAUGH] Just lots of instructions to draw. 29 00:02:08,169 --> 00:02:13,518 If we go look at the assembly code, it's the same assembly code we had before but 30 00:02:13,518 --> 00:02:16,820 we load the vector length with four instead of 64. 31 00:02:17,120 --> 00:02:25,140 So, load vector, load vector, multiply vector, vector, double and store. 32 00:02:27,269 --> 00:02:32,200 Let's look at the second load multiply and store. I don't have the second load, or 33 00:02:32,200 --> 00:02:35,840 the load immediate on here cuz it just took up too much space. 34 00:02:37,460 --> 00:02:42,364 This load vector here, it's going to start off, it's going to fetch, decode, and then 35 00:02:42,364 --> 00:02:47,591 it's usually going to sit at the R stage for a while inserting loads, loads, loads, 36 00:02:47,591 --> 00:02:51,920 loads down the pipeline. Okay. 37 00:02:52,420 --> 00:03:00,375 This basic vector of execution, we don't have any bypassing, and we do register 38 00:03:00,375 --> 00:03:08,520 dependency checking, on the through a register file and on a whole registers. 39 00:02:21,390 --> 00:03:11,600 So, we install instruction if the whole vector is not ready. 40 00:03:11,860 --> 00:03:17,646 Now, we're going to look at ways to get that better in a second. But, what that 41 00:03:17,646 --> 00:03:23,504 means is, we wait for all of the values of this load to write back to the register 42 00:03:23,504 --> 00:03:29,220 file before we go and start to do the register fetches of the next instructions. 43 00:03:31,000 --> 00:03:35,360 Then, we do the multiplies, multiplies have longer pipe line lengths. 44 00:03:36,020 --> 00:03:41,961 We wait for all of those multiplies to write to the register file before we start 45 00:03:41,961 --> 00:03:56,004 to go do, the storage operations. So, score boarding here is, is very score 46 00:03:56,004 --> 00:04:00,579 boarding in our bypassing here are very, very limited. They're, they, you can have, 47 00:04:00,579 --> 00:04:05,998 basically, a check to make sure everything is in the register file before you before 48 00:04:05,998 --> 00:04:09,165 you go ahead. I do want to introduce one piece of 49 00:04:09,165 --> 00:04:13,816 nomenclature. And, your book calls this out, 50 00:04:13,816 --> 00:04:21,061 It's called a chime. So, a chime is how long it takes to 51 00:04:21,061 --> 00:04:29,780 execute one vector instruction one the architecture. 52 00:04:30,040 --> 00:04:34,507 So, what we're going to see in a little bit is we're going to have some 53 00:04:34,507 --> 00:04:40,378 architectures where you can actually overlap some portion of the execution. 54 00:04:40,602 --> 00:04:46,197 Let's say, they're having different functional units, and decrease the chime. 55 00:04:46,197 --> 00:04:51,719 So, for this architecture here, the chime is, is four cuz it takes, basically, 56 00:04:51,868 --> 00:04:55,968 occupancy of four for, for a vector length of, of, of four. 57 00:04:55,971 --> 00:05:01,120 And, we only have one ALU effectively that can be used at a time here. 58 00:05:01,960 --> 00:05:06,480 So now, let's take a look at how to, how to make things run a little bit faster. 59 00:05:06,480 --> 00:05:10,305 We explained null parallelism. The only real advantage we've taken 60 00:05:10,305 --> 00:05:13,667 advantage of here is we've decreased our memory bandwidth. 61 00:05:13,667 --> 00:05:18,245 We've not actually or decreased our instruction fetch and instruction fetch 62 00:05:18,245 --> 00:05:22,302 memory bandwidth. But, that's not a real great reason to go do anything. 63 00:05:22,302 --> 00:05:25,780 It probably reduces power a little bit, But we want to go fast. 64 00:05:27,440 --> 00:05:30,460 So, we can start to think about how to overlap. 65 00:05:30,740 --> 00:05:39,186 If we, if we have different functional units, x0, the load, the ALU, the load, 66 00:05:39,186 --> 00:05:47,503 the store, and the multiply unit, we can start to think about overlapping them, in 67 00:05:47,503 --> 00:05:53,654 both space and time. So, here's an example where we're 68 00:05:53,654 --> 00:05:59,167 executing 32 elements, Is our vector length. 69 00:05:59,167 --> 00:06:04,500 And, we actually have multiple copies of function units, 70 00:06:04,760 --> 00:06:09,664 And we have multiple function units. So, we can overlap different executions of 71 00:06:09,664 --> 00:06:13,940 each other and we'll look at some more detailed examples in a second. 72 00:06:14,420 --> 00:06:19,220 So, we can start to add parallelism by using multiple units at the same time. 73 00:06:19,620 --> 00:06:25,051 Our adder unit, our multiply unit, and our load unit, which was, this was L and this 74 00:06:25,051 --> 00:06:30,282 was our Y and this was our X units, And we can actually put multiple copies of 75 00:06:30,282 --> 00:06:33,300 those. And we'll, we're going to call those lanes.