1 00:00:07,840 --> 00:00:14,090 so one of the, the first optimization we are going to look at is what we call 2 00:00:14,090 --> 00:00:19,242 vector chaining. So vector chaining introduced in the Cray 3 00:00:19,242 --> 00:00:24,420 one, The idea is that you don't have to wait for the first. 4 00:00:25,580 --> 00:00:31,189 Operation to complete or the whole vector to complete before you start the second 5 00:00:31,189 --> 00:00:34,610 vector. Because if we went back and looked at this 6 00:00:34,610 --> 00:00:42,175 picture here, This load creates the element of the 7 00:00:42,175 --> 00:00:49,132 vector zero, we'll say. This operation here looks at the third 8 00:00:49,132 --> 00:00:55,515 element in the vector so it cannot execute after this load is done. It has to wait 9 00:00:55,515 --> 00:00:59,824 for this one. But this multiply here can execute right 10 00:00:59,824 --> 00:01:05,011 after this load happens. So, chaining in vector architecture is, is 11 00:01:05,011 --> 00:01:09,640 to some extent the moral equivalent of register bypassing. 12 00:01:10,500 --> 00:01:16,219 And what this looks like is, is first of all, you don't actually have to add 13 00:01:16,219 --> 00:01:21,708 intra-functional unit bypassing. The most basic notion of chaining is you 14 00:01:21,708 --> 00:01:27,167 bypass just between functional units. So what will happen here, is we're going 15 00:01:27,167 --> 00:01:31,474 to actually do a load, and we can take the value of the load as it's heading for the 16 00:01:31,474 --> 00:01:35,462 vector register file and also sort of feed it into our bypass network into a 17 00:01:35,462 --> 00:01:39,770 different pipeline, so for instance, the multiply pipeline in our example we've 18 00:01:39,770 --> 00:01:42,801 been looking at. And you can take the result and multiply 19 00:01:42,801 --> 00:01:47,161 it when it's done and feed that into the like store pipeline or something like 20 00:01:47,161 --> 00:01:51,947 that, if you have a, a sub broken apart stall, store pipeline or in this example 21 00:01:51,947 --> 00:01:54,660 here we're actually feeding a multiply into an add. 22 00:01:58,480 --> 00:02:02,846 So the, the, advantage here, is as I said, instead of having to wait for load, 23 00:02:02,846 --> 00:02:07,507 multiply, and the next instruction, wait for a complete, complete, we can actually 24 00:02:07,507 --> 00:02:12,050 overlap, portions, some portions of the dependent instruction when they become 25 00:02:12,050 --> 00:02:19,260 available. So let's look at how this works out here 26 00:02:20,120 --> 00:02:25,420 where we're doing chaining. But through the register file. 27 00:02:28,560 --> 00:02:32,864 So chaining through the register file, what I mean by that is, we're not going to 28 00:02:32,864 --> 00:02:37,169 have a bypass directly from the output here to the read stage. We're actually 29 00:02:37,169 --> 00:02:41,642 going to have to wait for it to write and then read it, but at least we can overlap. 30 00:02:41,642 --> 00:02:45,332 In our previous example, we were not even able to do this overlap. 31 00:02:45,332 --> 00:02:49,525 We had to wait for the entire register to, vector register to be written in. 32 00:02:49,525 --> 00:02:52,880 So we've effectively pulled this forward from here to there. 33 00:02:53,160 --> 00:02:58,156 But, what you might notice here is, well, this value is already here, so we could 34 00:02:58,156 --> 00:03:02,640 even try to pull it a little bit farther forward by actually adding bypassing. 35 00:03:04,140 --> 00:03:09,520 Or more traditional bypassing, but this is just chaining through the register file. 36 00:03:10,200 --> 00:03:15,280 Okay, so now let's now add taining through the bypass network. 37 00:03:15,740 --> 00:03:21,100 And,. It didn't change. 38 00:03:21,680 --> 00:03:24,134 Well, what happened? That's a, that's a bummer. 39 00:03:24,134 --> 00:03:26,980 Nothing happened. I was expecting this to go faster. 40 00:03:26,980 --> 00:03:30,048 You know, I want to make my architecture go faster, here. 41 00:03:30,048 --> 00:03:33,509 Well, what happens is. These hours? 42 00:03:33,509 --> 00:03:40,854 You start to get structural hazards on the report, on the register, on the vector 43 00:03:40,854 --> 00:03:45,254 register file. So to make this go faster we're actually 44 00:03:45,254 --> 00:03:49,418 going to have to add more reports and more write ports to the register file and if 45 00:03:49,418 --> 00:03:53,478 you were smart, into the vector register file, if you were smart what we really 46 00:03:53,478 --> 00:03:59,639 want to do is we want to add. One per pipeline, or basically, as many as 47 00:03:59,639 --> 00:04:04,343 we need ports per pipeline. So if you have two input operands and one 48 00:04:04,343 --> 00:04:09,523 output operand per pipeline, you're probably going to add one output operand 49 00:04:09,523 --> 00:04:14,772 and two input operands, or two reports deregs files and one right port deregs 50 00:04:14,772 --> 00:04:21,031 file per pipeline. And this, actually lets things go extra 51 00:04:21,031 --> 00:04:25,921 beat faster. So now we have, bypassing and we have 52 00:04:25,921 --> 00:04:28,172 more. Register file bandwidth. 53 00:04:28,172 --> 00:04:33,271 So more ports to our vector register file. So what you'll note here, is now we have 54 00:04:33,271 --> 00:04:37,996 multiple things in R at the same time, multiple things in W at the same time. 55 00:04:37,996 --> 00:04:42,784 And we can do that because we have multiple ports to our multiple read ports 56 00:04:42,784 --> 00:04:46,080 and multiple write ports to our vector register file. 57 00:04:46,660 --> 00:04:53,239 And, the other thing you should note here is, this, multiply here, which is on 58 00:04:53,239 --> 00:04:59,446 element zero, Needs the result here of the load. 59 00:04:59,446 --> 00:05:04,855 And then it gets created here, it gets bypassed down to here, and then it just 60 00:05:04,855 --> 00:05:08,789 goes and executes. And then, they just step in, in lock 61 00:05:08,789 --> 00:05:12,428 sequence here. And similarly, the store here means the 62 00:05:12,428 --> 00:05:17,379 output of the multiply in Y3 of the first element, so it can bypass it there and 63 00:05:17,379 --> 00:05:21,464 then start doing the store. So we are able to get some parallelism 64 00:05:21,464 --> 00:05:26,538 here, even without having multiple copies of the same execution unit, but we are 65 00:05:26,538 --> 00:05:29,694 able to get parallelism in our vector architecture. 66 00:05:29,694 --> 00:05:34,460 And what's really cool is we only had to issue three instructions to do this. 67 00:05:34,460 --> 00:05:42,519 So, very low instruction bandwidth. And I wanted to make the point here that 68 00:05:42,519 --> 00:05:46,391 this is a kind of a contrived case, 'cause our vector length is very short. 69 00:05:46,391 --> 00:05:50,847 It almost makes no, it almost doesn't make sense to actually use vector instructions 70 00:05:50,847 --> 00:05:53,658 to do this. But here is a picture of the same sort of 71 00:05:53,658 --> 00:05:57,955 thing happening and the overlap you can get when you increase your vector length. 72 00:05:57,955 --> 00:06:01,244 So here we're going to set our vector length registered to eight. 73 00:06:01,244 --> 00:06:05,901 And we're going to get much more overlap of the executions of this, and you can 74 00:06:05,901 --> 00:06:11,113 basically have this instruction here, starting 3's, the, the element zero 75 00:06:11,113 --> 00:06:15,214 operation here, starting three cycles after that first load. 76 00:06:15,423 --> 00:06:20,844 And this would be as if you had, let's say, eight independent Scalar processors 77 00:06:20,844 --> 00:06:25,222 sort of running in parallel. But we could do with almost no encoding 78 00:06:25,222 --> 00:06:30,158 space. So, questions about that? 79 00:06:30,158 --> 00:06:36,060 So far, a little technical with the pictures. 80 00:06:43,100 --> 00:06:48,001 Lots, lots of bypassing happening here through the bypass network but the 81 00:06:48,001 --> 00:06:53,032 bypassing is relatively restricted. We only need to bypass one value at a time 82 00:06:53,226 --> 00:06:56,257 from one location to another location, we don't. 83 00:06:56,257 --> 00:07:01,352 And if you do it through a register file it sort of pushes this out one cycle so 84 00:07:01,352 --> 00:07:04,900 it's not so bad, Or, or pushes each bypass out one cycle. 85 00:07:06,700 --> 00:07:09,070 Okay. So this brings up the question, what 86 00:07:09,070 --> 00:07:15,000 happens if your vector length or the size of your ray does not equal a nice power of 87 00:07:15,000 --> 00:07:18,440 two multiple of the vector length register, 88 00:07:18,440 --> 00:07:25,160 Or let's say the vector that you're trying to multiply is much, much larger than the 89 00:07:25,160 --> 00:07:32,224 maximum vector length register can be. So, we just use this notion of strip 90 00:07:32,224 --> 00:07:37,019 mining. And the idea is that you do big chunks and 91 00:07:37,019 --> 00:07:41,440 then you do the remainder by resetting the vector length register. 92 00:07:41,980 --> 00:07:46,823 Now, you could do this one of two ways. You could either do the remainder first, 93 00:07:46,823 --> 00:07:51,418 and then do the big chunks, or you could do the big chunks and then do the 94 00:07:51,418 --> 00:07:54,709 remainder. And both are, designs have actually been 95 00:07:54,709 --> 00:07:57,317 used. To some extent, this is a software, 96 00:07:57,503 --> 00:08:02,533 technique here, because the architecture is, just allows you set the vector length, 97 00:08:02,533 --> 00:08:05,700 but doesn't have any special hardware support this. 98 00:08:06,160 --> 00:08:12,140 So, if you look at a piece of code here. This added a lot of stuff to our loop. 99 00:08:13,240 --> 00:08:18,087 Now all of a sudden, we still have load, look, load, load, multiply, store. 100 00:08:18,087 --> 00:08:21,148 This is our old instructions that we had before. 101 00:08:21,148 --> 00:08:26,123 We have also other stuff here that sort of bumping pointers, comes back into the, it, 102 00:08:26,123 --> 00:08:31,162 and we also have to write the vector length register every single time through 103 00:08:31,162 --> 00:08:34,370 this loop. And the reason we're in the vector length 104 00:08:34,370 --> 00:08:40,451 register here is we're going to write it with, The, this is actually coded right 105 00:08:40,451 --> 00:08:45,471 now that we do the remainder first and then do the big chunks. 106 00:08:45,471 --> 00:08:51,868 So we're going to write the vector length register with the remainder first. 107 00:08:51,868 --> 00:08:56,316 So, N mod 64. Do that chunk first, and then we're gonna 108 00:08:56,316 --> 00:09:02,964 do the rest of them with 64. What I really what everyone to get, go 109 00:09:02,964 --> 00:09:07,076 take away from this is. When you actually have to do strip mining 110 00:09:07,076 --> 00:09:12,137 and your, you know that the away sides is not a nice even multiple of the vector 111 00:09:12,137 --> 00:09:15,618 length, You actually have to do more work here. 112 00:09:15,618 --> 00:09:20,314 You have to do more work in your loop. The other thing I want, I want to take 113 00:09:20,314 --> 00:09:24,000 away here is the vector length register allows you to do this, 114 00:09:24,380 --> 00:09:26,859 But you actually have to, have to change it. 115 00:09:26,859 --> 00:09:33,647 You can't just say it's one value. So let's look at a code example here. 116 00:09:33,647 --> 00:09:39,233 Oh, oh yeah, I wanted to say. All of this stuff here is overhead, sort 117 00:09:39,233 --> 00:09:43,271 of extra overhead. And for short vector lengths, this, this 118 00:09:43,271 --> 00:09:47,380 can be a problem. For longer vector lengths, this starts to, 119 00:09:47,380 --> 00:09:52,870 to go away. So here we have vector stripmining And we 120 00:09:52,870 --> 00:10:00,145 have a, same piece of code that we saw before, but now we saw that load, load 121 00:10:00,145 --> 00:10:05,480 mall, store and then we have some junk down here. 122 00:10:06,060 --> 00:10:11,698 And this junk is that extra code you need to basically handle the loop around and 123 00:10:11,698 --> 00:10:17,336 the checking and bumping of pointers, and feel the handle that n may not be a fixed 124 00:10:17,336 --> 00:10:20,706 number. One, one other thing I wanted to point out 125 00:10:20,706 --> 00:10:26,482 about vector strip mining is this allows you to handle variable length arraus that 126 00:10:26,482 --> 00:10:31,708 you don't know a compile time also. Because you can just use any theorem and 127 00:10:31,708 --> 00:10:34,940 this, this code snippet will do the right thing. 128 00:10:39,760 --> 00:10:44,185 One, one other thing I wanted to, yeah, point out here is that I didn't draw this, 129 00:10:44,185 --> 00:10:48,384 cause this is actually very hard to draw and it wouldn't fit in the slide, 130 00:10:48,384 --> 00:10:53,758 Is this overhead is substantial when looking at a small vector length like 131 00:10:53,758 --> 00:10:57,013 four. If we make our vector length 64, there's a 132 00:10:57,013 --> 00:11:01,238 lot more work up here. And this sort of starts to, to be small 133 00:11:01,238 --> 00:11:06,224 and, and unimportant to some extent. So this is the overhead of this loop. 134 00:11:06,224 --> 00:11:11,765 Checking and then the maintenance of the loop, starts to get small when you have 135 00:11:11,765 --> 00:11:20,225 very long, vectors. Okay, so now we're going to start to look 136 00:11:20,225 --> 00:11:28,042 at parellelism a little bit more deeper. So here we have an instruction. 137 00:11:28,042 --> 00:11:32,687 We have a deep pipeline process ALU. And, we're going to have multiple things 138 00:11:32,687 --> 00:11:37,038 flowing down the pipe at a time, multiple results, and we're going to take, you 139 00:11:37,038 --> 00:11:41,330 know, A of three and B of three, and it's going to take six cycles to do this 140 00:11:41,330 --> 00:11:44,754 multiply. This is assuming that we have one Python 141 00:11:44,754 --> 00:11:48,842 function unit. What if all of a sudden, we start to have 142 00:11:48,842 --> 00:11:54,337 multiple copies of the same function unit? Well, we can actually start to think about 143 00:11:54,337 --> 00:12:00,163 executing these things in parallel, and interleaving the respective operations. 144 00:12:00,163 --> 00:12:05,933 So we have result C of zero here, C of one here, C of two here, C of three. 145 00:12:05,933 --> 00:12:13,365 Then module around it, it wraps around.. So this is effectively going to decrease 146 00:12:13,365 --> 00:12:17,639 our, our time because we're going to add more functional units, more parallel 147 00:12:17,639 --> 00:12:22,306 functional units, more copies of the same functional unit and that will allow us to 148 00:12:22,306 --> 00:12:26,725 speed up our execution. So here is an example of that same 149 00:12:26,725 --> 00:12:31,089 repetition we were looking at, But now instead of having one copy of 150 00:12:31,089 --> 00:12:35,388 everything, we have two copies. And typically, these, these things are 151 00:12:35,388 --> 00:12:38,661 called lanes. So we're going to call this Lane one and 152 00:12:38,661 --> 00:12:45,192 this Lane two. And take a look at the same piece of code, 153 00:12:45,192 --> 00:12:51,000 at this vector stripmining loop with a vector length of four, with two lanes. 154 00:12:51,600 --> 00:13:00,419 And we can see, is this basically pulls forward and we can overlap multiple loads 155 00:13:00,419 --> 00:13:04,240 at the same time. So while two things are in L0. 156 00:13:06,020 --> 00:13:10,173 And, also in this multiply here, we can have basically two multiplies going down 157 00:13:10,173 --> 00:13:18,663 the pipe at the same time like that. Now, one thing to note here is, different 158 00:13:18,663 --> 00:13:22,960 architectures can do the interweaving in different ways. 159 00:13:23,820 --> 00:13:26,678 This is one possible way to do the interweaving.. 160 00:13:27,250 --> 00:13:32,609 You can also think about these two pulled forward and these two delayed, or 161 00:13:32,609 --> 00:13:36,826 something like that. The, both, both those architectures work 162 00:13:36,826 --> 00:13:39,970 and. But it, I will say that typically you'll 163 00:13:39,970 --> 00:13:46,320 try to interleave your lanes in a Low order interleave matter. 164 00:13:47,420 --> 00:13:50,383 So what that would mean is not this picture. 165 00:13:50,383 --> 00:13:53,548 This is a, sort of, higher or interleave matter. 166 00:13:53,548 --> 00:13:58,330 But a little order interleave matter would be element zero, element one, element two, 167 00:13:58,330 --> 00:14:01,900 and then you, or, or, zero, one, two, three, and then you wrap around. 168 00:14:02,380 --> 00:14:07,712 This is a, a picture here showing an architecture which has four lanes, and one 169 00:14:07,712 --> 00:14:12,908 of the things I wanted to point out here is you can actually partition your 170 00:14:12,908 --> 00:14:16,464 register file. From architectural perspective, from a 171 00:14:16,464 --> 00:14:22,001 hardware perspective, you don't need to have one monolithic register file because 172 00:14:22,001 --> 00:14:28,923 you know that you'll only ever access, Let's say element zero, four, eight, dot, 173 00:14:28,923 --> 00:14:32,583 dot, dot of your vector registers or, or excuse me. 174 00:14:32,583 --> 00:14:38,707 This is elements of the vector register. So if you have 64 elements in your vector 175 00:14:38,707 --> 00:14:42,511 register, Element zero and zero four eight of those 176 00:14:42,511 --> 00:14:48,188 can go here. So you're basically doing interleaving of the respective sub 177 00:14:48,188 --> 00:14:52,777 portions of a vector register across these different lanes. 178 00:14:52,777 --> 00:14:58,300 So you can partition the register file, is the, the key take away point. 179 00:14:58,800 --> 00:15:02,207 I drew this, sort of show two different things going on. 180 00:15:02,207 --> 00:15:07,287 Let's say like in our examples we've been looking at, one of these is the multiply 181 00:15:07,287 --> 00:15:11,624 pipe, one of these is add pipe. And here we have our load store pipe or 182 00:15:11,624 --> 00:15:15,280 something like that. You can have multiple things in a lane. 183 00:15:17,250 --> 00:15:21,299 And then, you know, functional unit sort of runs the other way. 184 00:15:21,299 --> 00:15:24,220 So this is, all of the, adder units, we'll say. 185 00:15:24,415 --> 00:15:34,207 And in fact, this is, basically exactly the architecture of the t0 vector 186 00:15:34,207 --> 00:15:40,166 processor form Berkeley. And, as you guess, you see their layout 187 00:15:40,166 --> 00:15:42,373 here. They striped over the lanes, 188 00:15:42,373 --> 00:15:47,856 The respective elements of their, well of one register in the vector register file. 189 00:15:47,856 --> 00:15:51,267 And then they actually mirrored the data path here. 190 00:15:51,267 --> 00:15:56,014 They actually have two copies of the same thing on either side of the lane, 191 00:15:56,014 --> 00:16:00,963 And then, up here is their Scalar processor, their control processor if you 192 00:16:00,963 --> 00:16:04,106 will. And this is where they execute the Scalar 193 00:16:04,106 --> 00:16:08,720 operations like the shifts and the branches and, and things like that. 194 00:16:10,880 --> 00:16:17,430 Okay so, so recap here. Vector instructions sets have quite a few 195 00:16:17,430 --> 00:16:20,457 advantages. Main one is they're very compact. 196 00:16:20,457 --> 00:16:24,380 One instruction can generate lots and lots of work to do. 197 00:16:25,260 --> 00:16:33,547 And, they're relatively expressive. When they're expressive is that they very 198 00:16:33,547 --> 00:16:39,054 concisely say, exactly all the constraints between all the different operations they 199 00:16:39,054 --> 00:16:42,135 generate. So for instance, each of the different 200 00:16:42,135 --> 00:16:46,462 operations are independent. This allows us to stripe the registers. 201 00:16:46,462 --> 00:16:49,740 It allows us to know that we don't need to bypass. 202 00:16:51,620 --> 00:16:58,114 Allows us to sort of access all these registers, in a straight mannered, in a 203 00:16:58,114 --> 00:17:02,212 disjointed way. Our memory pattern, at least for, Unit 204 00:17:02,212 --> 00:17:06,870 stride operations is very well understood, and very easy to plan for. 205 00:17:06,870 --> 00:17:09,952 We know that it's going to, bank very well. 206 00:17:09,952 --> 00:17:15,637 So if we have a, a cache, for instance, we can actually bank our cache, cuz we know, 207 00:17:15,637 --> 00:17:21,322 if we're doing unit stride operations, and we have multiple lanes, for instance here. 208 00:17:21,322 --> 00:17:26,459 We know that we're only going to be accessing the zero-width element of a 209 00:17:26,459 --> 00:17:29,679 vector here. The first element there, the second 210 00:17:29,679 --> 00:17:31,460 element there, Etc. 211 00:17:32,020 --> 00:17:35,933 And this is, this works out very well from a banking perspective, 212 00:17:35,933 --> 00:17:44,260 But sometimes, it may not always work out. And finally, 213 00:17:45,020 --> 00:17:47,645 These sorts of architectures can be scalable. 214 00:17:47,645 --> 00:17:50,679 What I mean by scalable is you could add more lanes. 215 00:17:50,679 --> 00:17:54,880 And as long as you don't add more lanes, than your maximum vector length, 216 00:17:55,240 --> 00:17:59,635 You're probably doing okay. Or, you might want to be off my a little 217 00:17:59,635 --> 00:18:04,224 bit, you know, your, your depth of your pipeline, but you can add more and more 218 00:18:04,224 --> 00:18:08,753 work, and, and, in fact, people have actualy built families of these processors 219 00:18:08,753 --> 00:18:12,798 where they have different number of lanes in the same architecture. 220 00:18:12,798 --> 00:18:18,051 And this is actually common in our GPU's today where they will have multiple copies 221 00:18:18,051 --> 00:18:22,761 of the same units and they will actually build different variants of the same 222 00:18:22,761 --> 00:18:26,324 processor pipe. Or same pipe a processor pipeline where 223 00:18:26,324 --> 00:18:31,588 you'll have, ones that have, let's say, one lane versus, I don't know, fifteen 224 00:18:31,588 --> 00:18:37,087 lanes or something like that. And you can sell a high end part, which is 225 00:18:37,087 --> 00:18:40,960 fifteen lanes of a low end part that has one lane.