1 00:00:00,000 --> 00:00:04,311 . Okay. 2 00:00:04,311 --> 00:00:10,740 So now we're going to move off of vectors and talk about sort of a near cousin of 3 00:00:10,740 --> 00:00:14,111 vectors, or how you can deal, or have vector 4 00:00:14,111 --> 00:00:21,675 computing, in your desktop today. So this is actually a lot of this was 5 00:00:21,675 --> 00:00:29,220 done actually by Ruby Reith here at Princeton she added a lot of multimedia 6 00:00:29,220 --> 00:00:36,780 extensions to the HPPA risk architecture. there's a couple of other people involved 7 00:00:36,780 --> 00:00:43,022 in this, but the, she was actually pretty influential in, in dealing, to do this. 8 00:00:43,022 --> 00:00:49,421 The, the idea here is that if you have a wide register, so if you're doing let's 9 00:00:49,421 --> 00:00:55,067 say 64 bit additions, and you don't want to have to do 64 bit 10 00:00:55,067 --> 00:01:00,413 additions, or don't actually have 64 bit data laying around, you could cut it in 11 00:01:00,413 --> 00:01:03,477 half and do two 32 bit operations at the same time, 12 00:01:03,477 --> 00:01:07,980 or you can use that same ALU and try and do four sixteen bits, 13 00:01:07,980 --> 00:01:13,215 or eight 8-bit operations. So, this is called SIMDy, or Single 14 00:01:13,215 --> 00:01:19,846 Instruction, Multiple Data, so you have, or short SIMDy instructions here, because 15 00:01:19,846 --> 00:01:24,034 typically the, the vector length is pretty short, 16 00:01:24,034 --> 00:01:30,055 or multimedia extensions. and you have an instruction which says, I 17 00:01:30,055 --> 00:01:34,680 want to do two 32-bit ads, we'll say, at the same time. 18 00:01:36,400 --> 00:01:42,555 This is was popularized in x86 at least by, MMX was the first, first 19 00:01:42,555 --> 00:01:48,182 implementation of this. And it's, it's sort of gone on from there 20 00:01:48,182 --> 00:01:51,348 to SSE, SSE3, SSE4 SSE4, and now Intel AVX. 21 00:01:51,348 --> 00:01:58,500 And the differenances between mmx and all the different SSE's largely has to do 22 00:01:58,500 --> 00:02:03,266 with the length of the register and how many instructions they had. 23 00:02:03,266 --> 00:02:08,245 so in AVX we've gone to 256 bit registers, wider registers, and it's 24 00:02:08,245 --> 00:02:11,660 extensible to I think 1,000 bit or, or 1024 bits. 25 00:02:13,180 --> 00:02:19,311 One thing I do want to point out about this which is interesting is this 26 00:02:19,311 --> 00:02:25,282 requires changes to your data path. If you have an adder, and you have a 32 27 00:02:25,282 --> 00:02:31,575 bit add, and now you wanted to do eight, eight bit ads, you need to cut the carry 28 00:02:31,575 --> 00:02:38,993 chain in seven places. Now, that's if you have a basic adder. 29 00:02:38,993 --> 00:02:44,900 I guess it gets a little more complicated if you have something like a 30 00:02:46,200 --> 00:02:50,959 propagate, or, a, carry look ahead adder, or something like that, 31 00:02:50,959 --> 00:02:56,223 because you may not have a simple place to go sniff the, the carry chains. 32 00:02:56,223 --> 00:03:00,771 There is still some place to cut it, but you might, your original design, you 33 00:03:00,771 --> 00:03:04,140 might have propagated across, where now, you need to cut the boundary. 34 00:03:04,140 --> 00:03:06,493 So, this is, this is definitely a, a challenge. 35 00:03:06,493 --> 00:03:10,611 Also, for things like multiplies, if you want to do eight, eight bit multiplies. 36 00:03:10,611 --> 00:03:13,820 the, the, the structure looks a little bit different there. 37 00:03:13,820 --> 00:03:17,670 But the, some of these, the big insight here, is, you had that logic anyway. 38 00:03:17,670 --> 00:03:22,817 You're just effectively adding muxes on the carry chains to the, the the data 39 00:03:22,817 --> 00:03:26,296 path. And some operations you don't even need 40 00:03:26,296 --> 00:03:29,620 to add. Obviously if you're operating on 41 00:03:29,620 --> 00:03:34,990 something like eight, eight bit values, you want to do the logical or of them. 42 00:03:34,990 --> 00:03:37,720 You don't need to add a special instruction for that. 43 00:03:41,000 --> 00:03:46,684 From a implementation perspective, this is what I was trying to get at here. You 44 00:03:46,684 --> 00:03:51,953 can, you've independent ad's going on, and they all happen in parallel So why, 45 00:03:51,953 --> 00:03:57,846 why do we like multimedia extensions, or these vector instructions or short vector 46 00:03:57,846 --> 00:04:01,451 instructions? And let's compare them to our big vector 47 00:04:01,451 --> 00:04:04,848 machines. So, one of the major differences is that 48 00:04:04,848 --> 00:04:10,711 you can't control the vector length. The vector length is the way the length 49 00:04:10,711 --> 00:04:15,610 of the, the native data word or the length of the instruction set. 50 00:04:15,610 --> 00:04:21,338 so, or the length, the length of the native data type for your instruction 51 00:04:21,338 --> 00:04:24,040 set. And, 52 00:04:24,040 --> 00:04:27,593 strided, scatter-gather, these other operations are hard to do, 53 00:04:27,593 --> 00:04:30,797 because typically you just have a single load in store. 54 00:04:30,797 --> 00:04:34,176 And you use the processor's load and storing instructions. 55 00:04:34,176 --> 00:04:38,487 Because the processor doesn't care. It's just like the same way that unary 56 00:04:38,487 --> 00:04:43,147 operations or logical operations don't need special instructions to do short 57 00:04:43,147 --> 00:04:46,293 vector, or single instruction multiple data operations. 58 00:04:46,293 --> 00:04:51,012 You don't need special instructions for SIM D data to be able to do loads and 59 00:04:51,012 --> 00:04:53,020 stores. You just load the data. 60 00:04:53,020 --> 00:04:57,937 And store the data. this is actually starting to change a 61 00:04:57,937 --> 00:05:02,199 little bit. Some of the new versions of SSE actually 62 00:05:02,199 --> 00:05:06,420 do have some, scatter-gather modifications. 63 00:05:06,420 --> 00:05:13,800 It's a, it's a little bit harder if you think about it because you can't hold a 64 00:05:13,800 --> 00:05:20,200 full address if you will, in a vector. So it's not like you can actually do sort 65 00:05:20,200 --> 00:05:24,160 of index of addressing, index of addresses because you can't 66 00:05:24,160 --> 00:05:26,740 necessarily hold the full address in there. 67 00:05:26,740 --> 00:05:31,780 But, in essence, they've sort of come up with some way to do, scatter and gather 68 00:05:31,780 --> 00:05:38,259 operations. Couple things about having the vector 69 00:05:38,259 --> 00:05:44,197 register length being limited, is that you can't do as much work in one 70 00:05:44,197 --> 00:05:48,043 operation. So, you can't necessarily do a 64 71 00:05:48,043 --> 00:05:53,981 operations in one instruction, like we did with our vector length of 64. 72 00:05:53,981 --> 00:05:57,577 So that's just, that just is a, is a problem. 73 00:05:57,577 --> 00:06:03,598 And, and unfortunately, what happens here is you end up having to do more 74 00:06:03,598 --> 00:06:10,757 operations and issue more instructions. And you're effectively increasing the 75 00:06:10,757 --> 00:06:16,394 bandwidth out of your fetch, unit. So it's not, it's not, not as, not as 76 00:06:16,394 --> 00:06:19,796 good. and finally, I just wanted to say we're, 77 00:06:19,796 --> 00:06:25,044 that processors are starting to move, that these multimedia extensions are 78 00:06:25,044 --> 00:06:30,790 starting to move a little bit towards vector processors. as they add more rich 79 00:06:30,790 --> 00:06:34,620 instruction sets. So, as we get to SSC4 for instance, or 80 00:06:34,620 --> 00:06:40,081 SSC4.2, there's more instructions in there and X 86 that can do fancier 81 00:06:40,081 --> 00:06:43,486 things. And the vector length is even getting, 82 00:06:43,486 --> 00:06:47,600 getting longer, up to 124 bits. Or excuse me 1024 bits.