1 00:00:03,240 --> 00:00:09,107 So, it's an important aspect of all this vector work of how do you compile for 2 00:00:09,107 --> 00:00:09,860 this. Well, 3 00:00:11,100 --> 00:00:15,785 Thankfully, we actually have compilers that can do automatic vectorization. 4 00:00:15,785 --> 00:00:21,040 And one of the challenges here, if you look at this element wise multiply is, you 5 00:00:21,040 --> 00:00:25,915 have a loop that's running and another loop that's running and your compiler 6 00:00:25,915 --> 00:00:30,980 needs to figure out that it can merge those loops and run them at the same time. 7 00:00:33,220 --> 00:00:36,762 And, compilers actually have gotten pretty sophisticated. 8 00:00:36,762 --> 00:00:41,316 If you look at the, the, the Craig compiler now, it can basically do outer 9 00:00:41,316 --> 00:00:46,060 loop parallelism, it can do certain types of parallelism with loop carry 10 00:00:46,060 --> 00:00:51,154 dependencies and vectorize all this. But it requires some pretty deep compiler 11 00:00:51,154 --> 00:00:53,976 analysis. This especially works well for things like 12 00:00:53,976 --> 00:00:58,080 Fortran codes where you don't have random pointers pointing in different places. 13 00:00:58,080 --> 00:01:06,471 C codes get a little bit hard. So, what if you don't want to execute the 14 00:01:06,471 --> 00:01:11,160 same code in all the elements of your vector? 15 00:01:11,620 --> 00:01:17,620 Well, that could be a problem. So, here we have a piece of code which 16 00:01:17,620 --> 00:01:22,218 loops over some big vector, this is C code. And, it checks to see whether the 17 00:01:22,218 --> 00:01:26,510 value is greater than zero. And only if it's greater than zero does it 18 00:01:26,510 --> 00:01:33,022 do this next operation. So, there's been extensions to vector 19 00:01:33,022 --> 00:01:38,966 processors that have allowed effectively predicates or masked operations on a per 20 00:01:38,966 --> 00:01:44,548 element basis of the, of the vector. So, the way you would do this is you would 21 00:01:44,548 --> 00:01:50,203 actually load the entire vector, set a mask register where you have a one or a 22 00:01:50,203 --> 00:01:55,640 zero which is the result of this comparison on an element to element basis, 23 00:01:56,680 --> 00:02:02,268 And then, do the operation. And you can basically put this together with these bit 24 00:02:02,268 --> 00:02:07,456 by bit comparisons and have slightly different control flow for the different 25 00:02:07,456 --> 00:02:12,841 elements within a vector. And, just sort of, showing the 26 00:02:12,841 --> 00:02:18,812 implementation of this, if we looked at how to actually implement masking, one way 27 00:02:18,812 --> 00:02:21,940 to do it is, you actually do every operation. 28 00:02:22,420 --> 00:02:26,861 So, let's say, you're doing multiply and your vector length is 64. 29 00:02:26,861 --> 00:02:32,573 You do all 64 but you just disable the right to the register file on, the ones 30 00:02:32,573 --> 00:02:37,508 that have the mask bit turned off. Or, you could have a much more fancy 31 00:02:37,508 --> 00:02:42,373 implementation which takes out the work that doesn't have to be done. 32 00:02:42,373 --> 00:02:46,040 But, the control on this is, is quite a bit harder. 33 00:02:46,720 --> 00:02:51,835 And, I would say, that this is probably more common, just the simple 34 00:02:51,835 --> 00:02:54,160 implementation. And the, the, 35 00:02:54,960 --> 00:02:59,618 This is, this is harder largely because, if you have the resources anyway, say if 36 00:02:59,618 --> 00:03:04,336 you have multiple lanes, it might just make sense to go execute a sort of a null 37 00:03:04,336 --> 00:03:09,869 operation later. Some other things that are pretty common 38 00:03:09,869 --> 00:03:17,195 in vectors is you want to have reductions. What I mean by reduction is let's say, you 39 00:03:17,195 --> 00:03:23,496 have this array and you want to add all the elements in the array into a variable. 40 00:03:23,496 --> 00:03:30,551 There's a sort of a vector to scalar operation You can't really do this on what 41 00:03:30,551 --> 00:03:34,238 we discussed so far. You can't do a vector operation which will 42 00:03:34,238 --> 00:03:38,921 actually operate on all of these values and, and try to do something useful with 43 00:03:38,921 --> 00:03:41,554 it. But, what you can do is you can try and do 44 00:03:41,554 --> 00:03:44,949 some software tricks. So, one of the software tricks is, you 45 00:03:44,949 --> 00:03:48,186 take a whole vector, and instead, call it two vectors. 46 00:03:48,186 --> 00:03:52,448 Sort of, cut it in half, and then overlap them and do parallel adds. 47 00:03:52,448 --> 00:03:57,486 And then, you take the results of that. You take, it was someplace else in there, 48 00:03:57,486 --> 00:04:01,360 And you take those two parts and you overlap them, you do adds. 49 00:04:01,360 --> 00:04:06,656 So, you could do lots of parallel adds and effectively build a reduction operation, 50 00:04:06,656 --> 00:04:12,781 by building a tree of adds. So, if we have our vector here, we would 51 00:04:12,781 --> 00:04:19,261 cut it in half and add this part with this part, and then the result would be half 52 00:04:19,261 --> 00:04:25,582 the size. If we cut in half we had this, part of that part, the result is half the 53 00:04:25,582 --> 00:04:31,903 size and cut again, we do, we keep doing adds. So, we can use our vector arithmetic 54 00:04:31,903 --> 00:04:37,040 to effectively do a reduction. So we're about out of time here. 55 00:04:38,760 --> 00:04:41,798 Talk about scatter gather, this isn't that deep. 56 00:04:41,992 --> 00:04:45,806 The implementation of this can be very hard though. 57 00:04:45,806 --> 00:04:53,740 Um,, A of d of i. So, we want to index base off 58 00:04:53,740 --> 00:04:59,198 a index of the vector. This is called gather. 59 00:04:59,198 --> 00:05:05,092 Scatter is the other direction when you're doing store with a double lead-in, a, a, 60 00:05:05,092 --> 00:05:13,733 a, a, index of a index. And, in the instruction set in your book, 61 00:05:13,733 --> 00:05:16,800 they actually have an instruction to do this. 62 00:05:17,400 --> 00:05:20,922 Lvi here, Well, what that basically does is it takes 63 00:05:20,922 --> 00:05:26,033 each element of vector D here, indexes into vector C, and then that is that 64 00:05:26,033 --> 00:05:29,071 result. Problems with this is, of course, your 65 00:05:29,071 --> 00:05:33,215 memory layout is not going to be all nicely laid out in memory. 66 00:05:33,215 --> 00:05:36,600 You're going to be sort of jumping around in memory. 67 00:05:37,545 --> 00:05:42,958 Let's, let's stop here for today, and we'll talk a little bit more about vectors 68 00:05:42,958 --> 00:05:44,260 and GPUs next time.