So, it's an important aspect of all this vector work of how do you compile for this. Well, Thankfully, we actually have compilers that can do automatic vectorization. And one of the challenges here, if you look at this element wise multiply is, you have a loop that's running and another loop that's running and your compiler needs to figure out that it can merge those loops and run them at the same time. And, compilers actually have gotten pretty sophisticated. If you look at the, the, the Craig compiler now, it can basically do outer loop parallelism, it can do certain types of parallelism with loop carry dependencies and vectorize all this. But it requires some pretty deep compiler analysis. This especially works well for things like Fortran codes where you don't have random pointers pointing in different places. C codes get a little bit hard. So, what if you don't want to execute the same code in all the elements of your vector? Well, that could be a problem. So, here we have a piece of code which loops over some big vector, this is C code. And, it checks to see whether the value is greater than zero. And only if it's greater than zero does it do this next operation. So, there's been extensions to vector processors that have allowed effectively predicates or masked operations on a per element basis of the, of the vector. So, the way you would do this is you would actually load the entire vector, set a mask register where you have a one or a zero which is the result of this comparison on an element to element basis, And then, do the operation. And you can basically put this together with these bit by bit comparisons and have slightly different control flow for the different elements within a vector. And, just sort of, showing the implementation of this, if we looked at how to actually implement masking, one way to do it is, you actually do every operation. So, let's say, you're doing multiply and your vector length is 64. You do all 64 but you just disable the right to the register file on, the ones that have the mask bit turned off. Or, you could have a much more fancy implementation which takes out the work that doesn't have to be done. But, the control on this is, is quite a bit harder. And, I would say, that this is probably more common, just the simple implementation. And the, the, This is, this is harder largely because, if you have the resources anyway, say if you have multiple lanes, it might just make sense to go execute a sort of a null operation later. Some other things that are pretty common in vectors is you want to have reductions. What I mean by reduction is let's say, you have this array and you want to add all the elements in the array into a variable. There's a sort of a vector to scalar operation You can't really do this on what we discussed so far. You can't do a vector operation which will actually operate on all of these values and, and try to do something useful with it. But, what you can do is you can try and do some software tricks. So, one of the software tricks is, you take a whole vector, and instead, call it two vectors. Sort of, cut it in half, and then overlap them and do parallel adds. And then, you take the results of that. You take, it was someplace else in there, And you take those two parts and you overlap them, you do adds. So, you could do lots of parallel adds and effectively build a reduction operation, by building a tree of adds. So, if we have our vector here, we would cut it in half and add this part with this part, and then the result would be half the size. If we cut in half we had this, part of that part, the result is half the size and cut again, we do, we keep doing adds. So, we can use our vector arithmetic to effectively do a reduction. So we're about out of time here. Talk about scatter gather, this isn't that deep. The implementation of this can be very hard though. Um,, A of d of i. So, we want to index base off a index of the vector. This is called gather. Scatter is the other direction when you're doing store with a double lead-in, a, a, a, a, index of a index. And, in the instruction set in your book, they actually have an instruction to do this. Lvi here, Well, what that basically does is it takes each element of vector D here, indexes into vector C, and then that is that result. Problems with this is, of course, your memory layout is not going to be all nicely laid out in memory. You're going to be sort of jumping around in memory. Let's, let's stop here for today, and we'll talk a little bit more about vectors and GPUs next time.