1
00:00:03,240 --> 00:00:09,107
So, it's an important aspect of all this
vector work of how do you compile for

2
00:00:09,107 --> 00:00:09,860
this.
Well,

3
00:00:11,100 --> 00:00:15,785
Thankfully, we actually have compilers
that can do automatic vectorization.

4
00:00:15,785 --> 00:00:21,040
And one of the challenges here, if you
look at this element wise multiply is, you

5
00:00:21,040 --> 00:00:25,915
have a loop that's running and another
loop that's running and your compiler

6
00:00:25,915 --> 00:00:30,980
needs to figure out that it can merge
those loops and run them at the same time.

7
00:00:33,220 --> 00:00:36,762
And, compilers actually have gotten pretty
sophisticated.

8
00:00:36,762 --> 00:00:41,316
If you look at the, the, the Craig
compiler now, it can basically do outer

9
00:00:41,316 --> 00:00:46,060
loop parallelism, it can do certain types
of parallelism with loop carry

10
00:00:46,060 --> 00:00:51,154
dependencies and vectorize all this.
But it requires some pretty deep compiler

11
00:00:51,154 --> 00:00:53,976
analysis.
This especially works well for things like

12
00:00:53,976 --> 00:00:58,080
Fortran codes where you don't have random
pointers pointing in different places.

13
00:00:58,080 --> 00:01:06,471
C codes get a little bit hard.
So, what if you don't want to execute the

14
00:01:06,471 --> 00:01:11,160
same code in all the elements of your
vector?

15
00:01:11,620 --> 00:01:17,620
Well, that could be a problem.
So, here we have a piece of code which

16
00:01:17,620 --> 00:01:22,218
loops over some big vector, this is C
code. And, it checks to see whether the

17
00:01:22,218 --> 00:01:26,510
value is greater than zero.
And only if it's greater than zero does it

18
00:01:26,510 --> 00:01:33,022
do this next operation.
So, there's been extensions to vector

19
00:01:33,022 --> 00:01:38,966
processors that have allowed effectively
predicates or masked operations on a per

20
00:01:38,966 --> 00:01:44,548
element basis of the, of the vector.
So, the way you would do this is you would

21
00:01:44,548 --> 00:01:50,203
actually load the entire vector, set a
mask register where you have a one or a

22
00:01:50,203 --> 00:01:55,640
zero which is the result of this
comparison on an element to element basis,

23
00:01:56,680 --> 00:02:02,268
And then, do the operation. And you can
basically put this together with these bit

24
00:02:02,268 --> 00:02:07,456
by bit comparisons and have slightly
different control flow for the different

25
00:02:07,456 --> 00:02:12,841
elements within a vector.
And, just sort of, showing the

26
00:02:12,841 --> 00:02:18,812
implementation of this, if we looked at
how to actually implement masking, one way

27
00:02:18,812 --> 00:02:21,940
to do it is, you actually do every
operation.

28
00:02:22,420 --> 00:02:26,861
So, let's say, you're doing multiply and
your vector length is 64.

29
00:02:26,861 --> 00:02:32,573
You do all 64 but you just disable the
right to the register file on, the ones

30
00:02:32,573 --> 00:02:37,508
that have the mask bit turned off.
Or, you could have a much more fancy

31
00:02:37,508 --> 00:02:42,373
implementation which takes out the work
that doesn't have to be done.

32
00:02:42,373 --> 00:02:46,040
But, the control on this is, is quite a
bit harder.

33
00:02:46,720 --> 00:02:51,835
And, I would say, that this is probably
more common, just the simple

34
00:02:51,835 --> 00:02:54,160
implementation.
And the, the,

35
00:02:54,960 --> 00:02:59,618
This is, this is harder largely because,
if you have the resources anyway, say if

36
00:02:59,618 --> 00:03:04,336
you have multiple lanes, it might just
make sense to go execute a sort of a null

37
00:03:04,336 --> 00:03:09,869
operation later.
Some other things that are pretty common

38
00:03:09,869 --> 00:03:17,195
in vectors is you want to have reductions.
What I mean by reduction is let's say, you

39
00:03:17,195 --> 00:03:23,496
have this array and you want to add all
the elements in the array into a variable.

40
00:03:23,496 --> 00:03:30,551
There's a sort of a vector to scalar
operation You can't really do this on what

41
00:03:30,551 --> 00:03:34,238
we discussed so far.
You can't do a vector operation which will

42
00:03:34,238 --> 00:03:38,921
actually operate on all of these values
and, and try to do something useful with

43
00:03:38,921 --> 00:03:41,554
it.
But, what you can do is you can try and do

44
00:03:41,554 --> 00:03:44,949
some software tricks.
So, one of the software tricks is, you

45
00:03:44,949 --> 00:03:48,186
take a whole vector, and instead, call it
two vectors.

46
00:03:48,186 --> 00:03:52,448
Sort of, cut it in half, and then overlap
them and do parallel adds.

47
00:03:52,448 --> 00:03:57,486
And then, you take the results of that.
You take, it was someplace else in there,

48
00:03:57,486 --> 00:04:01,360
And you take those two parts and you
overlap them, you do adds.

49
00:04:01,360 --> 00:04:06,656
So, you could do lots of parallel adds and
effectively build a reduction operation,

50
00:04:06,656 --> 00:04:12,781
by building a tree of adds.
So, if we have our vector here, we would

51
00:04:12,781 --> 00:04:19,261
cut it in half and add this part with this
part, and then the result would be half

52
00:04:19,261 --> 00:04:25,582
the size. If we cut in half we had this,
part of that part, the result is half the

53
00:04:25,582 --> 00:04:31,903
size and cut again, we do, we keep doing
adds. So, we can use our vector arithmetic

54
00:04:31,903 --> 00:04:37,040
to effectively do a reduction.
So we're about out of time here.

55
00:04:38,760 --> 00:04:41,798
Talk about scatter gather, this isn't that
deep.

56
00:04:41,992 --> 00:04:45,806
The implementation of this can be very
hard though.

57
00:04:45,806 --> 00:04:53,740
Um,,
A of d of i. So, we want to index base off

58
00:04:53,740 --> 00:04:59,198
a index of the vector.
This is called gather.

59
00:04:59,198 --> 00:05:05,092
Scatter is the other direction when you're
doing store with a double lead-in, a, a,

60
00:05:05,092 --> 00:05:13,733
a, a, index of a index.
And, in the instruction set in your book,

61
00:05:13,733 --> 00:05:16,800
they actually have an instruction to do
this.

62
00:05:17,400 --> 00:05:20,922
Lvi here,
Well, what that basically does is it takes

63
00:05:20,922 --> 00:05:26,033
each element of vector D here, indexes
into vector C, and then that is that

64
00:05:26,033 --> 00:05:29,071
result.
Problems with this is, of course, your

65
00:05:29,071 --> 00:05:33,215
memory layout is not going to be all
nicely laid out in memory.

66
00:05:33,215 --> 00:05:36,600
You're going to be sort of jumping around
in memory.

67
00:05:37,545 --> 00:05:42,958
Let's, let's stop here for today, and
we'll talk a little bit more about vectors

68
00:05:42,958 --> 00:05:44,260
and GPUs next time.