1
00:00:00,000 --> 00:00:04,311
. 
Okay. 

2
00:00:04,311 --> 00:00:10,740
So now we're going to move off of vectors 
and talk about sort of a near cousin of 

3
00:00:10,740 --> 00:00:14,111
vectors, 
or how you can deal, or have vector 

4
00:00:14,111 --> 00:00:21,675
computing, in your desktop today. 
So this is actually a lot of this was 

5
00:00:21,675 --> 00:00:29,220
done actually by Ruby Reith here at 
Princeton she added a lot of multimedia 

6
00:00:29,220 --> 00:00:36,780
extensions to the HPPA risk architecture. 
there's a couple of other people involved 

7
00:00:36,780 --> 00:00:43,022
in this, but the, she was actually pretty 
influential in, in dealing, to do this. 

8
00:00:43,022 --> 00:00:49,421
The, the idea here is that if you have a 
wide register, so if you're doing let's 

9
00:00:49,421 --> 00:00:55,067
say 64 bit additions, 
and you don't want to have to do 64 bit 

10
00:00:55,067 --> 00:01:00,413
additions, or don't actually have 64 bit 
data laying around, you could cut it in 

11
00:01:00,413 --> 00:01:03,477
half and do two 32 bit operations at the 
same time, 

12
00:01:03,477 --> 00:01:07,980
or you can use that same ALU and try and 
do four sixteen bits, 

13
00:01:07,980 --> 00:01:13,215
or eight 8-bit operations. 
So, this is called SIMDy, or Single 

14
00:01:13,215 --> 00:01:19,846
Instruction, Multiple Data, so you have, 
or short SIMDy instructions here, because 

15
00:01:19,846 --> 00:01:24,034
typically the, the vector length is 
pretty short, 

16
00:01:24,034 --> 00:01:30,055
or multimedia extensions. 
and you have an instruction which says, I 

17
00:01:30,055 --> 00:01:34,680
want to do two 32-bit ads, we'll say, at 
the same time. 

18
00:01:36,400 --> 00:01:42,555
This is was popularized in x86 at least 
by, MMX was the first, first 

19
00:01:42,555 --> 00:01:48,182
implementation of this. 
And it's, it's sort of gone on from there 

20
00:01:48,182 --> 00:01:51,348
to SSE, SSE3, SSE4 SSE4, and now Intel 
AVX. 

21
00:01:51,348 --> 00:01:58,500
And the differenances between mmx and all 
the different SSE's largely has to do 

22
00:01:58,500 --> 00:02:03,266
with the length of the register and how 
many instructions they had. 

23
00:02:03,266 --> 00:02:08,245
so in AVX we've gone to 256 bit 
registers, wider registers, and it's 

24
00:02:08,245 --> 00:02:11,660
extensible to I think 1,000 bit or, or 
1024 bits. 

25
00:02:13,180 --> 00:02:19,311
One thing I do want to point out about 
this which is interesting is this 

26
00:02:19,311 --> 00:02:25,282
requires changes to your data path. 
If you have an adder, and you have a 32 

27
00:02:25,282 --> 00:02:31,575
bit add, and now you wanted to do eight, 
eight bit ads, you need to cut the carry 

28
00:02:31,575 --> 00:02:38,993
chain in seven places. 
Now, that's if you have a basic adder. 

29
00:02:38,993 --> 00:02:44,900
I guess it gets a little more complicated 
if you have something like a 

30
00:02:46,200 --> 00:02:50,959
propagate, or, a, carry look ahead adder, 
or something like that, 

31
00:02:50,959 --> 00:02:56,223
because you may not have a simple place 
to go sniff the, the carry chains. 

32
00:02:56,223 --> 00:03:00,771
There is still some place to cut it, 
but you might, your original design, you 

33
00:03:00,771 --> 00:03:04,140
might have propagated across, 
where now, you need to cut the boundary. 

34
00:03:04,140 --> 00:03:06,493
So, this is, this is definitely a, a 
challenge. 

35
00:03:06,493 --> 00:03:10,611
Also, for things like multiplies, if you 
want to do eight, eight bit multiplies. 

36
00:03:10,611 --> 00:03:13,820
the, the, the structure looks a little 
bit different there. 

37
00:03:13,820 --> 00:03:17,670
But the, some of these, the big insight 
here, is, you had that logic anyway. 

38
00:03:17,670 --> 00:03:22,817
You're just effectively adding muxes on 
the carry chains to the, the the data 

39
00:03:22,817 --> 00:03:26,296
path. 
And some operations you don't even need 

40
00:03:26,296 --> 00:03:29,620
to add. 
Obviously if you're operating on 

41
00:03:29,620 --> 00:03:34,990
something like eight, eight bit values, 
you want to do the logical or of them. 

42
00:03:34,990 --> 00:03:37,720
You don't need to add a special 
instruction for that. 

43
00:03:41,000 --> 00:03:46,684
From a implementation perspective, this 
is what I was trying to get at here. You 

44
00:03:46,684 --> 00:03:51,953
can, you've independent ad's going on, 
and they all happen in parallel So why, 

45
00:03:51,953 --> 00:03:57,846
why do we like multimedia extensions, or 
these vector instructions or short vector 

46
00:03:57,846 --> 00:04:01,451
instructions? 
And let's compare them to our big vector 

47
00:04:01,451 --> 00:04:04,848
machines. 
So, one of the major differences is that 

48
00:04:04,848 --> 00:04:10,711
you can't control the vector length. 
The vector length is the way the length 

49
00:04:10,711 --> 00:04:15,610
of the, the native data word or the 
length of the instruction set. 

50
00:04:15,610 --> 00:04:21,338
so, or the length, the length of the 
native data type for your instruction 

51
00:04:21,338 --> 00:04:24,040
set. 
And, 

52
00:04:24,040 --> 00:04:27,593
strided, scatter-gather, these other 
operations are hard to do, 

53
00:04:27,593 --> 00:04:30,797
because typically you just have a single 
load in store. 

54
00:04:30,797 --> 00:04:34,176
And you use the processor's load and 
storing instructions. 

55
00:04:34,176 --> 00:04:38,487
Because the processor doesn't care. 
It's just like the same way that unary 

56
00:04:38,487 --> 00:04:43,147
operations or logical operations don't 
need special instructions to do short 

57
00:04:43,147 --> 00:04:46,293
vector, or single instruction multiple 
data operations. 

58
00:04:46,293 --> 00:04:51,012
You don't need special instructions for 
SIM D data to be able to do loads and 

59
00:04:51,012 --> 00:04:53,020
stores. 
You just load the data. 

60
00:04:53,020 --> 00:04:57,937
And store the data. 
this is actually starting to change a 

61
00:04:57,937 --> 00:05:02,199
little bit. 
Some of the new versions of SSE actually 

62
00:05:02,199 --> 00:05:06,420
do have some, scatter-gather 
modifications. 

63
00:05:06,420 --> 00:05:13,800
It's a, it's a little bit harder if you 
think about it because you can't hold a 

64
00:05:13,800 --> 00:05:20,200
full address if you will, in a vector. 
So it's not like you can actually do sort 

65
00:05:20,200 --> 00:05:24,160
of index of addressing, 
index of addresses because you can't 

66
00:05:24,160 --> 00:05:26,740
necessarily hold the full address in 
there. 

67
00:05:26,740 --> 00:05:31,780
But, in essence, they've sort of come up 
with some way to do, scatter and gather 

68
00:05:31,780 --> 00:05:38,259
operations. 
Couple things about having the vector 

69
00:05:38,259 --> 00:05:44,197
register length being limited, is that 
you can't do as much work in one 

70
00:05:44,197 --> 00:05:48,043
operation. 
So, you can't necessarily do a 64 

71
00:05:48,043 --> 00:05:53,981
operations in one instruction, like we 
did with our vector length of 64. 

72
00:05:53,981 --> 00:05:57,577
So that's just, that just is a, is a 
problem. 

73
00:05:57,577 --> 00:06:03,598
And, and unfortunately, what happens here 
is you end up having to do more 

74
00:06:03,598 --> 00:06:10,757
operations and issue more instructions. 
And you're effectively increasing the 

75
00:06:10,757 --> 00:06:16,394
bandwidth out of your fetch, unit. 
So it's not, it's not, not as, not as 

76
00:06:16,394 --> 00:06:19,796
good. 
and finally, I just wanted to say we're, 

77
00:06:19,796 --> 00:06:25,044
that processors are starting to move, 
that these multimedia extensions are 

78
00:06:25,044 --> 00:06:30,790
starting to move a little bit towards 
vector processors. as they add more rich 

79
00:06:30,790 --> 00:06:34,620
instruction sets. 
So, as we get to SSC4 for instance, or 

80
00:06:34,620 --> 00:06:40,081
SSC4.2, there's more instructions in 
there and X 86 that can do fancier 

81
00:06:40,081 --> 00:06:43,486
things. 
And the vector length is even getting, 

82
00:06:43,486 --> 00:06:47,600
getting longer, up to 124 bits. 
Or excuse me 1024 bits.