1
00:00:03,280 --> 00:00:04,532
Okay. 
So, all here. 

2
00:00:04,532 --> 00:00:09,401
So, let's get started. 
So, we're continuing our ELE 475 

3
00:00:09,401 --> 00:00:12,744
experience. 
And we're going to continue on where we 

4
00:00:12,744 --> 00:00:16,742
left off last time talking about vectors 
and vector machines. 

5
00:00:16,742 --> 00:00:21,592
And just to recap, because we went 
through this really fast at the end of 

6
00:00:21,592 --> 00:00:25,525
lecture last time. 
when you have a vector computer, one of 

7
00:00:25,525 --> 00:00:30,375
the things that you want to do or the 
easy thing to do is to add vectors or 

8
00:00:30,375 --> 00:00:34,430
numbers. 
But, what if you want to do work inside 

9
00:00:34,430 --> 00:00:38,346
of a vector? So, you want to take a 
vector and you want to sum all of the 

10
00:00:38,346 --> 00:00:41,703
elements in the vector. 
So, we call this a reduction, a vector 

11
00:00:41,703 --> 00:00:45,900
reduction. And if you're trying to do 
this with a vector machine, unless you 

12
00:00:45,900 --> 00:00:50,432
have some special instruction which looks 
at all the different elements, which is 

13
00:00:50,432 --> 00:00:54,572
probably a bad thing to do because if 
you're trying to do that then you would 

14
00:00:54,572 --> 00:01:00,162
lose all the [SOUND] advantages of having 
lane structures because you would build 

15
00:01:00,162 --> 00:01:04,544
or partition the elements. 
because if you had to do a reduction you 

16
00:01:04,544 --> 00:01:08,908
would actually have to have, let's say, 
one ALU use all of the elements from 

17
00:01:08,908 --> 00:01:12,181
these different lanes. 
And that would be, that'd be sad. 

18
00:01:12,181 --> 00:01:16,727
So if you want to do a reduction, one of 
the ways to go about doing this is 

19
00:01:16,727 --> 00:01:20,243
actually have use vectors but use them 
sort of temporally. 

20
00:01:20,243 --> 00:01:26,850
And, you can use a, if you will a binary 
tree algorithm here to start off with 

21
00:01:26,850 --> 00:01:31,571
[SOUND] a big long vector that you want 
to do [SOUND] the sum of all the, the sub 

22
00:01:31,571 --> 00:01:35,618
parts of this. 
And the first step is you just cut this 

23
00:01:35,618 --> 00:01:41,089
in half. [SOUND] And you take this half 
of the vector and that half of the vector 

24
00:01:41,089 --> 00:01:46,035
and you add it, and you end up with 
[SOUND] the partial sums here, which is 

25
00:01:46,035 --> 00:01:51,356
half the length. And again, [SOUND] add 
this half with that half and you can use 

26
00:01:51,356 --> 00:01:55,927
vector instructions to do that [SOUND] 
and for something half the length. 

27
00:01:55,927 --> 00:02:02,470
[COUGH] Continue, and at some point, you 
end up with a scallar, [SOUND] which is 

28
00:02:02,470 --> 00:02:06,822
the sum. 
[SOUND] So, this is pretty widely used to 

29
00:02:06,822 --> 00:02:13,139
do vector reductions. 
at the end of yesterday last class's 

30
00:02:13,139 --> 00:02:18,180
lecture, we also briefly touched on more 
interesting addressing modes. 

31
00:02:18,180 --> 00:02:22,933
So, the vector addressing modes and 
electro low, loads and stores we've been 

32
00:02:22,933 --> 00:02:26,210
talking about. 
Up to this point, you could bank very 

33
00:02:26,210 --> 00:02:31,028
well and you could assign, let's say, 
different regions of memory to, sort of, 

34
00:02:31,028 --> 00:02:34,754
different lanes. 
And you would always be able to do a load 

35
00:02:34,754 --> 00:02:39,572
and actually just read out from your bank 
that was sort of a, attached to a 

36
00:02:39,572 --> 00:02:43,825
particular lane. 
Well, that works well for very 

37
00:02:43,825 --> 00:02:49,745
well-structured memory accesses. But all 
of a sudden, let's say, you want to do an 

38
00:02:49,745 --> 00:02:54,907
operation where you have C of D of i. 
[COUGH] So, you have a vector, D, and you 

39
00:02:54,907 --> 00:02:59,386
want to index into that vector. 
So, it's a vector of addresses. 

40
00:02:59,386 --> 00:03:03,181
And then, you want to take, 
or, or, a vector of indexes. 

41
00:03:03,181 --> 00:03:08,600
And then, you want to take that index and 
use that to index into C. 

42
00:03:08,600 --> 00:03:13,710
So, this is something you commonly want 
to do, but you need special support for 

43
00:03:13,710 --> 00:03:16,527
it. 
And a basic vector architecture may not 

44
00:03:16,527 --> 00:03:18,624
have this. 
but you can add it. 

45
00:03:18,624 --> 00:03:23,668
And the, the MIPS architecture which is 
developed in the Hennessy and Patterson 

46
00:03:23,668 --> 00:03:27,665
book as this instruction here called load 
vector indirect. 

47
00:03:27,665 --> 00:03:33,696
Where you can actually have two vector 
registers, and the one will index into 

48
00:03:33,696 --> 00:03:38,700
the other, and then you have a 
destination vector register. 

49
00:03:38,700 --> 00:03:42,559
And, we call this gather. 
But your memory system, because you don't 

50
00:03:42,559 --> 00:03:47,021
know the a priori, if you will, the 
addressing, your memory system might get 

51
00:03:47,021 --> 00:03:50,399
big and complex. 
And you need to be able to have all, all 

52
00:03:50,399 --> 00:03:55,080
the lanes in your vector processor be 
able to talk to all the memory. 

53
00:03:55,080 --> 00:04:00,168
And that's probably a good thing to do 
anyway to make your machine a little more 

54
00:04:00,168 --> 00:04:05,131
flexible and to allow sort of vectors 
that don't have to align to a particular 

55
00:04:05,131 --> 00:04:08,334
address. 
but, you have to make your memory system 

56
00:04:08,334 --> 00:04:12,795
much more complicated to be able to do 
these sort of gather operations. 

57
00:04:12,795 --> 00:04:16,124
And the scatter operation is the, the 
inverse of this. 

58
00:04:16,124 --> 00:04:21,331
It would be [COUGH], SVI, a Store Vector 
Indirect [COUGH] which would do the store 

59
00:04:21,331 --> 00:04:26,389
where you have an indirect for the store. 
So, if this would be a, on the left hand 

60
00:04:26,389 --> 00:04:29,784
side of an assignment operation. 
Okay. 

61
00:04:29,784 --> 00:04:32,178
So now, we get to talk about a couple of 
examples. 

62
00:04:32,178 --> 00:04:35,843
Well, let's do, we'll touch on one 
example, actually, right now of a vector 

63
00:04:35,843 --> 00:04:38,336
machine. 
And this is what I was trying to say, 

64
00:04:38,336 --> 00:04:41,952
when I was coming in that, if you're 
going to build a really fast computer, 

65
00:04:41,952 --> 00:04:45,340
and it could cost millions of dollars, 
you're going to look cool. 

66
00:04:45,340 --> 00:04:52,981
So, the picture on the right here is the 
Cray-1. 

67
00:04:52,981 --> 00:04:56,781
And 
I've had the pleasure of seeing a couple 

68
00:04:56,781 --> 00:04:59,240
of these and sitting on a couple of 
these. 

69
00:04:59,240 --> 00:05:01,993
and it has a nice little seat built into 
it. 

70
00:05:01,993 --> 00:05:04,687
You can actually sit down on it and it's 
warm. 

71
00:05:04,687 --> 00:05:09,372
Because this is a water cooled machine 
and it uses a lot of this is water 

72
00:05:09,372 --> 00:05:12,418
cooled. 
They later went to something called floor 

73
00:05:12,418 --> 00:05:16,752
inert to cool these machines. 
the Cray-1 was never floor inert cooled, 

74
00:05:16,752 --> 00:05:20,675
but the Cray-2 I think was, 
and the Cray-3 definitely was. 

75
00:05:20,675 --> 00:05:25,297
But, the, the idea is that you use water 
and you can have a nice place to sit so 

76
00:05:25,297 --> 00:05:29,185
the operator has a nice place to sit down 
while he's, you know, he or she is 

77
00:05:29,185 --> 00:05:32,503
working on the machine. 
And, it's heated because there's, these 

78
00:05:32,503 --> 00:05:36,651
machines are quite hot and that, and part 
of the, the power supplies are actually 

79
00:05:36,651 --> 00:05:41,422
under the bench here. 
the other fun thing about these is you'll 

80
00:05:41,422 --> 00:05:45,007
notice they're shaped like the letter C, 
for Cray. 

81
00:05:45,007 --> 00:05:50,454
No one really knows if that's true. 
I think you actually Seymour Cray claims 

82
00:05:50,454 --> 00:05:56,384
this to somehow make the, the distance of 
the back plane shorter. But it, it, it is 

83
00:05:56,384 --> 00:06:00,314
shaped like a C. 
And, and Seymour Cray, who's the, the, 

84
00:06:00,314 --> 00:06:05,322
the founder of Cray, 
does have a C as the first letter of his 

85
00:06:05,322 --> 00:06:08,838
name. 
But, for a little bit more from a influ, 

86
00:06:08,838 --> 00:06:14,111
or a perspective of what's actually 
inside of here, the Cray-1 did not 

87
00:06:14,111 --> 00:06:19,531
actually have lots of different lanes. 
Instead, what it was, it was a vector 

88
00:06:19,531 --> 00:06:25,244
computer that had very long pipelines or 
long for the time pipelines, it had a 

89
00:06:25,244 --> 00:06:30,444
couple pipelines for different, different 
functional units. And, it was a 

90
00:06:30,444 --> 00:06:36,081
registered, 
register, vector register, register style 

91
00:06:36,081 --> 00:06:37,430
machine. 
And, 

92
00:06:37,430 --> 00:06:43,910
some of the, the, the interesting things 
about this is it didn't have any caches. 

93
00:06:43,910 --> 00:06:48,674
And, well, deleting virtual memory, any 
of that other stuff because this is 

94
00:06:48,674 --> 00:06:53,634
really sort of a super computer, you're 
using this to solve some big problem. 

95
00:06:53,634 --> 00:06:58,334
So, you didn't need all this fancy dancy 
multi-tasking, virtualization. 

96
00:06:58,334 --> 00:07:03,359
You ran one really big problem on it, you 
were trying to, I don't know, somehow, 

97
00:07:03,359 --> 00:07:08,720
model nuclear weapons, or use it to crack 
codes, or something like that. 

98
00:07:08,720 --> 00:07:12,298
Here's the, the, micro-architecture of 
the Cray-1. 

99
00:07:12,298 --> 00:07:17,882
And, what we see is they have 64 vectors 
register, or excuse me, eight vector 

100
00:07:17,882 --> 00:07:22,033
registers with 64 elements each. 
Their vector length is 64, 

101
00:07:22,033 --> 00:07:27,044
their maximum vector length is 64. 
And, they also have a bunch of scallar 

102
00:07:27,044 --> 00:07:32,555
registers and they have a separate 
addressing address register bank of 

103
00:07:32,555 --> 00:07:36,492
registers. 
And you can only do loads in store based 

104
00:07:36,492 --> 00:07:42,332
on these address registers. 
What I was trying to get at here, is you 

105
00:07:42,332 --> 00:07:47,360
can see that they basically had only one 
pipe for each of the different 

106
00:07:48,800 --> 00:07:52,930
operations, but these pipes were 
relatively long. So, they give you an 

107
00:07:52,930 --> 00:07:58,814
idea here something like the multiply 
with six cycles, multiply to six cycles 

108
00:07:58,814 --> 00:08:02,694
which today sounds like, well, things are 
pipelined pretty deep. 

109
00:08:02,694 --> 00:08:06,763
We have lots of transistors. 
But, you know, it's 1976, there weren't 

110
00:08:06,763 --> 00:08:10,330
that many transistors. 
This thing was physically large. So, 

111
00:08:10,330 --> 00:08:13,210
building a pipeline that long took, took 
space. 

112
00:08:13,210 --> 00:08:18,216
[COUGH] Or, and another example here is I 
think the reciporical took about fourteen 

113
00:08:18,216 --> 00:08:20,138
cycles, 
and that was pipelined. 

114
00:08:20,138 --> 00:08:24,504
And this did not have interlocking 
between the different pipe stages. 

115
00:08:24,504 --> 00:08:28,696
And didn't have to have bypassing because 
the vector length was so long. 

116
00:08:28,696 --> 00:08:33,179
So, you didn't have to bypass from some 
place in the pipe to some place else in 

117
00:08:33,179 --> 00:08:35,216
the pipe. 
They did have chaining 

118
00:08:35,216 --> 00:08:39,639
but, and, and they did have 
inter-pipeline bypassing, but 

119
00:08:39,639 --> 00:08:44,900
intra-pipeline bypassing wasn't, wasn't 
really there. 

120
00:08:46,240 --> 00:08:51,295
Couple other things, this machine ran 
really pretty fast for the days. 

121
00:08:51,295 --> 00:08:56,424
80 megahertz was I'm sure was the fastest 
clock tick of, of the day. 

122
00:08:56,424 --> 00:09:02,360
today, that sounds pretty slow but that, 
that was, that was pretty good for 1976.