Okay. 
So, all here. So, let's get started. 
So, we're continuing our ELE 475 experience. 
And we're going to continue on where we left off last time talking about vectors 
and vector machines. And just to recap, because we went 
through this really fast at the end of lecture last time. 
when you have a vector computer, one of the things that you want to do or the 
easy thing to do is to add vectors or numbers. 
But, what if you want to do work inside of a vector? So, you want to take a 
vector and you want to sum all of the elements in the vector. 
So, we call this a reduction, a vector reduction. And if you're trying to do 
this with a vector machine, unless you have some special instruction which looks 
at all the different elements, which is probably a bad thing to do because if 
you're trying to do that then you would lose all the [SOUND] advantages of having 
lane structures because you would build or partition the elements. 
because if you had to do a reduction you would actually have to have, let's say, 
one ALU use all of the elements from these different lanes. 
And that would be, that'd be sad. So if you want to do a reduction, one of 
the ways to go about doing this is actually have use vectors but use them 
sort of temporally. And, you can use a, if you will a binary 
tree algorithm here to start off with [SOUND] a big long vector that you want 
to do [SOUND] the sum of all the, the sub parts of this. 
And the first step is you just cut this in half. [SOUND] And you take this half 
of the vector and that half of the vector and you add it, and you end up with 
[SOUND] the partial sums here, which is half the length. And again, [SOUND] add 
this half with that half and you can use vector instructions to do that [SOUND] 
and for something half the length. [COUGH] Continue, and at some point, you 
end up with a scallar, [SOUND] which is the sum. 
[SOUND] So, this is pretty widely used to do vector reductions. 
at the end of yesterday last class's lecture, we also briefly touched on more 
interesting addressing modes. So, the vector addressing modes and 
electro low, loads and stores we've been talking about. 
Up to this point, you could bank very well and you could assign, let's say, 
different regions of memory to, sort of, different lanes. 
And you would always be able to do a load and actually just read out from your bank 
that was sort of a, attached to a particular lane. 
Well, that works well for very well-structured memory accesses. But all 
of a sudden, let's say, you want to do an operation where you have C of D of i. 
[COUGH] So, you have a vector, D, and you want to index into that vector. 
So, it's a vector of addresses. And then, you want to take, 
or, or, a vector of indexes. And then, you want to take that index and 
use that to index into C. So, this is something you commonly want 
to do, but you need special support for it. 
And a basic vector architecture may not have this. 
but you can add it. And the, the MIPS architecture which is 
developed in the Hennessy and Patterson book as this instruction here called load 
vector indirect. Where you can actually have two vector 
registers, and the one will index into the other, and then you have a 
destination vector register. And, we call this gather. 
But your memory system, because you don't know the a priori, if you will, the 
addressing, your memory system might get big and complex. 
And you need to be able to have all, all the lanes in your vector processor be 
able to talk to all the memory. And that's probably a good thing to do 
anyway to make your machine a little more flexible and to allow sort of vectors 
that don't have to align to a particular address. 
but, you have to make your memory system much more complicated to be able to do 
these sort of gather operations. And the scatter operation is the, the 
inverse of this. It would be [COUGH], SVI, a Store Vector 
Indirect [COUGH] which would do the store where you have an indirect for the store. 
So, if this would be a, on the left hand side of an assignment operation. 
Okay. So now, we get to talk about a couple of 
examples. Well, let's do, we'll touch on one 
example, actually, right now of a vector machine. 
And this is what I was trying to say, when I was coming in that, if you're 
going to build a really fast computer, and it could cost millions of dollars, 
you're going to look cool. So, the picture on the right here is the 
Cray-1. And 
I've had the pleasure of seeing a couple of these and sitting on a couple of 
these. and it has a nice little seat built into 
it. You can actually sit down on it and it's 
warm. Because this is a water cooled machine 
and it uses a lot of this is water cooled. 
They later went to something called floor inert to cool these machines. 
the Cray-1 was never floor inert cooled, but the Cray-2 I think was, 
and the Cray-3 definitely was. But, the, the idea is that you use water 
and you can have a nice place to sit so the operator has a nice place to sit down 
while he's, you know, he or she is working on the machine. 
And, it's heated because there's, these machines are quite hot and that, and part 
of the, the power supplies are actually under the bench here. 
the other fun thing about these is you'll notice they're shaped like the letter C, 
for Cray. No one really knows if that's true. 
I think you actually Seymour Cray claims this to somehow make the, the distance of 
the back plane shorter. But it, it, it is shaped like a C. 
And, and Seymour Cray, who's the, the, the founder of Cray, 
does have a C as the first letter of his name. 
But, for a little bit more from a influ, or a perspective of what's actually 
inside of here, the Cray-1 did not actually have lots of different lanes. 
Instead, what it was, it was a vector computer that had very long pipelines or 
long for the time pipelines, it had a couple pipelines for different, different 
functional units. And, it was a registered, 
register, vector register, register style machine. 
And, some of the, the, the interesting things 
about this is it didn't have any caches. And, well, deleting virtual memory, any 
of that other stuff because this is really sort of a super computer, you're 
using this to solve some big problem. So, you didn't need all this fancy dancy 
multi-tasking, virtualization. You ran one really big problem on it, you 
were trying to, I don't know, somehow, model nuclear weapons, or use it to crack 
codes, or something like that. Here's the, the, micro-architecture of 
the Cray-1. And, what we see is they have 64 vectors 
register, or excuse me, eight vector registers with 64 elements each. 
Their vector length is 64, their maximum vector length is 64. 
And, they also have a bunch of scallar registers and they have a separate 
addressing address register bank of registers. 
And you can only do loads in store based on these address registers. 
What I was trying to get at here, is you can see that they basically had only one 
pipe for each of the different operations, but these pipes were 
relatively long. So, they give you an idea here something like the multiply 
with six cycles, multiply to six cycles which today sounds like, well, things are 
pipelined pretty deep. We have lots of transistors. 
But, you know, it's 1976, there weren't that many transistors. 
This thing was physically large. So, building a pipeline that long took, took 
space. [COUGH] Or, and another example here is I 
think the reciporical took about fourteen cycles, 
and that was pipelined. And this did not have interlocking 
between the different pipe stages. And didn't have to have bypassing because 
the vector length was so long. So, you didn't have to bypass from some 
place in the pipe to some place else in the pipe. 
They did have chaining but, and, and they did have 
inter-pipeline bypassing, but intra-pipeline bypassing wasn't, wasn't 
really there. Couple other things, this machine ran 
really pretty fast for the days. 80 megahertz was I'm sure was the fastest 
clock tick of, of the day. today, that sounds pretty slow but that, 
that was, that was pretty good for 1976.