1 00:00:03,280 --> 00:00:04,532 Okay. So, all here. 2 00:00:04,532 --> 00:00:09,401 So, let's get started. So, we're continuing our ELE 475 3 00:00:09,401 --> 00:00:12,744 experience. And we're going to continue on where we 4 00:00:12,744 --> 00:00:16,742 left off last time talking about vectors and vector machines. 5 00:00:16,742 --> 00:00:21,592 And just to recap, because we went through this really fast at the end of 6 00:00:21,592 --> 00:00:25,525 lecture last time. when you have a vector computer, one of 7 00:00:25,525 --> 00:00:30,375 the things that you want to do or the easy thing to do is to add vectors or 8 00:00:30,375 --> 00:00:34,430 numbers. But, what if you want to do work inside 9 00:00:34,430 --> 00:00:38,346 of a vector? So, you want to take a vector and you want to sum all of the 10 00:00:38,346 --> 00:00:41,703 elements in the vector. So, we call this a reduction, a vector 11 00:00:41,703 --> 00:00:45,900 reduction. And if you're trying to do this with a vector machine, unless you 12 00:00:45,900 --> 00:00:50,432 have some special instruction which looks at all the different elements, which is 13 00:00:50,432 --> 00:00:54,572 probably a bad thing to do because if you're trying to do that then you would 14 00:00:54,572 --> 00:01:00,162 lose all the [SOUND] advantages of having lane structures because you would build 15 00:01:00,162 --> 00:01:04,544 or partition the elements. because if you had to do a reduction you 16 00:01:04,544 --> 00:01:08,908 would actually have to have, let's say, one ALU use all of the elements from 17 00:01:08,908 --> 00:01:12,181 these different lanes. And that would be, that'd be sad. 18 00:01:12,181 --> 00:01:16,727 So if you want to do a reduction, one of the ways to go about doing this is 19 00:01:16,727 --> 00:01:20,243 actually have use vectors but use them sort of temporally. 20 00:01:20,243 --> 00:01:26,850 And, you can use a, if you will a binary tree algorithm here to start off with 21 00:01:26,850 --> 00:01:31,571 [SOUND] a big long vector that you want to do [SOUND] the sum of all the, the sub 22 00:01:31,571 --> 00:01:35,618 parts of this. And the first step is you just cut this 23 00:01:35,618 --> 00:01:41,089 in half. [SOUND] And you take this half of the vector and that half of the vector 24 00:01:41,089 --> 00:01:46,035 and you add it, and you end up with [SOUND] the partial sums here, which is 25 00:01:46,035 --> 00:01:51,356 half the length. And again, [SOUND] add this half with that half and you can use 26 00:01:51,356 --> 00:01:55,927 vector instructions to do that [SOUND] and for something half the length. 27 00:01:55,927 --> 00:02:02,470 [COUGH] Continue, and at some point, you end up with a scallar, [SOUND] which is 28 00:02:02,470 --> 00:02:06,822 the sum. [SOUND] So, this is pretty widely used to 29 00:02:06,822 --> 00:02:13,139 do vector reductions. at the end of yesterday last class's 30 00:02:13,139 --> 00:02:18,180 lecture, we also briefly touched on more interesting addressing modes. 31 00:02:18,180 --> 00:02:22,933 So, the vector addressing modes and electro low, loads and stores we've been 32 00:02:22,933 --> 00:02:26,210 talking about. Up to this point, you could bank very 33 00:02:26,210 --> 00:02:31,028 well and you could assign, let's say, different regions of memory to, sort of, 34 00:02:31,028 --> 00:02:34,754 different lanes. And you would always be able to do a load 35 00:02:34,754 --> 00:02:39,572 and actually just read out from your bank that was sort of a, attached to a 36 00:02:39,572 --> 00:02:43,825 particular lane. Well, that works well for very 37 00:02:43,825 --> 00:02:49,745 well-structured memory accesses. But all of a sudden, let's say, you want to do an 38 00:02:49,745 --> 00:02:54,907 operation where you have C of D of i. [COUGH] So, you have a vector, D, and you 39 00:02:54,907 --> 00:02:59,386 want to index into that vector. So, it's a vector of addresses. 40 00:02:59,386 --> 00:03:03,181 And then, you want to take, or, or, a vector of indexes. 41 00:03:03,181 --> 00:03:08,600 And then, you want to take that index and use that to index into C. 42 00:03:08,600 --> 00:03:13,710 So, this is something you commonly want to do, but you need special support for 43 00:03:13,710 --> 00:03:16,527 it. And a basic vector architecture may not 44 00:03:16,527 --> 00:03:18,624 have this. but you can add it. 45 00:03:18,624 --> 00:03:23,668 And the, the MIPS architecture which is developed in the Hennessy and Patterson 46 00:03:23,668 --> 00:03:27,665 book as this instruction here called load vector indirect. 47 00:03:27,665 --> 00:03:33,696 Where you can actually have two vector registers, and the one will index into 48 00:03:33,696 --> 00:03:38,700 the other, and then you have a destination vector register. 49 00:03:38,700 --> 00:03:42,559 And, we call this gather. But your memory system, because you don't 50 00:03:42,559 --> 00:03:47,021 know the a priori, if you will, the addressing, your memory system might get 51 00:03:47,021 --> 00:03:50,399 big and complex. And you need to be able to have all, all 52 00:03:50,399 --> 00:03:55,080 the lanes in your vector processor be able to talk to all the memory. 53 00:03:55,080 --> 00:04:00,168 And that's probably a good thing to do anyway to make your machine a little more 54 00:04:00,168 --> 00:04:05,131 flexible and to allow sort of vectors that don't have to align to a particular 55 00:04:05,131 --> 00:04:08,334 address. but, you have to make your memory system 56 00:04:08,334 --> 00:04:12,795 much more complicated to be able to do these sort of gather operations. 57 00:04:12,795 --> 00:04:16,124 And the scatter operation is the, the inverse of this. 58 00:04:16,124 --> 00:04:21,331 It would be [COUGH], SVI, a Store Vector Indirect [COUGH] which would do the store 59 00:04:21,331 --> 00:04:26,389 where you have an indirect for the store. So, if this would be a, on the left hand 60 00:04:26,389 --> 00:04:29,784 side of an assignment operation. Okay. 61 00:04:29,784 --> 00:04:32,178 So now, we get to talk about a couple of examples. 62 00:04:32,178 --> 00:04:35,843 Well, let's do, we'll touch on one example, actually, right now of a vector 63 00:04:35,843 --> 00:04:38,336 machine. And this is what I was trying to say, 64 00:04:38,336 --> 00:04:41,952 when I was coming in that, if you're going to build a really fast computer, 65 00:04:41,952 --> 00:04:45,340 and it could cost millions of dollars, you're going to look cool. 66 00:04:45,340 --> 00:04:52,981 So, the picture on the right here is the Cray-1. 67 00:04:52,981 --> 00:04:56,781 And I've had the pleasure of seeing a couple 68 00:04:56,781 --> 00:04:59,240 of these and sitting on a couple of these. 69 00:04:59,240 --> 00:05:01,993 and it has a nice little seat built into it. 70 00:05:01,993 --> 00:05:04,687 You can actually sit down on it and it's warm. 71 00:05:04,687 --> 00:05:09,372 Because this is a water cooled machine and it uses a lot of this is water 72 00:05:09,372 --> 00:05:12,418 cooled. They later went to something called floor 73 00:05:12,418 --> 00:05:16,752 inert to cool these machines. the Cray-1 was never floor inert cooled, 74 00:05:16,752 --> 00:05:20,675 but the Cray-2 I think was, and the Cray-3 definitely was. 75 00:05:20,675 --> 00:05:25,297 But, the, the idea is that you use water and you can have a nice place to sit so 76 00:05:25,297 --> 00:05:29,185 the operator has a nice place to sit down while he's, you know, he or she is 77 00:05:29,185 --> 00:05:32,503 working on the machine. And, it's heated because there's, these 78 00:05:32,503 --> 00:05:36,651 machines are quite hot and that, and part of the, the power supplies are actually 79 00:05:36,651 --> 00:05:41,422 under the bench here. the other fun thing about these is you'll 80 00:05:41,422 --> 00:05:45,007 notice they're shaped like the letter C, for Cray. 81 00:05:45,007 --> 00:05:50,454 No one really knows if that's true. I think you actually Seymour Cray claims 82 00:05:50,454 --> 00:05:56,384 this to somehow make the, the distance of the back plane shorter. But it, it, it is 83 00:05:56,384 --> 00:06:00,314 shaped like a C. And, and Seymour Cray, who's the, the, 84 00:06:00,314 --> 00:06:05,322 the founder of Cray, does have a C as the first letter of his 85 00:06:05,322 --> 00:06:08,838 name. But, for a little bit more from a influ, 86 00:06:08,838 --> 00:06:14,111 or a perspective of what's actually inside of here, the Cray-1 did not 87 00:06:14,111 --> 00:06:19,531 actually have lots of different lanes. Instead, what it was, it was a vector 88 00:06:19,531 --> 00:06:25,244 computer that had very long pipelines or long for the time pipelines, it had a 89 00:06:25,244 --> 00:06:30,444 couple pipelines for different, different functional units. And, it was a 90 00:06:30,444 --> 00:06:36,081 registered, register, vector register, register style 91 00:06:36,081 --> 00:06:37,430 machine. And, 92 00:06:37,430 --> 00:06:43,910 some of the, the, the interesting things about this is it didn't have any caches. 93 00:06:43,910 --> 00:06:48,674 And, well, deleting virtual memory, any of that other stuff because this is 94 00:06:48,674 --> 00:06:53,634 really sort of a super computer, you're using this to solve some big problem. 95 00:06:53,634 --> 00:06:58,334 So, you didn't need all this fancy dancy multi-tasking, virtualization. 96 00:06:58,334 --> 00:07:03,359 You ran one really big problem on it, you were trying to, I don't know, somehow, 97 00:07:03,359 --> 00:07:08,720 model nuclear weapons, or use it to crack codes, or something like that. 98 00:07:08,720 --> 00:07:12,298 Here's the, the, micro-architecture of the Cray-1. 99 00:07:12,298 --> 00:07:17,882 And, what we see is they have 64 vectors register, or excuse me, eight vector 100 00:07:17,882 --> 00:07:22,033 registers with 64 elements each. Their vector length is 64, 101 00:07:22,033 --> 00:07:27,044 their maximum vector length is 64. And, they also have a bunch of scallar 102 00:07:27,044 --> 00:07:32,555 registers and they have a separate addressing address register bank of 103 00:07:32,555 --> 00:07:36,492 registers. And you can only do loads in store based 104 00:07:36,492 --> 00:07:42,332 on these address registers. What I was trying to get at here, is you 105 00:07:42,332 --> 00:07:47,360 can see that they basically had only one pipe for each of the different 106 00:07:48,800 --> 00:07:52,930 operations, but these pipes were relatively long. So, they give you an 107 00:07:52,930 --> 00:07:58,814 idea here something like the multiply with six cycles, multiply to six cycles 108 00:07:58,814 --> 00:08:02,694 which today sounds like, well, things are pipelined pretty deep. 109 00:08:02,694 --> 00:08:06,763 We have lots of transistors. But, you know, it's 1976, there weren't 110 00:08:06,763 --> 00:08:10,330 that many transistors. This thing was physically large. So, 111 00:08:10,330 --> 00:08:13,210 building a pipeline that long took, took space. 112 00:08:13,210 --> 00:08:18,216 [COUGH] Or, and another example here is I think the reciporical took about fourteen 113 00:08:18,216 --> 00:08:20,138 cycles, and that was pipelined. 114 00:08:20,138 --> 00:08:24,504 And this did not have interlocking between the different pipe stages. 115 00:08:24,504 --> 00:08:28,696 And didn't have to have bypassing because the vector length was so long. 116 00:08:28,696 --> 00:08:33,179 So, you didn't have to bypass from some place in the pipe to some place else in 117 00:08:33,179 --> 00:08:35,216 the pipe. They did have chaining 118 00:08:35,216 --> 00:08:39,639 but, and, and they did have inter-pipeline bypassing, but 119 00:08:39,639 --> 00:08:44,900 intra-pipeline bypassing wasn't, wasn't really there. 120 00:08:46,240 --> 00:08:51,295 Couple other things, this machine ran really pretty fast for the days. 121 00:08:51,295 --> 00:08:56,424 80 megahertz was I'm sure was the fastest clock tick of, of the day. 122 00:08:56,424 --> 00:09:02,360 today, that sounds pretty slow but that, that was, that was pretty good for 1976.