1 00:00:03,060 --> 00:00:07,028 Okay. So let's, let's talk about some of the 2 00:00:07,028 --> 00:00:13,061 other things that were on our list of limiters to instruction level parallelism. 3 00:00:13,061 --> 00:00:18,012 And ask ourselves, are these things which we can solve very easily? 4 00:00:18,012 --> 00:00:21,095 Well, things start to get harder pretty, pretty fast here. 5 00:00:22,015 --> 00:00:26,087 Like I said, dynamic events. They're basically things the compiler has 6 00:00:26,087 --> 00:00:30,090 no way of reacting to. Now, the instruction set might not be, 7 00:00:30,090 --> 00:00:35,062 might be able to react to it. You can add things into the instruction 8 00:00:35,062 --> 00:00:38,042 set. You may take a branch dependent on a 9 00:00:38,042 --> 00:00:41,097 dynamic event. But let's look at some dynamic events. 10 00:00:42,047 --> 00:00:49,050 First negative event here, cache miss. Well, it's really for compiler, just, all 11 00:00:49,050 --> 00:00:55,012 things being equal, knowing whether a load is going to take a cache miss or not. 12 00:00:55,032 --> 00:01:00,029 You might have some guesses, that could influence the code, but, you know, 13 00:01:00,029 --> 00:01:05,019 actually doing something about it. If you think about what an out-of-order 14 00:01:05,019 --> 00:01:09,095 super scalar does, is if you take a cache miss, it'll reschedule the code. 15 00:01:11,017 --> 00:01:14,049 Around the cache miss. It'll take the non dependent instructions 16 00:01:14,049 --> 00:01:18,038 that, the instructions that are not dependent on load, and pull those up and 17 00:01:18,038 --> 00:01:21,045 try to execute those. But, if the load hits in the cache, you 18 00:01:21,045 --> 00:01:24,098 want a totally different schedule. You want to actually try to start 19 00:01:24,098 --> 00:01:28,061 executing the instructions that are dependent on that load as soon as 20 00:01:28,061 --> 00:01:32,041 possible. So an out of order superscalar has dynamic 21 00:01:32,041 --> 00:01:36,070 means to be able to do this, has dynamic, instruction execution. 22 00:01:36,070 --> 00:01:41,030 But our Wiehl IW processor, which is statically scheduled, can't really do 23 00:01:41,030 --> 00:01:44,008 this. So what's, what's some techniques to, to 24 00:01:44,008 --> 00:01:48,005 go about after this? One thing is something called informing 25 00:01:48,005 --> 00:01:50,089 loads. As far as I know, this actually has not 26 00:01:50,089 --> 00:01:55,062 been built in any hardware, but it's been proposed, at least in the computer 27 00:01:55,062 --> 00:02:02,017 architecture, Academic circles or in, in the computer architecture literature. 28 00:02:02,038 --> 00:02:08,086 And informing loads is actually come up the, the original paper about this by one 29 00:02:08,086 --> 00:02:14,055 of our faculty member or one of our faculty members here at Princeton Margaret 30 00:02:14,055 --> 00:02:19,002 Martonosi wrote a, a, she's one of the authors of this paper. 31 00:02:19,002 --> 00:02:25,006 But the basic idea is that if you have a, a load that misses in the cache you can 32 00:02:25,006 --> 00:02:30,016 not execute subsequent instructions. L seeking next actually no. 33 00:02:30,016 --> 00:02:34,086 If there, the load is missing the cache you just don't execute the subsequent 34 00:02:34,086 --> 00:02:38,003 instructions, you basically nullify those instructions. 35 00:02:38,003 --> 00:02:43,003 And this allows you to change the code sequence dependent on whether the load 36 00:02:43,003 --> 00:02:46,016 hits in the cache or whether it misses the cache. 37 00:02:46,016 --> 00:02:51,016 And this was done with Todd Maury [unknown] Professor [unkown], and 38 00:02:51,016 --> 00:02:56,053 professor Todd Maury I think both when they were graduate students, and Mark 39 00:02:56,053 --> 00:03:01,085 Horwitz from Stanford was on his paper, I think a couple, trying to remember who was 40 00:03:01,085 --> 00:03:04,054 the last author, the, the professor on this. 41 00:03:04,073 --> 00:03:07,093 On this work. Another option is something that the 42 00:03:08,012 --> 00:03:12,395 Elbrus, Elbrus processor people did. So you probably never heard of this 43 00:03:12,395 --> 00:03:15,783 processor. It was something that was built. 44 00:03:15,783 --> 00:03:22,605 In the, well, actually don't know who's ever finished so you get a prototype of it 45 00:03:22,605 --> 00:03:28,082 sort of in the Soviet Russia Day right when the Soviet Union was kind of 46 00:03:28,082 --> 00:03:32,044 breaking, breaking down. This was the design house in Soviet Russia 47 00:03:32,044 --> 00:03:36,029 which made all of these sort of military processors. 48 00:03:36,029 --> 00:03:41,017 They later went off and that same design team was gonna go build some commercial 49 00:03:41,017 --> 00:03:43,094 processors after the fall of the Soviet Union. 50 00:03:43,094 --> 00:03:48,058 So, [laugh], they went to, you know, they went to go to build this processor, and 51 00:03:48,058 --> 00:03:53,069 it's a VLIW processor, it's some very long instruction word processor, and they had 52 00:03:53,069 --> 00:03:59,707 an instruction in there which tried to solve this dynamic event, probable around 53 00:03:59,707 --> 00:04:04,357 cache misses. So, what it said is if the instruction 54 00:04:04,357 --> 00:04:15,590 misses in the cache. Go execute a different piece of code with 55 00:04:15,590 --> 00:04:29,043 a different schedule than if the instructions if a load hit in the cache. 56 00:04:29,059 --> 00:04:29,851 So you could effectively have two difference codes here because the compiler 57 00:04:29,851 --> 00:04:30,650 could actually generate two different code sequences and it can get back while the 58 00:04:30,650 --> 00:04:32,003 performance and get back to almost exactly what [inaudible] could do. 59 00:04:32,003 --> 00:04:32,053 This processor never really made it commercially. 60 00:04:32,053 --> 00:04:36,052 And later, they, the company went under. It was bought, I actually don't know if 61 00:04:36,052 --> 00:04:40,072 the assets were bought by Intel but at least all the, the people who worked at 62 00:04:40,072 --> 00:04:43,003 this company now work at Intel, effectively. 63 00:04:43,003 --> 00:04:46,164 So they still live in Russia. So, that's a funny story there, that, 64 00:04:46,164 --> 00:04:50,257 Yeah, it didn't work out from a, a commercial perspective, but they had some 65 00:04:50,257 --> 00:04:54,075 cool ideas in there. That you could actually try to schedule 66 00:04:54,075 --> 00:04:56,910 around branch or, schedule around cache misses. 67 00:04:56,910 --> 00:05:00,133 Okay, so some other things. Branch mispredicts. 68 00:05:00,133 --> 00:05:03,109 Well, we already talked about one technique here. 69 00:05:03,109 --> 00:05:08,425 We can talk about you can add predication. But that doesn't help if you have big 70 00:05:08,425 --> 00:05:13,767 pieces of code, big code sequences and big code hammocks, you can't necessarily 71 00:05:13,767 --> 00:05:18,654 predicate your entire program. So what do you, what do you do instead? 72 00:05:18,654 --> 00:05:22,072 Well. One, one solution that people have come up 73 00:05:22,072 --> 00:05:25,599 with for this. And this is, this is the hard one, hard 74 00:05:25,599 --> 00:05:29,439 one to deal with here. Is you can add branch delay slots. 75 00:05:29,439 --> 00:05:33,222 So you have a VWA processor, let's say it's three wide. 76 00:05:33,222 --> 00:05:38,668 And you add branch delay slots in your instruction set and you use predication in 77 00:05:38,668 --> 00:05:43,992 the branch delay slots. And what this allows you to do is, it 78 00:05:43,992 --> 00:05:50,119 effectively lets you mask some of the branch mis-predict penalty and change the 79 00:05:50,119 --> 00:05:54,443 code schedule, a little bit, dependent on the prediction. 80 00:05:54,443 --> 00:06:00,904 Or change is actually completely based on where the branch is, and the way this 81 00:06:00,904 --> 00:06:08,184 works is, in the, in the delay slots, you use the same predicate that the branch is 82 00:06:08,184 --> 00:06:11,448 branching on. So the branch is branching on let's say if 83 00:06:11,448 --> 00:06:14,436 A equals B. We'll use that same thing in the 84 00:06:14,436 --> 00:06:19,655 predication and effectively you can reschedule the code differently, depending 85 00:06:19,655 --> 00:06:22,561 on whether the branch is taken or not taken. 86 00:06:22,561 --> 00:06:26,317 And because you're putting in the delay slot. 87 00:06:26,317 --> 00:06:30,696 You can sort of get around some of the problems here of whether it's, taking the 88 00:06:30,696 --> 00:06:33,030 one direction or taking the other direction. 89 00:06:33,063 --> 00:06:38,037 No matter of which way the pred, which way the mis-predict happens or, whether the 90 00:06:38,037 --> 00:06:40,071 branches can be predicted correctly or not. 91 00:06:40,071 --> 00:06:45,006 And, and what we're doing is basically putting code in the delay slots that will 92 00:06:45,006 --> 00:06:49,070 always execute, but it can be predicated. And, if you can actually pull code up from 93 00:06:49,070 --> 00:06:53,089 the two destinations of the two branches. So it's sort of a way to get around, 94 00:06:54,005 --> 00:06:58,054 branch mis-predict [inaudible] here. This is actually done in a research 95 00:06:58,054 --> 00:07:03,021 process that was built at MIT. And I think it's probably also done in 96 00:07:03,040 --> 00:07:08,013 some of the other HP processor. But the MIT processor at least was called 97 00:07:08,013 --> 00:07:13,057 the M machine out of Bill Dali's group. He's now at Stanford but I think the ATM 98 00:07:13,057 --> 00:07:17,072 machine was built right when he was moving from MIT to Stanford. 99 00:07:17,072 --> 00:07:22,006 But they had I think three delay slots and they were three wide. 100 00:07:22,006 --> 00:07:26,001 And they could predicate the instructions in the delay slots. 101 00:07:26,001 --> 00:07:28,095 So last thing is exceptions. So you take a exception. 102 00:07:28,095 --> 00:07:32,018 You want to schedule different code. And these are, like, impossible to 103 00:07:32,018 --> 00:07:34,093 predict, you, the compiler has no way to try to predict this. 104 00:07:34,093 --> 00:07:38,054 But this is hard on a super-scalar also. You know, when you have a traditional 105 00:07:38,054 --> 00:07:42,009 super-scalar, and an exception happens, they usually end up flushing the pipe 106 00:07:42,009 --> 00:07:44,029 anyway. So, and, and it doesn't happen that often. 107 00:07:44,029 --> 00:07:46,072 It probably doesn't hurt your performance that much. 108 00:07:46,072 --> 00:07:49,057 So you, no one's gonna lose, lose too much sleep over this one. 109 00:07:50,025 --> 00:07:56,098 So briefly I wanted to say something about how, how to build really wide VLIWs. 110 00:07:56,098 --> 00:08:05,301 As we start to go to wider and wider, VLIWs lots of instructions execute at the 111 00:08:05,301 --> 00:08:09,658 same time. You have to start thinking about what does 112 00:08:09,658 --> 00:08:14,006 the register file and the by-pass number look like. 113 00:08:14,006 --> 00:08:21,432 So in this drawing here we actually have a figure of the C64000 Series Processors. 114 00:08:21,432 --> 00:08:26,267 So these are TIDSPs sort of the flagship of DSPs processors. 115 00:08:26,267 --> 00:08:31,191 And this is a sort of block level diagram of what they, what they have. 116 00:08:31,191 --> 00:08:36,250 What they actually have is, they have, they have divided the machine into local 117 00:08:36,250 --> 00:08:39,714 register files. When they bypass with inside of what's 118 00:08:39,714 --> 00:08:44,152 called the clusters. So is the clusterd VIWS similar to how we 119 00:08:44,152 --> 00:08:49,470 have clustered superscalars which also divide the register file but this is a 120 00:08:49,470 --> 00:08:54,277 architectural, big architectural or isale of architecture splitting or dividing 121 00:08:54,277 --> 00:09:00,026 going on here. So in something like the C6400, they 122 00:09:00,026 --> 00:09:07,320 actually have four instructions per cluster, so they are executing eight 123 00:09:07,320 --> 00:09:14,087 instructions at the same time. And you can bypass values between these 124 00:09:14,087 --> 00:09:17,522 four ALUs. Within, within them, or you can, bypass 125 00:09:17,522 --> 00:09:21,964 within these, but if you wanna sort of take a value from here and move it to 126 00:09:21,964 --> 00:09:26,284 there, you have a very low bandwidth, sort of, bypass path here and it takes an 127 00:09:26,284 --> 00:09:29,025 instruction. You effectively have to have a move 128 00:09:29,025 --> 00:09:32,014 instruction to move between the two different clusters. 129 00:09:32,014 --> 00:09:36,193 So there's a, there's a lower bandwidth between the clusters and a higher latency 130 00:09:36,193 --> 00:09:39,467 between the clusters, but inside of a cluster, it's very fast. 131 00:09:39,467 --> 00:09:43,396 And what's important to know here is these are not two processors. 132 00:09:43,396 --> 00:09:48,259 These are all executing one instruction at the same time, so it's a eight wide 133 00:09:48,259 --> 00:09:52,240 instruction executing under this eight different L use. 134 00:09:52,240 --> 00:10:00,571 And this is used in the sort of TI. High end DSPs also is its used in HPs and 135 00:10:00,571 --> 00:10:05,510 STMicro's LX processor. This a probably a processor you never 136 00:10:05,510 --> 00:10:10,153 heard about, but this was actually what Josh Fisher went on to go do at HP Labs 137 00:10:10,153 --> 00:10:13,426 after multi-flow. So, this is after sort of some of the 138 00:10:13,426 --> 00:10:17,698 original VLIW work same, same person went and built this LX processor. 139 00:10:17,698 --> 00:10:21,642 And this is sort of joint collaboration between ST micro and HP. 140 00:10:21,642 --> 00:10:25,871 And this shows up in printers today. So, it's not something you're gonna highly 141 00:10:25,871 --> 00:10:28,389 have. The LX processor is probablty something 142 00:10:28,389 --> 00:10:31,032 you're not gonna have on your desktop, machine.