1 00:00:03,048 --> 00:00:08,455 This is a rich, rich field of, in the compiler research world, and there's been 2 00:00:08,455 --> 00:00:14,409 a lot of problems with VLIWs, or classical VLIWs, and people sort of built things 3 00:00:14,409 --> 00:00:20,768 that are somewhere between superscalars and classical VLIWs to solve some of these 4 00:00:20,768 --> 00:00:24,029 problems. People come up with fancy compiler 5 00:00:24,029 --> 00:00:30,115 optimizations to solve some of these problems and some of them are sort of 6 00:00:30,115 --> 00:00:35,681 still open. First one on my list here, object-code 7 00:00:35,681 --> 00:00:39,970 compatibility. In the superscalar, because we came up 8 00:00:39,970 --> 00:00:44,603 with one serialized instruction sequence, and the architecture came up with all the 9 00:00:44,603 --> 00:00:49,019 scheduling, you can change the number of functional units under the hood in your 10 00:00:49,019 --> 00:00:52,786 micro-architecture of your processor, and no one none was ever the wiser. 11 00:00:52,786 --> 00:00:57,501 It would still execute the piece of code. May not be optimal, but it would still 12 00:00:57,501 --> 00:01:00,083 execute. That's not necessarily the case for 13 00:01:00,083 --> 00:01:02,817 classical VLIWs. So, you know, you have to recompile the 14 00:01:02,817 --> 00:01:04,910 code when you change the microarchitecture. 15 00:01:04,910 --> 00:01:09,107 So it's a very tight coupling between the architecture and microarchitecture because 16 00:01:09,107 --> 00:01:12,730 our instruction encoding now says that exactly let's say two integer, two multip- 17 00:01:12,730 --> 00:01:16,421 or, two memory operations, and two floating point operations, or something 18 00:01:16,421 --> 00:01:18,808 like that. But all of a sudden if you build a machine 19 00:01:18,808 --> 00:01:22,734 which has a different mix, your schedule's completely wrong, so it's going to have 20 00:01:22,734 --> 00:01:26,580 bad performance, and it's just not going to execute because the, you probably 21 00:01:26,580 --> 00:01:31,518 changed the instruction encoding standard when you go and make that different VLIW. 22 00:01:31,518 --> 00:01:39,135 Another big problem is code size. As you can imagine, there's a fair number 23 00:01:39,135 --> 00:01:46,451 of no-, no operation instructions in, or no operation operations inside of a VLIW 24 00:01:46,451 --> 00:01:50,925 bundle. If you can't fill a slot on a superscalar, 25 00:01:50,925 --> 00:01:58,019 you just don't put the instruction. On a VLIW, if you can't fill the slot, you 26 00:01:58,019 --> 00:02:02,048 have to put a NOOP there, cuz you've got to put something there. 27 00:02:02,048 --> 00:02:05,089 So this, this, this causes some serious problems. 28 00:02:05,089 --> 00:02:11,041 Also, things that, hurt this even more is these fancy techniques that we talked 29 00:02:11,041 --> 00:02:14,062 about. Loop on rolling, software pipelining, that 30 00:02:14,062 --> 00:02:17,062 bloats the code size. We're replicating code. 31 00:02:17,062 --> 00:02:20,082 We've unrolled the code. We're using more space. 32 00:02:20,082 --> 00:02:25,059 This hurts our instruction cache size, and instruction cache footprint. 33 00:02:27,081 --> 00:02:31,015 We'll talk a little bit more about this in a few slides. 34 00:02:31,015 --> 00:02:34,068 But, variable-latency operations are very hard to deal with. 35 00:02:34,068 --> 00:02:39,039 So, if you have a load, you don't know whether it's gonna take cache miss or not. 36 00:02:39,039 --> 00:02:42,031 So your schedule may be wrong, if you guessed wrong. 37 00:02:42,031 --> 00:02:46,055 You could do it with some high probability, you can say, oh, I think this 38 00:02:46,055 --> 00:02:51,008 load usually takes a cache miss, or you can say, oh, I think this load usually 39 00:02:51,008 --> 00:02:54,059 doesn't take a cache miss. But we can guess wrong. 40 00:02:54,059 --> 00:02:59,034 This similar sorts of things with, with sort of branch mispredictions. 41 00:03:03,002 --> 00:03:09,025 Scheduling for statically unpredictable branches. 42 00:03:10,058 --> 00:03:14,003 This gets very hard. There are some techniques to solve this. 43 00:03:14,003 --> 00:03:18,057 There is something called predication, which we will talk about in a few slides, 44 00:03:18,057 --> 00:03:22,008 that helps to solve this. So you can add things back into the 45 00:03:22,008 --> 00:03:26,080 architecture, into your VLIW architecture, to deal with sort of short branches that 46 00:03:26,080 --> 00:03:30,025 are very hard to predict, predict or data-dependent branches. 47 00:03:30,078 --> 00:03:36,077 And as I said, depending on your design, precise interrupts can be challenging, to 48 00:03:36,077 --> 00:03:40,098 say the least. If you're actually using the EQ model, you 49 00:03:40,098 --> 00:03:45,070 probably have a hard time figuring out what to do on a single step, or if you 50 00:03:45,070 --> 00:03:50,078 actually had a branch serve in the middle while you have a pending operation going 51 00:03:50,078 --> 00:03:52,068 on. Sort of undefined, it's icky. 52 00:03:52,068 --> 00:03:57,077 It, it's sort of similar to having like, branch delay slots, and having a fault in 53 00:03:57,077 --> 00:04:01,268 your branch delay slot, what do you really do with that. 54 00:04:01,268 --> 00:04:05,043 Also, this is a interesting point here, is that, does the fault. 55 00:04:05,043 --> 00:04:12,002 If you have a fault on let's say, one operation in a bundle, does the entire 56 00:04:12,002 --> 00:04:17,013 bundle fault, or just the operation fault? Or take an interrupt. 57 00:04:17,013 --> 00:04:22,188 Typically, the way people implement this is You have the entire bundle, or the 58 00:04:22,188 --> 00:04:26,392 entire instruction be atomic. So if anything in the bundle takes a 59 00:04:26,392 --> 00:04:30,779 fault, you actually don't allow any of the sub operations to commit. 60 00:04:30,779 --> 00:04:34,004 That sounds like the most rational thing to do. 61 00:04:34,004 --> 00:04:39,056 People have done things in the middle. I, I probably don't recommend you building 62 00:04:39,056 --> 00:04:43,082 any of those machines. In the VLIWs that I've always built, an 63 00:04:43,082 --> 00:04:47,042 entire bundle is atomic. That makes traps a lot easier. 64 00:04:48,080 --> 00:04:51,063 The, cuz actually, well people have not always done that. 65 00:04:51,063 --> 00:04:55,160 One, one of the interesting cases, if you think about this, If you have let's say 66 00:04:55,160 --> 00:04:59,235 five instructions in a bundle, and only one of them faults, maybe you handled that 67 00:04:59,235 --> 00:05:03,269 one, but you let the other ones commit, and then when you come back you use some 68 00:05:03,269 --> 00:05:06,073 sort of mask to say which ones you need to re-execute. 69 00:05:06,073 --> 00:05:09,036 People have built things like that, they get tricky. 70 00:05:11,075 --> 00:05:14,009 Okay. So let's talk about the rest of today's 71 00:05:14,009 --> 00:05:18,055 lecture and next lecture, we're gonna talk about techniques to solve a lot of these 72 00:05:18,055 --> 00:05:20,083 problems or solve a lot of these challenges. 73 00:05:20,083 --> 00:05:24,088 Some of them are compiler, some of them are hardware, and some of them are both. 74 00:05:26,013 --> 00:05:31,018 First thing people try to do is they try to come up with compressed instruction 75 00:05:31,018 --> 00:05:35,077 encodings, or fancier instruction encodings, but when you go to do this, it 76 00:05:35,077 --> 00:05:39,504 makes the front end more complicated. So here we have, let's say some 77 00:05:39,504 --> 00:05:45,006 instruction, but inside of the instruction, we can, have different 78 00:05:45,006 --> 00:05:50,076 groups, which, inside of the group executes parallel, but between groups are, 79 00:05:50,076 --> 00:05:55,031 is not parallel. So something like the itanium processor, 80 00:05:55,031 --> 00:06:01,062 which was the, or the IA64 processor, from Intel, actually looks something like that. 81 00:06:01,062 --> 00:06:07,058 Other things you can do is you can have compressed instruction formats, and then 82 00:06:07,058 --> 00:06:12,087 when you go to actually execute it you uncompress the no ops into your 83 00:06:13,009 --> 00:06:18,008 instruction memory maybe. That's what multi-flow trace processor 84 00:06:18,008 --> 00:06:21,081 did. This marking parallel groups is what I was 85 00:06:21,081 --> 00:06:25,718 talking about before. Sideroom had an interesting solution to 86 00:06:25,718 --> 00:06:28,832 this. They actually had a single operation view 87 00:06:28,832 --> 00:06:32,570 like w instruction. So, to save space they sort of had their 88 00:06:32,570 --> 00:06:37,598 wide instructions but if you had a case where you were only going to execute one 89 00:06:37,598 --> 00:06:42,659 operation in an instruction, there was a special encoding format just for that 90 00:06:42,659 --> 00:06:46,359 case. And that saved a lot of encoding space, or 91 00:06:46,359 --> 00:06:52,044 a lot of instruction space, if you will. Another example of this actually is a 92 00:06:52,044 --> 00:06:55,758 processor I worked on called the Tilera 64 processor. 93 00:06:55,758 --> 00:06:59,667 We review, 3Y VLIW. We had an encoding standard, or we have an 94 00:06:59,667 --> 00:07:04,890 encoding standard, which allows you to execute either two instructions at a time, 95 00:07:04,890 --> 00:07:09,773 or three instructions at a time. When you're executing two instructions at 96 00:07:09,773 --> 00:07:13,868 a time you have a richer pallet of instructions you can execute. 97 00:07:13,868 --> 00:07:18,801 So it's sort of a, something in the middle, it gives you some better code 98 00:07:18,801 --> 00:07:22,507 density benefits. But not with have, not without, not with 99 00:07:22,507 --> 00:07:29,907 the sort of complexity of having sort of compressed formats and things like that. 100 00:07:29,907 --> 00:07:37,031 Okay, so that's one way to have, to deal with the instruction encoding challenges. 101 00:07:37,031 --> 00:07:43,584 One thing you can think about though is, you just have a bigger I cache or, and a 102 00:07:43,584 --> 00:07:47,024 wider bust from your I cache onto your memory system. 103 00:07:47,024 --> 00:07:51,074 That does solve a lot of these problems. It costs hardware, but it's sort of a 104 00:07:51,074 --> 00:07:55,018 simple, stupid solution to the problem versus a smart solution. 105 00:07:55,018 --> 00:07:59,024 Is, these sort of things on this list here are, are complex, smart solutions. 106 00:07:59,024 --> 00:08:02,018 Simple solution just have a bigger instruction cache. 107 00:08:02,018 --> 00:08:05,096 And, if you have bigger code sequences, you won't feel as, the, the, the 108 00:08:05,096 --> 00:08:08,001 performance hit as, as much from that.