This is a rich, rich field of, in the compiler research world, and there's been a lot of problems with VLIWs, or classical VLIWs, and people sort of built things that are somewhere between superscalars and classical VLIWs to solve some of these problems. People come up with fancy compiler optimizations to solve some of these problems and some of them are sort of still open. First one on my list here, object-code compatibility. In the superscalar, because we came up with one serialized instruction sequence, and the architecture came up with all the scheduling, you can change the number of functional units under the hood in your micro-architecture of your processor, and no one none was ever the wiser. It would still execute the piece of code. May not be optimal, but it would still execute. That's not necessarily the case for classical VLIWs. So, you know, you have to recompile the code when you change the microarchitecture. So it's a very tight coupling between the architecture and microarchitecture because our instruction encoding now says that exactly let's say two integer, two multip- or, two memory operations, and two floating point operations, or something like that. But all of a sudden if you build a machine which has a different mix, your schedule's completely wrong, so it's going to have bad performance, and it's just not going to execute because the, you probably changed the instruction encoding standard when you go and make that different VLIW. Another big problem is code size. As you can imagine, there's a fair number of no-, no operation instructions in, or no operation operations inside of a VLIW bundle. If you can't fill a slot on a superscalar, you just don't put the instruction. On a VLIW, if you can't fill the slot, you have to put a NOOP there, cuz you've got to put something there. So this, this, this causes some serious problems. Also, things that, hurt this even more is these fancy techniques that we talked about. Loop on rolling, software pipelining, that bloats the code size. We're replicating code. We've unrolled the code. We're using more space. This hurts our instruction cache size, and instruction cache footprint. We'll talk a little bit more about this in a few slides. But, variable-latency operations are very hard to deal with. So, if you have a load, you don't know whether it's gonna take cache miss or not. So your schedule may be wrong, if you guessed wrong. You could do it with some high probability, you can say, oh, I think this load usually takes a cache miss, or you can say, oh, I think this load usually doesn't take a cache miss. But we can guess wrong. This similar sorts of things with, with sort of branch mispredictions. Scheduling for statically unpredictable branches. This gets very hard. There are some techniques to solve this. There is something called predication, which we will talk about in a few slides, that helps to solve this. So you can add things back into the architecture, into your VLIW architecture, to deal with sort of short branches that are very hard to predict, predict or data-dependent branches. And as I said, depending on your design, precise interrupts can be challenging, to say the least. If you're actually using the EQ model, you probably have a hard time figuring out what to do on a single step, or if you actually had a branch serve in the middle while you have a pending operation going on. Sort of undefined, it's icky. It, it's sort of similar to having like, branch delay slots, and having a fault in your branch delay slot, what do you really do with that. Also, this is a interesting point here, is that, does the fault. If you have a fault on let's say, one operation in a bundle, does the entire bundle fault, or just the operation fault? Or take an interrupt. Typically, the way people implement this is You have the entire bundle, or the entire instruction be atomic. So if anything in the bundle takes a fault, you actually don't allow any of the sub operations to commit. That sounds like the most rational thing to do. People have done things in the middle. I, I probably don't recommend you building any of those machines. In the VLIWs that I've always built, an entire bundle is atomic. That makes traps a lot easier. The, cuz actually, well people have not always done that. One, one of the interesting cases, if you think about this, If you have let's say five instructions in a bundle, and only one of them faults, maybe you handled that one, but you let the other ones commit, and then when you come back you use some sort of mask to say which ones you need to re-execute. People have built things like that, they get tricky. Okay. So let's talk about the rest of today's lecture and next lecture, we're gonna talk about techniques to solve a lot of these problems or solve a lot of these challenges. Some of them are compiler, some of them are hardware, and some of them are both. First thing people try to do is they try to come up with compressed instruction encodings, or fancier instruction encodings, but when you go to do this, it makes the front end more complicated. So here we have, let's say some instruction, but inside of the instruction, we can, have different groups, which, inside of the group executes parallel, but between groups are, is not parallel. So something like the itanium processor, which was the, or the IA64 processor, from Intel, actually looks something like that. Other things you can do is you can have compressed instruction formats, and then when you go to actually execute it you uncompress the no ops into your instruction memory maybe. That's what multi-flow trace processor did. This marking parallel groups is what I was talking about before. Sideroom had an interesting solution to this. They actually had a single operation view like w instruction. So, to save space they sort of had their wide instructions but if you had a case where you were only going to execute one operation in an instruction, there was a special encoding format just for that case. And that saved a lot of encoding space, or a lot of instruction space, if you will. Another example of this actually is a processor I worked on called the Tilera 64 processor. We review, 3Y VLIW. We had an encoding standard, or we have an encoding standard, which allows you to execute either two instructions at a time, or three instructions at a time. When you're executing two instructions at a time you have a richer pallet of instructions you can execute. So it's sort of a, something in the middle, it gives you some better code density benefits. But not with have, not without, not with the sort of complexity of having sort of compressed formats and things like that. Okay, so that's one way to have, to deal with the instruction encoding challenges. One thing you can think about though is, you just have a bigger I cache or, and a wider bust from your I cache onto your memory system. That does solve a lot of these problems. It costs hardware, but it's sort of a simple, stupid solution to the problem versus a smart solution. Is, these sort of things on this list here are, are complex, smart solutions. Simple solution just have a bigger instruction cache. And, if you have bigger code sequences, you won't feel as, the, the, the performance hit as, as much from that.