We left off last time, talking about some implementation issues around, multithreading. so a little bit of background on what we're doing today is we're going to be finishing up multithreading. We're going to be using cutlery and you'll see why we're using cutlery in a minute. That's right. A whole box a forks. And we'll be talking about synchronization and synchronization permitives. So let's, let's go back to multithreading, And look a, actual simultaneous multithreading which is where we left off. So just to recap, we talked about different types of multithreading. We talked about not multithreading processors, so we have a superscalar where we're trying to fill in the empty slots here. So we can do some fine-grain, but not the simultaneous multithreading. We can do coarse-grain multithreading we can even possibly think about trying to do this in some fast software layer. You could think about just cutting your cluster in half and using half your ALUs for one thread and half your ALUs for another thread. That is sort of technically a version of multithreading. It's just not executing oh, it is executing at the same time, but it's not necessarily using all the resources. and then you could think about actual simultaneous multithreading, where you're issuing instructions from different threads into different functional units at the same time. So, we talked about this last time and we, one of the things that came up was, how do you go about getting a parallelism? So, if you have thread-level parallelism, you can mix and match different instruction slots from different threads and get parallelism that way. And one of the great things about simultaneous multithreading is that if you have a fast superscalar machine and you try to use it for simultaneous multithreading, you could think about trying to reduce the number of threads if you don't have enough thread-levle parallelism and actually go at instruction-level parallelism. So here, we see this blue crosshatch. block zero instructions, and this is the same thread running in the same sort of time period. And, if you have lots of threads running, you could actually intermix the different threads and run the blue crosshatch threads slower, than if you were try to use parallel resources to run one thread faster. And this is one of the big insights here that symmetric multithreading lets us go after. So we briefly talked and flew through this at the end of lecture last time because we talked about, what is the cost of implementing symmetric multithreading. So, conveniently, there is actually a really good example of this, the Power 4 by IBM and the Power 5 by IB, IBM. They're very similar architectures, except the Power 5 has two way simultaneous multithreading, and the Power 4 is almost the same architecture without two way symmetric multithreading. So you look at these two pictures. One of the big things they added is they added more fetch bandwidth, so you can fetch from two different program counters at the same time. You can decode more instructions, and then, they added an extra pipe staging here, which they, they do what we'll call, what they call group formation, which is basically the scheduling of the two different threads instructions, together by the time that it reaches the execution units. Also over here, they, as, as we've talked about before, even without symmetric multithreading, it's pretty useful to have more registers or more physical registers, registers and more architectural registers. You, you're minimally going to need more architectural registers, more physical registers probably is a good performance thing. So, we're talking about implementation details. So, one of the questions that comes up is, what is the cost of actually going to implement this symmetric multithreading on this processor? And one of the questions that also comes up here is, why two threads instead of four threads, or eight threads, or more threads? It has costs. So, that's one of the things that they would actually have to start replicating more data structures. For the pipeline that they had roughly left over from the penny, excuse me, the Power 4, they had enough compute and bandwidth through their execution units to be able to handle two threads. But if they wanted to try to go to more threads, they basically bottleneck somewhere in their architectures. I don't completely know exactly where they turtlenecked. They, the, the paper talks about this, it basically says that they bottleneck. Somewhere in their execution I guess they kind of leave it a little vague. so they decided that it wasn't worth adding more than two threads, because it's not going to give you any performance increase, unless you have threads that don't have or, or, threads that have a lot of stalls. If you have threads with enough stalls, you probably have enough holes to try to intermix another thread in there. So, some of the changes in the hardware they were about doing. Well, they actually increased their cache size, they increased the associativity of their caches their last level cache here that went to 1.92 megabytes versus 1.44 of the L2 and L3 together. Sorry. and you can see here one of the interesting things is they start to actually separate data structures, because if you have multiple threads running and you try to mix them all into one big data structure, one of the problems like, or one of the big hardware data structure, one of the problems that comes up is maybe one of the threads is going to hog the data structure. So, threading actually introduces a lot of complexity, having to do with figuring out how to be fair in the usage of the structures. So one solution is you just replicate. So anywhere there was, it was hard to get it right or hard to have they have a lot of conflicts you could just replicate. So they added a separate instruction fetch, prefetch and buffering. and then they added more physical they, they call them virtual registers here, but it's more physical registers effectively, so more places to schedule into. So there were some added hardware costs to this. And to give you an idea of sort of area, 24% area improvement increase, a lot of this is just due to the cache, because they made the cache so much larger. So, you have to ask yourself, does it make sense to use a 25, 24% more area to run two threads or at one point does it make sense just to plop down a second processor? It's tough trade there. Well, if it's 24% bigger and you get a two times performance boost, it was a good trade. If it was 24% bigger and you get less than a 24% performance boost, it's pretty questionable. So let's look at another example and then we'll wrap up on the performance of this and see if they did get their 24% performance boost. The first implementation of symmetric multithreading, there's been many implementations of multi threading. But excuse me, not symmetric, simultaneous multithreading, was the Pentium 4 processor. Now they called it, hyperthreading. that's a Intel marketing term. it is simultaneous multithreading, but the Pentium 4, they didn't actually replicate very many structures. They didn't really change the processor very much. And in fact there's basically modes to turn off simultaneous multithreading they're paying for, and it was not a flagship feature when the processor was first shipped. So the, the, the basic idea in the Pentium 4, the, the, the little bit of extra hardware they added was, they did duplicate some, some resources here. So, overall they, they increased their dye area by about 5%., So very, very small amount. Now the question is what is the performance boost they can get from this? [COUGH], and one of the interesting things here is that there was a big problem that started to come up with the Pentium 4 that when you were trying to run one thread on it, well, you weren't guaranteed that the performance of that one thread would be equal to if you turned off hyperthreading, or sorry, symmetric simultaneous multithreading. So they had a mode switch, they could turn off, where they would take some structures that they would partition when they were in multithreading mode and partition them 50/50. So you didn't have deadlock problems and you didn't have resource contention problems on the shared structures. And when you turn this mode switch off, [LAUGH], some programs got faster, to some extent what that meant is, they didn't size those structures correctly to be running in multitraining mode at all times. And the switch was not a little switch. It was like a big switch. It's like reboot the computer and change the switch. It was it was a boot time parameter, parameter to the chip. And this left such a bad taste in, in Intel's mouth that simultaneous multithreading got kicked out of Intel land for a few generations. The Pentium M, Coreduo, Core2duo all kicked out simultaneous multithreading. And it didn't work it's way back in until inhalem core i5, i7 sort of processors there recently. So, it's interesting to see that, you can use a little bit of hardware, but you have to be careful, now, what was, what was the biggest problem that they had here? The, the biggest problem was that the load store queue on the Pentium four was. Split in half when they were running in two thread mode. So the first Pentium 4 architecture that was shipped. It had symmetric or simultaneous multi threading turned on. They just split the structure in half. And it didn't have enough bandwidth for lots of programs. Or didn't have enough entries in there for lots of programs when it was cut in half. So it was sized to run one thread. They cut it in half statically and when they tried to run two threads it was, effectively. Not providing enough performance. So they couldn't get enough loads out to the memory system from one thread when they cut that structure in half. So it really starts to bring up the question of what is the right allocation of resources. And, should you use some sort of round robin scheme? Should you use dynamic allocation? Should you use static allocation? Should you replicate? If you replicate it costs more area and this is, these are some of the big challenges here. Okay so let's look at the, the, some of the performance here. So the, The numbers were pretty abysmal for the Pentium 4. Even when. Everything was going well. So. Example here. Pentium 4. Extreme edition with simultaneous multithreading give you one% speed up when you're running two threads, for spec int. Rate. So you're running multiple, copies of the same program, that's how spec, spec rate is done. And that was the integer of one. For the floating.1, it did a little bit better. Seven% improvement. Now, that's not horrible, considering it was only a five% area. Improvement or area increase, but still that's, that's not saying a whole lot. I mean you have to question if that's really, this whole level of complexity was worth it. And was at five% area, could have been used for something else to get a few percent performance increase. And this was to some extent pretty typical across applications for the Pentium 4. One of the, the interesting things about the, the, the Pentium 4 was. Because the cash wasn't huge, especially the L1 cash was relatively small because they wanted it to operate so fast, at the very high clock frequencies. You get a lot of data pollution in there from the two threats that would actually destructively interfere in the level one cache. So that was one of the big things they had to try and make up was this is a problem of any multi-threading. Not just simultaneous multi-threading, that you can actually fight for space in the cache. And end up with both capacity and [INAUDIBLE]. The Power five did a lot better here. You know it was, they, they thought a little bit harder about sharing structures, duplicating structures, and allocation of different resources in there. So on, on Specint, they, Specint rates, they were getting 23% improvement for their 25% area. Okay. Well, this, this might actually be a good idea. It might have some value. [COUGH] And for the floating point version, they are not doing great, but doing better here. And. The, the floating point apps had a lot of, cache conflicts, so the floating point spec FP apps, if you go look at them in the inside, they typically have very large data sets, so they were basically conflicting in their large, larger, last little caches and things like that. So finally I wanted to just give a little bit of color to this idea of picking fairly and not having starvation in different resources in a simultaneous multi trading pipeline. So one processor that's sort of the famous, simultaneous multiframe processor that was never built, was the EV eight, or the Digital Equipment Corporation last alpha, that was never built. So this last alpha, the, EV eight or also known as the 21464, Have introduced they had eight way threaded processors note this was never, never built but, it was [INAUDIBLE] super scale that it had eight way, eight way. Simultaneous multi-threading. It also was a very aggressive processor. And when I say it was never built, it was never shipped commercially. They went pretty far in the design of it. So one of the questions that they realized upon here is, what is the right way in an out of order pipeline to make sure that you're getting correct utilization of the pipeline when you're issuing instructions from different threads? So in this example here, we have four different threads we'll say. And, you need to choose to go into your multi issue out of order pipeline here, which thread to go pick from. And, by definition you want to try to fill in the holes of one of the threads with work from another thread. So, complete fairness here or lockstep round robin choosing for instance is not what you want to do. Completely wrong thing to do. Because you want to fill in when one processor is stalled, with other processors work. But, you want to make sure that the thread which let's say, has no dependencies on each other. Or could run very fast, doesn't go to memory very often, we'll say, it just doesn't stall very often. Doesn't just hog the processor. Because it's never going to introduce stall cycles on itself by itself. So they came up with this idea called icount, and it was a choosing policy which basically looked at the number of instructions that were retiring out of the back of the pipeline, had like a moving window to try to estimate how many instructions from each thread were completing and then, feedback around to the front of the pipeline here to determine where to issue from. So I, I just wanted to get across the idea that, if you try to intermix different threads in a data structure, it's relatively simple. You can either cut the data structure in half, or have some round-robin allocation there. It gets much more complicated when you have a full processor pipeline that you need to figure out how to issue into that processor pipeline. And, then you know 20 stages later, if something happens, there's so much stuff in flight that you need to, need to worry about this a little bit harder. So here, they, they basically were looking at how many instructions were in flight and how many instructions had recently committed from a thread to determine what to do. And, on average they were hoping that they had some fairness between the threads.