Okay so, 
those core screen multi-turning. We're going to move onto figuring out how 
to mix instructions of different threads in the pipeline at the same time, and mix 
them in the same, and issue two different threads instructions at the same time to 
different pipelines. So, the idea here, this is called 
simultaneous multi-threading, and actually, let me start of with a, a, a 
picture of what this looks like. Let's work backwards in the slide deck 
here for a second. So we have a four issue processor. 
This is our issue with. And this is time increasing going down. 
Each of the different patterns represents a different thread. 
And the idea here, in simultaneous multi-threading is you can execute 
instructions from different threads into different pipelines simultaneously. 
Now, this gets quite a bit harder. Then this basic design here, because now 
we actually have to read from the register file four different threads 
simultaneously, and we have to fetch different code from different program 
counters simultaneously in this processor. 
So simultaneous multi-threading where it came out of was, people were building 
complex out of order superscalars. And these complex out-of-order 
superscalars had all this logic, to be able to track different 
dependencies between different instructions, and to basically restart 
sub portions of instruction sequences. So this is like, you take a branch of 
mispredict. You had to kill all the instructions that 
were dependent on the branch mispredict and leave other ones that were not there 
alone. And, when you have this out of order 
mechanism you have all this extra logic there to figure that out. 
Dean Tolson, Susan Eggers, Jim Levy, came up with this idea that, 
well, what if we try to utilize all the dead 
slots in our out of order superscalar, but intermix different threads in there 
simultaneously to fill the time? So they did this study back from ISCA in 
95,' and read a bunch of different applications, and this right most bar 
here, is our composite or our, our average here. 
And what you should figure out from this is this black bar on the bottom here is 
how long the processor is busy, actually doing work and the rest of this is 
different reasons the processor was stalled. 
So we were stalled on instruction task misses, branch mispredictions, load 
delays, just pipeline interlocking, memory conflicts, other, other sorts of 
things, and we're only using this processor less than 20% of the time. 
So, to show this a different way. We have our multi issue processor. 
And we have time here. We have all these purple boxes which are 
just dead, dead time. And we might be able to use subsets, 
you know. This is very good. 
We actually issued four instructions in this cycle. 
But here, we only issued two. Here, we issued one. 
Here, we issued two to these two pipelines in the middle. 
Maybe these are the two ALU pipes. And there's, like, a load pipe here and a 
branch pipe there, or something like that. 
This is, this is a kind of a disaster from an IPC 
perspective. Can we try to re-use that hardware? 
And we talked about our core screen multithreading, which was effectively 
temporally slicing up the cycles here. So you run one thread, switch to a 
different thread, run a different thread, and you can temporally switch between the 
threads. But what would happen in, in another 
approach, and this is actually done in the failed Sun Millenium processor, what 
was going to be Ultra Spark 5. The, they have this idea that they 
actually had a clustered VLIW, or excuse me, a clustered superscalar, 
and they had a mode that you could flip a bit, and you could basically cut the two 
clusters separately and run them as two separate processors, or two different 
threads. So you can actually decrease, let's say 
you had this width four processor, and instead you split it in half, and you 
have two functional units running one thread, 
two functional units running another thread. 
Well, this has, this has some good effects. 
You don't actually have to have really high IPC, 
or really high instructions per clock. You don't ever have to reach four in this 
design because we've narrowed the two pipeline widths. 
And you could think about trying to put them together. 
This is what the Millennium processor, the UUltra-Spark 5 tried to do, is you 
can actually switch between this mode and this mode. 
But you still have a lot of open slots here that you can't go use. 
So, it still leaves some vertical waste. Another downside to this was you can't 
have one thread using all of the resources very easily here in this mode. 
You can't have, let's say, thread one use this functional unit over here in this 
very static design. So, this brings us to full simultaneous 
multithreading, or SMT. And in SMT we were able to mix and match 
all these different instructions, but this is going to change our proster 
pipeline quite a bit. Okay, so let's, let's look at what this 
does to a processor pipeline. Well. 
All of a sudden we need to be fetching multiple instructions at a time. 
That's, that's definitely, a hard, harder thing to do here from different threads. 
Conveniently, we can actually use our instruction queue, or our reservation 
stations, if you will, to go find different instructions to go execute at 
the issue stage of our out of order processor. 
So what we can do is we can tag the different threads independently, and have 
that be part of the instruction, issue logic, such that you can't issue 
or, or, you know that thread one, register one is different than thread 
two, register one. And then you use the same logic that 
we've had in our out of order superscalars, in our issue queue or our 
to go find different instructions from multiple threads to go stick down the 
pipe, or the respective multiple pipes. And conveniently we have most of this 
logic already. And what's also nice about this is, if 
you're only running one thread at a time. When you go to look in your instruction 
queue, or your issue window, that only that one thread will show up. 
So you can know that you don't need to go and look around the other threads, and 
you can use the full machine. And this allows you to use the same 
machine such that if you have lots of parallelism, 
you can different threads and fill all different slots. 
And then, if you don't have the parallelism, and you want to exploit 
instructionable parallelism, you can use the slots in different allocation and 
just have the dead slots, the purple slots. 
And you can see here, here we are issuing three instructions from the checkered 
board blue, and here we are issuing four. Well, let's say it's the exact same 
program, executing on the same cycle. You could actually have the machine such 
that it'll issue different amounts of parallelism, of instructionable 
parallelism, from a thread, depending on if there's multiple threads 
running. And this is what these truly 
multi-threading machines, simultaneous multi-threading machines, attempt to do. . See, there's one other point I want to 
make here. you have to worry when you're doing this 
simultaneous multi-threading here, about priorities between threads, 
because you want to make sure they, you have some sort of equal progress 
guarantee, or some, and round robin's probably not good 
enough. So you need to somehow figure out how to 
not have one thread hog the machine, and have some, some, fairness between 
different threads. So I wanted to show a few examples here. 
So here we have on the top, we have the Power 4 IBM Power 4 processor pipeline. 
And this was a non-multi-thread machine. And the Power 5 architecture actually 
looks very similar to the Power 4 architecture, except they added a second 
hardware thread. So what they had to do here is they 
actually had to add more fetch bandwidth, and then they had it, this notion of 
group formation. And this group formation was the picking, 
out of the two threads. So they could figure out what they could 
actually execute simultaneously at the same time, 
and they, effectively had extra pipe stages out in front to go do that. 
And at the end here you have to commit to a, different program counters at the same 
time. You can see there is definitely 
complexity in building this. That all of the sudden, you have to 
basically into reorder buffer or kill a subthread when it branch mispredicts, but 
not the rest of it. And we already talked about how to do 
that in a super scale where we can kill a subsequent a subsection of instructions, 
and not have to kill the whole instruction sequence. . Here's a different view of the, the Power 
4 in a, in a, physical chip view, here. And one of the interesting things to, to 
see is that they went with two threads, because if they went to four threads, 
they figured out that they would actually be using the resources too much and be 
bottlenecking in different locations in the pipe. 
It was basically, they didn't have enough resources, sort 
of, over here to have it be filled. it would be filled if you had more than 
two threads executing on this anyway, so there was no, 
so sort of diminishing returns past that point. 
We're almost done, but I think we're out of time. 
Let's, actually, before we do that, let me skip 
forward here and just summarize, with all the different types of multi-threading. 
You have your superscalar processor. You have your, 
very, very fine-grain multi-threading, where you're doing it on each cycle, you 
have coarse grain, coarser grain where you sort of do it every few cycles. 
You can think about cutting the processor in half, and using some of the function 
units for one thread, and some of it for another thread, and then we have our full 
simultaneous multi-threading. So we'll pick up on this next time and, 
and, finish up about some of the implementations of the Pentium 4, and how 
they did multi-threading. But we'll stop here for today.