So, the idea here is if we can guarantee 
that there's no dependence between the instructions, 
we can think about having, you can execute those instructions in 
parallel. And one really good way to have 
guaranteed independence is to have completely programs or completely 
different threads. And the naive approach here is we 
interleave instructions from different threads to cover latency. 
So we send a load out here and while this load is off revolving in the memory 
system,, this is our thread one that we had here before. 
We then load into R1 and we went to go use the results here, also from thread 
one. Well, by the time the data comes back 
from memory, we know it's all, all ready to go. 
We put other work in our processor pipeline. 
And this is called multithreading. Now, one of the big insights here is you 
need to have enough other work to do and, these other threads add some complexity 
to your processor design, and they may cause some locality 
challenges. So if you have, if you're trying to 
exploit temporal locality in, let's say, a cash or a translation look inside 
buffer or some other buffer. And you start putting other data axises 
sort of in the middle here or other operations in the middle, this might 
destroy your locality. But, if, let's think about this in a 
happy happy way for right now. If you can find other work to do, you can 
just shove it in, the slots while you're, while you're waiting for this load to 
come back. The most basic version of multithreading 
is going to be some sort of very basic interweaving. 
So we run thread 1, then thread 2, then thread 3, then thread 4. 
Thread 1, thread 2, thread 3, and thread 4. 
So we don't pay attention to the, latencies or dependencies of the 
instructions when we go to schedule. And if you come around to a thread slot 
and there is nothing to do, let's say this load missed in the cache 
and took a thousand cycles to come back. You just don't schedule anything. 
You have a dead cycle there. One of the nice things about 
multithreading that you can take advantage of, is that you don't have to 
worry about bypassing anymore. Because if you know that you're not going 
to be executing an instruction from the same thread until a couple seconds later, 
why do you need to have a fast bypass from the ALU back to itself? 
You just don't care. So you, let's see, do we actually have an 
example of that. So here's, here's an example that say you 
have a, a write to R1, and then a read of R1. 
What does value does this thread 2 get here? 
So, the first versions of multi threading machines, 
try to partition the register file, so this is when it happened, 
so you compile up your programs differently. 
That's not really common anymore. we'll, we'll talk about an example where 
they actually did that. for instance, the 
that was actually also a limiter to threads for awhile, 
until sort of people came up with the idea of what you just suggested which is 
somehow change the name space of the registers. 
And the way that that was actually changed was 
well, the first implementation, we'll talk about in a little bit. 
But the first implementation was on Spark, actually. 
And they used the register windows on Spark, which gave you different name 
spaces for different registers. And then, then that finally evolved into 
actually having a thread identifier and having copies of the entire register 
space, so each thread had a different register 
set. And that's what we're going to show in 
this picture here. So as, as shown here, you have to copy 
the general purpose register file four times. 
[COUGH] And you have to copy the program counter four times, and then, you have a 
incrementer out here on the front of the pipe, which chooses the thread ID, or the 
thread select. And in our simple case here, we're just 
going to keep incrementing this and choose 1, 2, 3, 4. 1, 2, 3, 4. 1, 2, 3, 4 
and just continually does that and likewise that's a indexing to which 
general purpose register file we're suppose to be accessing. 
Or which logical general purpose register file because most of these architectures 
will actually put this together in one larger general purpose register file and 
then have, this just be some addressing bits that are architecturally not, not 
visible. But you bring up a good, a point that, 
there are a couple userland threading libraries out there. 
So, it issues this notion of threading, but not on a multithreaded computer 
architecture. So where you don't have this and this, 
and there's still some advantages to that cause you still, can swap between 
different threads and actually try to cover memory seek because your load to 
use will be longer. And those threading libraries, largely 
they do try to partition the register file space still. 
So you can actually go download one of these, on, you know, 
get Linux and go run it and people still use these threading libraries to go do 
that and these usually lend thread libraries. 
The other way to go do that is actually to swap out the entire register space, 
but people typically try not to do that because that, if you're trying to do fine 
grade, interweaving and you want to save your entire register space to memory, 
that's, that's very expensive. But the, the, these threading libraries 
sort of work together with the compiler, and tell the compiler don't, 
don't use all the registers for thread 1. Use half the registers for thread 1, and 
half the registers, let's say, for thread 2. 
So this is our simple multithreading pipeline. 
It's relatively small changes so far. replicate the register file, 
replicate the PC, and this can help us recover utilization 
on our ALU, for instance. And what we wanted to point out that the 
software, what this look like, is that it looks likes multiple processors. 
So if you go use something like a modern day Core i7, 
those have two hardware threads. But when you go to look at it, it'll look 
like there's twice as many cores in the machine than you think there are. 
So, for instance, if you have a four-core Core i7 and you open up in Windows the 
little C, process management sort of dialogue 
box, you'll see that there's eight little bars 
in there because there's one thread or one virtual processor, or excuse me two 
virtual processors per physical core in the machine. 
So they actually look like there's multiple slower CPUs. 
Okay, so what are the costs? The easy costs, are replicating the 
program counter and the General Purpose Registers. 
Things that start to become harder to replicate, but you're going to have to 
replicate also, is if you want to do full isolation of these CPUs, is you're going 
to have to have replication of system state. 
So things like page tables, things like page table base registers. All the 
different, system state about where the interrupt handlers, so things like the 
exceptional PC. And this actually gets a little bit, hard 
to do and some processors have a fair amount of system states. 
If you'll look at x86, there's a lot of system states. 
MIPS is relatively minimal, because it's sort of exceptional PC and the interrupt 
mask register. But it can be hard, and, and, the, 
because the TLB software maintains, so you don't even have one of these, you 
don't even have a virtual memory page table base register. 
So this something to think about is that you have to replicate all the state per 
thread. So there is some cost to it, but you 
still get to hide latencies to memory. You still get to reuse ALUs and increase 
the efficiencies of your ALUs. Personally, I think these are the 
smaller, smaller things. So this is very small. 
This is smaller. It's bigger than the first one, but 
harder to do. The big things out there, 
when you start to have temporal locality conflicts and spatial locality conflicts. 
So if you have four threads that are all executing, and they're fighting for 
space, they're fighting for capacity in your cache, this can be a problem. 
Or, let's say you have sixteen hardware threads, and you have an 8-way set 
associative cache, and they always want to access, let's 
say, index 0 in the cache. This is a common thing. 
Your stack always starts there or something like that. 
Well, you only have 8-way, associativity, and if you're time multiplexing thread 1, 
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 and then you move back around 
to 1. You're guaranteed that by the time you 
come back to 1, because you only have an 8-way set associative cache that you've 
bumped out the data that you needed, so you get no temporal locality. 
By the time you come back, it's not in your cache, 
so you had a, conflict. You'll have a conflict that's going on in 
your cache. Okay. 
So, so this is the negative side, is that they can conflict with each other, 
but if two threads are working together. You can actually get positive cache 
interference, or, or positive interference or constructive 
interference in the cache and it definitely happens. 
And that was actually, one of the reasons that people originally did this 
multithreading, was that if you have a program where you have different threads 
that are working on roughly the same data. 
This is actually very, very good, because it will be precaching data for each 
other. [COUGH]. 
Well, you have to make sure, it's a, it's a, it's a fine line to sort 
of walk there that when the threads fight for shared resources, and when they 
collaborate on pulling in the shared resources and exploit the locality 
structures in the processor. So the, the cash is one example. Another 
example is translation lookaside buffer entries. 
One solution to this is you can actually partition the cache, or partition the 
TLB, and then you can guarantee that you do 
not fight. Unfortunately, if you go to do this and 
you have a fixed size cache, you effectively cut the cache, so let's 
say in half or however many threads by N, by N threads. 
So, typically people don't try to statically partition these things. 
but, you could think about doing that, if you 
want no interference between the different threads. 
Okay. So let's, let's look at a couple 
different ways that you can try to schedule threads, and they've gone from 
simple to a little more complicated to even more complicated over time. 
So, simple's what we've been talking about up to this point, 
is that there's some fixed interleaving of N hardware threads, and you, 
basically, execute one instruction from each thread and change to the next 
thread. So it's the, our first case. 
So it's the fixed interleaving dates back to the, the 60s and this is pretty 
simple. What's nice about this is you can remove 
some of the interlocking and some of the bypassing from your processor. 
Next thing you can think about is, well, you could try to allocate different 
locations in your scheduling quantum to different threads. 
So for instance, we know that the, 
cyan thread here is the highest priority thread. 
It needs the most number of slots. So we can do an uneven distribution. 
And we can say, over our slots here, we can actually interleave and say, well, 
the cyan thread gets every other one we'll say. 
And then we can say, let's say the orange one gets a smaller percentage and the 
blue one gets a smaller percentage. So it's a, still a fixed interleaving to 
some extent or a fixed ordering, but it is, controllable by software 
depending, or the operating system, depending on the priorities of the 
different threads. So it's a little bit more flexible in 
hardware than the completely fixed interleave, design, 
and this does require a little bit of change here, because you can't just have 
this counter here, just sort of incrementing your thread ID. 
Instead, you need to have something else here, which, sort of, is is a picker for 
which thread you go execute, but, still relatively simple. 
And, because it's software programmable, you can actually choose a time and then 
reallocate, so the OS can change the allocation for a, a different time here, 
and, let's say, make orange a higher priority instead of the cyan. 
Then we start to go to something a little more complicated, 
so this is still relatively fixed priorities. 
You could think about something where you actually have the hardware making 
decisions about which one to go execute. And you could actually even have it go as 
far as, let's say, determining if you're executing a thread, 
which then has a long latency operation, it'll switch to another thread at that 
point. So, it will try to fill in backfilled 
work from other threads purposely when one thread gets swapped out or one set of 
thread goes to do something that's a long latency operation. 
And you could think about designs like that, 
and that is something like the HEP processor, which we'll be talking about 
in a minute or two. has, is, is, was one of the first early 
examples of that. So just to recap here. 
We're going to call this coarse-grain multithreading. 
The reason we're calling it coarse-grain is because for any one cycle, 
only one thread is running. And you might scratch your side of the 
head and say, how do you have multiple threads running in one, one cycle? 
Well, we'll show, if you go to a superscalar, or multi-issue processor, 
you could think about having the different pipelines executing 
instructions from different threads simultaneously. 
That's called simultaneous multithreading and we'll talk about that in a second. 
[COUGH] But in our core screen approach here, just to, to recap, you can swap 
threads on hardware misses, [COUGH], and you can take advantage of bubbles in the 
processor to do other work. Okay, so a little bit of a history lesson 
here. the HEP processor was Burton Smith, who's 
now at Microsoft Research. he was the, the, the chief architect of 
this back in the 80s. And this processor's actually pretty 
interesting because there was lots and lots of threads, and small number of 
processors. So there is 120 threads in this machine, 
processor. Relatively modest clock rate for the 80s, 
but what they were trying to do here is they were trying to deal with memory 
latency. So this machine had a very high bandwidth 
memory system and what would happen is effectively you were allowed to have a 
load or store every instruction and your performance would never degrade if you 
had a load and a store every instruction. Because in this machine, they had 120 
threads per processor and the memory latency was less than 120 cycles. 
So if you had to load every instruction, and you have enough bandwidth to be able 
to feed all those loads, you could execute a load from each different thread 
and none of them are each thread was independent of each other. You could 
basically go up to the memory system, wait the latency, have it come back in 
pipeline the, the latency out to memory, and pipeline the memory access. And by 
the time you were to come back and execute the next instruction from that 
thread, which would be at 120 cycles later, the memory result would be there. 
So this insight was carried forward Burton went and started this company 
called Tera, or Tera Systems which later went on to buy Cray, strangely enough, 
because I always thought Cray was a bigger company than Tera, and it 
definitely was. But we'll, we'll call it a merger, but I think that's what they 
called it, but in reality, Tera acquired Cray and Tera had a similar sort of idea 
here, more processors. 
This was, sort of, further evolution of the HEP processor. 
128 active threads per processor. Lots of processors, 256 processors. 
So lots of active programs. So you had to find enough thread level 
parallelism in your, in your program and this architecture has no, no caches, 
because they don't need it. If you have enough bandwidth out to your 
memory system, you can have a load every cycle, and you can cover the latency with 
other threads. Why do you, why do you care? 
So, some, some interesting things about this is, this may not be good for power. 
You are not exploiting locality in your data references at all and you're going 
out to the memory system every cycle. And that could be far away. 
And it is, well, it's far away in this machine. 
So that's, you got to be a little careful about these machines, and then the second 
idea here is, you have to come up with lots of threads. 
Now we're talking about having similar numbers of threads in something like, 
modern day, many core machines. But there's still a fair number of 
threads to be able to effectively use the machine to its, to its best performance. 
I wanted to say about this actually, this architecture, 
where it got mostly used, was in applications that had no locality anyway 
or no temporal or spacial locality in their data references anyway. 
And good examples of this were things like data mining, 
huge data sets, arbitrary access to the data sets. 
Not, you couldn't have a prefetcher predict where your next memory reference 
is going to be. So if you were just going basically not 
going to be able to effectively fit in a cache or use a cache anyway, remove the 
cache and go, go multithreaded. It was the idea behind these machines. 
And they actually saw some, big speed ups for applications that had, 
had data sets like that. This still actually lives on today, in 
the Cray XMT, the extreme multithreaded architecture. 
And you can go buy this machine from what's nowadays called Cray Computer 
Corp., which was Tera-eating Cray or buying Cray. 
And it's not, a modest clock speed by modern day standards. 
It's only 500 MHz, but it, they can intermix these with sort of Opterons and 
other processors because they standardized on the AMD bus protocol, 
so you can plug in AMD chips and these XMT chips in the same, system. 
Just a recap here of, like, what their memory system looked like. 
Their instruction pipelines were very long, and they didn't have to bypass, 
because you could never execute instruction and a subsequent instruction 
which would use the results. And the memory operations, the memory 
latencies were about 150 cycles and they had 128 threads. 
So, the probability they, they were typically not waiting for the memory 
system and they could effectively pipeline their memory operations. 
Another little tidbit of history here. So, this is the machine I was talking 
about that is my academic lineage a little bit here. 
when I first showed up at MIT there was this machine in the corner called the MIT 
Alewife processor, which was a multiprocessor machine. 
It went up to 128 nodes, which was a lot at, in 1990. 
And, one of the, the little tidbits about this machine is, they had spark 
processors, and a spark processor had this notion of a register window. 
What a register window is, is every time you do a system call, 
you basically change your addressing in the register file. 
So you have a larger register file, and you have a smaller window of the register 
file, and you kind of slide this window across the register file on function 
calls and function returns. [COUGH] And the MIT Alewife processor 
used this. and they extended this register window 
idea, such that they could have a special 
instruction which would actually change how much the register file they were 
looking at, at a time. So as they could actually swap the entire 
register file very quickly, and, this actually allowed them to multithread very 
effectively with four threads. And this was one of the early 
multithreaded processors and they introduced this notion of thread switch 
on remote memory access. So if you had a memory access that had to 
go to one of the other nodes in this machine. So if you were on this node 
here, and you need to go down to this one, and you're going to send a message 
over there in a multiprocessor, notion, you actually had to send a message and 
get a response back, it took a long time. So you would actually switch threads at 
that time, . Multithreading lives on today, especially 
this, in this coarse-grain multithreading lives on today. 
A good example of this is the Oracle, and what used to be Sun, and before that, 
Afara, Niagara processors. 
Afara was the name of the, startup company that made this, and then Sun 
acquired them, and then Oracle acquired Sun. 
And, the Sun Niagara processor was designed for throughput computing, such 
that you can have lots of, threads running and they were all 
independent and it was a multithreaded processor. 
So, the Niagra one had 8 cores and then 4 threads per core. 
And they've, turn that up over time. 
So now they use the Niagara 3, so this is what was called the, the Sun 
T1. This is the Sun T2. 
This is the Sun T3 now. they have 16 cores where they have 8 
threads per core, so a lot of parallelism going on here and 
this coarse-grain multithreading, goes on. 
So here's our, a dye photo of the Niagara 3 processor. 
We can see the different cores you might look in this picture. 
And say. there is, what looks to be 16 cores and I 
say 16 cores on the slide here, but one of the interesting things is they 
actually sort of conjoin two cores together, internally. 
It's a strange sort of design idea that they do, is that they mix the threads and 
the cores together and they share, I think it was floating point unit between 
the two cores. So, these cores are, to some extent, 
conjoined cores together.