1 00:00:03,740 --> 00:00:10,850 So, the idea here is if we can guarantee that there's no dependence between the 2 00:00:10,850 --> 00:00:15,580 instructions, we can think about having, 3 00:00:16,720 --> 00:00:20,064 you can execute those instructions in parallel. 4 00:00:20,064 --> 00:00:24,833 And one really good way to have guaranteed independence is to have 5 00:00:24,833 --> 00:00:28,533 completely programs or completely different threads. 6 00:00:28,533 --> 00:00:36,820 And the naive approach here is we interleave instructions from different 7 00:00:36,820 --> 00:00:41,026 threads to cover latency. So we send a load out here and while this 8 00:00:41,026 --> 00:00:45,031 load is off revolving in the memory system,, this is our thread one that we 9 00:00:45,031 --> 00:00:48,007 had here before. We then load into R1 and we went to go 10 00:00:48,007 --> 00:00:50,280 use the results here, also from thread one. 11 00:00:51,420 --> 00:00:56,635 Well, by the time the data comes back from memory, we know it's all, all ready 12 00:00:56,635 --> 00:00:59,449 to go. We put other work in our processor 13 00:00:59,449 --> 00:01:02,400 pipeline. And this is called multithreading. 14 00:01:04,100 --> 00:01:09,582 Now, one of the big insights here is you need to have enough other work to do and, 15 00:01:09,582 --> 00:01:14,506 these other threads add some complexity to your processor design, 16 00:01:14,506 --> 00:01:17,838 and they may cause some locality challenges. 17 00:01:17,838 --> 00:01:23,443 So if you have, if you're trying to exploit temporal locality in, let's say, 18 00:01:23,443 --> 00:01:28,290 a cash or a translation look inside buffer or some other buffer. 19 00:01:28,290 --> 00:01:34,047 And you start putting other data axises sort of in the middle here or other 20 00:01:34,047 --> 00:01:38,720 operations in the middle, this might destroy your locality. 21 00:01:38,720 --> 00:01:44,520 But, if, let's think about this in a happy happy way for right now. 22 00:01:46,380 --> 00:01:54,537 If you can find other work to do, you can just shove it in, the slots while you're, 23 00:01:54,537 --> 00:01:58,470 while you're waiting for this load to come back. 24 00:01:58,470 --> 00:02:05,189 The most basic version of multithreading is going to be some sort of very basic 25 00:02:05,189 --> 00:02:10,444 interweaving. So we run thread 1, then thread 2, then 26 00:02:10,444 --> 00:02:15,394 thread 3, then thread 4. Thread 1, thread 2, thread 3, and thread 27 00:02:15,394 --> 00:02:18,218 4. So we don't pay attention to the, 28 00:02:18,218 --> 00:02:22,508 latencies or dependencies of the instructions when we go to schedule. 29 00:02:22,508 --> 00:02:26,673 And if you come around to a thread slot and there is nothing to do, 30 00:02:26,673 --> 00:02:31,647 let's say this load missed in the cache and took a thousand cycles to come back. 31 00:02:31,647 --> 00:02:35,440 You just don't schedule anything. You have a dead cycle there. 32 00:02:36,920 --> 00:02:40,223 One of the nice things about multithreading that you can take 33 00:02:40,223 --> 00:02:43,900 advantage of, is that you don't have to worry about bypassing anymore. 34 00:02:44,960 --> 00:02:50,253 Because if you know that you're not going to be executing an instruction from the 35 00:02:50,253 --> 00:02:55,352 same thread until a couple seconds later, why do you need to have a fast bypass 36 00:02:55,352 --> 00:02:59,100 from the ALU back to itself? You just don't care. 37 00:03:00,360 --> 00:03:04,366 So you, let's see, do we actually have an example of that. 38 00:03:04,366 --> 00:03:10,234 So here's, here's an example that say you have a, a write to R1, and then a read of 39 00:03:10,234 --> 00:03:13,382 R1. What does value does this thread 2 get 40 00:03:13,382 --> 00:03:18,261 here? So, the first versions of multi threading 41 00:03:18,261 --> 00:03:22,523 machines, try to partition the register file, 42 00:03:22,523 --> 00:03:26,986 so this is when it happened, so you compile up your programs 43 00:03:26,986 --> 00:03:30,334 differently. That's not really common anymore. 44 00:03:30,334 --> 00:03:35,243 we'll, we'll talk about an example where they actually did that. 45 00:03:35,243 --> 00:03:40,263 for instance, the that was actually also a limiter to 46 00:03:40,263 --> 00:03:44,298 threads for awhile, until sort of people came up with the 47 00:03:44,298 --> 00:03:49,544 idea of what you just suggested which is somehow change the name space of the 48 00:03:49,544 --> 00:03:52,571 registers. And the way that that was actually 49 00:03:52,571 --> 00:03:56,037 changed was well, the first implementation, we'll 50 00:03:56,037 --> 00:03:59,709 talk about in a little bit. But the first implementation was on 51 00:03:59,709 --> 00:04:02,797 Spark, actually. And they used the register windows on 52 00:04:02,797 --> 00:04:06,702 Spark, which gave you different name spaces for different registers. 53 00:04:06,702 --> 00:04:11,364 And then, then that finally evolved into actually having a thread identifier and 54 00:04:11,364 --> 00:04:13,870 having copies of the entire register space, 55 00:04:13,870 --> 00:04:16,434 so each thread had a different register set. 56 00:04:16,434 --> 00:04:19,581 And that's what we're going to show in this picture here. 57 00:04:19,581 --> 00:04:24,010 So as, as shown here, you have to copy the general purpose register file four 58 00:04:24,010 --> 00:04:28,643 times. [COUGH] And you have to copy the program 59 00:04:28,643 --> 00:04:36,980 counter four times, and then, you have a incrementer out here on the front of the 60 00:04:36,980 --> 00:04:42,900 pipe, which chooses the thread ID, or the thread select. 61 00:04:42,900 --> 00:04:46,940 And in our simple case here, we're just going to keep incrementing this and 62 00:04:46,940 --> 00:04:52,832 choose 1, 2, 3, 4. 1, 2, 3, 4. 1, 2, 3, 4 and just continually does that and 63 00:04:52,832 --> 00:04:56,873 likewise that's a indexing to which general purpose register file we're 64 00:04:56,873 --> 00:05:02,074 suppose to be accessing. Or which logical general purpose register 65 00:05:02,074 --> 00:05:06,955 file because most of these architectures will actually put this together in one 66 00:05:06,955 --> 00:05:12,082 larger general purpose register file and then have, this just be some addressing 67 00:05:12,082 --> 00:05:15,295 bits that are architecturally not, not visible. 68 00:05:15,295 --> 00:05:20,344 But you bring up a good, a point that, there are a couple userland threading 69 00:05:20,344 --> 00:05:24,852 libraries out there. So, it issues this notion of threading, 70 00:05:24,852 --> 00:05:28,547 but not on a multithreaded computer architecture. 71 00:05:28,547 --> 00:05:34,459 So where you don't have this and this, and there's still some advantages to that 72 00:05:34,459 --> 00:05:39,928 cause you still, can swap between different threads and actually try to 73 00:05:39,928 --> 00:05:44,066 cover memory seek because your load to use will be longer. 74 00:05:44,066 --> 00:05:49,712 And those threading libraries, largely they do try to partition the register 75 00:05:49,712 --> 00:05:52,652 file space still. So you can actually go download one of 76 00:05:52,652 --> 00:05:55,908 these, on, you know, get Linux and go run it and people still 77 00:05:55,908 --> 00:05:59,793 use these threading libraries to go do that and these usually lend thread 78 00:05:59,793 --> 00:06:02,418 libraries. The other way to go do that is actually 79 00:06:02,418 --> 00:06:06,881 to swap out the entire register space, but people typically try not to do that 80 00:06:06,881 --> 00:06:10,714 because that, if you're trying to do fine grade, interweaving and you want to save 81 00:06:10,714 --> 00:06:14,960 your entire register space to memory, that's, that's very expensive. 82 00:06:14,960 --> 00:06:19,934 But the, the, these threading libraries sort of work together with the compiler, 83 00:06:19,934 --> 00:06:24,786 and tell the compiler don't, don't use all the registers for thread 1. 84 00:06:24,786 --> 00:06:28,384 Use half the registers for thread 1, and half the registers, let's say, for thread 85 00:06:28,384 --> 00:06:38,699 2. So this is our simple multithreading 86 00:06:38,699 --> 00:06:43,224 pipeline. It's relatively small changes so far. 87 00:06:43,224 --> 00:06:49,180 replicate the register file, replicate the PC, 88 00:06:49,180 --> 00:06:54,900 and this can help us recover utilization on our ALU, for instance. 89 00:06:56,180 --> 00:07:02,016 And what we wanted to point out that the software, what this look like, is that it 90 00:07:02,016 --> 00:07:07,203 looks likes multiple processors. So if you go use something like a modern 91 00:07:07,203 --> 00:07:10,373 day Core i7, those have two hardware threads. 92 00:07:10,373 --> 00:07:16,209 But when you go to look at it, it'll look like there's twice as many cores in the 93 00:07:16,209 --> 00:07:21,469 machine than you think there are. So, for instance, if you have a four-core 94 00:07:21,469 --> 00:07:25,000 Core i7 and you open up in Windows the little 95 00:07:25,000 --> 00:07:28,883 C, process management sort of dialogue box, 96 00:07:28,883 --> 00:07:34,588 you'll see that there's eight little bars in there because there's one thread or 97 00:07:34,588 --> 00:07:40,441 one virtual processor, or excuse me two virtual processors per physical core in 98 00:07:40,441 --> 00:07:43,924 the machine. So they actually look like there's 99 00:07:43,924 --> 00:07:49,003 multiple slower CPUs. Okay, so what are the costs? 100 00:07:49,003 --> 00:07:54,219 The easy costs, are replicating the program counter and the General Purpose 101 00:07:54,219 --> 00:07:58,333 Registers. Things that start to become harder to 102 00:07:58,333 --> 00:08:04,137 replicate, but you're going to have to replicate also, is if you want to do full 103 00:08:04,137 --> 00:08:10,063 isolation of these CPUs, is you're going to have to have replication of system 104 00:08:10,063 --> 00:08:13,302 state. So things like page tables, things like 105 00:08:13,302 --> 00:08:18,485 page table base registers. All the different, system state about where the 106 00:08:18,485 --> 00:08:22,700 interrupt handlers, so things like the exceptional PC. 107 00:08:22,700 --> 00:08:28,323 And this actually gets a little bit, hard to do and some processors have a fair 108 00:08:28,323 --> 00:08:32,438 amount of system states. If you'll look at x86, there's a lot of 109 00:08:32,438 --> 00:08:36,142 system states. MIPS is relatively minimal, because it's 110 00:08:36,142 --> 00:08:39,708 sort of exceptional PC and the interrupt mask register. 111 00:08:39,708 --> 00:08:44,118 But it can be hard, and, and, the, because the TLB software maintains, so 112 00:08:44,118 --> 00:08:48,385 you don't even have one of these, you don't even have a virtual memory page 113 00:08:48,385 --> 00:08:52,569 table base register. So this something to think about is that 114 00:08:52,569 --> 00:08:55,416 you have to replicate all the state per thread. 115 00:08:55,416 --> 00:08:59,806 So there is some cost to it, but you still get to hide latencies to memory. 116 00:08:59,806 --> 00:09:04,520 You still get to reuse ALUs and increase the efficiencies of your ALUs. 117 00:09:04,520 --> 00:09:09,920 Personally, I think these are the smaller, smaller things. 118 00:09:09,920 --> 00:09:12,924 So this is very small. This is smaller. 119 00:09:12,924 --> 00:09:16,323 It's bigger than the first one, but harder to do. 120 00:09:16,323 --> 00:09:22,638 The big things out there, when you start to have temporal locality 121 00:09:22,638 --> 00:09:29,007 conflicts and spatial locality conflicts. So if you have four threads that are all 122 00:09:29,007 --> 00:09:34,756 executing, and they're fighting for space, they're fighting for capacity in 123 00:09:34,756 --> 00:09:40,653 your cache, this can be a problem. Or, let's say you have sixteen hardware 124 00:09:40,653 --> 00:09:44,320 threads, and you have an 8-way set associative cache, 125 00:09:45,960 --> 00:09:49,566 and they always want to access, let's say, index 0 in the cache. 126 00:09:49,566 --> 00:09:53,001 This is a common thing. Your stack always starts there or 127 00:09:53,001 --> 00:09:56,378 something like that. Well, you only have 8-way, associativity, 128 00:09:56,378 --> 00:10:02,790 and if you're time multiplexing thread 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 129 00:10:02,790 --> 00:10:07,320 14, 15, 16 and then you move back around to 1. 130 00:10:07,320 --> 00:10:12,999 You're guaranteed that by the time you come back to 1, because you only have an 131 00:10:12,999 --> 00:10:17,831 8-way set associative cache that you've bumped out the data that you needed, 132 00:10:17,831 --> 00:10:22,401 so you get no temporal locality. By the time you come back, it's not in 133 00:10:22,401 --> 00:10:24,751 your cache, so you had a, conflict. 134 00:10:24,751 --> 00:10:28,080 You'll have a conflict that's going on in your cache. 135 00:10:28,080 --> 00:10:31,018 Okay. So, so this is the negative side, is that 136 00:10:31,018 --> 00:10:35,850 they can conflict with each other, but if two threads are working together. 137 00:10:35,850 --> 00:10:41,980 You can actually get positive cache interference, or, or 138 00:10:41,980 --> 00:10:45,779 positive interference or constructive interference in the cache and it 139 00:10:45,779 --> 00:10:49,014 definitely happens. And that was actually, one of the reasons 140 00:10:49,014 --> 00:10:52,608 that people originally did this multithreading, was that if you have a 141 00:10:52,608 --> 00:10:56,613 program where you have different threads that are working on roughly the same 142 00:10:56,613 --> 00:10:58,923 data. This is actually very, very good, because 143 00:10:58,923 --> 00:11:01,080 it will be precaching data for each other. 144 00:11:04,240 --> 00:11:06,759 [COUGH]. Well, you have to make sure, 145 00:11:06,759 --> 00:11:12,805 it's a, it's a, it's a fine line to sort of walk there that when the threads fight 146 00:11:12,805 --> 00:11:18,419 for shared resources, and when they collaborate on pulling in the shared 147 00:11:18,419 --> 00:11:22,954 resources and exploit the locality structures in the processor. 148 00:11:22,954 --> 00:11:28,784 So the, the cash is one example. Another example is translation lookaside buffer 149 00:11:28,784 --> 00:11:32,750 entries. One solution to this is you can actually 150 00:11:32,750 --> 00:11:35,880 partition the cache, or partition the TLB, 151 00:11:35,880 --> 00:11:39,031 and then you can guarantee that you do not fight. 152 00:11:39,031 --> 00:11:43,340 Unfortunately, if you go to do this and you have a fixed size cache, 153 00:11:43,340 --> 00:11:48,484 you effectively cut the cache, so let's say in half or however many threads by N, 154 00:11:48,484 --> 00:11:51,443 by N threads. So, typically people don't try to 155 00:11:51,443 --> 00:11:55,040 statically partition these things. but, 156 00:11:55,040 --> 00:11:59,238 you could think about doing that, if you want no interference between the 157 00:11:59,238 --> 00:12:03,124 different threads. Okay. 158 00:12:03,124 --> 00:12:08,521 So let's, let's look at a couple different ways that you can try to 159 00:12:08,521 --> 00:12:15,152 schedule threads, and they've gone from simple to a little more complicated to 160 00:12:15,152 --> 00:12:20,930 even more complicated over time. So, simple's what we've been talking 161 00:12:20,930 --> 00:12:25,536 about up to this point, is that there's some fixed interleaving 162 00:12:25,536 --> 00:12:30,771 of N hardware threads, and you, basically, execute one instruction from 163 00:12:30,771 --> 00:12:33,461 each thread and change to the next thread. 164 00:12:33,461 --> 00:12:38,074 So it's the, our first case. So it's the fixed interleaving dates back 165 00:12:38,074 --> 00:12:40,444 to the, the 60s and this is pretty simple. 166 00:12:40,444 --> 00:12:45,698 What's nice about this is you can remove some of the interlocking and some of the 167 00:12:45,698 --> 00:12:54,620 bypassing from your processor. Next thing you can think about is, well, 168 00:12:54,620 --> 00:13:02,353 you could try to allocate different locations in your scheduling quantum to 169 00:13:02,353 --> 00:13:06,200 different threads. So for instance, 170 00:13:06,200 --> 00:13:12,317 we know that the, cyan thread here is the highest priority 171 00:13:12,317 --> 00:13:15,710 thread. It needs the most number of slots. 172 00:13:15,710 --> 00:13:23,788 So we can do an uneven distribution. And we can say, over our slots here, we 173 00:13:23,788 --> 00:13:29,965 can actually interleave and say, well, the cyan thread gets every other one 174 00:13:29,965 --> 00:13:36,101 we'll say. And then we can say, let's say the orange 175 00:13:36,101 --> 00:13:41,162 one gets a smaller percentage and the blue one gets a smaller percentage. 176 00:13:41,162 --> 00:13:46,152 So it's a, still a fixed interleaving to some extent or a fixed ordering, 177 00:13:46,152 --> 00:13:51,143 but it is, controllable by software depending, or the operating system, 178 00:13:51,143 --> 00:13:54,817 depending on the priorities of the different threads. 179 00:13:54,817 --> 00:13:59,877 So it's a little bit more flexible in hardware than the completely fixed 180 00:13:59,877 --> 00:14:03,967 interleave, design, and this does require a little bit of 181 00:14:03,967 --> 00:14:09,239 change here, because you can't just have this counter here, just sort of 182 00:14:09,239 --> 00:14:13,670 incrementing your thread ID. Instead, you need to have something else 183 00:14:13,670 --> 00:14:18,101 here, which, sort of, is is a picker for which thread you go execute, 184 00:14:18,101 --> 00:14:22,467 but, still relatively simple. And, because it's software programmable, 185 00:14:22,467 --> 00:14:27,419 you can actually choose a time and then reallocate, so the OS can change the 186 00:14:27,419 --> 00:14:32,110 allocation for a, a different time here, and, let's say, make orange a higher 187 00:14:32,110 --> 00:14:38,786 priority instead of the cyan. Then we start to go to something a little 188 00:14:38,786 --> 00:14:41,757 more complicated, so this is still relatively fixed 189 00:14:41,757 --> 00:14:44,728 priorities. You could think about something where you 190 00:14:44,728 --> 00:14:49,820 actually have the hardware making decisions about which one to go execute. 191 00:14:49,820 --> 00:14:54,860 And you could actually even have it go as far as, let's say, 192 00:14:54,860 --> 00:15:00,213 determining if you're executing a thread, which then has a long latency operation, 193 00:15:00,213 --> 00:15:03,048 it'll switch to another thread at that point. 194 00:15:03,048 --> 00:15:07,835 So, it will try to fill in backfilled work from other threads purposely when 195 00:15:07,835 --> 00:15:13,630 one thread gets swapped out or one set of thread goes to do something that's a long 196 00:15:13,630 --> 00:15:17,312 latency operation. And you could think about designs like 197 00:15:17,312 --> 00:15:21,205 that, and that is something like the HEP 198 00:15:21,205 --> 00:15:25,631 processor, which we'll be talking about in a minute or two. 199 00:15:25,631 --> 00:15:30,440 has, is, is, was one of the first early examples of that. 200 00:15:32,820 --> 00:15:40,573 So just to recap here. We're going to call this coarse-grain 201 00:15:40,573 --> 00:15:44,486 multithreading. The reason we're calling it coarse-grain 202 00:15:44,486 --> 00:15:50,420 is because for any one cycle, only one thread is running. 203 00:15:50,420 --> 00:15:54,275 And you might scratch your side of the head and say, how do you have multiple 204 00:15:54,275 --> 00:15:57,479 threads running in one, one cycle? Well, we'll show, if you go to a 205 00:15:57,479 --> 00:16:00,884 superscalar, or multi-issue processor, you could think about having the 206 00:16:00,884 --> 00:16:04,188 different pipelines executing instructions from different threads 207 00:16:04,188 --> 00:16:08,697 simultaneously. That's called simultaneous multithreading 208 00:16:08,697 --> 00:16:14,750 and we'll talk about that in a second. [COUGH] But in our core screen approach 209 00:16:14,750 --> 00:16:20,493 here, just to, to recap, you can swap threads on hardware misses, [COUGH], and 210 00:16:20,493 --> 00:16:25,770 you can take advantage of bubbles in the processor to do other work. 211 00:16:25,770 --> 00:16:29,960 Okay, so a little bit of a history lesson here. 212 00:16:29,960 --> 00:16:36,743 the HEP processor was Burton Smith, who's now at Microsoft Research. 213 00:16:36,743 --> 00:16:42,783 he was the, the, the chief architect of this back in the 80s. 214 00:16:42,783 --> 00:16:49,473 And this processor's actually pretty interesting because there was lots and 215 00:16:49,473 --> 00:16:54,702 lots of threads, and small number of processors. 216 00:16:54,702 --> 00:17:01,140 So there is 120 threads in this machine, processor. 217 00:17:01,140 --> 00:17:09,233 Relatively modest clock rate for the 80s, but what they were trying to do here is 218 00:17:09,233 --> 00:17:14,080 they were trying to deal with memory latency. 219 00:17:14,080 --> 00:17:19,780 So this machine had a very high bandwidth memory system and what would happen is 220 00:17:19,780 --> 00:17:25,410 effectively you were allowed to have a load or store every instruction and your 221 00:17:25,410 --> 00:17:31,480 performance would never degrade if you had a load and a store every instruction. 222 00:17:31,480 --> 00:17:36,500 Because in this machine, they had 120 threads per processor and the memory 223 00:17:36,500 --> 00:17:41,913 latency was less than 120 cycles. So if you had to load every instruction, 224 00:17:41,913 --> 00:17:46,190 and you have enough bandwidth to be able to feed all those loads, you could 225 00:17:46,190 --> 00:17:50,686 execute a load from each different thread and none of them are each thread was 226 00:17:50,686 --> 00:17:54,744 independent of each other. You could basically go up to the memory system, 227 00:17:54,744 --> 00:17:58,966 wait the latency, have it come back in pipeline the, the latency out to memory, 228 00:17:58,966 --> 00:18:02,969 and pipeline the memory access. And by the time you were to come back and 229 00:18:02,969 --> 00:18:07,081 execute the next instruction from that thread, which would be at 120 cycles 230 00:18:07,081 --> 00:18:14,086 later, the memory result would be there. So this insight was carried forward 231 00:18:14,086 --> 00:18:20,451 Burton went and started this company called Tera, or Tera Systems which later 232 00:18:20,451 --> 00:18:27,058 went on to buy Cray, strangely enough, because I always thought Cray was a 233 00:18:27,058 --> 00:18:31,376 bigger company than Tera, and it definitely was. But we'll, we'll call it 234 00:18:31,376 --> 00:18:36,311 a merger, but I think that's what they called it, but in reality, Tera acquired 235 00:18:36,311 --> 00:18:40,762 Cray and Tera had a similar sort of idea here, 236 00:18:40,762 --> 00:18:45,603 more processors. This was, sort of, further evolution of 237 00:18:45,603 --> 00:18:49,935 the HEP processor. 128 active threads per processor. 238 00:18:49,935 --> 00:18:55,117 Lots of processors, 256 processors. So lots of active programs. 239 00:18:55,117 --> 00:19:01,827 So you had to find enough thread level parallelism in your, in your program and 240 00:19:01,827 --> 00:19:07,330 this architecture has no, no caches, because they don't need it. 241 00:19:07,330 --> 00:19:12,761 If you have enough bandwidth out to your memory system, you can have a load every 242 00:19:12,761 --> 00:19:16,449 cycle, and you can cover the latency with other threads. 243 00:19:16,449 --> 00:19:22,648 Why do you, why do you care? So, some, some interesting things about 244 00:19:22,648 --> 00:19:28,186 this is, this may not be good for power. You are not exploiting locality in your 245 00:19:28,186 --> 00:19:33,051 data references at all and you're going out to the memory system every cycle. 246 00:19:33,051 --> 00:19:37,094 And that could be far away. And it is, well, it's far away in this 247 00:19:37,094 --> 00:19:41,313 machine. So that's, you got to be a little careful 248 00:19:41,313 --> 00:19:46,179 about these machines, and then the second idea here is, you have to come up with 249 00:19:46,179 --> 00:19:49,442 lots of threads. Now we're talking about having similar 250 00:19:49,442 --> 00:19:53,418 numbers of threads in something like, modern day, many core machines. 251 00:19:53,418 --> 00:19:57,987 But there's still a fair number of threads to be able to effectively use the 252 00:19:57,987 --> 00:20:07,674 machine to its, to its best performance. I wanted to say about this actually, this 253 00:20:07,674 --> 00:20:11,093 architecture, where it got mostly used, was in 254 00:20:11,093 --> 00:20:17,095 applications that had no locality anyway or no temporal or spacial locality in 255 00:20:17,095 --> 00:20:22,185 their data references anyway. And good examples of this were things 256 00:20:22,185 --> 00:20:24,616 like data mining, huge data sets, 257 00:20:24,616 --> 00:20:30,197 arbitrary access to the data sets. Not, you couldn't have a prefetcher 258 00:20:30,197 --> 00:20:33,344 predict where your next memory reference is going to be. 259 00:20:33,344 --> 00:20:37,974 So if you were just going basically not going to be able to effectively fit in a 260 00:20:37,974 --> 00:20:42,783 cache or use a cache anyway, remove the cache and go, go multithreaded. It was 261 00:20:42,783 --> 00:20:47,160 the idea behind these machines. And they actually saw some, 262 00:20:47,160 --> 00:20:53,425 big speed ups for applications that had, had data sets like that. 263 00:20:53,425 --> 00:21:00,676 This still actually lives on today, in the Cray XMT, the extreme multithreaded 264 00:21:00,676 --> 00:21:05,253 architecture. And you can go buy this machine from 265 00:21:05,253 --> 00:21:11,278 what's nowadays called Cray Computer Corp., which was Tera-eating Cray or 266 00:21:11,278 --> 00:21:15,321 buying Cray. And it's not, a modest clock speed by 267 00:21:15,321 --> 00:21:19,860 modern day standards. It's only 500 MHz, but it, they can 268 00:21:19,860 --> 00:21:25,720 intermix these with sort of Opterons and other processors because they 269 00:21:25,720 --> 00:21:32,540 standardized on the AMD bus protocol, so you can plug in AMD chips and these 270 00:21:32,540 --> 00:21:42,643 XMT chips in the same, system. Just a recap here of, like, what their 271 00:21:42,643 --> 00:21:49,193 memory system looked like. Their instruction pipelines were very 272 00:21:49,193 --> 00:21:53,449 long, and they didn't have to bypass, because you could never execute 273 00:21:53,449 --> 00:21:57,768 instruction and a subsequent instruction which would use the results. 274 00:21:57,768 --> 00:22:02,713 And the memory operations, the memory latencies were about 150 cycles and they 275 00:22:02,713 --> 00:22:05,842 had 128 threads. So, the probability they, they were 276 00:22:05,842 --> 00:22:10,474 typically not waiting for the memory system and they could effectively 277 00:22:10,474 --> 00:22:17,437 pipeline their memory operations. Another little tidbit of history here. 278 00:22:17,437 --> 00:22:24,153 So, this is the machine I was talking about that is my academic lineage a 279 00:22:24,153 --> 00:22:29,366 little bit here. when I first showed up at MIT there was 280 00:22:29,366 --> 00:22:36,100 this machine in the corner called the MIT Alewife processor, which was a 281 00:22:36,100 --> 00:22:41,032 multiprocessor machine. It went up to 128 nodes, which was a lot 282 00:22:41,032 --> 00:22:44,963 at, in 1990. And, one of the, the little tidbits about 283 00:22:44,963 --> 00:22:50,204 this machine is, they had spark processors, and a spark processor had 284 00:22:50,204 --> 00:22:55,831 this notion of a register window. What a register window is, is every time 285 00:22:55,831 --> 00:23:00,119 you do a system call, you basically change your addressing in 286 00:23:00,119 --> 00:23:03,658 the register file. So you have a larger register file, and 287 00:23:03,658 --> 00:23:08,686 you have a smaller window of the register file, and you kind of slide this window 288 00:23:08,686 --> 00:23:12,660 across the register file on function calls and function returns. 289 00:23:12,660 --> 00:23:15,474 [COUGH] And the MIT Alewife processor used this. 290 00:23:15,474 --> 00:23:18,230 and they extended this register window idea, 291 00:23:18,230 --> 00:23:22,686 such that they could have a special instruction which would actually change 292 00:23:22,686 --> 00:23:26,086 how much the register file they were looking at, at a time. 293 00:23:26,086 --> 00:23:31,102 So as they could actually swap the entire register file very quickly, and, this 294 00:23:31,102 --> 00:23:37,086 actually allowed them to multithread very effectively with four threads. 295 00:23:37,086 --> 00:23:42,440 And this was one of the early multithreaded processors and they 296 00:23:42,440 --> 00:23:47,860 introduced this notion of thread switch on remote memory access. 297 00:23:47,860 --> 00:23:52,885 So if you had a memory access that had to go to one of the other nodes in this 298 00:23:52,885 --> 00:23:57,275 machine. So if you were on this node here, and you need to go down to this 299 00:23:57,275 --> 00:24:02,364 one, and you're going to send a message over there in a multiprocessor, notion, 300 00:24:02,364 --> 00:24:07,390 you actually had to send a message and get a response back, it took a long time. 301 00:24:07,390 --> 00:24:11,080 So you would actually switch threads at that time, . 302 00:24:13,240 --> 00:24:19,939 Multithreading lives on today, especially this, in this coarse-grain multithreading 303 00:24:19,939 --> 00:24:24,301 lives on today. A good example of this is the Oracle, and 304 00:24:24,301 --> 00:24:27,572 what used to be Sun, and before that, Afara, 305 00:24:27,572 --> 00:24:31,935 Niagara processors. Afara was the name of the, startup 306 00:24:31,935 --> 00:24:38,088 company that made this, and then Sun acquired them, and then Oracle acquired 307 00:24:38,088 --> 00:24:41,048 Sun. And, the Sun Niagara processor was 308 00:24:41,048 --> 00:24:46,190 designed for throughput computing, such that you can have lots of, 309 00:24:46,190 --> 00:24:51,746 threads running and they were all independent and it was a multithreaded 310 00:24:51,746 --> 00:24:56,102 processor. So, the Niagra one had 8 cores and then 4 311 00:24:56,102 --> 00:24:59,140 threads per core. And they've, 312 00:24:59,140 --> 00:25:03,484 turn that up over time. So now they use the Niagara 3, 313 00:25:03,484 --> 00:25:06,456 so this is what was called the, the Sun T1. 314 00:25:06,456 --> 00:25:09,657 This is the Sun T2. This is the Sun T3 now. 315 00:25:09,657 --> 00:25:14,687 they have 16 cores where they have 8 threads per core, 316 00:25:14,687 --> 00:25:21,242 so a lot of parallelism going on here and this coarse-grain multithreading, goes 317 00:25:21,242 --> 00:25:24,595 on. So here's our, a dye photo of the Niagara 318 00:25:24,595 --> 00:25:31,411 3 processor. We can see the different cores you might 319 00:25:31,411 --> 00:25:33,900 look in this picture. And say. 320 00:25:35,180 --> 00:25:44,426 there is, what looks to be 16 cores and I say 16 cores on the slide here, but one 321 00:25:44,426 --> 00:25:48,547 of the interesting things is they actually sort of conjoin two cores 322 00:25:48,547 --> 00:25:52,131 together, internally. It's a strange sort of design idea that 323 00:25:52,131 --> 00:25:56,730 they do, is that they mix the threads and the cores together and they share, I 324 00:25:56,730 --> 00:26:00,015 think it was floating point unit between the two cores. 325 00:26:00,015 --> 00:26:03,540 So, these cores are, to some extent, conjoined cores together.