1 00:00:03,520 --> 00:00:08,626 Okay so, those core screen multi-turning. 2 00:00:08,626 --> 00:00:14,867 We're going to move onto figuring out how to mix instructions of different threads 3 00:00:14,867 --> 00:00:21,033 in the pipeline at the same time, and mix them in the same, and issue two different 4 00:00:21,033 --> 00:00:26,100 threads instructions at the same time to different pipelines. 5 00:00:26,100 --> 00:00:30,728 So, the idea here, this is called simultaneous multi-threading, and 6 00:00:30,728 --> 00:00:35,356 actually, let me start of with a, a, a picture of what this looks like. 7 00:00:35,356 --> 00:00:39,440 Let's work backwards in the slide deck here for a second. 8 00:00:42,100 --> 00:00:46,380 So we have a four issue processor. This is our issue with. 9 00:00:46,380 --> 00:00:52,464 And this is time increasing going down. Each of the different patterns represents 10 00:00:52,464 --> 00:00:56,444 a different thread. And the idea here, in simultaneous 11 00:00:56,444 --> 00:01:02,152 multi-threading is you can execute instructions from different threads into 12 00:01:02,152 --> 00:01:07,260 different pipelines simultaneously. Now, this gets quite a bit harder. 13 00:01:09,520 --> 00:01:15,961 Then this basic design here, because now we actually have to read from the 14 00:01:15,961 --> 00:01:22,043 register file four different threads simultaneously, and we have to fetch 15 00:01:22,043 --> 00:01:26,817 different code from different program counters simultaneously in this 16 00:01:26,817 --> 00:01:36,364 processor. So simultaneous multi-threading where it 17 00:01:36,364 --> 00:01:42,920 came out of was, people were building complex out of order superscalars. 18 00:01:42,920 --> 00:01:47,003 And these complex out-of-order superscalars had all this logic, 19 00:01:47,003 --> 00:01:51,092 to be able to track different dependencies between different 20 00:01:51,092 --> 00:01:55,564 instructions, and to basically restart sub portions of instruction sequences. 21 00:01:55,564 --> 00:01:58,212 So this is like, you take a branch of mispredict. 22 00:01:58,212 --> 00:02:03,037 You had to kill all the instructions that were dependent on the branch mispredict 23 00:02:03,037 --> 00:02:06,156 and leave other ones that were not there alone. 24 00:02:06,156 --> 00:02:10,628 And, when you have this out of order mechanism you have all this extra logic 25 00:02:10,628 --> 00:02:21,064 there to figure that out. Dean Tolson, Susan Eggers, Jim Levy, came 26 00:02:21,064 --> 00:02:25,550 up with this idea that, well, 27 00:02:25,550 --> 00:02:31,080 what if we try to utilize all the dead slots in our out of order superscalar, 28 00:02:31,080 --> 00:02:36,252 but intermix different threads in there simultaneously to fill the time? 29 00:02:36,252 --> 00:02:41,412 So they did this study back from ISCA in 95,' and read a bunch of different 30 00:02:41,412 --> 00:02:46,221 applications, and this right most bar here, is our composite or our, our 31 00:02:46,221 --> 00:02:49,779 average here. And what you should figure out from this 32 00:02:49,779 --> 00:02:55,049 is this black bar on the bottom here is how long the processor is busy, actually 33 00:02:55,049 --> 00:02:59,726 doing work and the rest of this is different reasons the processor was 34 00:02:59,726 --> 00:03:02,822 stalled. So we were stalled on instruction task 35 00:03:02,822 --> 00:03:07,302 misses, branch mispredictions, load delays, just pipeline interlocking, 36 00:03:07,302 --> 00:03:12,012 memory conflicts, other, other sorts of things, and we're only using this 37 00:03:12,012 --> 00:03:19,624 processor less than 20% of the time. So, to show this a different way. 38 00:03:19,624 --> 00:03:22,537 We have our multi issue processor. And we have time here. 39 00:03:22,537 --> 00:03:25,710 We have all these purple boxes which are just dead, dead time. 40 00:03:25,710 --> 00:03:28,051 And we might be able to use subsets, you know. 41 00:03:28,051 --> 00:03:31,068 This is very good. We actually issued four instructions in 42 00:03:31,068 --> 00:03:33,096 this cycle. But here, we only issued two. 43 00:03:33,096 --> 00:03:35,749 Here, we issued one. Here, we issued two to these two 44 00:03:35,749 --> 00:03:38,766 pipelines in the middle. Maybe these are the two ALU pipes. 45 00:03:38,766 --> 00:03:42,719 And there's, like, a load pipe here and a branch pipe there, or something like 46 00:03:42,719 --> 00:03:45,030 that. This is, 47 00:03:45,030 --> 00:03:48,794 this is a kind of a disaster from an IPC perspective. 48 00:03:48,794 --> 00:03:54,903 Can we try to re-use that hardware? And we talked about our core screen 49 00:03:54,903 --> 00:03:59,786 multithreading, which was effectively temporally slicing up the cycles here. 50 00:03:59,786 --> 00:04:04,604 So you run one thread, switch to a different thread, run a different thread, 51 00:04:04,604 --> 00:04:07,860 and you can temporally switch between the threads. 52 00:04:10,860 --> 00:04:18,547 But what would happen in, in another approach, and this is actually done in 53 00:04:18,547 --> 00:04:28,080 the failed Sun Millenium processor, what was going to be Ultra Spark 5. 54 00:04:28,080 --> 00:04:32,412 The, they have this idea that they actually had a clustered VLIW, or excuse 55 00:04:32,412 --> 00:04:36,398 me, a clustered superscalar, and they had a mode that you could flip a 56 00:04:36,398 --> 00:04:41,020 bit, and you could basically cut the two clusters separately and run them as two 57 00:04:41,020 --> 00:04:44,560 separate processors, or two different threads. 58 00:04:44,560 --> 00:04:50,209 So you can actually decrease, let's say you had this width four processor, and 59 00:04:50,209 --> 00:04:55,793 instead you split it in half, and you have two functional units running one 60 00:04:55,793 --> 00:04:58,025 thread, two functional units running another 61 00:04:58,025 --> 00:05:04,919 thread. Well, this has, this has some good 62 00:05:04,919 --> 00:05:11,140 effects. You don't actually have to have really 63 00:05:11,140 --> 00:05:17,395 high IPC, or really high instructions per clock. 64 00:05:17,395 --> 00:05:22,731 You don't ever have to reach four in this design because we've narrowed the two 65 00:05:22,731 --> 00:05:26,467 pipeline widths. And you could think about trying to put 66 00:05:26,467 --> 00:05:28,535 them together. This is what the Millennium processor, 67 00:05:30,002 --> 00:05:35,472 the UUltra-Spark 5 tried to do, is you can actually switch between this mode and 68 00:05:35,472 --> 00:05:41,639 this mode. But you still have a lot of open slots 69 00:05:41,639 --> 00:05:48,520 here that you can't go use. So, it still leaves some vertical waste. 70 00:05:48,520 --> 00:05:54,197 Another downside to this was you can't have one thread using all of the 71 00:05:54,197 --> 00:06:00,506 resources very easily here in this mode. You can't have, let's say, thread one use 72 00:06:00,506 --> 00:06:05,440 this functional unit over here in this very static design. 73 00:06:05,440 --> 00:06:10,712 So, this brings us to full simultaneous multithreading, or SMT. 74 00:06:10,712 --> 00:06:17,453 And in SMT we were able to mix and match all these different instructions, but 75 00:06:17,453 --> 00:06:22,380 this is going to change our proster pipeline quite a bit. 76 00:06:25,340 --> 00:06:34,120 Okay, so let's, let's look at what this does to a processor pipeline. 77 00:06:35,140 --> 00:06:38,405 Well. All of a sudden we need to be fetching 78 00:06:38,405 --> 00:06:44,650 multiple instructions at a time. That's, that's definitely, a hard, harder 79 00:06:44,650 --> 00:06:52,407 thing to do here from different threads. Conveniently, we can actually use our 80 00:06:52,407 --> 00:06:58,494 instruction queue, or our reservation stations, if you will, to go find 81 00:06:58,494 --> 00:07:05,333 different instructions to go execute at the issue stage of our out of order 82 00:07:05,333 --> 00:07:08,312 processor. So what we can do is we can tag the 83 00:07:08,312 --> 00:07:13,040 different threads independently, and have that be part of the instruction, 84 00:07:14,920 --> 00:07:19,694 issue logic, such that you can't issue or, or, you know that thread one, 85 00:07:19,694 --> 00:07:24,440 register one is different than thread two, register one. 86 00:07:24,440 --> 00:07:28,843 And then you use the same logic that we've had in our out of order 87 00:07:28,843 --> 00:07:34,364 superscalars, in our issue queue or our to go find different instructions from 88 00:07:34,364 --> 00:07:40,060 multiple threads to go stick down the pipe, or the respective multiple pipes. 89 00:07:40,060 --> 00:07:44,320 And conveniently we have most of this logic already. 90 00:07:46,640 --> 00:07:51,360 And what's also nice about this is, if you're only running one thread at a time. 91 00:07:51,360 --> 00:07:55,740 When you go to look in your instruction queue, or your issue window, 92 00:07:57,040 --> 00:08:01,998 that only that one thread will show up. So you can know that you don't need to go 93 00:08:01,998 --> 00:08:06,100 and look around the other threads, and you can use the full machine. 94 00:08:09,440 --> 00:08:13,776 And this allows you to use the same machine such that if you have lots of 95 00:08:13,776 --> 00:08:17,218 parallelism, you can different threads and fill all 96 00:08:17,218 --> 00:08:20,038 different slots. And then, if you don't have the 97 00:08:20,038 --> 00:08:24,403 parallelism, and you want to exploit instructionable parallelism, you can use 98 00:08:24,403 --> 00:08:28,767 the slots in different allocation and just have the dead slots, the purple 99 00:08:28,767 --> 00:08:31,540 slots. And you can see here, here we are issuing 100 00:08:31,540 --> 00:08:36,140 three instructions from the checkered board blue, and here we are issuing four. 101 00:08:36,140 --> 00:08:40,328 Well, let's say it's the exact same program, executing on the same cycle. 102 00:08:40,328 --> 00:08:44,988 You could actually have the machine such that it'll issue different amounts of 103 00:08:44,988 --> 00:08:48,350 parallelism, of instructionable parallelism, from a thread, 104 00:08:48,350 --> 00:08:51,479 depending on if there's multiple threads running. 105 00:08:51,479 --> 00:08:55,949 And this is what these truly multi-threading machines, simultaneous 106 00:08:55,949 --> 00:08:58,440 multi-threading machines, attempt to do. 107 00:08:59,520 --> 00:09:03,760 . 108 00:09:10,920 --> 00:09:14,440 See, there's one other point I want to make here. 109 00:09:14,440 --> 00:09:23,480 you have to worry when you're doing this simultaneous multi-threading here, about 110 00:09:23,480 --> 00:09:28,989 priorities between threads, because you want to make sure they, you 111 00:09:28,989 --> 00:09:32,185 have some sort of equal progress guarantee, or some, 112 00:09:32,185 --> 00:09:35,980 and round robin's probably not good enough. 113 00:09:35,980 --> 00:09:41,734 So you need to somehow figure out how to not have one thread hog the machine, and 114 00:09:41,734 --> 00:09:45,500 have some, some, fairness between different threads. 115 00:09:48,020 --> 00:09:53,379 So I wanted to show a few examples here. So here we have on the top, we have the 116 00:09:53,379 --> 00:09:59,453 Power 4 IBM Power 4 processor pipeline. And this was a non-multi-thread machine. 117 00:09:59,453 --> 00:10:05,098 And the Power 5 architecture actually looks very similar to the Power 4 118 00:10:05,098 --> 00:10:09,620 architecture, except they added a second hardware thread. 119 00:10:09,620 --> 00:10:19,000 So what they had to do here is they actually had to add more fetch bandwidth, 120 00:10:19,000 --> 00:10:22,080 and then they had it, this notion of group formation. 121 00:10:22,080 --> 00:10:26,102 And this group formation was the picking, out of the two threads. 122 00:10:26,102 --> 00:10:31,131 So they could figure out what they could actually execute simultaneously at the 123 00:10:31,131 --> 00:10:34,022 same time, and they, effectively had extra pipe 124 00:10:34,022 --> 00:10:40,176 stages out in front to go do that. And at the end here you have to commit to 125 00:10:40,176 --> 00:10:42,734 a, different program counters at the same time. 126 00:10:42,734 --> 00:10:46,072 You can see there is definitely complexity in building this. 127 00:10:46,072 --> 00:10:50,244 That all of the sudden, you have to basically into reorder buffer or kill a 128 00:10:50,244 --> 00:10:53,581 subthread when it branch mispredicts, but not the rest of it. 129 00:10:53,581 --> 00:10:58,031 And we already talked about how to do that in a super scale where we can kill a 130 00:10:58,031 --> 00:11:02,091 subsequent a subsection of instructions, and not have to kill the whole 131 00:11:02,091 --> 00:11:03,260 instruction sequence. 132 00:11:04,380 --> 00:11:08,520 . 133 00:11:08,520 --> 00:11:16,382 Here's a different view of the, the Power 4 in a, in a, physical chip view, here. 134 00:11:16,382 --> 00:11:21,727 And one of the interesting things to, to see is that they went with two threads, 135 00:11:21,727 --> 00:11:26,454 because if they went to four threads, they figured out that they would actually 136 00:11:26,454 --> 00:11:31,181 be using the resources too much and be bottlenecking in different locations in 137 00:11:31,181 --> 00:11:32,736 the pipe. It was basically, 138 00:11:32,736 --> 00:11:37,104 they didn't have enough resources, sort of, over here to have it be filled. 139 00:11:37,104 --> 00:11:42,070 it would be filled if you had more than two threads executing on this anyway, 140 00:11:42,070 --> 00:11:45,481 so there was no, so sort of diminishing returns past that 141 00:11:45,481 --> 00:11:51,413 point. We're almost done, but I think we're out 142 00:11:51,413 --> 00:11:52,720 of time. Let's, 143 00:11:53,760 --> 00:11:59,796 actually, before we do that, let me skip forward here and just summarize, with all 144 00:11:59,796 --> 00:12:06,664 the different types of multi-threading. You have your superscalar processor. 145 00:12:06,664 --> 00:12:10,811 You have your, very, very fine-grain multi-threading, 146 00:12:10,811 --> 00:12:15,221 where you're doing it on each cycle, you have coarse grain, coarser grain where 147 00:12:15,221 --> 00:12:19,743 you sort of do it every few cycles. You can think about cutting the processor 148 00:12:19,743 --> 00:12:25,137 in half, and using some of the function units for one thread, and some of it for 149 00:12:25,137 --> 00:12:29,383 another thread, and then we have our full simultaneous multi-threading. 150 00:12:29,383 --> 00:12:34,054 So we'll pick up on this next time and, and, finish up about some of the 151 00:12:34,054 --> 00:12:38,240 implementations of the Pentium 4, and how they did multi-threading. 152 00:12:39,500 --> 00:12:41,780 But we'll stop here for today.