1
00:00:03,520 --> 00:00:08,626
Okay so, 
those core screen multi-turning. 

2
00:00:08,626 --> 00:00:14,867
We're going to move onto figuring out how 
to mix instructions of different threads 

3
00:00:14,867 --> 00:00:21,033
in the pipeline at the same time, and mix 
them in the same, and issue two different 

4
00:00:21,033 --> 00:00:26,100
threads instructions at the same time to 
different pipelines. 

5
00:00:26,100 --> 00:00:30,728
So, the idea here, this is called 
simultaneous multi-threading, and 

6
00:00:30,728 --> 00:00:35,356
actually, let me start of with a, a, a 
picture of what this looks like. 

7
00:00:35,356 --> 00:00:39,440
Let's work backwards in the slide deck 
here for a second. 

8
00:00:42,100 --> 00:00:46,380
So we have a four issue processor. 
This is our issue with. 

9
00:00:46,380 --> 00:00:52,464
And this is time increasing going down. 
Each of the different patterns represents 

10
00:00:52,464 --> 00:00:56,444
a different thread. 
And the idea here, in simultaneous 

11
00:00:56,444 --> 00:01:02,152
multi-threading is you can execute 
instructions from different threads into 

12
00:01:02,152 --> 00:01:07,260
different pipelines simultaneously. 
Now, this gets quite a bit harder. 

13
00:01:09,520 --> 00:01:15,961
Then this basic design here, because now 
we actually have to read from the 

14
00:01:15,961 --> 00:01:22,043
register file four different threads 
simultaneously, and we have to fetch 

15
00:01:22,043 --> 00:01:26,817
different code from different program 
counters simultaneously in this 

16
00:01:26,817 --> 00:01:36,364
processor. 
So simultaneous multi-threading where it 

17
00:01:36,364 --> 00:01:42,920
came out of was, people were building 
complex out of order superscalars. 

18
00:01:42,920 --> 00:01:47,003
And these complex out-of-order 
superscalars had all this logic, 

19
00:01:47,003 --> 00:01:51,092
to be able to track different 
dependencies between different 

20
00:01:51,092 --> 00:01:55,564
instructions, and to basically restart 
sub portions of instruction sequences. 

21
00:01:55,564 --> 00:01:58,212
So this is like, you take a branch of 
mispredict. 

22
00:01:58,212 --> 00:02:03,037
You had to kill all the instructions that 
were dependent on the branch mispredict 

23
00:02:03,037 --> 00:02:06,156
and leave other ones that were not there 
alone. 

24
00:02:06,156 --> 00:02:10,628
And, when you have this out of order 
mechanism you have all this extra logic 

25
00:02:10,628 --> 00:02:21,064
there to figure that out. 
Dean Tolson, Susan Eggers, Jim Levy, came 

26
00:02:21,064 --> 00:02:25,550
up with this idea that, 
well, 

27
00:02:25,550 --> 00:02:31,080
what if we try to utilize all the dead 
slots in our out of order superscalar, 

28
00:02:31,080 --> 00:02:36,252
but intermix different threads in there 
simultaneously to fill the time? 

29
00:02:36,252 --> 00:02:41,412
So they did this study back from ISCA in 
95,' and read a bunch of different 

30
00:02:41,412 --> 00:02:46,221
applications, and this right most bar 
here, is our composite or our, our 

31
00:02:46,221 --> 00:02:49,779
average here. 
And what you should figure out from this 

32
00:02:49,779 --> 00:02:55,049
is this black bar on the bottom here is 
how long the processor is busy, actually 

33
00:02:55,049 --> 00:02:59,726
doing work and the rest of this is 
different reasons the processor was 

34
00:02:59,726 --> 00:03:02,822
stalled. 
So we were stalled on instruction task 

35
00:03:02,822 --> 00:03:07,302
misses, branch mispredictions, load 
delays, just pipeline interlocking, 

36
00:03:07,302 --> 00:03:12,012
memory conflicts, other, other sorts of 
things, and we're only using this 

37
00:03:12,012 --> 00:03:19,624
processor less than 20% of the time. 
So, to show this a different way. 

38
00:03:19,624 --> 00:03:22,537
We have our multi issue processor. 
And we have time here. 

39
00:03:22,537 --> 00:03:25,710
We have all these purple boxes which are 
just dead, dead time. 

40
00:03:25,710 --> 00:03:28,051
And we might be able to use subsets, 
you know. 

41
00:03:28,051 --> 00:03:31,068
This is very good. 
We actually issued four instructions in 

42
00:03:31,068 --> 00:03:33,096
this cycle. 
But here, we only issued two. 

43
00:03:33,096 --> 00:03:35,749
Here, we issued one. 
Here, we issued two to these two 

44
00:03:35,749 --> 00:03:38,766
pipelines in the middle. 
Maybe these are the two ALU pipes. 

45
00:03:38,766 --> 00:03:42,719
And there's, like, a load pipe here and a 
branch pipe there, or something like 

46
00:03:42,719 --> 00:03:45,030
that. 
This is, 

47
00:03:45,030 --> 00:03:48,794
this is a kind of a disaster from an IPC 
perspective. 

48
00:03:48,794 --> 00:03:54,903
Can we try to re-use that hardware? 
And we talked about our core screen 

49
00:03:54,903 --> 00:03:59,786
multithreading, which was effectively 
temporally slicing up the cycles here. 

50
00:03:59,786 --> 00:04:04,604
So you run one thread, switch to a 
different thread, run a different thread, 

51
00:04:04,604 --> 00:04:07,860
and you can temporally switch between the 
threads. 

52
00:04:10,860 --> 00:04:18,547
But what would happen in, in another 
approach, and this is actually done in 

53
00:04:18,547 --> 00:04:28,080
the failed Sun Millenium processor, what 
was going to be Ultra Spark 5. 

54
00:04:28,080 --> 00:04:32,412
The, they have this idea that they 
actually had a clustered VLIW, or excuse 

55
00:04:32,412 --> 00:04:36,398
me, a clustered superscalar, 
and they had a mode that you could flip a 

56
00:04:36,398 --> 00:04:41,020
bit, and you could basically cut the two 
clusters separately and run them as two 

57
00:04:41,020 --> 00:04:44,560
separate processors, or two different 
threads. 

58
00:04:44,560 --> 00:04:50,209
So you can actually decrease, let's say 
you had this width four processor, and 

59
00:04:50,209 --> 00:04:55,793
instead you split it in half, and you 
have two functional units running one 

60
00:04:55,793 --> 00:04:58,025
thread, 
two functional units running another 

61
00:04:58,025 --> 00:05:04,919
thread. 
Well, this has, this has some good 

62
00:05:04,919 --> 00:05:11,140
effects. 
You don't actually have to have really 

63
00:05:11,140 --> 00:05:17,395
high IPC, 
or really high instructions per clock. 

64
00:05:17,395 --> 00:05:22,731
You don't ever have to reach four in this 
design because we've narrowed the two 

65
00:05:22,731 --> 00:05:26,467
pipeline widths. 
And you could think about trying to put 

66
00:05:26,467 --> 00:05:28,535
them together. 
This is what the Millennium processor, 

67
00:05:30,002 --> 00:05:35,472
the UUltra-Spark 5 tried to do, is you 
can actually switch between this mode and 

68
00:05:35,472 --> 00:05:41,639
this mode. 
But you still have a lot of open slots 

69
00:05:41,639 --> 00:05:48,520
here that you can't go use. 
So, it still leaves some vertical waste. 

70
00:05:48,520 --> 00:05:54,197
Another downside to this was you can't 
have one thread using all of the 

71
00:05:54,197 --> 00:06:00,506
resources very easily here in this mode. 
You can't have, let's say, thread one use 

72
00:06:00,506 --> 00:06:05,440
this functional unit over here in this 
very static design. 

73
00:06:05,440 --> 00:06:10,712
So, this brings us to full simultaneous 
multithreading, or SMT. 

74
00:06:10,712 --> 00:06:17,453
And in SMT we were able to mix and match 
all these different instructions, but 

75
00:06:17,453 --> 00:06:22,380
this is going to change our proster 
pipeline quite a bit. 

76
00:06:25,340 --> 00:06:34,120
Okay, so let's, let's look at what this 
does to a processor pipeline. 

77
00:06:35,140 --> 00:06:38,405
Well. 
All of a sudden we need to be fetching 

78
00:06:38,405 --> 00:06:44,650
multiple instructions at a time. 
That's, that's definitely, a hard, harder 

79
00:06:44,650 --> 00:06:52,407
thing to do here from different threads. 
Conveniently, we can actually use our 

80
00:06:52,407 --> 00:06:58,494
instruction queue, or our reservation 
stations, if you will, to go find 

81
00:06:58,494 --> 00:07:05,333
different instructions to go execute at 
the issue stage of our out of order 

82
00:07:05,333 --> 00:07:08,312
processor. 
So what we can do is we can tag the 

83
00:07:08,312 --> 00:07:13,040
different threads independently, and have 
that be part of the instruction, 

84
00:07:14,920 --> 00:07:19,694
issue logic, such that you can't issue 
or, or, you know that thread one, 

85
00:07:19,694 --> 00:07:24,440
register one is different than thread 
two, register one. 

86
00:07:24,440 --> 00:07:28,843
And then you use the same logic that 
we've had in our out of order 

87
00:07:28,843 --> 00:07:34,364
superscalars, in our issue queue or our 
to go find different instructions from 

88
00:07:34,364 --> 00:07:40,060
multiple threads to go stick down the 
pipe, or the respective multiple pipes. 

89
00:07:40,060 --> 00:07:44,320
And conveniently we have most of this 
logic already. 

90
00:07:46,640 --> 00:07:51,360
And what's also nice about this is, if 
you're only running one thread at a time. 

91
00:07:51,360 --> 00:07:55,740
When you go to look in your instruction 
queue, or your issue window, 

92
00:07:57,040 --> 00:08:01,998
that only that one thread will show up. 
So you can know that you don't need to go 

93
00:08:01,998 --> 00:08:06,100
and look around the other threads, and 
you can use the full machine. 

94
00:08:09,440 --> 00:08:13,776
And this allows you to use the same 
machine such that if you have lots of 

95
00:08:13,776 --> 00:08:17,218
parallelism, 
you can different threads and fill all 

96
00:08:17,218 --> 00:08:20,038
different slots. 
And then, if you don't have the 

97
00:08:20,038 --> 00:08:24,403
parallelism, and you want to exploit 
instructionable parallelism, you can use 

98
00:08:24,403 --> 00:08:28,767
the slots in different allocation and 
just have the dead slots, the purple 

99
00:08:28,767 --> 00:08:31,540
slots. 
And you can see here, here we are issuing 

100
00:08:31,540 --> 00:08:36,140
three instructions from the checkered 
board blue, and here we are issuing four. 

101
00:08:36,140 --> 00:08:40,328
Well, let's say it's the exact same 
program, executing on the same cycle. 

102
00:08:40,328 --> 00:08:44,988
You could actually have the machine such 
that it'll issue different amounts of 

103
00:08:44,988 --> 00:08:48,350
parallelism, of instructionable 
parallelism, from a thread, 

104
00:08:48,350 --> 00:08:51,479
depending on if there's multiple threads 
running. 

105
00:08:51,479 --> 00:08:55,949
And this is what these truly 
multi-threading machines, simultaneous 

106
00:08:55,949 --> 00:08:58,440
multi-threading machines, attempt to do. 

107
00:08:59,520 --> 00:09:03,760
. 

108
00:09:10,920 --> 00:09:14,440
See, there's one other point I want to 
make here. 

109
00:09:14,440 --> 00:09:23,480
you have to worry when you're doing this 
simultaneous multi-threading here, about 

110
00:09:23,480 --> 00:09:28,989
priorities between threads, 
because you want to make sure they, you 

111
00:09:28,989 --> 00:09:32,185
have some sort of equal progress 
guarantee, or some, 

112
00:09:32,185 --> 00:09:35,980
and round robin's probably not good 
enough. 

113
00:09:35,980 --> 00:09:41,734
So you need to somehow figure out how to 
not have one thread hog the machine, and 

114
00:09:41,734 --> 00:09:45,500
have some, some, fairness between 
different threads. 

115
00:09:48,020 --> 00:09:53,379
So I wanted to show a few examples here. 
So here we have on the top, we have the 

116
00:09:53,379 --> 00:09:59,453
Power 4 IBM Power 4 processor pipeline. 
And this was a non-multi-thread machine. 

117
00:09:59,453 --> 00:10:05,098
And the Power 5 architecture actually 
looks very similar to the Power 4 

118
00:10:05,098 --> 00:10:09,620
architecture, except they added a second 
hardware thread. 

119
00:10:09,620 --> 00:10:19,000
So what they had to do here is they 
actually had to add more fetch bandwidth, 

120
00:10:19,000 --> 00:10:22,080
and then they had it, this notion of 
group formation. 

121
00:10:22,080 --> 00:10:26,102
And this group formation was the picking, 
out of the two threads. 

122
00:10:26,102 --> 00:10:31,131
So they could figure out what they could 
actually execute simultaneously at the 

123
00:10:31,131 --> 00:10:34,022
same time, 
and they, effectively had extra pipe 

124
00:10:34,022 --> 00:10:40,176
stages out in front to go do that. 
And at the end here you have to commit to 

125
00:10:40,176 --> 00:10:42,734
a, different program counters at the same 
time. 

126
00:10:42,734 --> 00:10:46,072
You can see there is definitely 
complexity in building this. 

127
00:10:46,072 --> 00:10:50,244
That all of the sudden, you have to 
basically into reorder buffer or kill a 

128
00:10:50,244 --> 00:10:53,581
subthread when it branch mispredicts, but 
not the rest of it. 

129
00:10:53,581 --> 00:10:58,031
And we already talked about how to do 
that in a super scale where we can kill a 

130
00:10:58,031 --> 00:11:02,091
subsequent a subsection of instructions, 
and not have to kill the whole 

131
00:11:02,091 --> 00:11:03,260
instruction sequence. 

132
00:11:04,380 --> 00:11:08,520
. 

133
00:11:08,520 --> 00:11:16,382
Here's a different view of the, the Power 
4 in a, in a, physical chip view, here. 

134
00:11:16,382 --> 00:11:21,727
And one of the interesting things to, to 
see is that they went with two threads, 

135
00:11:21,727 --> 00:11:26,454
because if they went to four threads, 
they figured out that they would actually 

136
00:11:26,454 --> 00:11:31,181
be using the resources too much and be 
bottlenecking in different locations in 

137
00:11:31,181 --> 00:11:32,736
the pipe. 
It was basically, 

138
00:11:32,736 --> 00:11:37,104
they didn't have enough resources, sort 
of, over here to have it be filled. 

139
00:11:37,104 --> 00:11:42,070
it would be filled if you had more than 
two threads executing on this anyway, 

140
00:11:42,070 --> 00:11:45,481
so there was no, 
so sort of diminishing returns past that 

141
00:11:45,481 --> 00:11:51,413
point. 
We're almost done, but I think we're out 

142
00:11:51,413 --> 00:11:52,720
of time. 
Let's, 

143
00:11:53,760 --> 00:11:59,796
actually, before we do that, let me skip 
forward here and just summarize, with all 

144
00:11:59,796 --> 00:12:06,664
the different types of multi-threading. 
You have your superscalar processor. 

145
00:12:06,664 --> 00:12:10,811
You have your, 
very, very fine-grain multi-threading, 

146
00:12:10,811 --> 00:12:15,221
where you're doing it on each cycle, you 
have coarse grain, coarser grain where 

147
00:12:15,221 --> 00:12:19,743
you sort of do it every few cycles. 
You can think about cutting the processor 

148
00:12:19,743 --> 00:12:25,137
in half, and using some of the function 
units for one thread, and some of it for 

149
00:12:25,137 --> 00:12:29,383
another thread, and then we have our full 
simultaneous multi-threading. 

150
00:12:29,383 --> 00:12:34,054
So we'll pick up on this next time and, 
and, finish up about some of the 

151
00:12:34,054 --> 00:12:38,240
implementations of the Pentium 4, and how 
they did multi-threading. 

152
00:12:39,500 --> 00:12:41,780
But we'll stop here for today.