1
00:00:03,740 --> 00:00:10,850
So, the idea here is if we can guarantee 
that there's no dependence between the 

2
00:00:10,850 --> 00:00:15,580
instructions, 
we can think about having, 

3
00:00:16,720 --> 00:00:20,064
you can execute those instructions in 
parallel. 

4
00:00:20,064 --> 00:00:24,833
And one really good way to have 
guaranteed independence is to have 

5
00:00:24,833 --> 00:00:28,533
completely programs or completely 
different threads. 

6
00:00:28,533 --> 00:00:36,820
And the naive approach here is we 
interleave instructions from different 

7
00:00:36,820 --> 00:00:41,026
threads to cover latency. 
So we send a load out here and while this 

8
00:00:41,026 --> 00:00:45,031
load is off revolving in the memory 
system,, this is our thread one that we 

9
00:00:45,031 --> 00:00:48,007
had here before. 
We then load into R1 and we went to go 

10
00:00:48,007 --> 00:00:50,280
use the results here, also from thread 
one. 

11
00:00:51,420 --> 00:00:56,635
Well, by the time the data comes back 
from memory, we know it's all, all ready 

12
00:00:56,635 --> 00:00:59,449
to go. 
We put other work in our processor 

13
00:00:59,449 --> 00:01:02,400
pipeline. 
And this is called multithreading. 

14
00:01:04,100 --> 00:01:09,582
Now, one of the big insights here is you 
need to have enough other work to do and, 

15
00:01:09,582 --> 00:01:14,506
these other threads add some complexity 
to your processor design, 

16
00:01:14,506 --> 00:01:17,838
and they may cause some locality 
challenges. 

17
00:01:17,838 --> 00:01:23,443
So if you have, if you're trying to 
exploit temporal locality in, let's say, 

18
00:01:23,443 --> 00:01:28,290
a cash or a translation look inside 
buffer or some other buffer. 

19
00:01:28,290 --> 00:01:34,047
And you start putting other data axises 
sort of in the middle here or other 

20
00:01:34,047 --> 00:01:38,720
operations in the middle, this might 
destroy your locality. 

21
00:01:38,720 --> 00:01:44,520
But, if, let's think about this in a 
happy happy way for right now. 

22
00:01:46,380 --> 00:01:54,537
If you can find other work to do, you can 
just shove it in, the slots while you're, 

23
00:01:54,537 --> 00:01:58,470
while you're waiting for this load to 
come back. 

24
00:01:58,470 --> 00:02:05,189
The most basic version of multithreading 
is going to be some sort of very basic 

25
00:02:05,189 --> 00:02:10,444
interweaving. 
So we run thread 1, then thread 2, then 

26
00:02:10,444 --> 00:02:15,394
thread 3, then thread 4. 
Thread 1, thread 2, thread 3, and thread 

27
00:02:15,394 --> 00:02:18,218
4. 
So we don't pay attention to the, 

28
00:02:18,218 --> 00:02:22,508
latencies or dependencies of the 
instructions when we go to schedule. 

29
00:02:22,508 --> 00:02:26,673
And if you come around to a thread slot 
and there is nothing to do, 

30
00:02:26,673 --> 00:02:31,647
let's say this load missed in the cache 
and took a thousand cycles to come back. 

31
00:02:31,647 --> 00:02:35,440
You just don't schedule anything. 
You have a dead cycle there. 

32
00:02:36,920 --> 00:02:40,223
One of the nice things about 
multithreading that you can take 

33
00:02:40,223 --> 00:02:43,900
advantage of, is that you don't have to 
worry about bypassing anymore. 

34
00:02:44,960 --> 00:02:50,253
Because if you know that you're not going 
to be executing an instruction from the 

35
00:02:50,253 --> 00:02:55,352
same thread until a couple seconds later, 
why do you need to have a fast bypass 

36
00:02:55,352 --> 00:02:59,100
from the ALU back to itself? 
You just don't care. 

37
00:03:00,360 --> 00:03:04,366
So you, let's see, do we actually have an 
example of that. 

38
00:03:04,366 --> 00:03:10,234
So here's, here's an example that say you 
have a, a write to R1, and then a read of 

39
00:03:10,234 --> 00:03:13,382
R1. 
What does value does this thread 2 get 

40
00:03:13,382 --> 00:03:18,261
here? 
So, the first versions of multi threading 

41
00:03:18,261 --> 00:03:22,523
machines, 
try to partition the register file, 

42
00:03:22,523 --> 00:03:26,986
so this is when it happened, 
so you compile up your programs 

43
00:03:26,986 --> 00:03:30,334
differently. 
That's not really common anymore. 

44
00:03:30,334 --> 00:03:35,243
we'll, we'll talk about an example where 
they actually did that. 

45
00:03:35,243 --> 00:03:40,263
for instance, the 
that was actually also a limiter to 

46
00:03:40,263 --> 00:03:44,298
threads for awhile, 
until sort of people came up with the 

47
00:03:44,298 --> 00:03:49,544
idea of what you just suggested which is 
somehow change the name space of the 

48
00:03:49,544 --> 00:03:52,571
registers. 
And the way that that was actually 

49
00:03:52,571 --> 00:03:56,037
changed was 
well, the first implementation, we'll 

50
00:03:56,037 --> 00:03:59,709
talk about in a little bit. 
But the first implementation was on 

51
00:03:59,709 --> 00:04:02,797
Spark, actually. 
And they used the register windows on 

52
00:04:02,797 --> 00:04:06,702
Spark, which gave you different name 
spaces for different registers. 

53
00:04:06,702 --> 00:04:11,364
And then, then that finally evolved into 
actually having a thread identifier and 

54
00:04:11,364 --> 00:04:13,870
having copies of the entire register 
space, 

55
00:04:13,870 --> 00:04:16,434
so each thread had a different register 
set. 

56
00:04:16,434 --> 00:04:19,581
And that's what we're going to show in 
this picture here. 

57
00:04:19,581 --> 00:04:24,010
So as, as shown here, you have to copy 
the general purpose register file four 

58
00:04:24,010 --> 00:04:28,643
times. 
[COUGH] And you have to copy the program 

59
00:04:28,643 --> 00:04:36,980
counter four times, and then, you have a 
incrementer out here on the front of the 

60
00:04:36,980 --> 00:04:42,900
pipe, which chooses the thread ID, or the 
thread select. 

61
00:04:42,900 --> 00:04:46,940
And in our simple case here, we're just 
going to keep incrementing this and 

62
00:04:46,940 --> 00:04:52,832
choose 1, 2, 3, 4. 1, 2, 3, 4. 1, 2, 3, 4 
and just continually does that and 

63
00:04:52,832 --> 00:04:56,873
likewise that's a indexing to which 
general purpose register file we're 

64
00:04:56,873 --> 00:05:02,074
suppose to be accessing. 
Or which logical general purpose register 

65
00:05:02,074 --> 00:05:06,955
file because most of these architectures 
will actually put this together in one 

66
00:05:06,955 --> 00:05:12,082
larger general purpose register file and 
then have, this just be some addressing 

67
00:05:12,082 --> 00:05:15,295
bits that are architecturally not, not 
visible. 

68
00:05:15,295 --> 00:05:20,344
But you bring up a good, a point that, 
there are a couple userland threading 

69
00:05:20,344 --> 00:05:24,852
libraries out there. 
So, it issues this notion of threading, 

70
00:05:24,852 --> 00:05:28,547
but not on a multithreaded computer 
architecture. 

71
00:05:28,547 --> 00:05:34,459
So where you don't have this and this, 
and there's still some advantages to that 

72
00:05:34,459 --> 00:05:39,928
cause you still, can swap between 
different threads and actually try to 

73
00:05:39,928 --> 00:05:44,066
cover memory seek because your load to 
use will be longer. 

74
00:05:44,066 --> 00:05:49,712
And those threading libraries, largely 
they do try to partition the register 

75
00:05:49,712 --> 00:05:52,652
file space still. 
So you can actually go download one of 

76
00:05:52,652 --> 00:05:55,908
these, on, you know, 
get Linux and go run it and people still 

77
00:05:55,908 --> 00:05:59,793
use these threading libraries to go do 
that and these usually lend thread 

78
00:05:59,793 --> 00:06:02,418
libraries. 
The other way to go do that is actually 

79
00:06:02,418 --> 00:06:06,881
to swap out the entire register space, 
but people typically try not to do that 

80
00:06:06,881 --> 00:06:10,714
because that, if you're trying to do fine 
grade, interweaving and you want to save 

81
00:06:10,714 --> 00:06:14,960
your entire register space to memory, 
that's, that's very expensive. 

82
00:06:14,960 --> 00:06:19,934
But the, the, these threading libraries 
sort of work together with the compiler, 

83
00:06:19,934 --> 00:06:24,786
and tell the compiler don't, 
don't use all the registers for thread 1. 

84
00:06:24,786 --> 00:06:28,384
Use half the registers for thread 1, and 
half the registers, let's say, for thread 

85
00:06:28,384 --> 00:06:38,699
2. 
So this is our simple multithreading 

86
00:06:38,699 --> 00:06:43,224
pipeline. 
It's relatively small changes so far. 

87
00:06:43,224 --> 00:06:49,180
replicate the register file, 
replicate the PC, 

88
00:06:49,180 --> 00:06:54,900
and this can help us recover utilization 
on our ALU, for instance. 

89
00:06:56,180 --> 00:07:02,016
And what we wanted to point out that the 
software, what this look like, is that it 

90
00:07:02,016 --> 00:07:07,203
looks likes multiple processors. 
So if you go use something like a modern 

91
00:07:07,203 --> 00:07:10,373
day Core i7, 
those have two hardware threads. 

92
00:07:10,373 --> 00:07:16,209
But when you go to look at it, it'll look 
like there's twice as many cores in the 

93
00:07:16,209 --> 00:07:21,469
machine than you think there are. 
So, for instance, if you have a four-core 

94
00:07:21,469 --> 00:07:25,000
Core i7 and you open up in Windows the 
little 

95
00:07:25,000 --> 00:07:28,883
C, process management sort of dialogue 
box, 

96
00:07:28,883 --> 00:07:34,588
you'll see that there's eight little bars 
in there because there's one thread or 

97
00:07:34,588 --> 00:07:40,441
one virtual processor, or excuse me two 
virtual processors per physical core in 

98
00:07:40,441 --> 00:07:43,924
the machine. 
So they actually look like there's 

99
00:07:43,924 --> 00:07:49,003
multiple slower CPUs. 
Okay, so what are the costs? 

100
00:07:49,003 --> 00:07:54,219
The easy costs, are replicating the 
program counter and the General Purpose 

101
00:07:54,219 --> 00:07:58,333
Registers. 
Things that start to become harder to 

102
00:07:58,333 --> 00:08:04,137
replicate, but you're going to have to 
replicate also, is if you want to do full 

103
00:08:04,137 --> 00:08:10,063
isolation of these CPUs, is you're going 
to have to have replication of system 

104
00:08:10,063 --> 00:08:13,302
state. 
So things like page tables, things like 

105
00:08:13,302 --> 00:08:18,485
page table base registers. All the 
different, system state about where the 

106
00:08:18,485 --> 00:08:22,700
interrupt handlers, so things like the 
exceptional PC. 

107
00:08:22,700 --> 00:08:28,323
And this actually gets a little bit, hard 
to do and some processors have a fair 

108
00:08:28,323 --> 00:08:32,438
amount of system states. 
If you'll look at x86, there's a lot of 

109
00:08:32,438 --> 00:08:36,142
system states. 
MIPS is relatively minimal, because it's 

110
00:08:36,142 --> 00:08:39,708
sort of exceptional PC and the interrupt 
mask register. 

111
00:08:39,708 --> 00:08:44,118
But it can be hard, and, and, the, 
because the TLB software maintains, so 

112
00:08:44,118 --> 00:08:48,385
you don't even have one of these, you 
don't even have a virtual memory page 

113
00:08:48,385 --> 00:08:52,569
table base register. 
So this something to think about is that 

114
00:08:52,569 --> 00:08:55,416
you have to replicate all the state per 
thread. 

115
00:08:55,416 --> 00:08:59,806
So there is some cost to it, but you 
still get to hide latencies to memory. 

116
00:08:59,806 --> 00:09:04,520
You still get to reuse ALUs and increase 
the efficiencies of your ALUs. 

117
00:09:04,520 --> 00:09:09,920
Personally, I think these are the 
smaller, smaller things. 

118
00:09:09,920 --> 00:09:12,924
So this is very small. 
This is smaller. 

119
00:09:12,924 --> 00:09:16,323
It's bigger than the first one, but 
harder to do. 

120
00:09:16,323 --> 00:09:22,638
The big things out there, 
when you start to have temporal locality 

121
00:09:22,638 --> 00:09:29,007
conflicts and spatial locality conflicts. 
So if you have four threads that are all 

122
00:09:29,007 --> 00:09:34,756
executing, and they're fighting for 
space, they're fighting for capacity in 

123
00:09:34,756 --> 00:09:40,653
your cache, this can be a problem. 
Or, let's say you have sixteen hardware 

124
00:09:40,653 --> 00:09:44,320
threads, and you have an 8-way set 
associative cache, 

125
00:09:45,960 --> 00:09:49,566
and they always want to access, let's 
say, index 0 in the cache. 

126
00:09:49,566 --> 00:09:53,001
This is a common thing. 
Your stack always starts there or 

127
00:09:53,001 --> 00:09:56,378
something like that. 
Well, you only have 8-way, associativity, 

128
00:09:56,378 --> 00:10:02,790
and if you're time multiplexing thread 1, 
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 

129
00:10:02,790 --> 00:10:07,320
14, 15, 16 and then you move back around 
to 1. 

130
00:10:07,320 --> 00:10:12,999
You're guaranteed that by the time you 
come back to 1, because you only have an 

131
00:10:12,999 --> 00:10:17,831
8-way set associative cache that you've 
bumped out the data that you needed, 

132
00:10:17,831 --> 00:10:22,401
so you get no temporal locality. 
By the time you come back, it's not in 

133
00:10:22,401 --> 00:10:24,751
your cache, 
so you had a, conflict. 

134
00:10:24,751 --> 00:10:28,080
You'll have a conflict that's going on in 
your cache. 

135
00:10:28,080 --> 00:10:31,018
Okay. 
So, so this is the negative side, is that 

136
00:10:31,018 --> 00:10:35,850
they can conflict with each other, 
but if two threads are working together. 

137
00:10:35,850 --> 00:10:41,980
You can actually get positive cache 
interference, or, or 

138
00:10:41,980 --> 00:10:45,779
positive interference or constructive 
interference in the cache and it 

139
00:10:45,779 --> 00:10:49,014
definitely happens. 
And that was actually, one of the reasons 

140
00:10:49,014 --> 00:10:52,608
that people originally did this 
multithreading, was that if you have a 

141
00:10:52,608 --> 00:10:56,613
program where you have different threads 
that are working on roughly the same 

142
00:10:56,613 --> 00:10:58,923
data. 
This is actually very, very good, because 

143
00:10:58,923 --> 00:11:01,080
it will be precaching data for each 
other. 

144
00:11:04,240 --> 00:11:06,759
[COUGH]. 
Well, you have to make sure, 

145
00:11:06,759 --> 00:11:12,805
it's a, it's a, it's a fine line to sort 
of walk there that when the threads fight 

146
00:11:12,805 --> 00:11:18,419
for shared resources, and when they 
collaborate on pulling in the shared 

147
00:11:18,419 --> 00:11:22,954
resources and exploit the locality 
structures in the processor. 

148
00:11:22,954 --> 00:11:28,784
So the, the cash is one example. Another 
example is translation lookaside buffer 

149
00:11:28,784 --> 00:11:32,750
entries. 
One solution to this is you can actually 

150
00:11:32,750 --> 00:11:35,880
partition the cache, or partition the 
TLB, 

151
00:11:35,880 --> 00:11:39,031
and then you can guarantee that you do 
not fight. 

152
00:11:39,031 --> 00:11:43,340
Unfortunately, if you go to do this and 
you have a fixed size cache, 

153
00:11:43,340 --> 00:11:48,484
you effectively cut the cache, so let's 
say in half or however many threads by N, 

154
00:11:48,484 --> 00:11:51,443
by N threads. 
So, typically people don't try to 

155
00:11:51,443 --> 00:11:55,040
statically partition these things. 
but, 

156
00:11:55,040 --> 00:11:59,238
you could think about doing that, if you 
want no interference between the 

157
00:11:59,238 --> 00:12:03,124
different threads. 
Okay. 

158
00:12:03,124 --> 00:12:08,521
So let's, let's look at a couple 
different ways that you can try to 

159
00:12:08,521 --> 00:12:15,152
schedule threads, and they've gone from 
simple to a little more complicated to 

160
00:12:15,152 --> 00:12:20,930
even more complicated over time. 
So, simple's what we've been talking 

161
00:12:20,930 --> 00:12:25,536
about up to this point, 
is that there's some fixed interleaving 

162
00:12:25,536 --> 00:12:30,771
of N hardware threads, and you, 
basically, execute one instruction from 

163
00:12:30,771 --> 00:12:33,461
each thread and change to the next 
thread. 

164
00:12:33,461 --> 00:12:38,074
So it's the, our first case. 
So it's the fixed interleaving dates back 

165
00:12:38,074 --> 00:12:40,444
to the, the 60s and this is pretty 
simple. 

166
00:12:40,444 --> 00:12:45,698
What's nice about this is you can remove 
some of the interlocking and some of the 

167
00:12:45,698 --> 00:12:54,620
bypassing from your processor. 
Next thing you can think about is, well, 

168
00:12:54,620 --> 00:13:02,353
you could try to allocate different 
locations in your scheduling quantum to 

169
00:13:02,353 --> 00:13:06,200
different threads. 
So for instance, 

170
00:13:06,200 --> 00:13:12,317
we know that the, 
cyan thread here is the highest priority 

171
00:13:12,317 --> 00:13:15,710
thread. 
It needs the most number of slots. 

172
00:13:15,710 --> 00:13:23,788
So we can do an uneven distribution. 
And we can say, over our slots here, we 

173
00:13:23,788 --> 00:13:29,965
can actually interleave and say, well, 
the cyan thread gets every other one 

174
00:13:29,965 --> 00:13:36,101
we'll say. 
And then we can say, let's say the orange 

175
00:13:36,101 --> 00:13:41,162
one gets a smaller percentage and the 
blue one gets a smaller percentage. 

176
00:13:41,162 --> 00:13:46,152
So it's a, still a fixed interleaving to 
some extent or a fixed ordering, 

177
00:13:46,152 --> 00:13:51,143
but it is, controllable by software 
depending, or the operating system, 

178
00:13:51,143 --> 00:13:54,817
depending on the priorities of the 
different threads. 

179
00:13:54,817 --> 00:13:59,877
So it's a little bit more flexible in 
hardware than the completely fixed 

180
00:13:59,877 --> 00:14:03,967
interleave, design, 
and this does require a little bit of 

181
00:14:03,967 --> 00:14:09,239
change here, because you can't just have 
this counter here, just sort of 

182
00:14:09,239 --> 00:14:13,670
incrementing your thread ID. 
Instead, you need to have something else 

183
00:14:13,670 --> 00:14:18,101
here, which, sort of, is is a picker for 
which thread you go execute, 

184
00:14:18,101 --> 00:14:22,467
but, still relatively simple. 
And, because it's software programmable, 

185
00:14:22,467 --> 00:14:27,419
you can actually choose a time and then 
reallocate, so the OS can change the 

186
00:14:27,419 --> 00:14:32,110
allocation for a, a different time here, 
and, let's say, make orange a higher 

187
00:14:32,110 --> 00:14:38,786
priority instead of the cyan. 
Then we start to go to something a little 

188
00:14:38,786 --> 00:14:41,757
more complicated, 
so this is still relatively fixed 

189
00:14:41,757 --> 00:14:44,728
priorities. 
You could think about something where you 

190
00:14:44,728 --> 00:14:49,820
actually have the hardware making 
decisions about which one to go execute. 

191
00:14:49,820 --> 00:14:54,860
And you could actually even have it go as 
far as, let's say, 

192
00:14:54,860 --> 00:15:00,213
determining if you're executing a thread, 
which then has a long latency operation, 

193
00:15:00,213 --> 00:15:03,048
it'll switch to another thread at that 
point. 

194
00:15:03,048 --> 00:15:07,835
So, it will try to fill in backfilled 
work from other threads purposely when 

195
00:15:07,835 --> 00:15:13,630
one thread gets swapped out or one set of 
thread goes to do something that's a long 

196
00:15:13,630 --> 00:15:17,312
latency operation. 
And you could think about designs like 

197
00:15:17,312 --> 00:15:21,205
that, 
and that is something like the HEP 

198
00:15:21,205 --> 00:15:25,631
processor, which we'll be talking about 
in a minute or two. 

199
00:15:25,631 --> 00:15:30,440
has, is, is, was one of the first early 
examples of that. 

200
00:15:32,820 --> 00:15:40,573
So just to recap here. 
We're going to call this coarse-grain 

201
00:15:40,573 --> 00:15:44,486
multithreading. 
The reason we're calling it coarse-grain 

202
00:15:44,486 --> 00:15:50,420
is because for any one cycle, 
only one thread is running. 

203
00:15:50,420 --> 00:15:54,275
And you might scratch your side of the 
head and say, how do you have multiple 

204
00:15:54,275 --> 00:15:57,479
threads running in one, one cycle? 
Well, we'll show, if you go to a 

205
00:15:57,479 --> 00:16:00,884
superscalar, or multi-issue processor, 
you could think about having the 

206
00:16:00,884 --> 00:16:04,188
different pipelines executing 
instructions from different threads 

207
00:16:04,188 --> 00:16:08,697
simultaneously. 
That's called simultaneous multithreading 

208
00:16:08,697 --> 00:16:14,750
and we'll talk about that in a second. 
[COUGH] But in our core screen approach 

209
00:16:14,750 --> 00:16:20,493
here, just to, to recap, you can swap 
threads on hardware misses, [COUGH], and 

210
00:16:20,493 --> 00:16:25,770
you can take advantage of bubbles in the 
processor to do other work. 

211
00:16:25,770 --> 00:16:29,960
Okay, so a little bit of a history lesson 
here. 

212
00:16:29,960 --> 00:16:36,743
the HEP processor was Burton Smith, who's 
now at Microsoft Research. 

213
00:16:36,743 --> 00:16:42,783
he was the, the, the chief architect of 
this back in the 80s. 

214
00:16:42,783 --> 00:16:49,473
And this processor's actually pretty 
interesting because there was lots and 

215
00:16:49,473 --> 00:16:54,702
lots of threads, and small number of 
processors. 

216
00:16:54,702 --> 00:17:01,140
So there is 120 threads in this machine, 
processor. 

217
00:17:01,140 --> 00:17:09,233
Relatively modest clock rate for the 80s, 
but what they were trying to do here is 

218
00:17:09,233 --> 00:17:14,080
they were trying to deal with memory 
latency. 

219
00:17:14,080 --> 00:17:19,780
So this machine had a very high bandwidth 
memory system and what would happen is 

220
00:17:19,780 --> 00:17:25,410
effectively you were allowed to have a 
load or store every instruction and your 

221
00:17:25,410 --> 00:17:31,480
performance would never degrade if you 
had a load and a store every instruction. 

222
00:17:31,480 --> 00:17:36,500
Because in this machine, they had 120 
threads per processor and the memory 

223
00:17:36,500 --> 00:17:41,913
latency was less than 120 cycles. 
So if you had to load every instruction, 

224
00:17:41,913 --> 00:17:46,190
and you have enough bandwidth to be able 
to feed all those loads, you could 

225
00:17:46,190 --> 00:17:50,686
execute a load from each different thread 
and none of them are each thread was 

226
00:17:50,686 --> 00:17:54,744
independent of each other. You could 
basically go up to the memory system, 

227
00:17:54,744 --> 00:17:58,966
wait the latency, have it come back in 
pipeline the, the latency out to memory, 

228
00:17:58,966 --> 00:18:02,969
and pipeline the memory access. And by 
the time you were to come back and 

229
00:18:02,969 --> 00:18:07,081
execute the next instruction from that 
thread, which would be at 120 cycles 

230
00:18:07,081 --> 00:18:14,086
later, the memory result would be there. 
So this insight was carried forward 

231
00:18:14,086 --> 00:18:20,451
Burton went and started this company 
called Tera, or Tera Systems which later 

232
00:18:20,451 --> 00:18:27,058
went on to buy Cray, strangely enough, 
because I always thought Cray was a 

233
00:18:27,058 --> 00:18:31,376
bigger company than Tera, and it 
definitely was. But we'll, we'll call it 

234
00:18:31,376 --> 00:18:36,311
a merger, but I think that's what they 
called it, but in reality, Tera acquired 

235
00:18:36,311 --> 00:18:40,762
Cray and Tera had a similar sort of idea 
here, 

236
00:18:40,762 --> 00:18:45,603
more processors. 
This was, sort of, further evolution of 

237
00:18:45,603 --> 00:18:49,935
the HEP processor. 
128 active threads per processor. 

238
00:18:49,935 --> 00:18:55,117
Lots of processors, 256 processors. 
So lots of active programs. 

239
00:18:55,117 --> 00:19:01,827
So you had to find enough thread level 
parallelism in your, in your program and 

240
00:19:01,827 --> 00:19:07,330
this architecture has no, no caches, 
because they don't need it. 

241
00:19:07,330 --> 00:19:12,761
If you have enough bandwidth out to your 
memory system, you can have a load every 

242
00:19:12,761 --> 00:19:16,449
cycle, and you can cover the latency with 
other threads. 

243
00:19:16,449 --> 00:19:22,648
Why do you, why do you care? 
So, some, some interesting things about 

244
00:19:22,648 --> 00:19:28,186
this is, this may not be good for power. 
You are not exploiting locality in your 

245
00:19:28,186 --> 00:19:33,051
data references at all and you're going 
out to the memory system every cycle. 

246
00:19:33,051 --> 00:19:37,094
And that could be far away. 
And it is, well, it's far away in this 

247
00:19:37,094 --> 00:19:41,313
machine. 
So that's, you got to be a little careful 

248
00:19:41,313 --> 00:19:46,179
about these machines, and then the second 
idea here is, you have to come up with 

249
00:19:46,179 --> 00:19:49,442
lots of threads. 
Now we're talking about having similar 

250
00:19:49,442 --> 00:19:53,418
numbers of threads in something like, 
modern day, many core machines. 

251
00:19:53,418 --> 00:19:57,987
But there's still a fair number of 
threads to be able to effectively use the 

252
00:19:57,987 --> 00:20:07,674
machine to its, to its best performance. 
I wanted to say about this actually, this 

253
00:20:07,674 --> 00:20:11,093
architecture, 
where it got mostly used, was in 

254
00:20:11,093 --> 00:20:17,095
applications that had no locality anyway 
or no temporal or spacial locality in 

255
00:20:17,095 --> 00:20:22,185
their data references anyway. 
And good examples of this were things 

256
00:20:22,185 --> 00:20:24,616
like data mining, 
huge data sets, 

257
00:20:24,616 --> 00:20:30,197
arbitrary access to the data sets. 
Not, you couldn't have a prefetcher 

258
00:20:30,197 --> 00:20:33,344
predict where your next memory reference 
is going to be. 

259
00:20:33,344 --> 00:20:37,974
So if you were just going basically not 
going to be able to effectively fit in a 

260
00:20:37,974 --> 00:20:42,783
cache or use a cache anyway, remove the 
cache and go, go multithreaded. It was 

261
00:20:42,783 --> 00:20:47,160
the idea behind these machines. 
And they actually saw some, 

262
00:20:47,160 --> 00:20:53,425
big speed ups for applications that had, 
had data sets like that. 

263
00:20:53,425 --> 00:21:00,676
This still actually lives on today, in 
the Cray XMT, the extreme multithreaded 

264
00:21:00,676 --> 00:21:05,253
architecture. 
And you can go buy this machine from 

265
00:21:05,253 --> 00:21:11,278
what's nowadays called Cray Computer 
Corp., which was Tera-eating Cray or 

266
00:21:11,278 --> 00:21:15,321
buying Cray. 
And it's not, a modest clock speed by 

267
00:21:15,321 --> 00:21:19,860
modern day standards. 
It's only 500 MHz, but it, they can 

268
00:21:19,860 --> 00:21:25,720
intermix these with sort of Opterons and 
other processors because they 

269
00:21:25,720 --> 00:21:32,540
standardized on the AMD bus protocol, 
so you can plug in AMD chips and these 

270
00:21:32,540 --> 00:21:42,643
XMT chips in the same, system. 
Just a recap here of, like, what their 

271
00:21:42,643 --> 00:21:49,193
memory system looked like. 
Their instruction pipelines were very 

272
00:21:49,193 --> 00:21:53,449
long, and they didn't have to bypass, 
because you could never execute 

273
00:21:53,449 --> 00:21:57,768
instruction and a subsequent instruction 
which would use the results. 

274
00:21:57,768 --> 00:22:02,713
And the memory operations, the memory 
latencies were about 150 cycles and they 

275
00:22:02,713 --> 00:22:05,842
had 128 threads. 
So, the probability they, they were 

276
00:22:05,842 --> 00:22:10,474
typically not waiting for the memory 
system and they could effectively 

277
00:22:10,474 --> 00:22:17,437
pipeline their memory operations. 
Another little tidbit of history here. 

278
00:22:17,437 --> 00:22:24,153
So, this is the machine I was talking 
about that is my academic lineage a 

279
00:22:24,153 --> 00:22:29,366
little bit here. 
when I first showed up at MIT there was 

280
00:22:29,366 --> 00:22:36,100
this machine in the corner called the MIT 
Alewife processor, which was a 

281
00:22:36,100 --> 00:22:41,032
multiprocessor machine. 
It went up to 128 nodes, which was a lot 

282
00:22:41,032 --> 00:22:44,963
at, in 1990. 
And, one of the, the little tidbits about 

283
00:22:44,963 --> 00:22:50,204
this machine is, they had spark 
processors, and a spark processor had 

284
00:22:50,204 --> 00:22:55,831
this notion of a register window. 
What a register window is, is every time 

285
00:22:55,831 --> 00:23:00,119
you do a system call, 
you basically change your addressing in 

286
00:23:00,119 --> 00:23:03,658
the register file. 
So you have a larger register file, and 

287
00:23:03,658 --> 00:23:08,686
you have a smaller window of the register 
file, and you kind of slide this window 

288
00:23:08,686 --> 00:23:12,660
across the register file on function 
calls and function returns. 

289
00:23:12,660 --> 00:23:15,474
[COUGH] And the MIT Alewife processor 
used this. 

290
00:23:15,474 --> 00:23:18,230
and they extended this register window 
idea, 

291
00:23:18,230 --> 00:23:22,686
such that they could have a special 
instruction which would actually change 

292
00:23:22,686 --> 00:23:26,086
how much the register file they were 
looking at, at a time. 

293
00:23:26,086 --> 00:23:31,102
So as they could actually swap the entire 
register file very quickly, and, this 

294
00:23:31,102 --> 00:23:37,086
actually allowed them to multithread very 
effectively with four threads. 

295
00:23:37,086 --> 00:23:42,440
And this was one of the early 
multithreaded processors and they 

296
00:23:42,440 --> 00:23:47,860
introduced this notion of thread switch 
on remote memory access. 

297
00:23:47,860 --> 00:23:52,885
So if you had a memory access that had to 
go to one of the other nodes in this 

298
00:23:52,885 --> 00:23:57,275
machine. So if you were on this node 
here, and you need to go down to this 

299
00:23:57,275 --> 00:24:02,364
one, and you're going to send a message 
over there in a multiprocessor, notion, 

300
00:24:02,364 --> 00:24:07,390
you actually had to send a message and 
get a response back, it took a long time. 

301
00:24:07,390 --> 00:24:11,080
So you would actually switch threads at 
that time, . 

302
00:24:13,240 --> 00:24:19,939
Multithreading lives on today, especially 
this, in this coarse-grain multithreading 

303
00:24:19,939 --> 00:24:24,301
lives on today. 
A good example of this is the Oracle, and 

304
00:24:24,301 --> 00:24:27,572
what used to be Sun, and before that, 
Afara, 

305
00:24:27,572 --> 00:24:31,935
Niagara processors. 
Afara was the name of the, startup 

306
00:24:31,935 --> 00:24:38,088
company that made this, and then Sun 
acquired them, and then Oracle acquired 

307
00:24:38,088 --> 00:24:41,048
Sun. 
And, the Sun Niagara processor was 

308
00:24:41,048 --> 00:24:46,190
designed for throughput computing, such 
that you can have lots of, 

309
00:24:46,190 --> 00:24:51,746
threads running and they were all 
independent and it was a multithreaded 

310
00:24:51,746 --> 00:24:56,102
processor. 
So, the Niagra one had 8 cores and then 4 

311
00:24:56,102 --> 00:24:59,140
threads per core. 
And they've, 

312
00:24:59,140 --> 00:25:03,484
turn that up over time. 
So now they use the Niagara 3, 

313
00:25:03,484 --> 00:25:06,456
so this is what was called the, the Sun 
T1. 

314
00:25:06,456 --> 00:25:09,657
This is the Sun T2. 
This is the Sun T3 now. 

315
00:25:09,657 --> 00:25:14,687
they have 16 cores where they have 8 
threads per core, 

316
00:25:14,687 --> 00:25:21,242
so a lot of parallelism going on here and 
this coarse-grain multithreading, goes 

317
00:25:21,242 --> 00:25:24,595
on. 
So here's our, a dye photo of the Niagara 

318
00:25:24,595 --> 00:25:31,411
3 processor. 
We can see the different cores you might 

319
00:25:31,411 --> 00:25:33,900
look in this picture. 
And say. 

320
00:25:35,180 --> 00:25:44,426
there is, what looks to be 16 cores and I 
say 16 cores on the slide here, but one 

321
00:25:44,426 --> 00:25:48,547
of the interesting things is they 
actually sort of conjoin two cores 

322
00:25:48,547 --> 00:25:52,131
together, internally. 
It's a strange sort of design idea that 

323
00:25:52,131 --> 00:25:56,730
they do, is that they mix the threads and 
the cores together and they share, I 

324
00:25:56,730 --> 00:26:00,015
think it was floating point unit between 
the two cores. 

325
00:26:00,015 --> 00:26:03,540
So, these cores are, to some extent, 
conjoined cores together.