1
00:00:03,500 --> 00:00:07,346
Okay, so Multiple Logical Communication 
Channels. 

2
00:00:07,346 --> 00:00:13,757
What we just talked about with all these 
different message types here, thus like, 

3
00:00:13,757 --> 00:00:19,171
the ten not drawn. 
Can't all go on the same network at the 

4
00:00:19,171 --> 00:00:21,936
same time. 
If you have one interconnection network 

5
00:00:21,936 --> 00:00:24,480
or one logical, excuse me, logical 
network here. 

6
00:00:25,820 --> 00:00:30,974
You can end up with cases where you have 
responses that get qued behind requests 

7
00:00:30,974 --> 00:00:35,555
that will create more traffic. 
So, in these protocols we have multiple 

8
00:00:35,555 --> 00:00:39,882
phases of, of communication. 
We send messages from point A to point B, 

9
00:00:39,882 --> 00:00:44,846
then from B to C, then from C back to B, 
and then B back to A, or something like 

10
00:00:44,846 --> 00:00:47,645
that. 
So, if you try to over lay this onto an 

11
00:00:47,645 --> 00:00:52,672
inner connection network, and you do head 
of line processing. So you process the 

12
00:00:52,672 --> 00:00:58,813
message which is at the head of the que 
first It's very common to have something 

13
00:00:58,813 --> 00:01:02,980
where you have a new response that sneaks 
in. 

14
00:01:02,980 --> 00:01:06,567
Which is dependent on some other 
transaction that's happening somewhere 

15
00:01:06,567 --> 00:01:11,096
else in the system. 
And you're waiting for excuse me a new 

16
00:01:11,096 --> 00:01:14,141
request comes in that gets qued behind a 
response. 

17
00:01:14,141 --> 00:01:19,073
Or excuse me, you have a new request that 
comes in that is in front of a response 

18
00:01:19,073 --> 00:01:22,240
and all of a sudden you're going to have 
a deadlock. 

19
00:01:23,580 --> 00:01:28,287
So, one way people go about solving this 
is you look at all the different types of 

20
00:01:28,287 --> 00:01:32,649
messages that you have, message types 
that you have, and you figure out which 

21
00:01:32,649 --> 00:01:35,780
types can coexist on the network at the 
same time. 

22
00:01:35,780 --> 00:01:42,307
so good example here is a I think if you 
look at a relatively basic cache 

23
00:01:42,307 --> 00:01:47,340
coherence protocol there's maybe twelve 
different message types. 

24
00:01:49,180 --> 00:01:53,667
You can try to shrink that down to maybe 
four, three or four different classes of 

25
00:01:53,667 --> 00:01:56,049
traffic. 
That those classes of traffic when 

26
00:01:56,049 --> 00:01:58,431
intermixed will never introduce a 
deadlock. 

27
00:01:58,431 --> 00:02:02,807
If you would be really safe you'd just 
have twelve separate networks or twelve 

28
00:02:02,807 --> 00:02:06,163
different networks, 
or twelve logical different networks. 

29
00:02:06,163 --> 00:02:11,013
But that's expensive to build, so people 
try to come up with equivalence classes 

30
00:02:11,013 --> 00:02:14,893
of different traffic and then only have 
that number of networks. 

31
00:02:14,893 --> 00:02:19,379
But you're going to have to segregate 
these flows in different, logical or 

32
00:02:19,379 --> 00:02:23,320
physical channels to make sure you don't 
have deadlock happening. 

33
00:02:24,600 --> 00:02:31,881
[COUGH] Another thing that I wanted to 
point out here is we just distributed our 

34
00:02:31,881 --> 00:02:38,124
memory. 
So we need to start worrying about the 

35
00:02:38,124 --> 00:02:44,905
memory ordering point or for given memory 
address, where do you go and what order 

36
00:02:44,905 --> 00:02:51,885
do memory operations happen in? 
So just like in a bus space protocol you 

37
00:02:51,885 --> 00:02:56,259
have some sort of arbitor which 
determines the transaction or who wins 

38
00:02:56,259 --> 00:02:59,092
the bus. 
In a directory based system, typically 

39
00:02:59,092 --> 00:03:01,926
the forth given address is the ultimate 
other. 

40
00:03:01,926 --> 00:03:06,608
So two messages are coming in from 
different course you go for right access 

41
00:03:06,608 --> 00:03:10,057
to a particular address. 
One of them is going to get to the 

42
00:03:10,057 --> 00:03:14,985
directory first and what if we gets the 
directory first, could win or estimately 

43
00:03:14,985 --> 00:03:18,434
will win. 
You could think of, sometimes this could 

44
00:03:18,434 --> 00:03:21,717
be unfair. 
So some of these systems you try to have 

45
00:03:21,717 --> 00:03:26,931
something which prevents the core, which 
is close to the directory, always winning 

46
00:03:26,931 --> 00:03:31,694
access to that contended line we'll say. 
So may have some sort of back off 

47
00:03:31,694 --> 00:03:36,522
protocol playing into effect there. 
But this is what effectively guarantees 

48
00:03:36,522 --> 00:03:41,414
our atomicity is the directory allows, 
guarantees that a particular line can 

49
00:03:41,414 --> 00:03:45,920
only be transitioning from one state to 
another state at a given time. 

50
00:03:47,480 --> 00:03:50,680
So we have the directory as the ordering 
point. 

51
00:03:50,680 --> 00:03:56,910
And whichever message gets there, gets to 
the home directory first let's say wins. 

52
00:03:56,910 --> 00:04:02,370
Subsequent request for that line are 
going to lose so what do you do on a 

53
00:04:02,370 --> 00:04:05,096
lose. 
Well you probably, the directory's 

54
00:04:05,096 --> 00:04:08,311
probably going to want to send a negative 
acknowledgement, or NACK. 

55
00:04:08,311 --> 00:04:11,125
It's going to say, or send a retry. 
It's the same thing here. 

56
00:04:11,125 --> 00:04:13,385
It's going to say, I can't handle this 
right now. 

57
00:04:13,385 --> 00:04:15,897
This line is currently being transitioned 
already. 

58
00:04:15,897 --> 00:04:18,308
Someone else won the transitioning of 
this line. 

59
00:04:18,308 --> 00:04:21,775
Go retry that transaction, the memory 
transaction again in the future. 

60
00:04:21,775 --> 00:04:24,967
Now, this gets pretty complicated back at 
the cache controller. 

61
00:04:24,967 --> 00:04:27,900
because it's going to get a retry, or a 
NACK back. 

62
00:04:27,900 --> 00:04:32,572
It could potentially have a message 
coming in for that exact same cache line. 

63
00:04:32,572 --> 00:04:37,487
It needs to give the directories message 
priority in this case over the pipeline 

64
00:04:37,487 --> 00:04:41,336
the main processor. 
So it's going to have to order that 

65
00:04:41,336 --> 00:04:46,528
after, it because the directory one the 
directory is the ultimate ordering point 

66
00:04:46,528 --> 00:04:50,682
for the memory location. 
Finally, you have to worry about forward 

67
00:04:50,682 --> 00:04:55,728
progress guarantees. 
What I mean by this is, it's pretty easy 

68
00:04:55,728 --> 00:04:59,903
to build a memory system that you pull 
the data into your cache. 

69
00:04:59,903 --> 00:05:04,990
Your cache control pull means your cache. 
But before you're able to do load or 

70
00:05:04,990 --> 00:05:09,540
storage at that actual address, it gets 
bumped out f your cache. 

71
00:05:09,540 --> 00:05:13,220
At all sign you have a cash line just 
sort of bouncing between two cashes and 

72
00:05:13,220 --> 00:05:18,289
its a live lock scenario. 
So you believes directory based coherence 

73
00:05:18,289 --> 00:05:22,972
systems and this also happens with bus 
space protocols plus usually it'll be 

74
00:05:22,972 --> 00:05:25,962
less likely. 
you have to have some sort of forward 

75
00:05:25,962 --> 00:05:29,347
progress guarantee. 
And the reason is less likely is because 

76
00:05:29,347 --> 00:05:33,240
in a directory based system the 
communication time is usually longer. 

77
00:05:33,240 --> 00:05:35,722
So the window of, of vulnerability is 
longer. 

78
00:05:35,722 --> 00:05:40,292
We afford progress guarantee means that 
once you get into your cache you need to 

79
00:05:40,292 --> 00:05:44,806
at least do one memory operation to it, 
to guarantee you have won bef, before you 

80
00:05:44,806 --> 00:05:47,740
relinquish the memory back to the 
directory control. 

81
00:05:47,740 --> 00:05:53,238
So, if you do, if you doing a load you're 
actually reading your cache shared to the 

82
00:05:53,238 --> 00:05:56,972
actual one load. 
And then you're allowed to cough it up, 

83
00:05:56,972 --> 00:06:01,868
so you don't respond back with a reply. 
To, let's say, an invalidation request 

84
00:06:01,868 --> 00:06:05,752
from the directory control until you've 
done your one memory transaction. 

85
00:06:05,752 --> 00:06:09,796
Or likewise, if you bring in a modify. 
Do that one store before you, you cough 

86
00:06:09,796 --> 00:06:13,467
the data back up or release the data back 
into the memory controller. 

87
00:06:13,467 --> 00:06:17,830
That's really important, to make sure you 
have some forward progress in the system, 

88
00:06:17,830 --> 00:06:22,908
and don't end up with a live lock. 
Okay so we're going to get into the, the 

89
00:06:22,908 --> 00:06:25,520
more futurey for future looking stuff 
here. 

90
00:06:28,660 --> 00:06:33,310
We've been talking about what's called, a 
full map directory, where there's a bit 

91
00:06:33,310 --> 00:06:36,392
processor core. 
If you have 1,000 cores in the system, 

92
00:06:36,392 --> 00:06:41,303
that's a very large bit map. 
And, it's pretty uncommon that a thousand 

93
00:06:41,303 --> 00:06:48,406
core in the system will all be reading 
all of the data, or all of one particular 

94
00:06:48,406 --> 00:06:51,920
cache line. 
So there might be wasteful and your 

95
00:06:51,920 --> 00:06:55,584
directory in a format directory grows at 
order N. 

96
00:06:55,584 --> 00:07:01,507
So that's, that's not particularly good. 
Hmm, so, people have looked into 

97
00:07:01,507 --> 00:07:06,262
different protocols here. 
I just want to touch on this, this is 

98
00:07:06,262 --> 00:07:10,944
largely the area sort of research and 
future directions of this. 

99
00:07:10,944 --> 00:07:15,144
One, one idea here is to have a limited 
pointer based directory. 

100
00:07:15,144 --> 00:07:18,626
So instead of having a bit mask of the 
share of list. 

101
00:07:18,626 --> 00:07:23,949
So its right this is the share of list. 
So its sort of a bit mask of the share of 

102
00:07:23,949 --> 00:07:28,023
list you have base two encoding numbers 
up to some, some point. 

103
00:07:28,023 --> 00:07:33,214
This is why its called limited directory 
or limited point of directory, because 

104
00:07:33,214 --> 00:07:36,828
you can't have all the nodes in the 
system on the list. 

105
00:07:36,828 --> 00:07:41,280
There is none of entries here. 
But you can name them because you have 

106
00:07:41,280 --> 00:07:48,180
base two encoding of the actual number. 
And if you get bigger this in this entire 

107
00:07:48,180 --> 00:07:53,503
list, so lets say you'll have one, two, 
three, four, five entries an over sense 

108
00:07:53,503 --> 00:07:57,974
six sharers, six mead owing copies of the 
list want to be taken. 

109
00:07:57,974 --> 00:08:03,580
Usually this is an overflow bit here 
which says its more than six and when its 

110
00:08:03,580 --> 00:08:08,690
more than six and it will share and also 
in a transition to modified. 

111
00:08:08,690 --> 00:08:14,367
You're going to just send a broadcast 
message or send a invalidate every single 

112
00:08:14,367 --> 00:08:19,406
of cache in the system. 
But usually this could be a good trade 

113
00:08:19,406 --> 00:08:24,140
off, because it's pretty uncommon to have 
extremely widely shared lines. 

114
00:08:24,140 --> 00:08:28,128
So it's an interesting trade-off. 
There are storage space, versus sending 

115
00:08:28,128 --> 00:08:34,203
more messages in the system. 
Like wise there's interesting protocol 

116
00:08:34,203 --> 00:08:38,912
here called limitless. 
Where same idea, it's a limited directory 

117
00:08:38,912 --> 00:08:45,041
and overflow bit, but if it overflows you 
start to keep the directory, or you start 

118
00:08:45,041 --> 00:08:50,220
to keep the sharer list in software in a 
structure in main memory. 

119
00:08:50,220 --> 00:08:53,794
And this requires you to basically have 
very fast traps such that when this 

120
00:08:53,794 --> 00:08:57,322
happens because your, your servicing 
cache line here you interrupt the main 

121
00:08:57,322 --> 00:09:01,320
processor and the main processor provides 
the rest of the sharer list for instance, 

122
00:09:01,320 --> 00:09:05,663
if the sharer list gets overflowed. 
So, there's, and there's a bunch of stuff 

123
00:09:05,663 --> 00:09:09,763
in between in some future, future 
research that's still being done actively 

124
00:09:09,763 --> 00:09:14,699
in this space. 
Beyond simple directory coherence, we 

125
00:09:14,699 --> 00:09:19,927
have some pretty cool on-chip coherence. 
This is why this is actually being 

126
00:09:19,927 --> 00:09:25,691
studied a lot right now, is people built 
this massively parallel directory based 

127
00:09:25,691 --> 00:09:29,645
machines in the past. 
They got some use, they were very good 

128
00:09:29,645 --> 00:09:33,402
for some applications. 
But now we are starting to put more and 

129
00:09:33,402 --> 00:09:37,226
more cores on to a given chip, so we 
start to see on-chip coherence. 

130
00:09:37,226 --> 00:09:41,623
And figuring out how to leverage, 
the fast on-chip communication alongst 

131
00:09:41,623 --> 00:09:45,296
directories to make more interesting 
coherence protocols. 

132
00:09:45,296 --> 00:09:50,709
There is something called COMA systems or 
Cache Only Memory Architectures, where 

133
00:09:50,709 --> 00:09:55,864
instead of having a data in main memory, 
you don't have main memory ever and the 

134
00:09:55,864 --> 00:10:00,375
directories move around. 
before beyond scope of the scores worry 

135
00:10:00,375 --> 00:10:06,045
about the KSR one if you want to go about 
that kind a score research one and then 

136
00:10:06,045 --> 00:10:09,590
also this real had a scale of the sharer 
list is active. 

137
00:10:09,590 --> 00:10:15,529
Briefly wanted to talk about the most 
up-to-date versions of these things. 

138
00:10:15,529 --> 00:10:21,549
We have the SGI UV 1000 which is a 
descendant of the Origin and the Origin 

139
00:10:21,549 --> 00:10:24,920
2000 machines from SGI, lots of cores 
here, 

140
00:10:26,300 --> 00:10:29,152
2560 cores that are all kept cache 
coherent. 

141
00:10:29,152 --> 00:10:32,458
They use a directory based coherence 
protocol here. 

142
00:10:32,458 --> 00:10:37,190
It's very non uniform, and it's all 
connected together by a multi-chassis 

143
00:10:37,190 --> 00:10:40,755
treaty tour. 
So, this is one chassis, there's actually 

144
00:10:40,755 --> 00:10:44,904
up to eight of these. 
Princeton has one of these in the HPCRC 

145
00:10:44,904 --> 00:10:49,701
centre, that I think is four chassis, 
which is sort of half the size of the 

146
00:10:49,701 --> 00:10:55,295
maximum. 
[COUGH] An on-chip example here is the 

147
00:10:55,295 --> 00:10:59,895
TILE64Pro had 64 cores. 
And each of the cores, itself is a 

148
00:10:59,895 --> 00:11:04,105
directory home and it runs a directory 
based protocol. 

149
00:11:04,105 --> 00:11:09,796
And I was talking about dividing 
communication traffic into different 

150
00:11:09,796 --> 00:11:13,304
flows. 
Well there were three different memory 

151
00:11:13,304 --> 00:11:17,358
networks here. 
So we had to do, come up with three 

152
00:11:17,358 --> 00:11:23,325
different classes of traffic, that 
themselves would not deadlock themselves, 

153
00:11:23,325 --> 00:11:26,412
if you will. 
And also there's four memory controllers, 

154
00:11:26,412 --> 00:11:29,208
which is connected into our interconnect 
system. 

155
00:11:29,208 --> 00:11:32,587
And because of this, the communication 
lane sees different. 

156
00:11:32,587 --> 00:11:36,956
A core here talking to the American 
controller is very fast, whereas a core 

157
00:11:36,956 --> 00:11:40,218
there takes longer. 
But maybe this core is close to that 

158
00:11:40,218 --> 00:11:42,840
memory controller. 
So non, non-uniformity here. 

159
00:11:44,400 --> 00:11:47,024
Okay, 
so this is our last slide of the term 

160
00:11:47,024 --> 00:11:48,732
here, 
Beyond ELE 475. 

161
00:11:48,732 --> 00:11:53,004
If you want to go on and do more. 
[COUGH] Well start reading some papers 

162
00:11:53,004 --> 00:11:55,994
from different computer architecture 
conferences. 

163
00:11:55,994 --> 00:12:00,449
The proceedings of International 
Symposium on Computer Architecture, ISCA 

164
00:12:00,449 --> 00:12:03,988
is a good place to start. 
That's probably the top, major 

165
00:12:03,988 --> 00:12:08,138
architecture conference in the field. 
The International Symposium on 

166
00:12:08,138 --> 00:12:11,677
Microarchitecture is the top 
Microarchitecture conference. 

167
00:12:11,677 --> 00:12:16,193
So if you're trying to look inside of a 
processor, and some of the smaller 

168
00:12:16,193 --> 00:12:20,459
microtechtectural details. 
ASPLOS, Architectural Support for 

169
00:12:20,459 --> 00:12:25,353
Programming Languages and Operating 
Systems has lot of different processor 

170
00:12:25,353 --> 00:12:28,638
between software and hardware's radio 
conference. 

171
00:12:28,638 --> 00:12:34,202
And HPCA looks at or used to look up at a 
higher performing high, big computers, 

172
00:12:34,202 --> 00:12:39,834
high performance computer systems. 
but now lot of normal computer operation 

173
00:12:39,834 --> 00:12:43,119
ends up in a compensawsive. 
I would have CD audio. 

174
00:12:43,119 --> 00:12:46,596
Well go to research. 
Built in chips, built some test 

175
00:12:46,596 --> 00:12:51,488
[INAUDIBLE] FPGA learn more about 
parallel computer architecture and, in 

176
00:12:51,488 --> 00:12:56,718
deed you can come back in the Fall. 
and take ELE 580A which is going to be a 

177
00:12:56,718 --> 00:13:00,240
graduate level primary sources paper 
reading course. 

178
00:13:00,240 --> 00:13:05,300
More traditional graduate course where 
you'll learn about all the different 

179
00:13:05,300 --> 00:13:09,528
parallel computing systems. 
So this is called parallel computation 

180
00:13:09,528 --> 00:13:14,459
because it's both parallel computer 
architecture and some programming models 

181
00:13:14,459 --> 00:13:19,071
that hook together without in parallel 
programming together with the 

182
00:13:19,071 --> 00:13:22,530
architectures cause they go, they go very 
hand in hand. 

183
00:13:22,530 --> 00:13:26,240
So let's stop here for today. 
And stop here for the course.