1
00:00:03,600 --> 00:00:08,232
So let's, let's think about our snoopy 
protocols that we've talked about, our 

2
00:00:08,232 --> 00:00:12,560
bus based protocols, and the performance 
and the asymptotic performance 

3
00:00:12,560 --> 00:00:17,215
requirements of them. 
So, what are the challenges of a snooping 

4
00:00:17,215 --> 00:00:20,487
protocol? 
As we discussed before, as you add more 

5
00:00:20,487 --> 00:00:25,873
people or more processors to the system 
you have more entities shouting on one 

6
00:00:25,873 --> 00:00:31,600
shared media. 
And you need to hear the shouting. 

7
00:00:31,600 --> 00:00:36,024
You can't just forget some, some, some 
shout because you need to snoop that 

8
00:00:36,024 --> 00:00:39,551
against your vocal cache. 
So when ever another core takes a cache 

9
00:00:39,551 --> 00:00:44,991
miss, you need to snoop that against your 
cache and make sure that you don't have a 

10
00:00:44,991 --> 00:00:49,416
copy of that or that you have to 
invalidate for instance overplay back of 

11
00:00:49,416 --> 00:00:53,242
some data or do something other 
invalidation coherence protocol 

12
00:00:53,242 --> 00:00:56,260
adjustment. 
So, 

13
00:00:56,260 --> 00:01:00,689
what's, what's annoying about this, if 
you sort of look at the amount of 

14
00:01:00,689 --> 00:01:05,866
bandwidth you require on your bus, all 
cache miss need to go across that bus 

15
00:01:05,866 --> 00:01:10,482
everyone needs to look at that, and 
everyone needs to have a port that has 

16
00:01:10,482 --> 00:01:15,248
enough bandwidth to look at all those 
transactions on their cache. 

17
00:01:15,248 --> 00:01:20,799
So, if we go to look at this, our bus, if 
we want to sort of keep up with the same 

18
00:01:20,799 --> 00:01:26,351
amount of bandwidth of cache misses per 
core as we add more cores to the system 

19
00:01:26,351 --> 00:01:30,584
is going to grow order N, where N is the 
number of processors. 

20
00:01:30,584 --> 00:01:34,087
[COUGH]. 
Because though you can compute this 

21
00:01:34,087 --> 00:01:37,959
because everyone let's say has the same 
number of same amount of cache misses 

22
00:01:37,959 --> 00:01:41,931
going on and you want to have the same 
cache miss rate and you just multiply it 

23
00:01:41,931 --> 00:01:45,526
by N. 
Three each, each course is going to have 

24
00:01:45,526 --> 00:01:48,811
that. 
Well, that will be fine when N is eight, 

25
00:01:48,811 --> 00:01:53,775
but if N goes to a thousand or a million, 
you're going to have some serious 

26
00:01:53,775 --> 00:01:56,851
problems with that, it's a very, very big 
bus. 

27
00:01:56,851 --> 00:02:02,723
And it's not just straight badnwidth, you 
also need to have effectively somewhere 

28
00:02:02,723 --> 00:02:08,122
arbitrate for the bus and you need to 
have atomic transactions going across 

29
00:02:08,122 --> 00:02:12,644
that bus so it may be hard to actually 
even if you have a very high bandwidth 

30
00:02:12,644 --> 00:02:17,340
bus you may not have enough this cycles 
in order to operate in the bus. 

31
00:02:18,400 --> 00:02:23,263
So, a solution this, is, we start to look 
at something we're going to call 

32
00:02:23,263 --> 00:02:26,893
directory cache coherence and directory 
protocols. 

33
00:02:26,893 --> 00:02:32,628
And the idea in a directory protocol, 
that the key idea here is that instead of 

34
00:02:32,628 --> 00:02:38,580
broadcasting your invalidations to every 
other core in the system, or all cache in 

35
00:02:38,580 --> 00:02:41,629
the system, every other core in the 
system. 

36
00:02:41,629 --> 00:02:46,784
Instead what you do is you go talk to a 
location that we're going to call 

37
00:02:46,784 --> 00:02:50,925
directory. 
And this directory is going to keep track 

38
00:02:50,925 --> 00:02:57,788
of which caches have that data. 
And what's nice about this is now if you 

39
00:02:57,788 --> 00:03:02,308
take a cache miss you can go ask the 
directory well, who has all these 

40
00:03:02,308 --> 00:03:07,346
different, who has this cache line. 
And if only one other core has the cache 

41
00:03:07,346 --> 00:03:11,673
line let's say readable, it will and 
you're trying to take it into an 

42
00:03:11,673 --> 00:03:16,969
exclusive access trying to write to it. 
You only need to invalidate only one 

43
00:03:16,969 --> 00:03:22,140
location instead of sending that data to 
all end processors in your system. 

44
00:03:22,140 --> 00:03:27,297
So we've cut down what we, was a 
broadcast system into a point to point 

45
00:03:27,297 --> 00:03:30,857
system. 
But the overhead that we have to keep now 

46
00:03:30,857 --> 00:03:37,690
is we need to track, in a directory. 
All the locations that have or all the 

47
00:03:37,690 --> 00:03:41,522
caches which could have a particular 
cache line in it. 

48
00:03:41,522 --> 00:03:46,843
And, and we'll, we'll go through a much 
more complicated example of that but 

49
00:03:46,843 --> 00:03:52,236
that's the, that's the overall key idea. 
And this is going to turn what was a 

50
00:03:52,236 --> 00:03:55,500
broadcast into a point to point 
communication. 

51
00:03:55,500 --> 00:03:58,709
And we can use point to point 
interconnects for this. 

52
00:03:58,709 --> 00:04:03,190
Another good point of scalability here is 
you can actually have different 

53
00:04:03,190 --> 00:04:05,976
directories. 
So you don't have to have one big 

54
00:04:05,976 --> 00:04:09,791
monolithic directory. 
Instead you can segment the address space 

55
00:04:09,791 --> 00:04:14,454
somehow and depending on the address that 
you have you can go to a different 

56
00:04:14,454 --> 00:04:17,704
directory. 
And, by going in these different 

57
00:04:17,704 --> 00:04:22,120
directories, you can actually increase 
the bandwidth, to your directories. 

58
00:04:24,320 --> 00:04:29,229
Okay, 
so let's see how this fits into like, a 

59
00:04:29,229 --> 00:04:35,964
block diagram here. 
We have CPUs are trying to communicate 

60
00:04:35,964 --> 00:04:41,236
with other CPUs via shared memory. 
And they go check their cache first. 

61
00:04:41,236 --> 00:04:48,332
If it's not in the cache, 
before they would have communicate or 

62
00:04:48,332 --> 00:04:54,283
cross a bus, and everyone would have to 
look at that traffic but instead, in our 

63
00:04:54,283 --> 00:04:59,520
directory cache coherence, they'll send a 
message from the cache. 

64
00:04:59,520 --> 00:05:05,069
To the directory controller associated 
with the address so you're going to send 

65
00:05:05,069 --> 00:05:08,700
a message here to this directory 
controller. 

66
00:05:08,700 --> 00:05:13,603
And they directly controller is going to 
keep track for every single line in or 

67
00:05:13,603 --> 00:05:17,217
the, the basic directory controller here 
is going to keep track for every single 

68
00:05:17,217 --> 00:05:24,502
line and memory. 
The list of possible other caches which 

69
00:05:24,502 --> 00:05:30,404
could potentially have that piece of data 
and we're going to call that the sharer 

70
00:05:30,404 --> 00:05:34,023
list or the share list. 
And this is. 

71
00:05:34,023 --> 00:05:39,705
Right now, if we look at this, might 
still be a uniform communication network. 

72
00:05:39,705 --> 00:05:43,230
So, let's say, in here, you have some 
omega network. 

73
00:05:43,230 --> 00:05:48,049
Anyone can talk to anyone else and the 
latency through it is fixed. 

74
00:05:48,049 --> 00:05:51,502
So this is still a uniform memory access 
system. 

75
00:05:51,502 --> 00:05:54,955
We've not. 
We didn't have to go non-uniform here. 

76
00:05:54,955 --> 00:06:00,637
So, no cache is necessarily closer or 
farther away to any other piece of memory 

77
00:06:00,637 --> 00:06:06,233
in a system like this. 
[COUGH] So this is kind of our naive 

78
00:06:06,233 --> 00:06:10,196
directory cache coherence protocol. 
But what's still nice here is we don't 

79
00:06:10,196 --> 00:06:13,891
have to broadcast. 
We can our, let's say our omega network, 

80
00:06:13,891 --> 00:06:16,729
or mesh network, or something else on the 
inside here. 

81
00:06:16,729 --> 00:06:20,050
We'd like to send from this cache 
directly to this controller. 

82
00:06:20,050 --> 00:06:23,976
If no other cache has it. 
Let's say, readable or writable, or 

83
00:06:23,976 --> 00:06:27,775
anything like that. 
It can just respond back with the memory. 

84
00:06:27,775 --> 00:06:31,575
The, the data from memory. 
If not, instead of invalidating and 

85
00:06:31,575 --> 00:06:34,488
broadcasting an invalidate to all other 
cores. 

86
00:06:34,488 --> 00:06:37,781
Instead now, the directory can just say. 
Oh, this core, 

87
00:06:37,781 --> 00:06:41,960
this cache and this cache have copies. 
I need to send two messages. 

88
00:06:41,960 --> 00:06:46,320
One to this cache, one to that cache, and 
validate them. 

89
00:06:46,320 --> 00:06:49,920
Wait for the responses and then reply 
back with the data. 

90
00:06:49,920 --> 00:06:56,188
So, we can, we can decrease our bandwidth 
that we use in the common case across our 

91
00:06:56,188 --> 00:07:02,576
inter-connection network. 
So I'm going to show a slightly different 

92
00:07:02,576 --> 00:07:07,512
picture here which is pretty similar to 
the previous picture. 

93
00:07:07,512 --> 00:07:13,743
Well, you'll notice is that the memory 
and the directory, now, are connected to 

94
00:07:13,743 --> 00:07:20,806
an individual CPU. 
So why do we do this? 

95
00:07:20,806 --> 00:07:24,290
Well. 
If you're building one of these scalable 

96
00:07:24,290 --> 00:07:29,252
systems some sort of like supercomputer, 
it might be a good property that as you 

97
00:07:29,252 --> 00:07:33,180
add more CPUs to the system, you also add 
more RAM memory. 

98
00:07:33,180 --> 00:07:37,265
And maybe more directory storage or 
something like that. 

99
00:07:37,265 --> 00:07:43,248
Another positive of, of, of a design like 
this is this CPU now is actually close to 

100
00:07:43,248 --> 00:07:47,480
this memory bank. 
And we can try to take advantage of that. 

101
00:07:50,260 --> 00:07:54,379
So one, one question comes up, is how can 
we take advantage of that? 

102
00:07:54,379 --> 00:07:58,436
Anyone have any thoughts? 
Okay so, shared data, we don't know where 

103
00:07:58,436 --> 00:08:02,369
it's going to be accessed. 
It could be accessed by all 6 CPUs and 

104
00:08:02,369 --> 00:08:06,176
all 6 caches here. 
But it's very common that your stack for 

105
00:08:06,176 --> 00:08:11,108
your program is going to be only access 
local and the instruction memory for your 

106
00:08:11,108 --> 00:08:16,668
program is only going to be access local. 
So you can potentially have performance 

107
00:08:16,668 --> 00:08:20,870
benefits by putting the instruction in 
memory or excuse me, 

108
00:08:20,870 --> 00:08:26,638
instruction and stack and maybe even some 
portion of the heap close to this core 

109
00:08:26,638 --> 00:08:31,979
because then you can access that really 
quickly, but only shared data has to go 

110
00:08:31,979 --> 00:08:40,094
across this interconnect. 
[COUGH] And will or in fact that has a 

111
00:08:40,094 --> 00:08:44,706
fancy name. 
So systems where some data is close and 

112
00:08:44,706 --> 00:08:51,488
some data is far away are called Non 
Uniform Memory Access or NUMA and you 

113
00:08:51,488 --> 00:08:58,280
might see this actually even in your 
desktop processors are actually 

114
00:08:58,280 --> 00:09:02,742
Moving towards numerous systems. 
They're, they're, some of them are are 

115
00:09:02,742 --> 00:09:05,820
actually 
If you look at some of the. 

116
00:09:05,820 --> 00:09:09,691
I believe it's the A and D chips today 
are already numa systems even on a on a 

117
00:09:09,691 --> 00:09:13,807
single di, or not excuse me single chip 
with multiple dies or something like that 

118
00:09:13,807 --> 00:09:17,678
there, there actually two newer nodes 
inside of them sort of one for one memory 

119
00:09:17,678 --> 00:09:20,580
controller one for one another memory 
controller. 

120
00:09:20,580 --> 00:09:25,350
So, [COUGH] if you go into something like 
Linux and you go look in the proc 

121
00:09:25,350 --> 00:09:29,998
directory, you can actually see there's a 
sub-directory in there called NUMA. 

122
00:09:29,998 --> 00:09:33,912
And it'll tell you the configuration of 
the different memory. 

123
00:09:33,912 --> 00:09:38,621
And then the OS can take advantage of 
this, so can put for instance, the stack 

124
00:09:38,621 --> 00:09:43,085
and the instruction memory, for a 
particular program, that's being used by 

125
00:09:43,085 --> 00:09:47,733
a particular core close to that core. 
And then, maybe through some other data, 

126
00:09:47,733 --> 00:09:50,485
I can somehow choose some other, other 
choice. 

127
00:09:50,485 --> 00:09:55,378
Now, 
I want to make a point here, is that just 

128
00:09:55,378 --> 00:10:04,572
because the latency to memory is 
different does not mean that your system 

129
00:10:04,572 --> 00:10:11,280
is a directory based cached coherent NUMA 
system. 

130
00:10:11,280 --> 00:10:17,158
So you can still have non-directory-based 
systems where some memory is close and 

131
00:10:17,158 --> 00:10:20,960
some memory is far away. 
So you could still have a, basically a 

132
00:10:20,960 --> 00:10:24,656
bus, or something like that, or maybe 
some other internet connection network in 

133
00:10:24,656 --> 00:10:28,258
there which is still a snooping protocol, 
or effectively a snooping protocol. 

134
00:10:28,258 --> 00:10:30,860
But some data is close and some data is 
far away. 

135
00:10:30,860 --> 00:10:36,655
But if you see this in literature usually 
you feel people talking about directory 

136
00:10:36,655 --> 00:10:42,308
based cache coherent NUMA systems will 
call them CC NUMA or cache coherent NUMA 

137
00:10:42,308 --> 00:10:47,326
systems, that's usually sort of means 
that this is a cache coherent non-uniform 

138
00:10:47,326 --> 00:10:53,192
memory access architecture and usually 
implies that directory based for the may 

139
00:10:53,192 --> 00:10:57,150
be other protocols that people are using 
out there also. 

140
00:10:57,150 --> 00:11:03,026
Okay so I want to go back one slide here, 
and I wanted to finish off talking about 

141
00:11:03,026 --> 00:11:07,934
one-topology, which is interesting. 
And the difference between these two 

142
00:11:07,934 --> 00:11:10,838
slides is we went from a CPU here to 
CPUs. 

143
00:11:10,838 --> 00:11:16,508
So this is a multi-core chip now and 
where this gets interesting is you might 

144
00:11:16,508 --> 00:11:21,852
have a directory based cache coherence 
system connecting multiple chips but then 

145
00:11:21,852 --> 00:11:26,800
inside of a chip you may have something 
like a bus based snooping protocol. 

146
00:11:26,800 --> 00:11:29,876
So we actually mix and match these two 
things. 

147
00:11:29,876 --> 00:11:35,025
And how we go about doing this is, if 
caches, if cores inside of this one chip 

148
00:11:35,025 --> 00:11:40,174
for instance, after you go get data from 
each other, they can just effectively 

149
00:11:40,174 --> 00:11:45,524
snoop on each other, but outside of that 
your cache controller or may be your L3 

150
00:11:45,524 --> 00:11:50,984
cache for this particular chip, is going 
to respond to messages coming from other 

151
00:11:50,984 --> 00:11:55,472
directories, like invalidation requests 
and do something about it. 

152
00:11:55,472 --> 00:12:00,981
So there's basically a transducer there 
between a directory base cache coherence 

153
00:12:00,981 --> 00:12:06,626
protocol, and a bus base snoopy protocol. 
And this is pretty, pretty common these 

154
00:12:06,626 --> 00:12:09,550
days. 
Especially given that you have a fair 

155
00:12:09,550 --> 00:12:15,535
number of multi core systems showing up. 
And being used in more of these directory 

156
00:12:15,535 --> 00:12:20,295
based cache coherence systems. 
And we'll talk about one of them at the 

157
00:12:20,295 --> 00:12:25,671
end actually the 
SGI UV systems or UV 1000, which we'll 

158
00:12:25,671 --> 00:12:32,053
talk about, is a users off-the-shelve, 
Intel parts, modern day sort of Core i7 

159
00:12:32,053 --> 00:12:38,183
parts mixed together with a NUMA, and 
directory based coherence system to 

160
00:12:38,183 --> 00:12:44,565
connect all the chips together. 
So there's a transducer from the external 

161
00:12:44,565 --> 00:12:49,940
snoop bus protocol to the directory based 
coherence protocol.