1
00:00:03,480 --> 00:00:08,918
So, today we're going to continue our 
adventure in computer architecture and 

2
00:00:08,918 --> 00:00:12,281
talk more about parallel computer 
architecture. 

3
00:00:12,281 --> 00:00:17,719
last time we talked about coherence, 
memory coherence, and cache coherence, 

4
00:00:17,719 --> 00:00:23,748
systems and to differentiate that from 
memory consistency models which is a 

5
00:00:23,748 --> 00:00:30,926
model of how memory is supposed to work, 
versus the underlying algorithms that try 

6
00:00:30,926 --> 00:00:37,316
to keep memory consistent, and try to 
implement the consistency models. 

7
00:00:37,316 --> 00:00:44,302
We left off last time, we, we were 
talking about MOESI, or also known as the 

8
00:00:44,302 --> 00:00:48,740
Illinois protocol, 
and we walked through all of the 

9
00:00:48,740 --> 00:00:54,571
different arcs through here. 
And if you recall what we were talking 

10
00:00:54,571 --> 00:01:01,012
about, was, we split the shared state 
from the MSI protocol into two states, 

11
00:01:01,012 --> 00:01:05,238
shared and exclusive. 
And the insight here is, it's very common 

12
00:01:05,238 --> 00:01:09,508
for programs to read a memory address, 
which will pull it into your cache. 

13
00:01:09,508 --> 00:01:14,188
And then go modify that memory address. 
So for instance, if you want to increment 

14
00:01:14,188 --> 00:01:16,235
a number. 
You're going to do a load. 

15
00:01:16,235 --> 00:01:20,037
It's going to bring it into your, your, 
ca-, or into your register set. 

16
00:01:20,037 --> 00:01:23,606
But also into your cache. 
You're going to going to increment the 

17
00:01:23,606 --> 00:01:27,008
number and then you do a write back to 
the exact same location. 

18
00:01:27,008 --> 00:01:29,870
Pretty common in imperative programming 
languages. 

19
00:01:29,870 --> 00:01:33,975
Declarative programming languages like 
Scheme and such, they may at times copy 

20
00:01:33,975 --> 00:01:36,729
everything. 
But for declara- excuse me, imperative 

21
00:01:36,729 --> 00:01:41,160
programming languages it's pretty common 
to actually change state in place. 

22
00:01:41,160 --> 00:01:46,090
So, because of that, you can bring it 
right into, this exclusive stage, and 

23
00:01:46,090 --> 00:01:51,568
then when you have to go to modify it, 
you would have to go and broadcast in the 

24
00:01:51,568 --> 00:01:54,238
bus. 
You know, you would have to talk to 

25
00:01:54,238 --> 00:01:58,894
anybody and would loose, hm, effectively 
this intent to write message. 

26
00:01:58,894 --> 00:02:04,098
Then you would have to send otherwise 
across the bus and waiting for that 

27
00:02:04,098 --> 00:02:09,166
address to be snooped on the bus, or be 
seen by all the other entities on the 

28
00:02:09,166 --> 00:02:13,508
bus. 
note I, I say entities in the bus. 

29
00:02:13,508 --> 00:02:18,439
We've been talking primarily about, 
processors, the last day, 

30
00:02:18,439 --> 00:02:24,290
but there can be other entities on the 
bus that want to snoop the bus. 

31
00:02:24,290 --> 00:02:28,469
So examples sometimes include coherent, 
IO devices. 

32
00:02:28,469 --> 00:02:30,766
So, 
this isn't very popular right now, 

33
00:02:30,766 --> 00:02:35,126
but I think this will become much more 
popular as soon as we start to have, GPUs 

34
00:02:35,126 --> 00:02:37,910
or Graphics Processing Units or general 
purpose GPUs, 

35
00:02:37,910 --> 00:02:42,007
which will be sitting, effectively, very 
close to our processor on the same bus, 

36
00:02:42,007 --> 00:02:45,999
and will want to take part in the 
coherence traffic of the processor. 

37
00:02:45,999 --> 00:02:50,149
So it's going to want to basically read 
and write to the same memory addresses 

38
00:02:50,149 --> 00:02:52,356
that the processor is reading and 
writing, 

39
00:02:52,356 --> 00:02:54,720
and take part in the cash coherence 
protocol. 

40
00:02:54,720 --> 00:02:59,064
[COUGH] At a minimum, usually your IO 
devices need to effectively tell the 

41
00:02:59,064 --> 00:03:03,597
processor when its doing a memory 
transaction that the processor should 

42
00:03:03,597 --> 00:03:06,934
know about. 
So typically when you are moving data 

43
00:03:06,934 --> 00:03:11,216
from a IO device to main memory, 
that's going to have to effectively go 

44
00:03:11,216 --> 00:03:14,742
across the person. 
Everyone is going to have to validate 

45
00:03:14,742 --> 00:03:19,464
their cache's, you have to snoop the 
traffic, or they will all have to snoop 

46
00:03:19,464 --> 00:03:25,840
that memory traffic from the IO device. 
So we had talked about MOESI as an 

47
00:03:25,840 --> 00:03:30,331
enhancement to MSI. 
Well, we left off last time, and we were 

48
00:03:30,331 --> 00:03:35,569
going to talk about two more enhancements 
that are pretty common. 

49
00:03:35,569 --> 00:03:42,176
one is been used widely in AMD Opterons. 
I think they still use this in AMD. 

50
00:03:42,176 --> 00:03:48,542
I think they use something similar to 
this still in AMD, is my understanding. 

51
00:03:48,542 --> 00:03:52,330
and the idea is you add an extra state 
here, 

52
00:03:52,330 --> 00:03:55,961
which is called ownership, or the owned 
state. 

53
00:03:55,961 --> 00:04:01,635
And effectively, what this is, is it 
looks just like our MOESI protocol from 

54
00:04:01,635 --> 00:04:05,267
before. 
But now, instead of having data in the 

55
00:04:05,267 --> 00:04:10,941
modified stage, when you, let's say 
another processor needs to go access that 

56
00:04:10,941 --> 00:04:16,842
data, instead of having to send all that 
data back to main memory, and validate 

57
00:04:16,842 --> 00:04:21,760
that line out to main memory, and go 
fetch it back from main memory. 

58
00:04:21,760 --> 00:04:24,775
Instead, you can do direct cache to cache 
transfer. 

59
00:04:24,775 --> 00:04:26,947
This is of a, basically an optimization 
here. 

60
00:04:26,947 --> 00:04:30,264
So, you don't have to right back to data 
to main memory, 

61
00:04:30,264 --> 00:04:33,280
and in fact you can allow main memory to 
be stale. 

62
00:04:33,280 --> 00:04:38,782
And you can just transfer the data across 
the bus from the one cache to the cache 

63
00:04:38,782 --> 00:04:42,964
which needs it. 
So in this example here, we're going to 

64
00:04:42,964 --> 00:04:47,512
look at this edge here. 
So another processor wants to read the 

65
00:04:47,512 --> 00:04:50,374
data. 
So we see an intent to write to a 

66
00:04:50,374 --> 00:04:56,170
particular cache line and our processor 
currently has it in the modified state. 

67
00:04:56,170 --> 00:05:01,150
We see this other processors intent to 
write, and. 

68
00:05:01,150 --> 00:05:03,810
[COUGH] or excuse me. 
Intent to read. 

69
00:05:03,810 --> 00:05:08,391
And we're actually going to provide the 
data out of our cache, 

70
00:05:08,391 --> 00:05:14,303
and not write it back to main memory, 
and transition the line in our cache to 

71
00:05:14,303 --> 00:05:20,816
this owned state. 
The other processors can now take it in, 

72
00:05:20,816 --> 00:05:27,430
and take it in a shared state. 
So they will have it a read, read only 

73
00:05:27,430 --> 00:05:31,810
copy. 
Now, note this is only for, for 

74
00:05:31,810 --> 00:05:36,703
read-only, we'll talk about if another 
processor wants to write to the state in 

75
00:05:36,703 --> 00:05:39,800
a second. 
So we have it in its own state, and what 

76
00:05:39,800 --> 00:05:44,817
we're trying to do here is this processor 
is tracking that, that data needs to be 

77
00:05:44,817 --> 00:05:47,418
written back to main memory at some 
point. 

78
00:05:47,418 --> 00:05:50,144
That's the whole purpose of this state 
here, 

79
00:05:50,144 --> 00:05:55,346
is we've basically designated a processor 
which owns the data and owns the modified 

80
00:05:55,346 --> 00:06:00,996
state. So the processors which take at 
read only get it into the shared state, 

81
00:06:00,996 --> 00:06:05,239
and if they need to invalidate the line, 
they don't need to contact anybody. 

82
00:06:05,239 --> 00:06:08,576
Because they are having a share state, 
they have a read-only copy. 

83
00:06:08,576 --> 00:06:11,121
They don't need to make any bus 
transactions. 

84
00:06:11,121 --> 00:06:15,250
[COUGH] So if you think about it, if you 
actually want to effectively have one 

85
00:06:15,250 --> 00:06:20,100
core, read the state from other, read 
this dirty state from the other core, 

86
00:06:20,100 --> 00:06:26,457
and then in some points it goes in and 
just invalidates it in the, in the second 

87
00:06:26,457 --> 00:06:29,954
core. 
If the data is not up to date as in, it 

88
00:06:29,954 --> 00:06:33,530
would be in main memory, you lose the 
changes. 

89
00:06:33,530 --> 00:06:39,570
So, by one processor keeping it in the 
own state here, it keeps track that at 

90
00:06:39,570 --> 00:06:45,689
some point, if it never gets invalidated 
out of that processor's cache, it needs 

91
00:06:45,689 --> 00:06:50,720
to write that out to main memory, to keep 
it up, up to date. 

92
00:06:50,720 --> 00:06:55,765
Now, there's a couple other arcs here. 
you can transition from the own state 

93
00:06:55,765 --> 00:07:00,164
back to the modified state if the 
processor, which has it in the owned 

94
00:07:00,164 --> 00:07:02,504
state wants to go to a write. 
[COUGH]. 

95
00:07:02,504 --> 00:07:05,116
It can't do that while it's in the owned 
state, 

96
00:07:05,116 --> 00:07:09,528
because while it's in the owned state, 
other processors may have shared copies 

97
00:07:09,528 --> 00:07:11,502
of it. 
So, when it needs to do that, 

98
00:07:11,502 --> 00:07:16,205
if it wants to do, P1 wants to do a write 
here, it needs to re-invalidate everyone 

99
00:07:16,205 --> 00:07:20,036
else's copies across the bus. 
So it's going to have to send an intent 

100
00:07:20,036 --> 00:07:23,345
to write for that line, 
and everyone else will snoop that 

101
00:07:23,345 --> 00:07:26,520
traffic, and transition to the invalid 
state. 

102
00:07:26,520 --> 00:07:30,765
And then, this processor will be able to 
transition to the modified state, 

103
00:07:30,765 --> 00:07:33,960
and now it's able to actually modify the 
data. 

104
00:07:33,960 --> 00:07:38,846
Okay. 
So we've got this arc here, which we sort 

105
00:07:38,846 --> 00:07:43,527
of already talked about, is that if 
you're in the owned state, anyone else 

106
00:07:43,527 --> 00:07:46,920
can get read only shared copies of it. 
[COUGH]. 

107
00:07:46,920 --> 00:07:52,275
They can't go get an exclusive copy, 
because that would basically violate this 

108
00:07:52,275 --> 00:07:57,492
notion, because then they would be able 
to upgrade to modified without telling 

109
00:07:57,492 --> 00:08:02,436
anybody, and we don't want that. 
But they can get shared read-only copies 

110
00:08:02,436 --> 00:08:07,653
of the data and then there's this arc 
here from owned to invalid, is if some 

111
00:08:07,653 --> 00:08:14,465
other processor wants to write the data. 
We're going, processor 1, P1 here will 

112
00:08:14,465 --> 00:08:17,850
say, 
we'll see the intent to write from 

113
00:08:17,850 --> 00:08:22,798
another processor. 
It will, snoop that traffic effectively, 

114
00:08:22,798 --> 00:08:28,179
and at that point it will transition to 
this invalid state. 

115
00:08:28,179 --> 00:08:36,726
note here that this intent to write, we 
may need to provide information across 

116
00:08:36,726 --> 00:08:41,781
the bus when we're in the owned state. 
Because if the only, if we're the only 

117
00:08:41,781 --> 00:08:46,968
owner of that, or the only cache that has 
that data, and the other processor is 

118
00:08:46,968 --> 00:08:52,089
basically going straight into this state 
here via rightness, we're going to need 

119
00:08:52,089 --> 00:08:58,700
to provide the data. 
Okay, so, questions about MOESI? 

120
00:08:58,700 --> 00:09:01,434
So far, 
But a basic, extra optimization.'Cause we 

121
00:09:01,434 --> 00:09:04,068
don't have to. 
We can basically transfer data around. 

122
00:09:04,068 --> 00:09:06,955
And one cache can have a, a, a cache line 
in the owned state. 

123
00:09:06,955 --> 00:09:10,804
And later, some other cache, you know, 
the exact same cache line in the own 

124
00:09:10,804 --> 00:09:12,880
state. 
And it can basically bounce around 

125
00:09:12,880 --> 00:09:15,500
without ever having to go out to main 
memory. 

126
00:09:15,500 --> 00:09:19,700
And this, this decreases our bandwidth 
out to the main memory system. 

127
00:09:21,780 --> 00:09:27,618
Okay. 
Then we're going to talk about MESIF or 

128
00:09:28,622 --> 00:09:34,480
MESIF, which is actually used in the core 
I7 in the most up to date Intel 

129
00:09:34,480 --> 00:09:38,831
processors. 
And it looks very similar to MOESI, 

130
00:09:38,831 --> 00:09:45,400
except we're going to see an extra little 
letter in this one bubble here. 

131
00:09:45,400 --> 00:09:51,286
Effectively, 
the, what's going on here is we add an 

132
00:09:51,286 --> 00:09:57,290
extra state called the Forward State. 
And this is similar to sort of the 

133
00:09:57,290 --> 00:10:05,740
optimization we saw in MOESI, except it 
can't keep the data writeable. 

134
00:10:07,700 --> 00:10:13,357
So, what happens in this protocol is, 
let's say the first cache, which does a 

135
00:10:13,357 --> 00:10:18,939
read miss on a line for widely shared 
data is going to be elected and going to 

136
00:10:18,939 --> 00:10:24,832
get the data in this forward state. 
And then if other caches want to get read 

137
00:10:24,832 --> 00:10:29,821
only copies, bring it in shared. 
Instead of having to go out to main 

138
00:10:29,821 --> 00:10:35,274
memory, 
the cache that has it in the forward 

139
00:10:35,274 --> 00:10:39,320
state is going to provide that data 
across the bus. 

140
00:10:39,320 --> 00:10:45,110
So this is going to effectively decrease 
our bandwidth to main memory, by 

141
00:10:45,110 --> 00:10:49,180
providing the data out of another 
cache's, 

142
00:10:49,180 --> 00:10:53,617
cache, effectively, or another 
processor's cache rather, and then you 

143
00:10:53,617 --> 00:10:57,988
won't have to, have to transition it. 
Now, this is a little bit of a 

144
00:10:57,988 --> 00:11:01,697
simplification. 
There is a question here of, if you're in 

145
00:11:01,697 --> 00:11:04,810
this forward state and you invalidate the 
data. 

146
00:11:04,810 --> 00:11:07,498
Who has it? 
does anyone provide the data? 

147
00:11:07,498 --> 00:11:11,764
So there's sort of two choices here. 
One choice is no one has it in the 

148
00:11:11,764 --> 00:11:14,920
forward state. 
So when it's a snooper quest for a line, 

149
00:11:14,920 --> 00:11:18,310
it actually has, it just have to go out 
of the main memory. 

150
00:11:18,310 --> 00:11:22,050
That's kind of the easy case. 
The other case is you could try to 

151
00:11:22,050 --> 00:11:26,784
actually build a protocol where another 
cache when one, one cache invalids the 

152
00:11:26,784 --> 00:11:31,635
forward, it just chooses another cache. 
But probably the simplest thing you do is 

153
00:11:31,635 --> 00:11:35,500
when the forward, the forwarding core 
invalidates the data. 

154
00:11:35,500 --> 00:11:39,397
For whatever reason, you just go back out 
to main memory, because there's always a 

155
00:11:39,397 --> 00:11:42,571
copy in main memory. 
So effectively you're just keeping read 

156
00:11:42,571 --> 00:11:43,392
only copies. 
Yeah, 

157
00:11:43,392 --> 00:11:45,951
you're right. 
You're probably going to enter, into the 

158
00:11:45,951 --> 00:11:47,835
exclusive state. 
That's a good question. 

159
00:11:47,835 --> 00:11:54,688
so I read two different versions of this 
in, in different books. 

160
00:11:54,688 --> 00:11:58,350
So, 
I'm not quite sure Intel actually 

161
00:11:58,350 --> 00:12:03,513
documents what they do for, for this. 
probably what's okay, so, so, you 

162
00:12:03,513 --> 00:12:07,669
probably will, youre probably right. 
You probably want to enter straight into 

163
00:12:07,669 --> 00:12:10,980
exclusive state. 
If you have a read only copy, 

164
00:12:10,980 --> 00:12:13,084
What. 
You, you, yeah. 

165
00:12:13,084 --> 00:12:17,095
So what's going to happen is when you 
transition from E to S here, you're 

166
00:12:17,095 --> 00:12:21,106
going to transition from E to F. 
And then you're going to be able to make 

167
00:12:21,106 --> 00:12:25,117
this, you'll end up in the F state. 
So the first person who actually 

168
00:12:25,117 --> 00:12:28,096
downgrades is going to always end up in 
the F state. 

169
00:12:28,096 --> 00:12:31,878
but like I said. 
I saw other references where people said, 

170
00:12:31,878 --> 00:12:35,487
There were other people implementing 
something similar to this. 

171
00:12:35,487 --> 00:12:38,581
Where they, 
Some have, some have some election where 

172
00:12:38,581 --> 00:12:41,160
they figure out who is the, the 
forwarding, 

173
00:12:41,160 --> 00:12:46,440
Node, but probably the easiest thing to 
do is to downgrade from E to F. 

174
00:12:47,800 --> 00:12:53,978
So the rest of the course, we're going to 
look at how to scale beyond these 

175
00:12:53,978 --> 00:12:59,898
broadcast and these invalidate protocols 
that have to snoop on a bus. 

176
00:12:59,898 --> 00:13:04,160
so, some of the problems of building 
these. 

177
00:13:04,160 --> 00:13:08,995
Snooping systems is, that you need, it 
really affects how you design your 

178
00:13:08,995 --> 00:13:12,152
processor. 
So first of all, you're going to have to 

179
00:13:12,152 --> 00:13:17,257
add more bandwidth into your cache. 
Or at least more bandwidth into your tag 

180
00:13:17,257 --> 00:13:21,753
array. 
so one choice is going to dual port your 

181
00:13:21,753 --> 00:13:25,363
tags. 
Another choice is you can steal cycles 

182
00:13:25,363 --> 00:13:28,445
for snoops. 
So what I mean by steal cycles is if 

183
00:13:28,445 --> 00:13:33,709
there is a bus transaction happening and 
you need to check this against your tags, 

184
00:13:33,709 --> 00:13:38,652
you actually block the main processor 
that is associated with that cash from 

185
00:13:38,652 --> 00:13:43,596
accessing the cash that cycle, so you, 
you generate a stall signal to the cash 

186
00:13:43,596 --> 00:13:48,158
or to the main pipe. 
[COUGH] And one of the things here that 

187
00:13:48,158 --> 00:13:53,476
get a little tricky is, and this will 
affects your design, is if you have a 

188
00:13:53,476 --> 00:13:58,728
multilevel cache, usually you want to put 
your sort L2 tag array on the bus and 

189
00:13:58,728 --> 00:14:03,514
snoop against your L2 tag array. 
But if it hits there and you figure out 

190
00:14:03,514 --> 00:14:09,300
that you have to invalidate something. 
You're going to have to invalidate down 

191
00:14:09,300 --> 00:14:13,320
the entire cache hierarchy, 
all the way down to the level one cache. 

192
00:14:13,320 --> 00:14:17,400
So this can actually affect your 
throughput on your level one cache 

193
00:14:17,400 --> 00:14:20,400
effectively. 
And also, it sort of is, is annoying to 

194
00:14:20,400 --> 00:14:22,680
do, 
because it's going to effectively have to 

195
00:14:22,680 --> 00:14:25,860
reach down and touch your tag array of 
your L1 cache. 

196
00:14:25,860 --> 00:14:28,800
And as I had mentioned, I think, last 
time, briefly. 

197
00:14:28,800 --> 00:14:32,920
If you're thinking about something like a 
exclusive cache. 

198
00:14:32,920 --> 00:14:39,420
So a cache where the tags and the L2 
don't have the tags in the L1. 

199
00:14:39,420 --> 00:14:45,487
You're going to have to check both tags 
for every snoop transaction, and that can 

200
00:14:45,487 --> 00:14:50,805
be pretty, pretty painful, to do. 
[COUGH] Or you have to copy, the L1 tags, 

201
00:14:50,805 --> 00:14:56,198
but it's effectively the same thing as 
just having a, inclusive cache, 

202
00:14:56,198 --> 00:15:03,968
but maybe for little less data storage. 
Okay, so what limits our performance? 

203
00:15:03,968 --> 00:15:07,528
Why can't we just build 1,000 processors 
on a big bus? 

204
00:15:07,528 --> 00:15:12,613
Well it's the same idea if you have 1,000 
people in this room, and they're all 

205
00:15:12,613 --> 00:15:15,601
trying to shout to each other at the same 
time. 

206
00:15:15,601 --> 00:15:20,560
At some point you, you run out of both 
bandwidth, and more importantly you need 

207
00:15:20,560 --> 00:15:25,664
some way to coordinate them. 
But also, but also if you wanted to, if 

208
00:15:25,664 --> 00:15:31,010
you're required to basically serialize, 
the occupancy on the bus goes up. 

209
00:15:31,010 --> 00:15:36,801
So, if you have one bus with two people 
talking on the bus at a time, they each 

210
00:15:36,801 --> 00:15:42,741
can, let's say, and they talk 10% of the 
time, then you have a 20% utilized bus. 

211
00:15:42,741 --> 00:15:47,865
Well all of a sudden, if you have ten 
people on this bus, you have 100% 

212
00:15:47,865 --> 00:15:53,285
utilized bus and if you have 1,000 
people, you have an oversubscribed bus 

213
00:15:53,285 --> 00:15:58,099
so, you have to worry about the 
bandwidth, and occupancy, because we do 

214
00:15:58,099 --> 00:16:01,564
need to make these different bus 
transactions atomic. 

215
00:16:01,564 --> 00:16:04,225
So it's not quite just a bandwidth 
problem. 

216
00:16:04,225 --> 00:16:08,360
And what, what I mean by balance, you 
could make the bus wider. 

217
00:16:08,360 --> 00:16:12,967
To increase the bandwidth, but it's not 
going to solve our problems. 

218
00:16:12,967 --> 00:16:18,125
Because there's an occupancy challenge 
here also that you need effectively 

219
00:16:18,125 --> 00:16:23,970
atomic transactions to happen across the 
bus in order to keep the cache coherence 

220
00:16:23,970 --> 00:16:29,297
protocol correct. 
Okay, so before we move off this topic 

221
00:16:29,297 --> 00:16:35,005
into our interconnection networks, that 
we were talking about today, hm, I want 

222
00:16:35,005 --> 00:16:40,713
to talk about one of the challenge of, 
that happens in simple cache coherence 

223
00:16:40,713 --> 00:16:43,160
systems. 
And that's false sharing. 

224
00:16:44,680 --> 00:16:52,200
So caches, like to track information on, 
a particular bloc size. 

225
00:16:52,200 --> 00:16:58,014
So, we've talked about caches which have 
64 byte, lines, or 64 byte block sizes, 

226
00:16:58,014 --> 00:17:02,300
and they can be bigger or smaller than 
that. 

227
00:17:02,300 --> 00:17:07,525
Now, one of the things that happens that 
is pretty unpleasant in these coherence 

228
00:17:07,525 --> 00:17:12,750
protocols is let's say, you take a piece 
of data which is shared, and needs to be 

229
00:17:12,750 --> 00:17:15,689
coherent between two different 
processors. 

230
00:17:15,689 --> 00:17:18,563
And it's gets communicated relatively 
often. 

231
00:17:18,563 --> 00:17:23,853
And you put some other piece of critical 
data right next to it, on the same cache 

232
00:17:23,853 --> 00:17:27,941
line. 
All of the sudden, what's going to happen 

233
00:17:27,941 --> 00:17:32,800
is, because they're packed into one cache 
line, and we only track that information 

234
00:17:32,800 --> 00:17:39,040
on a per cache line basis, 
whenever that one piece of data, let's 

235
00:17:39,040 --> 00:17:45,818
say it's a four byte integer, and there's 
another four byte integer which is not 

236
00:17:45,818 --> 00:17:49,249
shared, [COUGH]. 
Whatever the first four by energy which 

237
00:17:49,249 --> 00:17:53,623
let's just say a lock or something like 
that gets, gets bounced around between 

238
00:17:53,623 --> 00:17:56,269
caches you're going to bounce around the 
other data. 

239
00:17:56,269 --> 00:18:00,643
So this can effectively can hurt your 
performance common case performance for 

240
00:18:00,643 --> 00:18:04,152
non shared data by having this true 
sharing of data happening. 

241
00:18:04,152 --> 00:18:08,472
And this is not something that typically 
happens in a normal cache because in a 

242
00:18:08,472 --> 00:18:12,684
uniprocessor cache system you're going to 
bring the data in and it's going to bring 

243
00:18:12,684 --> 00:18:15,006
everything in and you get spacial 
locality. 

244
00:18:15,006 --> 00:18:18,434
And if you. 
Pump it out, you know, you can, you can 

245
00:18:18,434 --> 00:18:23,316
get conflicts which are sort of 
equivalent to this but it's a little bit 

246
00:18:23,316 --> 00:18:27,329
different idea here. 
It's never going to be in the same line. 

247
00:18:27,329 --> 00:18:30,071
But with false sharing, we, we do see 
this. 

248
00:18:30,071 --> 00:18:35,623
Hm, now false sharing is interesting 
because people have come up with a whole 

249
00:18:35,623 --> 00:18:40,438
measure of techniques to avoid it. 
So, anyone have an idea, one, one really 

250
00:18:40,438 --> 00:18:45,895
good technique to avoid false sharing? 
What we can do, and this is pretty 

251
00:18:45,895 --> 00:18:51,533
common, is either the programmer or the 
compiler can detect that this is 

252
00:18:51,533 --> 00:18:55,668
happening and it will actually pad the 
information out. 

253
00:18:55,668 --> 00:19:01,231
So waste memory for highly contended 
pieces of data, and co-locate it with 

254
00:19:01,231 --> 00:19:04,157
nothing that is shared. 
[COUGH]. 

255
00:19:04,157 --> 00:19:08,384
So one of the better examples of why you 
really have to care about this is 

256
00:19:08,384 --> 00:19:11,968
something like your stack. 
Sometimes if you, if you were to have, 

257
00:19:11,968 --> 00:19:16,222
let's say, a lock on your stack, there's 
a lot of data which you need to use 

258
00:19:16,222 --> 00:19:19,910
often, and it's all local. 
Stacks between threads are all local. 

259
00:19:19,910 --> 00:19:23,845
But if you have, like, some sort of 
variable that you pass to someone else, 

260
00:19:23,845 --> 00:19:27,834
which is a struct, and inside that struct 
is a lock, or something like that. 

261
00:19:27,834 --> 00:19:32,146
All of sudden, you're basically going to 
be bouncing a line around which is your 

262
00:19:32,146 --> 00:19:34,700
stack. 
And it's, other people are going to be 

263
00:19:34,700 --> 00:19:37,929
invalidating your stack. 
So one way to solve this is 

264
00:19:37,929 --> 00:19:41,511
when you put a lock, 
and the compiler can sometimes recognize 

265
00:19:41,511 --> 00:19:44,446
this. 
Because you can actually designate memory 

266
00:19:44,446 --> 00:19:47,616
addresses as locks, 
with special keywords, sometimes, 

267
00:19:47,616 --> 00:19:51,139
depending on the language. 
And when you do that, it'll say, oh, 

268
00:19:51,139 --> 00:19:55,425
don't put this with anything else, 
or maybe only collocate this data with 

269
00:19:55,425 --> 00:19:59,241
things that are other locks. 
because that may have bad sharing 

270
00:19:59,241 --> 00:20:04,977
performance anyway, for instance. 
So and really what you want to do here, 

271
00:20:04,977 --> 00:20:11,438
is not have a false sharing case. 
Now, the analog default sharing is 

272
00:20:11,438 --> 00:20:17,680
actually true sharing. 
So there are, there are cases where 

273
00:20:17,680 --> 00:20:24,880
you'll have multiple pieces of data that 
are, shared differently between different 

274
00:20:24,880 --> 00:20:27,331
lines. 
But they are also widely shared. 

275
00:20:27,331 --> 00:20:32,360
So example of this is, you have an array 
of locks, and different processors won't 

276
00:20:32,360 --> 00:20:37,067
be grabbing these blocks randomly. 
[COUGH] You can use similar techniques in 

277
00:20:37,067 --> 00:20:42,097
fold sharing. Now, you probably don't 
want all those locks to be on those cache 

278
00:20:42,097 --> 00:20:46,444
line. 
Because the locks are basically going to 

279
00:20:46,444 --> 00:20:49,365
be bouncing around, 
and everyone is going to be contending 

280
00:20:49,365 --> 00:20:52,437
for that one cache line to get it, 
modified in their ache, 

281
00:20:52,437 --> 00:20:56,114
or in the M state in their cache. 
So what you can think about doing is, is 

282
00:20:56,114 --> 00:20:59,538
actually just doing the similar 
technique, and putting, each of those 

283
00:20:59,538 --> 00:21:08,960
locks on a separate cache line. 
Okay, so let's switch gears here.