1
00:00:04,800 --> 00:00:12,052
We move on to our final topic of ELE 475 
which is, directory based cached 

2
00:00:12,052 --> 00:00:17,260
coherence. 
It's a little bit of a warm up here. 

3
00:00:17,260 --> 00:00:24,419
We remember the three Cs of caches. 
We had compulsory misses, capacity 

4
00:00:24,419 --> 00:00:27,740
misses, and conflict misses. 
Well, 

5
00:00:28,840 --> 00:00:34,552
we're going to think about adding a new 
miss type here, a coherence miss. 

6
00:00:34,552 --> 00:00:40,512
So our coherence miss is some other 
cache, or some other entity, reaching 

7
00:00:40,512 --> 00:00:45,440
down into our cache and invalidating 
something there. 

8
00:00:45,440 --> 00:00:50,325
So this is strictly different then 
compulsory, capacity and conflict. 

9
00:00:50,325 --> 00:00:55,355
If anything it looks most closest to a 
compulsory miss because you're 

10
00:00:55,355 --> 00:00:59,307
effectively. 
It's like a first miss, but someone, some 

11
00:00:59,307 --> 00:01:05,220
other entity bumped it out of your cache. 
But it's not any of those either. 

12
00:01:05,220 --> 00:01:10,120
So this communication that's coming from 
other cores that is been in a snooping 

13
00:01:10,120 --> 00:01:14,775
protocol or symetric shared memory 
multi-processor that other traffic comes 

14
00:01:14,775 --> 00:01:18,083
in and it will actually bump things out 
of your cache. 

15
00:01:18,083 --> 00:01:23,371
You need to worry about this. 
Now we're going to, we're going to take 

16
00:01:23,371 --> 00:01:26,760
these coherency misses and put them into 
two categories. 

17
00:01:26,760 --> 00:01:32,431
True sharing, we, we talked about this 
briefly at the end of two lectures ago, 

18
00:01:32,431 --> 00:01:36,092
or three lectures ago. 
Well we're going to, we're going to 

19
00:01:36,092 --> 00:01:39,969
categorize this as two different 
categories of misses. 

20
00:01:39,969 --> 00:01:45,497
True sharing misses, which we're going to 
say, is that if were to have a cache 

21
00:01:45,497 --> 00:01:51,169
where each block length or cache line 
size is exactly what's a one byte to the 

22
00:01:51,169 --> 00:01:56,410
minimal size you can have on your 
machine, you would still have that miss. 

23
00:01:56,410 --> 00:01:59,526
So that's a true miss. 
And a true miss is if you're actually 

24
00:01:59,526 --> 00:02:02,175
sharing data. 
So if one cache, let's say, writes some 

25
00:02:02,175 --> 00:02:06,226
data, and another cache wants to go read, 
or another processor wants to go read 

26
00:02:06,226 --> 00:02:10,174
that data, and needs to go pull it into 
its cache, that's a true sharing misses. 

27
00:02:10,174 --> 00:02:14,020
You need to do that communication. 
Now, 

28
00:02:14,020 --> 00:02:19,900
to contrast that, we talk about false 
sharing, or false sharing misses. 

29
00:02:19,900 --> 00:02:26,633
And what a false sharing misses is, is 
saying that if you were to reduce the, 

30
00:02:26,633 --> 00:02:33,571
sharing size or the block size down to 
say, one byte or one word [COUGH] and you 

31
00:02:33,571 --> 00:02:38,730
run the same program and that miss 
occurs, or the miss occurs when the block 

32
00:02:38,730 --> 00:02:43,481
size is let's say one, versus 
or, or excuse me, if the miss occurs in 

33
00:02:43,481 --> 00:02:48,775
the larger block size versus when the 
block size is one, then that is actually 

34
00:02:48,775 --> 00:02:52,848
a false sharing miss. 
The block size was too big and you had 

35
00:02:52,848 --> 00:02:58,278
two pieces of data in the same cache line 
that are effectively causing a sharing, 

36
00:02:58,278 --> 00:03:02,283
even though there was no true sharing 
going on of the data. 

37
00:03:02,283 --> 00:03:07,000
It just so happens they're packed into 
the same cache line. 

38
00:03:07,000 --> 00:03:12,709
[COUGH] Now it's a little bit more then 
that we're also going to say that false 

39
00:03:12,709 --> 00:03:19,221
sharing can happen when data gets moved 
around or gets invalidated but it's not 

40
00:03:19,221 --> 00:03:24,751
being, it may be shared later in the 
program, but that exact miss was not 

41
00:03:24,751 --> 00:03:31,460
because of data being communicated. 
So it's a little bit broader then what we 

42
00:03:31,460 --> 00:03:34,711
said last time. 
It, it, it does still happen because 

43
00:03:34,711 --> 00:03:39,386
there is two piece data packed in to the 
same line but effectively what I'm, what 

44
00:03:39,386 --> 00:03:44,061
I'm trying to get out here is you could 
have data's that bounce around between 

45
00:03:44,061 --> 00:03:48,677
different caches and the same instruction 
sequence or the same load in store 

46
00:03:48,677 --> 00:03:53,648
sequence will not cause then this is if 
you had a very small cache line size but 

47
00:03:53,648 --> 00:03:58,323
does happen of a large cache line size. 
So lets, lets look at a example here and 

48
00:03:58,323 --> 00:04:03,028
try to categorize these different misses. 
So let's, let's start off here. 

49
00:04:03,028 --> 00:04:09,255
The initial conditions are X and X2, or 
X1 and X2, which are two piece, two words 

50
00:04:09,255 --> 00:04:14,420
of data, are packed into the same cache 
block or the same cache line. 

51
00:04:15,640 --> 00:04:24,306
P1 and P2 have both read the data, and 
it's, it's, it's readable in both caches 

52
00:04:24,306 --> 00:04:32,754
at the beginning of time. 
Okay so all of a sudden we do a right to 

53
00:04:32,754 --> 00:04:37,940
X1. 
And we have to, what do we have to do? 

54
00:04:37,940 --> 00:04:44,980
Well, we're going to have to invalidate. 
X1 and P2. 

55
00:04:44,980 --> 00:04:49,073
And this, this is a true miss because the 
data was in both. 

56
00:04:49,073 --> 00:04:53,872
We need to pull it out of the one. 
We need to actually invalidate it. 

57
00:04:53,872 --> 00:04:56,977
because this is actual, actual real data. 
Okay. 

58
00:04:56,977 --> 00:05:01,000
So next thing we do is P2 goes and 
executes a read of X2. 

59
00:05:02,140 --> 00:05:07,159
Well, 
what you may notice here is at the 

60
00:05:07,159 --> 00:05:15,000
beginning of time X2 is in the cache of 
P2, but it got bumped out, here. 

61
00:05:15,000 --> 00:05:23,480
If P1 never want to go right X2 so this 
is a false sharing mess. 

62
00:05:23,480 --> 00:05:31,217
This got an exclusive into P1's cache and 
this is going to pull it out of that 

63
00:05:31,217 --> 00:05:37,367
cache and invalidate it. 
So recall, say false miss because X1 was 

64
00:05:37,367 --> 00:05:45,204
irrelevant to P2 for this, for this miss. 
Okay, so now we see, another right to X1. 

65
00:05:45,204 --> 00:05:52,346
Well, P2 didn't actually touch X, X1, so 
likewise, this is completely false 

66
00:05:52,346 --> 00:05:57,871
sharing. 
Now we see a right to X2 well we didn't 

67
00:05:57,871 --> 00:06:03,904
see any communication going on here. 
So this is also a false sharing miss. 

68
00:06:03,904 --> 00:06:06,738
Now finally we have something that is 
real here. 

69
00:06:06,738 --> 00:06:10,458
We're going to read X two. 
When we wrote X two there, we read X two 

70
00:06:10,458 --> 00:06:15,105
here, we are actually communicating data. 
So this is true sharing. 

71
00:06:15,105 --> 00:06:18,985
And that's okay. 
But we want to try to minimize these, 

72
00:06:18,985 --> 00:06:25,208
these false sharing patterns. 
This is just a warm up. 

73
00:06:25,208 --> 00:06:29,740
Some mo, motivate us into directory based 
coherence, a little bit. 

74
00:06:31,980 --> 00:06:35,232
Okay. 
So let's, let's motivate this a little 

75
00:06:35,232 --> 00:06:39,090
bit more. 
And let's look at something like a online 

76
00:06:39,090 --> 00:06:43,856
transaction processing workload. 
So this is a database workload. 

77
00:06:43,856 --> 00:06:47,109
So it's a multiprocessor database 
workload. 

78
00:06:47,109 --> 00:06:51,270
It's using threads. 
And what we're going to see here is we 

79
00:06:51,270 --> 00:06:54,523
have. 
We're going to run the same work load on 

80
00:06:54,523 --> 00:06:58,759
a four processor system with four 
different cache sizes. 

81
00:06:58,759 --> 00:07:03,240
This, this data is, taken from a paper 
from your book. 

82
00:07:03,240 --> 00:07:08,252
And. 
What you'll notice is as you increase the 

83
00:07:08,252 --> 00:07:12,073
cache size, our false sharing and our 
true sharing don't really change. 

84
00:07:12,073 --> 00:07:16,112
You still need to communicate data and 
your still going to get false sharing. 

85
00:07:16,112 --> 00:07:19,878
Just because you make the cache size 
bigger, didn't change the block size. 

86
00:07:19,878 --> 00:07:23,180
You're still going to get the same false 
sharing patterns. 

87
00:07:23,180 --> 00:07:28,900
But, as you increase the cache size. 
The instruction. 

88
00:07:28,900 --> 00:07:34,352
Misses the capacity misses the conflict 
misses the cold the, the, the compulsory 

89
00:07:34,352 --> 00:07:40,240
misses we call that cold go down because 
the cache is getting bigger, So 

90
00:07:40,240 --> 00:07:45,460
non-shared cache lines performing the, 
the, the. 

91
00:07:47,100 --> 00:07:49,125
[COUGH]. 
Number of memory cycles. 

92
00:07:49,125 --> 00:07:53,051
The amount of time you take memory misses 
there, is going down. 

93
00:07:53,051 --> 00:07:55,583
But the rest of this is not changing. 
Hm. 

94
00:07:55,583 --> 00:07:58,432
Well, okay. 
This is, this is interesting, so the 

95
00:07:58,432 --> 00:08:03,686
second question comes up as, what happens 
if we increase the number of cores in our 

96
00:08:03,686 --> 00:08:06,472
system? 
So this is a relatively small system 

97
00:08:06,472 --> 00:08:09,130
here. 
Let's, let's plot the number of cores 

98
00:08:09,130 --> 00:08:11,915
here from one to eight, at the same 
workload. 

99
00:08:11,915 --> 00:08:14,890
And look what happens. 
So if we, we look at this. 

100
00:08:14,890 --> 00:08:19,005
Something else is in variant here. 
This is for a fixed cache size. 

101
00:08:19,005 --> 00:08:22,740
We're going to plot the number of 
processors down here now. 

102
00:08:22,740 --> 00:08:28,198
The. 
Number of memory cycles per instruction, 

103
00:08:28,198 --> 00:08:35,209
for instruction misses, complex capacity 
misses, com, compulsory misses, doesn't 

104
00:08:35,209 --> 00:08:41,122
change, just stays the same because it's 
basically uni-processor based. 

105
00:08:41,122 --> 00:08:47,880
But, as you add more cores you get both, 
more true sharing and more false sharing. 

106
00:08:49,800 --> 00:08:55,015
Hmm, well this is a little scary because 
our performance is basically going down 

107
00:08:55,015 --> 00:08:59,338
as we add more course. 
So this is only up to eight, what happens 

108
00:08:59,338 --> 00:09:02,426
if we're, you know way out here at 100 
course? 

109
00:09:02,426 --> 00:09:06,817
What do we think's going to happen? 
Well, we're probably going to to be 

110
00:09:06,817 --> 00:09:12,101
dominated, our performance is going to be 
dominated by the sharing and the false 

111
00:09:12,101 --> 00:09:16,920
sharing and these misses. 
Huh, well, we're going to start thinking 

112
00:09:16,920 --> 00:09:20,464
about that one. 
Figure out how to, how to solve that and 

113
00:09:20,464 --> 00:09:23,300
think about scalability of coherence 
system.