1
00:00:04,080 --> 00:00:09,692
Okay, so now we get to talk about issues 
of sequential consistency and how to 

2
00:00:09,692 --> 00:00:12,900
actually go about implementing these 
things. 

3
00:00:14,000 --> 00:00:18,916
So going back to our, valid interleaving. 
We start to ask ourselves, how do we 

4
00:00:18,916 --> 00:00:23,251
build a sequentially consistent model 
with out of order processors, 

5
00:00:23,251 --> 00:00:28,986
or out of order execution of memory. 
Well we already said in our out of order 

6
00:00:28,986 --> 00:00:32,639
system that we can re-order loads to 
different addresses. 

7
00:00:32,639 --> 00:00:35,330
We can re-order loads to the same 
address. 

8
00:00:35,330 --> 00:00:38,100
Doesn't really matter. 
Well, if you go take that, 

9
00:00:38,100 --> 00:00:41,751
all of a sudden they just breaks a 
sequential consistency. 

10
00:00:41,751 --> 00:00:44,395
Huh. 
We said we could reorder a store in a 

11
00:00:44,395 --> 00:00:48,865
load in our out of order processor if 
they were to different addresses. 

12
00:00:48,865 --> 00:00:52,328
We said we could reorder stores and load 
the other way. 

13
00:00:52,328 --> 00:00:54,972
We said we could reorder stores in 
stores. 

14
00:00:54,972 --> 00:00:59,631
So out of order memory systems and auto, 
auto processors effectively break 

15
00:00:59,631 --> 00:01:03,723
sequential consistencies. 
So how you break, how do you implement a 

16
00:01:03,723 --> 00:01:08,287
processor which is useful? 
And is out of order, and we want to go 

17
00:01:08,287 --> 00:01:14,085
out of order for performance reasons. 
Also cache has started to become a big 

18
00:01:14,085 --> 00:01:18,261
problem here for performance. 
If we want to have true sequential 

19
00:01:18,261 --> 00:01:23,221
consistency you're going to have to 
somehow figure out how to not store any 

20
00:01:23,221 --> 00:01:27,202
data in your cache. 
Because the second that you store data in 

21
00:01:27,202 --> 00:01:31,361
your cache, 
you're going to have to go, let's say 

22
00:01:31,361 --> 00:01:36,382
it's a right back cache if you do a store 
to your local cache and you write a one 

23
00:01:36,382 --> 00:01:39,137
to it, 
and some other processor tries to read 

24
00:01:39,137 --> 00:01:43,200
it, well it's not going to be able to see 
that data. 

25
00:01:43,200 --> 00:01:51,935
So we need to be very careful here of 
when is data actually visible to other 

26
00:01:51,935 --> 00:01:59,423
processors if you have a cashe involved 
in multi-processor system. 

27
00:01:59,423 --> 00:02:09,380
So we are going to move towards little 
bit more weak or the relax memory models. 

28
00:02:09,380 --> 00:02:17,182
So as I said last class, 
almost no processor actually implements 

29
00:02:17,182 --> 00:02:20,066
a, a truly sequentially consistent memory 
system. 

30
00:02:20,066 --> 00:02:24,271
They're all weaker than that. 
Now the question is how weak do you want 

31
00:02:24,271 --> 00:02:27,214
to go. 
and this starts to come up because it's 

32
00:02:27,214 --> 00:02:32,520
a, it's a tech trade off between what the 
programmer can wrap their head around 

33
00:02:32,520 --> 00:02:39,440
versus effectively performance in an out 
of order system. 

34
00:02:41,640 --> 00:02:46,500
So what we can do is we can think about 
going to weaker memory systems, 

35
00:02:49,460 --> 00:02:56,335
and when we do that, we are going to make 
the programmer decide when a load in a 

36
00:02:56,335 --> 00:03:01,309
store can be reordered, or when a store 
in a store can be reordered, or when a 

37
00:03:01,309 --> 00:03:05,795
load in a load can be re reordered. 
So let's look at these four different 

38
00:03:05,795 --> 00:03:09,016
cases independently here. 
Reordering a load with a load, 

39
00:03:09,016 --> 00:03:13,701
reordering a load with a store after it, 
reordering a store with a load after it 

40
00:03:13,701 --> 00:03:17,684
and reordering two stores. 
And note we're not talking about the same 

41
00:03:17,684 --> 00:03:21,432
memory addresses here at all. 
We are talking about two different 

42
00:03:21,432 --> 00:03:26,058
addresses, but sequential consistency 
even says for different addresses we need 

43
00:03:26,058 --> 00:03:29,962
to worry about this. 
So how many of these extra dependencies 

44
00:03:29,962 --> 00:03:34,949
can we remove and we're going to 
introduce this notion of a memory fence 

45
00:03:34,949 --> 00:03:40,208
or sometimes we call memory barrier 
operation which is a serialization point 

46
00:03:40,208 --> 00:03:44,649
that says all of the previous 
instructions before well, I will be 

47
00:03:44,649 --> 00:03:49,840
careful what I say there actually. 
My references is actually a serialization 

48
00:03:49,840 --> 00:03:55,305
operation which says all of the memory 
accesses or at least the named memory 

49
00:03:55,305 --> 00:04:00,496
accesses before a certain point have been 
completed before the memory fence 

50
00:04:00,496 --> 00:04:04,641
completes. 
And, all of the operations after it do 

51
00:04:04,641 --> 00:04:09,120
not complete, have not started or 
completed. 

52
00:04:09,120 --> 00:04:12,921
Before the memory fence completes so this 
really is a barrier on your memory 

53
00:04:12,921 --> 00:04:17,628
operations and the reason I'm being a 
little bit hesitant here is the basic 

54
00:04:17,628 --> 00:04:22,651
memory fence will say that all memory 
operations before the memory fence 

55
00:04:22,651 --> 00:04:28,661
completes and that none after the memory 
fence have started when the memory fence 

56
00:04:28,661 --> 00:04:32,137
completes. 
But people have introduced a little bit 

57
00:04:32,137 --> 00:04:36,548
even weaker versions of memory fence over 
time of course, for performance. 

58
00:04:36,548 --> 00:04:40,778
And you start to see things where. 
You'll have only load memory fences, 

59
00:04:40,778 --> 00:04:44,080
where it barriers against all the loads 
previous. 

60
00:04:44,080 --> 00:04:48,808
Or you have, one you need directional 
memory fences which will say only look at 

61
00:04:48,808 --> 00:04:53,296
all the stores that happen after this 
point and don't allow any of them to 

62
00:04:53,296 --> 00:04:56,887
happen before this point. 
So, people have actually introduced 

63
00:04:56,887 --> 00:05:00,717
instructions in modern architectures that 
have all these things. 

64
00:05:00,717 --> 00:05:05,386
And, probably one of the weaker ones if 
you actually want to go look at a really 

65
00:05:05,386 --> 00:05:09,934
weak memory model, is go read about the 
itanium memory models, extremely weak 

66
00:05:09,934 --> 00:05:11,730
memory model. 
and they, they. 

67
00:05:11,730 --> 00:05:15,423
Need to have fences all over the place. 
And the, the trade off here, the 

68
00:05:15,423 --> 00:05:18,590
performance trade off is how many fences 
you need to put in. 

69
00:05:18,590 --> 00:05:23,090
Versus performance you get from 
reordering of memory operations. 

70
00:05:23,090 --> 00:05:27,860
So let's look at a couple, let's, let's 
name a few of these things, cause 

71
00:05:27,860 --> 00:05:30,900
everyone likes naming. 
And these are 

72
00:05:30,900 --> 00:05:36,267
computer architecture research names. 
These are common across the industry at 

73
00:05:36,267 --> 00:05:40,617
this point effectively, 
One of them is called total store 

74
00:05:40,617 --> 00:05:43,673
ordering. 
So in total store ordering, this is not a 

75
00:05:43,673 --> 00:05:46,620
big step away from sequential 
consistency. 

76
00:05:46,620 --> 00:05:53,397
The stores, some total store ordering. 
Loads do not get reordered with other 

77
00:05:53,397 --> 00:05:56,174
loads. 
Loads do not get reordered with other 

78
00:05:56,174 --> 00:05:59,518
stores. 
Stores don't get reordered with other s, 

79
00:05:59,518 --> 00:06:01,916
stores. 
But a store followed by a load? 

80
00:06:01,916 --> 00:06:05,324
Can be reordered. 
So a load can be moved above a store 

81
00:06:05,324 --> 00:06:10,023
here. 
And if you want to enforce the order 

82
00:06:10,023 --> 00:06:14,973
between the store and load you need to 
write this. 

83
00:06:14,973 --> 00:06:21,507
You need to do store word and then fence, 
loadwork. 

84
00:06:21,507 --> 00:06:29,229
And this guarantees that this load is 
going to happen after that store in the 

85
00:06:29,229 --> 00:06:36,155
processor. 
Partial store ordering is a little bit 

86
00:06:36,155 --> 00:06:40,610
weaker than that. 
So partial store ordering, loads of other 

87
00:06:40,610 --> 00:06:46,620
loads don't get reordered. 
Loads of other stores don't need fences, 

88
00:06:46,620 --> 00:06:50,156
but stores followed by loads, which is up 
here, 

89
00:06:50,156 --> 00:06:56,100
and stores followed by stores also need 
fences to guarantee ordering. 

90
00:06:56,100 --> 00:07:00,838
And then finally we get this class of 
weak ordering, and frankly, most 

91
00:07:00,838 --> 00:07:06,057
processors these days actually fall into 
some category of weak ordering, and 

92
00:07:06,057 --> 00:07:09,766
there's sort of different questions in 
how weak it is. 

93
00:07:09,766 --> 00:07:13,337
but we're not going into that little 
detail here. 

94
00:07:13,337 --> 00:07:18,625
And here, you actually need a fence to 
enforce all of these orderings, so it's 

95
00:07:18,625 --> 00:07:22,127
kind of, kind of interesting, interesting 
to think about. 

96
00:07:22,127 --> 00:07:27,808
The nice thing about fences is that. 
While these operations are expensive and 

97
00:07:27,808 --> 00:07:32,443
are big serialization points, the memory 
fences, you only need to pay for them 

98
00:07:32,443 --> 00:07:36,795
when you care about the ordering. 
When you don't care about the ordering, 

99
00:07:36,795 --> 00:07:41,345
the computer architecture can reorder 
things and make everything go faster. 

100
00:07:41,345 --> 00:07:46,017
It's only exactly when you care about it 
that you need to introduce defenses. 

101
00:07:46,017 --> 00:07:50,688
one thing I was going to say is some 
compilers will do some of this work for 

102
00:07:50,688 --> 00:07:52,932
you. 
Introduce defenses, for you. 

103
00:07:52,932 --> 00:07:57,543
And usually one of the ways that it's 
done actually is if you see atomic 

104
00:07:57,543 --> 00:08:00,212
operation. 
So if they see, if a compiler sees 

105
00:08:00,212 --> 00:08:05,248
something like a lock operation or P or a 
V, in your code and it's sort of a 

106
00:08:05,248 --> 00:08:08,379
special way to 
signal back to the compiler. 

107
00:08:08,379 --> 00:08:13,247
It'll know that you need to make sure the 
memory operations have all been 

108
00:08:13,247 --> 00:08:17,590
serialized at that point. 
Usually also atomic operations like test 

109
00:08:17,590 --> 00:08:23,043
and set are fences on their own right. 
which helps, but sometimes compilers can 

110
00:08:23,043 --> 00:08:29,300
even figure out where other fences are 
needed for other memory addresses. 

111
00:08:29,300 --> 00:08:34,249
And one of the things that I want to say 
here is that fences don't actually take a 

112
00:08:34,249 --> 00:08:37,170
argue bit. 
It's not going to take an address to go 

113
00:08:37,170 --> 00:08:40,211
fence on, 
and the reason for this is we really want 

114
00:08:40,211 --> 00:08:44,504
to do is we want to make sure all the 
previous memory operations, because in 

115
00:08:44,504 --> 00:08:48,857
the question of consistency we haven't 
said anything about the addresses. 

116
00:08:48,857 --> 00:08:53,210
We just said previous memory operations 
in one thread don't move after a 

117
00:08:53,210 --> 00:08:58,787
subsequent memory operation. 
Okay, so let's look at an example here of 

118
00:08:58,787 --> 00:09:05,412
where we might need to introduce a 
different, different levels of fences 

119
00:09:05,412 --> 00:09:08,946
here. 
So here we have our producer consumer 

120
00:09:08,946 --> 00:09:15,368
model. 
And one of the things we were worried 

121
00:09:15,368 --> 00:09:18,738
about in our original examples, 
we were worried about these two stores 

122
00:09:18,738 --> 00:09:22,569
getting reordered. 
Well, we can force that by putting a 

123
00:09:22,569 --> 00:09:26,111
memory fence operation in here, 
and saying, it's a store, store. 

124
00:09:26,111 --> 00:09:30,697
So no stores can pass any other stores, 
and this'll guarantee that these happen 

125
00:09:30,697 --> 00:09:35,815
in order. 
But we don't really care, in this case 

126
00:09:35,815 --> 00:09:41,045
we'll say, if, well, 
this actually already has an arc, so 

127
00:09:41,045 --> 00:09:45,135
we're not, that's not going to get 
reordered, because this register is, is, 

128
00:09:45,135 --> 00:09:48,429
is dependent there. 
But let's go over here, and take a look 

129
00:09:48,429 --> 00:09:51,213
over here. 
If we have full sequential consistency, 

130
00:09:51,213 --> 00:09:53,940
we would not be able to reorder these two 
loads. 

131
00:09:55,040 --> 00:10:00,077
Well for performance maybe we want to 
reorder those loads because one of the 

132
00:10:00,077 --> 00:10:04,674
input registers isn't valable. 
Because it's the result of a long 

133
00:10:04,674 --> 00:10:07,820
multiply for instance, or something, 
something like that. 

134
00:10:07,820 --> 00:10:12,298
So you might want to reorder into your 
out of order processes for performance 

135
00:10:12,298 --> 00:10:14,564
reasons. 
In true sequential consistency you 

136
00:10:14,564 --> 00:10:18,199
wouldn't be able to do that. 
But with this weaker memory model we can 

137
00:10:18,199 --> 00:10:21,729
reorder that, but then guarantee 
correctness by introducing a fence 

138
00:10:21,729 --> 00:10:24,469
operation here. 
And this fence operation is going to to 

139
00:10:24,469 --> 00:10:27,460
say. 
After these loads are done. 

140
00:10:28,880 --> 00:10:32,503
These loads need to complete before this 
load starts. 

141
00:10:32,503 --> 00:10:37,905
So you can sort of guarantee that if our 
head equals our tail, you know at that 

142
00:10:37,905 --> 00:10:42,760
point that you're, you're safe to go 
execute the rest of this code here. 

143
00:10:44,240 --> 00:10:46,400

144
00:10:50,040 --> 00:10:56,527
Any questions about that so far, 
about the, the fences and why? It's kind 

145
00:10:56,527 --> 00:10:59,360
of a pay as you go, if you will.