1 00:00:04,080 --> 00:00:09,692 Okay, so now we get to talk about issues of sequential consistency and how to 2 00:00:09,692 --> 00:00:12,900 actually go about implementing these things. 3 00:00:14,000 --> 00:00:18,916 So going back to our, valid interleaving. We start to ask ourselves, how do we 4 00:00:18,916 --> 00:00:23,251 build a sequentially consistent model with out of order processors, 5 00:00:23,251 --> 00:00:28,986 or out of order execution of memory. Well we already said in our out of order 6 00:00:28,986 --> 00:00:32,639 system that we can re-order loads to different addresses. 7 00:00:32,639 --> 00:00:35,330 We can re-order loads to the same address. 8 00:00:35,330 --> 00:00:38,100 Doesn't really matter. Well, if you go take that, 9 00:00:38,100 --> 00:00:41,751 all of a sudden they just breaks a sequential consistency. 10 00:00:41,751 --> 00:00:44,395 Huh. We said we could reorder a store in a 11 00:00:44,395 --> 00:00:48,865 load in our out of order processor if they were to different addresses. 12 00:00:48,865 --> 00:00:52,328 We said we could reorder stores and load the other way. 13 00:00:52,328 --> 00:00:54,972 We said we could reorder stores in stores. 14 00:00:54,972 --> 00:00:59,631 So out of order memory systems and auto, auto processors effectively break 15 00:00:59,631 --> 00:01:03,723 sequential consistencies. So how you break, how do you implement a 16 00:01:03,723 --> 00:01:08,287 processor which is useful? And is out of order, and we want to go 17 00:01:08,287 --> 00:01:14,085 out of order for performance reasons. Also cache has started to become a big 18 00:01:14,085 --> 00:01:18,261 problem here for performance. If we want to have true sequential 19 00:01:18,261 --> 00:01:23,221 consistency you're going to have to somehow figure out how to not store any 20 00:01:23,221 --> 00:01:27,202 data in your cache. Because the second that you store data in 21 00:01:27,202 --> 00:01:31,361 your cache, you're going to have to go, let's say 22 00:01:31,361 --> 00:01:36,382 it's a right back cache if you do a store to your local cache and you write a one 23 00:01:36,382 --> 00:01:39,137 to it, and some other processor tries to read 24 00:01:39,137 --> 00:01:43,200 it, well it's not going to be able to see that data. 25 00:01:43,200 --> 00:01:51,935 So we need to be very careful here of when is data actually visible to other 26 00:01:51,935 --> 00:01:59,423 processors if you have a cashe involved in multi-processor system. 27 00:01:59,423 --> 00:02:09,380 So we are going to move towards little bit more weak or the relax memory models. 28 00:02:09,380 --> 00:02:17,182 So as I said last class, almost no processor actually implements 29 00:02:17,182 --> 00:02:20,066 a, a truly sequentially consistent memory system. 30 00:02:20,066 --> 00:02:24,271 They're all weaker than that. Now the question is how weak do you want 31 00:02:24,271 --> 00:02:27,214 to go. and this starts to come up because it's 32 00:02:27,214 --> 00:02:32,520 a, it's a tech trade off between what the programmer can wrap their head around 33 00:02:32,520 --> 00:02:39,440 versus effectively performance in an out of order system. 34 00:02:41,640 --> 00:02:46,500 So what we can do is we can think about going to weaker memory systems, 35 00:02:49,460 --> 00:02:56,335 and when we do that, we are going to make the programmer decide when a load in a 36 00:02:56,335 --> 00:03:01,309 store can be reordered, or when a store in a store can be reordered, or when a 37 00:03:01,309 --> 00:03:05,795 load in a load can be re reordered. So let's look at these four different 38 00:03:05,795 --> 00:03:09,016 cases independently here. Reordering a load with a load, 39 00:03:09,016 --> 00:03:13,701 reordering a load with a store after it, reordering a store with a load after it 40 00:03:13,701 --> 00:03:17,684 and reordering two stores. And note we're not talking about the same 41 00:03:17,684 --> 00:03:21,432 memory addresses here at all. We are talking about two different 42 00:03:21,432 --> 00:03:26,058 addresses, but sequential consistency even says for different addresses we need 43 00:03:26,058 --> 00:03:29,962 to worry about this. So how many of these extra dependencies 44 00:03:29,962 --> 00:03:34,949 can we remove and we're going to introduce this notion of a memory fence 45 00:03:34,949 --> 00:03:40,208 or sometimes we call memory barrier operation which is a serialization point 46 00:03:40,208 --> 00:03:44,649 that says all of the previous instructions before well, I will be 47 00:03:44,649 --> 00:03:49,840 careful what I say there actually. My references is actually a serialization 48 00:03:49,840 --> 00:03:55,305 operation which says all of the memory accesses or at least the named memory 49 00:03:55,305 --> 00:04:00,496 accesses before a certain point have been completed before the memory fence 50 00:04:00,496 --> 00:04:04,641 completes. And, all of the operations after it do 51 00:04:04,641 --> 00:04:09,120 not complete, have not started or completed. 52 00:04:09,120 --> 00:04:12,921 Before the memory fence completes so this really is a barrier on your memory 53 00:04:12,921 --> 00:04:17,628 operations and the reason I'm being a little bit hesitant here is the basic 54 00:04:17,628 --> 00:04:22,651 memory fence will say that all memory operations before the memory fence 55 00:04:22,651 --> 00:04:28,661 completes and that none after the memory fence have started when the memory fence 56 00:04:28,661 --> 00:04:32,137 completes. But people have introduced a little bit 57 00:04:32,137 --> 00:04:36,548 even weaker versions of memory fence over time of course, for performance. 58 00:04:36,548 --> 00:04:40,778 And you start to see things where. You'll have only load memory fences, 59 00:04:40,778 --> 00:04:44,080 where it barriers against all the loads previous. 60 00:04:44,080 --> 00:04:48,808 Or you have, one you need directional memory fences which will say only look at 61 00:04:48,808 --> 00:04:53,296 all the stores that happen after this point and don't allow any of them to 62 00:04:53,296 --> 00:04:56,887 happen before this point. So, people have actually introduced 63 00:04:56,887 --> 00:05:00,717 instructions in modern architectures that have all these things. 64 00:05:00,717 --> 00:05:05,386 And, probably one of the weaker ones if you actually want to go look at a really 65 00:05:05,386 --> 00:05:09,934 weak memory model, is go read about the itanium memory models, extremely weak 66 00:05:09,934 --> 00:05:11,730 memory model. and they, they. 67 00:05:11,730 --> 00:05:15,423 Need to have fences all over the place. And the, the trade off here, the 68 00:05:15,423 --> 00:05:18,590 performance trade off is how many fences you need to put in. 69 00:05:18,590 --> 00:05:23,090 Versus performance you get from reordering of memory operations. 70 00:05:23,090 --> 00:05:27,860 So let's look at a couple, let's, let's name a few of these things, cause 71 00:05:27,860 --> 00:05:30,900 everyone likes naming. And these are 72 00:05:30,900 --> 00:05:36,267 computer architecture research names. These are common across the industry at 73 00:05:36,267 --> 00:05:40,617 this point effectively, One of them is called total store 74 00:05:40,617 --> 00:05:43,673 ordering. So in total store ordering, this is not a 75 00:05:43,673 --> 00:05:46,620 big step away from sequential consistency. 76 00:05:46,620 --> 00:05:53,397 The stores, some total store ordering. Loads do not get reordered with other 77 00:05:53,397 --> 00:05:56,174 loads. Loads do not get reordered with other 78 00:05:56,174 --> 00:05:59,518 stores. Stores don't get reordered with other s, 79 00:05:59,518 --> 00:06:01,916 stores. But a store followed by a load? 80 00:06:01,916 --> 00:06:05,324 Can be reordered. So a load can be moved above a store 81 00:06:05,324 --> 00:06:10,023 here. And if you want to enforce the order 82 00:06:10,023 --> 00:06:14,973 between the store and load you need to write this. 83 00:06:14,973 --> 00:06:21,507 You need to do store word and then fence, loadwork. 84 00:06:21,507 --> 00:06:29,229 And this guarantees that this load is going to happen after that store in the 85 00:06:29,229 --> 00:06:36,155 processor. Partial store ordering is a little bit 86 00:06:36,155 --> 00:06:40,610 weaker than that. So partial store ordering, loads of other 87 00:06:40,610 --> 00:06:46,620 loads don't get reordered. Loads of other stores don't need fences, 88 00:06:46,620 --> 00:06:50,156 but stores followed by loads, which is up here, 89 00:06:50,156 --> 00:06:56,100 and stores followed by stores also need fences to guarantee ordering. 90 00:06:56,100 --> 00:07:00,838 And then finally we get this class of weak ordering, and frankly, most 91 00:07:00,838 --> 00:07:06,057 processors these days actually fall into some category of weak ordering, and 92 00:07:06,057 --> 00:07:09,766 there's sort of different questions in how weak it is. 93 00:07:09,766 --> 00:07:13,337 but we're not going into that little detail here. 94 00:07:13,337 --> 00:07:18,625 And here, you actually need a fence to enforce all of these orderings, so it's 95 00:07:18,625 --> 00:07:22,127 kind of, kind of interesting, interesting to think about. 96 00:07:22,127 --> 00:07:27,808 The nice thing about fences is that. While these operations are expensive and 97 00:07:27,808 --> 00:07:32,443 are big serialization points, the memory fences, you only need to pay for them 98 00:07:32,443 --> 00:07:36,795 when you care about the ordering. When you don't care about the ordering, 99 00:07:36,795 --> 00:07:41,345 the computer architecture can reorder things and make everything go faster. 100 00:07:41,345 --> 00:07:46,017 It's only exactly when you care about it that you need to introduce defenses. 101 00:07:46,017 --> 00:07:50,688 one thing I was going to say is some compilers will do some of this work for 102 00:07:50,688 --> 00:07:52,932 you. Introduce defenses, for you. 103 00:07:52,932 --> 00:07:57,543 And usually one of the ways that it's done actually is if you see atomic 104 00:07:57,543 --> 00:08:00,212 operation. So if they see, if a compiler sees 105 00:08:00,212 --> 00:08:05,248 something like a lock operation or P or a V, in your code and it's sort of a 106 00:08:05,248 --> 00:08:08,379 special way to signal back to the compiler. 107 00:08:08,379 --> 00:08:13,247 It'll know that you need to make sure the memory operations have all been 108 00:08:13,247 --> 00:08:17,590 serialized at that point. Usually also atomic operations like test 109 00:08:17,590 --> 00:08:23,043 and set are fences on their own right. which helps, but sometimes compilers can 110 00:08:23,043 --> 00:08:29,300 even figure out where other fences are needed for other memory addresses. 111 00:08:29,300 --> 00:08:34,249 And one of the things that I want to say here is that fences don't actually take a 112 00:08:34,249 --> 00:08:37,170 argue bit. It's not going to take an address to go 113 00:08:37,170 --> 00:08:40,211 fence on, and the reason for this is we really want 114 00:08:40,211 --> 00:08:44,504 to do is we want to make sure all the previous memory operations, because in 115 00:08:44,504 --> 00:08:48,857 the question of consistency we haven't said anything about the addresses. 116 00:08:48,857 --> 00:08:53,210 We just said previous memory operations in one thread don't move after a 117 00:08:53,210 --> 00:08:58,787 subsequent memory operation. Okay, so let's look at an example here of 118 00:08:58,787 --> 00:09:05,412 where we might need to introduce a different, different levels of fences 119 00:09:05,412 --> 00:09:08,946 here. So here we have our producer consumer 120 00:09:08,946 --> 00:09:15,368 model. And one of the things we were worried 121 00:09:15,368 --> 00:09:18,738 about in our original examples, we were worried about these two stores 122 00:09:18,738 --> 00:09:22,569 getting reordered. Well, we can force that by putting a 123 00:09:22,569 --> 00:09:26,111 memory fence operation in here, and saying, it's a store, store. 124 00:09:26,111 --> 00:09:30,697 So no stores can pass any other stores, and this'll guarantee that these happen 125 00:09:30,697 --> 00:09:35,815 in order. But we don't really care, in this case 126 00:09:35,815 --> 00:09:41,045 we'll say, if, well, this actually already has an arc, so 127 00:09:41,045 --> 00:09:45,135 we're not, that's not going to get reordered, because this register is, is, 128 00:09:45,135 --> 00:09:48,429 is dependent there. But let's go over here, and take a look 129 00:09:48,429 --> 00:09:51,213 over here. If we have full sequential consistency, 130 00:09:51,213 --> 00:09:53,940 we would not be able to reorder these two loads. 131 00:09:55,040 --> 00:10:00,077 Well for performance maybe we want to reorder those loads because one of the 132 00:10:00,077 --> 00:10:04,674 input registers isn't valable. Because it's the result of a long 133 00:10:04,674 --> 00:10:07,820 multiply for instance, or something, something like that. 134 00:10:07,820 --> 00:10:12,298 So you might want to reorder into your out of order processes for performance 135 00:10:12,298 --> 00:10:14,564 reasons. In true sequential consistency you 136 00:10:14,564 --> 00:10:18,199 wouldn't be able to do that. But with this weaker memory model we can 137 00:10:18,199 --> 00:10:21,729 reorder that, but then guarantee correctness by introducing a fence 138 00:10:21,729 --> 00:10:24,469 operation here. And this fence operation is going to to 139 00:10:24,469 --> 00:10:27,460 say. After these loads are done. 140 00:10:28,880 --> 00:10:32,503 These loads need to complete before this load starts. 141 00:10:32,503 --> 00:10:37,905 So you can sort of guarantee that if our head equals our tail, you know at that 142 00:10:37,905 --> 00:10:42,760 point that you're, you're safe to go execute the rest of this code here. 143 00:10:44,240 --> 00:10:46,400 144 00:10:50,040 --> 00:10:56,527 Any questions about that so far, about the, the fences and why? It's kind 145 00:10:56,527 --> 00:10:59,360 of a pay as you go, if you will.