1 00:00:03,320 --> 00:00:09,594 Okay. So, let's introduce a model of, and 2 00:00:09,594 --> 00:00:15,368 introduce some semantics to describe ordering of memory operations between 3 00:00:15,368 --> 00:00:20,064 different processors. One of the most common ones that you're 4 00:00:20,064 --> 00:00:25,991 going to see is called Sequential Consistencies by gnoming the only one out 5 00:00:25,991 --> 00:00:29,400 there. This is a very strong ordering. 6 00:00:29,400 --> 00:00:35,631 compared to any of the processors you guys run on your computers today, this is 7 00:00:35,631 --> 00:00:39,150 extremely strong. None of your computers actually implement 8 00:00:39,150 --> 00:00:43,026 this strong of an ordering. But, it's a good basis to think reason 9 00:00:43,026 --> 00:00:47,559 about and reason about parallel programs. So, we're going to study sequential 10 00:00:47,559 --> 00:00:50,541 consistency sums. So, up here, we have one, two, three, 11 00:00:50,541 --> 00:00:54,060 four, five processors, one big memory, still very basic model. 12 00:00:54,060 --> 00:00:58,891 We don't have, we've not introduced any caches or anything crazy like that yet 13 00:00:58,891 --> 00:01:02,231 because that makes all these things a little bit harder. 14 00:01:02,231 --> 00:01:08,995 Either caches are out of order. And, what sequential consistency is, is 15 00:01:08,995 --> 00:01:15,294 the idea that you take all of the instructions and all of the programs 16 00:01:15,294 --> 00:01:22,999 executing on all of the processors, and you guarantee that the execution 17 00:01:22,999 --> 00:01:33,037 sequence that is seen by all of the processors is some valid in order 18 00:01:33,037 --> 00:01:41,057 interleaving of those instructions of the relative processors themselves at the 19 00:01:41,057 --> 00:01:43,920 beginning. So, to give you an example. 20 00:01:45,240 --> 00:01:56,403 If you have processor one, [SOUND] [COUGH] and processor two. 21 00:01:56,403 --> 00:02:00,382 And processor one goes one, two, three, four. 22 00:02:00,382 --> 00:02:06,920 And processor two goes execution instruction five, six, seven, eight, 23 00:02:08,200 --> 00:02:13,620 you're guaranteed that some intervening here is what happened in sequential 24 00:02:13,620 --> 00:02:17,120 consistency. So, let's say, 25 00:02:17,120 --> 00:02:20,260 one and two happens. And then, five happens, 26 00:02:20,260 --> 00:02:23,550 and then three happens, and then six happens, 27 00:02:23,550 --> 00:02:27,740 and then seven, and then eight, and then four. 28 00:02:27,740 --> 00:02:31,120 That is a valid sequential consistent order. 29 00:02:31,120 --> 00:02:38,452 Likewise, another one is easier to reason out is one, two, three, four, five, six, 30 00:02:38,452 --> 00:02:43,754 seven, eight. That's valid in sequential consistent, 31 00:02:43,754 --> 00:02:46,720 sequential consistency. Likewise, 32 00:02:46,720 --> 00:02:59,555 I'm going to say five, six, seven, one, two, three, four, eight is valid order. 33 00:02:59,555 --> 00:03:11,840 What is not valid is going to be something like [SOUND] five, one, 34 00:03:11,840 --> 00:03:21,120 I don't know. Three, two, four, seven, eight. 35 00:03:22,520 --> 00:03:26,440 I see it's missing one. five. 36 00:03:26,440 --> 00:03:29,207 No, five, six, no, six. Okay. 37 00:03:29,207 --> 00:03:36,037 We'll put six in there. Because we've reordered one of the 38 00:03:36,037 --> 00:03:42,120 individual address sequences, two and three should be ordered. 39 00:03:42,120 --> 00:03:47,720 So, this gives us some guarantees. 40 00:03:47,720 --> 00:03:53,485 so the, the basic idea is that, it's an arbitrary, order-preserving 41 00:03:53,485 --> 00:03:58,936 interweaving. So, each processors addresses' are preserved in order, but 42 00:03:58,936 --> 00:04:03,920 can be interweaved in between the different threads arbitrarily. 43 00:04:05,620 --> 00:04:09,078 Okay. I want that to sink in for a second. 44 00:04:09,078 --> 00:04:14,813 Anyone have questions about what the model is? Or, why this is a good model? 45 00:04:14,813 --> 00:04:31,839 [SOUND] Okay. I'm going to, I'm going to ask a few questions here. 46 00:04:31,839 --> 00:04:36,220 Why is this a model that almost no one ever implements? 47 00:04:38,460 --> 00:04:41,420 48 00:04:45,460 --> 00:04:50,054 So, the reason no one actually goes to try to do this is as all the things we 49 00:04:50,054 --> 00:04:55,006 talked about when we try out of our processors, you want to try to re-order 50 00:04:55,006 --> 00:04:58,264 your lows in stores for performance reasons. 51 00:04:58,264 --> 00:05:03,811 and the second, you try to introduce something like cache into your processor, 52 00:05:03,811 --> 00:05:08,527 this gets very hard because communication becomes a real bottleneck. 53 00:05:08,527 --> 00:05:14,075 You might have one processor that did a, for instance, a store into its cache and 54 00:05:14,075 --> 00:05:19,622 not everyone has seen that value update. So, unless you actually want to have one 55 00:05:19,622 --> 00:05:24,823 monolithic memory that you play all loads and stores against in some interleaved 56 00:05:24,823 --> 00:05:31,202 yet in order to the processor order, and everyone sees that exact ordering, 57 00:05:31,202 --> 00:05:35,680 you're not going to want to actually implement that. That's very hard to do. 58 00:05:35,680 --> 00:05:43,138 Okay. So, [SOUND] fun, funny example here of 59 00:05:43,138 --> 00:05:49,920 sequential consistency. Let's look at two threads, T1 and T2. 60 00:05:49,920 --> 00:05:54,522 And two variables, actually, we'll talk four variables here. 61 00:05:54,522 --> 00:06:00,300 X, Y, X prime, and Y prime. X prime and Y prime are the outputs here 62 00:06:00,300 --> 00:06:06,780 so we do stores to those. [COUGH] At the beginning of time, 63 00:06:06,780 --> 00:06:13,920 we are going to say X and Y are initialized to zero and ten, 64 00:06:15,000 --> 00:06:23,905 respectively as shown. Okay. Let's, let's talk about what our 65 00:06:23,905 --> 00:06:30,480 valid sequential consistent outcomes here for X prime and Y prime. 66 00:06:31,600 --> 00:06:35,965 Now, this is not actually easy to detect right off the get go. 67 00:06:35,965 --> 00:06:41,475 Does anyone know which of these four, I think the answer is on your sheet, but 68 00:06:41,475 --> 00:06:45,912 don't go look at it. it's not actually easy to detect which 69 00:06:45,912 --> 00:06:51,064 one is sequentially consistent, which one is not sequentially consistent. 70 00:06:51,064 --> 00:06:56,360 So, let's, let's walk through a few interweavings here to see what happens. 71 00:07:00,460 --> 00:07:02,740 72 00:07:07,620 --> 00:07:17,300 Okay. So, we're going to have thread one, thread two, X, Y and X, Y primes. 73 00:07:18,500 --> 00:07:26,999 Let's say, we execute T1, and then T2 in time. 74 00:07:26,999 --> 00:07:33,260 So we're, we'll draw time this way. So we do, 75 00:07:33,260 --> 00:07:39,408 we do a store of one. We do a store word eleven. 76 00:07:39,408 --> 00:07:50,600 And then, we do a load word of Y. A store word of Y prime, 77 00:07:50,600 --> 00:07:56,226 load word of X and a store word of X prime. 78 00:07:56,226 --> 00:08:03,544 We do this store here. at the beginning of time, as I said, X 79 00:08:03,544 --> 00:08:13,750 has zero and Y has ten. We do a store here and this is going to 80 00:08:13,750 --> 00:08:19,360 be storing to X one, so we're going to update the value of X there. 81 00:08:19,360 --> 00:08:30,246 Now, we're going to store eleven to Y. Okay. So, eleven gets loaded there in 82 00:08:30,246 --> 00:08:33,069 time. [SOUND] We do this load. 83 00:08:33,069 --> 00:08:41,154 Nothing, no, nothing changes. We do a store here to Y prime and we look 84 00:08:41,154 --> 00:08:45,349 at what we got when we did this load of Y. 85 00:08:45,349 --> 00:08:52,440 We loaded eleven and we're going to store eleven to Y prime, at this point. 86 00:08:52,440 --> 00:09:01,180 Then, we do a load of X which has one in it now, and we do a store to X prime. 87 00:09:03,540 --> 00:09:10,399 It's going to have one. So, to sum up, you have eleven and one. 88 00:09:10,399 --> 00:09:13,420 Okay. That's this one. 89 00:09:13,420 --> 00:09:18,676 That's a valid output. And what's interesting to see here is, 90 00:09:18,676 --> 00:09:23,818 even with the stitch memory model, we're going to end up with different 91 00:09:23,818 --> 00:09:27,412 possible outcomes. There's not only one possible outcome. 92 00:09:27,412 --> 00:09:30,941 There's actually multiple possible valid outcomes here. 93 00:09:30,941 --> 00:09:38,270 Okay. So, let's look at another. Let's use the, the same columns at the 94 00:09:38,270 --> 00:09:46,729 top here and let's say we have these two split so that the first store happens 95 00:09:46,729 --> 00:09:51,876 here and the second, the second store from thread one happens there. 96 00:09:51,876 --> 00:09:57,316 Still sequentially consistent because they're being ordered inside of the 97 00:09:57,316 --> 00:10:02,096 respective threads, and it's a global ordering that everyone sees. 98 00:10:02,096 --> 00:10:08,420 Okay. So, let's, let's go do that one. So we do store word of one. 99 00:10:10,040 --> 00:10:15,322 Then, we execute in thread two load word of R1, 100 00:10:15,322 --> 00:10:18,423 Y. Store word of R1, Y prime. 101 00:10:18,423 --> 00:10:22,443 Okay. So, let's go to the store here. 102 00:10:22,443 --> 00:10:25,774 The store's going to, oh well, 103 00:10:25,774 --> 00:10:34,732 actually, we'll finish writing it first. Load word of R2 from X store word R2, X 104 00:10:34,732 --> 00:10:39,900 prime. And then finally, we, we finish off here 105 00:10:39,900 --> 00:10:43,920 writing the store word of one to X. 106 00:10:46,560 --> 00:10:48,603 107 00:10:48,603 --> 00:10:59,119 Okay. Oops, it is wrong. 108 00:10:59,119 --> 00:11:08,891 Yep. [SOUND] That's what I do. 109 00:11:08,891 --> 00:11:14,230 Okay. Store word's going to happen here first. 110 00:11:14,230 --> 00:11:20,094 And, remember we started out with zero and ten, and zero in, in X and Y. 111 00:11:20,094 --> 00:11:25,612 So store word of one is going to happen. It's going to update X to be one. 112 00:11:25,612 --> 00:11:29,493 We do this load and it's going to be at ten now. 113 00:11:29,493 --> 00:11:38,449 And we store ten into Y prime. We do a load word here of X, 114 00:11:38,449 --> 00:11:42,740 or we're basically going to store that back into X prime. 115 00:11:42,740 --> 00:11:45,848 So, what was X? We look back at this column. 116 00:11:45,848 --> 00:11:48,437 Most up to date value was one. Okay. 117 00:11:48,437 --> 00:11:51,618 So we come over here, and we say, that's one. 118 00:11:51,618 --> 00:11:55,540 And then finally, we do a store word of eleven into Y. 119 00:11:57,160 --> 00:12:02,233 Okay? Or Y to Y, 120 00:12:02,233 --> 00:12:07,384 so it gets two, eleven into Y. But what's the output here? 121 00:12:07,384 --> 00:12:12,722 Well, the output is one and ten. And this value, this store, 122 00:12:12,722 --> 00:12:18,248 didn't do anything. I mean, it updated the value of Y but it 123 00:12:18,248 --> 00:12:23,680 didn't effect Y prime or X prime in any way, shape, or form. 124 00:12:25,660 --> 00:12:29,949 Okay. So that ends up being one in ten is a 125 00:12:29,949 --> 00:12:39,381 valid sequentially consistent output. Let's try a non-sequentially consistent 126 00:12:39,381 --> 00:12:56,236 execution order and see what happens. [SOUND] Let's say, I don't know, which is 127 00:12:56,236 --> 00:13:01,955 a good one to execute. there's a couple different ways to get 128 00:13:01,955 --> 00:13:05,501 this. Let's execute this instruction here 129 00:13:05,501 --> 00:13:09,230 first. So, to store to Y. 130 00:13:09,230 --> 00:13:13,440 And we're going to reorder that with a stored X. 131 00:13:14,520 --> 00:13:20,887 And, we're going to execute thread two in between those two operations. 132 00:13:20,887 --> 00:13:21,160 So, 133 00:13:31,540 --> 00:13:34,140 134 00:13:36,920 --> 00:13:46,788 we use store word of eleven to Y. So, we've out of, let's say, T1 is an out 135 00:13:46,788 --> 00:13:52,712 of order processor here, for instance. Then, we're going to execute all of 136 00:13:52,712 --> 00:13:58,020 thread two, and we'll execute all of thread two in order. 137 00:13:58,020 --> 00:14:18,072 [SOUND] And then finally, we're going to come back and thread one here and do the 138 00:14:18,072 --> 00:14:27,838 store of let's say, one to X. Okay. 139 00:14:27,838 --> 00:14:32,462 So, X and Y start out with zero and eleven, or excuse me, zero and ten, 140 00:14:32,462 --> 00:14:37,263 rather. Store of eleven of to Y, okay that's 141 00:14:37,263 --> 00:14:45,120 going to update. This could be eleven. 142 00:14:45,120 --> 00:14:52,960 We're going to load Y now, so we're going to load this eleven into 143 00:14:52,960 --> 00:14:56,240 R1. And then, we're going to store up to Y 144 00:14:56,240 --> 00:15:01,093 prime eleven. Then we're going to load, let's say, X 145 00:15:01,093 --> 00:15:04,268 here. Well, X is just the initial value, so 146 00:15:04,268 --> 00:15:09,535 we're going to get zero into R2, then we're going to store our zero into X 147 00:15:09,535 --> 00:15:13,330 prime there. And then finally, here, we're going to do 148 00:15:13,330 --> 00:15:17,470 a store of one to X. So, what's our output? 149 00:15:17,470 --> 00:15:20,262 Well, our output is zero and eleven. 150 00:15:20,262 --> 00:15:25,344 And, but we had a non-sequentially consistent, this is a non-sequentially 151 00:15:25,344 --> 00:15:29,902 consistent output here. So, there's no way we can work through 152 00:15:29,902 --> 00:15:36,030 all the different possible combinations. But the only way you're possibly going to 153 00:15:36,030 --> 00:15:42,083 get a zero for X prime and eleven for Y prime is if you execute non-sequentially 154 00:15:42,083 --> 00:15:47,164 consistent execution sequence. And this was a, because we flipped the 155 00:15:47,164 --> 00:15:55,733 ordering of these two instructions, that made this not sequentially 156 00:15:55,733 --> 00:16:03,118 consistent. So, we have a big X over it. 157 00:16:03,118 --> 00:16:06,334 We could work through all the other possible combinations here. 158 00:16:06,334 --> 00:16:09,755 But, there's actually another, interestingly enough, there's actually 159 00:16:09,755 --> 00:16:13,533 another way, another interweaving that gets you this same non sequentially 160 00:16:13,533 --> 00:16:17,567 consistently output from a different sequentially consistent ordering of these 161 00:16:17,567 --> 00:16:20,017 instructions. Or, or a different non-sequentially 162 00:16:20,017 --> 00:16:30,980 consistent ordering of the instructions. So that, that can't be true. 163 00:16:30,980 --> 00:16:40,568 Okay. So, what this is really trying to say is that, in our processors we've 164 00:16:40,568 --> 00:16:46,740 looked at up to this point, we've had some arcs. 165 00:16:47,980 --> 00:16:56,948 And example of an arc here is that if you do a load into R1 and then use the value 166 00:16:56,948 --> 00:17:02,174 of R1 to do the store, sorry these should be order flipped for 167 00:17:02,174 --> 00:17:04,103 MIPs compatibility. Yeah. 168 00:17:04,103 --> 00:17:07,560 I flipped them there, I missed here. Okay. 169 00:17:11,140 --> 00:17:15,891 for this store here to happen, it needs to wait for the value of R1. 170 00:17:15,891 --> 00:17:18,941 So, just simple read after write dependence. 171 00:17:18,941 --> 00:17:23,100 Likewise, here you have a read after write dependence. 172 00:17:23,100 --> 00:17:27,284 And it's sequential consistency you can have other dependencies. 173 00:17:27,284 --> 00:17:31,827 So for instance, if you do a load in the store and their to the same address, 174 00:17:31,827 --> 00:17:36,549 there needs to be order between those. Which we've already talked about in this 175 00:17:36,549 --> 00:17:39,119 class. What we've not talked about is additional 176 00:17:39,119 --> 00:17:43,304 sequential consistency requirements. So, if you want to have additional 177 00:17:43,304 --> 00:17:48,145 sequential consistency requirements, what it looks like is every memory operation 178 00:17:48,145 --> 00:17:52,812 is dependent on the prior memory operations in its own thread. 179 00:17:52,812 --> 00:17:56,380 So, we're going to draw that as a red arc here. 180 00:17:57,440 --> 00:18:01,142 And you can sort of see it here, but also, this is dependent on this, and this 181 00:18:01,142 --> 00:18:04,602 is dependent on that, and that is dependent on that, but sort of through 182 00:18:04,602 --> 00:18:06,892 transitivity they, they don't draw all the arcs. 183 00:18:06,892 --> 00:18:10,883 So, some, something to think about. Okay. So, we've got a question that comes 184 00:18:10,883 --> 00:18:12,815 up here. I think we already asked this question 185 00:18:12,815 --> 00:18:16,780 but, does a system of caches and out-of-order 186 00:18:16,780 --> 00:18:21,045 processors provide a sequentially consistent unit of memory? 187 00:18:21,045 --> 00:18:30,866 [SOUND] Answer's probably not. It's pretty hard to do. 188 00:18:30,866 --> 00:18:37,344 You could potentially try to build a processor which is out of order and, and 189 00:18:37,344 --> 00:18:40,916 do this. But, we talked about breaking and 190 00:18:40,916 --> 00:18:45,669 reordering all of these instructions in an out of order processor. 191 00:18:45,669 --> 00:18:50,495 So, we're going to try to think about what is the current, correct model if you 192 00:18:50,495 --> 00:18:54,160 cannot go for full sequential consistency. 193 00:18:54,160 --> 00:18:59,410 You're asking me, is that the Java compiler to guarantee this ordering. 194 00:18:59,410 --> 00:19:04,915 Okay. So, what we're actually talking about here is the hardware breaking the 195 00:19:04,915 --> 00:19:09,535 ordering And just reorder these randomly that the compiler can possibly never be 196 00:19:09,535 --> 00:19:16,556 able to try to control. That's, the, the programmer is at fault 197 00:19:16,556 --> 00:19:18,891 there. The programmer should not be assuming 198 00:19:18,891 --> 00:19:21,225 that. [LAUGH] The programmer assumes, assumes 199 00:19:21,225 --> 00:19:23,401 some, basically, programmers try to assume 200 00:19:23,401 --> 00:19:27,646 something like sequential consistency. And what we're saying is that none of the 201 00:19:27,646 --> 00:19:30,458 hardware out there implements sequential consistency. 202 00:19:30,458 --> 00:19:34,013 So the programmer can't assume that, it was a bad assumption on the 203 00:19:34,013 --> 00:19:38,053 programmer's perspective. so the compiler, for instance, the reason 204 00:19:38,053 --> 00:19:42,050 why the programmer assumes that is because it's kind of a rational thing to 205 00:19:42,050 --> 00:19:44,294 do. It's like the, the reasonable rational 206 00:19:44,294 --> 00:19:46,812 thing to do. But, it's really hard to implement 207 00:19:46,812 --> 00:19:51,356 something fast under those constraints. So, instead what people do is they try to 208 00:19:51,356 --> 00:19:55,626 come up with something weaker than that which still gives the programmer some 209 00:19:55,626 --> 00:19:58,965 simblence of programmability. And that's what we are going to be 210 00:19:58,965 --> 00:20:03,344 talking about the rest of today's lecture and next lecture what are those 211 00:20:03,344 --> 00:20:06,520 simblences of rationality that we can give the programmer. 212 00:20:06,520 --> 00:20:09,708 But, this is one program and this is another program. 213 00:20:09,708 --> 00:20:14,122 Or this is one thread and this is another thread in the same program. 214 00:20:14,122 --> 00:20:18,658 But the compiler cannot guarantee this ordering because the out of order 215 00:20:18,658 --> 00:20:22,467 processor will just move things around, disregards the compiler. 216 00:20:22,467 --> 00:20:27,429 We already talked about the out of load processors moving loads past storers and 217 00:20:27,429 --> 00:20:32,037 storers past loads in the hardware. So, and what we're saying here is if you 218 00:20:32,037 --> 00:20:35,758 want to actually implement full sequential consistency, all those 219 00:20:35,758 --> 00:20:40,247 optimizations that the hardware out of order processor want to do are invalid. 220 00:20:40,247 --> 00:20:49,840 We don't want to really do that. okay. So, that's sequential consistency.