1 00:00:03,480 --> 00:00:08,918 So, today we're going to continue our adventure in computer architecture and 2 00:00:08,918 --> 00:00:12,281 talk more about parallel computer architecture. 3 00:00:12,281 --> 00:00:17,719 last time we talked about coherence, memory coherence, and cache coherence, 4 00:00:17,719 --> 00:00:23,748 systems and to differentiate that from memory consistency models which is a 5 00:00:23,748 --> 00:00:30,926 model of how memory is supposed to work, versus the underlying algorithms that try 6 00:00:30,926 --> 00:00:37,316 to keep memory consistent, and try to implement the consistency models. 7 00:00:37,316 --> 00:00:44,302 We left off last time, we, we were talking about MOESI, or also known as the 8 00:00:44,302 --> 00:00:48,740 Illinois protocol, and we walked through all of the 9 00:00:48,740 --> 00:00:54,571 different arcs through here. And if you recall what we were talking 10 00:00:54,571 --> 00:01:01,012 about, was, we split the shared state from the MSI protocol into two states, 11 00:01:01,012 --> 00:01:05,238 shared and exclusive. And the insight here is, it's very common 12 00:01:05,238 --> 00:01:09,508 for programs to read a memory address, which will pull it into your cache. 13 00:01:09,508 --> 00:01:14,188 And then go modify that memory address. So for instance, if you want to increment 14 00:01:14,188 --> 00:01:16,235 a number. You're going to do a load. 15 00:01:16,235 --> 00:01:20,037 It's going to bring it into your, your, ca-, or into your register set. 16 00:01:20,037 --> 00:01:23,606 But also into your cache. You're going to going to increment the 17 00:01:23,606 --> 00:01:27,008 number and then you do a write back to the exact same location. 18 00:01:27,008 --> 00:01:29,870 Pretty common in imperative programming languages. 19 00:01:29,870 --> 00:01:33,975 Declarative programming languages like Scheme and such, they may at times copy 20 00:01:33,975 --> 00:01:36,729 everything. But for declara- excuse me, imperative 21 00:01:36,729 --> 00:01:41,160 programming languages it's pretty common to actually change state in place. 22 00:01:41,160 --> 00:01:46,090 So, because of that, you can bring it right into, this exclusive stage, and 23 00:01:46,090 --> 00:01:51,568 then when you have to go to modify it, you would have to go and broadcast in the 24 00:01:51,568 --> 00:01:54,238 bus. You know, you would have to talk to 25 00:01:54,238 --> 00:01:58,894 anybody and would loose, hm, effectively this intent to write message. 26 00:01:58,894 --> 00:02:04,098 Then you would have to send otherwise across the bus and waiting for that 27 00:02:04,098 --> 00:02:09,166 address to be snooped on the bus, or be seen by all the other entities on the 28 00:02:09,166 --> 00:02:13,508 bus. note I, I say entities in the bus. 29 00:02:13,508 --> 00:02:18,439 We've been talking primarily about, processors, the last day, 30 00:02:18,439 --> 00:02:24,290 but there can be other entities on the bus that want to snoop the bus. 31 00:02:24,290 --> 00:02:28,469 So examples sometimes include coherent, IO devices. 32 00:02:28,469 --> 00:02:30,766 So, this isn't very popular right now, 33 00:02:30,766 --> 00:02:35,126 but I think this will become much more popular as soon as we start to have, GPUs 34 00:02:35,126 --> 00:02:37,910 or Graphics Processing Units or general purpose GPUs, 35 00:02:37,910 --> 00:02:42,007 which will be sitting, effectively, very close to our processor on the same bus, 36 00:02:42,007 --> 00:02:45,999 and will want to take part in the coherence traffic of the processor. 37 00:02:45,999 --> 00:02:50,149 So it's going to want to basically read and write to the same memory addresses 38 00:02:50,149 --> 00:02:52,356 that the processor is reading and writing, 39 00:02:52,356 --> 00:02:54,720 and take part in the cash coherence protocol. 40 00:02:54,720 --> 00:02:59,064 [COUGH] At a minimum, usually your IO devices need to effectively tell the 41 00:02:59,064 --> 00:03:03,597 processor when its doing a memory transaction that the processor should 42 00:03:03,597 --> 00:03:06,934 know about. So typically when you are moving data 43 00:03:06,934 --> 00:03:11,216 from a IO device to main memory, that's going to have to effectively go 44 00:03:11,216 --> 00:03:14,742 across the person. Everyone is going to have to validate 45 00:03:14,742 --> 00:03:19,464 their cache's, you have to snoop the traffic, or they will all have to snoop 46 00:03:19,464 --> 00:03:25,840 that memory traffic from the IO device. So we had talked about MOESI as an 47 00:03:25,840 --> 00:03:30,331 enhancement to MSI. Well, we left off last time, and we were 48 00:03:30,331 --> 00:03:35,569 going to talk about two more enhancements that are pretty common. 49 00:03:35,569 --> 00:03:42,176 one is been used widely in AMD Opterons. I think they still use this in AMD. 50 00:03:42,176 --> 00:03:48,542 I think they use something similar to this still in AMD, is my understanding. 51 00:03:48,542 --> 00:03:52,330 and the idea is you add an extra state here, 52 00:03:52,330 --> 00:03:55,961 which is called ownership, or the owned state. 53 00:03:55,961 --> 00:04:01,635 And effectively, what this is, is it looks just like our MOESI protocol from 54 00:04:01,635 --> 00:04:05,267 before. But now, instead of having data in the 55 00:04:05,267 --> 00:04:10,941 modified stage, when you, let's say another processor needs to go access that 56 00:04:10,941 --> 00:04:16,842 data, instead of having to send all that data back to main memory, and validate 57 00:04:16,842 --> 00:04:21,760 that line out to main memory, and go fetch it back from main memory. 58 00:04:21,760 --> 00:04:24,775 Instead, you can do direct cache to cache transfer. 59 00:04:24,775 --> 00:04:26,947 This is of a, basically an optimization here. 60 00:04:26,947 --> 00:04:30,264 So, you don't have to right back to data to main memory, 61 00:04:30,264 --> 00:04:33,280 and in fact you can allow main memory to be stale. 62 00:04:33,280 --> 00:04:38,782 And you can just transfer the data across the bus from the one cache to the cache 63 00:04:38,782 --> 00:04:42,964 which needs it. So in this example here, we're going to 64 00:04:42,964 --> 00:04:47,512 look at this edge here. So another processor wants to read the 65 00:04:47,512 --> 00:04:50,374 data. So we see an intent to write to a 66 00:04:50,374 --> 00:04:56,170 particular cache line and our processor currently has it in the modified state. 67 00:04:56,170 --> 00:05:01,150 We see this other processors intent to write, and. 68 00:05:01,150 --> 00:05:03,810 [COUGH] or excuse me. Intent to read. 69 00:05:03,810 --> 00:05:08,391 And we're actually going to provide the data out of our cache, 70 00:05:08,391 --> 00:05:14,303 and not write it back to main memory, and transition the line in our cache to 71 00:05:14,303 --> 00:05:20,816 this owned state. The other processors can now take it in, 72 00:05:20,816 --> 00:05:27,430 and take it in a shared state. So they will have it a read, read only 73 00:05:27,430 --> 00:05:31,810 copy. Now, note this is only for, for 74 00:05:31,810 --> 00:05:36,703 read-only, we'll talk about if another processor wants to write to the state in 75 00:05:36,703 --> 00:05:39,800 a second. So we have it in its own state, and what 76 00:05:39,800 --> 00:05:44,817 we're trying to do here is this processor is tracking that, that data needs to be 77 00:05:44,817 --> 00:05:47,418 written back to main memory at some point. 78 00:05:47,418 --> 00:05:50,144 That's the whole purpose of this state here, 79 00:05:50,144 --> 00:05:55,346 is we've basically designated a processor which owns the data and owns the modified 80 00:05:55,346 --> 00:06:00,996 state. So the processors which take at read only get it into the shared state, 81 00:06:00,996 --> 00:06:05,239 and if they need to invalidate the line, they don't need to contact anybody. 82 00:06:05,239 --> 00:06:08,576 Because they are having a share state, they have a read-only copy. 83 00:06:08,576 --> 00:06:11,121 They don't need to make any bus transactions. 84 00:06:11,121 --> 00:06:15,250 [COUGH] So if you think about it, if you actually want to effectively have one 85 00:06:15,250 --> 00:06:20,100 core, read the state from other, read this dirty state from the other core, 86 00:06:20,100 --> 00:06:26,457 and then in some points it goes in and just invalidates it in the, in the second 87 00:06:26,457 --> 00:06:29,954 core. If the data is not up to date as in, it 88 00:06:29,954 --> 00:06:33,530 would be in main memory, you lose the changes. 89 00:06:33,530 --> 00:06:39,570 So, by one processor keeping it in the own state here, it keeps track that at 90 00:06:39,570 --> 00:06:45,689 some point, if it never gets invalidated out of that processor's cache, it needs 91 00:06:45,689 --> 00:06:50,720 to write that out to main memory, to keep it up, up to date. 92 00:06:50,720 --> 00:06:55,765 Now, there's a couple other arcs here. you can transition from the own state 93 00:06:55,765 --> 00:07:00,164 back to the modified state if the processor, which has it in the owned 94 00:07:00,164 --> 00:07:02,504 state wants to go to a write. [COUGH]. 95 00:07:02,504 --> 00:07:05,116 It can't do that while it's in the owned state, 96 00:07:05,116 --> 00:07:09,528 because while it's in the owned state, other processors may have shared copies 97 00:07:09,528 --> 00:07:11,502 of it. So, when it needs to do that, 98 00:07:11,502 --> 00:07:16,205 if it wants to do, P1 wants to do a write here, it needs to re-invalidate everyone 99 00:07:16,205 --> 00:07:20,036 else's copies across the bus. So it's going to have to send an intent 100 00:07:20,036 --> 00:07:23,345 to write for that line, and everyone else will snoop that 101 00:07:23,345 --> 00:07:26,520 traffic, and transition to the invalid state. 102 00:07:26,520 --> 00:07:30,765 And then, this processor will be able to transition to the modified state, 103 00:07:30,765 --> 00:07:33,960 and now it's able to actually modify the data. 104 00:07:33,960 --> 00:07:38,846 Okay. So we've got this arc here, which we sort 105 00:07:38,846 --> 00:07:43,527 of already talked about, is that if you're in the owned state, anyone else 106 00:07:43,527 --> 00:07:46,920 can get read only shared copies of it. [COUGH]. 107 00:07:46,920 --> 00:07:52,275 They can't go get an exclusive copy, because that would basically violate this 108 00:07:52,275 --> 00:07:57,492 notion, because then they would be able to upgrade to modified without telling 109 00:07:57,492 --> 00:08:02,436 anybody, and we don't want that. But they can get shared read-only copies 110 00:08:02,436 --> 00:08:07,653 of the data and then there's this arc here from owned to invalid, is if some 111 00:08:07,653 --> 00:08:14,465 other processor wants to write the data. We're going, processor 1, P1 here will 112 00:08:14,465 --> 00:08:17,850 say, we'll see the intent to write from 113 00:08:17,850 --> 00:08:22,798 another processor. It will, snoop that traffic effectively, 114 00:08:22,798 --> 00:08:28,179 and at that point it will transition to this invalid state. 115 00:08:28,179 --> 00:08:36,726 note here that this intent to write, we may need to provide information across 116 00:08:36,726 --> 00:08:41,781 the bus when we're in the owned state. Because if the only, if we're the only 117 00:08:41,781 --> 00:08:46,968 owner of that, or the only cache that has that data, and the other processor is 118 00:08:46,968 --> 00:08:52,089 basically going straight into this state here via rightness, we're going to need 119 00:08:52,089 --> 00:08:58,700 to provide the data. Okay, so, questions about MOESI? 120 00:08:58,700 --> 00:09:01,434 So far, But a basic, extra optimization.'Cause we 121 00:09:01,434 --> 00:09:04,068 don't have to. We can basically transfer data around. 122 00:09:04,068 --> 00:09:06,955 And one cache can have a, a, a cache line in the owned state. 123 00:09:06,955 --> 00:09:10,804 And later, some other cache, you know, the exact same cache line in the own 124 00:09:10,804 --> 00:09:12,880 state. And it can basically bounce around 125 00:09:12,880 --> 00:09:15,500 without ever having to go out to main memory. 126 00:09:15,500 --> 00:09:19,700 And this, this decreases our bandwidth out to the main memory system. 127 00:09:21,780 --> 00:09:27,618 Okay. Then we're going to talk about MESIF or 128 00:09:28,622 --> 00:09:34,480 MESIF, which is actually used in the core I7 in the most up to date Intel 129 00:09:34,480 --> 00:09:38,831 processors. And it looks very similar to MOESI, 130 00:09:38,831 --> 00:09:45,400 except we're going to see an extra little letter in this one bubble here. 131 00:09:45,400 --> 00:09:51,286 Effectively, the, what's going on here is we add an 132 00:09:51,286 --> 00:09:57,290 extra state called the Forward State. And this is similar to sort of the 133 00:09:57,290 --> 00:10:05,740 optimization we saw in MOESI, except it can't keep the data writeable. 134 00:10:07,700 --> 00:10:13,357 So, what happens in this protocol is, let's say the first cache, which does a 135 00:10:13,357 --> 00:10:18,939 read miss on a line for widely shared data is going to be elected and going to 136 00:10:18,939 --> 00:10:24,832 get the data in this forward state. And then if other caches want to get read 137 00:10:24,832 --> 00:10:29,821 only copies, bring it in shared. Instead of having to go out to main 138 00:10:29,821 --> 00:10:35,274 memory, the cache that has it in the forward 139 00:10:35,274 --> 00:10:39,320 state is going to provide that data across the bus. 140 00:10:39,320 --> 00:10:45,110 So this is going to effectively decrease our bandwidth to main memory, by 141 00:10:45,110 --> 00:10:49,180 providing the data out of another cache's, 142 00:10:49,180 --> 00:10:53,617 cache, effectively, or another processor's cache rather, and then you 143 00:10:53,617 --> 00:10:57,988 won't have to, have to transition it. Now, this is a little bit of a 144 00:10:57,988 --> 00:11:01,697 simplification. There is a question here of, if you're in 145 00:11:01,697 --> 00:11:04,810 this forward state and you invalidate the data. 146 00:11:04,810 --> 00:11:07,498 Who has it? does anyone provide the data? 147 00:11:07,498 --> 00:11:11,764 So there's sort of two choices here. One choice is no one has it in the 148 00:11:11,764 --> 00:11:14,920 forward state. So when it's a snooper quest for a line, 149 00:11:14,920 --> 00:11:18,310 it actually has, it just have to go out of the main memory. 150 00:11:18,310 --> 00:11:22,050 That's kind of the easy case. The other case is you could try to 151 00:11:22,050 --> 00:11:26,784 actually build a protocol where another cache when one, one cache invalids the 152 00:11:26,784 --> 00:11:31,635 forward, it just chooses another cache. But probably the simplest thing you do is 153 00:11:31,635 --> 00:11:35,500 when the forward, the forwarding core invalidates the data. 154 00:11:35,500 --> 00:11:39,397 For whatever reason, you just go back out to main memory, because there's always a 155 00:11:39,397 --> 00:11:42,571 copy in main memory. So effectively you're just keeping read 156 00:11:42,571 --> 00:11:43,392 only copies. Yeah, 157 00:11:43,392 --> 00:11:45,951 you're right. You're probably going to enter, into the 158 00:11:45,951 --> 00:11:47,835 exclusive state. That's a good question. 159 00:11:47,835 --> 00:11:54,688 so I read two different versions of this in, in different books. 160 00:11:54,688 --> 00:11:58,350 So, I'm not quite sure Intel actually 161 00:11:58,350 --> 00:12:03,513 documents what they do for, for this. probably what's okay, so, so, you 162 00:12:03,513 --> 00:12:07,669 probably will, youre probably right. You probably want to enter straight into 163 00:12:07,669 --> 00:12:10,980 exclusive state. If you have a read only copy, 164 00:12:10,980 --> 00:12:13,084 What. You, you, yeah. 165 00:12:13,084 --> 00:12:17,095 So what's going to happen is when you transition from E to S here, you're 166 00:12:17,095 --> 00:12:21,106 going to transition from E to F. And then you're going to be able to make 167 00:12:21,106 --> 00:12:25,117 this, you'll end up in the F state. So the first person who actually 168 00:12:25,117 --> 00:12:28,096 downgrades is going to always end up in the F state. 169 00:12:28,096 --> 00:12:31,878 but like I said. I saw other references where people said, 170 00:12:31,878 --> 00:12:35,487 There were other people implementing something similar to this. 171 00:12:35,487 --> 00:12:38,581 Where they, Some have, some have some election where 172 00:12:38,581 --> 00:12:41,160 they figure out who is the, the forwarding, 173 00:12:41,160 --> 00:12:46,440 Node, but probably the easiest thing to do is to downgrade from E to F. 174 00:12:47,800 --> 00:12:53,978 So the rest of the course, we're going to look at how to scale beyond these 175 00:12:53,978 --> 00:12:59,898 broadcast and these invalidate protocols that have to snoop on a bus. 176 00:12:59,898 --> 00:13:04,160 so, some of the problems of building these. 177 00:13:04,160 --> 00:13:08,995 Snooping systems is, that you need, it really affects how you design your 178 00:13:08,995 --> 00:13:12,152 processor. So first of all, you're going to have to 179 00:13:12,152 --> 00:13:17,257 add more bandwidth into your cache. Or at least more bandwidth into your tag 180 00:13:17,257 --> 00:13:21,753 array. so one choice is going to dual port your 181 00:13:21,753 --> 00:13:25,363 tags. Another choice is you can steal cycles 182 00:13:25,363 --> 00:13:28,445 for snoops. So what I mean by steal cycles is if 183 00:13:28,445 --> 00:13:33,709 there is a bus transaction happening and you need to check this against your tags, 184 00:13:33,709 --> 00:13:38,652 you actually block the main processor that is associated with that cash from 185 00:13:38,652 --> 00:13:43,596 accessing the cash that cycle, so you, you generate a stall signal to the cash 186 00:13:43,596 --> 00:13:48,158 or to the main pipe. [COUGH] And one of the things here that 187 00:13:48,158 --> 00:13:53,476 get a little tricky is, and this will affects your design, is if you have a 188 00:13:53,476 --> 00:13:58,728 multilevel cache, usually you want to put your sort L2 tag array on the bus and 189 00:13:58,728 --> 00:14:03,514 snoop against your L2 tag array. But if it hits there and you figure out 190 00:14:03,514 --> 00:14:09,300 that you have to invalidate something. You're going to have to invalidate down 191 00:14:09,300 --> 00:14:13,320 the entire cache hierarchy, all the way down to the level one cache. 192 00:14:13,320 --> 00:14:17,400 So this can actually affect your throughput on your level one cache 193 00:14:17,400 --> 00:14:20,400 effectively. And also, it sort of is, is annoying to 194 00:14:20,400 --> 00:14:22,680 do, because it's going to effectively have to 195 00:14:22,680 --> 00:14:25,860 reach down and touch your tag array of your L1 cache. 196 00:14:25,860 --> 00:14:28,800 And as I had mentioned, I think, last time, briefly. 197 00:14:28,800 --> 00:14:32,920 If you're thinking about something like a exclusive cache. 198 00:14:32,920 --> 00:14:39,420 So a cache where the tags and the L2 don't have the tags in the L1. 199 00:14:39,420 --> 00:14:45,487 You're going to have to check both tags for every snoop transaction, and that can 200 00:14:45,487 --> 00:14:50,805 be pretty, pretty painful, to do. [COUGH] Or you have to copy, the L1 tags, 201 00:14:50,805 --> 00:14:56,198 but it's effectively the same thing as just having a, inclusive cache, 202 00:14:56,198 --> 00:15:03,968 but maybe for little less data storage. Okay, so what limits our performance? 203 00:15:03,968 --> 00:15:07,528 Why can't we just build 1,000 processors on a big bus? 204 00:15:07,528 --> 00:15:12,613 Well it's the same idea if you have 1,000 people in this room, and they're all 205 00:15:12,613 --> 00:15:15,601 trying to shout to each other at the same time. 206 00:15:15,601 --> 00:15:20,560 At some point you, you run out of both bandwidth, and more importantly you need 207 00:15:20,560 --> 00:15:25,664 some way to coordinate them. But also, but also if you wanted to, if 208 00:15:25,664 --> 00:15:31,010 you're required to basically serialize, the occupancy on the bus goes up. 209 00:15:31,010 --> 00:15:36,801 So, if you have one bus with two people talking on the bus at a time, they each 210 00:15:36,801 --> 00:15:42,741 can, let's say, and they talk 10% of the time, then you have a 20% utilized bus. 211 00:15:42,741 --> 00:15:47,865 Well all of a sudden, if you have ten people on this bus, you have 100% 212 00:15:47,865 --> 00:15:53,285 utilized bus and if you have 1,000 people, you have an oversubscribed bus 213 00:15:53,285 --> 00:15:58,099 so, you have to worry about the bandwidth, and occupancy, because we do 214 00:15:58,099 --> 00:16:01,564 need to make these different bus transactions atomic. 215 00:16:01,564 --> 00:16:04,225 So it's not quite just a bandwidth problem. 216 00:16:04,225 --> 00:16:08,360 And what, what I mean by balance, you could make the bus wider. 217 00:16:08,360 --> 00:16:12,967 To increase the bandwidth, but it's not going to solve our problems. 218 00:16:12,967 --> 00:16:18,125 Because there's an occupancy challenge here also that you need effectively 219 00:16:18,125 --> 00:16:23,970 atomic transactions to happen across the bus in order to keep the cache coherence 220 00:16:23,970 --> 00:16:29,297 protocol correct. Okay, so before we move off this topic 221 00:16:29,297 --> 00:16:35,005 into our interconnection networks, that we were talking about today, hm, I want 222 00:16:35,005 --> 00:16:40,713 to talk about one of the challenge of, that happens in simple cache coherence 223 00:16:40,713 --> 00:16:43,160 systems. And that's false sharing. 224 00:16:44,680 --> 00:16:52,200 So caches, like to track information on, a particular bloc size. 225 00:16:52,200 --> 00:16:58,014 So, we've talked about caches which have 64 byte, lines, or 64 byte block sizes, 226 00:16:58,014 --> 00:17:02,300 and they can be bigger or smaller than that. 227 00:17:02,300 --> 00:17:07,525 Now, one of the things that happens that is pretty unpleasant in these coherence 228 00:17:07,525 --> 00:17:12,750 protocols is let's say, you take a piece of data which is shared, and needs to be 229 00:17:12,750 --> 00:17:15,689 coherent between two different processors. 230 00:17:15,689 --> 00:17:18,563 And it's gets communicated relatively often. 231 00:17:18,563 --> 00:17:23,853 And you put some other piece of critical data right next to it, on the same cache 232 00:17:23,853 --> 00:17:27,941 line. All of the sudden, what's going to happen 233 00:17:27,941 --> 00:17:32,800 is, because they're packed into one cache line, and we only track that information 234 00:17:32,800 --> 00:17:39,040 on a per cache line basis, whenever that one piece of data, let's 235 00:17:39,040 --> 00:17:45,818 say it's a four byte integer, and there's another four byte integer which is not 236 00:17:45,818 --> 00:17:49,249 shared, [COUGH]. Whatever the first four by energy which 237 00:17:49,249 --> 00:17:53,623 let's just say a lock or something like that gets, gets bounced around between 238 00:17:53,623 --> 00:17:56,269 caches you're going to bounce around the other data. 239 00:17:56,269 --> 00:18:00,643 So this can effectively can hurt your performance common case performance for 240 00:18:00,643 --> 00:18:04,152 non shared data by having this true sharing of data happening. 241 00:18:04,152 --> 00:18:08,472 And this is not something that typically happens in a normal cache because in a 242 00:18:08,472 --> 00:18:12,684 uniprocessor cache system you're going to bring the data in and it's going to bring 243 00:18:12,684 --> 00:18:15,006 everything in and you get spacial locality. 244 00:18:15,006 --> 00:18:18,434 And if you. Pump it out, you know, you can, you can 245 00:18:18,434 --> 00:18:23,316 get conflicts which are sort of equivalent to this but it's a little bit 246 00:18:23,316 --> 00:18:27,329 different idea here. It's never going to be in the same line. 247 00:18:27,329 --> 00:18:30,071 But with false sharing, we, we do see this. 248 00:18:30,071 --> 00:18:35,623 Hm, now false sharing is interesting because people have come up with a whole 249 00:18:35,623 --> 00:18:40,438 measure of techniques to avoid it. So, anyone have an idea, one, one really 250 00:18:40,438 --> 00:18:45,895 good technique to avoid false sharing? What we can do, and this is pretty 251 00:18:45,895 --> 00:18:51,533 common, is either the programmer or the compiler can detect that this is 252 00:18:51,533 --> 00:18:55,668 happening and it will actually pad the information out. 253 00:18:55,668 --> 00:19:01,231 So waste memory for highly contended pieces of data, and co-locate it with 254 00:19:01,231 --> 00:19:04,157 nothing that is shared. [COUGH]. 255 00:19:04,157 --> 00:19:08,384 So one of the better examples of why you really have to care about this is 256 00:19:08,384 --> 00:19:11,968 something like your stack. Sometimes if you, if you were to have, 257 00:19:11,968 --> 00:19:16,222 let's say, a lock on your stack, there's a lot of data which you need to use 258 00:19:16,222 --> 00:19:19,910 often, and it's all local. Stacks between threads are all local. 259 00:19:19,910 --> 00:19:23,845 But if you have, like, some sort of variable that you pass to someone else, 260 00:19:23,845 --> 00:19:27,834 which is a struct, and inside that struct is a lock, or something like that. 261 00:19:27,834 --> 00:19:32,146 All of sudden, you're basically going to be bouncing a line around which is your 262 00:19:32,146 --> 00:19:34,700 stack. And it's, other people are going to be 263 00:19:34,700 --> 00:19:37,929 invalidating your stack. So one way to solve this is 264 00:19:37,929 --> 00:19:41,511 when you put a lock, and the compiler can sometimes recognize 265 00:19:41,511 --> 00:19:44,446 this. Because you can actually designate memory 266 00:19:44,446 --> 00:19:47,616 addresses as locks, with special keywords, sometimes, 267 00:19:47,616 --> 00:19:51,139 depending on the language. And when you do that, it'll say, oh, 268 00:19:51,139 --> 00:19:55,425 don't put this with anything else, or maybe only collocate this data with 269 00:19:55,425 --> 00:19:59,241 things that are other locks. because that may have bad sharing 270 00:19:59,241 --> 00:20:04,977 performance anyway, for instance. So and really what you want to do here, 271 00:20:04,977 --> 00:20:11,438 is not have a false sharing case. Now, the analog default sharing is 272 00:20:11,438 --> 00:20:17,680 actually true sharing. So there are, there are cases where 273 00:20:17,680 --> 00:20:24,880 you'll have multiple pieces of data that are, shared differently between different 274 00:20:24,880 --> 00:20:27,331 lines. But they are also widely shared. 275 00:20:27,331 --> 00:20:32,360 So example of this is, you have an array of locks, and different processors won't 276 00:20:32,360 --> 00:20:37,067 be grabbing these blocks randomly. [COUGH] You can use similar techniques in 277 00:20:37,067 --> 00:20:42,097 fold sharing. Now, you probably don't want all those locks to be on those cache 278 00:20:42,097 --> 00:20:46,444 line. Because the locks are basically going to 279 00:20:46,444 --> 00:20:49,365 be bouncing around, and everyone is going to be contending 280 00:20:49,365 --> 00:20:52,437 for that one cache line to get it, modified in their ache, 281 00:20:52,437 --> 00:20:56,114 or in the M state in their cache. So what you can think about doing is, is 282 00:20:56,114 --> 00:20:59,538 actually just doing the similar technique, and putting, each of those 283 00:20:59,538 --> 00:21:08,960 locks on a separate cache line. Okay, so let's switch gears here.