1 00:00:04,800 --> 00:00:12,052 We move on to our final topic of ELE 475 which is, directory based cached 2 00:00:12,052 --> 00:00:17,260 coherence. It's a little bit of a warm up here. 3 00:00:17,260 --> 00:00:24,419 We remember the three Cs of caches. We had compulsory misses, capacity 4 00:00:24,419 --> 00:00:27,740 misses, and conflict misses. Well, 5 00:00:28,840 --> 00:00:34,552 we're going to think about adding a new miss type here, a coherence miss. 6 00:00:34,552 --> 00:00:40,512 So our coherence miss is some other cache, or some other entity, reaching 7 00:00:40,512 --> 00:00:45,440 down into our cache and invalidating something there. 8 00:00:45,440 --> 00:00:50,325 So this is strictly different then compulsory, capacity and conflict. 9 00:00:50,325 --> 00:00:55,355 If anything it looks most closest to a compulsory miss because you're 10 00:00:55,355 --> 00:00:59,307 effectively. It's like a first miss, but someone, some 11 00:00:59,307 --> 00:01:05,220 other entity bumped it out of your cache. But it's not any of those either. 12 00:01:05,220 --> 00:01:10,120 So this communication that's coming from other cores that is been in a snooping 13 00:01:10,120 --> 00:01:14,775 protocol or symetric shared memory multi-processor that other traffic comes 14 00:01:14,775 --> 00:01:18,083 in and it will actually bump things out of your cache. 15 00:01:18,083 --> 00:01:23,371 You need to worry about this. Now we're going to, we're going to take 16 00:01:23,371 --> 00:01:26,760 these coherency misses and put them into two categories. 17 00:01:26,760 --> 00:01:32,431 True sharing, we, we talked about this briefly at the end of two lectures ago, 18 00:01:32,431 --> 00:01:36,092 or three lectures ago. Well we're going to, we're going to 19 00:01:36,092 --> 00:01:39,969 categorize this as two different categories of misses. 20 00:01:39,969 --> 00:01:45,497 True sharing misses, which we're going to say, is that if were to have a cache 21 00:01:45,497 --> 00:01:51,169 where each block length or cache line size is exactly what's a one byte to the 22 00:01:51,169 --> 00:01:56,410 minimal size you can have on your machine, you would still have that miss. 23 00:01:56,410 --> 00:01:59,526 So that's a true miss. And a true miss is if you're actually 24 00:01:59,526 --> 00:02:02,175 sharing data. So if one cache, let's say, writes some 25 00:02:02,175 --> 00:02:06,226 data, and another cache wants to go read, or another processor wants to go read 26 00:02:06,226 --> 00:02:10,174 that data, and needs to go pull it into its cache, that's a true sharing misses. 27 00:02:10,174 --> 00:02:14,020 You need to do that communication. Now, 28 00:02:14,020 --> 00:02:19,900 to contrast that, we talk about false sharing, or false sharing misses. 29 00:02:19,900 --> 00:02:26,633 And what a false sharing misses is, is saying that if you were to reduce the, 30 00:02:26,633 --> 00:02:33,571 sharing size or the block size down to say, one byte or one word [COUGH] and you 31 00:02:33,571 --> 00:02:38,730 run the same program and that miss occurs, or the miss occurs when the block 32 00:02:38,730 --> 00:02:43,481 size is let's say one, versus or, or excuse me, if the miss occurs in 33 00:02:43,481 --> 00:02:48,775 the larger block size versus when the block size is one, then that is actually 34 00:02:48,775 --> 00:02:52,848 a false sharing miss. The block size was too big and you had 35 00:02:52,848 --> 00:02:58,278 two pieces of data in the same cache line that are effectively causing a sharing, 36 00:02:58,278 --> 00:03:02,283 even though there was no true sharing going on of the data. 37 00:03:02,283 --> 00:03:07,000 It just so happens they're packed into the same cache line. 38 00:03:07,000 --> 00:03:12,709 [COUGH] Now it's a little bit more then that we're also going to say that false 39 00:03:12,709 --> 00:03:19,221 sharing can happen when data gets moved around or gets invalidated but it's not 40 00:03:19,221 --> 00:03:24,751 being, it may be shared later in the program, but that exact miss was not 41 00:03:24,751 --> 00:03:31,460 because of data being communicated. So it's a little bit broader then what we 42 00:03:31,460 --> 00:03:34,711 said last time. It, it, it does still happen because 43 00:03:34,711 --> 00:03:39,386 there is two piece data packed in to the same line but effectively what I'm, what 44 00:03:39,386 --> 00:03:44,061 I'm trying to get out here is you could have data's that bounce around between 45 00:03:44,061 --> 00:03:48,677 different caches and the same instruction sequence or the same load in store 46 00:03:48,677 --> 00:03:53,648 sequence will not cause then this is if you had a very small cache line size but 47 00:03:53,648 --> 00:03:58,323 does happen of a large cache line size. So lets, lets look at a example here and 48 00:03:58,323 --> 00:04:03,028 try to categorize these different misses. So let's, let's start off here. 49 00:04:03,028 --> 00:04:09,255 The initial conditions are X and X2, or X1 and X2, which are two piece, two words 50 00:04:09,255 --> 00:04:14,420 of data, are packed into the same cache block or the same cache line. 51 00:04:15,640 --> 00:04:24,306 P1 and P2 have both read the data, and it's, it's, it's readable in both caches 52 00:04:24,306 --> 00:04:32,754 at the beginning of time. Okay so all of a sudden we do a right to 53 00:04:32,754 --> 00:04:37,940 X1. And we have to, what do we have to do? 54 00:04:37,940 --> 00:04:44,980 Well, we're going to have to invalidate. X1 and P2. 55 00:04:44,980 --> 00:04:49,073 And this, this is a true miss because the data was in both. 56 00:04:49,073 --> 00:04:53,872 We need to pull it out of the one. We need to actually invalidate it. 57 00:04:53,872 --> 00:04:56,977 because this is actual, actual real data. Okay. 58 00:04:56,977 --> 00:05:01,000 So next thing we do is P2 goes and executes a read of X2. 59 00:05:02,140 --> 00:05:07,159 Well, what you may notice here is at the 60 00:05:07,159 --> 00:05:15,000 beginning of time X2 is in the cache of P2, but it got bumped out, here. 61 00:05:15,000 --> 00:05:23,480 If P1 never want to go right X2 so this is a false sharing mess. 62 00:05:23,480 --> 00:05:31,217 This got an exclusive into P1's cache and this is going to pull it out of that 63 00:05:31,217 --> 00:05:37,367 cache and invalidate it. So recall, say false miss because X1 was 64 00:05:37,367 --> 00:05:45,204 irrelevant to P2 for this, for this miss. Okay, so now we see, another right to X1. 65 00:05:45,204 --> 00:05:52,346 Well, P2 didn't actually touch X, X1, so likewise, this is completely false 66 00:05:52,346 --> 00:05:57,871 sharing. Now we see a right to X2 well we didn't 67 00:05:57,871 --> 00:06:03,904 see any communication going on here. So this is also a false sharing miss. 68 00:06:03,904 --> 00:06:06,738 Now finally we have something that is real here. 69 00:06:06,738 --> 00:06:10,458 We're going to read X two. When we wrote X two there, we read X two 70 00:06:10,458 --> 00:06:15,105 here, we are actually communicating data. So this is true sharing. 71 00:06:15,105 --> 00:06:18,985 And that's okay. But we want to try to minimize these, 72 00:06:18,985 --> 00:06:25,208 these false sharing patterns. This is just a warm up. 73 00:06:25,208 --> 00:06:29,740 Some mo, motivate us into directory based coherence, a little bit. 74 00:06:31,980 --> 00:06:35,232 Okay. So let's, let's motivate this a little 75 00:06:35,232 --> 00:06:39,090 bit more. And let's look at something like a online 76 00:06:39,090 --> 00:06:43,856 transaction processing workload. So this is a database workload. 77 00:06:43,856 --> 00:06:47,109 So it's a multiprocessor database workload. 78 00:06:47,109 --> 00:06:51,270 It's using threads. And what we're going to see here is we 79 00:06:51,270 --> 00:06:54,523 have. We're going to run the same work load on 80 00:06:54,523 --> 00:06:58,759 a four processor system with four different cache sizes. 81 00:06:58,759 --> 00:07:03,240 This, this data is, taken from a paper from your book. 82 00:07:03,240 --> 00:07:08,252 And. What you'll notice is as you increase the 83 00:07:08,252 --> 00:07:12,073 cache size, our false sharing and our true sharing don't really change. 84 00:07:12,073 --> 00:07:16,112 You still need to communicate data and your still going to get false sharing. 85 00:07:16,112 --> 00:07:19,878 Just because you make the cache size bigger, didn't change the block size. 86 00:07:19,878 --> 00:07:23,180 You're still going to get the same false sharing patterns. 87 00:07:23,180 --> 00:07:28,900 But, as you increase the cache size. The instruction. 88 00:07:28,900 --> 00:07:34,352 Misses the capacity misses the conflict misses the cold the, the, the compulsory 89 00:07:34,352 --> 00:07:40,240 misses we call that cold go down because the cache is getting bigger, So 90 00:07:40,240 --> 00:07:45,460 non-shared cache lines performing the, the, the. 91 00:07:47,100 --> 00:07:49,125 [COUGH]. Number of memory cycles. 92 00:07:49,125 --> 00:07:53,051 The amount of time you take memory misses there, is going down. 93 00:07:53,051 --> 00:07:55,583 But the rest of this is not changing. Hm. 94 00:07:55,583 --> 00:07:58,432 Well, okay. This is, this is interesting, so the 95 00:07:58,432 --> 00:08:03,686 second question comes up as, what happens if we increase the number of cores in our 96 00:08:03,686 --> 00:08:06,472 system? So this is a relatively small system 97 00:08:06,472 --> 00:08:09,130 here. Let's, let's plot the number of cores 98 00:08:09,130 --> 00:08:11,915 here from one to eight, at the same workload. 99 00:08:11,915 --> 00:08:14,890 And look what happens. So if we, we look at this. 100 00:08:14,890 --> 00:08:19,005 Something else is in variant here. This is for a fixed cache size. 101 00:08:19,005 --> 00:08:22,740 We're going to plot the number of processors down here now. 102 00:08:22,740 --> 00:08:28,198 The. Number of memory cycles per instruction, 103 00:08:28,198 --> 00:08:35,209 for instruction misses, complex capacity misses, com, compulsory misses, doesn't 104 00:08:35,209 --> 00:08:41,122 change, just stays the same because it's basically uni-processor based. 105 00:08:41,122 --> 00:08:47,880 But, as you add more cores you get both, more true sharing and more false sharing. 106 00:08:49,800 --> 00:08:55,015 Hmm, well this is a little scary because our performance is basically going down 107 00:08:55,015 --> 00:08:59,338 as we add more course. So this is only up to eight, what happens 108 00:08:59,338 --> 00:09:02,426 if we're, you know way out here at 100 course? 109 00:09:02,426 --> 00:09:06,817 What do we think's going to happen? Well, we're probably going to to be 110 00:09:06,817 --> 00:09:12,101 dominated, our performance is going to be dominated by the sharing and the false 111 00:09:12,101 --> 00:09:16,920 sharing and these misses. Huh, well, we're going to start thinking 112 00:09:16,920 --> 00:09:20,464 about that one. Figure out how to, how to solve that and 113 00:09:20,464 --> 00:09:23,300 think about scalability of coherence system.