1 00:00:03,600 --> 00:00:08,232 So let's, let's think about our snoopy protocols that we've talked about, our 2 00:00:08,232 --> 00:00:12,560 bus based protocols, and the performance and the asymptotic performance 3 00:00:12,560 --> 00:00:17,215 requirements of them. So, what are the challenges of a snooping 4 00:00:17,215 --> 00:00:20,487 protocol? As we discussed before, as you add more 5 00:00:20,487 --> 00:00:25,873 people or more processors to the system you have more entities shouting on one 6 00:00:25,873 --> 00:00:31,600 shared media. And you need to hear the shouting. 7 00:00:31,600 --> 00:00:36,024 You can't just forget some, some, some shout because you need to snoop that 8 00:00:36,024 --> 00:00:39,551 against your vocal cache. So when ever another core takes a cache 9 00:00:39,551 --> 00:00:44,991 miss, you need to snoop that against your cache and make sure that you don't have a 10 00:00:44,991 --> 00:00:49,416 copy of that or that you have to invalidate for instance overplay back of 11 00:00:49,416 --> 00:00:53,242 some data or do something other invalidation coherence protocol 12 00:00:53,242 --> 00:00:56,260 adjustment. So, 13 00:00:56,260 --> 00:01:00,689 what's, what's annoying about this, if you sort of look at the amount of 14 00:01:00,689 --> 00:01:05,866 bandwidth you require on your bus, all cache miss need to go across that bus 15 00:01:05,866 --> 00:01:10,482 everyone needs to look at that, and everyone needs to have a port that has 16 00:01:10,482 --> 00:01:15,248 enough bandwidth to look at all those transactions on their cache. 17 00:01:15,248 --> 00:01:20,799 So, if we go to look at this, our bus, if we want to sort of keep up with the same 18 00:01:20,799 --> 00:01:26,351 amount of bandwidth of cache misses per core as we add more cores to the system 19 00:01:26,351 --> 00:01:30,584 is going to grow order N, where N is the number of processors. 20 00:01:30,584 --> 00:01:34,087 [COUGH]. Because though you can compute this 21 00:01:34,087 --> 00:01:37,959 because everyone let's say has the same number of same amount of cache misses 22 00:01:37,959 --> 00:01:41,931 going on and you want to have the same cache miss rate and you just multiply it 23 00:01:41,931 --> 00:01:45,526 by N. Three each, each course is going to have 24 00:01:45,526 --> 00:01:48,811 that. Well, that will be fine when N is eight, 25 00:01:48,811 --> 00:01:53,775 but if N goes to a thousand or a million, you're going to have some serious 26 00:01:53,775 --> 00:01:56,851 problems with that, it's a very, very big bus. 27 00:01:56,851 --> 00:02:02,723 And it's not just straight badnwidth, you also need to have effectively somewhere 28 00:02:02,723 --> 00:02:08,122 arbitrate for the bus and you need to have atomic transactions going across 29 00:02:08,122 --> 00:02:12,644 that bus so it may be hard to actually even if you have a very high bandwidth 30 00:02:12,644 --> 00:02:17,340 bus you may not have enough this cycles in order to operate in the bus. 31 00:02:18,400 --> 00:02:23,263 So, a solution this, is, we start to look at something we're going to call 32 00:02:23,263 --> 00:02:26,893 directory cache coherence and directory protocols. 33 00:02:26,893 --> 00:02:32,628 And the idea in a directory protocol, that the key idea here is that instead of 34 00:02:32,628 --> 00:02:38,580 broadcasting your invalidations to every other core in the system, or all cache in 35 00:02:38,580 --> 00:02:41,629 the system, every other core in the system. 36 00:02:41,629 --> 00:02:46,784 Instead what you do is you go talk to a location that we're going to call 37 00:02:46,784 --> 00:02:50,925 directory. And this directory is going to keep track 38 00:02:50,925 --> 00:02:57,788 of which caches have that data. And what's nice about this is now if you 39 00:02:57,788 --> 00:03:02,308 take a cache miss you can go ask the directory well, who has all these 40 00:03:02,308 --> 00:03:07,346 different, who has this cache line. And if only one other core has the cache 41 00:03:07,346 --> 00:03:11,673 line let's say readable, it will and you're trying to take it into an 42 00:03:11,673 --> 00:03:16,969 exclusive access trying to write to it. You only need to invalidate only one 43 00:03:16,969 --> 00:03:22,140 location instead of sending that data to all end processors in your system. 44 00:03:22,140 --> 00:03:27,297 So we've cut down what we, was a broadcast system into a point to point 45 00:03:27,297 --> 00:03:30,857 system. But the overhead that we have to keep now 46 00:03:30,857 --> 00:03:37,690 is we need to track, in a directory. All the locations that have or all the 47 00:03:37,690 --> 00:03:41,522 caches which could have a particular cache line in it. 48 00:03:41,522 --> 00:03:46,843 And, and we'll, we'll go through a much more complicated example of that but 49 00:03:46,843 --> 00:03:52,236 that's the, that's the overall key idea. And this is going to turn what was a 50 00:03:52,236 --> 00:03:55,500 broadcast into a point to point communication. 51 00:03:55,500 --> 00:03:58,709 And we can use point to point interconnects for this. 52 00:03:58,709 --> 00:04:03,190 Another good point of scalability here is you can actually have different 53 00:04:03,190 --> 00:04:05,976 directories. So you don't have to have one big 54 00:04:05,976 --> 00:04:09,791 monolithic directory. Instead you can segment the address space 55 00:04:09,791 --> 00:04:14,454 somehow and depending on the address that you have you can go to a different 56 00:04:14,454 --> 00:04:17,704 directory. And, by going in these different 57 00:04:17,704 --> 00:04:22,120 directories, you can actually increase the bandwidth, to your directories. 58 00:04:24,320 --> 00:04:29,229 Okay, so let's see how this fits into like, a 59 00:04:29,229 --> 00:04:35,964 block diagram here. We have CPUs are trying to communicate 60 00:04:35,964 --> 00:04:41,236 with other CPUs via shared memory. And they go check their cache first. 61 00:04:41,236 --> 00:04:48,332 If it's not in the cache, before they would have communicate or 62 00:04:48,332 --> 00:04:54,283 cross a bus, and everyone would have to look at that traffic but instead, in our 63 00:04:54,283 --> 00:04:59,520 directory cache coherence, they'll send a message from the cache. 64 00:04:59,520 --> 00:05:05,069 To the directory controller associated with the address so you're going to send 65 00:05:05,069 --> 00:05:08,700 a message here to this directory controller. 66 00:05:08,700 --> 00:05:13,603 And they directly controller is going to keep track for every single line in or 67 00:05:13,603 --> 00:05:17,217 the, the basic directory controller here is going to keep track for every single 68 00:05:17,217 --> 00:05:24,502 line and memory. The list of possible other caches which 69 00:05:24,502 --> 00:05:30,404 could potentially have that piece of data and we're going to call that the sharer 70 00:05:30,404 --> 00:05:34,023 list or the share list. And this is. 71 00:05:34,023 --> 00:05:39,705 Right now, if we look at this, might still be a uniform communication network. 72 00:05:39,705 --> 00:05:43,230 So, let's say, in here, you have some omega network. 73 00:05:43,230 --> 00:05:48,049 Anyone can talk to anyone else and the latency through it is fixed. 74 00:05:48,049 --> 00:05:51,502 So this is still a uniform memory access system. 75 00:05:51,502 --> 00:05:54,955 We've not. We didn't have to go non-uniform here. 76 00:05:54,955 --> 00:06:00,637 So, no cache is necessarily closer or farther away to any other piece of memory 77 00:06:00,637 --> 00:06:06,233 in a system like this. [COUGH] So this is kind of our naive 78 00:06:06,233 --> 00:06:10,196 directory cache coherence protocol. But what's still nice here is we don't 79 00:06:10,196 --> 00:06:13,891 have to broadcast. We can our, let's say our omega network, 80 00:06:13,891 --> 00:06:16,729 or mesh network, or something else on the inside here. 81 00:06:16,729 --> 00:06:20,050 We'd like to send from this cache directly to this controller. 82 00:06:20,050 --> 00:06:23,976 If no other cache has it. Let's say, readable or writable, or 83 00:06:23,976 --> 00:06:27,775 anything like that. It can just respond back with the memory. 84 00:06:27,775 --> 00:06:31,575 The, the data from memory. If not, instead of invalidating and 85 00:06:31,575 --> 00:06:34,488 broadcasting an invalidate to all other cores. 86 00:06:34,488 --> 00:06:37,781 Instead now, the directory can just say. Oh, this core, 87 00:06:37,781 --> 00:06:41,960 this cache and this cache have copies. I need to send two messages. 88 00:06:41,960 --> 00:06:46,320 One to this cache, one to that cache, and validate them. 89 00:06:46,320 --> 00:06:49,920 Wait for the responses and then reply back with the data. 90 00:06:49,920 --> 00:06:56,188 So, we can, we can decrease our bandwidth that we use in the common case across our 91 00:06:56,188 --> 00:07:02,576 inter-connection network. So I'm going to show a slightly different 92 00:07:02,576 --> 00:07:07,512 picture here which is pretty similar to the previous picture. 93 00:07:07,512 --> 00:07:13,743 Well, you'll notice is that the memory and the directory, now, are connected to 94 00:07:13,743 --> 00:07:20,806 an individual CPU. So why do we do this? 95 00:07:20,806 --> 00:07:24,290 Well. If you're building one of these scalable 96 00:07:24,290 --> 00:07:29,252 systems some sort of like supercomputer, it might be a good property that as you 97 00:07:29,252 --> 00:07:33,180 add more CPUs to the system, you also add more RAM memory. 98 00:07:33,180 --> 00:07:37,265 And maybe more directory storage or something like that. 99 00:07:37,265 --> 00:07:43,248 Another positive of, of, of a design like this is this CPU now is actually close to 100 00:07:43,248 --> 00:07:47,480 this memory bank. And we can try to take advantage of that. 101 00:07:50,260 --> 00:07:54,379 So one, one question comes up, is how can we take advantage of that? 102 00:07:54,379 --> 00:07:58,436 Anyone have any thoughts? Okay so, shared data, we don't know where 103 00:07:58,436 --> 00:08:02,369 it's going to be accessed. It could be accessed by all 6 CPUs and 104 00:08:02,369 --> 00:08:06,176 all 6 caches here. But it's very common that your stack for 105 00:08:06,176 --> 00:08:11,108 your program is going to be only access local and the instruction memory for your 106 00:08:11,108 --> 00:08:16,668 program is only going to be access local. So you can potentially have performance 107 00:08:16,668 --> 00:08:20,870 benefits by putting the instruction in memory or excuse me, 108 00:08:20,870 --> 00:08:26,638 instruction and stack and maybe even some portion of the heap close to this core 109 00:08:26,638 --> 00:08:31,979 because then you can access that really quickly, but only shared data has to go 110 00:08:31,979 --> 00:08:40,094 across this interconnect. [COUGH] And will or in fact that has a 111 00:08:40,094 --> 00:08:44,706 fancy name. So systems where some data is close and 112 00:08:44,706 --> 00:08:51,488 some data is far away are called Non Uniform Memory Access or NUMA and you 113 00:08:51,488 --> 00:08:58,280 might see this actually even in your desktop processors are actually 114 00:08:58,280 --> 00:09:02,742 Moving towards numerous systems. They're, they're, some of them are are 115 00:09:02,742 --> 00:09:05,820 actually If you look at some of the. 116 00:09:05,820 --> 00:09:09,691 I believe it's the A and D chips today are already numa systems even on a on a 117 00:09:09,691 --> 00:09:13,807 single di, or not excuse me single chip with multiple dies or something like that 118 00:09:13,807 --> 00:09:17,678 there, there actually two newer nodes inside of them sort of one for one memory 119 00:09:17,678 --> 00:09:20,580 controller one for one another memory controller. 120 00:09:20,580 --> 00:09:25,350 So, [COUGH] if you go into something like Linux and you go look in the proc 121 00:09:25,350 --> 00:09:29,998 directory, you can actually see there's a sub-directory in there called NUMA. 122 00:09:29,998 --> 00:09:33,912 And it'll tell you the configuration of the different memory. 123 00:09:33,912 --> 00:09:38,621 And then the OS can take advantage of this, so can put for instance, the stack 124 00:09:38,621 --> 00:09:43,085 and the instruction memory, for a particular program, that's being used by 125 00:09:43,085 --> 00:09:47,733 a particular core close to that core. And then, maybe through some other data, 126 00:09:47,733 --> 00:09:50,485 I can somehow choose some other, other choice. 127 00:09:50,485 --> 00:09:55,378 Now, I want to make a point here, is that just 128 00:09:55,378 --> 00:10:04,572 because the latency to memory is different does not mean that your system 129 00:10:04,572 --> 00:10:11,280 is a directory based cached coherent NUMA system. 130 00:10:11,280 --> 00:10:17,158 So you can still have non-directory-based systems where some memory is close and 131 00:10:17,158 --> 00:10:20,960 some memory is far away. So you could still have a, basically a 132 00:10:20,960 --> 00:10:24,656 bus, or something like that, or maybe some other internet connection network in 133 00:10:24,656 --> 00:10:28,258 there which is still a snooping protocol, or effectively a snooping protocol. 134 00:10:28,258 --> 00:10:30,860 But some data is close and some data is far away. 135 00:10:30,860 --> 00:10:36,655 But if you see this in literature usually you feel people talking about directory 136 00:10:36,655 --> 00:10:42,308 based cache coherent NUMA systems will call them CC NUMA or cache coherent NUMA 137 00:10:42,308 --> 00:10:47,326 systems, that's usually sort of means that this is a cache coherent non-uniform 138 00:10:47,326 --> 00:10:53,192 memory access architecture and usually implies that directory based for the may 139 00:10:53,192 --> 00:10:57,150 be other protocols that people are using out there also. 140 00:10:57,150 --> 00:11:03,026 Okay so I want to go back one slide here, and I wanted to finish off talking about 141 00:11:03,026 --> 00:11:07,934 one-topology, which is interesting. And the difference between these two 142 00:11:07,934 --> 00:11:10,838 slides is we went from a CPU here to CPUs. 143 00:11:10,838 --> 00:11:16,508 So this is a multi-core chip now and where this gets interesting is you might 144 00:11:16,508 --> 00:11:21,852 have a directory based cache coherence system connecting multiple chips but then 145 00:11:21,852 --> 00:11:26,800 inside of a chip you may have something like a bus based snooping protocol. 146 00:11:26,800 --> 00:11:29,876 So we actually mix and match these two things. 147 00:11:29,876 --> 00:11:35,025 And how we go about doing this is, if caches, if cores inside of this one chip 148 00:11:35,025 --> 00:11:40,174 for instance, after you go get data from each other, they can just effectively 149 00:11:40,174 --> 00:11:45,524 snoop on each other, but outside of that your cache controller or may be your L3 150 00:11:45,524 --> 00:11:50,984 cache for this particular chip, is going to respond to messages coming from other 151 00:11:50,984 --> 00:11:55,472 directories, like invalidation requests and do something about it. 152 00:11:55,472 --> 00:12:00,981 So there's basically a transducer there between a directory base cache coherence 153 00:12:00,981 --> 00:12:06,626 protocol, and a bus base snoopy protocol. And this is pretty, pretty common these 154 00:12:06,626 --> 00:12:09,550 days. Especially given that you have a fair 155 00:12:09,550 --> 00:12:15,535 number of multi core systems showing up. And being used in more of these directory 156 00:12:15,535 --> 00:12:20,295 based cache coherence systems. And we'll talk about one of them at the 157 00:12:20,295 --> 00:12:25,671 end actually the SGI UV systems or UV 1000, which we'll 158 00:12:25,671 --> 00:12:32,053 talk about, is a users off-the-shelve, Intel parts, modern day sort of Core i7 159 00:12:32,053 --> 00:12:38,183 parts mixed together with a NUMA, and directory based coherence system to 160 00:12:38,183 --> 00:12:44,565 connect all the chips together. So there's a transducer from the external 161 00:12:44,565 --> 00:12:49,940 snoop bus protocol to the directory based coherence protocol.