1 00:00:03,500 --> 00:00:07,346 Okay, so Multiple Logical Communication Channels. 2 00:00:07,346 --> 00:00:13,757 What we just talked about with all these different message types here, thus like, 3 00:00:13,757 --> 00:00:19,171 the ten not drawn. Can't all go on the same network at the 4 00:00:19,171 --> 00:00:21,936 same time. If you have one interconnection network 5 00:00:21,936 --> 00:00:24,480 or one logical, excuse me, logical network here. 6 00:00:25,820 --> 00:00:30,974 You can end up with cases where you have responses that get qued behind requests 7 00:00:30,974 --> 00:00:35,555 that will create more traffic. So, in these protocols we have multiple 8 00:00:35,555 --> 00:00:39,882 phases of, of communication. We send messages from point A to point B, 9 00:00:39,882 --> 00:00:44,846 then from B to C, then from C back to B, and then B back to A, or something like 10 00:00:44,846 --> 00:00:47,645 that. So, if you try to over lay this onto an 11 00:00:47,645 --> 00:00:52,672 inner connection network, and you do head of line processing. So you process the 12 00:00:52,672 --> 00:00:58,813 message which is at the head of the que first It's very common to have something 13 00:00:58,813 --> 00:01:02,980 where you have a new response that sneaks in. 14 00:01:02,980 --> 00:01:06,567 Which is dependent on some other transaction that's happening somewhere 15 00:01:06,567 --> 00:01:11,096 else in the system. And you're waiting for excuse me a new 16 00:01:11,096 --> 00:01:14,141 request comes in that gets qued behind a response. 17 00:01:14,141 --> 00:01:19,073 Or excuse me, you have a new request that comes in that is in front of a response 18 00:01:19,073 --> 00:01:22,240 and all of a sudden you're going to have a deadlock. 19 00:01:23,580 --> 00:01:28,287 So, one way people go about solving this is you look at all the different types of 20 00:01:28,287 --> 00:01:32,649 messages that you have, message types that you have, and you figure out which 21 00:01:32,649 --> 00:01:35,780 types can coexist on the network at the same time. 22 00:01:35,780 --> 00:01:42,307 so good example here is a I think if you look at a relatively basic cache 23 00:01:42,307 --> 00:01:47,340 coherence protocol there's maybe twelve different message types. 24 00:01:49,180 --> 00:01:53,667 You can try to shrink that down to maybe four, three or four different classes of 25 00:01:53,667 --> 00:01:56,049 traffic. That those classes of traffic when 26 00:01:56,049 --> 00:01:58,431 intermixed will never introduce a deadlock. 27 00:01:58,431 --> 00:02:02,807 If you would be really safe you'd just have twelve separate networks or twelve 28 00:02:02,807 --> 00:02:06,163 different networks, or twelve logical different networks. 29 00:02:06,163 --> 00:02:11,013 But that's expensive to build, so people try to come up with equivalence classes 30 00:02:11,013 --> 00:02:14,893 of different traffic and then only have that number of networks. 31 00:02:14,893 --> 00:02:19,379 But you're going to have to segregate these flows in different, logical or 32 00:02:19,379 --> 00:02:23,320 physical channels to make sure you don't have deadlock happening. 33 00:02:24,600 --> 00:02:31,881 [COUGH] Another thing that I wanted to point out here is we just distributed our 34 00:02:31,881 --> 00:02:38,124 memory. So we need to start worrying about the 35 00:02:38,124 --> 00:02:44,905 memory ordering point or for given memory address, where do you go and what order 36 00:02:44,905 --> 00:02:51,885 do memory operations happen in? So just like in a bus space protocol you 37 00:02:51,885 --> 00:02:56,259 have some sort of arbitor which determines the transaction or who wins 38 00:02:56,259 --> 00:02:59,092 the bus. In a directory based system, typically 39 00:02:59,092 --> 00:03:01,926 the forth given address is the ultimate other. 40 00:03:01,926 --> 00:03:06,608 So two messages are coming in from different course you go for right access 41 00:03:06,608 --> 00:03:10,057 to a particular address. One of them is going to get to the 42 00:03:10,057 --> 00:03:14,985 directory first and what if we gets the directory first, could win or estimately 43 00:03:14,985 --> 00:03:18,434 will win. You could think of, sometimes this could 44 00:03:18,434 --> 00:03:21,717 be unfair. So some of these systems you try to have 45 00:03:21,717 --> 00:03:26,931 something which prevents the core, which is close to the directory, always winning 46 00:03:26,931 --> 00:03:31,694 access to that contended line we'll say. So may have some sort of back off 47 00:03:31,694 --> 00:03:36,522 protocol playing into effect there. But this is what effectively guarantees 48 00:03:36,522 --> 00:03:41,414 our atomicity is the directory allows, guarantees that a particular line can 49 00:03:41,414 --> 00:03:45,920 only be transitioning from one state to another state at a given time. 50 00:03:47,480 --> 00:03:50,680 So we have the directory as the ordering point. 51 00:03:50,680 --> 00:03:56,910 And whichever message gets there, gets to the home directory first let's say wins. 52 00:03:56,910 --> 00:04:02,370 Subsequent request for that line are going to lose so what do you do on a 53 00:04:02,370 --> 00:04:05,096 lose. Well you probably, the directory's 54 00:04:05,096 --> 00:04:08,311 probably going to want to send a negative acknowledgement, or NACK. 55 00:04:08,311 --> 00:04:11,125 It's going to say, or send a retry. It's the same thing here. 56 00:04:11,125 --> 00:04:13,385 It's going to say, I can't handle this right now. 57 00:04:13,385 --> 00:04:15,897 This line is currently being transitioned already. 58 00:04:15,897 --> 00:04:18,308 Someone else won the transitioning of this line. 59 00:04:18,308 --> 00:04:21,775 Go retry that transaction, the memory transaction again in the future. 60 00:04:21,775 --> 00:04:24,967 Now, this gets pretty complicated back at the cache controller. 61 00:04:24,967 --> 00:04:27,900 because it's going to get a retry, or a NACK back. 62 00:04:27,900 --> 00:04:32,572 It could potentially have a message coming in for that exact same cache line. 63 00:04:32,572 --> 00:04:37,487 It needs to give the directories message priority in this case over the pipeline 64 00:04:37,487 --> 00:04:41,336 the main processor. So it's going to have to order that 65 00:04:41,336 --> 00:04:46,528 after, it because the directory one the directory is the ultimate ordering point 66 00:04:46,528 --> 00:04:50,682 for the memory location. Finally, you have to worry about forward 67 00:04:50,682 --> 00:04:55,728 progress guarantees. What I mean by this is, it's pretty easy 68 00:04:55,728 --> 00:04:59,903 to build a memory system that you pull the data into your cache. 69 00:04:59,903 --> 00:05:04,990 Your cache control pull means your cache. But before you're able to do load or 70 00:05:04,990 --> 00:05:09,540 storage at that actual address, it gets bumped out f your cache. 71 00:05:09,540 --> 00:05:13,220 At all sign you have a cash line just sort of bouncing between two cashes and 72 00:05:13,220 --> 00:05:18,289 its a live lock scenario. So you believes directory based coherence 73 00:05:18,289 --> 00:05:22,972 systems and this also happens with bus space protocols plus usually it'll be 74 00:05:22,972 --> 00:05:25,962 less likely. you have to have some sort of forward 75 00:05:25,962 --> 00:05:29,347 progress guarantee. And the reason is less likely is because 76 00:05:29,347 --> 00:05:33,240 in a directory based system the communication time is usually longer. 77 00:05:33,240 --> 00:05:35,722 So the window of, of vulnerability is longer. 78 00:05:35,722 --> 00:05:40,292 We afford progress guarantee means that once you get into your cache you need to 79 00:05:40,292 --> 00:05:44,806 at least do one memory operation to it, to guarantee you have won bef, before you 80 00:05:44,806 --> 00:05:47,740 relinquish the memory back to the directory control. 81 00:05:47,740 --> 00:05:53,238 So, if you do, if you doing a load you're actually reading your cache shared to the 82 00:05:53,238 --> 00:05:56,972 actual one load. And then you're allowed to cough it up, 83 00:05:56,972 --> 00:06:01,868 so you don't respond back with a reply. To, let's say, an invalidation request 84 00:06:01,868 --> 00:06:05,752 from the directory control until you've done your one memory transaction. 85 00:06:05,752 --> 00:06:09,796 Or likewise, if you bring in a modify. Do that one store before you, you cough 86 00:06:09,796 --> 00:06:13,467 the data back up or release the data back into the memory controller. 87 00:06:13,467 --> 00:06:17,830 That's really important, to make sure you have some forward progress in the system, 88 00:06:17,830 --> 00:06:22,908 and don't end up with a live lock. Okay so we're going to get into the, the 89 00:06:22,908 --> 00:06:25,520 more futurey for future looking stuff here. 90 00:06:28,660 --> 00:06:33,310 We've been talking about what's called, a full map directory, where there's a bit 91 00:06:33,310 --> 00:06:36,392 processor core. If you have 1,000 cores in the system, 92 00:06:36,392 --> 00:06:41,303 that's a very large bit map. And, it's pretty uncommon that a thousand 93 00:06:41,303 --> 00:06:48,406 core in the system will all be reading all of the data, or all of one particular 94 00:06:48,406 --> 00:06:51,920 cache line. So there might be wasteful and your 95 00:06:51,920 --> 00:06:55,584 directory in a format directory grows at order N. 96 00:06:55,584 --> 00:07:01,507 So that's, that's not particularly good. Hmm, so, people have looked into 97 00:07:01,507 --> 00:07:06,262 different protocols here. I just want to touch on this, this is 98 00:07:06,262 --> 00:07:10,944 largely the area sort of research and future directions of this. 99 00:07:10,944 --> 00:07:15,144 One, one idea here is to have a limited pointer based directory. 100 00:07:15,144 --> 00:07:18,626 So instead of having a bit mask of the share of list. 101 00:07:18,626 --> 00:07:23,949 So its right this is the share of list. So its sort of a bit mask of the share of 102 00:07:23,949 --> 00:07:28,023 list you have base two encoding numbers up to some, some point. 103 00:07:28,023 --> 00:07:33,214 This is why its called limited directory or limited point of directory, because 104 00:07:33,214 --> 00:07:36,828 you can't have all the nodes in the system on the list. 105 00:07:36,828 --> 00:07:41,280 There is none of entries here. But you can name them because you have 106 00:07:41,280 --> 00:07:48,180 base two encoding of the actual number. And if you get bigger this in this entire 107 00:07:48,180 --> 00:07:53,503 list, so lets say you'll have one, two, three, four, five entries an over sense 108 00:07:53,503 --> 00:07:57,974 six sharers, six mead owing copies of the list want to be taken. 109 00:07:57,974 --> 00:08:03,580 Usually this is an overflow bit here which says its more than six and when its 110 00:08:03,580 --> 00:08:08,690 more than six and it will share and also in a transition to modified. 111 00:08:08,690 --> 00:08:14,367 You're going to just send a broadcast message or send a invalidate every single 112 00:08:14,367 --> 00:08:19,406 of cache in the system. But usually this could be a good trade 113 00:08:19,406 --> 00:08:24,140 off, because it's pretty uncommon to have extremely widely shared lines. 114 00:08:24,140 --> 00:08:28,128 So it's an interesting trade-off. There are storage space, versus sending 115 00:08:28,128 --> 00:08:34,203 more messages in the system. Like wise there's interesting protocol 116 00:08:34,203 --> 00:08:38,912 here called limitless. Where same idea, it's a limited directory 117 00:08:38,912 --> 00:08:45,041 and overflow bit, but if it overflows you start to keep the directory, or you start 118 00:08:45,041 --> 00:08:50,220 to keep the sharer list in software in a structure in main memory. 119 00:08:50,220 --> 00:08:53,794 And this requires you to basically have very fast traps such that when this 120 00:08:53,794 --> 00:08:57,322 happens because your, your servicing cache line here you interrupt the main 121 00:08:57,322 --> 00:09:01,320 processor and the main processor provides the rest of the sharer list for instance, 122 00:09:01,320 --> 00:09:05,663 if the sharer list gets overflowed. So, there's, and there's a bunch of stuff 123 00:09:05,663 --> 00:09:09,763 in between in some future, future research that's still being done actively 124 00:09:09,763 --> 00:09:14,699 in this space. Beyond simple directory coherence, we 125 00:09:14,699 --> 00:09:19,927 have some pretty cool on-chip coherence. This is why this is actually being 126 00:09:19,927 --> 00:09:25,691 studied a lot right now, is people built this massively parallel directory based 127 00:09:25,691 --> 00:09:29,645 machines in the past. They got some use, they were very good 128 00:09:29,645 --> 00:09:33,402 for some applications. But now we are starting to put more and 129 00:09:33,402 --> 00:09:37,226 more cores on to a given chip, so we start to see on-chip coherence. 130 00:09:37,226 --> 00:09:41,623 And figuring out how to leverage, the fast on-chip communication alongst 131 00:09:41,623 --> 00:09:45,296 directories to make more interesting coherence protocols. 132 00:09:45,296 --> 00:09:50,709 There is something called COMA systems or Cache Only Memory Architectures, where 133 00:09:50,709 --> 00:09:55,864 instead of having a data in main memory, you don't have main memory ever and the 134 00:09:55,864 --> 00:10:00,375 directories move around. before beyond scope of the scores worry 135 00:10:00,375 --> 00:10:06,045 about the KSR one if you want to go about that kind a score research one and then 136 00:10:06,045 --> 00:10:09,590 also this real had a scale of the sharer list is active. 137 00:10:09,590 --> 00:10:15,529 Briefly wanted to talk about the most up-to-date versions of these things. 138 00:10:15,529 --> 00:10:21,549 We have the SGI UV 1000 which is a descendant of the Origin and the Origin 139 00:10:21,549 --> 00:10:24,920 2000 machines from SGI, lots of cores here, 140 00:10:26,300 --> 00:10:29,152 2560 cores that are all kept cache coherent. 141 00:10:29,152 --> 00:10:32,458 They use a directory based coherence protocol here. 142 00:10:32,458 --> 00:10:37,190 It's very non uniform, and it's all connected together by a multi-chassis 143 00:10:37,190 --> 00:10:40,755 treaty tour. So, this is one chassis, there's actually 144 00:10:40,755 --> 00:10:44,904 up to eight of these. Princeton has one of these in the HPCRC 145 00:10:44,904 --> 00:10:49,701 centre, that I think is four chassis, which is sort of half the size of the 146 00:10:49,701 --> 00:10:55,295 maximum. [COUGH] An on-chip example here is the 147 00:10:55,295 --> 00:10:59,895 TILE64Pro had 64 cores. And each of the cores, itself is a 148 00:10:59,895 --> 00:11:04,105 directory home and it runs a directory based protocol. 149 00:11:04,105 --> 00:11:09,796 And I was talking about dividing communication traffic into different 150 00:11:09,796 --> 00:11:13,304 flows. Well there were three different memory 151 00:11:13,304 --> 00:11:17,358 networks here. So we had to do, come up with three 152 00:11:17,358 --> 00:11:23,325 different classes of traffic, that themselves would not deadlock themselves, 153 00:11:23,325 --> 00:11:26,412 if you will. And also there's four memory controllers, 154 00:11:26,412 --> 00:11:29,208 which is connected into our interconnect system. 155 00:11:29,208 --> 00:11:32,587 And because of this, the communication lane sees different. 156 00:11:32,587 --> 00:11:36,956 A core here talking to the American controller is very fast, whereas a core 157 00:11:36,956 --> 00:11:40,218 there takes longer. But maybe this core is close to that 158 00:11:40,218 --> 00:11:42,840 memory controller. So non, non-uniformity here. 159 00:11:44,400 --> 00:11:47,024 Okay, so this is our last slide of the term 160 00:11:47,024 --> 00:11:48,732 here, Beyond ELE 475. 161 00:11:48,732 --> 00:11:53,004 If you want to go on and do more. [COUGH] Well start reading some papers 162 00:11:53,004 --> 00:11:55,994 from different computer architecture conferences. 163 00:11:55,994 --> 00:12:00,449 The proceedings of International Symposium on Computer Architecture, ISCA 164 00:12:00,449 --> 00:12:03,988 is a good place to start. That's probably the top, major 165 00:12:03,988 --> 00:12:08,138 architecture conference in the field. The International Symposium on 166 00:12:08,138 --> 00:12:11,677 Microarchitecture is the top Microarchitecture conference. 167 00:12:11,677 --> 00:12:16,193 So if you're trying to look inside of a processor, and some of the smaller 168 00:12:16,193 --> 00:12:20,459 microtechtectural details. ASPLOS, Architectural Support for 169 00:12:20,459 --> 00:12:25,353 Programming Languages and Operating Systems has lot of different processor 170 00:12:25,353 --> 00:12:28,638 between software and hardware's radio conference. 171 00:12:28,638 --> 00:12:34,202 And HPCA looks at or used to look up at a higher performing high, big computers, 172 00:12:34,202 --> 00:12:39,834 high performance computer systems. but now lot of normal computer operation 173 00:12:39,834 --> 00:12:43,119 ends up in a compensawsive. I would have CD audio. 174 00:12:43,119 --> 00:12:46,596 Well go to research. Built in chips, built some test 175 00:12:46,596 --> 00:12:51,488 [INAUDIBLE] FPGA learn more about parallel computer architecture and, in 176 00:12:51,488 --> 00:12:56,718 deed you can come back in the Fall. and take ELE 580A which is going to be a 177 00:12:56,718 --> 00:13:00,240 graduate level primary sources paper reading course. 178 00:13:00,240 --> 00:13:05,300 More traditional graduate course where you'll learn about all the different 179 00:13:05,300 --> 00:13:09,528 parallel computing systems. So this is called parallel computation 180 00:13:09,528 --> 00:13:14,459 because it's both parallel computer architecture and some programming models 181 00:13:14,459 --> 00:13:19,071 that hook together without in parallel programming together with the 182 00:13:19,071 --> 00:13:22,530 architectures cause they go, they go very hand in hand. 183 00:13:22,530 --> 00:13:26,240 So let's stop here for today. And stop here for the course.