1 00:00:03,420 --> 00:00:07,476 Okay, so this brings us to the end of interconnection networks. 2 00:00:07,476 --> 00:00:10,980 And what I wanted to talk about was Deadlock. 3 00:00:10,980 --> 00:00:17,889 So let's assume that you have three nodes labeled one, two, and three here. 4 00:00:17,889 --> 00:00:24,343 It's on a 1D, or excuse me, yeah, oh 1D tourist, that is unidirectional. 5 00:00:24,343 --> 00:00:28,980 So links only flow in the clockwise direction here. 6 00:00:30,360 --> 00:00:36,510 Now, all of a sudden, you have node one wants to send to node three, node two 7 00:00:36,510 --> 00:00:41,840 wants to send to node one, and node three wants to send to, to two. 8 00:00:43,120 --> 00:00:46,201 And we're going to look at a wormhole routing network. 9 00:00:46,201 --> 00:00:50,937 So the wormhole routing network you start to send the header words and [INAUDIBLE] 10 00:00:50,937 --> 00:00:55,160 has not, potentially not even been injected into the network at this point. 11 00:00:55,160 --> 00:00:58,013 Let's say they all start at exactly the same time. 12 00:00:58,013 --> 00:01:01,000 So one is going to start sending to three. 13 00:01:01,000 --> 00:01:05,389 Two's going to start sending to one. Three's going to start sending to two. 14 00:01:05,389 --> 00:01:10,295 And if you'll notice, is each set of these has one overlapping link with the 15 00:01:10,295 --> 00:01:13,651 previous one. So there's going to be some contention on 16 00:01:13,651 --> 00:01:16,426 that link. So very quickly, you can see that. 17 00:01:16,426 --> 00:01:21,331 If we were to all, if we were to launch these three messages exactly at same 18 00:01:21,331 --> 00:01:26,843 time, and the message length was long. None of them would actually get to the 19 00:01:26,843 --> 00:01:31,331 receivers. One would use this link, or one sending 20 00:01:31,331 --> 00:01:36,021 to three would use this link, two sending to one would use this link, three sending 21 00:01:36,021 --> 00:01:40,541 to two would use this wraparound link. But then all of a sudden, they would try 22 00:01:40,541 --> 00:01:43,744 to acquire the next link but it's already been reserved. 23 00:01:43,744 --> 00:01:48,149 But they can't give up the link because they've already acquired the link, so all 24 00:01:48,149 --> 00:01:54,690 of a sudden you have a deadlock. So these numbers get pretty tricky. 25 00:01:54,690 --> 00:02:00,384 and one way that we go about trying to see if our routing protocol is deadlock 26 00:02:00,384 --> 00:02:06,217 free or actually can deadlock is we start to do something called Waits-for-analysis 27 00:02:06,217 --> 00:02:12,174 or waits-for and Holds analysis. And waits for, and Waits-for and Holds 28 00:02:12,174 --> 00:02:16,629 analysis, the basic idea is, so this is looking at 29 00:02:16,629 --> 00:02:21,993 our example that we saw before where we have three routers. 30 00:02:21,993 --> 00:02:29,448 What's happening here is we're injecting into A and this A is acquiring this link 31 00:02:29,448 --> 00:02:35,176 or this [INAUDIBLE] here, and then it needs to acquire the next 32 00:02:35,176 --> 00:02:39,960 one. So let's draw this as a diagram. 33 00:02:39,960 --> 00:02:45,395 So A holds TQ2, so A holds TQ2. 34 00:02:45,395 --> 00:02:52,100 This one here, this inbound Q here. 35 00:02:53,220 --> 00:02:57,294 And when we draw Waits-for and Hold analysis diagrams. 36 00:02:57,294 --> 00:03:03,254 When we are waiting on something we draw a edge from the thing that is the, the 37 00:03:03,254 --> 00:03:06,800 actor to the resource that we're waiting on, 38 00:03:06,800 --> 00:03:10,874 and the holds actually goes in the opposite direction. 39 00:03:10,874 --> 00:03:16,080 So if a holds TQ2 here we actually draw the edge the other direction. 40 00:03:18,191 --> 00:03:26,905 all of a sudden, A is holding TQ2 two, and it's waiting to acquire TQ3, this one 41 00:03:26,905 --> 00:03:33,232 here. because it needs to transit this link to deliver it to TQ3. 42 00:03:34,374 --> 00:03:45,200 But at the same time, B has acquired TQ3, so it's holding that and it's waiting for 43 00:03:45,200 --> 00:03:54,030 TQ1. But C currently holding TQ1 and its 44 00:03:54,030 --> 00:04:00,652 waiting on TQ2. This is how we do deadlock analysis from 45 00:04:00,652 --> 00:04:06,547 a proof perspective. We actually write these dependency graphs or Waits-for and 46 00:04:06,547 --> 00:04:12,025 Hods graphs and if you find a cycle in your protocol, your protocol can have a 47 00:04:12,025 --> 00:04:15,354 deadlock. It need a couple of some way to color 48 00:04:15,354 --> 00:04:20,555 link your, color, color edge in this graph or you need to really design your 49 00:04:20,555 --> 00:04:23,332 protocol. But this is the sort of canonical way 50 00:04:23,332 --> 00:04:27,978 that you can do this and if you start to look at different routing protocols some 51 00:04:27,978 --> 00:04:32,624 of them could possibly deadlock and some of'em can, you can prove statically to 52 00:04:32,624 --> 00:04:37,043 all possible weights for of whole graphs you can come up with will never have 53 00:04:37,043 --> 00:04:40,855 cycles. So I, I think I mentioned last time, if 54 00:04:40,855 --> 00:04:44,945 you look at dimension order routing with one more routing. 55 00:04:44,945 --> 00:04:50,494 All possible inputs and outputs on that you never actually going to have a 56 00:04:50,494 --> 00:04:53,910 deadlock, you never call for cycle in the graph. 57 00:04:53,910 --> 00:04:59,673 And the reason for this is the only ever quire X routers and then Y routers, will 58 00:04:59,673 --> 00:05:03,870 say if you have the dimension that you acquire X and quire Y. 59 00:05:03,870 --> 00:05:09,349 But you never going into have effectively something that is holding Y trying to 60 00:05:09,349 --> 00:05:14,258 acquire something on the X axis. So, you never going to have a, a cycle in 61 00:05:14,258 --> 00:05:16,820 that Waits-for and Holds whole analysis graph. 62 00:05:18,820 --> 00:05:27,600 I just briefly wanted to some up here that Deadlock is actually not an enemy. 63 00:05:27,600 --> 00:05:32,308 You can try to use deadlock, to your benefit or, or just live with 64 00:05:32,308 --> 00:05:34,945 deadlock. Just because your protocol has deadlock 65 00:05:34,945 --> 00:05:38,440 does not mean you can't try to recover from it. 66 00:05:38,440 --> 00:05:42,842 So what I mean by this is deadlock avoidance can be very expensive. 67 00:05:42,842 --> 00:05:47,769 So trying to draw all possible graphs here, you should think about, you know, 68 00:05:47,769 --> 00:05:52,763 where your deadlocks happen in your systems, and plan them out and make sure 69 00:05:52,763 --> 00:05:57,617 they don't happen often. But some of the things you need to do, to 70 00:05:57,617 --> 00:06:00,306 try and cut edges here, might be too restrictive. 71 00:06:00,306 --> 00:06:04,900 So for instance, dimensional routing may only route in the X dimension and then in 72 00:06:04,900 --> 00:06:09,326 the Y dimension, could be very expensive relative to an adaptive routing scheme, 73 00:06:09,326 --> 00:06:13,100 or you try to route around congestion in your network. 74 00:06:13,100 --> 00:06:17,602 So this deadlock avoidance, you know, you'll, you'll design the protocol to 75 00:06:17,602 --> 00:06:20,377 never deadlock, this could be quite expensive. 76 00:06:20,377 --> 00:06:25,311 So an alternative, and this is something you should not necessarily be afraid of, 77 00:06:25,311 --> 00:06:30,121 but you have to be careful of this is that you find the deadlock and then you 78 00:06:30,121 --> 00:06:34,530 try to recover from it. So what I mean by this is on-chip network 79 00:06:34,530 --> 00:06:38,850 or a, a, multi-chip network you can actually have a deadlock occur, 80 00:06:38,850 --> 00:06:43,538 notice that a deadlock occurred. And either try to roll back the state and 81 00:06:43,538 --> 00:06:47,130 somehow jitter the state so the deadlock won't happen a second time. 82 00:06:47,130 --> 00:06:51,633 Or many times these deadlocks happen because your dependent on one particular 83 00:06:51,633 --> 00:06:55,444 buffer in the system and two people trying to acquire that buffer. 84 00:06:55,444 --> 00:06:59,948 So you can try to virtualize that buffer. Effectively saving the state of that 85 00:06:59,948 --> 00:07:02,546 buffer in memory and adding more, more states. 86 00:07:02,546 --> 00:07:06,952 So you, a lot of these protocols, if you were to add more FIFO entries into the 87 00:07:06,952 --> 00:07:09,297 system, it would actually break the deadlock. 88 00:07:09,297 --> 00:07:11,802 So, sort of going back into this picture here, 89 00:07:11,802 --> 00:07:15,799 if all of a sudden if I added another, let's say, FIFO here that the inbound 90 00:07:15,799 --> 00:07:19,743 traffic went into and then this other traffic tried to bypass it, it would 91 00:07:19,743 --> 00:07:23,900 actually cut one of these edges, and all of a sudden you would not be having a 92 00:07:23,900 --> 00:07:26,879 deadlock. So you can think about trying to have a 93 00:07:26,879 --> 00:07:30,191 deadlock recovery mechanism that's an example of it. 94 00:07:30,191 --> 00:07:35,690 So a good example this is actually in the raw micro processor from an IT where we 95 00:07:35,690 --> 00:07:39,252 I, you can actually have that much on the on-chip network. 96 00:07:39,252 --> 00:07:44,064 Its dimensional routed, but you can still have message depended deadlocks. 97 00:07:44,064 --> 00:07:48,188 So what I mean by that is while the network itself is not going to deadlock, 98 00:07:48,188 --> 00:07:53,063 the traffic flows you have on top of it that from since your memory coherence 99 00:07:53,063 --> 00:07:57,040 protocol on top of that could be potentially deadlock. 100 00:07:57,040 --> 00:08:00,815 So how do you recover from that? Well you can actually have a counter, 101 00:08:00,815 --> 00:08:05,411 a timer which goes off when you determine that the network has not moved for a 1000 102 00:08:05,411 --> 00:08:07,928 cycles. So all the sudden you're running along 103 00:08:07,928 --> 00:08:12,086 and no traffic moves on your network. Well it's a pretty good indicator that 104 00:08:12,086 --> 00:08:15,478 you have a deadlock condition, because something should be flowing. 105 00:08:15,478 --> 00:08:18,160 There should be some forward progress guarantees. 106 00:08:18,160 --> 00:08:21,165 [COUGH] And if you notice that, an interrupt will go off. 107 00:08:21,165 --> 00:08:25,567 So this timer will go off saying nothing has moved on my network for 1,000 cycles, 108 00:08:25,567 --> 00:08:29,217 this is probably a deadlock. So that, you could take an interrupt and 109 00:08:29,217 --> 00:08:33,350 then software can go look at all this data in the network and introduce more 110 00:08:33,350 --> 00:08:38,239 buffering into the network. So, effectively through software try to 111 00:08:38,239 --> 00:08:43,210 virtualize a particular FIFO entry and that can break the deadlock. 112 00:08:43,210 --> 00:08:46,165 So there are ways to recover from deadlocks. 113 00:08:46,165 --> 00:08:51,674 If your network is for instance used for message passing network, you can just use 114 00:08:51,674 --> 00:08:55,771 memory to go do that. So actually on the raw processor, our 115 00:08:55,771 --> 00:09:00,742 memory coherence protocol, we used deadlock avoidance and then our message 116 00:09:00,742 --> 00:09:05,310 passing protocol, our message passing networks use deadlock recovery. 117 00:09:05,310 --> 00:09:09,475 later on in, in Tylera actually we have something depending on which memory 118 00:09:09,475 --> 00:09:13,427 network we have mixtures of these two. And you might say that sounds scary, 119 00:09:13,427 --> 00:09:17,538 but if you walk through the proofs and are very careful about it and you are 120 00:09:17,538 --> 00:09:21,704 guaranteed that you don't need anymore resources to go break the deadlock you 121 00:09:21,704 --> 00:09:24,320 may be okay. But yeah it's, its just some say it's 122 00:09:24,320 --> 00:09:27,258 playing with fire. You've got to be a little careful of 123 00:09:27,258 --> 00:09:31,370 these, these sorts of solutions when trying to play with deadlock recovery. 124 00:09:31,370 --> 00:09:34,701 But if you're guaranteed that you're not going to need any more resources you can 125 00:09:34,701 --> 00:09:37,835 just resolve it right then. And oh, 126 00:09:37,835 --> 00:09:42,648 so the reason you would want to, to use the lock recovery is it does not restrict 127 00:09:42,648 --> 00:09:46,224 your protocol. And by doing that, you can have the 128 00:09:46,224 --> 00:09:48,648 common case of your protocol be very, very fast. 129 00:09:48,648 --> 00:09:51,948 And, you could make it so that the deadlock almost never happens. 130 00:09:51,948 --> 00:09:54,217 Or never, in practical concerns ever happens. 131 00:09:54,217 --> 00:09:58,085 Then, when the deadlock does happen oh, you take a little bit of performance 132 00:09:58,085 --> 00:10:01,643 bump, or a performance hit there. Because you have to virtualize it in 133 00:10:01,643 --> 00:10:04,273 software. But the probability of it happening is so 134 00:10:04,273 --> 00:10:04,480 low.