1
00:00:03,420 --> 00:00:07,476
Okay, so this brings us to the end of 
interconnection networks. 

2
00:00:07,476 --> 00:00:10,980
And what I wanted to talk about was 
Deadlock. 

3
00:00:10,980 --> 00:00:17,889
So let's assume that you have three nodes 
labeled one, two, and three here. 

4
00:00:17,889 --> 00:00:24,343
It's on a 1D, or excuse me, yeah, oh 1D 
tourist, that is unidirectional. 

5
00:00:24,343 --> 00:00:28,980
So links only flow in the clockwise 
direction here. 

6
00:00:30,360 --> 00:00:36,510
Now, all of a sudden, you have node one 
wants to send to node three, node two 

7
00:00:36,510 --> 00:00:41,840
wants to send to node one, and node three 
wants to send to, to two. 

8
00:00:43,120 --> 00:00:46,201
And we're going to look at a wormhole 
routing network. 

9
00:00:46,201 --> 00:00:50,937
So the wormhole routing network you start 
to send the header words and [INAUDIBLE] 

10
00:00:50,937 --> 00:00:55,160
has not, potentially not even been 
injected into the network at this point. 

11
00:00:55,160 --> 00:00:58,013
Let's say they all start at exactly the 
same time. 

12
00:00:58,013 --> 00:01:01,000
So one is going to start sending to 
three. 

13
00:01:01,000 --> 00:01:05,389
Two's going to start sending to one. 
Three's going to start sending to two. 

14
00:01:05,389 --> 00:01:10,295
And if you'll notice, is each set of 
these has one overlapping link with the 

15
00:01:10,295 --> 00:01:13,651
previous one. 
So there's going to be some contention on 

16
00:01:13,651 --> 00:01:16,426
that link. 
So very quickly, you can see that. 

17
00:01:16,426 --> 00:01:21,331
If we were to all, if we were to launch 
these three messages exactly at same 

18
00:01:21,331 --> 00:01:26,843
time, and the message length was long. 
None of them would actually get to the 

19
00:01:26,843 --> 00:01:31,331
receivers. 
One would use this link, or one sending 

20
00:01:31,331 --> 00:01:36,021
to three would use this link, two sending 
to one would use this link, three sending 

21
00:01:36,021 --> 00:01:40,541
to two would use this wraparound link. 
But then all of a sudden, they would try 

22
00:01:40,541 --> 00:01:43,744
to acquire the next link but it's already 
been reserved. 

23
00:01:43,744 --> 00:01:48,149
But they can't give up the link because 
they've already acquired the link, so all 

24
00:01:48,149 --> 00:01:54,690
of a sudden you have a deadlock. 
So these numbers get pretty tricky. 

25
00:01:54,690 --> 00:02:00,384
and one way that we go about trying to 
see if our routing protocol is deadlock 

26
00:02:00,384 --> 00:02:06,217
free or actually can deadlock is we start 
to do something called Waits-for-analysis 

27
00:02:06,217 --> 00:02:12,174
or waits-for and Holds analysis. 
And waits for, and Waits-for and Holds 

28
00:02:12,174 --> 00:02:16,629
analysis, 
the basic idea is, so this is looking at 

29
00:02:16,629 --> 00:02:21,993
our example that we saw before where we 
have three routers. 

30
00:02:21,993 --> 00:02:29,448
What's happening here is we're injecting 
into A and this A is acquiring this link 

31
00:02:29,448 --> 00:02:35,176
or this [INAUDIBLE] here, 
and then it needs to acquire the next 

32
00:02:35,176 --> 00:02:39,960
one. 
So let's draw this as a diagram. 

33
00:02:39,960 --> 00:02:45,395
So A holds TQ2, 
so A holds TQ2. 

34
00:02:45,395 --> 00:02:52,100
This one here, 
this inbound Q here. 

35
00:02:53,220 --> 00:02:57,294
And when we draw Waits-for and Hold 
analysis diagrams. 

36
00:02:57,294 --> 00:03:03,254
When we are waiting on something we draw 
a edge from the thing that is the, the 

37
00:03:03,254 --> 00:03:06,800
actor to the resource that we're waiting 
on, 

38
00:03:06,800 --> 00:03:10,874
and the holds actually goes in the 
opposite direction. 

39
00:03:10,874 --> 00:03:16,080
So if a holds TQ2 here we actually draw 
the edge the other direction. 

40
00:03:18,191 --> 00:03:26,905
all of a sudden, A is holding TQ2 two, 
and it's waiting to acquire TQ3, this one 

41
00:03:26,905 --> 00:03:33,232
here. because it needs to transit this 
link to deliver it to TQ3. 

42
00:03:34,374 --> 00:03:45,200
But at the same time, B has acquired TQ3, 
so it's holding that and it's waiting for 

43
00:03:45,200 --> 00:03:54,030
TQ1. 
But C currently holding TQ1 and its 

44
00:03:54,030 --> 00:04:00,652
waiting on TQ2. 
This is how we do deadlock analysis from 

45
00:04:00,652 --> 00:04:06,547
a proof perspective. We actually write 
these dependency graphs or Waits-for and 

46
00:04:06,547 --> 00:04:12,025
Hods graphs and if you find a cycle in 
your protocol, your protocol can have a 

47
00:04:12,025 --> 00:04:15,354
deadlock. 
It need a couple of some way to color 

48
00:04:15,354 --> 00:04:20,555
link your, color, color edge in this 
graph or you need to really design your 

49
00:04:20,555 --> 00:04:23,332
protocol. 
But this is the sort of canonical way 

50
00:04:23,332 --> 00:04:27,978
that you can do this and if you start to 
look at different routing protocols some 

51
00:04:27,978 --> 00:04:32,624
of them could possibly deadlock and some 
of'em can, you can prove statically to 

52
00:04:32,624 --> 00:04:37,043
all possible weights for of whole graphs 
you can come up with will never have 

53
00:04:37,043 --> 00:04:40,855
cycles. 
So I, I think I mentioned last time, if 

54
00:04:40,855 --> 00:04:44,945
you look at dimension order routing with 
one more routing. 

55
00:04:44,945 --> 00:04:50,494
All possible inputs and outputs on that 
you never actually going to have a 

56
00:04:50,494 --> 00:04:53,910
deadlock, 
you never call for cycle in the graph. 

57
00:04:53,910 --> 00:04:59,673
And the reason for this is the only ever 
quire X routers and then Y routers, will 

58
00:04:59,673 --> 00:05:03,870
say if you have the dimension that you 
acquire X and quire Y. 

59
00:05:03,870 --> 00:05:09,349
But you never going into have effectively 
something that is holding Y trying to 

60
00:05:09,349 --> 00:05:14,258
acquire something on the X axis. 
So, you never going to have a, a cycle in 

61
00:05:14,258 --> 00:05:16,820
that Waits-for and Holds whole analysis 
graph. 

62
00:05:18,820 --> 00:05:27,600
I just briefly wanted to some up here 
that Deadlock is actually not an enemy. 

63
00:05:27,600 --> 00:05:32,308
You can try to use deadlock, 
to your benefit or, or just live with 

64
00:05:32,308 --> 00:05:34,945
deadlock. 
Just because your protocol has deadlock 

65
00:05:34,945 --> 00:05:38,440
does not mean you can't try to recover 
from it. 

66
00:05:38,440 --> 00:05:42,842
So what I mean by this is deadlock 
avoidance can be very expensive. 

67
00:05:42,842 --> 00:05:47,769
So trying to draw all possible graphs 
here, you should think about, you know, 

68
00:05:47,769 --> 00:05:52,763
where your deadlocks happen in your 
systems, and plan them out and make sure 

69
00:05:52,763 --> 00:05:57,617
they don't happen often. 
But some of the things you need to do, to 

70
00:05:57,617 --> 00:06:00,306
try and cut edges here, might be too 
restrictive. 

71
00:06:00,306 --> 00:06:04,900
So for instance, dimensional routing may 
only route in the X dimension and then in 

72
00:06:04,900 --> 00:06:09,326
the Y dimension, could be very expensive 
relative to an adaptive routing scheme, 

73
00:06:09,326 --> 00:06:13,100
or you try to route around congestion in 
your network. 

74
00:06:13,100 --> 00:06:17,602
So this deadlock avoidance, you know, 
you'll, you'll design the protocol to 

75
00:06:17,602 --> 00:06:20,377
never deadlock, 
this could be quite expensive. 

76
00:06:20,377 --> 00:06:25,311
So an alternative, and this is something 
you should not necessarily be afraid of, 

77
00:06:25,311 --> 00:06:30,121
but you have to be careful of this is 
that you find the deadlock and then you 

78
00:06:30,121 --> 00:06:34,530
try to recover from it. 
So what I mean by this is on-chip network 

79
00:06:34,530 --> 00:06:38,850
or a, a, multi-chip network you can 
actually have a deadlock occur, 

80
00:06:38,850 --> 00:06:43,538
notice that a deadlock occurred. 
And either try to roll back the state and 

81
00:06:43,538 --> 00:06:47,130
somehow jitter the state so the deadlock 
won't happen a second time. 

82
00:06:47,130 --> 00:06:51,633
Or many times these deadlocks happen 
because your dependent on one particular 

83
00:06:51,633 --> 00:06:55,444
buffer in the system and two people 
trying to acquire that buffer. 

84
00:06:55,444 --> 00:06:59,948
So you can try to virtualize that buffer. 
Effectively saving the state of that 

85
00:06:59,948 --> 00:07:02,546
buffer in memory and adding more, more 
states. 

86
00:07:02,546 --> 00:07:06,952
So you, a lot of these protocols, if you 
were to add more FIFO entries into the 

87
00:07:06,952 --> 00:07:09,297
system, it would actually break the 
deadlock. 

88
00:07:09,297 --> 00:07:11,802
So, sort of going back into this picture 
here, 

89
00:07:11,802 --> 00:07:15,799
if all of a sudden if I added another, 
let's say, FIFO here that the inbound 

90
00:07:15,799 --> 00:07:19,743
traffic went into and then this other 
traffic tried to bypass it, it would 

91
00:07:19,743 --> 00:07:23,900
actually cut one of these edges, and all 
of a sudden you would not be having a 

92
00:07:23,900 --> 00:07:26,879
deadlock. 
So you can think about trying to have a 

93
00:07:26,879 --> 00:07:30,191
deadlock recovery mechanism that's an 
example of it. 

94
00:07:30,191 --> 00:07:35,690
So a good example this is actually in the 
raw micro processor from an IT where we 

95
00:07:35,690 --> 00:07:39,252
I, you can actually have that much on the 
on-chip network. 

96
00:07:39,252 --> 00:07:44,064
Its dimensional routed, but you can still 
have message depended deadlocks. 

97
00:07:44,064 --> 00:07:48,188
So what I mean by that is while the 
network itself is not going to deadlock, 

98
00:07:48,188 --> 00:07:53,063
the traffic flows you have on top of it 
that from since your memory coherence 

99
00:07:53,063 --> 00:07:57,040
protocol on top of that could be 
potentially deadlock. 

100
00:07:57,040 --> 00:08:00,815
So how do you recover from that? 
Well you can actually have a counter, 

101
00:08:00,815 --> 00:08:05,411
a timer which goes off when you determine 
that the network has not moved for a 1000 

102
00:08:05,411 --> 00:08:07,928
cycles. 
So all the sudden you're running along 

103
00:08:07,928 --> 00:08:12,086
and no traffic moves on your network. 
Well it's a pretty good indicator that 

104
00:08:12,086 --> 00:08:15,478
you have a deadlock condition, 
because something should be flowing. 

105
00:08:15,478 --> 00:08:18,160
There should be some forward progress 
guarantees. 

106
00:08:18,160 --> 00:08:21,165
[COUGH] And if you notice that, an 
interrupt will go off. 

107
00:08:21,165 --> 00:08:25,567
So this timer will go off saying nothing 
has moved on my network for 1,000 cycles, 

108
00:08:25,567 --> 00:08:29,217
this is probably a deadlock. 
So that, you could take an interrupt and 

109
00:08:29,217 --> 00:08:33,350
then software can go look at all this 
data in the network and introduce more 

110
00:08:33,350 --> 00:08:38,239
buffering into the network. 
So, effectively through software try to 

111
00:08:38,239 --> 00:08:43,210
virtualize a particular FIFO entry and 
that can break the deadlock. 

112
00:08:43,210 --> 00:08:46,165
So there are ways to recover from 
deadlocks. 

113
00:08:46,165 --> 00:08:51,674
If your network is for instance used for 
message passing network, you can just use 

114
00:08:51,674 --> 00:08:55,771
memory to go do that. 
So actually on the raw processor, our 

115
00:08:55,771 --> 00:09:00,742
memory coherence protocol, we used 
deadlock avoidance and then our message 

116
00:09:00,742 --> 00:09:05,310
passing protocol, our message passing 
networks use deadlock recovery. 

117
00:09:05,310 --> 00:09:09,475
later on in, in Tylera actually we have 
something depending on which memory 

118
00:09:09,475 --> 00:09:13,427
network we have mixtures of these two. 
And you might say that sounds scary, 

119
00:09:13,427 --> 00:09:17,538
but if you walk through the proofs and 
are very careful about it and you are 

120
00:09:17,538 --> 00:09:21,704
guaranteed that you don't need anymore 
resources to go break the deadlock you 

121
00:09:21,704 --> 00:09:24,320
may be okay. 
But yeah it's, its just some say it's 

122
00:09:24,320 --> 00:09:27,258
playing with fire. 
You've got to be a little careful of 

123
00:09:27,258 --> 00:09:31,370
these, these sorts of solutions when 
trying to play with deadlock recovery. 

124
00:09:31,370 --> 00:09:34,701
But if you're guaranteed that you're not 
going to need any more resources you can 

125
00:09:34,701 --> 00:09:37,835
just resolve it right then. 
And oh, 

126
00:09:37,835 --> 00:09:42,648
so the reason you would want to, to use 
the lock recovery is it does not restrict 

127
00:09:42,648 --> 00:09:46,224
your protocol. 
And by doing that, you can have the 

128
00:09:46,224 --> 00:09:48,648
common case of your protocol be very, 
very fast. 

129
00:09:48,648 --> 00:09:51,948
And, you could make it so that the 
deadlock almost never happens. 

130
00:09:51,948 --> 00:09:54,217
Or never, in practical concerns ever 
happens. 

131
00:09:54,217 --> 00:09:58,085
Then, when the deadlock does happen oh, 
you take a little bit of performance 

132
00:09:58,085 --> 00:10:01,643
bump, or a performance hit there. 
Because you have to virtualize it in 

133
00:10:01,643 --> 00:10:04,273
software. 
But the probability of it happening is so 

134
00:10:04,273 --> 00:10:04,480
low.