1
00:00:00,800 --> 00:00:04,770
So, now that we understand that the 
optimal solution to the sequence element 

2
00:00:04,770 --> 00:00:08,636
problem has to be only one of three 
candidates, we're going to be easily able 

3
00:00:08,636 --> 00:00:12,554
to formulate a recurrence, identify the 
relevant sub-problems and derive an 

4
00:00:12,554 --> 00:00:15,480
efficient dynamic programming algorithm 
for the problem. 

5
00:00:17,900 --> 00:00:22,111
So, here is the culmination of our work. 
On the previous video, we thought about 

6
00:00:22,111 --> 00:00:26,322
an optimal alignment of some pair of 
strings X and Y, and we notice that there 

7
00:00:26,322 --> 00:00:29,291
are three cases for the contents of the 
final position. 

8
00:00:29,291 --> 00:00:33,503
Either there's no gaps or there's a gap 
on top or there's a gap on the bottom. 

9
00:00:33,503 --> 00:00:36,688
in case one, where there's no gaps, XM 
and YM get matched. 

10
00:00:36,688 --> 00:00:41,061
And we proved that the induced alignment 
which is of the smaller strings X prime 

11
00:00:41,061 --> 00:00:43,599
and Y prime has to be optimal in its own 
right. 

12
00:00:43,599 --> 00:00:47,810
In the second case where the character 
little X sub M gets matched with a gap 

13
00:00:47,810 --> 00:00:50,240
induced alignment this time of X prime 
and Y. 

14
00:00:50,240 --> 00:00:54,006
Has to be optimal in its own right, and 
the third case where little y sub n gets 

15
00:00:54,006 --> 00:00:57,020
matched with the gap, the induced 
alignment now of x and y prime. 

16
00:00:57,020 --> 00:01:01,160
Must be optimal. 
So one way to think about this kind of 

17
00:01:01,160 --> 00:01:05,658
assertion is it says that the optimal 
solution to a problem, to a sequence of a 

18
00:01:05,658 --> 00:01:10,338
lot of problem depends only on the 
solutions to three different smaller 

19
00:01:10,338 --> 00:01:15,140
sub-problems, one involving x, x prime 
and y prime with characters peeled off of 

20
00:01:15,140 --> 00:01:18,544
both of the strings. 
One involving x prime and y and one 

21
00:01:18,544 --> 00:01:22,495
involving x and y prime. 
But in all of the cases, all that we care 

22
00:01:22,495 --> 00:01:27,418
about are sub-problems in which a single 
character was peeled off from the right 

23
00:01:27,418 --> 00:01:31,080
from one or both of the strings that we 
started with. 

24
00:01:31,080 --> 00:01:34,308
The situation is very similar to in our 
previous two examples. 

25
00:01:34,308 --> 00:01:38,109
We have independent sets on line graphs 
and the nap sack problem and the 

26
00:01:38,109 --> 00:01:42,326
independent set problem whenever we, we 
only cared about sub problems obtained by 

27
00:01:42,326 --> 00:01:45,763
plucking off either one or two vertices 
from the given line graph. 

28
00:01:45,763 --> 00:01:49,303
So all we ever cared about were prefixes 
of the original line graph. 

29
00:01:49,303 --> 00:01:53,156
In the nap sack problem we got sub 
problems by plucking off the last item 

30
00:01:53,156 --> 00:01:56,905
and perhaps also reducing the nap sack 
capacity by some interval amount. 

31
00:01:56,905 --> 00:02:01,174
So there were two dimensions in the nap 
sack problem for which sub problems could 

32
00:02:01,174 --> 00:02:05,080
decrease in size, then number of items in 
the residual nap sack capacity. 

33
00:02:05,080 --> 00:02:07,811
So we use two parameters to keep track of 
the sub problems. 

34
00:02:07,811 --> 00:02:11,562
And what we cared about were all possible 
prefixes of the items and all possible 

35
00:02:11,562 --> 00:02:14,711
residual integral capacities, at most the 
original knapsack capacity. 

36
00:02:14,711 --> 00:02:16,887
So what's up in the sequence alignment 
problem? 

37
00:02:16,887 --> 00:02:20,406
Well here, sub problems get smaller by 
plucking a character off of the first 

38
00:02:20,406 --> 00:02:23,787
string and or the second string. 
So again there are two ways in which the 

39
00:02:23,787 --> 00:02:27,167
sub problem can get smaller, either the 
first string or the second string. 

40
00:02:27,167 --> 00:02:30,454
So we'll again use two different 
parameters, one to figure out how much 

41
00:02:30,454 --> 00:02:33,974
we've plucked off of the first string, 
the second one to figure out how much 

42
00:02:33,974 --> 00:02:36,132
we've plucked off of the second string. 
Right. 

43
00:02:36,132 --> 00:02:39,780
But all we care about. 
The only relevant sub problems involved. 

44
00:02:39,780 --> 00:02:45,580
Prefixes of the two original input 
strings X and Y. 

45
00:02:45,580 --> 00:02:51,803
That is, the only sub problems that we 
care about have the form x I y j, where x 

46
00:02:51,803 --> 00:02:58,185
I denotes the first I letters of capital 
x, some prefix of x, and y j denotes some 

47
00:02:58,185 --> 00:03:05,888
prefix of y, the first j letters of y. 
So lets now move from the sub-problems 

48
00:03:05,888 --> 00:03:09,399
we're going to use in our dynamic 
programming algorithm to the recurrence 

49
00:03:09,399 --> 00:03:12,465
that we're going to use. 
And the recurrence really all it does is 

50
00:03:12,465 --> 00:03:16,223
compile our understanding of the optimal 
solution and how it depends on the 

51
00:03:16,223 --> 00:03:20,080
solution of the smaller sub-problems into 
an easy to use mathematical formula. 

52
00:03:23,380 --> 00:03:28,149
So I'll use the notation P sub i j for 
the value of the optimal solution to the 

53
00:03:28,149 --> 00:03:32,680
corresponding sub problem, the one 
involving the prefix X i and the prefix Y 

54
00:03:32,680 --> 00:03:41,295
j So for a given set of positive values 
for i and j, what is Pij? 

55
00:03:41,295 --> 00:03:49,205
Well, there are three possibilities. 
Case one is where the final position of 

56
00:03:49,205 --> 00:03:54,207
the optimal alignment doesn't have any 
gaps, so it matches the final character 

57
00:03:54,207 --> 00:03:59,337
of X sub i, that is little x sub i and 
the final character of the prefix capital 

58
00:03:59,337 --> 00:04:02,223
Y sub j, that is the character little y 
sub j. 

59
00:04:02,223 --> 00:04:07,289
It matches them together and just reuses 
an optimal alignment for the smaller 

60
00:04:07,289 --> 00:04:15,885
strings, Xi-1 and Yi-1 Case two is where 
the last letter of the first string, that 

61
00:04:15,885 --> 00:04:21,499
is little x of i gets matched with a gap. 
And that case the penalty of the 

62
00:04:21,499 --> 00:04:27,046
corresponding alignment is the penalty of 
a gap plus whatever the optimal alignment 

63
00:04:27,046 --> 00:04:31,999
is of the first i minus one letter of 
capital X plus the first J letter of 

64
00:04:31,999 --> 00:04:37,376
Capital Y. 
The symmetrically case three we pay for a 

65
00:04:37,376 --> 00:04:41,461
gap and then we pay whatever the optimal 
alignment is of all of the first I 

66
00:04:41,461 --> 00:04:44,792
letters of capital X with the first j 
menace one letters of Y. 

67
00:04:44,792 --> 00:04:49,199
This is the case where the last letter of 
the second string gets matched with the 

68
00:04:49,199 --> 00:04:52,900
gap in the final position of the optimal 
alignment. 

69
00:04:52,900 --> 00:04:57,226
So we know that the optimal solution has 
to look like one of these three things, 

70
00:04:57,226 --> 00:05:01,390
we don't know which, so in the recurrence 
we'll just in effect do brute force 

71
00:05:01,390 --> 00:05:05,067
search among the three outcomes. 
We just remember, we just choose the 

72
00:05:05,067 --> 00:05:09,850
minimum of the three possibilities. 
So that's the formal recurrence. 

73
00:05:09,850 --> 00:05:14,020
It's correctness really just follows 
immediately from the work we already did, 

74
00:05:14,020 --> 00:05:16,906
understand what the optimal solution has 
to look like. 

75
00:05:16,906 --> 00:05:21,076
So, before we state the algorithm where 
we systematically solve all of the sub 

76
00:05:21,076 --> 00:05:24,978
problems using this magical formula. 
Let's just make sure we get the edge 

77
00:05:24,978 --> 00:05:29,280
cases, the base cases where i or j equals 
zero correctly sorted out. 

78
00:05:29,280 --> 00:05:37,066
So specifically what is the value of p 
I,0 and also it turns out p of zero, i 

79
00:05:37,066 --> 00:05:41,820
where i here is just some non-negative 
integer. 

80
00:05:52,100 --> 00:05:54,462
Alright. 
So the answer to this question is the 

81
00:05:54,462 --> 00:05:57,441
second one, is B. 
And I hope if you could keep the notation 

82
00:05:57,441 --> 00:05:59,598
straight then the answer was fairly 
clear. 

83
00:05:59,598 --> 00:06:01,755
So let's remember what, what does PIJ 
mean? 

84
00:06:01,755 --> 00:06:05,864
That's the total penalty of an optimal 
alignment between the first i letters of 

85
00:06:05,864 --> 00:06:08,483
X, and the first j letters of Y. 
So consider Pi zero. 

86
00:06:08,483 --> 00:06:12,489
So now we're asking about aligning the 
first zero letters of X with the first 

87
00:06:12,489 --> 00:06:14,954
zero letters of Y. 
That is with the empty string. 

88
00:06:14,954 --> 00:06:19,166
Well the optimal way to match any string 
to the empty string is you're just going 

89
00:06:19,166 --> 00:06:22,402
to insert gaps into the empty string to 
equalize their lengths. 

90
00:06:22,402 --> 00:06:25,535
And so if your string has length i, you 
need to insert i gaps. 

91
00:06:25,535 --> 00:06:29,040
What's the? 
Penalty of that alignment is just i times 

92
00:06:29,040 --> 00:06:33,520
the penalty for a single gap, and that's 
the answer here in B. 

93
00:06:33,520 --> 00:06:37,541
So we're ready now to give the algorithm, 
and as with all these dynamic programming 

94
00:06:37,541 --> 00:06:41,418
algorithms once you know the sub-problems 
and once you know the recurrence that 

95
00:06:41,418 --> 00:06:43,986
relates their solutions there's really 
nothing to do. 

96
00:06:43,986 --> 00:06:47,910
All you do is systematically answer solve 
all of the sub-problems moving from 

97
00:06:47,910 --> 00:06:52,115
smallest to largest. 
So we're going to use an array A to keep 

98
00:06:52,115 --> 00:06:54,646
track of the solutions of all of these 
sub-problems. 

99
00:06:54,646 --> 00:06:58,297
A is going to have two dimensions. 
The reason for two dimensions is we have 

100
00:06:58,297 --> 00:07:01,948
two independent parameters which are 
keeping track of the sub-problem size. 

101
00:07:01,948 --> 00:07:05,550
One for how many letters of X we're 
dealing with, and the second dimension 

102
00:07:05,550 --> 00:07:07,983
for how many letters of Y that we're 
dealing with. 

103
00:07:07,983 --> 00:07:11,975
That's analogous to the knapsack problem, 
where we also had two dimensions to keep 

104
00:07:11,975 --> 00:07:14,165
track of. 
The number of items in play, and the 

105
00:07:14,165 --> 00:07:19,185
residual knapsack capacity. 
We just figured out what the base case 

106
00:07:19,185 --> 00:07:22,689
is, so we just solved those in a 
pre-processing step. 

107
00:07:22,689 --> 00:07:27,596
So if one of the two indices is zero, 
then the optimal solution value is just 

108
00:07:27,596 --> 00:07:32,856
the gap penalty times the non-zero index. 
And, now we just go to our double four 

109
00:07:32,856 --> 00:07:35,268
loops. 
It's double four loops because we have 

110
00:07:35,268 --> 00:07:38,805
two indices into out array. 
And whenever we get into a sub problem, 

111
00:07:38,805 --> 00:07:43,414
we just evaluate the recurrence invoking 
of these solution to the already computed 

112
00:07:43,414 --> 00:07:49,533
smaller sub problems. 
So one sanity check you should always 

113
00:07:49,533 --> 00:07:53,518
apply when you're writing out the code 
for a dynamic programming algorithm: when 

114
00:07:53,518 --> 00:07:57,205
you look at the right hand side of your 
recurrence, when you look at these 

115
00:07:57,205 --> 00:08:01,241
purportedly already solved subproblems 
whose solutions you're using to solve the 

116
00:08:01,241 --> 00:08:04,629
current subproblem, make sure you have 
actually already solved those 

117
00:08:04,629 --> 00:08:07,269
subproblems. 
So in this case we're good to go because 

118
00:08:07,269 --> 00:08:11,355
the indices of the subproblems are only 
less than the entry that we're filling in 

119
00:08:11,355 --> 00:08:13,646
right now. 
So indeed all three of the relevant 

120
00:08:13,646 --> 00:08:18,828
subproblems, A-I - 1 j - 1 A-I - 1 j, and 
A-I j - 1 they've already been computed 

121
00:08:18,828 --> 00:08:21,120
in earlier iterations of our double four 
loop. 

122
00:08:21,120 --> 00:08:25,980
So they're just hanging out, waiting to 
get looked up in constant time. 

123
00:08:25,980 --> 00:08:29,912
And as usual once you've actually figured 
out the key ingredients for the dynamic 

124
00:08:29,912 --> 00:08:33,221
programming solution, namely the 
sub-problems and the recurrence, it's 

125
00:08:33,221 --> 00:08:36,961
pretty much self evident why the things 
going to work and it's also self evident 

126
00:08:36,961 --> 00:08:39,900
exactly what its running time is going to 
be. 

127
00:08:39,900 --> 00:08:44,896
So why is the algorithm correct? 
That is, why does it terminate with every 

128
00:08:44,896 --> 00:08:49,618
entry Aij equal to the true optimal 
penalty Pij of the corresponding 

129
00:08:49,618 --> 00:08:53,589
sub-problem. 
Well, this just follows because our 

130
00:08:53,589 --> 00:08:57,650
recurrent is correct, that's where all 
the hard work was, and then we're just 

131
00:08:57,650 --> 00:09:00,161
systematically solving all of the 
sub-problems. 

132
00:09:00,161 --> 00:09:03,100
So, formally, if you like, it would be 
proof by induction. 

133
00:09:05,120 --> 00:09:07,942
So, the running time is completely 
trivial to evaluate. 

134
00:09:07,942 --> 00:09:11,863
In each iteration of this double four 
loop, we do a constant amount of work. 

135
00:09:11,863 --> 00:09:15,784
We just need to look up three things in 
constant time and make a couple of 

136
00:09:15,784 --> 00:09:17,980
comparisons. 
How many four loops are there? 

137
00:09:17,980 --> 00:09:21,744
Well, M iterations of the outer four 
loop, N iterations of the inner four 

138
00:09:21,744 --> 00:09:23,835
loop. 
So we suffer the product, M times N. 

139
00:09:23,835 --> 00:09:28,121
That is, the running time is proportional 
to the product of the lengths of the two 

140
00:09:28,121 --> 00:09:33,661
strings. 
So depending on the application, you may 

141
00:09:33,661 --> 00:09:37,654
be content to just have an algorithm 
compute for you the nw score, the total 

142
00:09:37,654 --> 00:09:41,699
penalty of an optimal alignment or 
perhaps you're actually interested in the 

143
00:09:41,699 --> 00:09:44,956
alignment itself. 
And just as we discussed with independent 

144
00:09:44,956 --> 00:09:48,739
sets of line graphs, by tracing back 
through the filled in table, you can 

145
00:09:48,739 --> 00:09:52,836
indeed reconstruct an optimal solution. 
So let me just give you the high level 

146
00:09:52,836 --> 00:09:55,989
idea of how it works. 
It's going to to follow the same template 

147
00:09:55,989 --> 00:10:00,244
and all you think through the details of 
why this really works in the privacy of 

148
00:10:00,244 --> 00:10:04,258
your own home. 
So assume that you've already run the 

149
00:10:04,258 --> 00:10:07,868
algorithm on the previous slide. 
That you've filled in all the entries of 

150
00:10:07,868 --> 00:10:10,390
this two d array capital A. 
Now we want to trace back. 

151
00:10:10,390 --> 00:10:13,551
So where are we going to start tracing 
back this filled in table? 

152
00:10:13,551 --> 00:10:16,956
Well, we are going to start with a 
problem that we actually care about, 

153
00:10:16,956 --> 00:10:20,847
namely the largest problem A of M comma 
N, that's what we want the alignment for. 

154
00:10:20,847 --> 00:10:24,593
Now we know the softball alignment looks, 
has one of three candidates, we know 

155
00:10:24,593 --> 00:10:28,533
there's three possible situations for the 
contents of the final position of that 

156
00:10:28,533 --> 00:10:30,916
alignment. 
More over, when we filled in this entry 

157
00:10:30,916 --> 00:10:34,807
of the table, we explicitly compared the 
three possibilities to figure out which 

158
00:10:34,807 --> 00:10:37,580
one was the best. 
So you know, perhaps on the forward pass 

159
00:10:37,580 --> 00:10:41,276
we actually cached the result of the 
comparison, or in the worst case we can 

160
00:10:41,276 --> 00:10:44,390
just go back and re-compute, and figure 
out which of those three. 

161
00:10:44,390 --> 00:10:48,138
Pieces was used to fill in this entry, 
and depending on which one of the three 

162
00:10:48,138 --> 00:10:51,502
candidates won, that tells us, what 
should be, the contents of the final 

163
00:10:51,502 --> 00:10:55,106
position of the optimal alignment. 
If case one was used to fill in this 

164
00:10:55,106 --> 00:10:57,749
entry, we should match, little x sub n 
and little y sub n. 

165
00:10:57,749 --> 00:11:01,257
If case two was used to fill in this 
entry, we should match little x sub n 

166
00:11:01,257 --> 00:11:03,756
with the gap. 
If case three was used, to fill in this 

167
00:11:03,756 --> 00:11:06,159
entry, we should match little y sub n 
with the gap. 

168
00:11:06,159 --> 00:11:08,370
If there was a tie, we get to pick any of 
them. 

169
00:11:08,370 --> 00:11:11,147
Arbitrarily, all of them will lead to 
optimal alignments. 

170
00:11:11,147 --> 00:11:14,520
Then of course, after figuring out what 
to do in this final position. 

171
00:11:14,520 --> 00:11:17,694
We have an induced sub problem involving 
x prime and or y prime. 

172
00:11:17,694 --> 00:11:20,423
That tells us a, a previous entry of the 
table to go to. 

173
00:11:20,423 --> 00:11:23,895
And we just repeat the process. 
We, again, figure out which of the three 

174
00:11:23,895 --> 00:11:27,615
cases was used to fill in this entry. 
That tells us how to fill in the next 

175
00:11:27,615 --> 00:11:31,483
right most position of the alignment. 
And we just keep going until we fall off 

176
00:11:31,483 --> 00:11:39,330
the table. 
So what do you do when you fall off the 

177
00:11:39,330 --> 00:11:41,717
table? 
Well, once one of the indices I or J gets 

178
00:11:41,717 --> 00:11:44,154
all the way down to zero, now you have no 
choice. 

179
00:11:44,154 --> 00:11:48,267
So now one of the strings is empty and 
the other one has some number of symbol. 

180
00:11:48,267 --> 00:11:52,380
So you should just insert the appropriate 
number of gaps to equalize the lengths. 

181
00:11:54,260 --> 00:11:58,625
One thing that's pretty neat is that this 
trace back procedure is efficient. 

182
00:11:58,625 --> 00:12:02,358
In fact, it's way more efficient, in 
general, than the forward pass. 

183
00:12:02,358 --> 00:12:06,839
For the forward pass, you have to fill in 
every single one of the m * n entries. 

184
00:12:06,839 --> 00:12:10,170
But in this trace back procedure, each 
time you track back. 

185
00:12:10,170 --> 00:12:13,214
One of the two indices, at least, will 
get decremented. 

186
00:12:13,214 --> 00:12:17,465
So that says, you're going to complete 
this trace back in O of m + n time with 

187
00:12:17,465 --> 00:12:20,280
an optimal alignment of the original two 
strings.