1
00:00:00,000 --> 00:00:03,893
At the beginning of the course, we talked 
about the sequence alignment problem, 

2
00:00:03,893 --> 00:00:07,188
a problem which is fundamental to modern 
computational genomics. 

3
00:00:07,188 --> 00:00:10,832
And we talked about the need for an 
efficient algorithm for solving that 

4
00:00:10,832 --> 00:00:13,528
problem, for finding the best alignment 
of two strings. 

5
00:00:13,528 --> 00:00:17,371
I'm pleased to report that at this point 
we're well prepared to give such an 

6
00:00:17,371 --> 00:00:19,817
algorithm. 
Indeed, such an efficient solution will 

7
00:00:19,817 --> 00:00:23,511
readily fall out of the dynamic 
programming recipe that we now have quite 

8
00:00:23,511 --> 00:00:28,742
a bit of practice with. 
So let me briefly jog your memory about 

9
00:00:28,742 --> 00:00:32,617
the sequence alignment problem. 
So the goal here is to compute a 

10
00:00:32,617 --> 00:00:37,280
similarity measure between strings, 
a similarity measure defined as the total 

11
00:00:37,280 --> 00:00:42,440
penalty of the best alignment, also known 
as the Needleman-Wunsch score. 

12
00:00:42,440 --> 00:00:46,660
So for example. 
if you've given as input the strings A G, 

13
00:00:46,660 --> 00:00:51,860
G, G C T at A G, G C A, a natural 
candidate alignment would be to stack 

14
00:00:51,860 --> 00:00:57,965
them on top of the other, inserting a gap 
in the shortest string after the two Gs, 

15
00:00:57,965 --> 00:01:04,906
that some sense represent the missing G. 
This is a pretty good alignment that 

16
00:01:04,906 --> 00:01:09,119
suffers from merely two flaws. 
So first of all, we did resort to adding 

17
00:01:09,119 --> 00:01:13,211
a gap in the second string. 
Second of all, there is a mismatch in the 

18
00:01:13,211 --> 00:01:16,680
final column. 
The A and the T get mismatched. 

19
00:01:16,680 --> 00:01:22,070
In general we evaluate the alignment by 
summing up the penalties of all the flaws 

20
00:01:22,070 --> 00:01:26,080
and the sum penalty per gap and the sum 
penalty per mismatch. 

21
00:01:27,680 --> 00:01:31,876
So, a bit more precisely, as input in 
this computational problem, we're given 

22
00:01:31,876 --> 00:01:34,618
two strings. 
I'm going to call them capital X and 

23
00:01:34,618 --> 00:01:37,527
capital Y. 
I'm going to use little x and little y to 

24
00:01:37,527 --> 00:01:40,325
denote the individual characters of these 
strings. 

25
00:01:40,325 --> 00:01:44,689
Let's say the first string capital X has 
linked M and the second string Y has 

26
00:01:44,689 --> 00:01:47,431
linked N. 
In addition to the two input strings we 

27
00:01:47,431 --> 00:01:51,571
assume we're given as input the values 
for the various types of penalties. 

28
00:01:51,571 --> 00:01:55,376
So that we know exactly how much it costs 
each time we insert a gap. 

29
00:01:55,376 --> 00:01:59,796
And for each possible mismatch, we need 
to know exactly what's the cost of that 

30
00:01:59,796 --> 00:02:03,899
mismatch. 
In principle you could be given a penalty 

31
00:02:03,899 --> 00:02:08,444
for matching a letter with itself, but 
typically that's going to be a penalty of 

32
00:02:08,444 --> 00:02:14,354
zero. 
The space of feasible solutions are just 

33
00:02:14,354 --> 00:02:19,007
the ways of inserting gaps into the two 
strings so that the results have equal 

34
00:02:19,007 --> 00:02:22,563
length. 
I should emphasize that you're allowed to 

35
00:02:22,563 --> 00:02:26,432
insert gaps into both of the strings. 
In the example, we only inserted into one 

36
00:02:26,432 --> 00:02:29,308
of the two strings, 
but in general, you might have an input 

37
00:02:29,308 --> 00:02:32,235
where one string is seven characters 
longer than the other, 

38
00:02:32,235 --> 00:02:36,352
and it might turn out that in the optimal 
alignment, the best thing to do is insert 

39
00:02:36,352 --> 00:02:38,832
three gaps at various places in the 
longer string, 

40
00:02:38,832 --> 00:02:41,560
and ten gaps at various places into the 
shorter string. 

41
00:02:41,560 --> 00:02:45,708
And the goal, of course is just to 
compute amongst all of the exponentially 

42
00:02:45,708 --> 00:02:50,078
many alignments, the one that minimizes 
the total penalty, where total penalty is 

43
00:02:50,078 --> 00:02:54,171
just the sum of individual penalties for 
the inserted gaps and the various 

44
00:02:54,171 --> 00:02:58,440
mismatches. 
So let's not be unduly intimidated by how 

45
00:02:58,440 --> 00:03:01,840
fundamental this problem is, 
and let's just apply the dynamic recipe, 

46
00:03:01,840 --> 00:03:04,640
the programing recipe that we have been 
using all along. 

47
00:03:04,640 --> 00:03:08,340
Now remember, the really key insight in 
any dynamic programing solution is 

48
00:03:08,340 --> 00:03:11,190
figuring out what's the right collection 
of sub problems. 

49
00:03:11,190 --> 00:03:15,240
And if you're feeling like your up to the 
black belt level in dynamic programing, 

50
00:03:15,240 --> 00:03:19,240
you might just want to try to guess what 
are the right collection of sub problems 

51
00:03:19,240 --> 00:03:22,390
for sequence alignment? 
But I don't expect you to be able to do 

52
00:03:22,390 --> 00:03:25,240
that at this point. 
And so, as usual we're going to derive 

53
00:03:25,240 --> 00:03:29,316
the correct collection of sub-problems. 
And we're going to do it by reasoning 

54
00:03:29,316 --> 00:03:31,727
about the structure of an optimal 
solution, 

55
00:03:31,727 --> 00:03:36,213
narrowing it down to a small number of 
candidates composed in various ways from 

56
00:03:36,213 --> 00:03:40,939
solutions to smaller sub-problems. 
Once we've figured out the small number 

57
00:03:40,939 --> 00:03:44,463
of possibilities for what the optimal 
solution could look like in terms of 

58
00:03:44,463 --> 00:03:48,174
solutions to smaller sub problems, we'll 
be able to drive a recurrence which in 

59
00:03:48,174 --> 00:03:51,698
effect just does brute force search 
through the small number of candidates. 

60
00:03:51,698 --> 00:03:54,047
And from the recurrence we'll be able to 
back out. 

61
00:03:54,047 --> 00:03:57,618
We'll be able to reverse engineer what 
are the various sub problems that we 

62
00:03:57,618 --> 00:04:00,080
actually care about and that we have to 
solve. 

63
00:04:00,080 --> 00:04:03,641
So let's do a thought experiment. 
What does the optimal solution have to 

64
00:04:03,641 --> 00:04:06,149
look like? 
And again, remember, this is exactly what 

65
00:04:06,149 --> 00:04:09,761
it is that we're trying to compute but 
that's not going to stop us from 

66
00:04:09,761 --> 00:04:12,520
reasoning about it. 
If someone handed to us on a silver 

67
00:04:12,520 --> 00:04:16,060
platter the optimal solution, what would 
it have to look like? 

68
00:04:16,060 --> 00:04:20,522
So consider any old pair of strings, 
capital X and capital Y, and an optimal 

69
00:04:20,522 --> 00:04:24,091
alignment of them. 
Let's visualize this optimal alignment as 

70
00:04:24,091 --> 00:04:26,650
follows. 
Let's write down the string X plus 

71
00:04:26,650 --> 00:04:31,112
whatever gaps get inserted into it on 
top, and right beneath it we'll write 

72
00:04:31,112 --> 00:04:34,563
down the string Y, with whatever gaps are 
inserted into it. 

73
00:04:34,563 --> 00:04:37,300
These two things have exactly the same 
length. 

74
00:04:39,180 --> 00:04:43,436
So to figure out the various cases of the 
structure for this optimal solution, 

75
00:04:43,436 --> 00:04:46,874
let's reason by analogy with the problems 
we've already solved. 

76
00:04:46,874 --> 00:04:50,858
So back when we were looking at 
independent sets of line graphs, our case 

77
00:04:50,858 --> 00:04:54,951
analysis was well, either the final 
vertex, the rightmost vertex of the path, 

78
00:04:54,951 --> 00:04:59,199
is in the optimal solution or it's not. 
In the knapsack problem, we said well 

79
00:04:59,199 --> 00:05:02,637
either the last item is in the optimal 
solution or it's not. 

80
00:05:02,637 --> 00:05:07,048
So we always looked at sort of the last 
part of the optimal solution, in some 

81
00:05:07,048 --> 00:05:10,943
sense the rightmost position. 
And happily, staring at this alignment, 

82
00:05:10,943 --> 00:05:15,240
we see we can once again focus just on 
the action in the right most, in the 

83
00:05:15,240 --> 00:05:19,082
final position. 
So now I have a question for you. 

84
00:05:19,082 --> 00:05:22,052
So in the independent set problem, there 
were two cases. 

85
00:05:22,052 --> 00:05:25,454
The last vertex was either in the optimal 
solution or it's not. 

86
00:05:25,454 --> 00:05:29,721
In the knapsack problem, there were also 
two cases, the final item was either in 

87
00:05:29,721 --> 00:05:33,285
the optimal solution or it's not. 
So my question for you is, in the 

88
00:05:33,285 --> 00:05:37,713
sequence alignment problem: when we focus 
on what's going on in the final position 

89
00:05:37,713 --> 00:05:41,440
of the optimal alignment, how many 
relevant cases do we have to study? 

90
00:05:48,500 --> 00:05:53,334
So the answer I'm looking for is B, three 
relevant possiblities for the contents of 

91
00:05:53,334 --> 00:05:56,771
the final position. 
Let me explain my reasoning, let's start 

92
00:05:56,771 --> 00:05:59,275
with the upper parts of the final 
position. 

93
00:05:59,275 --> 00:06:04,284
observe that if that's a character of the 
string capital X, it can only be the very 

94
00:06:04,284 --> 00:06:06,906
last character. 
It can only be little X sub N. 

95
00:06:06,906 --> 00:06:09,527
That's because that's where this string 
ends. 

96
00:06:09,527 --> 00:06:14,187
Now we don't know that little X sub N is 
in the final position, there might be a 

97
00:06:14,187 --> 00:06:16,551
gap. 
Similarly, in the bottom part of this 

98
00:06:16,551 --> 00:06:18,912
final position, there's two 
possibilities. 

99
00:06:18,912 --> 00:06:21,908
There's a gap, 
or, if it's a character of y, it has to 

100
00:06:21,908 --> 00:06:26,055
be the final character, little y, sub n. 
So that one seems to suggest four 

101
00:06:26,055 --> 00:06:28,244
possibilities, 
two options for the top, 

102
00:06:28,244 --> 00:06:32,045
two options for the bottom. 
But the hint of talking about relevant 

103
00:06:32,045 --> 00:06:36,365
possibilities is that it's totally 
pointless to have two, a gap in both the 

104
00:06:36,365 --> 00:06:37,700
top and the bottom. 
Why? 

105
00:06:37,700 --> 00:06:41,845
Well the penalty for gaps is 
non-negative, so if we just deleted both 

106
00:06:41,845 --> 00:06:45,389
of those gaps we'd get an even better 
alignment of X and Y. 

107
00:06:45,389 --> 00:06:50,015
And in studying an optimal solution, we 
can therefore assume we never have two 

108
00:06:50,015 --> 00:06:54,180
gaps in a common position. 
So that leaves exactly three cases. 

109
00:06:54,180 --> 00:06:58,730
It could be there's no gaps at all, that 
in fact this alignment matches the 

110
00:06:58,730 --> 00:07:01,400
character, little x of m, with little y 
sub n. 

111
00:07:03,820 --> 00:07:08,620
Or it could match the final character of 
capital X, with a gap. 

112
00:07:10,340 --> 00:07:14,640
Or it could match the final character of 
capital Y with a gap. 

113
00:07:16,220 --> 00:07:20,596
So the hope behind this case analysis is 
that we're going to be able to boil down 

114
00:07:20,596 --> 00:07:24,474
the possibilities for the optimal 
solution to merely three candidates, 

115
00:07:24,474 --> 00:07:28,574
one candidate for each of the three 
possibilities for the contents of the 

116
00:07:28,574 --> 00:07:31,731
final position. 
That would be analogous to what we did in 

117
00:07:31,731 --> 00:07:35,886
both the independent set and knapsack 
problems, where we boiled the optimal 

118
00:07:35,886 --> 00:07:39,709
solution down to just one of two 
candidates, corresponding to whether 

119
00:07:39,709 --> 00:07:43,698
either the final vertex or the final 
item, as the case may be, was in the 

120
00:07:43,698 --> 00:07:47,637
optimal solution. 
Another way of thinking about this is 

121
00:07:47,637 --> 00:07:51,681
we'd like to make precise the idea that 
if we just knew what was going on in the 

122
00:07:51,681 --> 00:07:55,526
final position, if only a little birdy 
would tell us which of the three cases 

123
00:07:55,526 --> 00:07:59,520
that we're in, then we'd be done by just 
solving the some smaller sub-problem 

124
00:07:59,520 --> 00:08:03,267
recursively. 
So let's now state for each of the three 

125
00:08:03,267 --> 00:08:05,815
possible scenarios for the final 
position, 

126
00:08:05,815 --> 00:08:10,546
what is the corresponding candidate for 
the optimal solution, the way in which, 

127
00:08:10,546 --> 00:08:14,914
it must necessary be composed with an 
optimal solution to a smaller sub 

128
00:08:14,914 --> 00:08:18,877
problem. 
So who are going to be the protagonists 

129
00:08:18,877 --> 00:08:23,059
in our smaller sub-problem? 
Well, the smaller sub-problem's going to 

130
00:08:23,059 --> 00:08:26,680
involve everything except the stuff in 
the final position. 

131
00:08:26,680 --> 00:08:31,299
So it's going to involve the string's X 
and Y, possibly with one character 

132
00:08:31,299 --> 00:08:34,483
remaining. 
So let's let x prime be x, with its final 

133
00:08:34,483 --> 00:08:38,229
character peeled off. 
Y prime's going to be y, with its final 

134
00:08:38,229 --> 00:08:44,023
character peeled off. 
So let me just remind you of how I 

135
00:08:44,023 --> 00:08:47,614
numbered the three cases. 
So case one is when the final position 

136
00:08:47,614 --> 00:08:52,159
contains the final characters of both of 
the two strings, that is, when there's no 

137
00:08:52,159 --> 00:08:55,421
gaps. 
Case two is when x, little x of n gets 

138
00:08:55,421 --> 00:09:00,562
matched with the gap and case three is 
when little y of n gets matched with the 

139
00:09:00,562 --> 00:09:03,846
gap. 
Alright, so let's suppose that case one 

140
00:09:03,846 --> 00:09:06,415
holds. 
This means that the contents of the final 

141
00:09:06,415 --> 00:09:10,428
position, includes both of the characters 
little x sub m and little y sub n. 

142
00:09:10,428 --> 00:09:14,227
So now what we're going to do is we want 
to look at a smaller sub problem. 

143
00:09:14,227 --> 00:09:18,348
And we want to look at the sub problem 
induced by the contents of all of the 

144
00:09:18,348 --> 00:09:21,344
rest of the positions. 
We're going to call that the induced 

145
00:09:21,344 --> 00:09:25,229
alignment. 
Since we started with an alignment, two 

146
00:09:25,229 --> 00:09:29,461
things that had to equal length, and we 
peeled off the final position of both, we 

147
00:09:29,461 --> 00:09:33,478
have another thing that has equal link so 
we're justified in calling it an 

148
00:09:33,478 --> 00:09:35,674
alignment. 
Now what is it an alignment of? 

149
00:09:35,674 --> 00:09:39,906
Well if we're in case one, that means 
what's missing from the induced alignment 

150
00:09:39,906 --> 00:09:43,067
are the final characters. 
little X of M and little Y's of N, 

151
00:09:43,067 --> 00:09:47,138
which means the induced alignment is a 
bona fide alignment of X prime and Y 

152
00:09:47,138 --> 00:09:51,966
prime. 
And certainly, what we're hoping is true, 

153
00:09:51,966 --> 00:09:57,245
is that the induced alignment is in fact, 
an optimal alignment of these smaller 

154
00:09:57,245 --> 00:10:01,816
strings x prime and y prime. 
This would say that when we're in case 

155
00:10:01,816 --> 00:10:06,183
one, the optimal solution to the original 
problem is built up in a simple way from 

156
00:10:06,183 --> 00:10:09,440
an optimal solution to a smaller sub 
problem. 

157
00:10:09,440 --> 00:10:13,406
We're of course hoping that something 
analogous happens in cases two and three. 

158
00:10:13,406 --> 00:10:17,373
The only change is going to be that the 
protagonists of the sub-problem will be a 

159
00:10:17,373 --> 00:10:20,386
little bit different. 
In case two, the thing which is missing 

160
00:10:20,386 --> 00:10:23,148
from the induced alignment is the final 
character of x. 

161
00:10:23,148 --> 00:10:26,110
So, it's going to be the induced 
alignment of x prime and y. 

162
00:10:26,110 --> 00:10:29,977
Similarly, in case three, the induced 
alignment is going to be an alignment of 

163
00:10:29,977 --> 00:10:32,852
x and y prime. 
So, this is an insertion, this is a 

164
00:10:32,852 --> 00:10:36,607
claim, it's not completely obvious, 
though the proof isn't hard, as I will 

165
00:10:36,607 --> 00:10:39,996
show you on the next slide. 
But assuming for the moment that this 

166
00:10:39,996 --> 00:10:42,812
assertion is true, it fulfills the hope 
we had earlier. 

167
00:10:42,812 --> 00:10:47,088
It says that indeed, the optimal solution 
can only be one of three candidates, that 

168
00:10:47,088 --> 00:10:50,894
one for each of the possibilities for the 
contents of the final position. 

169
00:10:50,894 --> 00:10:55,014
Alternatively it says, that if we only 
knew which of the three cases we were in, 

170
00:10:55,014 --> 00:10:58,820
we'd be done, we can recurse, we could 
look up a solution to a smaller sub 

171
00:10:58,820 --> 00:11:02,783
problem and we could extend it in an easy 
way to a optimal solution for the 

172
00:11:02,783 --> 00:11:05,908
original problem. 
So lets now move onto the proof of this 

173
00:11:05,908 --> 00:11:08,706
assertion. 
Why is it true that an optimal solution 

174
00:11:08,706 --> 00:11:13,070
must be built up from an optimal solution 
to the relevant smaller sub-problem? 

175
00:11:13,070 --> 00:11:17,379
Well all of the cases are pretty much the 
same argument so I'm just going to do 

176
00:11:17,379 --> 00:11:20,065
case one, the other cases are basically 
the same. 

177
00:11:20,065 --> 00:11:24,359
I invite you to fill in the details. 
So it's going to be the same type of 

178
00:11:24,359 --> 00:11:28,113
simple proof by contradiction that we 
used earlier, when reasoning about the 

179
00:11:28,113 --> 00:11:31,867
structure of optimal solutions for the 
independent set in knapsack problems. 

180
00:11:31,867 --> 00:11:35,967
We're going to assume the contrary, we're 
going to assume that the induced solution 

181
00:11:35,967 --> 00:11:40,018
to the smaller subproblem is not optimal, 
and from the fact that there is a better 

182
00:11:40,018 --> 00:11:43,476
solution for the subproblem, we will 
extract a better solution for the 

183
00:11:43,476 --> 00:11:47,279
original problem, contradicting the 
purported optimality of the solution that 

184
00:11:47,279 --> 00:11:50,271
we started with. 
So when we're dealing with case one, the 

185
00:11:50,271 --> 00:11:54,618
induced alignment is of the strings X 
prime and Y prime, X and Y with the final 

186
00:11:54,618 --> 00:11:57,700
character peeled off. 
And so for the contradiction, let's 

187
00:11:57,700 --> 00:12:01,827
assume that this induced alignment, it 
has some penalty, say capital P. Let's 

188
00:12:01,827 --> 00:12:05,625
assume it's not actually an optimal 
alignment of X prime and Y prime. 

189
00:12:05,625 --> 00:12:09,752
That is, suppose if we started from 
scratch we'd come up with some superior 

190
00:12:09,752 --> 00:12:13,934
alignment of X prime and Y prime, with 
total penalty P star, strictly smaller 

191
00:12:13,934 --> 00:12:18,139
than P. 
But if that were the case, it would be a 

192
00:12:18,139 --> 00:12:23,116
simple matter to lift this purportedly 
better alignment of x prime and y prime 

193
00:12:23,116 --> 00:12:26,139
to an alignment of the original strings x 
and y. 

194
00:12:26,139 --> 00:12:30,738
Namely we just reuse the exact same 
alignment of x prime and y prime, and 

195
00:12:30,738 --> 00:12:34,140
then in the final position, we just match 
x m with y n. 

196
00:12:39,540 --> 00:12:43,998
So what is the total penalty of this 
extended alignment of all of x and y? 

197
00:12:43,998 --> 00:12:48,637
Well, it's just the penalty incurred in 
everything but the final position, and 

198
00:12:48,637 --> 00:12:53,336
that's just the old penalty p star, plus 
the new penalty incurred in the final 

199
00:12:53,336 --> 00:12:56,409
position. 
And that's just the penalty corresponding 

200
00:12:56,409 --> 00:12:59,660
to the match of the characters x m and y 
n. 

201
00:12:59,660 --> 00:13:04,982
P star being less than p, 
of course p star plus alpha x m y n is 

202
00:13:04,982 --> 00:13:10,671
less than p plus alpha x m y n. 
But this second term is simply the total 

203
00:13:10,671 --> 00:13:14,112
penalty incurred by our original 
alignment of X and Y, right? 

204
00:13:14,112 --> 00:13:18,527
That alignment incurred penalty capital 
P, just in the induced alignment of X 

205
00:13:18,527 --> 00:13:23,172
prime Y prime, and it's total penalty was 
just that plus the penalty in the final 

206
00:13:23,172 --> 00:13:28,502
position, which is this alpha xn, yn. 
But that furnishes the contradiction that 

207
00:13:28,502 --> 00:13:32,682
we suppose that we started with an 
optimal alignment of X and Y, 

208
00:13:32,682 --> 00:13:36,926
yet here is a better one. 
So with that contradiction, it completes 

209
00:13:36,926 --> 00:13:39,800
the proof of the optimal substructure 
claim.