At the beginning of the course, we talked 
about the sequence alignment problem, 
a problem which is fundamental to modern 
computational genomics. 
And we talked about the need for an 
efficient algorithm for solving that 
problem, for finding the best alignment 
of two strings. 
I'm pleased to report that at this point 
we're well prepared to give such an 
algorithm. 
Indeed, such an efficient solution will 
readily fall out of the dynamic 
programming recipe that we now have quite 
a bit of practice with. 
So let me briefly jog your memory about 
the sequence alignment problem. 
So the goal here is to compute a 
similarity measure between strings, 
a similarity measure defined as the total 
penalty of the best alignment, also known 
as the Needleman-Wunsch score. 
So for example. 
if you've given as input the strings A G, 
G, G C T at A G, G C A, a natural 
candidate alignment would be to stack 
them on top of the other, inserting a gap 
in the shortest string after the two Gs, 
that some sense represent the missing G. 
This is a pretty good alignment that 
suffers from merely two flaws. 
So first of all, we did resort to adding 
a gap in the second string. 
Second of all, there is a mismatch in the 
final column. 
The A and the T get mismatched. 
In general we evaluate the alignment by 
summing up the penalties of all the flaws 
and the sum penalty per gap and the sum 
penalty per mismatch. 
So, a bit more precisely, as input in 
this computational problem, we're given 
two strings. 
I'm going to call them capital X and 
capital Y. 
I'm going to use little x and little y to 
denote the individual characters of these 
strings. 
Let's say the first string capital X has 
linked M and the second string Y has 
linked N. 
In addition to the two input strings we 
assume we're given as input the values 
for the various types of penalties. 
So that we know exactly how much it costs 
each time we insert a gap. 
And for each possible mismatch, we need 
to know exactly what's the cost of that 
mismatch. 
In principle you could be given a penalty 
for matching a letter with itself, but 
typically that's going to be a penalty of 
zero. 
The space of feasible solutions are just 
the ways of inserting gaps into the two 
strings so that the results have equal 
length. 
I should emphasize that you're allowed to 
insert gaps into both of the strings. 
In the example, we only inserted into one 
of the two strings, 
but in general, you might have an input 
where one string is seven characters 
longer than the other, 
and it might turn out that in the optimal 
alignment, the best thing to do is insert 
three gaps at various places in the 
longer string, 
and ten gaps at various places into the 
shorter string. 
And the goal, of course is just to 
compute amongst all of the exponentially 
many alignments, the one that minimizes 
the total penalty, where total penalty is 
just the sum of individual penalties for 
the inserted gaps and the various 
mismatches. 
So let's not be unduly intimidated by how 
fundamental this problem is, 
and let's just apply the dynamic recipe, 
the programing recipe that we have been 
using all along. 
Now remember, the really key insight in 
any dynamic programing solution is 
figuring out what's the right collection 
of sub problems. 
And if you're feeling like your up to the 
black belt level in dynamic programing, 
you might just want to try to guess what 
are the right collection of sub problems 
for sequence alignment? 
But I don't expect you to be able to do 
that at this point. 
And so, as usual we're going to derive 
the correct collection of sub-problems. 
And we're going to do it by reasoning 
about the structure of an optimal 
solution, 
narrowing it down to a small number of 
candidates composed in various ways from 
solutions to smaller sub-problems. 
Once we've figured out the small number 
of possibilities for what the optimal 
solution could look like in terms of 
solutions to smaller sub problems, we'll 
be able to drive a recurrence which in 
effect just does brute force search 
through the small number of candidates. 
And from the recurrence we'll be able to 
back out. 
We'll be able to reverse engineer what 
are the various sub problems that we 
actually care about and that we have to 
solve. 
So let's do a thought experiment. 
What does the optimal solution have to 
look like? 
And again, remember, this is exactly what 
it is that we're trying to compute but 
that's not going to stop us from 
reasoning about it. 
If someone handed to us on a silver 
platter the optimal solution, what would 
it have to look like? 
So consider any old pair of strings, 
capital X and capital Y, and an optimal 
alignment of them. 
Let's visualize this optimal alignment as 
follows. 
Let's write down the string X plus 
whatever gaps get inserted into it on 
top, and right beneath it we'll write 
down the string Y, with whatever gaps are 
inserted into it. 
These two things have exactly the same 
length. 
So to figure out the various cases of the 
structure for this optimal solution, 
let's reason by analogy with the problems 
we've already solved. 
So back when we were looking at 
independent sets of line graphs, our case 
analysis was well, either the final 
vertex, the rightmost vertex of the path, 
is in the optimal solution or it's not. 
In the knapsack problem, we said well 
either the last item is in the optimal 
solution or it's not. 
So we always looked at sort of the last 
part of the optimal solution, in some 
sense the rightmost position. 
And happily, staring at this alignment, 
we see we can once again focus just on 
the action in the right most, in the 
final position. 
So now I have a question for you. 
So in the independent set problem, there 
were two cases. 
The last vertex was either in the optimal 
solution or it's not. 
In the knapsack problem, there were also 
two cases, the final item was either in 
the optimal solution or it's not. 
So my question for you is, in the 
sequence alignment problem: when we focus 
on what's going on in the final position 
of the optimal alignment, how many 
relevant cases do we have to study? 
So the answer I'm looking for is B, three 
relevant possiblities for the contents of 
the final position. 
Let me explain my reasoning, let's start 
with the upper parts of the final 
position. 
observe that if that's a character of the 
string capital X, it can only be the very 
last character. 
It can only be little X sub N. 
That's because that's where this string 
ends. 
Now we don't know that little X sub N is 
in the final position, there might be a 
gap. 
Similarly, in the bottom part of this 
final position, there's two 
possibilities. 
There's a gap, 
or, if it's a character of y, it has to 
be the final character, little y, sub n. 
So that one seems to suggest four 
possibilities, 
two options for the top, 
two options for the bottom. 
But the hint of talking about relevant 
possibilities is that it's totally 
pointless to have two, a gap in both the 
top and the bottom. 
Why? 
Well the penalty for gaps is 
non-negative, so if we just deleted both 
of those gaps we'd get an even better 
alignment of X and Y. 
And in studying an optimal solution, we 
can therefore assume we never have two 
gaps in a common position. 
So that leaves exactly three cases. 
It could be there's no gaps at all, that 
in fact this alignment matches the 
character, little x of m, with little y 
sub n. 
Or it could match the final character of 
capital X, with a gap. 
Or it could match the final character of 
capital Y with a gap. 
So the hope behind this case analysis is 
that we're going to be able to boil down 
the possibilities for the optimal 
solution to merely three candidates, 
one candidate for each of the three 
possibilities for the contents of the 
final position. 
That would be analogous to what we did in 
both the independent set and knapsack 
problems, where we boiled the optimal 
solution down to just one of two 
candidates, corresponding to whether 
either the final vertex or the final 
item, as the case may be, was in the 
optimal solution. 
Another way of thinking about this is 
we'd like to make precise the idea that 
if we just knew what was going on in the 
final position, if only a little birdy 
would tell us which of the three cases 
that we're in, then we'd be done by just 
solving the some smaller sub-problem 
recursively. 
So let's now state for each of the three 
possible scenarios for the final 
position, 
what is the corresponding candidate for 
the optimal solution, the way in which, 
it must necessary be composed with an 
optimal solution to a smaller sub 
problem. 
So who are going to be the protagonists 
in our smaller sub-problem? 
Well, the smaller sub-problem's going to 
involve everything except the stuff in 
the final position. 
So it's going to involve the string's X 
and Y, possibly with one character 
remaining. 
So let's let x prime be x, with its final 
character peeled off. 
Y prime's going to be y, with its final 
character peeled off. 
So let me just remind you of how I 
numbered the three cases. 
So case one is when the final position 
contains the final characters of both of 
the two strings, that is, when there's no 
gaps. 
Case two is when x, little x of n gets 
matched with the gap and case three is 
when little y of n gets matched with the 
gap. 
Alright, so let's suppose that case one 
holds. 
This means that the contents of the final 
position, includes both of the characters 
little x sub m and little y sub n. 
So now what we're going to do is we want 
to look at a smaller sub problem. 
And we want to look at the sub problem 
induced by the contents of all of the 
rest of the positions. 
We're going to call that the induced 
alignment. 
Since we started with an alignment, two 
things that had to equal length, and we 
peeled off the final position of both, we 
have another thing that has equal link so 
we're justified in calling it an 
alignment. 
Now what is it an alignment of? 
Well if we're in case one, that means 
what's missing from the induced alignment 
are the final characters. 
little X of M and little Y's of N, 
which means the induced alignment is a 
bona fide alignment of X prime and Y 
prime. 
And certainly, what we're hoping is true, 
is that the induced alignment is in fact, 
an optimal alignment of these smaller 
strings x prime and y prime. 
This would say that when we're in case 
one, the optimal solution to the original 
problem is built up in a simple way from 
an optimal solution to a smaller sub 
problem. 
We're of course hoping that something 
analogous happens in cases two and three. 
The only change is going to be that the 
protagonists of the sub-problem will be a 
little bit different. 
In case two, the thing which is missing 
from the induced alignment is the final 
character of x. 
So, it's going to be the induced 
alignment of x prime and y. 
Similarly, in case three, the induced 
alignment is going to be an alignment of 
x and y prime. 
So, this is an insertion, this is a 
claim, it's not completely obvious, 
though the proof isn't hard, as I will 
show you on the next slide. 
But assuming for the moment that this 
assertion is true, it fulfills the hope 
we had earlier. 
It says that indeed, the optimal solution 
can only be one of three candidates, that 
one for each of the possibilities for the 
contents of the final position. 
Alternatively it says, that if we only 
knew which of the three cases we were in, 
we'd be done, we can recurse, we could 
look up a solution to a smaller sub 
problem and we could extend it in an easy 
way to a optimal solution for the 
original problem. 
So lets now move onto the proof of this 
assertion. 
Why is it true that an optimal solution 
must be built up from an optimal solution 
to the relevant smaller sub-problem? 
Well all of the cases are pretty much the 
same argument so I'm just going to do 
case one, the other cases are basically 
the same. 
I invite you to fill in the details. 
So it's going to be the same type of 
simple proof by contradiction that we 
used earlier, when reasoning about the 
structure of optimal solutions for the 
independent set in knapsack problems. 
We're going to assume the contrary, we're 
going to assume that the induced solution 
to the smaller subproblem is not optimal, 
and from the fact that there is a better 
solution for the subproblem, we will 
extract a better solution for the 
original problem, contradicting the 
purported optimality of the solution that 
we started with. 
So when we're dealing with case one, the 
induced alignment is of the strings X 
prime and Y prime, X and Y with the final 
character peeled off. 
And so for the contradiction, let's 
assume that this induced alignment, it 
has some penalty, say capital P. Let's 
assume it's not actually an optimal 
alignment of X prime and Y prime. 
That is, suppose if we started from 
scratch we'd come up with some superior 
alignment of X prime and Y prime, with 
total penalty P star, strictly smaller 
than P. 
But if that were the case, it would be a 
simple matter to lift this purportedly 
better alignment of x prime and y prime 
to an alignment of the original strings x 
and y. 
Namely we just reuse the exact same 
alignment of x prime and y prime, and 
then in the final position, we just match 
x m with y n. 
So what is the total penalty of this 
extended alignment of all of x and y? 
Well, it's just the penalty incurred in 
everything but the final position, and 
that's just the old penalty p star, plus 
the new penalty incurred in the final 
position. 
And that's just the penalty corresponding 
to the match of the characters x m and y 
n. 
P star being less than p, 
of course p star plus alpha x m y n is 
less than p plus alpha x m y n. 
But this second term is simply the total 
penalty incurred by our original 
alignment of X and Y, right? 
That alignment incurred penalty capital 
P, just in the induced alignment of X 
prime Y prime, and it's total penalty was 
just that plus the penalty in the final 
position, which is this alpha xn, yn. 
But that furnishes the contradiction that 
we suppose that we started with an 
optimal alignment of X and Y, 
yet here is a better one. 
So with that contradiction, it completes 
the proof of the optimal substructure 
claim.