At the beginning of the course, we talked about the sequence alignment problem, a problem which is fundamental to modern computational genomics. And we talked about the need for an efficient algorithm for solving that problem, for finding the best alignment of two strings. I'm pleased to report that at this point we're well prepared to give such an algorithm. Indeed, such an efficient solution will readily fall out of the dynamic programming recipe that we now have quite a bit of practice with. So let me briefly jog your memory about the sequence alignment problem. So the goal here is to compute a similarity measure between strings, a similarity measure defined as the total penalty of the best alignment, also known as the Needleman-Wunsch score. So for example. if you've given as input the strings A G, G, G C T at A G, G C A, a natural candidate alignment would be to stack them on top of the other, inserting a gap in the shortest string after the two Gs, that some sense represent the missing G. This is a pretty good alignment that suffers from merely two flaws. So first of all, we did resort to adding a gap in the second string. Second of all, there is a mismatch in the final column. The A and the T get mismatched. In general we evaluate the alignment by summing up the penalties of all the flaws and the sum penalty per gap and the sum penalty per mismatch. So, a bit more precisely, as input in this computational problem, we're given two strings. I'm going to call them capital X and capital Y. I'm going to use little x and little y to denote the individual characters of these strings. Let's say the first string capital X has linked M and the second string Y has linked N. In addition to the two input strings we assume we're given as input the values for the various types of penalties. So that we know exactly how much it costs each time we insert a gap. And for each possible mismatch, we need to know exactly what's the cost of that mismatch. In principle you could be given a penalty for matching a letter with itself, but typically that's going to be a penalty of zero. The space of feasible solutions are just the ways of inserting gaps into the two strings so that the results have equal length. I should emphasize that you're allowed to insert gaps into both of the strings. In the example, we only inserted into one of the two strings, but in general, you might have an input where one string is seven characters longer than the other, and it might turn out that in the optimal alignment, the best thing to do is insert three gaps at various places in the longer string, and ten gaps at various places into the shorter string. And the goal, of course is just to compute amongst all of the exponentially many alignments, the one that minimizes the total penalty, where total penalty is just the sum of individual penalties for the inserted gaps and the various mismatches. So let's not be unduly intimidated by how fundamental this problem is, and let's just apply the dynamic recipe, the programing recipe that we have been using all along. Now remember, the really key insight in any dynamic programing solution is figuring out what's the right collection of sub problems. And if you're feeling like your up to the black belt level in dynamic programing, you might just want to try to guess what are the right collection of sub problems for sequence alignment? But I don't expect you to be able to do that at this point. And so, as usual we're going to derive the correct collection of sub-problems. And we're going to do it by reasoning about the structure of an optimal solution, narrowing it down to a small number of candidates composed in various ways from solutions to smaller sub-problems. Once we've figured out the small number of possibilities for what the optimal solution could look like in terms of solutions to smaller sub problems, we'll be able to drive a recurrence which in effect just does brute force search through the small number of candidates. And from the recurrence we'll be able to back out. We'll be able to reverse engineer what are the various sub problems that we actually care about and that we have to solve. So let's do a thought experiment. What does the optimal solution have to look like? And again, remember, this is exactly what it is that we're trying to compute but that's not going to stop us from reasoning about it. If someone handed to us on a silver platter the optimal solution, what would it have to look like? So consider any old pair of strings, capital X and capital Y, and an optimal alignment of them. Let's visualize this optimal alignment as follows. Let's write down the string X plus whatever gaps get inserted into it on top, and right beneath it we'll write down the string Y, with whatever gaps are inserted into it. These two things have exactly the same length. So to figure out the various cases of the structure for this optimal solution, let's reason by analogy with the problems we've already solved. So back when we were looking at independent sets of line graphs, our case analysis was well, either the final vertex, the rightmost vertex of the path, is in the optimal solution or it's not. In the knapsack problem, we said well either the last item is in the optimal solution or it's not. So we always looked at sort of the last part of the optimal solution, in some sense the rightmost position. And happily, staring at this alignment, we see we can once again focus just on the action in the right most, in the final position. So now I have a question for you. So in the independent set problem, there were two cases. The last vertex was either in the optimal solution or it's not. In the knapsack problem, there were also two cases, the final item was either in the optimal solution or it's not. So my question for you is, in the sequence alignment problem: when we focus on what's going on in the final position of the optimal alignment, how many relevant cases do we have to study? So the answer I'm looking for is B, three relevant possiblities for the contents of the final position. Let me explain my reasoning, let's start with the upper parts of the final position. observe that if that's a character of the string capital X, it can only be the very last character. It can only be little X sub N. That's because that's where this string ends. Now we don't know that little X sub N is in the final position, there might be a gap. Similarly, in the bottom part of this final position, there's two possibilities. There's a gap, or, if it's a character of y, it has to be the final character, little y, sub n. So that one seems to suggest four possibilities, two options for the top, two options for the bottom. But the hint of talking about relevant possibilities is that it's totally pointless to have two, a gap in both the top and the bottom. Why? Well the penalty for gaps is non-negative, so if we just deleted both of those gaps we'd get an even better alignment of X and Y. And in studying an optimal solution, we can therefore assume we never have two gaps in a common position. So that leaves exactly three cases. It could be there's no gaps at all, that in fact this alignment matches the character, little x of m, with little y sub n. Or it could match the final character of capital X, with a gap. Or it could match the final character of capital Y with a gap. So the hope behind this case analysis is that we're going to be able to boil down the possibilities for the optimal solution to merely three candidates, one candidate for each of the three possibilities for the contents of the final position. That would be analogous to what we did in both the independent set and knapsack problems, where we boiled the optimal solution down to just one of two candidates, corresponding to whether either the final vertex or the final item, as the case may be, was in the optimal solution. Another way of thinking about this is we'd like to make precise the idea that if we just knew what was going on in the final position, if only a little birdy would tell us which of the three cases that we're in, then we'd be done by just solving the some smaller sub-problem recursively. So let's now state for each of the three possible scenarios for the final position, what is the corresponding candidate for the optimal solution, the way in which, it must necessary be composed with an optimal solution to a smaller sub problem. So who are going to be the protagonists in our smaller sub-problem? Well, the smaller sub-problem's going to involve everything except the stuff in the final position. So it's going to involve the string's X and Y, possibly with one character remaining. So let's let x prime be x, with its final character peeled off. Y prime's going to be y, with its final character peeled off. So let me just remind you of how I numbered the three cases. So case one is when the final position contains the final characters of both of the two strings, that is, when there's no gaps. Case two is when x, little x of n gets matched with the gap and case three is when little y of n gets matched with the gap. Alright, so let's suppose that case one holds. This means that the contents of the final position, includes both of the characters little x sub m and little y sub n. So now what we're going to do is we want to look at a smaller sub problem. And we want to look at the sub problem induced by the contents of all of the rest of the positions. We're going to call that the induced alignment. Since we started with an alignment, two things that had to equal length, and we peeled off the final position of both, we have another thing that has equal link so we're justified in calling it an alignment. Now what is it an alignment of? Well if we're in case one, that means what's missing from the induced alignment are the final characters. little X of M and little Y's of N, which means the induced alignment is a bona fide alignment of X prime and Y prime. And certainly, what we're hoping is true, is that the induced alignment is in fact, an optimal alignment of these smaller strings x prime and y prime. This would say that when we're in case one, the optimal solution to the original problem is built up in a simple way from an optimal solution to a smaller sub problem. We're of course hoping that something analogous happens in cases two and three. The only change is going to be that the protagonists of the sub-problem will be a little bit different. In case two, the thing which is missing from the induced alignment is the final character of x. So, it's going to be the induced alignment of x prime and y. Similarly, in case three, the induced alignment is going to be an alignment of x and y prime. So, this is an insertion, this is a claim, it's not completely obvious, though the proof isn't hard, as I will show you on the next slide. But assuming for the moment that this assertion is true, it fulfills the hope we had earlier. It says that indeed, the optimal solution can only be one of three candidates, that one for each of the possibilities for the contents of the final position. Alternatively it says, that if we only knew which of the three cases we were in, we'd be done, we can recurse, we could look up a solution to a smaller sub problem and we could extend it in an easy way to a optimal solution for the original problem. So lets now move onto the proof of this assertion. Why is it true that an optimal solution must be built up from an optimal solution to the relevant smaller sub-problem? Well all of the cases are pretty much the same argument so I'm just going to do case one, the other cases are basically the same. I invite you to fill in the details. So it's going to be the same type of simple proof by contradiction that we used earlier, when reasoning about the structure of optimal solutions for the independent set in knapsack problems. We're going to assume the contrary, we're going to assume that the induced solution to the smaller subproblem is not optimal, and from the fact that there is a better solution for the subproblem, we will extract a better solution for the original problem, contradicting the purported optimality of the solution that we started with. So when we're dealing with case one, the induced alignment is of the strings X prime and Y prime, X and Y with the final character peeled off. And so for the contradiction, let's assume that this induced alignment, it has some penalty, say capital P. Let's assume it's not actually an optimal alignment of X prime and Y prime. That is, suppose if we started from scratch we'd come up with some superior alignment of X prime and Y prime, with total penalty P star, strictly smaller than P. But if that were the case, it would be a simple matter to lift this purportedly better alignment of x prime and y prime to an alignment of the original strings x and y. Namely we just reuse the exact same alignment of x prime and y prime, and then in the final position, we just match x m with y n. So what is the total penalty of this extended alignment of all of x and y? Well, it's just the penalty incurred in everything but the final position, and that's just the old penalty p star, plus the new penalty incurred in the final position. And that's just the penalty corresponding to the match of the characters x m and y n. P star being less than p, of course p star plus alpha x m y n is less than p plus alpha x m y n. But this second term is simply the total penalty incurred by our original alignment of X and Y, right? That alignment incurred penalty capital P, just in the induced alignment of X prime Y prime, and it's total penalty was just that plus the penalty in the final position, which is this alpha xn, yn. But that furnishes the contradiction that we suppose that we started with an optimal alignment of X and Y, yet here is a better one. So with that contradiction, it completes the proof of the optimal substructure claim.