1 00:00:00,000 --> 00:00:03,893 At the beginning of the course, we talked about the sequence alignment problem, 2 00:00:03,893 --> 00:00:07,188 a problem which is fundamental to modern computational genomics. 3 00:00:07,188 --> 00:00:10,832 And we talked about the need for an efficient algorithm for solving that 4 00:00:10,832 --> 00:00:13,528 problem, for finding the best alignment of two strings. 5 00:00:13,528 --> 00:00:17,371 I'm pleased to report that at this point we're well prepared to give such an 6 00:00:17,371 --> 00:00:19,817 algorithm. Indeed, such an efficient solution will 7 00:00:19,817 --> 00:00:23,511 readily fall out of the dynamic programming recipe that we now have quite 8 00:00:23,511 --> 00:00:28,742 a bit of practice with. So let me briefly jog your memory about 9 00:00:28,742 --> 00:00:32,617 the sequence alignment problem. So the goal here is to compute a 10 00:00:32,617 --> 00:00:37,280 similarity measure between strings, a similarity measure defined as the total 11 00:00:37,280 --> 00:00:42,440 penalty of the best alignment, also known as the Needleman-Wunsch score. 12 00:00:42,440 --> 00:00:46,660 So for example. if you've given as input the strings A G, 13 00:00:46,660 --> 00:00:51,860 G, G C T at A G, G C A, a natural candidate alignment would be to stack 14 00:00:51,860 --> 00:00:57,965 them on top of the other, inserting a gap in the shortest string after the two Gs, 15 00:00:57,965 --> 00:01:04,906 that some sense represent the missing G. This is a pretty good alignment that 16 00:01:04,906 --> 00:01:09,119 suffers from merely two flaws. So first of all, we did resort to adding 17 00:01:09,119 --> 00:01:13,211 a gap in the second string. Second of all, there is a mismatch in the 18 00:01:13,211 --> 00:01:16,680 final column. The A and the T get mismatched. 19 00:01:16,680 --> 00:01:22,070 In general we evaluate the alignment by summing up the penalties of all the flaws 20 00:01:22,070 --> 00:01:26,080 and the sum penalty per gap and the sum penalty per mismatch. 21 00:01:27,680 --> 00:01:31,876 So, a bit more precisely, as input in this computational problem, we're given 22 00:01:31,876 --> 00:01:34,618 two strings. I'm going to call them capital X and 23 00:01:34,618 --> 00:01:37,527 capital Y. I'm going to use little x and little y to 24 00:01:37,527 --> 00:01:40,325 denote the individual characters of these strings. 25 00:01:40,325 --> 00:01:44,689 Let's say the first string capital X has linked M and the second string Y has 26 00:01:44,689 --> 00:01:47,431 linked N. In addition to the two input strings we 27 00:01:47,431 --> 00:01:51,571 assume we're given as input the values for the various types of penalties. 28 00:01:51,571 --> 00:01:55,376 So that we know exactly how much it costs each time we insert a gap. 29 00:01:55,376 --> 00:01:59,796 And for each possible mismatch, we need to know exactly what's the cost of that 30 00:01:59,796 --> 00:02:03,899 mismatch. In principle you could be given a penalty 31 00:02:03,899 --> 00:02:08,444 for matching a letter with itself, but typically that's going to be a penalty of 32 00:02:08,444 --> 00:02:14,354 zero. The space of feasible solutions are just 33 00:02:14,354 --> 00:02:19,007 the ways of inserting gaps into the two strings so that the results have equal 34 00:02:19,007 --> 00:02:22,563 length. I should emphasize that you're allowed to 35 00:02:22,563 --> 00:02:26,432 insert gaps into both of the strings. In the example, we only inserted into one 36 00:02:26,432 --> 00:02:29,308 of the two strings, but in general, you might have an input 37 00:02:29,308 --> 00:02:32,235 where one string is seven characters longer than the other, 38 00:02:32,235 --> 00:02:36,352 and it might turn out that in the optimal alignment, the best thing to do is insert 39 00:02:36,352 --> 00:02:38,832 three gaps at various places in the longer string, 40 00:02:38,832 --> 00:02:41,560 and ten gaps at various places into the shorter string. 41 00:02:41,560 --> 00:02:45,708 And the goal, of course is just to compute amongst all of the exponentially 42 00:02:45,708 --> 00:02:50,078 many alignments, the one that minimizes the total penalty, where total penalty is 43 00:02:50,078 --> 00:02:54,171 just the sum of individual penalties for the inserted gaps and the various 44 00:02:54,171 --> 00:02:58,440 mismatches. So let's not be unduly intimidated by how 45 00:02:58,440 --> 00:03:01,840 fundamental this problem is, and let's just apply the dynamic recipe, 46 00:03:01,840 --> 00:03:04,640 the programing recipe that we have been using all along. 47 00:03:04,640 --> 00:03:08,340 Now remember, the really key insight in any dynamic programing solution is 48 00:03:08,340 --> 00:03:11,190 figuring out what's the right collection of sub problems. 49 00:03:11,190 --> 00:03:15,240 And if you're feeling like your up to the black belt level in dynamic programing, 50 00:03:15,240 --> 00:03:19,240 you might just want to try to guess what are the right collection of sub problems 51 00:03:19,240 --> 00:03:22,390 for sequence alignment? But I don't expect you to be able to do 52 00:03:22,390 --> 00:03:25,240 that at this point. And so, as usual we're going to derive 53 00:03:25,240 --> 00:03:29,316 the correct collection of sub-problems. And we're going to do it by reasoning 54 00:03:29,316 --> 00:03:31,727 about the structure of an optimal solution, 55 00:03:31,727 --> 00:03:36,213 narrowing it down to a small number of candidates composed in various ways from 56 00:03:36,213 --> 00:03:40,939 solutions to smaller sub-problems. Once we've figured out the small number 57 00:03:40,939 --> 00:03:44,463 of possibilities for what the optimal solution could look like in terms of 58 00:03:44,463 --> 00:03:48,174 solutions to smaller sub problems, we'll be able to drive a recurrence which in 59 00:03:48,174 --> 00:03:51,698 effect just does brute force search through the small number of candidates. 60 00:03:51,698 --> 00:03:54,047 And from the recurrence we'll be able to back out. 61 00:03:54,047 --> 00:03:57,618 We'll be able to reverse engineer what are the various sub problems that we 62 00:03:57,618 --> 00:04:00,080 actually care about and that we have to solve. 63 00:04:00,080 --> 00:04:03,641 So let's do a thought experiment. What does the optimal solution have to 64 00:04:03,641 --> 00:04:06,149 look like? And again, remember, this is exactly what 65 00:04:06,149 --> 00:04:09,761 it is that we're trying to compute but that's not going to stop us from 66 00:04:09,761 --> 00:04:12,520 reasoning about it. If someone handed to us on a silver 67 00:04:12,520 --> 00:04:16,060 platter the optimal solution, what would it have to look like? 68 00:04:16,060 --> 00:04:20,522 So consider any old pair of strings, capital X and capital Y, and an optimal 69 00:04:20,522 --> 00:04:24,091 alignment of them. Let's visualize this optimal alignment as 70 00:04:24,091 --> 00:04:26,650 follows. Let's write down the string X plus 71 00:04:26,650 --> 00:04:31,112 whatever gaps get inserted into it on top, and right beneath it we'll write 72 00:04:31,112 --> 00:04:34,563 down the string Y, with whatever gaps are inserted into it. 73 00:04:34,563 --> 00:04:37,300 These two things have exactly the same length. 74 00:04:39,180 --> 00:04:43,436 So to figure out the various cases of the structure for this optimal solution, 75 00:04:43,436 --> 00:04:46,874 let's reason by analogy with the problems we've already solved. 76 00:04:46,874 --> 00:04:50,858 So back when we were looking at independent sets of line graphs, our case 77 00:04:50,858 --> 00:04:54,951 analysis was well, either the final vertex, the rightmost vertex of the path, 78 00:04:54,951 --> 00:04:59,199 is in the optimal solution or it's not. In the knapsack problem, we said well 79 00:04:59,199 --> 00:05:02,637 either the last item is in the optimal solution or it's not. 80 00:05:02,637 --> 00:05:07,048 So we always looked at sort of the last part of the optimal solution, in some 81 00:05:07,048 --> 00:05:10,943 sense the rightmost position. And happily, staring at this alignment, 82 00:05:10,943 --> 00:05:15,240 we see we can once again focus just on the action in the right most, in the 83 00:05:15,240 --> 00:05:19,082 final position. So now I have a question for you. 84 00:05:19,082 --> 00:05:22,052 So in the independent set problem, there were two cases. 85 00:05:22,052 --> 00:05:25,454 The last vertex was either in the optimal solution or it's not. 86 00:05:25,454 --> 00:05:29,721 In the knapsack problem, there were also two cases, the final item was either in 87 00:05:29,721 --> 00:05:33,285 the optimal solution or it's not. So my question for you is, in the 88 00:05:33,285 --> 00:05:37,713 sequence alignment problem: when we focus on what's going on in the final position 89 00:05:37,713 --> 00:05:41,440 of the optimal alignment, how many relevant cases do we have to study? 90 00:05:48,500 --> 00:05:53,334 So the answer I'm looking for is B, three relevant possiblities for the contents of 91 00:05:53,334 --> 00:05:56,771 the final position. Let me explain my reasoning, let's start 92 00:05:56,771 --> 00:05:59,275 with the upper parts of the final position. 93 00:05:59,275 --> 00:06:04,284 observe that if that's a character of the string capital X, it can only be the very 94 00:06:04,284 --> 00:06:06,906 last character. It can only be little X sub N. 95 00:06:06,906 --> 00:06:09,527 That's because that's where this string ends. 96 00:06:09,527 --> 00:06:14,187 Now we don't know that little X sub N is in the final position, there might be a 97 00:06:14,187 --> 00:06:16,551 gap. Similarly, in the bottom part of this 98 00:06:16,551 --> 00:06:18,912 final position, there's two possibilities. 99 00:06:18,912 --> 00:06:21,908 There's a gap, or, if it's a character of y, it has to 100 00:06:21,908 --> 00:06:26,055 be the final character, little y, sub n. So that one seems to suggest four 101 00:06:26,055 --> 00:06:28,244 possibilities, two options for the top, 102 00:06:28,244 --> 00:06:32,045 two options for the bottom. But the hint of talking about relevant 103 00:06:32,045 --> 00:06:36,365 possibilities is that it's totally pointless to have two, a gap in both the 104 00:06:36,365 --> 00:06:37,700 top and the bottom. Why? 105 00:06:37,700 --> 00:06:41,845 Well the penalty for gaps is non-negative, so if we just deleted both 106 00:06:41,845 --> 00:06:45,389 of those gaps we'd get an even better alignment of X and Y. 107 00:06:45,389 --> 00:06:50,015 And in studying an optimal solution, we can therefore assume we never have two 108 00:06:50,015 --> 00:06:54,180 gaps in a common position. So that leaves exactly three cases. 109 00:06:54,180 --> 00:06:58,730 It could be there's no gaps at all, that in fact this alignment matches the 110 00:06:58,730 --> 00:07:01,400 character, little x of m, with little y sub n. 111 00:07:03,820 --> 00:07:08,620 Or it could match the final character of capital X, with a gap. 112 00:07:10,340 --> 00:07:14,640 Or it could match the final character of capital Y with a gap. 113 00:07:16,220 --> 00:07:20,596 So the hope behind this case analysis is that we're going to be able to boil down 114 00:07:20,596 --> 00:07:24,474 the possibilities for the optimal solution to merely three candidates, 115 00:07:24,474 --> 00:07:28,574 one candidate for each of the three possibilities for the contents of the 116 00:07:28,574 --> 00:07:31,731 final position. That would be analogous to what we did in 117 00:07:31,731 --> 00:07:35,886 both the independent set and knapsack problems, where we boiled the optimal 118 00:07:35,886 --> 00:07:39,709 solution down to just one of two candidates, corresponding to whether 119 00:07:39,709 --> 00:07:43,698 either the final vertex or the final item, as the case may be, was in the 120 00:07:43,698 --> 00:07:47,637 optimal solution. Another way of thinking about this is 121 00:07:47,637 --> 00:07:51,681 we'd like to make precise the idea that if we just knew what was going on in the 122 00:07:51,681 --> 00:07:55,526 final position, if only a little birdy would tell us which of the three cases 123 00:07:55,526 --> 00:07:59,520 that we're in, then we'd be done by just solving the some smaller sub-problem 124 00:07:59,520 --> 00:08:03,267 recursively. So let's now state for each of the three 125 00:08:03,267 --> 00:08:05,815 possible scenarios for the final position, 126 00:08:05,815 --> 00:08:10,546 what is the corresponding candidate for the optimal solution, the way in which, 127 00:08:10,546 --> 00:08:14,914 it must necessary be composed with an optimal solution to a smaller sub 128 00:08:14,914 --> 00:08:18,877 problem. So who are going to be the protagonists 129 00:08:18,877 --> 00:08:23,059 in our smaller sub-problem? Well, the smaller sub-problem's going to 130 00:08:23,059 --> 00:08:26,680 involve everything except the stuff in the final position. 131 00:08:26,680 --> 00:08:31,299 So it's going to involve the string's X and Y, possibly with one character 132 00:08:31,299 --> 00:08:34,483 remaining. So let's let x prime be x, with its final 133 00:08:34,483 --> 00:08:38,229 character peeled off. Y prime's going to be y, with its final 134 00:08:38,229 --> 00:08:44,023 character peeled off. So let me just remind you of how I 135 00:08:44,023 --> 00:08:47,614 numbered the three cases. So case one is when the final position 136 00:08:47,614 --> 00:08:52,159 contains the final characters of both of the two strings, that is, when there's no 137 00:08:52,159 --> 00:08:55,421 gaps. Case two is when x, little x of n gets 138 00:08:55,421 --> 00:09:00,562 matched with the gap and case three is when little y of n gets matched with the 139 00:09:00,562 --> 00:09:03,846 gap. Alright, so let's suppose that case one 140 00:09:03,846 --> 00:09:06,415 holds. This means that the contents of the final 141 00:09:06,415 --> 00:09:10,428 position, includes both of the characters little x sub m and little y sub n. 142 00:09:10,428 --> 00:09:14,227 So now what we're going to do is we want to look at a smaller sub problem. 143 00:09:14,227 --> 00:09:18,348 And we want to look at the sub problem induced by the contents of all of the 144 00:09:18,348 --> 00:09:21,344 rest of the positions. We're going to call that the induced 145 00:09:21,344 --> 00:09:25,229 alignment. Since we started with an alignment, two 146 00:09:25,229 --> 00:09:29,461 things that had to equal length, and we peeled off the final position of both, we 147 00:09:29,461 --> 00:09:33,478 have another thing that has equal link so we're justified in calling it an 148 00:09:33,478 --> 00:09:35,674 alignment. Now what is it an alignment of? 149 00:09:35,674 --> 00:09:39,906 Well if we're in case one, that means what's missing from the induced alignment 150 00:09:39,906 --> 00:09:43,067 are the final characters. little X of M and little Y's of N, 151 00:09:43,067 --> 00:09:47,138 which means the induced alignment is a bona fide alignment of X prime and Y 152 00:09:47,138 --> 00:09:51,966 prime. And certainly, what we're hoping is true, 153 00:09:51,966 --> 00:09:57,245 is that the induced alignment is in fact, an optimal alignment of these smaller 154 00:09:57,245 --> 00:10:01,816 strings x prime and y prime. This would say that when we're in case 155 00:10:01,816 --> 00:10:06,183 one, the optimal solution to the original problem is built up in a simple way from 156 00:10:06,183 --> 00:10:09,440 an optimal solution to a smaller sub problem. 157 00:10:09,440 --> 00:10:13,406 We're of course hoping that something analogous happens in cases two and three. 158 00:10:13,406 --> 00:10:17,373 The only change is going to be that the protagonists of the sub-problem will be a 159 00:10:17,373 --> 00:10:20,386 little bit different. In case two, the thing which is missing 160 00:10:20,386 --> 00:10:23,148 from the induced alignment is the final character of x. 161 00:10:23,148 --> 00:10:26,110 So, it's going to be the induced alignment of x prime and y. 162 00:10:26,110 --> 00:10:29,977 Similarly, in case three, the induced alignment is going to be an alignment of 163 00:10:29,977 --> 00:10:32,852 x and y prime. So, this is an insertion, this is a 164 00:10:32,852 --> 00:10:36,607 claim, it's not completely obvious, though the proof isn't hard, as I will 165 00:10:36,607 --> 00:10:39,996 show you on the next slide. But assuming for the moment that this 166 00:10:39,996 --> 00:10:42,812 assertion is true, it fulfills the hope we had earlier. 167 00:10:42,812 --> 00:10:47,088 It says that indeed, the optimal solution can only be one of three candidates, that 168 00:10:47,088 --> 00:10:50,894 one for each of the possibilities for the contents of the final position. 169 00:10:50,894 --> 00:10:55,014 Alternatively it says, that if we only knew which of the three cases we were in, 170 00:10:55,014 --> 00:10:58,820 we'd be done, we can recurse, we could look up a solution to a smaller sub 171 00:10:58,820 --> 00:11:02,783 problem and we could extend it in an easy way to a optimal solution for the 172 00:11:02,783 --> 00:11:05,908 original problem. So lets now move onto the proof of this 173 00:11:05,908 --> 00:11:08,706 assertion. Why is it true that an optimal solution 174 00:11:08,706 --> 00:11:13,070 must be built up from an optimal solution to the relevant smaller sub-problem? 175 00:11:13,070 --> 00:11:17,379 Well all of the cases are pretty much the same argument so I'm just going to do 176 00:11:17,379 --> 00:11:20,065 case one, the other cases are basically the same. 177 00:11:20,065 --> 00:11:24,359 I invite you to fill in the details. So it's going to be the same type of 178 00:11:24,359 --> 00:11:28,113 simple proof by contradiction that we used earlier, when reasoning about the 179 00:11:28,113 --> 00:11:31,867 structure of optimal solutions for the independent set in knapsack problems. 180 00:11:31,867 --> 00:11:35,967 We're going to assume the contrary, we're going to assume that the induced solution 181 00:11:35,967 --> 00:11:40,018 to the smaller subproblem is not optimal, and from the fact that there is a better 182 00:11:40,018 --> 00:11:43,476 solution for the subproblem, we will extract a better solution for the 183 00:11:43,476 --> 00:11:47,279 original problem, contradicting the purported optimality of the solution that 184 00:11:47,279 --> 00:11:50,271 we started with. So when we're dealing with case one, the 185 00:11:50,271 --> 00:11:54,618 induced alignment is of the strings X prime and Y prime, X and Y with the final 186 00:11:54,618 --> 00:11:57,700 character peeled off. And so for the contradiction, let's 187 00:11:57,700 --> 00:12:01,827 assume that this induced alignment, it has some penalty, say capital P. Let's 188 00:12:01,827 --> 00:12:05,625 assume it's not actually an optimal alignment of X prime and Y prime. 189 00:12:05,625 --> 00:12:09,752 That is, suppose if we started from scratch we'd come up with some superior 190 00:12:09,752 --> 00:12:13,934 alignment of X prime and Y prime, with total penalty P star, strictly smaller 191 00:12:13,934 --> 00:12:18,139 than P. But if that were the case, it would be a 192 00:12:18,139 --> 00:12:23,116 simple matter to lift this purportedly better alignment of x prime and y prime 193 00:12:23,116 --> 00:12:26,139 to an alignment of the original strings x and y. 194 00:12:26,139 --> 00:12:30,738 Namely we just reuse the exact same alignment of x prime and y prime, and 195 00:12:30,738 --> 00:12:34,140 then in the final position, we just match x m with y n. 196 00:12:39,540 --> 00:12:43,998 So what is the total penalty of this extended alignment of all of x and y? 197 00:12:43,998 --> 00:12:48,637 Well, it's just the penalty incurred in everything but the final position, and 198 00:12:48,637 --> 00:12:53,336 that's just the old penalty p star, plus the new penalty incurred in the final 199 00:12:53,336 --> 00:12:56,409 position. And that's just the penalty corresponding 200 00:12:56,409 --> 00:12:59,660 to the match of the characters x m and y n. 201 00:12:59,660 --> 00:13:04,982 P star being less than p, of course p star plus alpha x m y n is 202 00:13:04,982 --> 00:13:10,671 less than p plus alpha x m y n. But this second term is simply the total 203 00:13:10,671 --> 00:13:14,112 penalty incurred by our original alignment of X and Y, right? 204 00:13:14,112 --> 00:13:18,527 That alignment incurred penalty capital P, just in the induced alignment of X 205 00:13:18,527 --> 00:13:23,172 prime Y prime, and it's total penalty was just that plus the penalty in the final 206 00:13:23,172 --> 00:13:28,502 position, which is this alpha xn, yn. But that furnishes the contradiction that 207 00:13:28,502 --> 00:13:32,682 we suppose that we started with an optimal alignment of X and Y, 208 00:13:32,682 --> 00:13:36,926 yet here is a better one. So with that contradiction, it completes 209 00:13:36,926 --> 00:13:39,800 the proof of the optimal substructure claim.