1 00:00:00,800 --> 00:00:04,770 So, now that we understand that the optimal solution to the sequence element 2 00:00:04,770 --> 00:00:08,636 problem has to be only one of three candidates, we're going to be easily able 3 00:00:08,636 --> 00:00:12,554 to formulate a recurrence, identify the relevant sub-problems and derive an 4 00:00:12,554 --> 00:00:15,480 efficient dynamic programming algorithm for the problem. 5 00:00:17,900 --> 00:00:22,111 So, here is the culmination of our work. On the previous video, we thought about 6 00:00:22,111 --> 00:00:26,322 an optimal alignment of some pair of strings X and Y, and we notice that there 7 00:00:26,322 --> 00:00:29,291 are three cases for the contents of the final position. 8 00:00:29,291 --> 00:00:33,503 Either there's no gaps or there's a gap on top or there's a gap on the bottom. 9 00:00:33,503 --> 00:00:36,688 in case one, where there's no gaps, XM and YM get matched. 10 00:00:36,688 --> 00:00:41,061 And we proved that the induced alignment which is of the smaller strings X prime 11 00:00:41,061 --> 00:00:43,599 and Y prime has to be optimal in its own right. 12 00:00:43,599 --> 00:00:47,810 In the second case where the character little X sub M gets matched with a gap 13 00:00:47,810 --> 00:00:50,240 induced alignment this time of X prime and Y. 14 00:00:50,240 --> 00:00:54,006 Has to be optimal in its own right, and the third case where little y sub n gets 15 00:00:54,006 --> 00:00:57,020 matched with the gap, the induced alignment now of x and y prime. 16 00:00:57,020 --> 00:01:01,160 Must be optimal. So one way to think about this kind of 17 00:01:01,160 --> 00:01:05,658 assertion is it says that the optimal solution to a problem, to a sequence of a 18 00:01:05,658 --> 00:01:10,338 lot of problem depends only on the solutions to three different smaller 19 00:01:10,338 --> 00:01:15,140 sub-problems, one involving x, x prime and y prime with characters peeled off of 20 00:01:15,140 --> 00:01:18,544 both of the strings. One involving x prime and y and one 21 00:01:18,544 --> 00:01:22,495 involving x and y prime. But in all of the cases, all that we care 22 00:01:22,495 --> 00:01:27,418 about are sub-problems in which a single character was peeled off from the right 23 00:01:27,418 --> 00:01:31,080 from one or both of the strings that we started with. 24 00:01:31,080 --> 00:01:34,308 The situation is very similar to in our previous two examples. 25 00:01:34,308 --> 00:01:38,109 We have independent sets on line graphs and the nap sack problem and the 26 00:01:38,109 --> 00:01:42,326 independent set problem whenever we, we only cared about sub problems obtained by 27 00:01:42,326 --> 00:01:45,763 plucking off either one or two vertices from the given line graph. 28 00:01:45,763 --> 00:01:49,303 So all we ever cared about were prefixes of the original line graph. 29 00:01:49,303 --> 00:01:53,156 In the nap sack problem we got sub problems by plucking off the last item 30 00:01:53,156 --> 00:01:56,905 and perhaps also reducing the nap sack capacity by some interval amount. 31 00:01:56,905 --> 00:02:01,174 So there were two dimensions in the nap sack problem for which sub problems could 32 00:02:01,174 --> 00:02:05,080 decrease in size, then number of items in the residual nap sack capacity. 33 00:02:05,080 --> 00:02:07,811 So we use two parameters to keep track of the sub problems. 34 00:02:07,811 --> 00:02:11,562 And what we cared about were all possible prefixes of the items and all possible 35 00:02:11,562 --> 00:02:14,711 residual integral capacities, at most the original knapsack capacity. 36 00:02:14,711 --> 00:02:16,887 So what's up in the sequence alignment problem? 37 00:02:16,887 --> 00:02:20,406 Well here, sub problems get smaller by plucking a character off of the first 38 00:02:20,406 --> 00:02:23,787 string and or the second string. So again there are two ways in which the 39 00:02:23,787 --> 00:02:27,167 sub problem can get smaller, either the first string or the second string. 40 00:02:27,167 --> 00:02:30,454 So we'll again use two different parameters, one to figure out how much 41 00:02:30,454 --> 00:02:33,974 we've plucked off of the first string, the second one to figure out how much 42 00:02:33,974 --> 00:02:36,132 we've plucked off of the second string. Right. 43 00:02:36,132 --> 00:02:39,780 But all we care about. The only relevant sub problems involved. 44 00:02:39,780 --> 00:02:45,580 Prefixes of the two original input strings X and Y. 45 00:02:45,580 --> 00:02:51,803 That is, the only sub problems that we care about have the form x I y j, where x 46 00:02:51,803 --> 00:02:58,185 I denotes the first I letters of capital x, some prefix of x, and y j denotes some 47 00:02:58,185 --> 00:03:05,888 prefix of y, the first j letters of y. So lets now move from the sub-problems 48 00:03:05,888 --> 00:03:09,399 we're going to use in our dynamic programming algorithm to the recurrence 49 00:03:09,399 --> 00:03:12,465 that we're going to use. And the recurrence really all it does is 50 00:03:12,465 --> 00:03:16,223 compile our understanding of the optimal solution and how it depends on the 51 00:03:16,223 --> 00:03:20,080 solution of the smaller sub-problems into an easy to use mathematical formula. 52 00:03:23,380 --> 00:03:28,149 So I'll use the notation P sub i j for the value of the optimal solution to the 53 00:03:28,149 --> 00:03:32,680 corresponding sub problem, the one involving the prefix X i and the prefix Y 54 00:03:32,680 --> 00:03:41,295 j So for a given set of positive values for i and j, what is Pij? 55 00:03:41,295 --> 00:03:49,205 Well, there are three possibilities. Case one is where the final position of 56 00:03:49,205 --> 00:03:54,207 the optimal alignment doesn't have any gaps, so it matches the final character 57 00:03:54,207 --> 00:03:59,337 of X sub i, that is little x sub i and the final character of the prefix capital 58 00:03:59,337 --> 00:04:02,223 Y sub j, that is the character little y sub j. 59 00:04:02,223 --> 00:04:07,289 It matches them together and just reuses an optimal alignment for the smaller 60 00:04:07,289 --> 00:04:15,885 strings, Xi-1 and Yi-1 Case two is where the last letter of the first string, that 61 00:04:15,885 --> 00:04:21,499 is little x of i gets matched with a gap. And that case the penalty of the 62 00:04:21,499 --> 00:04:27,046 corresponding alignment is the penalty of a gap plus whatever the optimal alignment 63 00:04:27,046 --> 00:04:31,999 is of the first i minus one letter of capital X plus the first J letter of 64 00:04:31,999 --> 00:04:37,376 Capital Y. The symmetrically case three we pay for a 65 00:04:37,376 --> 00:04:41,461 gap and then we pay whatever the optimal alignment is of all of the first I 66 00:04:41,461 --> 00:04:44,792 letters of capital X with the first j menace one letters of Y. 67 00:04:44,792 --> 00:04:49,199 This is the case where the last letter of the second string gets matched with the 68 00:04:49,199 --> 00:04:52,900 gap in the final position of the optimal alignment. 69 00:04:52,900 --> 00:04:57,226 So we know that the optimal solution has to look like one of these three things, 70 00:04:57,226 --> 00:05:01,390 we don't know which, so in the recurrence we'll just in effect do brute force 71 00:05:01,390 --> 00:05:05,067 search among the three outcomes. We just remember, we just choose the 72 00:05:05,067 --> 00:05:09,850 minimum of the three possibilities. So that's the formal recurrence. 73 00:05:09,850 --> 00:05:14,020 It's correctness really just follows immediately from the work we already did, 74 00:05:14,020 --> 00:05:16,906 understand what the optimal solution has to look like. 75 00:05:16,906 --> 00:05:21,076 So, before we state the algorithm where we systematically solve all of the sub 76 00:05:21,076 --> 00:05:24,978 problems using this magical formula. Let's just make sure we get the edge 77 00:05:24,978 --> 00:05:29,280 cases, the base cases where i or j equals zero correctly sorted out. 78 00:05:29,280 --> 00:05:37,066 So specifically what is the value of p I,0 and also it turns out p of zero, i 79 00:05:37,066 --> 00:05:41,820 where i here is just some non-negative integer. 80 00:05:52,100 --> 00:05:54,462 Alright. So the answer to this question is the 81 00:05:54,462 --> 00:05:57,441 second one, is B. And I hope if you could keep the notation 82 00:05:57,441 --> 00:05:59,598 straight then the answer was fairly clear. 83 00:05:59,598 --> 00:06:01,755 So let's remember what, what does PIJ mean? 84 00:06:01,755 --> 00:06:05,864 That's the total penalty of an optimal alignment between the first i letters of 85 00:06:05,864 --> 00:06:08,483 X, and the first j letters of Y. So consider Pi zero. 86 00:06:08,483 --> 00:06:12,489 So now we're asking about aligning the first zero letters of X with the first 87 00:06:12,489 --> 00:06:14,954 zero letters of Y. That is with the empty string. 88 00:06:14,954 --> 00:06:19,166 Well the optimal way to match any string to the empty string is you're just going 89 00:06:19,166 --> 00:06:22,402 to insert gaps into the empty string to equalize their lengths. 90 00:06:22,402 --> 00:06:25,535 And so if your string has length i, you need to insert i gaps. 91 00:06:25,535 --> 00:06:29,040 What's the? Penalty of that alignment is just i times 92 00:06:29,040 --> 00:06:33,520 the penalty for a single gap, and that's the answer here in B. 93 00:06:33,520 --> 00:06:37,541 So we're ready now to give the algorithm, and as with all these dynamic programming 94 00:06:37,541 --> 00:06:41,418 algorithms once you know the sub-problems and once you know the recurrence that 95 00:06:41,418 --> 00:06:43,986 relates their solutions there's really nothing to do. 96 00:06:43,986 --> 00:06:47,910 All you do is systematically answer solve all of the sub-problems moving from 97 00:06:47,910 --> 00:06:52,115 smallest to largest. So we're going to use an array A to keep 98 00:06:52,115 --> 00:06:54,646 track of the solutions of all of these sub-problems. 99 00:06:54,646 --> 00:06:58,297 A is going to have two dimensions. The reason for two dimensions is we have 100 00:06:58,297 --> 00:07:01,948 two independent parameters which are keeping track of the sub-problem size. 101 00:07:01,948 --> 00:07:05,550 One for how many letters of X we're dealing with, and the second dimension 102 00:07:05,550 --> 00:07:07,983 for how many letters of Y that we're dealing with. 103 00:07:07,983 --> 00:07:11,975 That's analogous to the knapsack problem, where we also had two dimensions to keep 104 00:07:11,975 --> 00:07:14,165 track of. The number of items in play, and the 105 00:07:14,165 --> 00:07:19,185 residual knapsack capacity. We just figured out what the base case 106 00:07:19,185 --> 00:07:22,689 is, so we just solved those in a pre-processing step. 107 00:07:22,689 --> 00:07:27,596 So if one of the two indices is zero, then the optimal solution value is just 108 00:07:27,596 --> 00:07:32,856 the gap penalty times the non-zero index. And, now we just go to our double four 109 00:07:32,856 --> 00:07:35,268 loops. It's double four loops because we have 110 00:07:35,268 --> 00:07:38,805 two indices into out array. And whenever we get into a sub problem, 111 00:07:38,805 --> 00:07:43,414 we just evaluate the recurrence invoking of these solution to the already computed 112 00:07:43,414 --> 00:07:49,533 smaller sub problems. So one sanity check you should always 113 00:07:49,533 --> 00:07:53,518 apply when you're writing out the code for a dynamic programming algorithm: when 114 00:07:53,518 --> 00:07:57,205 you look at the right hand side of your recurrence, when you look at these 115 00:07:57,205 --> 00:08:01,241 purportedly already solved subproblems whose solutions you're using to solve the 116 00:08:01,241 --> 00:08:04,629 current subproblem, make sure you have actually already solved those 117 00:08:04,629 --> 00:08:07,269 subproblems. So in this case we're good to go because 118 00:08:07,269 --> 00:08:11,355 the indices of the subproblems are only less than the entry that we're filling in 119 00:08:11,355 --> 00:08:13,646 right now. So indeed all three of the relevant 120 00:08:13,646 --> 00:08:18,828 subproblems, A-I - 1 j - 1 A-I - 1 j, and A-I j - 1 they've already been computed 121 00:08:18,828 --> 00:08:21,120 in earlier iterations of our double four loop. 122 00:08:21,120 --> 00:08:25,980 So they're just hanging out, waiting to get looked up in constant time. 123 00:08:25,980 --> 00:08:29,912 And as usual once you've actually figured out the key ingredients for the dynamic 124 00:08:29,912 --> 00:08:33,221 programming solution, namely the sub-problems and the recurrence, it's 125 00:08:33,221 --> 00:08:36,961 pretty much self evident why the things going to work and it's also self evident 126 00:08:36,961 --> 00:08:39,900 exactly what its running time is going to be. 127 00:08:39,900 --> 00:08:44,896 So why is the algorithm correct? That is, why does it terminate with every 128 00:08:44,896 --> 00:08:49,618 entry Aij equal to the true optimal penalty Pij of the corresponding 129 00:08:49,618 --> 00:08:53,589 sub-problem. Well, this just follows because our 130 00:08:53,589 --> 00:08:57,650 recurrent is correct, that's where all the hard work was, and then we're just 131 00:08:57,650 --> 00:09:00,161 systematically solving all of the sub-problems. 132 00:09:00,161 --> 00:09:03,100 So, formally, if you like, it would be proof by induction. 133 00:09:05,120 --> 00:09:07,942 So, the running time is completely trivial to evaluate. 134 00:09:07,942 --> 00:09:11,863 In each iteration of this double four loop, we do a constant amount of work. 135 00:09:11,863 --> 00:09:15,784 We just need to look up three things in constant time and make a couple of 136 00:09:15,784 --> 00:09:17,980 comparisons. How many four loops are there? 137 00:09:17,980 --> 00:09:21,744 Well, M iterations of the outer four loop, N iterations of the inner four 138 00:09:21,744 --> 00:09:23,835 loop. So we suffer the product, M times N. 139 00:09:23,835 --> 00:09:28,121 That is, the running time is proportional to the product of the lengths of the two 140 00:09:28,121 --> 00:09:33,661 strings. So depending on the application, you may 141 00:09:33,661 --> 00:09:37,654 be content to just have an algorithm compute for you the nw score, the total 142 00:09:37,654 --> 00:09:41,699 penalty of an optimal alignment or perhaps you're actually interested in the 143 00:09:41,699 --> 00:09:44,956 alignment itself. And just as we discussed with independent 144 00:09:44,956 --> 00:09:48,739 sets of line graphs, by tracing back through the filled in table, you can 145 00:09:48,739 --> 00:09:52,836 indeed reconstruct an optimal solution. So let me just give you the high level 146 00:09:52,836 --> 00:09:55,989 idea of how it works. It's going to to follow the same template 147 00:09:55,989 --> 00:10:00,244 and all you think through the details of why this really works in the privacy of 148 00:10:00,244 --> 00:10:04,258 your own home. So assume that you've already run the 149 00:10:04,258 --> 00:10:07,868 algorithm on the previous slide. That you've filled in all the entries of 150 00:10:07,868 --> 00:10:10,390 this two d array capital A. Now we want to trace back. 151 00:10:10,390 --> 00:10:13,551 So where are we going to start tracing back this filled in table? 152 00:10:13,551 --> 00:10:16,956 Well, we are going to start with a problem that we actually care about, 153 00:10:16,956 --> 00:10:20,847 namely the largest problem A of M comma N, that's what we want the alignment for. 154 00:10:20,847 --> 00:10:24,593 Now we know the softball alignment looks, has one of three candidates, we know 155 00:10:24,593 --> 00:10:28,533 there's three possible situations for the contents of the final position of that 156 00:10:28,533 --> 00:10:30,916 alignment. More over, when we filled in this entry 157 00:10:30,916 --> 00:10:34,807 of the table, we explicitly compared the three possibilities to figure out which 158 00:10:34,807 --> 00:10:37,580 one was the best. So you know, perhaps on the forward pass 159 00:10:37,580 --> 00:10:41,276 we actually cached the result of the comparison, or in the worst case we can 160 00:10:41,276 --> 00:10:44,390 just go back and re-compute, and figure out which of those three. 161 00:10:44,390 --> 00:10:48,138 Pieces was used to fill in this entry, and depending on which one of the three 162 00:10:48,138 --> 00:10:51,502 candidates won, that tells us, what should be, the contents of the final 163 00:10:51,502 --> 00:10:55,106 position of the optimal alignment. If case one was used to fill in this 164 00:10:55,106 --> 00:10:57,749 entry, we should match, little x sub n and little y sub n. 165 00:10:57,749 --> 00:11:01,257 If case two was used to fill in this entry, we should match little x sub n 166 00:11:01,257 --> 00:11:03,756 with the gap. If case three was used, to fill in this 167 00:11:03,756 --> 00:11:06,159 entry, we should match little y sub n with the gap. 168 00:11:06,159 --> 00:11:08,370 If there was a tie, we get to pick any of them. 169 00:11:08,370 --> 00:11:11,147 Arbitrarily, all of them will lead to optimal alignments. 170 00:11:11,147 --> 00:11:14,520 Then of course, after figuring out what to do in this final position. 171 00:11:14,520 --> 00:11:17,694 We have an induced sub problem involving x prime and or y prime. 172 00:11:17,694 --> 00:11:20,423 That tells us a, a previous entry of the table to go to. 173 00:11:20,423 --> 00:11:23,895 And we just repeat the process. We, again, figure out which of the three 174 00:11:23,895 --> 00:11:27,615 cases was used to fill in this entry. That tells us how to fill in the next 175 00:11:27,615 --> 00:11:31,483 right most position of the alignment. And we just keep going until we fall off 176 00:11:31,483 --> 00:11:39,330 the table. So what do you do when you fall off the 177 00:11:39,330 --> 00:11:41,717 table? Well, once one of the indices I or J gets 178 00:11:41,717 --> 00:11:44,154 all the way down to zero, now you have no choice. 179 00:11:44,154 --> 00:11:48,267 So now one of the strings is empty and the other one has some number of symbol. 180 00:11:48,267 --> 00:11:52,380 So you should just insert the appropriate number of gaps to equalize the lengths. 181 00:11:54,260 --> 00:11:58,625 One thing that's pretty neat is that this trace back procedure is efficient. 182 00:11:58,625 --> 00:12:02,358 In fact, it's way more efficient, in general, than the forward pass. 183 00:12:02,358 --> 00:12:06,839 For the forward pass, you have to fill in every single one of the m * n entries. 184 00:12:06,839 --> 00:12:10,170 But in this trace back procedure, each time you track back. 185 00:12:10,170 --> 00:12:13,214 One of the two indices, at least, will get decremented. 186 00:12:13,214 --> 00:12:17,465 So that says, you're going to complete this trace back in O of m + n time with 187 00:12:17,465 --> 00:12:20,280 an optimal alignment of the original two strings.