1 00:00:03,180 --> 00:00:07,691 In this video we'll cover a second problem to whet your appetite for things 2 00:00:07,691 --> 00:00:11,040 to come, namely the problem of sequence alignment. 3 00:00:11,040 --> 00:00:14,158 So this is a fundamental problem in computational genomics. 4 00:00:14,158 --> 00:00:18,229 If you take a class on the subject it's very likely to occupy the very first 5 00:00:18,229 --> 00:00:21,136 couple of lectures. So in this problem you're given two 6 00:00:21,136 --> 00:00:25,365 strings over an alphabet and no prizes for guessing which is the alphabet we're 7 00:00:25,365 --> 00:00:29,241 most likely to care about. Typically, these strings represent 8 00:00:29,241 --> 00:00:34,685 portions of one or more genomes. And just as a toy running example you can 9 00:00:34,685 --> 00:00:39,099 just imagine that the two strings were given are A, G, G, G, C, T and A, 10 00:00:39,099 --> 00:00:42,410 G, G, C, A. Know that the two input strings do not 11 00:00:42,410 --> 00:00:44,985 necessarily need to be of the same length. 12 00:00:44,985 --> 00:00:49,874 And informally speaking, the goal of this sequence alignment problem is to figure 13 00:00:49,874 --> 00:00:52,260 out how similar the two input strings are. 14 00:00:52,260 --> 00:00:56,237 Obviously, I haven't told you what I mean by two strings being similar. 15 00:00:56,237 --> 00:00:59,759 That's something we'll develop over the next couple of slides. 16 00:00:59,759 --> 00:01:04,303 Why might you want to solve this problem? Well, there's actually a lot of reasons. 17 00:01:04,303 --> 00:01:06,690 Let me just give you two of many examples. 18 00:01:06,690 --> 00:01:10,809 What will be the conjecture or the function of regions of a genome that you 19 00:01:10,809 --> 00:01:13,139 don't understand, lets say the human genome, 20 00:01:13,139 --> 00:01:17,042 from similar regions that exist in genomes that you do understand or at 21 00:01:17,042 --> 00:01:19,481 least understand better, say the mouse genome. 22 00:01:19,481 --> 00:01:23,600 If you see a string that has a known function in the well understood genome 23 00:01:23,600 --> 00:01:27,502 and you see something similar in the poorly understood genome, you might 24 00:01:27,502 --> 00:01:30,050 conjecture it has the same or similar function. 25 00:01:30,050 --> 00:01:34,439 A totally different reason you might want to compare the genomes of two different 26 00:01:34,439 --> 00:01:38,720 species, is to figure out whether one evolved directly from the other and when. 27 00:01:38,720 --> 00:01:43,305 A second totally different reason you might want to compare the genomes of two 28 00:01:43,305 --> 00:01:47,296 different species is to understand their evolutionary relationship. 29 00:01:47,296 --> 00:01:51,763 So for example, maybe you have three species A, B, and C, and you're wondering 30 00:01:51,763 --> 00:01:55,932 whether B evolved from A and then C evolved from B, or whether B and C 31 00:01:55,932 --> 00:02:00,160 evolved independently from a common ancestor, A. And you might then take 32 00:02:00,160 --> 00:02:04,270 genome similarity as a measure of proximity in the evolutionary tree. 33 00:02:04,270 --> 00:02:08,988 So having motivated the informal version of the problem, let's work toward making 34 00:02:08,988 --> 00:02:12,191 it more formal. In particular, I owe you a discussion of 35 00:02:12,191 --> 00:02:18,652 what I mean by two strings being similar. So to develop intuition for this, let's 36 00:02:18,652 --> 00:02:22,806 revisit the two strings that we introduced on the previous slide A, G, G, 37 00:02:22,806 --> 00:02:23,892 G, C, T, and A, G, G, C, A. 38 00:02:23,892 --> 00:02:26,385 Now, if we just sort of eyeball these two 39 00:02:26,385 --> 00:02:29,644 strings, I mean clearly they're not the same string. 40 00:02:29,644 --> 00:02:34,117 But, we somehow feel like they're more similar than they are different. 41 00:02:34,117 --> 00:02:39,416 So, where does that intuition come from? Well, one way to make it more precise is 42 00:02:39,416 --> 00:02:44,889 to notice that these two strings can be nicely aligned in the following sense. 43 00:02:44,889 --> 00:02:47,766 Lets write down the longer string, A, G, G, G, C, T. 44 00:02:47,766 --> 00:02:53,028 And, I'm going to write the shorter string under it, and I'll insert a gap, a 45 00:02:53,028 --> 00:02:56,607 space to make the two strings have the same length. 46 00:02:56,607 --> 00:03:02,010 I'm going to put the space where there seems to be quote unquote a missing G. 47 00:03:02,010 --> 00:03:06,032 And then, what sense is this a nice alignment, well, it's clearly not 48 00:03:06,032 --> 00:03:08,941 perfect. We don't' get a character, by character 49 00:03:08,941 --> 00:03:12,531 match of the two strings, but there's only two minor flaws. 50 00:03:12,531 --> 00:03:17,234 So on the one hand, we did have to insert a gap and we do have to suffer one 51 00:03:17,234 --> 00:03:21,443 mismatch in the final column. So this institution motivates defining 52 00:03:21,443 --> 00:03:25,713 similarity between two strings with respect to their highest quality 53 00:03:25,713 --> 00:03:30,181 alignment, their nicest alignment. So we're getting closer to a formal 54 00:03:30,181 --> 00:03:33,375 problem statement, but it's still somewhat underdetermined. 55 00:03:33,375 --> 00:03:37,889 Specifically, we need to make precise why we might compare, why we might prefer one 56 00:03:37,889 --> 00:03:41,302 alignment over another. For example, is it better to have three 57 00:03:41,302 --> 00:03:45,266 gaps and no mismatches or is it better to have one gap and one mismatch? 58 00:03:45,266 --> 00:03:49,395 So if in this video, we're effectively going to punt on this question. We're 59 00:03:49,395 --> 00:03:53,414 going to assume this problem's already been solved experimentally, that it's 60 00:03:53,414 --> 00:03:57,818 known and provided this part of the input which is more costly, gaps and various 61 00:03:57,818 --> 00:04:04,107 types of mismatches. So here, then, is the formal problem 62 00:04:04,107 --> 00:04:07,366 statement. So, in addition to the two strings over 63 00:04:07,366 --> 00:04:09,761 A, C, G, T, we are provided as part of the 64 00:04:09,761 --> 00:04:14,882 input, a non-negative number indicating the cost we incurred in alignment for 65 00:04:14,882 --> 00:04:19,139 each gap that we insert. Similarly, for each possible mismatch of 66 00:04:19,139 --> 00:04:22,664 two characters, like, for example, mismatching an A and 67 00:04:22,664 --> 00:04:25,124 T. We're given as part of the input a 68 00:04:25,124 --> 00:04:29,315 corresponding penalty. Given this input, the responsibility of a 69 00:04:29,315 --> 00:04:34,768 sequence alignment algorithm is to output the alignment that minimizes the sum of 70 00:04:34,768 --> 00:04:38,614 the penalties. Another way to think of this output, the 71 00:04:38,614 --> 00:04:42,815 minimum penalty allignment is, we're trying to find in affect the minimum cost 72 00:04:42,815 --> 00:04:46,694 explanation for how one of these strings would've turned into the other. 73 00:04:46,694 --> 00:04:50,680 So we can think of a gap as sort of undoing a deletion that occurred some 74 00:04:50,680 --> 00:04:55,320 time in the past and we can think of a mismatch as representing a mutation. 75 00:04:55,320 --> 00:05:00,104 So this minimum possible total penalty, that is these values of this optimal 76 00:05:00,104 --> 00:05:05,015 alignment is famous and fundamental enough to have its own name namely the 77 00:05:05,015 --> 00:05:08,659 Needleman-Wunsch score. So this quantity is named after the two 78 00:05:08,659 --> 00:05:12,671 authors that proposed efficient algorithm for computing of the optimal alignment. 79 00:05:12,671 --> 00:05:17,021 that appeared way back in 1970, in the Journal of Molecular Biology. 80 00:05:17,021 --> 00:05:20,694 And now, at last, we have a formal definition of what it means for two 81 00:05:20,694 --> 00:05:24,083 strings to be similar. It means they have a small NW score, a 82 00:05:24,083 --> 00:05:26,569 score close to 0. So for example, if you have, if you have 83 00:05:26,569 --> 00:05:29,394 a database with a whole bunch of genome fragments, 84 00:05:29,394 --> 00:05:33,914 according to this, you're going to define the most similar fragments to be those 85 00:05:33,914 --> 00:05:42,161 with the smallest NW score. So, to bring the discussion back squarely 86 00:05:42,161 --> 00:05:46,965 into the land of algorithms, let me point out that this definition of genome sum, 87 00:05:46,965 --> 00:05:51,829 similarity is intrinsically algorithmic. This definition would be totally useless, 88 00:05:51,829 --> 00:05:56,333 unless there existed in efficient algorithm that given two strings and its 89 00:05:56,333 --> 00:06:00,176 penalties computes the best alignment between those two strings. 90 00:06:00,176 --> 00:06:04,079 If you couldn't compute the score, you would never use it as a measure of 91 00:06:04,079 --> 00:06:06,790 similarity. So this observation puts us under a lot 92 00:06:06,790 --> 00:06:10,686 of pressure to devise an efficient algorithm for finding the best alignment. 93 00:06:10,686 --> 00:06:13,660 So how are we going to do that? Well, we can always fall back to 94 00:06:13,660 --> 00:06:17,863 brute-force search, where we iterate over all of the conceivable alignments of the 95 00:06:17,863 --> 00:06:21,913 two strings, compute the total penalty of each of those alignments, and remember 96 00:06:21,913 --> 00:06:25,867 the best one. Clearly, correctness is not going to be 97 00:06:25,867 --> 00:06:29,594 an issue for brute-force search. It's correct essentially by definition. 98 00:06:29,594 --> 00:06:33,111 The issue is how long does it take? So let's ask a simpler question. 99 00:06:33,111 --> 00:06:36,418 Let's just think about, how many different alignments there are? 100 00:06:36,418 --> 00:06:40,303 How many possibilities do we have to try? So if [INAUDIBLE] let's imagine, I gave 101 00:06:40,303 --> 00:06:43,925 you two strings of length 500, which is a knot of a reasonable length. 102 00:06:43,925 --> 00:06:47,495 Which of the following english phrases best describes the number of 103 00:06:47,495 --> 00:06:50,487 possibilities, the number of alignments given to strings 104 00:06:50,487 --> 00:07:04,154 with 500 characters each? So I realize this is sort of a cheeky 105 00:07:04,154 --> 00:07:07,313 question, but I hope you can gather that what I was 106 00:07:07,313 --> 00:07:09,667 looking for was part D. So you know? 107 00:07:09,667 --> 00:07:12,641 So, how big are each of these quantities, anyways? 108 00:07:12,641 --> 00:07:17,349 Well, in a, in a typical version of this class, you might have about 50,000 109 00:07:17,349 --> 00:07:21,066 students enrolled or so. So that's somewhere between 10^44 and 110 00:07:21,066 --> 00:07:24,226 10^5.5. The number of people on earth is roughly 111 00:07:24,226 --> 00:07:26,952 7,000.000.000. So that's somewhere between 10^9 and 112 00:07:26,952 --> 00:07:29,824 10^10/10. The most common estimate I see for the 113 00:07:29,824 --> 00:07:32,990 number of atoms in the known universe is 10^80. 114 00:07:32,990 --> 00:07:37,431 And believe it or not, the number of possible alignments of two strings of 115 00:07:37,431 --> 00:07:41,813 length 500 is even bigger than that. So I'll leave it for you to convince 116 00:07:41,813 --> 00:07:46,555 yourself that the number of possibilities is at least two raised to the 500. 117 00:07:46,555 --> 00:07:51,417 the real number is actually noticeably bigger than that. and because 10 is at 118 00:07:51,417 --> 00:07:57,780 most 2^4, we can lower bound this number by 10^125 quite a bit bigger than the 119 00:07:57,780 --> 00:08:01,388 number of atoms in the universe. And the point of course, is just that 120 00:08:01,388 --> 00:08:05,258 it's utterly absurd to envision implementing brute-force search even at a 121 00:08:05,258 --> 00:08:08,866 scale of a few hundred characters. And you know, forgetting about these sort 122 00:08:08,866 --> 00:08:12,997 of astronomical, if you will, comparisons even if you had string lengths much 123 00:08:12,997 --> 00:08:16,971 smaller, say in the you know, a dozen or two, you'd never ever run brute-force or 124 00:08:16,971 --> 00:08:21,206 this is not going to work. And of course, notice this is not the kind of problem 125 00:08:21,206 --> 00:08:25,651 that's [INAUDIBLE] This just doesn't go away if you wait a little while for 126 00:08:25,651 --> 00:08:28,788 Moore's law to help you. This is a fundamental limitation. It 127 00:08:28,788 --> 00:08:31,231 says, you are never going to compute alignments 128 00:08:31,231 --> 00:08:35,500 of the strings that you care about, unless you have a fast, 129 00:08:35,500 --> 00:08:37,800 clever algorithm. . 130 00:08:37,800 --> 00:08:42,299 I'm happy to report that you will indeed learn such a fast and clever algorithm 131 00:08:42,299 --> 00:08:46,011 later on in this course. Even better, it's just going to be a 132 00:08:46,011 --> 00:08:50,005 straightforward instantiation of a much more general algorithm design paradigm. 133 00:08:50,005 --> 00:08:51,580 That of dynamic programming.