In this video we'll cover a second problem to whet your appetite for things to come, namely the problem of sequence alignment. So this is a fundamental problem in computational genomics. If you take a class on the subject it's very likely to occupy the very first couple of lectures. So in this problem you're given two strings over an alphabet and no prizes for guessing which is the alphabet we're most likely to care about. Typically, these strings represent portions of one or more genomes. And just as a toy running example you can just imagine that the two strings were given are A, G, G, G, C, T and A, G, G, C, A. Know that the two input strings do not necessarily need to be of the same length. And informally speaking, the goal of this sequence alignment problem is to figure out how similar the two input strings are. Obviously, I haven't told you what I mean by two strings being similar. That's something we'll develop over the next couple of slides. Why might you want to solve this problem? Well, there's actually a lot of reasons. Let me just give you two of many examples. What will be the conjecture or the function of regions of a genome that you don't understand, lets say the human genome, from similar regions that exist in genomes that you do understand or at least understand better, say the mouse genome. If you see a string that has a known function in the well understood genome and you see something similar in the poorly understood genome, you might conjecture it has the same or similar function. A totally different reason you might want to compare the genomes of two different species, is to figure out whether one evolved directly from the other and when. A second totally different reason you might want to compare the genomes of two different species is to understand their evolutionary relationship. So for example, maybe you have three species A, B, and C, and you're wondering whether B evolved from A and then C evolved from B, or whether B and C evolved independently from a common ancestor, A. And you might then take genome similarity as a measure of proximity in the evolutionary tree. So having motivated the informal version of the problem, let's work toward making it more formal. In particular, I owe you a discussion of what I mean by two strings being similar. So to develop intuition for this, let's revisit the two strings that we introduced on the previous slide A, G, G, G, C, T, and A, G, G, C, A. Now, if we just sort of eyeball these two strings, I mean clearly they're not the same string. But, we somehow feel like they're more similar than they are different. So, where does that intuition come from? Well, one way to make it more precise is to notice that these two strings can be nicely aligned in the following sense. Lets write down the longer string, A, G, G, G, C, T. And, I'm going to write the shorter string under it, and I'll insert a gap, a space to make the two strings have the same length. I'm going to put the space where there seems to be quote unquote a missing G. And then, what sense is this a nice alignment, well, it's clearly not perfect. We don't' get a character, by character match of the two strings, but there's only two minor flaws. So on the one hand, we did have to insert a gap and we do have to suffer one mismatch in the final column. So this institution motivates defining similarity between two strings with respect to their highest quality alignment, their nicest alignment. So we're getting closer to a formal problem statement, but it's still somewhat underdetermined. Specifically, we need to make precise why we might compare, why we might prefer one alignment over another. For example, is it better to have three gaps and no mismatches or is it better to have one gap and one mismatch? So if in this video, we're effectively going to punt on this question. We're going to assume this problem's already been solved experimentally, that it's known and provided this part of the input which is more costly, gaps and various types of mismatches. So here, then, is the formal problem statement. So, in addition to the two strings over A, C, G, T, we are provided as part of the input, a non-negative number indicating the cost we incurred in alignment for each gap that we insert. Similarly, for each possible mismatch of two characters, like, for example, mismatching an A and T. We're given as part of the input a corresponding penalty. Given this input, the responsibility of a sequence alignment algorithm is to output the alignment that minimizes the sum of the penalties. Another way to think of this output, the minimum penalty allignment is, we're trying to find in affect the minimum cost explanation for how one of these strings would've turned into the other. So we can think of a gap as sort of undoing a deletion that occurred some time in the past and we can think of a mismatch as representing a mutation. So this minimum possible total penalty, that is these values of this optimal alignment is famous and fundamental enough to have its own name namely the Needleman-Wunsch score. So this quantity is named after the two authors that proposed efficient algorithm for computing of the optimal alignment. that appeared way back in 1970, in the Journal of Molecular Biology. And now, at last, we have a formal definition of what it means for two strings to be similar. It means they have a small NW score, a score close to 0. So for example, if you have, if you have a database with a whole bunch of genome fragments, according to this, you're going to define the most similar fragments to be those with the smallest NW score. So, to bring the discussion back squarely into the land of algorithms, let me point out that this definition of genome sum, similarity is intrinsically algorithmic. This definition would be totally useless, unless there existed in efficient algorithm that given two strings and its penalties computes the best alignment between those two strings. If you couldn't compute the score, you would never use it as a measure of similarity. So this observation puts us under a lot of pressure to devise an efficient algorithm for finding the best alignment. So how are we going to do that? Well, we can always fall back to brute-force search, where we iterate over all of the conceivable alignments of the two strings, compute the total penalty of each of those alignments, and remember the best one. Clearly, correctness is not going to be an issue for brute-force search. It's correct essentially by definition. The issue is how long does it take? So let's ask a simpler question. Let's just think about, how many different alignments there are? How many possibilities do we have to try? So if [INAUDIBLE] let's imagine, I gave you two strings of length 500, which is a knot of a reasonable length. Which of the following english phrases best describes the number of possibilities, the number of alignments given to strings with 500 characters each? So I realize this is sort of a cheeky question, but I hope you can gather that what I was looking for was part D. So you know? So, how big are each of these quantities, anyways? Well, in a, in a typical version of this class, you might have about 50,000 students enrolled or so. So that's somewhere between 10^44 and 10^5.5. The number of people on earth is roughly 7,000.000.000. So that's somewhere between 10^9 and 10^10/10. The most common estimate I see for the number of atoms in the known universe is 10^80. And believe it or not, the number of possible alignments of two strings of length 500 is even bigger than that. So I'll leave it for you to convince yourself that the number of possibilities is at least two raised to the 500. the real number is actually noticeably bigger than that. and because 10 is at most 2^4, we can lower bound this number by 10^125 quite a bit bigger than the number of atoms in the universe. And the point of course, is just that it's utterly absurd to envision implementing brute-force search even at a scale of a few hundred characters. And you know, forgetting about these sort of astronomical, if you will, comparisons even if you had string lengths much smaller, say in the you know, a dozen or two, you'd never ever run brute-force or this is not going to work. And of course, notice this is not the kind of problem that's [INAUDIBLE] This just doesn't go away if you wait a little while for Moore's law to help you. This is a fundamental limitation. It says, you are never going to compute alignments of the strings that you care about, unless you have a fast, clever algorithm. . I'm happy to report that you will indeed learn such a fast and clever algorithm later on in this course. Even better, it's just going to be a straightforward instantiation of a much more general algorithm design paradigm. That of dynamic programming.