1 00:00:00,000 --> 00:00:06,911 .finally Finally we're going to look at suffix arrays and string processing using 2 00:00:06,911 --> 00:00:12,618 this data structure that has really played a very important role in, string 3 00:00:12,618 --> 00:00:17,127 processing applications that would not otherwise be possible. 4 00:00:17,338 --> 00:00:22,974 To get a feeling for the idea, we're going to look at a really old idea that you're 5 00:00:22,974 --> 00:00:26,920 actually familiar with called keyword in context search. 6 00:00:26,920 --> 00:00:33,510 And the idea is that you're given a text, a huge X and what you want to do is 7 00:00:33,510 --> 00:00:37,711 pre-process it to enable fast substring search. 8 00:00:37,711 --> 00:00:44,630 That is, you want a client to be able to give a query string and then you want to 9 00:00:44,630 --> 00:00:48,946 give all occurrences of that query string in context. 10 00:00:48,946 --> 00:00:54,499 So, if you look for the word, search, it will give all the occurrences of where the 11 00:00:54,499 --> 00:00:58,630 word search occurs in context. Or better thing is another one. 12 00:00:58,833 --> 00:01:03,912 And that's a, a very common operation. One, you're certainly familiar with it 13 00:01:04,115 --> 00:01:09,330 from web searching, in your browser. And there's many other applications. 14 00:01:09,560 --> 00:01:16,632 This is a pretty old idea that dates back to the late 50's, early 60's people have 15 00:01:16,632 --> 00:01:21,628 always wanted to do this. And there's an easy way to look at it 16 00:01:21,628 --> 00:01:27,740 called suffix sorting. The idea is you take your input string, 17 00:01:27,960 --> 00:01:33,533 and then, form the suffixes. remember when I talked about at the beginning with 18 00:01:33,533 --> 00:01:39,840 JAVA's string data type, you can get this done in linear time because each suffix is 19 00:01:39,840 --> 00:01:43,360 basically a pointer back into the input string. 20 00:01:43,638 --> 00:01:52,361 So the suffix, remember, the I suffix just start the character I and take the rest of 21 00:01:52,361 --> 00:01:57,187 the string. So, what this does is, it gives, sort 22 00:01:57,187 --> 00:02:05,180 keys, that, contain, kind of the, the, Pieces of the string itself in context and 23 00:02:05,180 --> 00:02:08,440 all we do is just, run a sort on that suffix. 24 00:02:08,440 --> 00:02:14,552 And what that sort does is, it brings if you do a search, it brings the things that 25 00:02:14,552 --> 00:02:20,962 you're searching for, close together. And, you can use, once you've done that 26 00:02:20,962 --> 00:02:27,088 suffix sort, you can use a binary search to find all occurrences of the string out 27 00:02:27,088 --> 00:02:32,068 of there. That's the, basic idea of a keyword and 28 00:02:32,068 --> 00:02:36,574 context searching. You suffix sort the text, then do binary 29 00:02:36,574 --> 00:02:39,945 search to find the query that you're looking for. 30 00:02:39,945 --> 00:02:44,348 And then you can, scan for wherever the binary search ends up. 31 00:02:44,554 --> 00:02:50,608 And so, this is all the places where, the word search occurs, in the text of Tale of 32 00:02:50,608 --> 00:02:55,011 Two Cities, And then, you can use this index, to, 33 00:02:55,218 --> 00:02:57,970 print out the context of whatever's needed. 34 00:02:57,970 --> 00:03:04,463 It's a fine and effective way, for solving this important practical problem. 35 00:03:04,463 --> 00:03:11,477 And that's interesting, but I want to talk about another really important problem 36 00:03:11,477 --> 00:03:17,711 that, it has, hugely important applications. This is the longest repeated 37 00:03:17,711 --> 00:03:22,412 substring problem. So you're given a string of characters, 38 00:03:22,412 --> 00:03:28,547 find the longest repeated substring. in this case, this is genomic data, and, 39 00:03:28,547 --> 00:03:35,206 there's the example of the longest, repeated substring. And now, in scientific 40 00:03:35,206 --> 00:03:42,981 data, the long repeated substring is often something that scientists are looking for. 41 00:03:42,981 --> 00:03:49,749 And so, and these strings are, are huge. it might be billions of these, characters. 42 00:03:49,749 --> 00:03:56,792 And so, it's really important, not only to know that you can find long substrings 43 00:03:56,792 --> 00:04:01,687 efficiently, but also, that you can do it for, huge, huge strings. 44 00:04:01,687 --> 00:04:07,219 Another example is, cryptanalysis. Where, a long repeat in a file that's 45 00:04:07,219 --> 00:04:12,971 supposed to be, encrypted random file indicates that somebody did something 46 00:04:12,971 --> 00:04:16,438 wrong. It might be the key to, breaking the code. 47 00:04:16,438 --> 00:04:22,264 Another example is data compression. When you've got a file that's got a lot of 48 00:04:22,264 --> 00:04:28,459 repeated stuff in it, you might, want to do this operation, to take advantage of, 49 00:04:28,681 --> 00:04:33,032 of these long repeats and not keep multiple copies of them. 50 00:04:33,253 --> 00:04:39,896 Here's another example, this for, studying or visualizing repetitions in music. 51 00:04:40,140 --> 00:04:47,787 In this example, every time there's a, a repeat of the notes then, there's, a, an 52 00:04:47,787 --> 00:04:53,294 arch drawn to, visualize the repeat. And it's, the arch is thick if, the number 53 00:04:53,294 --> 00:04:57,772 of repeats is long. And it's high if the repeats are far away. 54 00:04:57,992 --> 00:05:03,940 And this, tells, is an interesting way to analyze, repetitions, in music. 55 00:05:03,940 --> 00:05:06,903 So, how are we going to solve this problem? 56 00:05:06,903 --> 00:05:12,125 Very simple to state problem. Given a sequence of N characters, find the 57 00:05:12,125 --> 00:05:17,136 longest repeated substring. As with many problems, there's, an easy, 58 00:05:17,347 --> 00:05:22,358 brute force algorithm, that, at least gives us an idea of what, what the 59 00:05:22,358 --> 00:05:26,662 computation is like. But it's not going to be useful for huge 60 00:05:26,662 --> 00:05:30,120 strings. And that's you try all possibilities, all 61 00:05:30,120 --> 00:05:35,907 pairs of indices I and J and then just compute the longest common prefix for each 62 00:05:35,907 --> 00:05:40,014 pair. The problem with that is that if N is the 63 00:05:40,014 --> 00:05:45,110 length of the string, you've got, N squared over two pairs. 64 00:05:45,333 --> 00:05:51,595 And for every one of those pairs, you might have to, match them up to length D. 65 00:05:51,595 --> 00:05:57,335 It's definitely quadratic time, more than quadratic time in the length of the 66 00:05:57,335 --> 00:06:00,765 string. And can't be using that for, you're not 67 00:06:00,765 --> 00:06:06,580 going to be able to use that for strings that are billions of characters long. 68 00:06:06,834 --> 00:06:14,305 Another way to look at one way to solve the longest repeated substring, is to use 69 00:06:14,305 --> 00:06:18,210 a suffix sort. In fact, just we talked about before. 70 00:06:18,210 --> 00:06:22,710 So, I would take out our input string. Reform the suffixes. 71 00:06:22,968 --> 00:06:30,026 And then sort the suffixes and that brings the long repeated substrings together. 72 00:06:30,026 --> 00:06:37,087 So, that's a quite elegant and efficient solution to this problem. just build the 73 00:06:37,087 --> 00:06:41,674 suffix array that's a linear time process. Do the sort. 74 00:06:41,915 --> 00:06:48,193 We, we just saw we can do that, with n log n character, character compares. 75 00:06:48,434 --> 00:06:52,700 And then go through and find the repeated the substrings. 76 00:06:52,942 --> 00:07:01,272 So, it's just one passthrough to check for adjacent substrings to see which one's the 77 00:07:01,272 --> 00:07:05,640 longest. And this is very easy to code up. 78 00:07:05,855 --> 00:07:10,881 Here's the Java code for computing the longest repeated substring. 79 00:07:11,097 --> 00:07:15,836 We get the length of our string out. We build our suffix array. 80 00:07:15,836 --> 00:07:21,293 Remember that's linear time and space because of Java string implementation 81 00:07:21,293 --> 00:07:24,524 allows us to do substring and constant time. 82 00:07:24,739 --> 00:07:30,483 Then we go ahead and sort the suffixes, and then find the least common prefix 83 00:07:30,483 --> 00:07:34,002 between the, adjacent suffixes in sorted order. 84 00:07:34,217 --> 00:07:39,215 Just keep track of the max so that's longest, repeated substring. 85 00:07:39,418 --> 00:07:44,638 And if so, for example, this sentence about such a funny, sporty, gamy, jesty, 86 00:07:45,112 --> 00:07:50,739 joky, hoky-pokie, lad, is the longest repeated substring in the text of 87 00:07:50,942 --> 00:07:55,349 Melville's Moby Dick. And now we run that one to, prove that 88 00:07:55,349 --> 00:08:00,298 we've got a linear-rhythmic algorithm because there's a lot of characters in 89 00:08:00,298 --> 00:08:03,687 that. And you're not going to find this, without 90 00:08:03,687 --> 00:08:08,401 a good algorithm. And, and this is just a humorous approach 91 00:08:08,401 --> 00:08:14,473 to what we've, talked about today. If you have, five scientists that are 92 00:08:14,473 --> 00:08:21,759 looking for a long substring, in a genome, that, they might encounter, this problem, 93 00:08:22,001 --> 00:08:29,125 with a billion nucleotides, and there are plenty of scientists that are not aware 94 00:08:29,125 --> 00:08:35,278 of, important algorithms are the ones, like the ones that we are talking about. 95 00:08:35,520 --> 00:08:42,151 And the one that uses the good algorithm is definitely more likely to find a cure 96 00:08:42,151 --> 00:08:46,108 for cancer. But there is a flaw even in this, that 97 00:08:46,108 --> 00:08:52,650 computer scientists discovered as they got into ever more complex algorithms for 98 00:08:52,650 --> 00:08:58,434 problems like this, and that is if the, the longest repeated substring is long, 99 00:08:58,638 --> 00:09:04,098 there's a problem. So this is just some, experiments, for, 100 00:09:04,374 --> 00:09:09,534 finding the longest, repeated substring in various files. 101 00:09:09,811 --> 00:09:14,233 From just a program to, the text of Moby Dick. 102 00:09:14,510 --> 00:09:18,564 Or a chromosome with 7.1 million characters. 103 00:09:18,841 --> 00:09:26,673 And, using, the, brute force method, you get stuck, and too slow, very soon. 104 00:09:26,949 --> 00:09:34,656 But using, a suffix sort, you can get these jobs done, even for huge files like 105 00:09:34,656 --> 00:09:40,086 the first ten million digits of pi. By the way, if there was a really long 106 00:09:40,086 --> 00:09:45,661 repeated substring in the first in the digits of pi, that would be news. 107 00:09:45,878 --> 00:09:50,440 Maybe it would help us, understand more, about this number. 108 00:09:50,682 --> 00:09:54,891 So lots of people, write programs of this sort. 109 00:09:55,134 --> 00:10:01,284 But the big problem is if you have a really long repeated substring, then 110 00:10:01,284 --> 00:10:07,972 suffix sort is not going to work. Our fast algorithm, doesn't complete. 111 00:10:08,294 --> 00:10:15,369 So what's going on? In the explanation is really simple. 112 00:10:16,012 --> 00:10:21,480 If, for example, you have two copies of something. 113 00:10:21,700 --> 00:10:27,233 When you form the suffix array, what happens is that if you have the longest 114 00:10:27,233 --> 00:10:31,737 repeat, but you also have every version of that repeat appears. 115 00:10:31,737 --> 00:10:37,041 And you have to look for those when checking, for the longest repeated 116 00:10:37,331 --> 00:10:40,455 substring. So, if D is the length longest match, 117 00:10:40,455 --> 00:10:46,558 then, just to check for the longest common prefix, you got to do, at least one + two 118 00:10:46,558 --> 00:10:53,532 + three + four up to D character compares. Which means it's going to be quadratic 119 00:10:53,532 --> 00:10:59,682 just for checking for the longest common prefix of adjacent, elements in this 120 00:10:59,682 --> 00:11:03,410 algorithm. It's also a problem for the sort. 121 00:11:03,410 --> 00:11:08,050 So, if you have very long repeats, we still don't have an algorithm. 122 00:11:08,050 --> 00:11:14,028 So, the question is, that was confronting, computer scientists, in the late 80's, 123 00:11:14,028 --> 00:11:19,798 early 90's is how can we get this done? What's the best algorithm for this 124 00:11:19,798 --> 00:11:23,766 problem? And there was a great algorithm called the 125 00:11:24,233 --> 00:11:28,590 Manber-Myers algorithm, That I'll talk about in just a minute. 126 00:11:28,590 --> 00:11:32,168 That got the job done in linear rhythmic time. 127 00:11:32,168 --> 00:11:37,848 And actually, there's a clever old algorithm that uses the data structure 128 00:11:37,848 --> 00:11:42,050 called, suffix trees. But it was really the Manber-Myers 129 00:11:42,961 --> 00:11:47,444 Algorithm that, got people, excited about this area. 130 00:11:48,280 --> 00:11:54,565 And so, I want to describe that, briefly. Actually, these two computer scientists, 131 00:11:54,565 --> 00:11:59,974 one of them went on the become chief scientist of an internet company. 132 00:11:59,974 --> 00:12:06,079 The other one went on to become chief scientist of a company that won the race 133 00:12:06,079 --> 00:12:10,638 in sequencing the genome. In both cases, good algorithms for 134 00:12:10,638 --> 00:12:13,960 processing long strings are very important. 135 00:12:13,960 --> 00:12:19,524 And this is a great algorithm. Now, it's a little too complex to give all 136 00:12:19,524 --> 00:12:25,320 the details in a lecture, but I can give a pretty good idea of how it works. 137 00:12:25,572 --> 00:12:30,625 So the overview is that, it's kind of like an MSD, sort. 138 00:12:30,877 --> 00:12:37,951 But, what it manages to do is double the number of characters that it looks at in 139 00:12:37,951 --> 00:12:44,450 each passthrough of the array. And, the idea is that, you, since you're 140 00:12:44,450 --> 00:12:48,309 doubling the number of characters that you look at each time. 141 00:12:48,492 --> 00:12:53,024 It, it finishes in, log in time that's the size of the, suffix array. 142 00:12:53,208 --> 00:12:57,066 If, if you double until you get n, it's log n. 143 00:12:57,250 --> 00:13:02,516 And it turns out, what's really ingenious about the algorithm is that, you can do 144 00:13:02,516 --> 00:13:07,634 it, do each phase in linear time. So, this is an example of how it works. 145 00:13:07,634 --> 00:13:12,773 So, it's, we going to do a suffix sort. And then I'll, I'll talk about the least 146 00:13:12,773 --> 00:13:17,894 common prefix as well in a minute. So, sorting the first character, while we 147 00:13:17,894 --> 00:13:21,960 just use key, key index counting or something like that. 148 00:13:22,169 --> 00:13:28,718 So we know how to do that, in linear time. And then the idea has given that sorted on 149 00:13:28,718 --> 00:13:32,898 the first character. We can sort it, easily sort on the first 150 00:13:32,898 --> 00:13:35,755 two. Well again, we could use, key index 151 00:13:35,755 --> 00:13:38,890 counting. So now, what we can do is, double the 152 00:13:38,890 --> 00:13:42,373 number of characters that we, involve each time. 153 00:13:42,373 --> 00:13:47,808 So, the next phase of the Manber-Myers algorithm, now gets it sorted on four 154 00:13:47,808 --> 00:13:51,570 characters. And, of course, as we, the more characters 155 00:13:51,570 --> 00:13:56,904 we have sorted on at the leading part, the smaller the [INAUDIBLE] files in the 156 00:13:56,904 --> 00:14:01,556 trading, in the trailing part. So, that's, certainly a feature. 157 00:14:01,556 --> 00:14:07,589 And then, in this case, after we get to, eight characters, and all our sub files 158 00:14:07,589 --> 00:14:10,860 are of size one, and we're done with the sort. 159 00:14:11,520 --> 00:14:20,876 Now, the key to the algorithm is, the idea that, once it's going, it uses what's 160 00:14:20,876 --> 00:14:27,500 called an index sort, Which essentially means that it can do 161 00:14:27,500 --> 00:14:32,157 compares in constant time in the middle of the algorithm. 162 00:14:32,157 --> 00:14:36,488 Now, lets just take a look at, at, at how that works. 163 00:14:36,733 --> 00:14:42,616 So lets look at, trying to compare string zero with string, nine. 164 00:14:42,861 --> 00:14:49,480 And we know that we've already got the thing sorted on, four characters and we 165 00:14:49,480 --> 00:14:55,603 want to sort it on eight So, zero and nine are equal on the first four characters. 166 00:14:55,603 --> 00:14:59,759 They're in the same subfile. So, now we're going to get it sorted on 167 00:14:59,759 --> 00:15:03,111 the other. But the thing is if we add four to our 168 00:15:03,111 --> 00:15:06,463 given indices. So, we're at zero, and we add four. 169 00:15:06,463 --> 00:15:12,611 That gets us to four, and we're at nine. We add four that gets us to thirteen. we 170 00:15:12,611 --> 00:15:18,907 already know that the thing is sorted on those characters. 171 00:15:19,233 --> 00:15:25,713 And, we know that, thirteen appears before four in this sorted list. 172 00:15:25,713 --> 00:15:30,692 It's already sorted. And we can keep track of that by, using an 173 00:15:30,692 --> 00:15:37,127 inverse array which says, for every index, where it appears in the sorted order. 174 00:15:37,127 --> 00:15:43,256 So, this says that thirteen appears at position six, and four appears at position 175 00:15:43,256 --> 00:15:46,550 seven. That is, thirteen appears before four. 176 00:15:46,550 --> 00:15:52,550 So, when we're trying to compare, the, this, string here, that's the zeroth 177 00:15:52,550 --> 00:15:57,275 suffix with the ninth suffix, We can go in and again, add four to get 178 00:15:57,275 --> 00:16:03,042 four, add four to get thirteen go into the index array and see which one is less and 179 00:16:03,042 --> 00:16:06,307 the one that appears first is going to be less. 180 00:16:06,307 --> 00:16:11,101 So, that's a constant time, comparison. Thirteen is less than or equal to four 181 00:16:11,101 --> 00:16:16,312 because the inverse of thirteen is less than the inverse four, which means that 182 00:16:16,312 --> 00:16:21,037 suffixes of nine, if you take eight characters out of nine and eight 183 00:16:21,037 --> 00:16:26,640 characters out of zero, it's less. It's a really, simple and of course, 184 00:16:26,902 --> 00:16:32,565 Easy to implement operation. So, maintaining the inverse, I can get 185 00:16:32,565 --> 00:16:38,664 constant time string compare. So, what does that imply, for the whole 186 00:16:38,664 --> 00:16:41,080 suffixsort??uh Well, 187 00:16:41,286 --> 00:16:47,289 With a constant time compare, then, I can get, at a minimum, I can get an, an analog 188 00:16:47,289 --> 00:16:50,463 N sort, just by using quicksort or mergesort. 189 00:16:50,670 --> 00:16:56,258 And then, I get much faster than quadratic performance no matter what the strings 190 00:16:56,258 --> 00:16:59,501 are. That's a big key to the success of this 191 00:16:59,501 --> 00:17:03,561 algorithm. And actually, if, if you use a version of 192 00:17:03,561 --> 00:17:09,505 three-way quicksort, it's been proven that it even gets down to linear time for the 193 00:17:09,505 --> 00:17:12,001 sort, No matter what the input is. 194 00:17:12,001 --> 00:17:18,454 That's one thing, and then the other thing is when you need to do, the longest common 195 00:17:18,454 --> 00:17:24,524 prefix to look for the, longest match after the array is sorted you can also do 196 00:17:24,524 --> 00:17:28,750 this constant time, string compare and get the job done. 197 00:17:29,040 --> 00:17:37,274 So this is really an in-, ingenious algorithm that, can get, suffix sorting 198 00:17:37,274 --> 00:17:42,989 done very fast. even in the presence of, a huge repeat. 199 00:17:43,280 --> 00:17:48,448 And I think really underscores, the importance of careful algorithmic 200 00:17:48,448 --> 00:17:52,343 thinking, in addressing new computational challenges. 201 00:17:52,343 --> 00:17:58,036 So, string sorting, just to summarize, is a really, interesting area of inquiry. 202 00:17:58,036 --> 00:18:03,580 For one thing, we can develop linear time sorts for many, many applications. 203 00:18:03,781 --> 00:18:09,557 We, we thought, we're doing well when we had, n log n sorts, but actually we can do 204 00:18:09,557 --> 00:18:14,595 much better for many applications. And in fact, we can even get to sublinear 205 00:18:14,595 --> 00:18:17,886 where we don't even have to examine all the data. 206 00:18:17,886 --> 00:18:22,923 In today's world, when we have huge amounts of data, to be able to, get it 207 00:18:22,923 --> 00:18:28,028 sorted without even looking, at it at all is really quite a miracle. 208 00:18:28,230 --> 00:18:33,823 Three-way string quicksort,,you you can't do better than that in terms of the 209 00:18:33,823 --> 00:18:37,602 numbers of characters that you have to, examine. 210 00:18:37,829 --> 00:18:42,893 And it's a lot of deep mathematical analysis, behind that result. 211 00:18:43,120 --> 00:18:49,091 But I think the other lesson to learn is that algorithms like three-way string 212 00:18:49,091 --> 00:18:55,591 quicksort and Manber-Myers, show that we really have a lot to learn still in string 213 00:18:55,591 --> 00:18:58,690 processing. And they're not really random. 214 00:18:58,690 --> 00:19:04,281 In fact, a lot of times we're looking for non-randomness and we might want to use 215 00:19:04,281 --> 00:19:08,979 specialized algorithms. So, at string processes or introduction to 216 00:19:08,979 --> 00:19:13,430 string processing, We're going to look at many other string 217 00:19:13,430 --> 00:19:16,620 processing examples in the coming lectures.