1
00:00:00,000 --> 00:00:06,911
.finally Finally we're going to look at
suffix arrays and string processing using

2
00:00:06,911 --> 00:00:12,618
this data structure that has really played
a very important role in, string

3
00:00:12,618 --> 00:00:17,127
processing applications that would not
otherwise be possible.

4
00:00:17,338 --> 00:00:22,974
To get a feeling for the idea, we're going
to look at a really old idea that you're

5
00:00:22,974 --> 00:00:26,920
actually familiar with called keyword in
context search.

6
00:00:26,920 --> 00:00:33,510
And the idea is that you're given a text,
a huge X and what you want to do is

7
00:00:33,510 --> 00:00:37,711
pre-process it to enable fast substring
search.

8
00:00:37,711 --> 00:00:44,630
That is, you want a client to be able to
give a query string and then you want to

9
00:00:44,630 --> 00:00:48,946
give all occurrences of that query string
in context.

10
00:00:48,946 --> 00:00:54,499
So, if you look for the word, search, it
will give all the occurrences of where the

11
00:00:54,499 --> 00:00:58,630
word search occurs in context.
Or better thing is another one.

12
00:00:58,833 --> 00:01:03,912
And that's a, a very common operation.
One, you're certainly familiar with it

13
00:01:04,115 --> 00:01:09,330
from web searching, in your browser.
And there's many other applications.

14
00:01:09,560 --> 00:01:16,632
This is a pretty old idea that dates back
to the late 50's, early 60's people have

15
00:01:16,632 --> 00:01:21,628
always wanted to do this.
And there's an easy way to look at it

16
00:01:21,628 --> 00:01:27,740
called suffix sorting.
The idea is you take your input string,

17
00:01:27,960 --> 00:01:33,533
and then, form the suffixes. remember when
I talked about at the beginning with

18
00:01:33,533 --> 00:01:39,840
JAVA's string data type, you can get this
done in linear time because each suffix is

19
00:01:39,840 --> 00:01:43,360
basically a pointer back into the input
string.

20
00:01:43,638 --> 00:01:52,361
So the suffix, remember, the I suffix just
start the character I and take the rest of

21
00:01:52,361 --> 00:01:57,187
the string.
So, what this does is, it gives, sort

22
00:01:57,187 --> 00:02:05,180
keys, that, contain, kind of the, the,
Pieces of the string itself in context and

23
00:02:05,180 --> 00:02:08,440
all we do is just, run a sort on that
suffix.

24
00:02:08,440 --> 00:02:14,552
And what that sort does is, it brings if
you do a search, it brings the things that

25
00:02:14,552 --> 00:02:20,962
you're searching for, close together.
And, you can use, once you've done that

26
00:02:20,962 --> 00:02:27,088
suffix sort, you can use a binary search
to find all occurrences of the string out

27
00:02:27,088 --> 00:02:32,068
of there.
That's the, basic idea of a keyword and

28
00:02:32,068 --> 00:02:36,574
context searching.
You suffix sort the text, then do binary

29
00:02:36,574 --> 00:02:39,945
search to find the query that you're
looking for.

30
00:02:39,945 --> 00:02:44,348
And then you can, scan for wherever the
binary search ends up.

31
00:02:44,554 --> 00:02:50,608
And so, this is all the places where, the
word search occurs, in the text of Tale of

32
00:02:50,608 --> 00:02:55,011
Two Cities,
And then, you can use this index, to,

33
00:02:55,218 --> 00:02:57,970
print out the context of whatever's
needed.

34
00:02:57,970 --> 00:03:04,463
It's a fine and effective way, for solving
this important practical problem.

35
00:03:04,463 --> 00:03:11,477
And that's interesting, but I want to talk
about another really important problem

36
00:03:11,477 --> 00:03:17,711
that, it has, hugely important
applications. This is the longest repeated

37
00:03:17,711 --> 00:03:22,412
substring problem.
So you're given a string of characters,

38
00:03:22,412 --> 00:03:28,547
find the longest repeated substring. in
this case, this is genomic data, and,

39
00:03:28,547 --> 00:03:35,206
there's the example of the longest,
repeated substring. And now, in scientific

40
00:03:35,206 --> 00:03:42,981
data, the long repeated substring is often
something that scientists are looking for.

41
00:03:42,981 --> 00:03:49,749
And so, and these strings are, are huge.
it might be billions of these, characters.

42
00:03:49,749 --> 00:03:56,792
And so, it's really important, not only to
know that you can find long substrings

43
00:03:56,792 --> 00:04:01,687
efficiently, but also, that you can do it
for, huge, huge strings.

44
00:04:01,687 --> 00:04:07,219
Another example is, cryptanalysis.
Where, a long repeat in a file that's

45
00:04:07,219 --> 00:04:12,971
supposed to be, encrypted random file
indicates that somebody did something

46
00:04:12,971 --> 00:04:16,438
wrong.
It might be the key to, breaking the code.

47
00:04:16,438 --> 00:04:22,264
Another example is data compression.
When you've got a file that's got a lot of

48
00:04:22,264 --> 00:04:28,459
repeated stuff in it, you might, want to
do this operation, to take advantage of,

49
00:04:28,681 --> 00:04:33,032
of these long repeats and not keep
multiple copies of them.

50
00:04:33,253 --> 00:04:39,896
Here's another example, this for, studying
or visualizing repetitions in music.

51
00:04:40,140 --> 00:04:47,787
In this example, every time there's a, a
repeat of the notes then, there's, a, an

52
00:04:47,787 --> 00:04:53,294
arch drawn to, visualize the repeat.
And it's, the arch is thick if, the number

53
00:04:53,294 --> 00:04:57,772
of repeats is long.
And it's high if the repeats are far away.

54
00:04:57,992 --> 00:05:03,940
And this, tells, is an interesting way to
analyze, repetitions, in music.

55
00:05:03,940 --> 00:05:06,903
So, how are we going to solve this
problem?

56
00:05:06,903 --> 00:05:12,125
Very simple to state problem.
Given a sequence of N characters, find the

57
00:05:12,125 --> 00:05:17,136
longest repeated substring.
As with many problems, there's, an easy,

58
00:05:17,347 --> 00:05:22,358
brute force algorithm, that, at least
gives us an idea of what, what the

59
00:05:22,358 --> 00:05:26,662
computation is like.
But it's not going to be useful for huge

60
00:05:26,662 --> 00:05:30,120
strings.
And that's you try all possibilities, all

61
00:05:30,120 --> 00:05:35,907
pairs of indices I and J and then just
compute the longest common prefix for each

62
00:05:35,907 --> 00:05:40,014
pair.
The problem with that is that if N is the

63
00:05:40,014 --> 00:05:45,110
length of the string, you've got, N
squared over two pairs.

64
00:05:45,333 --> 00:05:51,595
And for every one of those pairs, you
might have to, match them up to length D.

65
00:05:51,595 --> 00:05:57,335
It's definitely quadratic time, more than
quadratic time in the length of the

66
00:05:57,335 --> 00:06:00,765
string.
And can't be using that for, you're not

67
00:06:00,765 --> 00:06:06,580
going to be able to use that for strings
that are billions of characters long.

68
00:06:06,834 --> 00:06:14,305
Another way to look at one way to solve
the longest repeated substring, is to use

69
00:06:14,305 --> 00:06:18,210
a suffix sort. In fact, just we talked
about before.

70
00:06:18,210 --> 00:06:22,710
So, I would take out our input string.
Reform the suffixes.

71
00:06:22,968 --> 00:06:30,026
And then sort the suffixes and that brings
the long repeated substrings together.

72
00:06:30,026 --> 00:06:37,087
So, that's a quite elegant and efficient
solution to this problem. just build the

73
00:06:37,087 --> 00:06:41,674
suffix array that's a linear time process.
Do the sort.

74
00:06:41,915 --> 00:06:48,193
We, we just saw we can do that, with n log
n character, character compares.

75
00:06:48,434 --> 00:06:52,700
And then go through and find the repeated
the substrings.

76
00:06:52,942 --> 00:07:01,272
So, it's just one passthrough to check for
adjacent substrings to see which one's the

77
00:07:01,272 --> 00:07:05,640
longest.
And this is very easy to code up.

78
00:07:05,855 --> 00:07:10,881
Here's the Java code for computing the
longest repeated substring.

79
00:07:11,097 --> 00:07:15,836
We get the length of our string out.
We build our suffix array.

80
00:07:15,836 --> 00:07:21,293
Remember that's linear time and space
because of Java string implementation

81
00:07:21,293 --> 00:07:24,524
allows us to do substring and constant
time.

82
00:07:24,739 --> 00:07:30,483
Then we go ahead and sort the suffixes,
and then find the least common prefix

83
00:07:30,483 --> 00:07:34,002
between the, adjacent suffixes in sorted
order.

84
00:07:34,217 --> 00:07:39,215
Just keep track of the max so that's
longest, repeated substring.

85
00:07:39,418 --> 00:07:44,638
And if so, for example, this sentence
about such a funny, sporty, gamy, jesty,

86
00:07:45,112 --> 00:07:50,739
joky, hoky-pokie, lad, is the longest
repeated substring in the text of

87
00:07:50,942 --> 00:07:55,349
Melville's Moby Dick.
And now we run that one to, prove that

88
00:07:55,349 --> 00:08:00,298
we've got a linear-rhythmic algorithm
because there's a lot of characters in

89
00:08:00,298 --> 00:08:03,687
that.
And you're not going to find this, without

90
00:08:03,687 --> 00:08:08,401
a good algorithm.
And, and this is just a humorous approach

91
00:08:08,401 --> 00:08:14,473
to what we've, talked about today.
If you have, five scientists that are

92
00:08:14,473 --> 00:08:21,759
looking for a long substring, in a genome,
that, they might encounter, this problem,

93
00:08:22,001 --> 00:08:29,125
with a billion nucleotides, and there are
plenty of scientists that are not aware

94
00:08:29,125 --> 00:08:35,278
of, important algorithms are the ones,
like the ones that we are talking about.

95
00:08:35,520 --> 00:08:42,151
And the one that uses the good algorithm
is definitely more likely to find a cure

96
00:08:42,151 --> 00:08:46,108
for cancer.
But there is a flaw even in this, that

97
00:08:46,108 --> 00:08:52,650
computer scientists discovered as they got
into ever more complex algorithms for

98
00:08:52,650 --> 00:08:58,434
problems like this, and that is if the,
the longest repeated substring is long,

99
00:08:58,638 --> 00:09:04,098
there's a problem.
So this is just some, experiments, for,

100
00:09:04,374 --> 00:09:09,534
finding the longest, repeated substring in
various files.

101
00:09:09,811 --> 00:09:14,233
From just a program to, the text of Moby
Dick.

102
00:09:14,510 --> 00:09:18,564
Or a chromosome with 7.1 million
characters.

103
00:09:18,841 --> 00:09:26,673
And, using, the, brute force method, you
get stuck, and too slow, very soon.

104
00:09:26,949 --> 00:09:34,656
But using, a suffix sort, you can get
these jobs done, even for huge files like

105
00:09:34,656 --> 00:09:40,086
the first ten million digits of pi.
By the way, if there was a really long

106
00:09:40,086 --> 00:09:45,661
repeated substring in the first in the
digits of pi, that would be news.

107
00:09:45,878 --> 00:09:50,440
Maybe it would help us, understand more,
about this number.

108
00:09:50,682 --> 00:09:54,891
So lots of people, write programs of this
sort.

109
00:09:55,134 --> 00:10:01,284
But the big problem is if you have a
really long repeated substring, then

110
00:10:01,284 --> 00:10:07,972
suffix sort is not going to work.
Our fast algorithm, doesn't complete.

111
00:10:08,294 --> 00:10:15,369
So what's going on?
In the explanation is really simple.

112
00:10:16,012 --> 00:10:21,480
If, for example, you have two copies of
something.

113
00:10:21,700 --> 00:10:27,233
When you form the suffix array, what
happens is that if you have the longest

114
00:10:27,233 --> 00:10:31,737
repeat, but you also have every version of
that repeat appears.

115
00:10:31,737 --> 00:10:37,041
And you have to look for those when
checking, for the longest repeated

116
00:10:37,331 --> 00:10:40,455
substring.
So, if D is the length longest match,

117
00:10:40,455 --> 00:10:46,558
then, just to check for the longest common
prefix, you got to do, at least one + two

118
00:10:46,558 --> 00:10:53,532
+ three + four up to D character compares.
Which means it's going to be quadratic

119
00:10:53,532 --> 00:10:59,682
just for checking for the longest common
prefix of adjacent, elements in this

120
00:10:59,682 --> 00:11:03,410
algorithm.
It's also a problem for the sort.

121
00:11:03,410 --> 00:11:08,050
So, if you have very long repeats, we
still don't have an algorithm.

122
00:11:08,050 --> 00:11:14,028
So, the question is, that was confronting,
computer scientists, in the late 80's,

123
00:11:14,028 --> 00:11:19,798
early 90's is how can we get this done?
What's the best algorithm for this

124
00:11:19,798 --> 00:11:23,766
problem?
And there was a great algorithm called the

125
00:11:24,233 --> 00:11:28,590
Manber-Myers algorithm,
That I'll talk about in just a minute.

126
00:11:28,590 --> 00:11:32,168
That got the job done in linear rhythmic
time.

127
00:11:32,168 --> 00:11:37,848
And actually, there's a clever old
algorithm that uses the data structure

128
00:11:37,848 --> 00:11:42,050
called, suffix trees.
But it was really the Manber-Myers

129
00:11:42,961 --> 00:11:47,444
Algorithm that, got people, excited about
this area.

130
00:11:48,280 --> 00:11:54,565
And so, I want to describe that, briefly.
Actually, these two computer scientists,

131
00:11:54,565 --> 00:11:59,974
one of them went on the become chief
scientist of an internet company.

132
00:11:59,974 --> 00:12:06,079
The other one went on to become chief
scientist of a company that won the race

133
00:12:06,079 --> 00:12:10,638
in sequencing the genome.
In both cases, good algorithms for

134
00:12:10,638 --> 00:12:13,960
processing long strings are very
important.

135
00:12:13,960 --> 00:12:19,524
And this is a great algorithm.
Now, it's a little too complex to give all

136
00:12:19,524 --> 00:12:25,320
the details in a lecture, but I can give a
pretty good idea of how it works.

137
00:12:25,572 --> 00:12:30,625
So the overview is that, it's kind of like
an MSD, sort.

138
00:12:30,877 --> 00:12:37,951
But, what it manages to do is double the
number of characters that it looks at in

139
00:12:37,951 --> 00:12:44,450
each passthrough of the array.
And, the idea is that, you, since you're

140
00:12:44,450 --> 00:12:48,309
doubling the number of characters that you
look at each time.

141
00:12:48,492 --> 00:12:53,024
It, it finishes in, log in time that's the
size of the, suffix array.

142
00:12:53,208 --> 00:12:57,066
If, if you double until you get n, it's
log n.

143
00:12:57,250 --> 00:13:02,516
And it turns out, what's really ingenious
about the algorithm is that, you can do

144
00:13:02,516 --> 00:13:07,634
it, do each phase in linear time.
So, this is an example of how it works.

145
00:13:07,634 --> 00:13:12,773
So, it's, we going to do a suffix sort.
And then I'll, I'll talk about the least

146
00:13:12,773 --> 00:13:17,894
common prefix as well in a minute.
So, sorting the first character, while we

147
00:13:17,894 --> 00:13:21,960
just use key, key index counting or
something like that.

148
00:13:22,169 --> 00:13:28,718
So we know how to do that, in linear time.
And then the idea has given that sorted on

149
00:13:28,718 --> 00:13:32,898
the first character.
We can sort it, easily sort on the first

150
00:13:32,898 --> 00:13:35,755
two.
Well again, we could use, key index

151
00:13:35,755 --> 00:13:38,890
counting.
So now, what we can do is, double the

152
00:13:38,890 --> 00:13:42,373
number of characters that we, involve each
time.

153
00:13:42,373 --> 00:13:47,808
So, the next phase of the Manber-Myers
algorithm, now gets it sorted on four

154
00:13:47,808 --> 00:13:51,570
characters.
And, of course, as we, the more characters

155
00:13:51,570 --> 00:13:56,904
we have sorted on at the leading part, the
smaller the [INAUDIBLE] files in the

156
00:13:56,904 --> 00:14:01,556
trading, in the trailing part. So, that's,
certainly a feature.

157
00:14:01,556 --> 00:14:07,589
And then, in this case, after we get to,
eight characters, and all our sub files

158
00:14:07,589 --> 00:14:10,860
are of size one, and we're done with the
sort.

159
00:14:11,520 --> 00:14:20,876
Now, the key to the algorithm is, the idea
that, once it's going, it uses what's

160
00:14:20,876 --> 00:14:27,500
called an index sort,
Which essentially means that it can do

161
00:14:27,500 --> 00:14:32,157
compares in constant time in the middle of
the algorithm.

162
00:14:32,157 --> 00:14:36,488
Now, lets just take a look at, at, at how
that works.

163
00:14:36,733 --> 00:14:42,616
So lets look at, trying to compare string
zero with string, nine.

164
00:14:42,861 --> 00:14:49,480
And we know that we've already got the
thing sorted on, four characters and we

165
00:14:49,480 --> 00:14:55,603
want to sort it on eight So, zero and nine
are equal on the first four characters.

166
00:14:55,603 --> 00:14:59,759
They're in the same subfile.
So, now we're going to get it sorted on

167
00:14:59,759 --> 00:15:03,111
the other.
But the thing is if we add four to our

168
00:15:03,111 --> 00:15:06,463
given indices. So, we're at zero, and we
add four.

169
00:15:06,463 --> 00:15:12,611
That gets us to four, and we're at nine.
We add four that gets us to thirteen. we

170
00:15:12,611 --> 00:15:18,907
already know that the thing is sorted on
those characters.

171
00:15:19,233 --> 00:15:25,713
And, we know that, thirteen appears before
four in this sorted list.

172
00:15:25,713 --> 00:15:30,692
It's already sorted.
And we can keep track of that by, using an

173
00:15:30,692 --> 00:15:37,127
inverse array which says, for every index,
where it appears in the sorted order.

174
00:15:37,127 --> 00:15:43,256
So, this says that thirteen appears at
position six, and four appears at position

175
00:15:43,256 --> 00:15:46,550
seven.
That is, thirteen appears before four.

176
00:15:46,550 --> 00:15:52,550
So, when we're trying to compare, the,
this, string here, that's the zeroth

177
00:15:52,550 --> 00:15:57,275
suffix with the ninth suffix,
We can go in and again, add four to get

178
00:15:57,275 --> 00:16:03,042
four, add four to get thirteen go into the
index array and see which one is less and

179
00:16:03,042 --> 00:16:06,307
the one that appears first is going to be
less.

180
00:16:06,307 --> 00:16:11,101
So, that's a constant time, comparison.
Thirteen is less than or equal to four

181
00:16:11,101 --> 00:16:16,312
because the inverse of thirteen is less
than the inverse four, which means that

182
00:16:16,312 --> 00:16:21,037
suffixes of nine, if you take eight
characters out of nine and eight

183
00:16:21,037 --> 00:16:26,640
characters out of zero, it's less.
It's a really, simple and of course,

184
00:16:26,902 --> 00:16:32,565
Easy to implement operation.
So, maintaining the inverse, I can get

185
00:16:32,565 --> 00:16:38,664
constant time string compare.
So, what does that imply, for the whole

186
00:16:38,664 --> 00:16:41,080
suffixsort??uh
Well,

187
00:16:41,286 --> 00:16:47,289
With a constant time compare, then, I can
get, at a minimum, I can get an, an analog

188
00:16:47,289 --> 00:16:50,463
N sort, just by using quicksort or
mergesort.

189
00:16:50,670 --> 00:16:56,258
And then, I get much faster than quadratic
performance no matter what the strings

190
00:16:56,258 --> 00:16:59,501
are.
That's a big key to the success of this

191
00:16:59,501 --> 00:17:03,561
algorithm.
And actually, if, if you use a version of

192
00:17:03,561 --> 00:17:09,505
three-way quicksort, it's been proven that
it even gets down to linear time for the

193
00:17:09,505 --> 00:17:12,001
sort,
No matter what the input is.

194
00:17:12,001 --> 00:17:18,454
That's one thing, and then the other thing
is when you need to do, the longest common

195
00:17:18,454 --> 00:17:24,524
prefix to look for the, longest match
after the array is sorted you can also do

196
00:17:24,524 --> 00:17:28,750
this constant time, string compare and get
the job done.

197
00:17:29,040 --> 00:17:37,274
So this is really an in-, ingenious
algorithm that, can get, suffix sorting

198
00:17:37,274 --> 00:17:42,989
done very fast. even in the presence of, a
huge repeat.

199
00:17:43,280 --> 00:17:48,448
And I think really underscores, the
importance of careful algorithmic

200
00:17:48,448 --> 00:17:52,343
thinking, in addressing new computational
challenges.

201
00:17:52,343 --> 00:17:58,036
So, string sorting, just to summarize, is
a really, interesting area of inquiry.

202
00:17:58,036 --> 00:18:03,580
For one thing, we can develop linear time
sorts for many, many applications.

203
00:18:03,781 --> 00:18:09,557
We, we thought, we're doing well when we
had, n log n sorts, but actually we can do

204
00:18:09,557 --> 00:18:14,595
much better for many applications.
And in fact, we can even get to sublinear

205
00:18:14,595 --> 00:18:17,886
where we don't even have to examine all
the data.

206
00:18:17,886 --> 00:18:22,923
In today's world, when we have huge
amounts of data, to be able to, get it

207
00:18:22,923 --> 00:18:28,028
sorted without even looking, at it at all
is really quite a miracle.

208
00:18:28,230 --> 00:18:33,823
Three-way string quicksort,,you you can't
do better than that in terms of the

209
00:18:33,823 --> 00:18:37,602
numbers of characters that you have to,
examine.

210
00:18:37,829 --> 00:18:42,893
And it's a lot of deep mathematical
analysis, behind that result.

211
00:18:43,120 --> 00:18:49,091
But I think the other lesson to learn is
that algorithms like three-way string

212
00:18:49,091 --> 00:18:55,591
quicksort and Manber-Myers, show that we
really have a lot to learn still in string

213
00:18:55,591 --> 00:18:58,690
processing.
And they're not really random.

214
00:18:58,690 --> 00:19:04,281
In fact, a lot of times we're looking for
non-randomness and we might want to use

215
00:19:04,281 --> 00:19:08,979
specialized algorithms.
So, at string processes or introduction to

216
00:19:08,979 --> 00:19:13,430
string processing,
We're going to look at many other string

217
00:19:13,430 --> 00:19:16,620
processing examples in the coming
lectures.