1
00:00:04,186 --> 00:00:08,249
Today we're gonna talk about String
Processing and we're gonna start by

2
00:00:08,249 --> 00:00:13,030
talking about String Sorting.
We'll take a look at some classic methods

3
00:00:13,216 --> 00:00:17,562
but first we need to talk a little bit
about just what strings are.

4
00:00:17,748 --> 00:00:22,783
And actually, that's really It's dependent
on the programming language that you're

5
00:00:22,783 --> 00:00:25,311
using.
It's different programming languages

6
00:00:25,311 --> 00:00:29,903
nowadays really have completely different
implementations of strings.

7
00:00:29,903 --> 00:00:35,160
So to get started, we need to take a look
at efficient implementations of basic

8
00:00:35,160 --> 00:00:39,286
operations on strings.
We're going to tailor our algorithms to

9
00:00:39,286 --> 00:00:43,878
particular Java implementation,
And that can be made to work in other

10
00:00:43,878 --> 00:00:47,139
situations,
But certainly, it's starting point, you

11
00:00:47,139 --> 00:00:50,033
have to be specific.
So what is a string?

12
00:00:50,033 --> 00:00:55,426
A string is just a, sequence of
characters, and it's actually a very

13
00:00:55,426 --> 00:01:01,567
important fundamental attraction that's
been with us from the beginning of the

14
00:01:01,366 --> 00:01:06,368
information processing.
So pretty much everything we communicate

15
00:01:06,368 --> 00:01:09,964
is a string, email or our programs are
strings.

16
00:01:09,964 --> 00:01:15,904
But even, another important area of
significance that has arisen recently is

17
00:01:15,904 --> 00:01:21,844
in Computational Biology, where our
understanding of the way that life works

18
00:01:21,844 --> 00:01:26,692
depends on genomic sequences, and is
essentially based on string processing.

19
00:01:26,692 --> 00:01:32,363
We'll see examples of that later on.
This is a quote that's talks about that

20
00:01:32,363 --> 00:01:35,507
issue.
The" digital information that underlies

21
00:01:35,507 --> 00:01:41,306
biochemistry, cell biology and development
can be represented by a simple string of

22
00:01:41,306 --> 00:01:45,637
G's, A's, T's, and C's.
This string is the root data structure of

23
00:01:45,637 --> 00:01:49,690
an organisms biology.
So we're not talking just about data

24
00:01:49,690 --> 00:01:54,860
structures for information processing but
for life models of life itself.

25
00:01:54,860 --> 00:02:01,211
Now back to computers.
The Strings are made up of characters.

26
00:02:01,211 --> 00:02:07,124
What's a character?
Well the kind of classical representation

27
00:02:07,124 --> 00:02:11,890
of what a character is, is so called 7-bit
ASCII Code.

28
00:02:12,134 --> 00:02:19,074
Where you have actually the underlying
data type is 8-bit so you coud have up to

29
00:02:19,074 --> 00:02:23,891
256 characters.
But for many, many years programmers used

30
00:02:23,891 --> 00:02:30,831
only 128 of those characters that would
include all the upper case and lower case

31
00:02:30,831 --> 00:02:37,934
letters and numbers and some punctuation.
So that's, 7-bit ASCII that's the standard

32
00:02:37,934 --> 00:02:42,670
for the C programming language, that's a
very widely used language.

33
00:02:42,920 --> 00:02:46,670
Nowadays, people use what's called
Unicode.

34
00:02:46,670 --> 00:02:54,303
Where a character is a 16-bit integer.
And that's to allow for many, many more

35
00:02:54,303 --> 00:02:58,914
characters,
Two to the sixteenth, 65,536 instead of

36
00:02:58,914 --> 00:03:03,176
only 256.
And that allows for encoding characters

37
00:03:03,176 --> 00:03:10,310
from, many different of the world's lang-,
languages, and mathematical characters.

38
00:03:10,310 --> 00:03:18,133
So it's a much more, general and generous,
representation of what is a character.

39
00:03:18,409 --> 00:03:25,404
But it's important to, be specific.
So, in Java, the standard is that a char

40
00:03:25,404 --> 00:03:30,190
is a sixteen bit unsigned integer.
Now, not all.

41
00:03:30,190 --> 00:03:35,039
Programming systems and applications have
moved up to the Unicode standard so

42
00:03:35,039 --> 00:03:40,323
sometimes you'll find programs looking for
Unicode encoding and not finding it,

43
00:03:40,323 --> 00:03:45,669
And this t-shirt is a joke that has
something to do that to do with that.

44
00:03:45,856 --> 00:03:50,891
It's supposed to be a heart which is a
valid Unicode character in some worlds,

45
00:03:50,891 --> 00:03:55,443
but not on that t-shirt.
So we'll come back to what is a character.

46
00:03:55,443 --> 00:04:00,487
We're gonna use a simpler version in
between those two, or characters in the

47
00:04:00,487 --> 00:04:03,936
8-bit integer.
Now let's talk about what a string is in

48
00:04:03,936 --> 00:04:06,554
Java.
There's a built-in string data type.

49
00:04:06,554 --> 00:04:11,726
It's not quite built-in, but many of the
features for processing string are built

50
00:04:11,726 --> 00:04:15,557
into the Java language.
So it's okay to think of it as being

51
00:04:15,557 --> 00:04:19,069
built-in.
And so in Java, a string is a sequence of

52
00:04:19,069 --> 00:04:23,858
characters and it's immutable.
Once you create a string, you can't change

53
00:04:23,858 --> 00:04:27,190
it.
And, the primary operations that you can

54
00:04:27,190 --> 00:04:32,608
perform efficiently with a string are to,
number one find it's length,

55
00:04:32,608 --> 00:04:39,204
So you can get the length of the string.
And that's just the number of characters

56
00:04:39,204 --> 00:04:43,130
in it.
You can index into a string and get the IT

57
00:04:43,130 --> 00:04:48,792
character, that's the charAt method.
And, you can eps, extract a substring of a

58
00:04:48,792 --> 00:04:54,641
string, to create a new string, that's a
continuous subsequence of the characters

59
00:04:54,641 --> 00:04:58,790
in the string.
And. one of the, big, features of Java's

60
00:04:58,790 --> 00:05:03,687
implementation is that you can get that
operation done in constant time,

61
00:05:03,687 --> 00:05:09,128
And then you can also do what's called
concatenation, and that's add a character

62
00:05:09,128 --> 00:05:14,229
to the end of another string,
That one can't be done in constant time in

63
00:05:14,229 --> 00:05:20,125
the standard Java string data type.
So, this is what the implementation looks

64
00:05:20,125 --> 00:05:26,034
like for the string datat type in Java.
The private instance variables are an

65
00:05:26,034 --> 00:05:30,921
array of characters,
An offset that's an index into the first

66
00:05:30,921 --> 00:05:37,550
character of a string in the array the
length and also to make it more efficient

67
00:05:37,550 --> 00:05:44,565
to search using string keys java keeps a
private variable which is the hashcode for

68
00:05:44,565 --> 00:05:48,265
that string.
So, once the string is built and the

69
00:05:48,265 --> 00:05:55,357
hashcode computed, then when it's time to
use the hashcode and the hashing algorithm

70
00:05:55,588 --> 00:06:01,793
it's immediately available.
So, the length method simply returns that

71
00:06:01,793 --> 00:06:07,057
length.
To get the IT character of the string, we

72
00:06:07,057 --> 00:06:10,829
add I to the offset, and get that
character.

73
00:06:11,093 --> 00:06:18,180
And, to, create a string given an offset
length and a character array, we just

74
00:06:18,180 --> 00:06:22,942
reset those, values.
And then the key thing is the substring,

75
00:06:23,161 --> 00:06:27,044
method.
Since all it involves is, a pointer into

76
00:06:27,044 --> 00:06:31,878
the immutable string.
And, a length, the index of the first

77
00:06:31,878 --> 00:06:37,372
character we can build a string in,
constant time just by copying the

78
00:06:37,372 --> 00:06:43,166
reference to the character array.
So that implementation we give good

79
00:06:43,166 --> 00:06:47,810
feeling of why, you know, substring method
is constant in string.

80
00:06:48,024 --> 00:06:52,538
So this is the performance.
It's a sequence of characters, is

81
00:06:52,538 --> 00:06:55,905
immutable.
And the underlying implementation is

82
00:06:55,905 --> 00:07:00,920
immutable instance variables that give the
array, offset and length.

83
00:07:01,102 --> 00:07:06,704
And so it means that we can get length out
of constant time, charAt in constant time

84
00:07:06,704 --> 00:07:11,453
just by adding in the offset indexing.
Substring in constant time just by

85
00:07:11,453 --> 00:07:14,498
essentially cons, copying those instance
variables.

86
00:07:14,498 --> 00:07:19,490
But to concatenate, to make a new string
that results from adding one character to

87
00:07:19,490 --> 00:07:24,118
a string, we have to create a whole new
string; and make a copy of it, because the

88
00:07:24,118 --> 00:07:28,745
string itself is immutable So it takes
time proportional to the number of

89
00:07:28,745 --> 00:07:32,460
characters in the string, and it involves
making a new string.

90
00:07:33,772 --> 00:07:38,200
You can imagine string implementations,
And they exist in various programming

91
00:07:38,200 --> 00:07:41,765
languages, where, these performance
guarantees are different.

92
00:07:41,765 --> 00:07:46,077
And actually, Java has different
implementations for applications where you

93
00:07:46,077 --> 00:07:48,780
might want different performance,
guarantees.

94
00:07:49,012 --> 00:07:55,913
At the if you work out the memory usage
for a string of length and it's 40 + 2N

95
00:07:55,913 --> 00:07:59,868
bytes.
You might consider using a char array, but

96
00:07:59,868 --> 00:08:06,614
then you, you lose a lot of convenience.
So, for being able to produce substring

97
00:08:06,847 --> 00:08:12,120
and sublink, and also the language
features that supports strings.

98
00:08:12,120 --> 00:08:18,710
So here's an implementation of a different
implementation of sequence of characters

99
00:08:18,938 --> 00:08:23,938
in Java that is mutable.
So it's, the idea is that you can use this

100
00:08:23,938 --> 00:08:28,559
data type when you're building up a string
a piece at a time.

101
00:08:28,559 --> 00:08:33,256
Like maybe reading characters off standard
input or something.

102
00:08:33,256 --> 00:08:39,090
The underlying implementation in this case
is a resizing array of characters.

103
00:08:39,090 --> 00:08:43,311
So when it fills up and doubles, as we've
done, many times before.

104
00:08:43,492 --> 00:08:46,870
And it keeps the, the length that's an
instance variable.

105
00:08:46,870 --> 00:08:50,670
So with string builder, you can get the
length in constant time.

106
00:08:50,670 --> 00:08:55,736
You can get characters in constant time
just by doubling, And you can concatenate,

107
00:08:55,736 --> 00:08:58,571
add a new character in amortized constant
time.

108
00:08:58,571 --> 00:09:02,853
Most of the time, it's constant.
Every once in a while, you might have to

109
00:09:02,853 --> 00:09:05,989
double.
But you pay for that double by, the number

110
00:09:05,989 --> 00:09:11,481
of operations, that you did.
The thing you lose though, is that it

111
00:09:11,481 --> 00:09:17,411
takes linear time to extract a sub-string,
because to extract a sub-string, you have

112
00:09:17,411 --> 00:09:22,770
to make a new char array that can be
re-sizing and so forth, and can be

113
00:09:22,770 --> 00:09:28,745
immutable to concat So that's two
different implementations of sequence of

114
00:09:28,745 --> 00:09:33,673
characters in Java, where these two
different, importantly different

115
00:09:33,673 --> 00:09:38,380
performance characteristics, so we have to
keep in mind be mindful of that in

116
00:09:38,380 --> 00:09:43,749
applications, and again in other
programming languages, something like the

117
00:09:43,749 --> 00:09:49,266
StringBuilder is more like the standard,
and you just have to know what the

118
00:09:49,266 --> 00:09:53,742
implementation is.
There's another one called StringBuffer as

119
00:09:53,742 --> 00:10:01,345
well in Java that we will skip for now.
So here's a typical example that might

120
00:10:01,345 --> 00:10:08,478
have a simple computation like how do we
efficiently reverse a string.

121
00:10:08,783 --> 00:10:19,210
So you could use a string or you could use
a StringBuilder with string you get to, a

122
00:10:19,210 --> 00:10:26,965
declare it almost like a built-in type and
simply initialize with the null string,

123
00:10:27,229 --> 00:10:34,191
And then to rev-, rev-, compute the
reverse string, we go backwards through

124
00:10:34,191 --> 00:10:43,105
the original string and concatenate the,
characters starting at the back to create

125
00:10:43,105 --> 00:10:47,678
our reverse string.
Or with the StringBuilder, you use, the

126
00:10:48,178 --> 00:10:52,322
StringBuilder data type,
And so it create an object, and that uses

127
00:10:52,322 --> 00:10:56,250
the doubling array,
And you use the append operation.

128
00:10:56,250 --> 00:11:00,965
So, what do you think?
Which one of these is, is gonna be, most

129
00:11:00,965 --> 00:11:06,928
efficient for, a long string?
The answer is, that it's StringBuilder

130
00:11:07,196 --> 00:11:13,305
because the, using the built-in string
every time you do a concatenation, you

131
00:11:13,305 --> 00:11:18,516
have to make a copy of the whole string,
so if the string is of length N, that's

132
00:11:18,516 --> 00:11:24,235
gonna take, one + two + three all the way
up to N, which sums to, N square, about

133
00:11:24,235 --> 00:11:30,017
N^2/2 so it takes quadratic time to do a,
for this algorithm to run, for a long

134
00:11:30,017 --> 00:11:33,131
string.
And that's gonna preclude using it for,

135
00:11:33,322 --> 00:11:38,787
huge strings. as we've seen so many times,
can't be using a quadratic time algorithm

136
00:11:38,787 --> 00:11:42,917
for, lot of data.
On the other hand, with StringBuilder it's

137
00:11:42,917 --> 00:11:47,680
linear time because the append operations
are amortized in linear.

138
00:11:47,919 --> 00:11:52,622
So that's a simple example.
Here's another example.

139
00:11:52,622 --> 00:11:59,318
A computation that we're gonna look at
later on at the end of the lecture is how

140
00:11:59,318 --> 00:12:04,869
do we form an array of suffixes?
So that is, we have an input string in the

141
00:12:04,869 --> 00:12:10,001
suffixes of the string or the strings that
you get by starting each position.

142
00:12:10,001 --> 00:12:15,333
So the first suffix is the whole string,
the next one starts at position one, the

143
00:12:15,333 --> 00:12:19,465
next one starts at position two and so
forth, each one less.

144
00:12:19,465 --> 00:12:24,931
And so we have algorithms that gain
efficiency by forming an array of suffixes

145
00:12:24,931 --> 00:12:29,330
of a given, given string.
And so how do we create that thing in the

146
00:12:29,330 --> 00:12:32,900
first place?
Again, you can do it with string or you

147
00:12:32,900 --> 00:12:37,518
can do it with StringBuilder. so, let's
look at it with string.

148
00:12:37,731 --> 00:12:41,851
We get the length, that's gonna be the
length of the array.

149
00:12:42,064 --> 00:12:47,889
And what we do is for all values of I, we
set suffixes of I to the sub string of X.

150
00:12:47,889 --> 00:12:53,715
S you get by starting at I and going all
the way to N and that's our suffix array

151
00:12:53,928 --> 00:12:59,753
and this is the corresponding code for
StringBuilder. But, now in this case the

152
00:12:59,753 --> 00:13:05,617
standard method is gonna be linear.
Whereas the StringBuilder, because there's

153
00:13:05,617 --> 00:13:11,124
only one string and the sub-strings are a
few pointers into that string.

154
00:13:11,124 --> 00:13:17,292
Whereas for a StringBuilder, we have to
make a new string for each suffix and

155
00:13:17,292 --> 00:13:23,534
there's a quadratic number of characters
in the, in all of those strings and so it

156
00:13:23,534 --> 00:13:29,482
takes quadratic time, so you can't use
StringBuilder to build a suffix array for

157
00:13:29,482 --> 00:13:32,965
a huge string.
So again, those are typical examples of

158
00:13:32,965 --> 00:13:38,192
string processing where it really matters
which string implementations that you're

159
00:13:38,192 --> 00:13:43,103
using and if you're not using, if you're
using Java, these trade-offs are clear.

160
00:13:43,103 --> 00:13:48,015
If you're using some other programming
language, you better make sure you know

161
00:13:48,015 --> 00:13:52,990
how strings are implemented before you
even get started with string processing.

162
00:13:53,211 --> 00:13:57,411
So here's a simple computation that we'll
be using.

163
00:13:57,632 --> 00:14:04,043
Suppose that we have two strings and what
we're interested in knowing is the length

164
00:14:04,043 --> 00:14:10,558
of the longest common prefix.
So here's some a static method that will

165
00:14:10,558 --> 00:14:15,020
implement this function.
Takes two strings as argument.

166
00:14:15,020 --> 00:14:21,287
We're gonna need to go as far as the
length of the shortest of the two strings,

167
00:14:21,287 --> 00:14:25,651
So that's N.
And then we just go ahead and start at the

168
00:14:25,651 --> 00:14:31,204
beginning and compare as long as the
strings are equal we increment I.

169
00:14:31,204 --> 00:14:36,996
And if we get to a point where they're
non-equal, that's when we return I.

170
00:14:36,996 --> 00:14:41,280
And that's the length of the longest
common pre-fixes.

171
00:14:41,280 --> 00:14:47,310
In this case, they're not equal at four,
that means they match at four characters.

172
00:14:47,310 --> 00:14:51,661
And if we get to the end of one of them
then that's a prefix,

173
00:14:51,661 --> 00:14:56,012
So we just return N.
So that's just a little bit of warm-up

174
00:14:56,012 --> 00:15:02,207
code, and the amount of time that takes is
proportional to the length of the longest

175
00:15:02,207 --> 00:15:06,565
common prefix.
Although if the prefix is short, like if

176
00:15:06,565 --> 00:15:12,114
the two strings have a different first
character, then it's a sublinear doesn't

177
00:15:12,114 --> 00:15:16,550
have to look at all the data, just has to
look at the amount that matches.

178
00:15:16,749 --> 00:15:21,735
So, the idea of a sub-linear time
algorithm for a string processing is a

179
00:15:21,735 --> 00:15:27,186
really important one that, you know, we're
going to be taking advantage of, as we

180
00:15:27,186 --> 00:15:32,239
move into more complicated algorithms.
So, for example, you can compare two

181
00:15:32,239 --> 00:15:37,424
strings without looking at them all.
It depends, you just have to find, the

182
00:15:37,424 --> 00:15:42,876
first place that they differ, so you don't
look at all the data at sub-linear time.

183
00:15:42,876 --> 00:15:47,530
And we're going to see sorting algorithms
that take advantage of that.

184
00:15:49,151 --> 00:15:55,218
Now we're not gonna really do it, in the
code that we show in lecture or even in

185
00:15:55,218 --> 00:16:01,354
the book but it's actually fairly easy to
take many of the algorithms that we're

186
00:16:01,354 --> 00:16:06,927
gonna look at, and make some so that they
work for general alphabets and for

187
00:16:06,927 --> 00:16:13,135
different applications you know, it might
be entirely appropriate to customize the

188
00:16:13,135 --> 00:16:18,637
code to a particular alphabet.
So, like if the, the thing are the things

189
00:16:18,637 --> 00:16:22,940
that are being processed are numbers or
positive integers,

190
00:16:22,940 --> 00:16:28,730
Or things like account numbers, maybe only
ten decimal characters can occur.

191
00:16:28,730 --> 00:16:34,212
So, we might as well work with strings
made from those well-defined ten

192
00:16:34,212 --> 00:16:38,073
characters.
In DNA, there's only four characters, so

193
00:16:38,073 --> 00:16:42,860
we might as well know that we're working
with four characters.

194
00:16:42,860 --> 00:16:49,097
And so it's the we'll often talk of the
radix, which is the number of possible

195
00:16:49,097 --> 00:16:56,406
different character values in the string.
Now we're always gonna use what's called

196
00:16:56,406 --> 00:17:01,779
an Extended ASCII, where just to fix
ideas, where the radix is 256.

197
00:17:02,034 --> 00:17:07,119
And the number of.
Bits, therefore to represent a character

198
00:17:07,119 --> 00:17:10,669
is the log base two of that, so 8-bits,
256.

199
00:17:10,669 --> 00:17:17,347
And when we talk about performance of
algorithms, we'll use R and log, R just to

200
00:17:17,347 --> 00:17:23,855
make sure that it's clear that if we're
working with a smaller alphabet or a

201
00:17:23,855 --> 00:17:30,617
larger alphabet we can still use the
algorithms but the performance is going to

202
00:17:30,617 --> 00:17:35,350
depend on the radix.
So that's an introduction to String