Today we're gonna talk about String
Processing and we're gonna start by
talking about String Sorting.
We'll take a look at some classic methods
but first we need to talk a little bit
about just what strings are.
And actually, that's really It's dependent
on the programming language that you're
using.
It's different programming languages
nowadays really have completely different
implementations of strings.
So to get started, we need to take a look
at efficient implementations of basic
operations on strings.
We're going to tailor our algorithms to
particular Java implementation,
And that can be made to work in other
situations,
But certainly, it's starting point, you
have to be specific.
So what is a string?
A string is just a, sequence of
characters, and it's actually a very
important fundamental attraction that's
been with us from the beginning of the
information processing.
So pretty much everything we communicate
is a string, email or our programs are
strings.
But even, another important area of
significance that has arisen recently is
in Computational Biology, where our
understanding of the way that life works
depends on genomic sequences, and is
essentially based on string processing.
We'll see examples of that later on.
This is a quote that's talks about that
issue.
The" digital information that underlies
biochemistry, cell biology and development
can be represented by a simple string of
G's, A's, T's, and C's.
This string is the root data structure of
an organisms biology.
So we're not talking just about data
structures for information processing but
for life models of life itself.
Now back to computers.
The Strings are made up of characters.
What's a character?
Well the kind of classical representation
of what a character is, is so called 7-bit
ASCII Code.
Where you have actually the underlying
data type is 8-bit so you coud have up to
256 characters.
But for many, many years programmers used
only 128 of those characters that would
include all the upper case and lower case
letters and numbers and some punctuation.
So that's, 7-bit ASCII that's the standard
for the C programming language, that's a
very widely used language.
Nowadays, people use what's called
Unicode.
Where a character is a 16-bit integer.
And that's to allow for many, many more
characters,
Two to the sixteenth, 65,536 instead of
only 256.
And that allows for encoding characters
from, many different of the world's lang-,
languages, and mathematical characters.
So it's a much more, general and generous,
representation of what is a character.
But it's important to, be specific.
So, in Java, the standard is that a char
is a sixteen bit unsigned integer.
Now, not all.
Programming systems and applications have
moved up to the Unicode standard so
sometimes you'll find programs looking for
Unicode encoding and not finding it,
And this t-shirt is a joke that has
something to do that to do with that.
It's supposed to be a heart which is a
valid Unicode character in some worlds,
but not on that t-shirt.
So we'll come back to what is a character.
We're gonna use a simpler version in
between those two, or characters in the
8-bit integer.
Now let's talk about what a string is in
Java.
There's a built-in string data type.
It's not quite built-in, but many of the
features for processing string are built
into the Java language.
So it's okay to think of it as being
built-in.
And so in Java, a string is a sequence of
characters and it's immutable.
Once you create a string, you can't change
it.
And, the primary operations that you can
perform efficiently with a string are to,
number one find it's length,
So you can get the length of the string.
And that's just the number of characters
in it.
You can index into a string and get the IT
character, that's the charAt method.
And, you can eps, extract a substring of a
string, to create a new string, that's a
continuous subsequence of the characters
in the string.
And. one of the, big, features of Java's
implementation is that you can get that
operation done in constant time,
And then you can also do what's called
concatenation, and that's add a character
to the end of another string,
That one can't be done in constant time in
the standard Java string data type.
So, this is what the implementation looks
like for the string datat type in Java.
The private instance variables are an
array of characters,
An offset that's an index into the first
character of a string in the array the
length and also to make it more efficient
to search using string keys java keeps a
private variable which is the hashcode for
that string.
So, once the string is built and the
hashcode computed, then when it's time to
use the hashcode and the hashing algorithm
it's immediately available.
So, the length method simply returns that
length.
To get the IT character of the string, we
add I to the offset, and get that
character.
And, to, create a string given an offset
length and a character array, we just
reset those, values.
And then the key thing is the substring,
method.
Since all it involves is, a pointer into
the immutable string.
And, a length, the index of the first
character we can build a string in,
constant time just by copying the
reference to the character array.
So that implementation we give good
feeling of why, you know, substring method
is constant in string.
So this is the performance.
It's a sequence of characters, is
immutable.
And the underlying implementation is
immutable instance variables that give the
array, offset and length.
And so it means that we can get length out
of constant time, charAt in constant time
just by adding in the offset indexing.
Substring in constant time just by
essentially cons, copying those instance
variables.
But to concatenate, to make a new string
that results from adding one character to
a string, we have to create a whole new
string; and make a copy of it, because the
string itself is immutable So it takes
time proportional to the number of
characters in the string, and it involves
making a new string.
You can imagine string implementations,
And they exist in various programming
languages, where, these performance
guarantees are different.
And actually, Java has different
implementations for applications where you
might want different performance,
guarantees.
At the if you work out the memory usage
for a string of length and it's 40 + 2N
bytes.
You might consider using a char array, but
then you, you lose a lot of convenience.
So, for being able to produce substring
and sublink, and also the language
features that supports strings.
So here's an implementation of a different
implementation of sequence of characters
in Java that is mutable.
So it's, the idea is that you can use this
data type when you're building up a string
a piece at a time.
Like maybe reading characters off standard
input or something.
The underlying implementation in this case
is a resizing array of characters.
So when it fills up and doubles, as we've
done, many times before.
And it keeps the, the length that's an
instance variable.
So with string builder, you can get the
length in constant time.
You can get characters in constant time
just by doubling, And you can concatenate,
add a new character in amortized constant
time.
Most of the time, it's constant.
Every once in a while, you might have to
double.
But you pay for that double by, the number
of operations, that you did.
The thing you lose though, is that it
takes linear time to extract a sub-string,
because to extract a sub-string, you have
to make a new char array that can be
re-sizing and so forth, and can be
immutable to concat So that's two
different implementations of sequence of
characters in Java, where these two
different, importantly different
performance characteristics, so we have to
keep in mind be mindful of that in
applications, and again in other
programming languages, something like the
StringBuilder is more like the standard,
and you just have to know what the
implementation is.
There's another one called StringBuffer as
well in Java that we will skip for now.
So here's a typical example that might
have a simple computation like how do we
efficiently reverse a string.
So you could use a string or you could use
a StringBuilder with string you get to, a
declare it almost like a built-in type and
simply initialize with the null string,
And then to rev-, rev-, compute the
reverse string, we go backwards through
the original string and concatenate the,
characters starting at the back to create
our reverse string.
Or with the StringBuilder, you use, the
StringBuilder data type,
And so it create an object, and that uses
the doubling array,
And you use the append operation.
So, what do you think?
Which one of these is, is gonna be, most
efficient for, a long string?
The answer is, that it's StringBuilder
because the, using the built-in string
every time you do a concatenation, you
have to make a copy of the whole string,
so if the string is of length N, that's
gonna take, one + two + three all the way
up to N, which sums to, N square, about
N^2/2 so it takes quadratic time to do a,
for this algorithm to run, for a long
string.
And that's gonna preclude using it for,
huge strings. as we've seen so many times,
can't be using a quadratic time algorithm
for, lot of data.
On the other hand, with StringBuilder it's
linear time because the append operations
are amortized in linear.
So that's a simple example.
Here's another example.
A computation that we're gonna look at
later on at the end of the lecture is how
do we form an array of suffixes?
So that is, we have an input string in the
suffixes of the string or the strings that
you get by starting each position.
So the first suffix is the whole string,
the next one starts at position one, the
next one starts at position two and so
forth, each one less.
And so we have algorithms that gain
efficiency by forming an array of suffixes
of a given, given string.
And so how do we create that thing in the
first place?
Again, you can do it with string or you
can do it with StringBuilder. so, let's
look at it with string.
We get the length, that's gonna be the
length of the array.
And what we do is for all values of I, we
set suffixes of I to the sub string of X.
S you get by starting at I and going all
the way to N and that's our suffix array
and this is the corresponding code for
StringBuilder. But, now in this case the
standard method is gonna be linear.
Whereas the StringBuilder, because there's
only one string and the sub-strings are a
few pointers into that string.
Whereas for a StringBuilder, we have to
make a new string for each suffix and
there's a quadratic number of characters
in the, in all of those strings and so it
takes quadratic time, so you can't use
StringBuilder to build a suffix array for
a huge string.
So again, those are typical examples of
string processing where it really matters
which string implementations that you're
using and if you're not using, if you're
using Java, these trade-offs are clear.
If you're using some other programming
language, you better make sure you know
how strings are implemented before you
even get started with string processing.
So here's a simple computation that we'll
be using.
Suppose that we have two strings and what
we're interested in knowing is the length
of the longest common prefix.
So here's some a static method that will
implement this function.
Takes two strings as argument.
We're gonna need to go as far as the
length of the shortest of the two strings,
So that's N.
And then we just go ahead and start at the
beginning and compare as long as the
strings are equal we increment I.
And if we get to a point where they're
non-equal, that's when we return I.
And that's the length of the longest
common pre-fixes.
In this case, they're not equal at four,
that means they match at four characters.
And if we get to the end of one of them
then that's a prefix,
So we just return N.
So that's just a little bit of warm-up
code, and the amount of time that takes is
proportional to the length of the longest
common prefix.
Although if the prefix is short, like if
the two strings have a different first
character, then it's a sublinear doesn't
have to look at all the data, just has to
look at the amount that matches.
So, the idea of a sub-linear time
algorithm for a string processing is a
really important one that, you know, we're
going to be taking advantage of, as we
move into more complicated algorithms.
So, for example, you can compare two
strings without looking at them all.
It depends, you just have to find, the
first place that they differ, so you don't
look at all the data at sub-linear time.
And we're going to see sorting algorithms
that take advantage of that.
Now we're not gonna really do it, in the
code that we show in lecture or even in
the book but it's actually fairly easy to
take many of the algorithms that we're
gonna look at, and make some so that they
work for general alphabets and for
different applications you know, it might
be entirely appropriate to customize the
code to a particular alphabet.
So, like if the, the thing are the things
that are being processed are numbers or
positive integers,
Or things like account numbers, maybe only
ten decimal characters can occur.
So, we might as well work with strings
made from those well-defined ten
characters.
In DNA, there's only four characters, so
we might as well know that we're working
with four characters.
And so it's the we'll often talk of the
radix, which is the number of possible
different character values in the string.
Now we're always gonna use what's called
an Extended ASCII, where just to fix
ideas, where the radix is 256.
And the number of.
Bits, therefore to represent a character
is the log base two of that, so 8-bits,
256.
And when we talk about performance of
algorithms, we'll use R and log, R just to
make sure that it's clear that if we're
working with a smaller alphabet or a
larger alphabet we can still use the
algorithms but the performance is going to
depend on the radix.
So that's an introduction to String