Today we're gonna talk about String Processing and we're gonna start by talking about String Sorting. We'll take a look at some classic methods but first we need to talk a little bit about just what strings are. And actually, that's really It's dependent on the programming language that you're using. It's different programming languages nowadays really have completely different implementations of strings. So to get started, we need to take a look at efficient implementations of basic operations on strings. We're going to tailor our algorithms to particular Java implementation, And that can be made to work in other situations, But certainly, it's starting point, you have to be specific. So what is a string? A string is just a, sequence of characters, and it's actually a very important fundamental attraction that's been with us from the beginning of the information processing. So pretty much everything we communicate is a string, email or our programs are strings. But even, another important area of significance that has arisen recently is in Computational Biology, where our understanding of the way that life works depends on genomic sequences, and is essentially based on string processing. We'll see examples of that later on. This is a quote that's talks about that issue. The" digital information that underlies biochemistry, cell biology and development can be represented by a simple string of G's, A's, T's, and C's. This string is the root data structure of an organisms biology. So we're not talking just about data structures for information processing but for life models of life itself. Now back to computers. The Strings are made up of characters. What's a character? Well the kind of classical representation of what a character is, is so called 7-bit ASCII Code. Where you have actually the underlying data type is 8-bit so you coud have up to 256 characters. But for many, many years programmers used only 128 of those characters that would include all the upper case and lower case letters and numbers and some punctuation. So that's, 7-bit ASCII that's the standard for the C programming language, that's a very widely used language. Nowadays, people use what's called Unicode. Where a character is a 16-bit integer. And that's to allow for many, many more characters, Two to the sixteenth, 65,536 instead of only 256. And that allows for encoding characters from, many different of the world's lang-, languages, and mathematical characters. So it's a much more, general and generous, representation of what is a character. But it's important to, be specific. So, in Java, the standard is that a char is a sixteen bit unsigned integer. Now, not all. Programming systems and applications have moved up to the Unicode standard so sometimes you'll find programs looking for Unicode encoding and not finding it, And this t-shirt is a joke that has something to do that to do with that. It's supposed to be a heart which is a valid Unicode character in some worlds, but not on that t-shirt. So we'll come back to what is a character. We're gonna use a simpler version in between those two, or characters in the 8-bit integer. Now let's talk about what a string is in Java. There's a built-in string data type. It's not quite built-in, but many of the features for processing string are built into the Java language. So it's okay to think of it as being built-in. And so in Java, a string is a sequence of characters and it's immutable. Once you create a string, you can't change it. And, the primary operations that you can perform efficiently with a string are to, number one find it's length, So you can get the length of the string. And that's just the number of characters in it. You can index into a string and get the IT character, that's the charAt method. And, you can eps, extract a substring of a string, to create a new string, that's a continuous subsequence of the characters in the string. And. one of the, big, features of Java's implementation is that you can get that operation done in constant time, And then you can also do what's called concatenation, and that's add a character to the end of another string, That one can't be done in constant time in the standard Java string data type. So, this is what the implementation looks like for the string datat type in Java. The private instance variables are an array of characters, An offset that's an index into the first character of a string in the array the length and also to make it more efficient to search using string keys java keeps a private variable which is the hashcode for that string. So, once the string is built and the hashcode computed, then when it's time to use the hashcode and the hashing algorithm it's immediately available. So, the length method simply returns that length. To get the IT character of the string, we add I to the offset, and get that character. And, to, create a string given an offset length and a character array, we just reset those, values. And then the key thing is the substring, method. Since all it involves is, a pointer into the immutable string. And, a length, the index of the first character we can build a string in, constant time just by copying the reference to the character array. So that implementation we give good feeling of why, you know, substring method is constant in string. So this is the performance. It's a sequence of characters, is immutable. And the underlying implementation is immutable instance variables that give the array, offset and length. And so it means that we can get length out of constant time, charAt in constant time just by adding in the offset indexing. Substring in constant time just by essentially cons, copying those instance variables. But to concatenate, to make a new string that results from adding one character to a string, we have to create a whole new string; and make a copy of it, because the string itself is immutable So it takes time proportional to the number of characters in the string, and it involves making a new string. You can imagine string implementations, And they exist in various programming languages, where, these performance guarantees are different. And actually, Java has different implementations for applications where you might want different performance, guarantees. At the if you work out the memory usage for a string of length and it's 40 + 2N bytes. You might consider using a char array, but then you, you lose a lot of convenience. So, for being able to produce substring and sublink, and also the language features that supports strings. So here's an implementation of a different implementation of sequence of characters in Java that is mutable. So it's, the idea is that you can use this data type when you're building up a string a piece at a time. Like maybe reading characters off standard input or something. The underlying implementation in this case is a resizing array of characters. So when it fills up and doubles, as we've done, many times before. And it keeps the, the length that's an instance variable. So with string builder, you can get the length in constant time. You can get characters in constant time just by doubling, And you can concatenate, add a new character in amortized constant time. Most of the time, it's constant. Every once in a while, you might have to double. But you pay for that double by, the number of operations, that you did. The thing you lose though, is that it takes linear time to extract a sub-string, because to extract a sub-string, you have to make a new char array that can be re-sizing and so forth, and can be immutable to concat So that's two different implementations of sequence of characters in Java, where these two different, importantly different performance characteristics, so we have to keep in mind be mindful of that in applications, and again in other programming languages, something like the StringBuilder is more like the standard, and you just have to know what the implementation is. There's another one called StringBuffer as well in Java that we will skip for now. So here's a typical example that might have a simple computation like how do we efficiently reverse a string. So you could use a string or you could use a StringBuilder with string you get to, a declare it almost like a built-in type and simply initialize with the null string, And then to rev-, rev-, compute the reverse string, we go backwards through the original string and concatenate the, characters starting at the back to create our reverse string. Or with the StringBuilder, you use, the StringBuilder data type, And so it create an object, and that uses the doubling array, And you use the append operation. So, what do you think? Which one of these is, is gonna be, most efficient for, a long string? The answer is, that it's StringBuilder because the, using the built-in string every time you do a concatenation, you have to make a copy of the whole string, so if the string is of length N, that's gonna take, one + two + three all the way up to N, which sums to, N square, about N^2/2 so it takes quadratic time to do a, for this algorithm to run, for a long string. And that's gonna preclude using it for, huge strings. as we've seen so many times, can't be using a quadratic time algorithm for, lot of data. On the other hand, with StringBuilder it's linear time because the append operations are amortized in linear. So that's a simple example. Here's another example. A computation that we're gonna look at later on at the end of the lecture is how do we form an array of suffixes? So that is, we have an input string in the suffixes of the string or the strings that you get by starting each position. So the first suffix is the whole string, the next one starts at position one, the next one starts at position two and so forth, each one less. And so we have algorithms that gain efficiency by forming an array of suffixes of a given, given string. And so how do we create that thing in the first place? Again, you can do it with string or you can do it with StringBuilder. so, let's look at it with string. We get the length, that's gonna be the length of the array. And what we do is for all values of I, we set suffixes of I to the sub string of X. S you get by starting at I and going all the way to N and that's our suffix array and this is the corresponding code for StringBuilder. But, now in this case the standard method is gonna be linear. Whereas the StringBuilder, because there's only one string and the sub-strings are a few pointers into that string. Whereas for a StringBuilder, we have to make a new string for each suffix and there's a quadratic number of characters in the, in all of those strings and so it takes quadratic time, so you can't use StringBuilder to build a suffix array for a huge string. So again, those are typical examples of string processing where it really matters which string implementations that you're using and if you're not using, if you're using Java, these trade-offs are clear. If you're using some other programming language, you better make sure you know how strings are implemented before you even get started with string processing. So here's a simple computation that we'll be using. Suppose that we have two strings and what we're interested in knowing is the length of the longest common prefix. So here's some a static method that will implement this function. Takes two strings as argument. We're gonna need to go as far as the length of the shortest of the two strings, So that's N. And then we just go ahead and start at the beginning and compare as long as the strings are equal we increment I. And if we get to a point where they're non-equal, that's when we return I. And that's the length of the longest common pre-fixes. In this case, they're not equal at four, that means they match at four characters. And if we get to the end of one of them then that's a prefix, So we just return N. So that's just a little bit of warm-up code, and the amount of time that takes is proportional to the length of the longest common prefix. Although if the prefix is short, like if the two strings have a different first character, then it's a sublinear doesn't have to look at all the data, just has to look at the amount that matches. So, the idea of a sub-linear time algorithm for a string processing is a really important one that, you know, we're going to be taking advantage of, as we move into more complicated algorithms. So, for example, you can compare two strings without looking at them all. It depends, you just have to find, the first place that they differ, so you don't look at all the data at sub-linear time. And we're going to see sorting algorithms that take advantage of that. Now we're not gonna really do it, in the code that we show in lecture or even in the book but it's actually fairly easy to take many of the algorithms that we're gonna look at, and make some so that they work for general alphabets and for different applications you know, it might be entirely appropriate to customize the code to a particular alphabet. So, like if the, the thing are the things that are being processed are numbers or positive integers, Or things like account numbers, maybe only ten decimal characters can occur. So, we might as well work with strings made from those well-defined ten characters. In DNA, there's only four characters, so we might as well know that we're working with four characters. And so it's the we'll often talk of the radix, which is the number of possible different character values in the string. Now we're always gonna use what's called an Extended ASCII, where just to fix ideas, where the radix is 256. And the number of. Bits, therefore to represent a character is the log base two of that, so 8-bits, 256. And when we talk about performance of algorithms, we'll use R and log, R just to make sure that it's clear that if we're working with a smaller alphabet or a larger alphabet we can still use the algorithms but the performance is going to depend on the radix. So that's an introduction to String