1 00:00:03,910 --> 00:00:08,130 we can also get strings sorted by moving from left to right. 2 00:00:08,130 --> 00:00:13,516 That's called MSD string sorter. Most-significant-digit-first string sort. 3 00:00:13,738 --> 00:00:19,346 And this is going to kind of, kind of related to what we did with quick sort, 4 00:00:19,346 --> 00:00:23,109 and maybe viewed as a generalization of quick sort. 5 00:00:23,331 --> 00:00:30,046 So the idea is that we start with the first character, to use key index counting 6 00:00:30,046 --> 00:00:36,761 on the first character of the array and that will partition it into the strings 7 00:00:36,761 --> 00:00:39,270 that start with each character. So. 8 00:00:39,472 --> 00:00:44,592 If we use the first character, the left-most character in a string as, as the 9 00:00:44,592 --> 00:00:50,251 first character in this case, then we get all the strings that start with A followed 10 00:00:50,251 --> 00:00:53,619 by all the strings that start with B and so forth. 11 00:00:53,619 --> 00:00:57,594 And then recursively use the same method for each subfile. 12 00:00:57,594 --> 00:01:02,310 Essentially, we have a subarray for, each of the characters. 13 00:01:02,310 --> 00:01:08,306 And, that cumulate that we built with key index counting gives us the subarrays and 14 00:01:08,306 --> 00:01:12,820 actually completely delineates the subarrays that we need to sort. 15 00:01:13,033 --> 00:01:19,233 Remember we had built up this array that said that there's six keys less than d and 16 00:01:19,233 --> 00:01:24,577 eight keys less than e and so forth. So that tells us exactly precisely the 17 00:01:24,791 --> 00:01:29,707 boundaries of the subarray that contain the keys to start with d, 18 00:01:29,707 --> 00:01:36,121 And then we can just use the same method recursively to sort each of the subarrays 19 00:01:36,121 --> 00:01:39,430 one for each character. Mm. 20 00:01:40,453 --> 00:01:44,549 Now, this is a little bit complicated trace, 21 00:01:44,805 --> 00:01:53,166 But if you look at it more carefully after the lecture you'll see that it's pretty 22 00:01:53,166 --> 00:02:01,014 simple, setup for what we need to do. So, here's our input and we sort on the 23 00:02:01,014 --> 00:02:05,017 first character. If we sort on the first character, 24 00:02:05,209 --> 00:02:10,243 In this example, we have lots of s's. So the subfile for the s's, it's already 25 00:02:10,243 --> 00:02:14,767 sorted on the first character. This d is the digit that we're currently 26 00:02:14,767 --> 00:02:19,483 working on. So then, we sort that on, it's, the rest of it on its first 27 00:02:19,483 --> 00:02:22,733 character. So it's the second character and all the 28 00:02:22,733 --> 00:02:27,194 words that start with s. And then there's a lot of them that start 29 00:02:27,194 --> 00:02:30,253 with e, So we move on to the third character for 30 00:02:30,253 --> 00:02:33,375 those, And then there's some that start with a 31 00:02:33,375 --> 00:02:35,160 and so forth. So, recursively, 32 00:02:35,509 --> 00:02:45,623 Every time that we move on one character, we have to keep going until one thing we 33 00:02:45,623 --> 00:02:51,701 have to do, is if there's two keys that are equal, we have to examine every 34 00:02:51,701 --> 00:02:56,464 character in the key. We never find out that the two keys are 35 00:02:56,464 --> 00:03:01,392 equal until the end. And if we reach the end of a string, we 36 00:03:01,392 --> 00:03:05,744 can just assume that goes before any character value. 37 00:03:05,744 --> 00:03:12,150 And with those two things, then you, you can trace through the recursion to see how 38 00:03:12,150 --> 00:03:16,693 MSD string sort of works. So and again, in this case, it sorts by 39 00:03:16,693 --> 00:03:21,307 the first character, A and B have only size once, so you don't have to do 40 00:03:21,307 --> 00:03:24,640 anything. So then most of this slide is showing what 41 00:03:24,640 --> 00:03:30,024 happens in the keys that start with S, and then at the end, there's the two keys that 42 00:03:30,024 --> 00:03:35,215 start with T, and they have both h's and both e's, we have to go through the whole 43 00:03:35,215 --> 00:03:38,909 thing. So that's an example of MSD string sort. 44 00:03:38,909 --> 00:03:45,527 Now, if strings are variable length like they were in that example, we just treat 45 00:03:45,527 --> 00:03:52,475 them as if they had an extra character at the end that's smaller than any character. 46 00:03:52,475 --> 00:03:59,259 So they'll appear before they're supposed to appear alphabetically, before strings 47 00:03:59,259 --> 00:04:03,809 that have the same starting characters that are longer. 48 00:04:03,809 --> 00:04:09,103 And you can do that by overloading, you know, implementing charAt to just return 49 00:04:09,103 --> 00:04:14,822 minus one, if, if we're past the character that we're looking at. 50 00:04:14,822 --> 00:04:18,760 So, that's the easy part of the implementation. 51 00:04:18,760 --> 00:04:25,267 One thing to point out, the in the C programming language, the representation 52 00:04:25,267 --> 00:04:32,373 of strings puts an extra character that's zero at the end and those string has the 53 00:04:32,373 --> 00:04:37,510 zero character. So, actually not to do anything at all. 54 00:04:37,854 --> 00:04:46,160 But when that change the code for MSD string sort is also really an extremely 55 00:04:46,160 --> 00:04:51,442 simple modification or extension to key index counting. 56 00:04:51,731 --> 00:04:58,358 So we've got our sort and it takes its input array and then, output buffer, 57 00:04:58,358 --> 00:05:06,233 And then we have to take the deliminators of the subfile we're going to sort, low 58 00:05:06,233 --> 00:05:11,612 and high just like we've done for other recursive sorts. 59 00:05:11,900 --> 00:05:22,510 We go ahead and we do the key index counting, again, to take care of, we have 60 00:05:22,510 --> 00:05:27,879 to take care of the fact that we have an extra character that kind of the end of 61 00:05:27,879 --> 00:05:32,717 string character. And also, pull out the d character of our 62 00:05:32,717 --> 00:05:38,615 string just like we did for LSD sort. And we just go through the whole thing and 63 00:05:38,615 --> 00:05:44,050 sort according to the given character. And then, what we do next is just do a 64 00:05:44,050 --> 00:05:49,617 recursive call for every entry in the counter array where we, we just sort the 65 00:05:49,617 --> 00:05:55,472 part of the array from count of r to lo plus count of r or lo plus count of r plus 66 00:05:55,472 --> 00:05:59,575 one. Really just one line of curve for each 67 00:05:59,812 --> 00:06:03,442 subarray. And then we move to the right one 68 00:06:03,442 --> 00:06:07,703 character. So, again, this is very little code to get 69 00:06:07,703 --> 00:06:13,780 a very useful and powerful sorting method, That's MSD string sorting. 70 00:06:13,983 --> 00:06:18,258 One thing to notice is that, with the, extra output buffer, 71 00:06:18,462 --> 00:06:21,584 We can use that even with the recursive calls, 72 00:06:21,584 --> 00:06:25,859 But not the counter array, We have to keep the counter array around. 73 00:06:26,063 --> 00:06:30,338 So it's part, it's gotta be local to the recursive procedure, 74 00:06:30,541 --> 00:06:36,174 Cuz when we call for the next character, we're going to need, a new counter array 75 00:06:36,174 --> 00:06:42,078 for that character but have to save the old one to do the next subarray for, the 76 00:06:42,078 --> 00:06:45,875 calling, method. Well that turns out to be important 77 00:06:45,875 --> 00:06:50,695 because there's a potential for big problems with MSD sort when we start 78 00:06:50,695 --> 00:06:55,258 looking at the analysis. And the thing to notice is that if you 79 00:06:55,258 --> 00:07:00,648 have a tiny array, say an array of size two, it doesn't matter how small your 80 00:07:00,648 --> 00:07:06,251 array is, you need to have a counter array for each potential character in the 81 00:07:06,251 --> 00:07:10,100 alphabet. So for ASCII, just to, to initialize the 82 00:07:10,100 --> 00:07:16,788 counter and to set it to zero, just create it, it's going to be a hundred times 83 00:07:16,788 --> 00:07:23,921 slower, then just sorting the thing or just copying it back, even if you're using 84 00:07:23,921 --> 00:07:30,877 Unicode, it's 32,000 times slower. And what's worse is it's a recursive program, 85 00:07:30,877 --> 00:07:34,890 so there's going to be lots of small subarrays. 86 00:07:35,211 --> 00:07:40,250 Sorry. This feature or characteristic of MSD 87 00:07:40,250 --> 00:07:48,828 string sort actually [INAUDIBLE], we're having a lot of applications that were 88 00:07:48,828 --> 00:07:56,119 using it when people switched from ASCII to Unicode a while back. 89 00:07:56,441 --> 00:08:02,010 All of a sudden programs that were really efficient sorts, 90 00:08:02,207 --> 00:08:07,330 All of a sudden, became hundreds of times slower, with the switch to Unicode, 91 00:08:07,527 --> 00:08:13,175 Because these big arrays in the recursive, these big arrays in the recursive 92 00:08:13,175 --> 00:08:16,460 procedure. It was a, a serious performance problem, 93 00:08:16,460 --> 00:08:22,240 So we definitely have to watch out for that, with, MSD string sort. 94 00:08:22,240 --> 00:08:28,444 Now there is a good solution to avoid this danger and it's the same solution we've 95 00:08:28,444 --> 00:08:32,256 used before. If you've got a small subarray and the 96 00:08:32,256 --> 00:08:36,367 sort is going to be slower just cut off the insertion sort. 97 00:08:36,367 --> 00:08:42,226 The other thing you can do is and save some more time, is to just have insertion 98 00:08:42,226 --> 00:08:46,321 sort start at the character that we're currently working on. 99 00:08:46,525 --> 00:08:52,598 Cuz we know things are, equal, to the left and we're just looking for the right. 100 00:08:52,803 --> 00:08:58,467 And, that's easy to implement just by changing the implementation of the compare 101 00:08:58,467 --> 00:09:02,425 function, to take into account which character we're at. 102 00:09:02,630 --> 00:09:06,929 Notice it's quicker, And, and it happens to be quicker in Java 103 00:09:06,929 --> 00:09:12,280 to just pull out the substrings and use compare to, than to go in and get the 104 00:09:12,280 --> 00:09:15,160 charAt that's just a feature of the implementation. 105 00:09:15,606 --> 00:09:22,158 So switching to, cutting off the insertion sort for small subarrays is definitely a 106 00:09:22,158 --> 00:09:27,964 good idea for MSD string sort. And so, what about the performance of this 107 00:09:27,964 --> 00:09:31,985 method? Well the key characteristic of, of MSD 108 00:09:31,985 --> 00:09:37,568 string sorting, it examines just the characters that it needs to examine in 109 00:09:37,568 --> 00:09:42,631 order to get the C key sorted. So, that means that its performance is 110 00:09:42,631 --> 00:09:48,666 going to be really dependent on the data. Now, in the, worst case for the algorithm, 111 00:09:48,889 --> 00:09:54,631 it has to examine all the data, in this case all the characters in all the 112 00:09:54,631 --> 00:10:00,521 strings, and that's, for example, when they're all equal, it's going to look at 113 00:10:00,521 --> 00:10:05,219 all the characters. If you have some duplicate keys, it might 114 00:10:05,219 --> 00:10:11,258 have to examine all the characters in those duplicate keys, but there's plenty 115 00:10:11,258 --> 00:10:16,180 of other strings that it doesn't examine all the characters. so, 116 00:10:16,400 --> 00:10:21,682 That, depending on, and end of the keys are random in some way, if they, or 117 00:10:21,682 --> 00:10:27,332 they're approximated by random keys. Then it's not going to examine very many 118 00:10:27,332 --> 00:10:32,248 characters at all, Actually and this is a typical case, say, 119 00:10:32,248 --> 00:10:38,558 for account numbers or some situation library call numbers or some situation 120 00:10:38,558 --> 00:10:40,980 like that, In a case like that. 121 00:10:40,980 --> 00:10:45,531 Msd string sort will examine only a small fraction of the data, 122 00:10:45,747 --> 00:10:51,527 A small constant fraction of the data, And it's always going to be sublinear. 123 00:10:51,527 --> 00:10:57,739 Now, it's also possible that sorts that do comparisons can be sublinear, but MSD 124 00:10:57,739 --> 00:11:02,652 string sort is, you, you know, good and that it really only examines the 125 00:11:02,652 --> 00:11:07,420 characters that need to be examined in order to get the sort done. 126 00:11:07,420 --> 00:11:11,653 So this gives us another line on, on the table, 127 00:11:11,897 --> 00:11:19,062 And the key here is that if the data is approximated by random log base r of N is 128 00:11:19,062 --> 00:11:24,680 going to be pretty well approximated by a constant in the real world. 129 00:11:24,876 --> 00:11:31,596 So this is going to be a fast method. The one danger is that, you have to worry 130 00:11:31,596 --> 00:11:38,674 out [inaudible], worry about using too much extra space, with those big counter 131 00:11:38,674 --> 00:11:42,173 arrays. But that's a important and useful, 132 00:11:42,411 --> 00:11:48,666 practical sorting method. So now let's look at, MSD string sort 133 00:11:48,666 --> 00:11:55,354 versus Quicksort for strings. There, it, they are similar in many ways. 134 00:11:55,354 --> 00:12:00,957 They're both recursive methods that, partition a file up. 135 00:12:00,957 --> 00:12:08,022 So one of the problems with MSD string sort is that, It tends to, when it's doing 136 00:12:08,022 --> 00:12:13,229 the counting, it's kind of making random accesses to memory when it, particularly 137 00:12:13,229 --> 00:12:18,891 when it's distributing the keys out. So on modern systems that have caches not 138 00:12:18,891 --> 00:12:24,554 everything is in the fastest memory at, at the same time and programs that move from 139 00:12:24,554 --> 00:12:29,760 one piece of data to another right next are, are going to be much more efficient. 140 00:12:29,760 --> 00:12:32,950 And quicksorts like that, an MSD quicksort is not. 141 00:12:34,190 --> 00:12:39,382 Another disadvantage of MSD string sort is that there's actually quite a few 142 00:12:39,382 --> 00:12:43,697 instructions in, in our loop. With the indexing and accounting. 143 00:12:43,697 --> 00:12:47,878 And the plus and [inaudible] in the accumulating and so forth, 144 00:12:47,878 --> 00:12:54,906 Whereas Quicksort remember is fast, because it has only a few instructions in 145 00:12:54,906 --> 00:12:59,568 the inner loop. And amnesty strings are used as extra 146 00:12:59,568 --> 00:13:03,221 space, Whereas quick sort, there's two things, 147 00:13:03,221 --> 00:13:08,501 one is, it's not linear in these applications, where MSD string sort is, 148 00:13:08,501 --> 00:13:14,233 and kind of the reason that it's not linear, is that if you've got keys that 149 00:13:14,233 --> 00:13:20,266 have a lot of the same characters at the beginning, when it's doing the compares, 150 00:13:20,266 --> 00:13:24,490 it has to rescan or recompare, lots of those characters. 151 00:13:24,490 --> 00:13:29,920 So, what were going to look at next, is try to get this, achieve this goal, of 152 00:13:29,920 --> 00:13:34,220 combining the advantages of, of these two sorting algorithms.