1 00:00:03,500 --> 00:00:08,584 The first string sorting algorithm that we're going to look at is actually the 2 00:00:08,584 --> 00:00:13,734 basis for several more complicated algorithms is called key-indexed counting 3 00:00:13,734 --> 00:00:19,287 and it's very useful on a particular special situation. but let's take a quick 4 00:00:19,287 --> 00:00:25,486 review of where we left off with sorting. so we considered a number of sorting 5 00:00:25,486 --> 00:00:32,117 algorithms starting with insertion sort, and then merge sort, quick sort and 6 00:00:32,584 --> 00:00:39,497 heapsort. And we got to the point where we could find an algorithm that's heapsort 7 00:00:39,964 --> 00:00:47,436 that guarantees to sort NL items and time proportional to N log N without using 8 00:00:47,436 --> 00:00:55,002 extra space, unfortunately not stable. And all these algorithms were useful, or are 9 00:00:55,002 --> 00:01:02,052 useful for any type of generic key as long as it implements the comparative 10 00:01:02,052 --> 00:01:08,721 operation. and not only that, we prove that, any algorithm that just uses 11 00:01:08,721 --> 00:01:15,960 compares has to use number of compares proportional to N log base 2N. So, in a 12 00:01:15,960 --> 00:01:23,397 very important sense, merge sort, or heapsort for example, or optimal. you 13 00:01:23,397 --> 00:01:30,835 can't use syntatically fewer compares, for, either one and, and, with heapsort, 14 00:01:30,835 --> 00:01:38,272 you can't use less, extra space. So why do we consider, other sorting algorithms? 15 00:01:38,538 --> 00:01:44,382 there was a lower bound, what, what, why are we thinking about this? And the 16 00:01:44,382 --> 00:01:50,241 question is, can we do better and obviously we're here because, the answer 17 00:01:50,241 --> 00:01:56,071 is that we can do better and if we don't depend on compares. The lower bound, the 18 00:01:56,071 --> 00:02:02,438 one assumption made by the lower bound is that we use compares, but we don't always 19 00:02:02,438 --> 00:02:08,727 need to use compares. And so let's look at an example. Key-indexed counting is a fine 20 00:02:08,727 --> 00:02:14,638 example of that. and it's representative of fairly common situation where, in 21 00:02:14,638 --> 00:02:21,280 sorting application, where it happens to be that the keys that we're using to sort 22 00:02:21,280 --> 00:02:26,666 are small integers. So, in this, this case, this is supposed to mimic an 23 00:02:26,666 --> 00:02:32,968 application where there's students and they're assigned to sections. There's not 24 00:02:32,968 --> 00:02:40,181 too many sections and we want to get the thing sorted. so we want to distribute the 25 00:02:40,181 --> 00:02:46,635 students by section and so we want to sort according to the section number and that's 26 00:02:46,635 --> 00:02:53,001 a small integer. And the implication of knowing that the key is a small integer 27 00:02:53,001 --> 00:02:59,543 is, that we can use the key as an array index and, and by knowing that the key is 28 00:02:59,543 --> 00:03:05,839 an array index, we can arrange for fast sort. so lots of applications for that 29 00:03:05,839 --> 00:03:12,708 when we have maybe a phone numbers, we can sort by area code or if you have a string 30 00:03:12,708 --> 00:03:19,495 you just want to sort by the first letter, you can do it that way. And actually, that 31 00:03:19,495 --> 00:03:27,196 idea leads to efficient sorting algorithm actually two different ways. now don't 32 00:03:27,196 --> 00:03:35,162 forget that we're sorting according to a sort key, but usually we're sorting bigger 33 00:03:35,162 --> 00:03:41,536 generic items that have other information associated with if you were just sorting 34 00:03:41,536 --> 00:03:47,475 the small integers, you could just count how many 1's there are, how many 2's there 35 00:03:47,475 --> 00:03:52,980 are and like that, and then in one path and if there's three 1's, just output 36 00:03:52,980 --> 00:03:58,267 three 1's and so forth. but the complication is that we have to carry the 37 00:03:58,267 --> 00:04:04,718 associated information along so we have to work a bit harder than that. so, here's 38 00:04:04,718 --> 00:04:10,771 the code for this method called key-indexed counting, and, let's look at a 39 00:04:10,771 --> 00:04:16,621 demo. So here's the key-indexed counting demo. Now, to make this a little less 40 00:04:16,621 --> 00:04:21,663 confusing in not so many numbers, we're going to use lower case a for zero, b for 41 00:04:21,663 --> 00:04:27,318 one, c for two and like that. So it's the a - first letter of the alphabet, and 42 00:04:27,318 --> 00:04:31,815 however you want to think of it. so, and we're already going to look at six. 43 00:04:31,815 --> 00:04:37,130 So we're supposing that we're sorting this array that has six different small 44 00:04:37,130 --> 00:04:42,445 integers and we're using lower case letters to represent the integers so that 45 00:04:42,445 --> 00:04:48,684 we can easily distinguish between the keys and this. So now, let's look at, the 46 00:04:48,684 --> 00:04:54,626 processing for this. So the first thing that we do is we go through and we count 47 00:04:54,626 --> 00:05:01,089 the frequency of occurrence of each letter and, so the way that we do that is we keep 48 00:05:01,089 --> 00:05:09,323 an array. now their arrays actually got to be one bigger than number of different 49 00:05:09,323 --> 00:05:14,672 key, keys that we have and the number of different small integers that we have. So 50 00:05:14,672 --> 00:05:19,729 in this case array of size seven. And just this, make the code a little 51 00:05:19,729 --> 00:05:24,920 cleaner, we keep the number of a's in count of one, the number of b's in count 52 00:05:24,920 --> 00:05:30,517 of two and so forth. and s o if we're, once we've set up, that's what we want to 53 00:05:30,517 --> 00:05:36,180 do, then its trivial to go ahead and count the frequencies. we'd simply go through, 54 00:05:36,180 --> 00:05:43,508 for i from zero to N, we'd go through our input. and when we, a of i, when we access 55 00:05:43,508 --> 00:05:49,036 a value in our input, it's a small integer. So it's zero, one, two, three, 56 00:05:49,036 --> 00:05:55,261 four, or five and we simply add one to that integer, and use it to index into 57 00:05:55,261 --> 00:06:01,557 that array. So when we see an a, that's zero, then we're, incrementing count of 58 00:06:01,557 --> 00:06:08,106 one, and we see a b, that's one, we're incrementing count of two and so forth. so 59 00:06:08,106 --> 00:06:14,150 in this case, we increment count corresponding to b and then a and c and 60 00:06:14,150 --> 00:06:18,869 like that. And everytime we encounter a new key, we simply increment one of these 61 00:06:18,869 --> 00:06:24,147 counters. In one pass through we get an array that gives us the number of a's, 62 00:06:24,147 --> 00:06:29,044 b's, c's, d's, e's and f's. That's the first path of key-indexed counting, count 63 00:06:29,044 --> 00:06:35,911 the frequencies of each letter, using the key as an index. now the next step is, is 64 00:06:36,697 --> 00:06:45,438 called computing cumulus, and that's a really easy thing as well. all we do is we 65 00:06:45,733 --> 00:06:54,573 go through the count array and simply at each step we add the current one to the 66 00:06:54,868 --> 00:07:03,690 sum computed so far. So if we look before, we had two a's and three b's, so that 67 00:07:03,690 --> 00:07:09,974 means there's five letters less than c. that's the a's and the b's, and they're 68 00:07:09,974 --> 00:07:16,552 six letters less than d and eight letters less than e and so forth. And that's just 69 00:07:16,552 --> 00:07:22,576 obtained by, we start with two, add three to it, get five, add one to it to get 70 00:07:22,576 --> 00:07:26,301 five. And with that one passed through the count 71 00:07:26,301 --> 00:07:32,642 array, then we can find out for example, there are six keys less than d and eight 72 00:07:32,642 --> 00:07:39,214 keys less than e. And those cumulus tell us where the d's go in the output. There's 73 00:07:39,214 --> 00:07:45,540 six keys less than d, and eight keys less than e, so the d's have to go in a6 and 74 00:07:45,540 --> 00:07:51,709 a7. So this is an array of indices that is going to tell us how to distribute the 75 00:07:51,709 --> 00:07:58,778 keys in the output. So that's the next step, is access the cumulus using the key 76 00:07:58,778 --> 00:08:05,634 as the index to move items. So let's take a look at. So now, remember when we see an 77 00:08:05,634 --> 00:08:11,603 a, we're just going to count that as zero, so we're going to go to count zero and 78 00:08:11,603 --> 00:08:15,106 that will access this entry in the count ar ray. 79 00:08:15,111 --> 00:08:20,931 So, we go through the whole array to be sorted and we move each key exactly to 80 00:08:20,931 --> 00:08:26,901 where it has to go and we'll do that one at, at time now. So when I is zero we're 81 00:08:26,901 --> 00:08:31,445 looking at the d, the count array corresponding to d has six, so it says, 82 00:08:31,445 --> 00:08:36,052 just put d in there and increment that. It means if you get another d, it's going to 83 00:08:36,052 --> 00:08:40,017 go into seven. And these things, the way we pre-computed 84 00:08:40,017 --> 00:08:45,161 them, are not going to run into one another. So now, a, we go, that goes in 85 00:08:45,161 --> 00:08:51,463 zero, and we increment the count array corresponding to a. next thing is c, and 86 00:08:51,463 --> 00:08:58,356 so that's going to says to put it in five and then increment, the count array 87 00:08:58,356 --> 00:09:03,127 corresponding to c, and f, it says put it in nine. 88 00:09:03,131 --> 00:09:08,319 Next is b, we put it in two. Sorry another f, we put in ten. 89 00:09:08,321 --> 00:09:14,774 Next is b that we put in wo. So you can see the keys from the input are getting 90 00:09:14,774 --> 00:09:20,985 distributed in the output according to the counts and the cumulus that we've 91 00:09:20,985 --> 00:09:27,437 pre-computed. so now we get the other d which goes into seven, we get the, another 92 00:09:27,437 --> 00:09:33,487 b which goes into three, and then increment the four for where the next one 93 00:09:33,487 --> 00:09:38,999 goes, f goes into eleven. The last b goes into four, the e goes into 94 00:09:38,999 --> 00:09:45,771 a, and the second a goes into one. So that's move items again, simply by 95 00:09:45,771 --> 00:09:53,979 using the key as index into the count array. and then the last step is to just 96 00:09:54,272 --> 00:10:01,503 copy the sorted array back into the original input. that's a demo of 97 00:10:01,503 --> 00:10:07,958 key-indexed counting. Quick summary of key-indexed counting, we make one pass 98 00:10:07,958 --> 00:10:13,870 through the array to count frequencies of each letter using the key as an index, 99 00:10:13,870 --> 00:10:20,078 then we go through that count array to, compute cumulus just by adding each new 100 00:10:20,078 --> 00:10:26,360 one into the running sum. then we use those cumulates, and access that using key 101 00:10:26,360 --> 00:10:32,346 as index to actually move items over and get them in sorted order, and then move 102 00:10:32,346 --> 00:10:38,646 back into the original array. what's the running time of this algorithm? Well, the 103 00:10:38,646 --> 00:10:46,217 analysis is actually quite simple, because it's just a couple of loops through the 104 00:10:46,217 --> 00:10:52,257 array that we sorted, and through the count array. and the key, key fact to 105 00:10:52,257 --> 00:10:59,022 note, that it takes time proportional to N plus R and space proportional to N + R. 106 00:10:59,022 --> 00:11:05,626 Now R, remember is our array that excess the number of different character values. 107 00:11:05,868 --> 00:11:16,043 so, so for asking, maybe that's the 205th. and, and for generic data maybe it's four. 108 00:11:16,045 --> 00:11:23,026 And N, we're assuming we're sorting huge files. So really, this is linear time in 109 00:11:23,026 --> 00:11:30,036 many, many practical situations. there's also the question of, is it stable? Yeah, 110 00:11:30,036 --> 00:11:36,521 it's actually stable. because when we do the move, we move things with equal keys 111 00:11:36,521 --> 00:11:42,772 in the order that we see them, we keep them in the order that we see them. That's 112 00:11:42,772 --> 00:11:49,023 just the way the method works. So we have for this special situation, we have a 113 00:11:49,023 --> 00:11:55,899 linear time stable sorting method which beats the N log N and it's useful in many 114 00:11:55,899 --> 00:11:57,540 practical situations.