1 00:00:03,580 --> 00:00:09,248 Today we're going to talk about tries, which is a data structure for searching 2 00:00:09,248 --> 00:00:14,988 with string keys that is remarkable effective in enabling us to deal with huge 3 00:00:14,988 --> 00:00:18,549 amounts of data that we have to process nowadays. 4 00:00:18,767 --> 00:00:25,016 Just as a motivation, when we left off talking about searching algorithms, or the 5 00:00:25,016 --> 00:00:28,868 symbol table implementations here's where we were. 6 00:00:29,086 --> 00:00:34,972 The best algorithms that we examined were balanced red, black, binary search trees, 7 00:00:35,190 --> 00:00:40,144 in hash tables. And With respect to the basic search, 8 00:00:40,344 --> 00:00:45,087 insert and delete operations. Red black BSTs we're able to guarantee 9 00:00:45,087 --> 00:00:50,297 that we can get those done in logorithmic time, actually, pretty efficiently. 10 00:00:50,497 --> 00:00:55,574 And for hashing, we'd get'em done in constant time under certain assumptions. 11 00:00:55,574 --> 00:00:58,647 That we could have a uniform, hash function. 12 00:00:58,847 --> 00:01:04,258 And also, for hashing, what we need to compute a hash, code for binary search 13 00:01:04,258 --> 00:01:09,468 trees, we use, a compare to function. And another difference is, binary search 14 00:01:09,468 --> 00:01:14,792 trees, we can support more operations. Than we can support with hashing ordered 15 00:01:14,792 --> 00:01:18,357 operations. For example, getting the rank of a key in 16 00:01:18,357 --> 00:01:22,441 the tree, and other things. And those are terrific algorithms. 17 00:01:22,441 --> 00:01:27,237 And we looked at plenty of go, good applications of those algorithms. 18 00:01:27,237 --> 00:01:30,673 But still there's the question could we do better? 19 00:01:30,867 --> 00:01:33,720 And the answer is that yes, we can do better. 20 00:01:33,939 --> 00:01:38,615 As with string sorting, we can avoid examining the entire key. 21 00:01:38,835 --> 00:01:45,046 Which is going to give us algorithms that can compete and even beat these classic 22 00:01:45,046 --> 00:01:49,260 algorithms. So to get started we are going to 23 00:01:49,260 --> 00:01:56,337 articulate in API for symbol tables that specialize for the case when the keys are 24 00:01:56,337 --> 00:02:00,804 strings. So simply, we just add the modifier string 25 00:02:00,804 --> 00:02:07,980 to our standard symbol table EPI. And we can take a generic value just put 26 00:02:07,980 --> 00:02:13,913 the keys or strings. And then we take all the methods that we 27 00:02:13,913 --> 00:02:19,099 articulated before. And for key type, instead of generic, we 28 00:02:19,099 --> 00:02:23,070 use string. And this is going to allow us to develop 29 00:02:23,070 --> 00:02:28,418 efficient algorithms that take advantage of properties of strings. 30 00:02:28,661 --> 00:02:34,495 Our goal is to get, some algorithms that are even faster than hashing. 31 00:02:34,738 --> 00:02:41,221 And maybe, more flexible than BSTs. And we're going to be able to achieve both 32 00:02:41,221 --> 00:02:47,347 those goals. So, this again is the summary of the 33 00:02:47,608 --> 00:02:54,818 running times for when, in the case that the keys are strings for red black BSTs 34 00:02:55,079 --> 00:03:00,378 and for hashing. And let's take a look at those so if L is 35 00:03:00,378 --> 00:03:05,644 the length of the string. And n is the number of strings and then 36 00:03:05,644 --> 00:03:11,092 also the rate x is r is going to be a factor later on but its not going to be 37 00:03:11,092 --> 00:03:15,128 for these two. Then for hashing for every operation we 38 00:03:15,128 --> 00:03:20,778 have to compute a hash function, which basically means looking at every character 39 00:03:20,778 --> 00:03:25,608 in the string. Actually in Java string hash functions are 40 00:03:25,608 --> 00:03:31,201 cached, so sometimes there's efficiencies cuz you only have to comp, and they're 41 00:03:31,201 --> 00:03:35,225 mutable, so you only have to compute the hash function once. 42 00:03:35,430 --> 00:03:41,023 And then this is roughly the amount of space that takes into account the table 43 00:03:41,023 --> 00:03:46,753 and the string size for red black BSTs, the analysis is a bit more complicated. 44 00:03:46,958 --> 00:03:50,710 For a search hit, you have to look at the entire string. 45 00:03:50,710 --> 00:03:55,404 Given a entire key to check that every character's the same and then the 46 00:03:55,404 --> 00:04:00,654 comparison's depending on the nature of the keys for a typical case though its 47 00:04:00,654 --> 00:04:06,152 something like log squared N. we had log N comparison's but usually you have to look 48 00:04:06,152 --> 00:04:11,896 at something like a log N characters in the key in order to get down the tree. 49 00:04:12,081 --> 00:04:17,085 For a search miss you might be able to find out before you get to the end of the 50 00:04:17,085 --> 00:04:22,150 tree and sing a for an insert so the running times are something like this. 51 00:04:22,376 --> 00:04:28,933 Then here's some experimental evidence that, that we, we can use to test out 52 00:04:28,933 --> 00:04:35,039 these analytic hypothesis. And we'll just look at two examples one is 53 00:04:35,039 --> 00:04:41,747 the entire text of Melville's Moby Dick which we've used before and that's about a 54 00:04:41,747 --> 00:04:45,226 million characters. And maybe 200,000 strings. 55 00:04:45,226 --> 00:04:49,319 And add to this 200,000 strings, about 32,000 are distinct. 56 00:04:49,530 --> 00:04:53,482 And then this, There's another file of actors from the 57 00:04:53,482 --> 00:04:59,127 Internet movie database that's much bigger maybe tourism magnitude bigger. 58 00:04:59,339 --> 00:05:03,361 And it's got maybe about a million different words. 59 00:05:03,361 --> 00:05:09,909 So we'll want our algorithms to do better than, or at least as well as red black 60 00:05:09,909 --> 00:05:15,826 BSTs in hashing on these data sets. I think, you know, our test is to dedupe 61 00:05:15,826 --> 00:05:18,891 these data sets. So that's our challenge. 62 00:05:18,891 --> 00:05:24,737 Efficient performance for string keys and try to come up with algorithms that can 63 00:05:24,737 --> 00:05:30,945 compete with the best classic algorithms. Now in order to do that we're going to 64 00:05:30,945 --> 00:05:35,976 look at R way tries. That's a particular data structure, that 65 00:05:35,976 --> 00:05:39,704 was, invented actually really quite a while ago. 66 00:05:39,928 --> 00:05:43,507 This data structure, dates back, to the 60s. 67 00:05:43,731 --> 00:05:50,218 And, the first thing to know about it is that, the name is based a little bit, on a 68 00:05:50,218 --> 00:05:53,797 pun. It actually is the middle letters of the 69 00:05:53,797 --> 00:05:59,987 word retrieval, but we don't pronounce it tree because then we couldn't distinguish 70 00:05:59,987 --> 00:06:03,790 it from binary search trees, so we pronounce it try. 71 00:06:03,993 --> 00:06:09,569 And this is a confusing fact of life that we've been living with for many decades 72 00:06:09,569 --> 00:06:14,397 now with this data structure. And let's look at what a try is and we'll 73 00:06:14,397 --> 00:06:17,864 look at some, examples before we get to the code. 74 00:06:17,864 --> 00:06:22,012 So, for now, we're going to think of a try as some nodes. 75 00:06:22,012 --> 00:06:26,568 A tree structure with nodes where we store characters in the nodes, 76 00:06:26,568 --> 00:06:30,401 Not keys. And the other thing is that each node has 77 00:06:30,401 --> 00:06:35,628 R children, where R is the rate, it's the number of possible characters. 78 00:06:35,843 --> 00:06:41,070 And there's g-, the possibility of a child for each possible character. 79 00:06:41,268 --> 00:06:46,305 Now in the standard implementation, this is going to involve on all links. 80 00:06:46,504 --> 00:06:50,348 We have to have a placeholder for each possible character. 81 00:06:50,348 --> 00:06:55,650 So there's a lot of null links and tries. And we'll get back to that in a minute. 82 00:06:55,650 --> 00:07:00,754 But to get the concept, we'll use this graphical representation, where we have 83 00:07:00,754 --> 00:07:04,399 characters and nodes and we don't show any null length. 84 00:07:04,399 --> 00:07:09,767 And the other thing is, remember a symbol table is associating a key with a value, 85 00:07:09,767 --> 00:07:13,678 so this try is built for this set of key value pairs. 86 00:07:13,678 --> 00:07:18,290 And what we do is we. Store values in nodes that correspond to 87 00:07:18,290 --> 00:07:23,117 the last characters in keys. And we'll look, that's what these numbers 88 00:07:23,117 --> 00:07:26,386 here are. And we'll look at a little more in detail, 89 00:07:26,386 --> 00:07:30,680 on how this try is built in just a second. So that's the basic idea. 90 00:07:30,680 --> 00:07:35,230 We're going to have string keys. We're going to associate values, and we're 91 00:07:35,230 --> 00:07:40,357 going to use this tree-like, tree data structure but we're not going to draw the 92 00:07:40,357 --> 00:07:45,019 imply null links for now. So, for example, in this trie so the root 93 00:07:45,019 --> 00:07:50,641 re-, represents all strings. And, coming out of the root is one length 94 00:07:50,641 --> 00:07:55,307 for each possible letter. In particular in this try, the middle link 95 00:07:55,307 --> 00:08:00,600 is the link to the, to a sub try that contains all the keys that start with S. 96 00:08:00,600 --> 00:08:06,334 Then we go further down. Each time we go by a node, we pick off a 97 00:08:06,334 --> 00:08:13,413 letter so that this length, for example, goes to the try for all keys that start 98 00:08:13,413 --> 00:08:17,547 with s followed by h followed by e. And so forth. 99 00:08:17,547 --> 00:08:23,689 So now all, out of all the keys that start with SHE, actually one of them, the one 100 00:08:23,689 --> 00:08:29,062 that just has the letters in then ends, has a value associated with it. 101 00:08:29,062 --> 00:08:35,050 So when the node corresponding to the last letter, the E, we put the value zero. 102 00:08:35,271 --> 00:08:38,445 And so that's, how the try is going to look. 103 00:08:38,445 --> 00:08:44,424 So just given that definition, then let's look at, how to do search, in a try. 104 00:08:44,424 --> 00:08:50,181 All we do is use the characters and the key to guide our search down the data 105 00:08:50,181 --> 00:08:54,116 structure. So, let's say we're looking for the key 106 00:08:54,116 --> 00:08:57,584 shells. So we start at the root, and we go down 107 00:08:57,584 --> 00:09:03,448 the S link, since we started with an S. Now our second letter is H, so we look to 108 00:09:03,448 --> 00:09:09,460 see if there's a non-null link that, corresponding to H, and in this case there 109 00:09:09,460 --> 00:09:12,651 is. Now our next letter is E, so we look for 110 00:09:12,651 --> 00:09:13,320 an E. Now. 111 00:09:13,320 --> 00:09:18,995 No need to examine the value here because well we haven't looked at all the letters 112 00:09:18,995 --> 00:09:22,914 in our key yet. So now we look for L now we look for 113 00:09:22,914 --> 00:09:28,860 another L and then finally we find the S and when we find the S we look to see if 114 00:09:28,860 --> 00:09:33,049 there's a value there. In this case there is a value and so 115 00:09:33,049 --> 00:09:37,847 that's what we return the value associated with the last key character. 116 00:09:37,847 --> 00:09:41,842 So that's a typical search hit in a tri. Another way is... 117 00:09:41,842 --> 00:09:45,139 So, that node was down at the bottom but if you had. 118 00:09:45,139 --> 00:09:48,629 We're doing a search for SHE, you follow the same path. 119 00:09:48,629 --> 00:09:53,994 But now when you get to the E, that's the last character of that key and so we just 120 00:09:53,994 --> 00:09:57,808 know how to return that, value associated with that node. 121 00:09:57,808 --> 00:10:01,945 That is, the search doesn't have to go all the way to the bottom. 122 00:10:01,945 --> 00:10:04,724 It might terminate at an intermediate node. 123 00:10:04,724 --> 00:10:09,896 And we just return the value associated with the node corresponding to the last 124 00:10:09,896 --> 00:10:13,063 character in the key. What about a search miss? 125 00:10:13,063 --> 00:10:16,151 Well, there's two cases. It's for a search miss. 126 00:10:16,360 --> 00:10:22,904 So one of them is we've followed down the tree picking off the letters in the key 127 00:10:22,904 --> 00:10:28,264 one letter at a time as, as before. And then when we get to the end of the key 128 00:10:28,473 --> 00:10:34,251 we take a look to see if there is a value. In this case there's no value associated 129 00:10:34,251 --> 00:10:39,681 with the last key character that mean there's no key in the data structure 130 00:10:39,681 --> 00:10:43,650 that's been associated with a value. So we just return no. 131 00:10:43,650 --> 00:10:48,746 Or, the other thing that can happen is, we can, reach a null link. 132 00:10:48,746 --> 00:10:54,729 So, our key now is S, starts with S, and then H, and then E, and then L, and now 133 00:10:54,729 --> 00:11:00,860 our next letter is T, so we look to see if there's a non null link coming from this 134 00:11:00,860 --> 00:11:05,587 node associated with T. And in this case, there's no such link, so 135 00:11:05,587 --> 00:11:10,167 we return null. That's evidence that, that string, there's 136 00:11:10,167 --> 00:11:14,377 no key associated with that string in our data structure. 137 00:11:14,377 --> 00:11:16,442 So that's a search miss. Iii. 138 00:11:16,442 --> 00:11:21,070 Now, what about insertion? Well insertion is also pretty simple. 139 00:11:21,280 --> 00:11:25,690 So we follow two, two rules. One is again, we go down links 140 00:11:25,690 --> 00:11:31,360 corresponding each character in the key. If we come out on a link, we create a new 141 00:11:31,360 --> 00:11:35,490 node that's associated with the character that we're on. 142 00:11:35,700 --> 00:11:41,300 And just keep doing that until we get to the last character of the key, in which 143 00:11:41,300 --> 00:11:45,136 case, we put the value. So if in this try, we're supposed to 144 00:11:45,136 --> 00:11:47,879 associate the value seven with [inaudible]. 145 00:11:48,070 --> 00:11:53,046 We follow, our path from the root to s, 'cause our first letter is s, and then 146 00:11:53,046 --> 00:11:57,958 h.'Cause our next letter is h. And then a O, we had an, [inaudible] kinda 147 00:11:57,958 --> 00:12:02,040 ran into a null link. So we have to put, a node there, with the 148 00:12:02,040 --> 00:12:04,719 character, associated with the character O. 149 00:12:04,719 --> 00:12:10,014 So in later searches for this key, we'll be able to find it by passing through that 150 00:12:10,014 --> 00:12:12,821 node. And now we move to the next letter, and 151 00:12:12,821 --> 00:12:16,840 there's no node again. In the next letter, there's still no node. 152 00:12:16,840 --> 00:12:20,125 Iii. When we get to the end, then that's the 153 00:12:20,125 --> 00:12:23,910 last character in the key, and we put our value seven. 154 00:12:23,910 --> 00:12:30,224 And now we've modified the data structure, adding the nodes that are necessary for a 155 00:12:30,224 --> 00:12:36,030 later search to go through, character by character and find the value associated 156 00:12:36,030 --> 00:12:39,514 with that key. So that's an insertion into a try. 157 00:12:39,732 --> 00:12:46,119 Just to cement all these ideas let's do a demo of how that tree was constructed from 158 00:12:46,119 --> 00:12:50,045 scratch. So we have a null try. 159 00:12:50,045 --> 00:12:58,215 So we're going to start by putting the, associating the value zero with the string 160 00:12:58,215 --> 00:13:02,662 SHE. So we start at the root node, which, and 161 00:13:02,662 --> 00:13:10,625 I'll try, it just has that one node. Create a new node for S, create a new node 162 00:13:10,625 --> 00:13:15,590 for H, create a new node for E. Associate zero. 163 00:13:15,590 --> 00:13:21,004 So the key is the sequence of characters from the root to the value, and the value 164 00:13:21,004 --> 00:13:24,371 is in the node corresponding to the last character. 165 00:13:24,371 --> 00:13:29,257 This try represents the symbol table that contains just on string SHE, and 166 00:13:29,257 --> 00:13:33,351 associated with zero. Every other string, if you search for any 167 00:13:33,351 --> 00:13:38,963 other string in this try you'll either end in a node that doesn't have a value or 168 00:13:38,963 --> 00:13:44,050 you'll end at a null link. All right, so now, let's say we put in the 169 00:13:44,050 --> 00:13:50,381 key S, E, L, L, S and you can kinda tell where it's going, we made room for it in 170 00:13:50,381 --> 00:13:54,410 the try. So we go for S, and now the second letter, 171 00:13:54,410 --> 00:13:58,767 E, corresponds to a null link, so we create a new node. 172 00:13:58,767 --> 00:14:05,838 And similarly we go through and create new nodes for each of the remaining characters 173 00:14:05,838 --> 00:14:10,360 in the key and then associate the value one at the end. 174 00:14:10,643 --> 00:14:16,510 So now we have a try that has two keys in it, SHE and SELLS. 175 00:14:16,510 --> 00:14:21,648 Now our next is to insert SEA. So now we have S, we already have a node 176 00:14:21,648 --> 00:14:26,126 corresponding to E. And so now we just have to create one new 177 00:14:26,126 --> 00:14:30,531 node, the one corresponding to A. And put our value two there. 178 00:14:30,531 --> 00:14:36,330 And now, we're going to put SHELLS in. So we already have the S, already have the 179 00:14:36,330 --> 00:14:42,129 H, already have the E, so now we have to add nodes for the last three letters in 180 00:14:42,129 --> 00:14:45,800 that string. And associate the value three with it. 181 00:14:45,800 --> 00:14:49,865 Now we're going to put finally a key that doesn't start with S. 182 00:14:49,865 --> 00:14:54,833 So that means we create a new node corresponding to the first letter of that 183 00:14:54,833 --> 00:14:59,480 string, I, N to the other letter two. And then associate the value four. 184 00:14:59,480 --> 00:15:02,224 And here's another one that doesn't start with s. 185 00:15:02,224 --> 00:15:06,761 So we create new nodes corresponding to each one of its characters, and associate 186 00:15:06,761 --> 00:15:10,178 the value five with the last one. Now here's SEA. 187 00:15:10,178 --> 00:15:18,860 So this is, remember, our symbol table API is associative, which means that if we get 188 00:15:18,860 --> 00:15:25,450 a new value for a key that's already in the table, we overwrite. 189 00:15:25,450 --> 00:15:32,079 The old value with the new value. That's the way, the convention that we've 190 00:15:32,079 --> 00:15:38,349 chosen for symbol tables. So that is easily done with tries as well. 191 00:15:38,349 --> 00:15:44,620 Now here's one more key, S, H. And now we have to add a new node because 192 00:15:44,620 --> 00:15:50,174 there's no node that's a child of H that corresponds letter O. 193 00:15:50,174 --> 00:15:57,341 So we have to create new nodes for O, R, and E, and associate the value seven with 194 00:15:57,341 --> 00:16:00,522 it. So that's a tri that has precisely eight 195 00:16:00,522 --> 00:16:03,648 keys. If you look for any one of those eight 196 00:16:03,648 --> 00:16:08,549 keys you'll get the value. If you look for any other string you'll 197 00:16:08,549 --> 00:16:14,658 either come to a node that has a null value or go out on a null link and so the 198 00:16:14,658 --> 00:16:19,560 search would be unsuccessful. That's is a demo of tri construction. 199 00:16:20,520 --> 00:16:27,464 Now that you have a basic idea of how, what tris are and how they work, let's 200 00:16:27,464 --> 00:16:32,299 take a look at what's needed to implement them in Java. 201 00:16:32,299 --> 00:16:39,771 The key idea in the tri representation for implementation in Java or any computer 202 00:16:39,771 --> 00:16:48,297 language is that. Instead of representing a node as a character in the node, what we 203 00:16:48,297 --> 00:16:55,521 do is represent the links as an indexed array with one entry for each possible 204 00:16:55,521 --> 00:16:59,636 character. And the idea is the characters are 205 00:16:59,636 --> 00:17:03,253 actually. Implicitly defined by link indeces. 206 00:17:03,253 --> 00:17:08,600 So just for example, we have this small try that we built over here. 207 00:17:08,600 --> 00:17:12,510 In this case, the root has only one node below it. 208 00:17:12,510 --> 00:17:18,656 That's for all the keys that start with S. The way that's represented in real 209 00:17:18,656 --> 00:17:22,007 representation, and this is for efficiency. 210 00:17:22,007 --> 00:17:25,918 And it's the way that tries get their efficiency. 211 00:17:25,918 --> 00:17:30,547 Is we have one link corresponding to each possible letter. 212 00:17:30,547 --> 00:17:36,953 And the only one that is un-null is S. So the character S is defined implicitly 213 00:17:36,953 --> 00:17:43,535 by the fact that, that link that [inaudible] that, that corresponds to S is 214 00:17:43,535 --> 00:17:48,472 not null. Over here there's from e to a there's two 215 00:17:48,472 --> 00:17:54,067 links coming out of E. And the only way that we represent A is by 216 00:17:54,067 --> 00:18:00,238 having the first link be not null. If the link corresponding to a letter is 217 00:18:00,238 --> 00:18:07,150 not null, then it corresponds to having that letter in the node that it points to. 218 00:18:07,378 --> 00:18:10,952 So in this case we have E connected to A and L. 219 00:18:10,952 --> 00:18:16,960 So we have the entries corresponding to A and L filled with non null values. 220 00:18:16,960 --> 00:18:23,022 So you can see immediately the correspondence between this tri the way 221 00:18:23,022 --> 00:18:27,774 we've been drawing it and the job representation of it. 222 00:18:27,774 --> 00:18:33,836 Where each node contains 200, or r links if there's r different letters. 223 00:18:34,082 --> 00:18:39,326 And letters are implicitly represented by non [inaudible] links. 224 00:18:39,326 --> 00:18:44,580 Now, you just go in the node. So for example, I, in this try, S, E, A, 225 00:18:44,580 --> 00:18:48,724 but which is easy to follow down through the try. 226 00:18:48,724 --> 00:18:53,788 We're looking for an S. And S is the twentieth letter in the 227 00:18:53,788 --> 00:18:54,957 alphabet. Or so. 228 00:18:54,957 --> 00:19:00,103 We index into the twentieth position and that takes us right to S. 229 00:19:00,103 --> 00:19:03,845 And E's the fifth letter, we take the fifth link. 230 00:19:03,845 --> 00:19:07,665 Then A's the first letter, we take the first link. 231 00:19:07,665 --> 00:19:12,421 So we can use the character as index into the array of links. 232 00:19:12,421 --> 00:19:18,190 To quickly travel down the tree. Then when we get to the last character, we 233 00:19:18,190 --> 00:19:22,790 can check the value at that node that's stored in the node. 234 00:19:23,030 --> 00:19:28,888 One slight complication in the implementation that we encountered before 235 00:19:28,888 --> 00:19:34,826 with hashing algorithms and other things. We're going to need arrays of nodes. 236 00:19:34,826 --> 00:19:40,363 That's what our lengths are. So we can't have any generics with nodes, 237 00:19:40,363 --> 00:19:46,060 even though I would like to. So we have to declare the value to be a 238 00:19:46,060 --> 00:19:52,319 type object, and then we'll have to cast it back to whatever type it should be, 239 00:19:52,319 --> 00:19:55,930 when we return it. And we'll see that in code. 240 00:19:56,152 --> 00:20:01,258 Other than that little complication. This is quite straightforward 241 00:20:01,258 --> 00:20:05,328 representation of tries. And it leads to a very easy 242 00:20:05,328 --> 00:20:09,325 implementation. So the keys and the characters are not 243 00:20:09,325 --> 00:20:13,691 explicitly stored. They're, they're implicitly, because of 244 00:20:13,691 --> 00:20:17,761 the links. And that's a, a very important point to 245 00:20:17,761 --> 00:20:21,980 think about when it comes to, implementing using tries. 246 00:20:22,880 --> 00:20:29,599 Given that representation, this code for implementing try, simple table in Java 247 00:20:29,848 --> 00:20:34,946 almost writes itself. So it's a [inaudible] implementation in 248 00:20:34,946 --> 00:20:40,929 this slide has the implementation of put using the same recursive scheme that was 249 00:20:40,929 --> 00:20:46,692 used many other times in building trees the what are the instance variables? 250 00:20:46,692 --> 00:20:53,113 Well we declare artery 256 as usual in our string implementations where working with 251 00:20:53,113 --> 00:20:59,388 extended asking and then we have one instance variable that matters and that is 252 00:20:59,607 --> 00:21:04,976 the root of the [inaudible]. So tries begin with, start, all tries 253 00:21:04,976 --> 00:21:10,775 start with a null node. So first thing we do is create a new node 254 00:21:11,031 --> 00:21:17,939 and put a reference to that node in root. So our empty trie consists of a node 255 00:21:18,195 --> 00:21:23,654 that's got a. Remember, when you create a new node we 256 00:21:23,910 --> 00:21:30,280 Build an array of R links to nodes. And at the beginning those'll all be empty 257 00:21:30,280 --> 00:21:32,100 links. Or all null links. 258 00:21:32,100 --> 00:21:37,448 So the root points to a node that has 256 null lengths. 259 00:21:37,448 --> 00:21:42,171 Now to put. Or to associate a key with the value in a 260 00:21:42,171 --> 00:21:46,953 trie. We use this instance method that calls 261 00:21:47,177 --> 00:21:51,584 recursive method. Again, the same way that we've done for 262 00:21:51,584 --> 00:21:57,935 other tree construction schemes before. So we take the recursive method, takes a 263 00:21:57,935 --> 00:22:04,135 node as argument and returns a node. So it takes a reference to a trie and 264 00:22:04,135 --> 00:22:08,170 returns to, a reference to the trie that it constructs. 265 00:22:08,435 --> 00:22:16,411 After a associating the key with a value. Just to get started suppose that our first 266 00:22:16,411 --> 00:22:22,260 key has just one character. So in that case, we would call. 267 00:22:22,260 --> 00:22:27,789 Another recursive put for the root, with our one character. 268 00:22:28,061 --> 00:22:35,313 And so our one character key. And call the recursive routine. 269 00:22:35,584 --> 00:22:39,845 Now our node is not null. It's the root node. 270 00:22:40,117 --> 00:22:46,915 So and our key is length one. And we, called it starting at zero. 271 00:22:46,915 --> 00:22:55,260 So the first thing that we do is pick out the [inaudible] character of our, Yeah key 272 00:22:55,260 --> 00:23:01,432 so zero of character which is our one character and I guess its an index 273 00:23:01,432 --> 00:23:07,603 whatever letter might happen to B. Say its B then C would be two that's sort 274 00:23:07,603 --> 00:23:14,100 of thing and then all we do is recursively are follow that link in ser. 275 00:23:14,313 --> 00:23:18,793 Our, associate our key value in the try pointed to by that link. 276 00:23:19,006 --> 00:23:25,264 And then take the link that we get and put it back into that link out of the root 277 00:23:25,264 --> 00:23:29,054 node. So the next call there's nothing in the 278 00:23:29,264 --> 00:23:35,010 word dot next it's a, it's null so the next call we get null so we create a new 279 00:23:35,010 --> 00:23:38,164 node. And then that new node this point we've 280 00:23:38,164 --> 00:23:41,808 called with D plus one moving to the next character. 281 00:23:41,808 --> 00:23:46,364 So now we have D equals one which is equal to our key dot length. 282 00:23:46,364 --> 00:23:51,270 So we associate the value in that node with our node and we return it. 283 00:23:51,482 --> 00:23:57,578 So that return one level up will set the length of the new node in the appropriate 284 00:23:57,578 --> 00:24:04,672 entry corresponding to the root. Again once you've learned our normal 285 00:24:04,672 --> 00:24:12,566 recursive way of structuring building tree structures this code is very natural. 286 00:24:12,832 --> 00:24:20,726 So for a longer key I'm again thinking recursively we find the if we're suppose 287 00:24:20,726 --> 00:24:28,797 to insert the key associate the key with a value, and we're working on the 288 00:24:28,797 --> 00:24:34,343 [inaudible] character. We pick out the, The character at the deep 289 00:24:34,343 --> 00:24:37,377 position. We use that to index and to be in link 290 00:24:37,377 --> 00:24:43,343 array of the current node and then that's the link that we set when we do a put of 291 00:24:43,343 --> 00:24:47,136 the new node. So when we start with a longer string and 292 00:24:47,136 --> 00:24:53,205 a perfectly empty tree we create new nodes all the way down and then put their links 293 00:24:53,205 --> 00:24:57,550 to their parents all the way up. In this recursive structure. 294 00:24:57,756 --> 00:25:01,412 And it's a very simple and natural recursive code. 295 00:25:01,412 --> 00:25:05,687 So that's the put. Now that takes a little study but not so 296 00:25:05,687 --> 00:25:11,067 much once you're use to our standard recursive way of producing tree 297 00:25:11,067 --> 00:25:13,687 structures. And get is even simpler. 298 00:25:13,894 --> 00:25:19,842 So contains and gets. So our scenario's standard convention is 299 00:25:20,116 --> 00:25:28,336 to return null if we have a key that's not there or that contains and the get 300 00:25:28,336 --> 00:25:36,786 function is a again recursive. Procedure that will return in reference to 301 00:25:36,786 --> 00:25:40,322 the node, and if that's null, we return null. 302 00:25:40,322 --> 00:25:46,984 Otherwise, we can get the value out of the node return by the recursive procedure. 303 00:25:46,984 --> 00:25:52,411 And then remember we had the omega value in nodes of type objects. 304 00:25:52,411 --> 00:25:56,277 So we have to cast that back to the type value. 305 00:25:56,277 --> 00:26:04,024 And the recursive get is very simple. To find the node that contains the value 306 00:26:04,024 --> 00:26:09,540 associated with a given key. And we're working on the deep character. 307 00:26:09,540 --> 00:26:14,326 And we're currently on, node x. We just, return null. 308 00:26:14,326 --> 00:26:18,626 Effects happens to be null. That means it's not there. 309 00:26:18,869 --> 00:26:24,710 If we're, at the last character in the key, we return our current node. 310 00:26:24,946 --> 00:26:28,650 Otherwise, we get that [inaudible] character. 311 00:26:28,650 --> 00:26:34,010 And we use that to index into the next array of the current, node. 312 00:26:34,246 --> 00:26:41,103 And recursively called the get routine, for, that node moving, one down the tree. 313 00:26:41,103 --> 00:26:47,802 And incrementing our key pointer by one. Extremely simple, recursive code to do the 314 00:26:47,802 --> 00:26:51,599 tri implementation. So what about the performance? 315 00:26:51,808 --> 00:26:58,218 Well in a cert chip we simply go down the tried examining all L characters just 316 00:26:58,218 --> 00:27:01,911 using each one to index into an x-array at some note. 317 00:27:02,120 --> 00:27:06,510 And then go down to the end to, to see if there's a value there. 318 00:27:06,719 --> 00:27:11,875 Search miss we might have to go all the way down to the last character. 319 00:27:11,875 --> 00:27:17,450 But we could also just have a miss match on the first character and find a null 320 00:27:17,450 --> 00:27:19,680 link right away. And actually. 321 00:27:19,680 --> 00:27:24,528 Typically, in a try. In typical applications, we only examine a 322 00:27:24,528 --> 00:27:28,544 few characters. So right away, you can see, by thinking 323 00:27:28,544 --> 00:27:33,090 about a search miss, that try algorithms can be sublinear. 324 00:27:33,090 --> 00:27:39,378 We can find out that a key's not in the tree, by only examining a few characters. 325 00:27:39,378 --> 00:27:44,032 If we don't have any. Strings in our try that begin with the 326 00:27:44,032 --> 00:27:49,057 same few characters as our search key, then, it's not there. 327 00:27:49,292 --> 00:27:55,260 That's a typical case. Now the downside of tri performance in 328 00:27:55,542 --> 00:28:00,160 lots of applications is the amount of space used. 329 00:28:00,433 --> 00:28:05,804 There is the problem that every node's got R lengths in it. 330 00:28:05,804 --> 00:28:12,997 And particularly, down at the bottom, the leaf nodes that correspond to the last 331 00:28:12,997 --> 00:28:20,371 characters and keys and, the prefix in the key in the try that's going to be null 332 00:28:20,371 --> 00:28:24,337 lengths. Now it is possible in a huge try with lots 333 00:28:24,337 --> 00:28:30,510 of strings that are short and sharing common prefixes to actually much less 334 00:28:30,510 --> 00:28:37,360 space then this but be ardenal links at each leaf is a real problem in some 335 00:28:37,360 --> 00:28:42,480 applications so we're going to take a look on how to deal with that. 336 00:28:42,783 --> 00:28:51,479 So the bottom line is for symbol tables with relatively small numbers of string 337 00:28:51,479 --> 00:28:58,672 keys where the amount of space used by the [UNKNOWN] is not a problem. 338 00:28:58,920 --> 00:29:05,697 Then we get very fast, search hit. And even faster search miss but we burn up 339 00:29:05,697 --> 00:29:09,334 some space. That's the bottom line about try 340 00:29:09,334 --> 00:29:13,384 performance. And, just a typical application. 341 00:29:13,384 --> 00:29:19,996 Maybe you get a job interview question about, what data structure you use for 342 00:29:19,996 --> 00:29:28,234 spell checking And then, so, and this is just an example, to show, how effective 343 00:29:28,234 --> 00:29:34,377 tries can be for such an application. Where all the words are three letters. 344 00:29:34,613 --> 00:29:41,385 And the solution is, build a 26 way try. So spell checking, usually, there'll be 345 00:29:41,385 --> 00:29:47,842 pre-processing to strip out everything but the lowercase letters that make up the 346 00:29:47,842 --> 00:29:51,858 word. So you can build a 26 way try that, will 347 00:29:51,858 --> 00:29:56,820 immediately tell you whether you have a misspelled word or not. 348 00:29:56,820 --> 00:30:04,540 Another interesting thing about prize is that, deletion is, very easy, what do we 349 00:30:04,540 --> 00:30:11,646 do to delete a key value pair, well you find, the node corresponding to key and, 350 00:30:12,172 --> 00:30:16,208 set the value there corresponding to null, . 351 00:30:16,208 --> 00:30:23,489 So that's, easily done, so we want to delete shelves we cross out the, value 352 00:30:23,489 --> 00:30:30,174 there, and now, there's two cases. One case is this one where that node, now 353 00:30:30,174 --> 00:30:34,250 we set its' value to null and its got all null links. 354 00:30:34,250 --> 00:30:40,865 Now there's no reason for its existence. So it wouldn't have been created except 355 00:30:40,865 --> 00:30:46,940 for the fact that we inserted shells. So if the node has got null links, just 356 00:30:46,940 --> 00:30:50,239 remove it. And then, just delete it. 357 00:30:50,239 --> 00:30:57,322 And then when you go back up the tree, which you do because you got down there. 358 00:30:57,575 --> 00:31:05,416 Every node that you touch you check if it's got a no value and all no links, just 359 00:31:05,416 --> 00:31:10,024 delete it. And you keep going until you find one that 360 00:31:10,024 --> 00:31:17,377 has either a value or a downhill link and then that's when you can stop the leading 361 00:31:17,377 --> 00:31:21,491 node. So it's pretty easy to delete a key value 362 00:31:21,491 --> 00:31:26,481 pair in an R-way tri. And that's unusual for other search 363 00:31:26,481 --> 00:31:33,484 structures that we saw in the deletion was a significant challenge to implement 364 00:31:33,484 --> 00:31:37,350 often. So our challenge is to find a way to use, 365 00:31:37,594 --> 00:31:41,912 less memory. Be, particularly because, nowadays, many 366 00:31:41,912 --> 00:31:47,533 strings are built with Unicode. And the number of null links in a tri 367 00:31:47,533 --> 00:31:52,746 would be truly huge. Again we talked about this, with rated 368 00:31:52,746 --> 00:31:56,820 sorting. A lot a programs broke when this. 369 00:31:56,820 --> 00:32:03,190 Switch went from ASCII to Unicode. In any program that use this reposession, 370 00:32:03,190 --> 00:32:10,540 representation for Tries, certainly broke. So that's an introduction to Tries with 371 00:32:10,540 --> 00:32:11,520 R-way Tries.