1 00:00:01,120 --> 00:00:04,967 In this next sequence of videos we're going to study a slightly more intricate 2 00:00:04,967 --> 00:00:07,396 application of the dynamic programming paradigm, 3 00:00:07,396 --> 00:00:10,231 namely to the problem of computing optimal search trees. 4 00:00:10,231 --> 00:00:14,230 Search trees that minimize the average search time, with respect to a given set 5 00:00:14,230 --> 00:00:19,292 of probabilities over the keys. So I'm going to assume in these videos 6 00:00:19,292 --> 00:00:22,780 that you remember the basics of the search tree data structure. 7 00:00:22,780 --> 00:00:26,986 So if you need to jog your memory, you might want to review the relevant video 8 00:00:26,986 --> 00:00:29,920 from part one. So a search tree is they contain objects, 9 00:00:29,920 --> 00:00:33,186 each object has a key drawn from some totally ordered sets. 10 00:00:33,186 --> 00:00:37,504 And the search tree property states that at each node of a search tree, say it 11 00:00:37,504 --> 00:00:41,987 contains some object with a key of value x, it better be the case that everything 12 00:00:41,987 --> 00:00:46,526 in the left subtree of that node has keys less than x and everything in the right 13 00:00:46,526 --> 00:00:50,932 subtree under the node with key x has to have keys bigger than X. So that has to 14 00:00:50,932 --> 00:00:54,370 be true simultaneously at every single node of the search tree. 15 00:00:54,370 --> 00:00:58,205 The motivation for the search tree property is so that searching in a search 16 00:00:58,205 --> 00:01:01,940 tree just involves following your nose, just like binary search in assorted 17 00:01:01,940 --> 00:01:04,181 array. So if you're looking for, say, an object 18 00:01:04,181 --> 00:01:07,917 with key seventeen, you know whether to go left or right from the root, just 19 00:01:07,917 --> 00:01:11,254 based on the root's key value. If the root has key value twelve, you 20 00:01:11,254 --> 00:01:14,541 know where seventeen, if it exists, has to be in the right sub-tree. 21 00:01:14,541 --> 00:01:16,783 So you recursively search the right sub-tree. 22 00:01:16,783 --> 00:01:20,817 If the root has value 23, you know where seventeen has to be in the left sub-tree. 23 00:01:20,817 --> 00:01:25,324 So that's where you recursively search. Something we originally discussed in the 24 00:01:25,324 --> 00:01:28,693 context of balanced binary search trees. Like red black trees. 25 00:01:28,693 --> 00:01:32,339 And I'm going to reiterate now. Is that, for a given set of keys, there 26 00:01:32,339 --> 00:01:35,376 are many, many valid search trees containing those keys. 27 00:01:35,376 --> 00:01:39,629 So just to remind you how this works, let's even just suppose there were only 28 00:01:39,629 --> 00:01:43,220 three keys in the world, x, y, and z, with x less than, y less than z. 29 00:01:44,680 --> 00:01:47,360 One obvious search tree would be the balanced one. 30 00:01:47,360 --> 00:01:50,148 So that would put the middle element y at the roots. 31 00:01:50,148 --> 00:01:52,614 You would have left child x and right child y. 32 00:01:52,614 --> 00:01:56,688 But there's also the two chain search trees containing these three keys, one 33 00:01:56,688 --> 00:02:00,977 with the smallest element x at the root, the other with the largest element z at 34 00:02:00,977 --> 00:02:05,375 the root. So, given the multiplicity of solutions, 35 00:02:05,375 --> 00:02:09,127 all of the different search trees one could use to store objects with a bunch 36 00:02:09,127 --> 00:02:11,677 of keys, an obvious question is which one is the best. 37 00:02:11,677 --> 00:02:15,520 What's the best search tree to use out of all of the possibilities? 38 00:02:15,520 --> 00:02:18,962 So I don't blame you if you've got a sense of deja vu. 39 00:02:18,962 --> 00:02:22,996 We did already ask and answer this question in one sense, when we discussed 40 00:02:22,996 --> 00:02:25,954 red black trees. There we argued that the best thing to 41 00:02:25,954 --> 00:02:29,826 have is a balanced binary search tree that keeps the height as small as 42 00:02:29,826 --> 00:02:34,075 possible and therefore the worst case search time, which is proportional to the 43 00:02:34,075 --> 00:02:38,109 height, as small as possible, namely logarithmic in the number of objects in 44 00:02:38,109 --> 00:02:41,269 the search tree. But now let's make the same kind of 45 00:02:41,269 --> 00:02:44,847 assumption that we made when we discussed Huffman codes. 46 00:02:44,847 --> 00:02:49,510 That is, let's assume that we actually have accurate statistics about how 47 00:02:49,510 --> 00:02:53,407 frequently, each item in the tree is going to be searched for. 48 00:02:53,407 --> 00:02:57,815 So maybe we know that item X is going to be searched for 80% of the time, 49 00:02:57,815 --> 00:03:01,073 whereas Y and Z will only be searched for 10% each. 50 00:03:01,073 --> 00:03:05,608 Could we then improve, upon the perfectly balanced search tree solution? 51 00:03:05,608 --> 00:03:11,642 So let me make this question more concrete, just by asking you to compare 52 00:03:11,642 --> 00:03:14,820 two candidate solutions. On the one hand the balance tree, which 53 00:03:14,820 --> 00:03:16,939 has Y at the root and X and Z as children. 54 00:03:16,939 --> 00:03:20,974 On the other hand, the chain which has X as a root, Y as its right child and then 55 00:03:20,974 --> 00:03:24,102 Z as the right child of Z, excuse me, Z as the right child of Y. 56 00:03:24,102 --> 00:03:27,835 So what is the average search time in each of these two search trees, with 57 00:03:27,835 --> 00:03:32,980 respect to the search frequencies I told you, 80% for X, 10% for Y, and 10% for Z? 58 00:03:32,980 --> 00:03:36,814 And when I say the search time for a node, I just mean how many nodes do you 59 00:03:36,814 --> 00:03:40,648 look at on route to discovering what you're looking for, including that last 60 00:03:40,648 --> 00:03:43,571 node itself. So the search time for something that's 61 00:03:43,571 --> 00:03:46,697 at the root for example, that would be a search time of just one, 62 00:03:46,697 --> 00:03:48,700 because you only look at the root to find it. 63 00:03:57,700 --> 00:04:02,198 All right so the correct answer to the quiz is the fourth option 1.9 and 1.3. 64 00:04:02,198 --> 00:04:06,638 So to see why, let's just compute the average search time in each of the two 65 00:04:06,638 --> 00:04:10,026 proposed search trees. In the first one, with Y at the root, 66 00:04:10,026 --> 00:04:14,407 well, 80% of the time we're going to suffer a search time of two, whenever we 67 00:04:14,407 --> 00:04:18,321 look for X we have to look at the root Y and then we look at the X. 68 00:04:18,321 --> 00:04:21,476 So we pay two 80% of the time and that contributes 1.6. 69 00:04:21,476 --> 00:04:25,040 10% of the time we get lucky, we see a Y, that contributes a.1. 70 00:04:25,040 --> 00:04:29,187 10% of the time we see Z, that contributes another.2 for a total of 1.9. 71 00:04:29,187 --> 00:04:32,789 By contrast think about the chain that has X at the root. 72 00:04:32,789 --> 00:04:36,700 Here 80% of the time we get lucky, and we only have to pay one. 73 00:04:36,700 --> 00:04:40,286 Two for every search for X so that contributes only 8. to the total. 74 00:04:40,286 --> 00:04:43,445 It is true our worst case search time has actually gone up. 75 00:04:43,445 --> 00:04:47,942 When we see its Z we suffer a search time of three which never ever happened in the 76 00:04:47,942 --> 00:04:50,672 balance case. But we pay that three only ten% of the 77 00:04:50,672 --> 00:04:54,206 time, that contributes a.3. The remaining ten% of the time we suffer 78 00:04:54,206 --> 00:04:59,200 a search time of two to look for Y. That gives us a total of 1.3. 79 00:05:02,000 --> 00:05:07,076 And the moral of the story, of course, is that this example exposes an interesting 80 00:05:07,076 --> 00:05:10,961 algorithmic opportunity. So the obvious, quote unquote, solution 81 00:05:10,961 --> 00:05:15,599 of a perfectly balanced search tree, need not be the best search tree when 82 00:05:15,599 --> 00:05:20,299 frequency of access is non uniform. You might want to have unbalanced trees, 83 00:05:20,299 --> 00:05:25,187 like this chain if it gets extremely frequent searches, closer to the roots to 84 00:05:25,187 --> 00:05:29,261 have smaller search time. So the obvious question is then, given a 85 00:05:29,261 --> 00:05:33,460 bunch of items and known frequencies of access, what is the optimal. 86 00:05:33,460 --> 00:05:36,710 Search tree. Which search tree minimizes the average 87 00:05:36,710 --> 00:05:40,762 search time? So that brings us to the formal problem 88 00:05:40,762 --> 00:05:43,678 definition. We're told n objects that we gotta store 89 00:05:43,678 --> 00:05:47,267 in a search tree and we're told the frequency of access of each. 90 00:05:47,267 --> 00:05:51,809 So let's just keep things simple and the notation straightforward, let's just say 91 00:05:51,809 --> 00:05:55,959 the items are named from one, two, all the way up to n, and that is also the 92 00:05:55,959 --> 00:05:59,996 order of their respective keys. So key one is the frequency of searching 93 00:05:59,996 --> 00:06:03,440 for the item with the smallest key, and so on. 94 00:06:03,440 --> 00:06:05,604 You might wonder where these frequencies come from. 95 00:06:05,604 --> 00:06:09,085 How would you know exactly how frequently every possible key will be searched for. 96 00:06:09,085 --> 00:06:12,353 It's going to depend on the application. And you know there will be applications 97 00:06:12,353 --> 00:06:14,645 where you're not going to have these kinds of statistics. 98 00:06:14,645 --> 00:06:18,126 And that's where you'd probably want to turn to a general purpose balanced binary 99 00:06:18,126 --> 00:06:21,479 search tree solution, something like a red black tree, which guarantees you that 100 00:06:21,479 --> 00:06:23,347 every search is going to be reasonably fast. 101 00:06:23,347 --> 00:06:26,276 But it's not hard to think about applications where you are going to be 102 00:06:26,276 --> 00:06:29,799 able to learn pretty accurate statistics about how frequently different things are 103 00:06:29,799 --> 00:06:31,964 searched for. One example might be something like a 104 00:06:31,964 --> 00:06:34,341 spell checker. So if you would implement that by storing 105 00:06:34,341 --> 00:06:37,440 all of the legally spelled words in a search tree, and as you're scanning a 106 00:06:37,440 --> 00:06:41,023 document, and every time you hit a word, you look it up in the search tree to see 107 00:06:41,023 --> 00:06:43,219 if it's correctly spelled or incorrectly spelled. 108 00:06:43,219 --> 00:06:46,578 You can imagine that after, you know, scanning through a number of documents, 109 00:06:46,578 --> 00:06:50,207 you would have pretty good estimates, about how frequently things get searched 110 00:06:50,207 --> 00:06:52,268 for. And then you could use those estimates to 111 00:06:52,268 --> 00:06:55,359 build a highly optimized binary search tree for all future documents. 112 00:06:55,359 --> 00:06:58,629 If you're in some other kind of application where you're concerned about 113 00:06:58,629 --> 00:07:02,168 these frequencies changing over time, so, for example, if they're privy to trends 114 00:07:02,168 --> 00:07:05,484 in the industry, you could imagine rebuilding the search tree every day or 115 00:07:05,484 --> 00:07:08,620 every week or whatever, based on the latest statistics that you've got. 116 00:07:09,920 --> 00:07:13,597 In any case, if you're lucky enough to have such statistics, what you're 117 00:07:13,597 --> 00:07:17,120 going to want to do is to build a search tree, which, on one hand is valid. 118 00:07:17,120 --> 00:07:20,849 It should satisfy the search tree property and on the other hand, should 119 00:07:20,849 --> 00:07:23,440 make the average search time as small as possible. 120 00:07:25,080 --> 00:07:28,388 So let me go ahead and write down a formula for the average search time. 121 00:07:28,388 --> 00:07:30,547 It's the one that you would expect it to be. 122 00:07:30,547 --> 00:07:33,074 And also introduce some notation, namely capital C of T. 123 00:07:33,074 --> 00:07:36,061 We'll denote the average search time of a proposed search tree T. 124 00:07:36,061 --> 00:07:39,507 So for these lectures, we're going to focus on the case where all searches are 125 00:07:39,507 --> 00:07:41,804 successful. The only thing that ever gets searched 126 00:07:41,804 --> 00:07:45,571 for is stuff that's actually in the tree. But everything we'll talk about in these 127 00:07:45,571 --> 00:07:49,063 lectures and the algorithm is easily extended to accommodate the case where 128 00:07:49,063 --> 00:07:52,831 you also have unsuccessful searches and statistics about how frequent the various 129 00:07:52,831 --> 00:07:55,496 unsuccessful searches are. But, if there is only successful 130 00:07:55,496 --> 00:07:59,848 searches, then we average only over the n elements that are stored in the tree. 131 00:07:59,848 --> 00:08:04,584 So we sum over each of the items I, we weight it by the probability or the 132 00:08:04,584 --> 00:08:09,512 frequency of it's access P sub I, and then that gets multiplied by the search 133 00:08:09,512 --> 00:08:12,520 time required in the tree T to find the item I. 134 00:08:13,860 --> 00:08:18,449 And as we discussed in the quiz, the search time for a given key I and a given 135 00:08:18,449 --> 00:08:22,868 tree T is just the number of nodes you have to visit until you get to the one 136 00:08:22,868 --> 00:08:25,871 containing I. So if you think about it, that's exactly 137 00:08:25,871 --> 00:08:28,364 the depth of the node in this tree plus one. 138 00:08:28,364 --> 00:08:32,840 So for example, if you're lucky enough that the key is at the root, the depth of 139 00:08:32,840 --> 00:08:36,523 the root is zero, and we're counting that as a search time as one. 140 00:08:36,523 --> 00:08:39,924 So it's depth plus one. So, one minor point. 141 00:08:39,924 --> 00:08:43,381 It's going to be convenient for me to not insist that the PI's sum to one. 142 00:08:43,381 --> 00:08:46,399 Of course, if the PIs were probabilities, they would sum to one. 143 00:08:46,399 --> 00:08:49,855 But I'm going to go ahead and allow them to be arbitrary positive numbers. 144 00:08:49,855 --> 00:08:53,214 And that, for the same reason, I'm going to sometimes call capital C of T, 145 00:08:53,214 --> 00:08:56,184 the weighted search time rather that the average search time, 146 00:08:56,184 --> 00:08:59,446 because I won't necessarily be assuming that the PIs, sum to one. 147 00:08:59,446 --> 00:09:03,145 With that said, go ahead and, you know? Think of that as the canonical special 148 00:09:03,145 --> 00:09:05,580 case in your mind as we go through these lectures. 149 00:09:07,520 --> 00:09:12,876 So for example, in the case where these are probabilities, where the PIs sum to 150 00:09:12,876 --> 00:09:18,440 one, we could always as a reference point use a red black tree as our search tree. 151 00:09:20,620 --> 00:09:25,089 But, as we've seen, when these PIs are not uniform, you can generally do better. 152 00:09:25,089 --> 00:09:28,224 And so that's the point of this computational problem. 153 00:09:28,224 --> 00:09:32,925 Exploit the non-uniformities in the given probabilities to come up with the best 154 00:09:32,925 --> 00:09:35,480 possibly unbalanced search tree as possible. 155 00:09:36,940 --> 00:09:41,050 So I'm sure many of you will have noticed some of the similarities between this 156 00:09:41,050 --> 00:09:43,773 computational problem of optimal binary search trees. 157 00:09:43,773 --> 00:09:47,267 And one that we already solved back in the greedy algorithm section, 158 00:09:47,267 --> 00:09:50,966 namely Huffkin, Huffman codes, which, amongst all prefix free binary codes, 159 00:09:50,966 --> 00:09:54,665 minimize the average encoding length. So, let's just be precise about the 160 00:09:54,665 --> 00:09:58,724 similarities and differences between the two problems, and in particular, why we 161 00:09:58,724 --> 00:10:02,783 can just reuse the algorithm we already saw off the shelf, to solve the optimal 162 00:10:02,783 --> 00:10:06,355 BST problem. So it's, of course, super similar in the 163 00:10:06,355 --> 00:10:10,570 two cases, is the format of the output. In both problems, the responsibility of 164 00:10:10,570 --> 00:10:15,004 the algorithm is to output a binary tree, and the goal is to minimize the average 165 00:10:15,004 --> 00:10:18,070 depth, more or less, where the average is with respect to 166 00:10:18,070 --> 00:10:21,573 provided frequencies, over a bunch of objects that we care about. 167 00:10:21,573 --> 00:10:24,694 Characters from an alphabet in the case of Huffman codes, 168 00:10:24,694 --> 00:10:28,909 and a bunch of objects with keys from some totally ordered set in the binary 169 00:10:28,909 --> 00:10:32,415 search tree case. And it is true that in the optimal BST 170 00:10:32,415 --> 00:10:36,705 case, we're not really averaging depths, we're averaging depths plus one but if 171 00:10:36,705 --> 00:10:39,400 you think about it that's exactly the same thing. 172 00:10:40,460 --> 00:10:44,846 More important is to understand the differences between the problem solved by 173 00:10:44,846 --> 00:10:49,600 Huffman codes and the computational problem that we have to solve here. 174 00:10:49,600 --> 00:10:53,759 In the Huffman code case, we had to output of binary code and the key 175 00:10:53,759 --> 00:10:58,334 constraint was that it be prefix-free. And in the language of trees what that 176 00:10:58,334 --> 00:11:02,969 meant is that the symbols that we're encoding had to correspond to the leaves 177 00:11:02,969 --> 00:11:06,119 of the tree. Symbols could not correspond to internal 178 00:11:06,119 --> 00:11:10,062 nodes of the tree that we output. Now in the optimal binary search tree 179 00:11:10,062 --> 00:11:14,027 problem we do not have this prefix free constraint, so we're going to have a 180 00:11:14,027 --> 00:11:18,201 label that is an object at every single node of our tree, whether it's a leaf or 181 00:11:18,201 --> 00:11:20,496 not. But, we have a different, seemingly quite 182 00:11:20,496 --> 00:11:24,096 a bit harder constraint to deal with, namely the search tree property. 183 00:11:24,096 --> 00:11:28,217 So remember, back in the Huffman code case we didn't even have an ordering on 184 00:11:28,217 --> 00:11:32,235 the symbols of the, of the alphabet. There wasn't a sense that one of them was 185 00:11:32,235 --> 00:11:36,200 less than another, it wouldn't even make sense to talk about the search tree. 186 00:11:36,200 --> 00:11:39,341 Brought in that context. Here by contrast, we're given these keys 187 00:11:39,341 --> 00:11:43,267 and there's this total lowering on them. And we'd better satisfy the search tree 188 00:11:43,267 --> 00:11:47,292 property in the tree that we output, that is in every single node in the tree that 189 00:11:47,292 --> 00:11:51,268 we output, it better be the case that all keys in the left sub-tree are less than 190 00:11:51,268 --> 00:11:55,342 the key at that node, and all keys in the right sub-tree are bigger than the key at 191 00:11:55,342 --> 00:11:57,600 that node. That's a constraint that we have no 192 00:11:57,600 --> 00:12:02,111 choice but to satisfy. This constraint is harder in the sense 193 00:12:02,111 --> 00:12:05,989 that no greedy algorithm, Huffman's Algorithm or otherwise, solves the 194 00:12:05,989 --> 00:12:10,316 optimal binary search tree problem. Rather we're going to have to turn to the 195 00:12:10,316 --> 00:12:14,249 more sophisticated tool of dynamic programming to design an efficient 196 00:12:14,249 --> 00:12:17,172 algorithm for computing optimal binary search trees. 197 00:12:17,172 --> 00:12:20,600 That's the solution we'll start developing in the next video.