1 00:00:00,000 --> 00:00:04,019 In this sequence of videos, we'll discuss our last but not least data structure 2 00:00:04,019 --> 00:00:08,011 namely the Balanced Binary Search Tree. Like our discussion of other data 3 00:00:08,011 --> 00:00:12,036 structures we'll begin with the what. That is we'll take the client's perspective and 4 00:00:12,036 --> 00:00:16,076 we'll ask what operations are supported by this data structure, what can you actually 5 00:00:16,076 --> 00:00:21,011 use it for? Then we'll move on to the how and the why. We'll peer under the hood of 6 00:00:21,011 --> 00:00:24,082 the data structure and look at how it's actually implemented and then 7 00:00:24,082 --> 00:00:28,080 understanding the implementation to understand why the operations have the 8 00:00:28,080 --> 00:00:33,052 running times that they do. So what is a Balanced Binary Search Tree good for? 9 00:00:33,052 --> 00:00:38,025 Well, I recommend thinking about it as a dynamic version of a sorted array. That 10 00:00:38,025 --> 00:00:43,015 is, if you have data store in a Balanced Binary Search Tree, you can do pretty much 11 00:00:43,015 --> 00:00:48,005 anything on the data that you could if it was just the static sorted array. But in 12 00:00:48,005 --> 00:00:52,240 addition, the data structure can accommodate insertions and deletions. You 13 00:00:52,240 --> 00:00:56,645 can accommodate a dynamic set of data that you're storing overtime. So to motivate 14 00:00:56,645 --> 00:01:01,294 the operations that a Balanced Binary Search Tree supports, let's just start 15 00:01:01,294 --> 00:01:05,282 with the sorted array and look at some of the things you can easily do with data 16 00:01:05,282 --> 00:01:09,311 that happens to be stored in such a way. So let's think about an array that has 17 00:01:09,311 --> 00:01:14,045 numerical data although, generally as we've said, in data structures is usually 18 00:01:14,045 --> 00:01:18,098 associated other data that's what you actually care about and the numbers are 19 00:01:18,098 --> 00:01:23,074 just some unique identifier for each of the records. So these might be an employee 20 00:01:23,074 --> 00:01:27,897 ID number, social security numbers, packet ID numbers and network contacts, etcetera. 21 00:01:27,897 --> 00:01:32,646 So what are some things that are easy to do given that your data is stored as a 22 00:01:32,646 --> 00:01:36,806 sorted array, most a bunch of things? First of all, you can search and recall 23 00:01:36,806 --> 00:01:40,857 that searching in a sorted array is generally done using binary search so this 24 00:01:40,857 --> 00:01:44,893 is how we used to look up phone numbers when we have physical phone books. You'd 25 00:01:44,893 --> 00:01:49,082 start in the middle of the phone book, if the name you were looking for was less 26 00:01:49,082 --> 00:01:53,115 than the midpoint, you recurse on the left hand side, otherwise you'd recurse 27 00:01:53,115 --> 00:01:57,202 on the right hand side. As we discussed back in the Master Method Lectures long ago, 28 00:01:57,202 --> 00:02:01,210 this is going to run in logarithmic time. Roughly speaking, every time you recurse, 29 00:02:01,210 --> 00:02:05,710 you've thrown out half of the array so you're guaranteed to terminate within a 30 00:02:05,710 --> 00:02:10,018 logarithmic number of iterations so binary search is logarithmic search time. 31 00:02:10,018 --> 00:02:13,494 Something else we discussed in previous lectures is the selection problem. So 32 00:02:13,494 --> 00:02:16,933 previously, we discussed this in much harder context of unsorted arrays. 33 00:02:16,933 --> 00:02:20,706 Remember, the selection problem in addition to array you're given in order 34 00:02:20,706 --> 00:02:24,715 statistic. So, if your order statistic that your target is seventeen, that means 35 00:02:24,715 --> 00:02:28,848 you're looking for the seventeenth smallest number that's stored in the 36 00:02:28,848 --> 00:02:33,516 array. So in previous lectures, we worked very hard to get a linear time algorithm 37 00:02:33,516 --> 00:02:37,816 for this problem in unsorted arrays. Now, in a sorted array, you want to know the 38 00:02:37,816 --> 00:02:41,800 seventeenth smallest element in the array. Pretty easy problem, just return whatever 39 00:02:41,800 --> 00:02:46,505 element happens to be in the seventeenth position of the array since the array is sorted, 40 00:02:46,505 --> 00:02:50,662 that's where it is so no problem. It's already sorted constant time, you can 41 00:02:50,662 --> 00:02:54,833 solve the selection problem. Of course, two special cases of the selection problem 42 00:02:54,833 --> 00:02:59,648 are finding the minimum element of the array. That's just if the order statistic 43 00:02:59,648 --> 00:03:04,208 problem with i = 1and the maximum element, that's just i = n. So this just 44 00:03:04,208 --> 00:03:08,782 corresponds to returning the element that's in the first position and the last 45 00:03:08,782 --> 00:03:14,269 position of the array respectively. Well let's do some more brainstorming. What 46 00:03:14,269 --> 00:03:18,526 other operations could we implement on a sorted array? Well here's a couple more. 47 00:03:18,526 --> 00:03:23,291 So there are operations called the Predecessor and Successor operations. And 48 00:03:23,291 --> 00:03:27,480 so the way these work is, you start with one element. So, say you start with a 49 00:03:27,480 --> 00:03:32,133 pointer to the 23, and you want to know where in this array is the next smallest 50 00:03:32,133 --> 00:03:36,285 element. That's the predecessor query and the successor operation returns the next 51 00:03:36,285 --> 00:03:40,445 largest element in the array. So the predecessor of the 23 is the seventeen, 52 00:03:40,450 --> 00:03:45,142 the successor of the 23 would be the 30. And again in a sorted array, these are 53 00:03:45,142 --> 00:03:48,845 trivial, right? You just know that predecessors just one position back in the 54 00:03:48,845 --> 00:03:53,151 array, the successor is one position forward. So given a pointer to the 23, you 55 00:03:53,151 --> 00:03:59,104 can return to 17 or the 30 in constant time. What else? Well, how about 56 00:03:59,104 --> 00:04:04,605 the rank operation? So we haven't discussed this operation in the past. So 57 00:04:04,605 --> 00:04:09,704 what rank is, this has for how many key stored in the data structure are less than 58 00:04:09,704 --> 00:04:14,352 or equal to a given key. So for example, the rank of 23 would be equal to 6. 59 00:04:14,352 --> 00:04:19,555 Because 6 of the 8 elements in the array are less than or equal to 23. And if you think about it, 60 00:04:19,555 --> 00:04:24,073 implementing the rank operation is really no harder than implementing search. All 61 00:04:24,073 --> 00:04:28,448 you do is search for the given key and wherever it is search terminates in the 62 00:04:28,448 --> 00:04:32,416 array. You just look at the position in the array and boom, that's the rank of 63 00:04:32,416 --> 00:04:36,037 that element. So for example, if you do a binary search for 23 and then when you 64 00:04:36,037 --> 00:04:40,650 terminates, you discover it is, they're in position number six then you know the rank 65 00:04:40,650 --> 00:04:45,165 is six. If you do an unsuccessful search, say you search for 21, well then you get 66 00:04:45,165 --> 00:04:49,627 stuck in between the 17 and the 23, and at that point you can conclude that 67 00:04:49,627 --> 00:04:54,542 the rank of 21 in this array is five. Let me just wrap up the list with the final 68 00:04:54,542 --> 00:04:59,516 operation which is trivial to implement in the sorted array. Namely, you can output 69 00:04:59,516 --> 00:05:04,684 or print say the stored keys in sorted order let's say from smallest to largest. 70 00:05:04,684 --> 00:05:10,301 And naturally, all you do here is a single scan from left to right through the array, 71 00:05:10,301 --> 00:05:15,724 outputting whatever element you see next. The time required is constant per element 72 00:05:15,724 --> 00:05:20,508 or linear overall. So that's a quite impressive list of supported operations. 73 00:05:20,508 --> 00:05:25,077 Could you really be so greedy as to want still more from our data structure? Well 74 00:05:25,077 --> 00:05:29,583 yeah, certainly. We definitely want more than just what we have on the slide. The 75 00:05:29,583 --> 00:05:34,231 reason being, these are operations that operate on a static data set which is not 76 00:05:34,231 --> 00:05:38,398 changing overtime. But the world in general is dynamic. For example, if you 77 00:05:38,398 --> 00:05:42,930 are running a company and keeping track of the employees, sometimes you get new 78 00:05:42,930 --> 00:05:47,151 employees, sometimes employees leave. That is one of the data structure that 79 00:05:47,151 --> 00:05:51,354 not only supports these kinds of operations but also, insertions and 80 00:05:51,354 --> 00:05:56,305 deletions. Now of course it's not that it's impossible to implement insert or 81 00:05:56,305 --> 00:06:00,865 delete in a sorted array, it's just that they're going to run way too slow. In 82 00:06:00,865 --> 00:06:04,957 general, you have to copy over a linear amount of stuff on an insertion or 83 00:06:04,957 --> 00:06:10,216 deletion if you want to maintain the sorted array property. So this linear time 84 00:06:10,216 --> 00:06:15,523 performance when insertion and deletion is unacceptable unless you barely ever do 85 00:06:15,523 --> 00:06:19,698 those operations. So, the raison d'etre of the Balanced Binary Search Tree 86 00:06:19,698 --> 00:06:23,715 is to implement this exact same set of operations just as rich as that's 87 00:06:23,715 --> 00:06:28,454 supported by a sorted array but in addition, insertions and deletions. Now, a 88 00:06:28,454 --> 00:06:33,575 few of these operations won't be quite as fast or we have to give up a little bit 89 00:06:33,575 --> 00:06:37,099 instead of constant time, the one in logarithmic time and we still got 90 00:06:37,099 --> 00:06:42,077 logarithmic time for all of these operations, linear time for outputting the 91 00:06:42,077 --> 00:06:47,066 elements in sort of order plus, we'll be able to insert and delete in logarithmic 92 00:06:47,066 --> 00:06:53,015 time so let me just spell that out in a little more detail. So, a Balanced Binary 93 00:06:53,015 --> 00:06:57,863 Search Tree will act like a sorted array plus, it will have fast, meaning 94 00:06:57,863 --> 00:07:03,186 logarithmic time inserts and deletes. So let's go ahead and spell out all of those 95 00:07:03,186 --> 00:07:09,040 operations. So search is going to run in O(log n) time, just like before. Select runs 96 00:07:09,040 --> 00:07:13,265 in constant time in a sorted array and here it's going to take logarithmic, so 97 00:07:13,265 --> 00:07:17,253 we'll give up a little bit on the selection problem but we'll still be able 98 00:07:17,253 --> 00:07:21,373 to do it quite quickly. Even on the special cases of finding the minimum or 99 00:07:21,373 --> 00:07:25,644 finding the maximum in our, in our data structure, we're going to need logarithmic 100 00:07:25,644 --> 00:07:30,042 time in general. Same thing for finding predecessors and successors they're not, 101 00:07:30,042 --> 00:07:35,029 they're no longer constant time, they go with logarithmic. Rank took as logarithmic 102 00:07:35,029 --> 00:07:40,009 time and the, even the sorted array version and that will remain logarithmic 103 00:07:40,009 --> 00:07:45,094 here. As we'll see, we lose essentially nothing over the sorted array, if we want 104 00:07:45,094 --> 00:07:49,464 to output the key values in sorted order say from smallest to largest. And 105 00:07:49,464 --> 00:07:53,965 crucially, we have two more fast operations compared to the sorted array of 106 00:07:53,965 --> 00:07:59,039 data structure. We can insert stuff so if you hire a new employee, you can insert 107 00:07:59,039 --> 00:08:03,129 them into your data structure. If an employee decides to leave, you can remove 108 00:08:03,129 --> 00:08:07,979 them from the data structure. You do not have to spend linear time like you did for 109 00:08:07,979 --> 00:08:13,099 sort of array, you only have to spend the logarithmic time whereas always n is the 110 00:08:13,099 --> 00:08:17,969 number of keys being stored in the data structure. So the key takeaway here is 111 00:08:17,969 --> 00:08:22,143 that, if you have data and it has keys which come from a totally ordered set 112 00:08:22,143 --> 00:08:26,300 like, say numeric keys, then a Balanced Binary Search Tree supports a very rich 113 00:08:26,300 --> 00:08:29,969 collection of operations. So if you anticipate doing a lot of different 114 00:08:29,969 --> 00:08:34,524 processing using the ordering information of all of these keys, then you really 115 00:08:34,524 --> 00:08:38,849 might want to consider a Balanced Binary Search Tree to maintain them. Well then, 116 00:08:38,849 --> 00:08:42,826 keep in mind though is that we have seen a couple of other data structures which 117 00:08:42,826 --> 00:08:47,403 don't do quite as much as balanced binary search trees but what they do, they do 118 00:08:47,403 --> 00:08:51,638 better. We already, we just discussed in the last slide of the sorted array. So, if 119 00:08:51,638 --> 00:08:55,739 you have a static data set, you don't need inserts and deletes. Well then by all 120 00:08:55,739 --> 00:08:59,561 means, don't bother with Balanced Binary Search Tree that use a sorted array 121 00:08:59,561 --> 00:09:03,606 because it will do everything super fast. But, we also sought through dynamic data 122 00:09:03,606 --> 00:09:07,945 structures which don't do as much but do it, but what they do, they do very well. 123 00:09:07,945 --> 00:09:12,105 So, we saw a heap, so what the heap is good for is it's just as dynamic as a 124 00:09:12,105 --> 00:09:16,642 search tree. It allows insertions and deletions both in logarithmic time. And in 125 00:09:16,642 --> 00:09:21,094 addition, it keeps track of the minimum element or the maximum element. Remember 126 00:09:21,094 --> 00:09:25,452 in a heap, you can choose whether you want to keep track of the minimum or keep track 127 00:09:25,452 --> 00:09:29,824 of the maximum but unlike in a search tree, a heap does not simultaneously keep 128 00:09:29,824 --> 00:09:34,706 track of the minimum and the maximum. So if you just need those three operations, 129 00:09:34,706 --> 00:09:39,313 insertions, deletions and remembering the smallest, and this would be the case for 130 00:09:39,313 --> 00:09:43,370 example in a priority queue or scheduling application as discussed in the heap 131 00:09:43,370 --> 00:09:48,080 videos. Then, a Binary Search Tree is over kill. You might want to consider a heap 132 00:09:48,080 --> 00:09:52,046 instead. In fact, the benefits of a heap don't show up in the big O notation here 133 00:09:52,046 --> 00:09:56,697 both have logarithmic operation time but the constant factors both in space and 134 00:09:56,697 --> 00:10:00,941 time are going to be faster with a heap then with a Balanced Binary Search Tree. 135 00:10:00,941 --> 00:10:05,867 The other dynamic data structure that we discussed is a hash table. And what hash 136 00:10:05,867 --> 00:10:10,266 tables are really, really good at is handling insertions and searches, that is 137 00:10:10,266 --> 00:10:14,242 look ups. Some, sometimes, depending on the implementation also handle deletions 138 00:10:14,242 --> 00:10:18,132 really well also. So, if you don't actually need to remember things like 139 00:10:18,132 --> 00:10:22,243 minima, maxima or remember ordering information on the keys, you just have to 140 00:10:22,243 --> 00:10:27,445 remember what's there and what's not. Then the data structure of choice is definitely 141 00:10:27,445 --> 00:10:31,469 the hash table, not the balance binary search tree. Again, the Balance Binary Search 142 00:10:31,469 --> 00:10:35,239 Tree would be fine and we'd give you logarithmic look up time but it's kind of 143 00:10:35,239 --> 00:10:39,926 over kill for the problem. All you need is fast look ups. A hash table recall will 144 00:10:39,926 --> 00:10:44,623 give you constant time look ups. So that will be a noticeable win over the Balanced 145 00:10:44,623 --> 00:10:49,032 Binary Search Tree. But if you want a very rich set of operations for processing your 146 00:10:49,032 --> 00:10:54,010 data. Then, the Balanced Binary Search Tree could be the optimal data structure 147 00:10:54,010 --> 00:10:55,003 for your needs.