1 00:00:00,000 --> 00:00:04,020 In this video we'll begin our discussion of hash tables; we'll focus first on the 2 00:00:04,020 --> 00:00:08,030 support operations, and on some of the canonical applications. So hash tables are 3 00:00:08,030 --> 00:00:12,018 insanely useful. If you want to be a serious programmer or a computer scientist 4 00:00:12,018 --> 00:00:16,002 you really have no choice but to learn about hash tables. I'm sure many of you 5 00:00:16,002 --> 00:00:20,010 have used them in your own programs in the past in fact. Now on the one hand what's 6 00:00:20,010 --> 00:00:24,027 funny is they don't actually do that many things in terms of the number of supported 7 00:00:24,027 --> 00:00:28,025 operations, but what they do, do they do really, really well. So what is a hash 8 00:00:28,025 --> 00:00:33,008 table? Well conceptually, ignoring all of the aspects of the implementation, you may 9 00:00:33,008 --> 00:00:37,097 wanna think of a hash table as an array. So one thing that arrays do super well is 10 00:00:37,097 --> 00:00:42,040 support immediate random access. So if you're wondering what's the position 11 00:00:42,040 --> 00:00:47,023 number seventeen of some array, boom, with a couple of machine instructions you can 12 00:00:47,023 --> 00:00:51,089 find out, wanna change the contents of position number 23 in some array? Done, in 13 00:00:51,089 --> 00:00:56,025 constant time. So let's think about an application in which you want to remember 14 00:00:56,025 --> 00:01:00,059 your friends phone numbers. So if you're lucky your friends parents were all u nu, 15 00:01:00,059 --> 00:01:05,009 unusually unimaginative people and all of your friends names are integers let's say 16 00:01:05,009 --> 00:01:09,048 between one and 10,000. So if this is the case then you can just maintain an array 17 00:01:09,048 --> 00:01:14,099 of link 10,000. And to store the phone number of say, your best friend, 173, you 18 00:01:14,099 --> 00:01:21,022 can just use position 173 of this modest sized array. So this array based solution 19 00:01:21,022 --> 00:01:27,014 would work great, even if your friends change over time, you gain some here you 20 00:01:27,014 --> 00:01:32,060 lose some there, as long as all your friends names happen to be integers 21 00:01:32,060 --> 00:01:37,081 between 1-10,000. Now, of course, your friends have more interesting names: 22 00:01:37,081 --> 00:01:42,044 Alice, Bob, Carol, whatever. And last names as well. So in principal you could 23 00:01:42,044 --> 00:01:47,023 have an array with one position in the array for every conceivable name you might 24 00:01:47,023 --> 00:01:52,001 encounter, with at least 30 letters set. But of course this array would be way too 25 00:01:52,001 --> 00:01:56,091 big. It would be something like 26 raised to the thirtieth power and you could never 26 00:01:56,091 --> 00:02:01,063 implement it. So what you'd really want is you'd want an array of reasonable size, 27 00:02:01,063 --> 00:02:06,001 say, you know ballpark the number of friends that you'd ever have, so say in 28 00:02:06,001 --> 00:02:10,062 the thousands or something, where it's positions are indexed not by the numbers, 29 00:02:10,062 --> 00:02:16,012 not integers. [inaudible] Between one and 10,000, but rather by your friends Names 30 00:02:16,012 --> 00:02:20,050 And what you'd like to do is you'd like to have random access to this array based on 31 00:02:20,050 --> 00:02:24,064 your friend's name. So you just look up the quote unquote Alice position of this 32 00:02:24,064 --> 00:02:28,069 array and. Boom, there would be Alice's phone number in constant time. And this, 33 00:02:28,069 --> 00:02:32,097 on a conceptual level is basically what a hash table, can do for you. So there's a 34 00:02:32,097 --> 00:02:37,025 lot of magic happening under the hood of a hash table and that's something we'll 35 00:02:37,025 --> 00:02:41,069 discuss to some extent in other videos. So you have to have this mapping between the 36 00:02:41,069 --> 00:02:45,076 keys that you care about, like your friends' names, and, numerical positions 37 00:02:45,076 --> 00:02:49,056 of some array. That's done by what's called a hash function, but properly 38 00:02:49,056 --> 00:02:53,042 implemented, this is the kind of functionality that hash tables gives you, 39 00:02:53,042 --> 00:02:57,049 So like an array with its positions indexed by the keys that you're storing. 40 00:02:57,049 --> 00:03:03,026 So you can think of the purpose of the hash table as to maintain a possibly 41 00:03:03,026 --> 00:03:07,099 evolving set of stuff. Where of course the set of things that you're maintaining, you 42 00:03:07,099 --> 00:03:12,009 know, will vary with the application. It can be any number of things. So if you're 43 00:03:12,009 --> 00:03:15,088 running an e-commerce website, maybe you're keeping track of transactions. You 44 00:03:15,088 --> 00:03:19,093 know, again, maybe you're keeping track of people, like for example, your friends and 45 00:03:19,093 --> 00:03:23,063 various data about them. So maybe you're keeping track of I-P addresses, for 46 00:03:23,063 --> 00:03:27,062 example if you wanna know, who was, were there unique visitors to your websites. 47 00:03:27,062 --> 00:03:31,091 And so on. So a little bit more formally, you know, the basic operations, you need 48 00:03:31,091 --> 00:03:36,021 to be able to insert stuff into a hash table. In many, but not all applications, 49 00:03:36,021 --> 00:03:40,061 you need to be able to delete stuff as well. And typically the most important 50 00:03:40,061 --> 00:03:45,053 operation is look-up. And for all these three operation you do it in a key based 51 00:03:45,053 --> 00:03:49,015 way. Where as usual a key should just be a unique identifier for the record that 52 00:03:49,015 --> 00:03:52,057 you're concerned with. So, for example, for employees you might be using social 53 00:03:52,057 --> 00:03:56,013 security numbers. For transactions you might have a transaction ID number. And 54 00:03:56,013 --> 00:03:59,078 then IP addresses could act as their own key. And so sometimes all you're doing is 55 00:03:59,078 --> 00:04:03,025 keeping track of the keys themselves. So, for example, in IP addresses, maybe you 56 00:04:03,025 --> 00:04:06,050 just want to remember a list of IP addresses. You don't actually have any 57 00:04:06,050 --> 00:04:10,006 associated data but in many applications, you know, along with the key, is a bunch 58 00:04:10,006 --> 00:04:13,022 of other stuff. So along with the employee's social security number, you 59 00:04:13,022 --> 00:04:16,096 gotta remember a bunch of other data about that employee. But when you do the insert, 60 00:04:16,096 --> 00:04:20,056 when you do the delete, when you do the look up, you do it based. On this key, and 61 00:04:20,056 --> 00:04:24,075 then for example, on look up you feed the key into the hash table and the hash table 62 00:04:24,075 --> 00:04:28,055 will spit back out all of the data associated with that key. We sometimes 63 00:04:28,055 --> 00:04:32,072 hear people refer to data structures that support these operations as a dictionary. 64 00:04:32,087 --> 00:04:37,018 So the main thing the hash table is meant to support is look up in the spirit of a 65 00:04:37,018 --> 00:04:41,025 dictionary. I find that terminology a little misleading actually. You know, most 66 00:04:41,025 --> 00:04:45,002 dictionaries that you'll find are in alphabetical order. So they'll support 67 00:04:45,002 --> 00:04:49,004 something like binary search. And I want to emphasis something a hash table does 68 00:04:49,004 --> 00:04:52,091 not do is maintain an ordering on the elements that it supports. So if you're 69 00:04:52,091 --> 00:04:56,092 storing stuff and you do want to have order based operations, you wanna find the 70 00:04:56,092 --> 00:05:00,084 minimum or the maximum, or something like that, a hash table's probably not the 71 00:05:00,084 --> 00:05:04,068 right data structure. You want something more. You wanna look at a heap or you 72 00:05:04,068 --> 00:05:08,067 wanna look at a, a search tree. But for applications in which all you have to do 73 00:05:08,067 --> 00:05:12,086 is basically look stuff up you gotta, you gotta know what's there and what's not, 74 00:05:12,086 --> 00:05:16,090 then there should be a light bulb that goes off in your head. And you can say, 75 00:05:16,090 --> 00:05:20,089 let me consider a hash table, that's probably the perfect data structure for 76 00:05:20,089 --> 00:05:25,004 this application. Now, looking at this menu-supported operations, you may be left 77 00:05:25,004 --> 00:05:29,008 kinda unimpressed. Alright, so a hash table, in some sense, doesn't do that many 78 00:05:29,008 --> 00:05:33,001 things; but again, what it does, it does really, really well. So, to first order. 79 00:05:33,001 --> 00:05:37,024 What hash tables give you is the following amazing guarantee. All of these operations 80 00:05:37,024 --> 00:05:41,042 run in constant time. And again this is in the spirit of thinking of a hash table as 81 00:05:41,042 --> 00:05:45,044 just like an array. Where its positions are conveniently indexed by your keys, So 82 00:05:45,044 --> 00:05:49,032 just like an array supports random access in constant time, you can see if, you 83 00:05:49,032 --> 00:05:53,000 know, there's anything in the array position, and what it is. As similarly a 84 00:05:53,000 --> 00:05:57,006 hash table will let you look up based on the key in constant time. So what is the 85 00:05:57,006 --> 00:06:01,007 fine print? Well, there's basically two caveats. So the first thing is that hash 86 00:06:01,007 --> 00:06:05,039 tables are easy to implement badly. And if you implement them badly you will not get 87 00:06:05,039 --> 00:06:09,051 this guarantee. So this guarantee is for properly implemented hash tables. Now, of 88 00:06:09,051 --> 00:06:13,078 course if you're just using a hash table from a well known library, it's probably a 89 00:06:13,078 --> 00:06:17,094 pretty good assumption that it's properly implemented. You'd hope. But in the event 90 00:06:17,094 --> 00:06:21,089 that you're forced to come up with your own hash table and your own hash function 91 00:06:21,089 --> 00:06:25,046 and unlike many of the other data structures we'll talk about, some of you 92 00:06:25,046 --> 00:06:29,046 probably will have to do that at some point in your career. Then you'll get this 93 00:06:29,046 --> 00:06:33,060 guarantee only if you implement it well. And we'll talk about exactly what that 94 00:06:33,060 --> 00:06:37,066 means in other videos. So the second caveat is that, unlike most of the 95 00:06:37,066 --> 00:06:42,015 problems that we've solved in this course, hash tables don't enjoy worst case 96 00:06:42,015 --> 00:06:46,070 guarantees. You cannot say for a given hash table that for every possible data 97 00:06:46,070 --> 00:06:51,043 set you're gonna get cost and time. What's true is that for non-pathological data, 98 00:06:51,043 --> 00:06:56,026 you will get cost and time operations in a properly implemented hash table. So we'll 99 00:06:56,026 --> 00:07:00,054 talk about both of these issues a bit more in other videos, but for now just high 100 00:07:00,054 --> 00:07:04,036 order bits are, you know, hash tables, constant time performance, subject to a 101 00:07:04,036 --> 00:07:08,018 couple of caveats. So now that I've covered the operations that hash tables 102 00:07:08,018 --> 00:07:12,016 support and the recommend way to think about them, let's turn our attention to 103 00:07:12,016 --> 00:07:15,077 some applications. All of these applications are gonna be in some sense, 104 00:07:15,077 --> 00:07:19,039 you know, kinda trivial uses of hash tables, but they're also all really 105 00:07:19,039 --> 00:07:23,050 practical. These come up all the time. So the first application we'll discuss, which 106 00:07:23,050 --> 00:07:27,011 again is a conical one, is removing duplicates from a bunch of stuff, Also 107 00:07:27,011 --> 00:07:31,023 known as the deduplication problem. So in the De-duplication problem, the input is 108 00:07:31,023 --> 00:07:35,039 essentially a stream of objects. Where, when I say a stream I have kinda, you know 109 00:07:35,039 --> 00:07:39,086 two different things in mind as canonical examples. So first of all you can imagine 110 00:07:39,086 --> 00:07:43,097 you have a huge file. So you have, you know, a log of everything that happened on 111 00:07:43,097 --> 00:07:48,049 some website you're running. Or all of the transactions that were made in a store on 112 00:07:48,049 --> 00:07:52,076 some day, And you do a pass through this huge file. So you're just in the middle of 113 00:07:52,076 --> 00:07:56,079 some outer for loop going line by line through this massive file. The other 114 00:07:56,079 --> 00:08:00,071 example of a stream that I had in mind, is, where you're getting new data over 115 00:08:00,071 --> 00:08:04,053 time. So here, you might imagine that you're running software to be deployed on 116 00:08:04,053 --> 00:08:08,050 an internet router. And data packets are coming through this router at a constant 117 00:08:08,050 --> 00:08:12,043 extremely fast rate. And so you might be looking at, say, the IP addresses and the 118 00:08:12,043 --> 00:08:16,045 sender, and use your data packet which is going through your router. So it would be 119 00:08:16,045 --> 00:08:20,022 another example of a stream of objects. And now, what do you gotta do? What you 120 00:08:20,022 --> 00:08:23,085 gotta do is you gotta ignore the duplicates. So remember just the distinct 121 00:08:23,085 --> 00:08:27,093 objects that you see in this stream. And I hope you find it easy to imagine why you 122 00:08:27,093 --> 00:08:31,069 might want to do this task in various applications. So, for example, if you're 123 00:08:31,069 --> 00:08:35,069 running a website you might want to keep track of the distinct visitors that you 124 00:08:35,069 --> 00:08:39,049 ever saw in a given day or a given week. If you're doing something like a web 125 00:08:39,049 --> 00:08:43,024 crawl, you might want to identify duplicate documents and only remember them 126 00:08:43,024 --> 00:08:46,095 once. So, for example, it would be annoying if in search results both the top 127 00:08:46,095 --> 00:08:50,070 link and the second link both led to identical pages at different URLs, okay, 128 00:08:50,070 --> 00:08:54,077 so search engines obviously want to avoid that, so you want to detect duplicate web 129 00:08:54,077 --> 00:09:00,095 pages and only report unique ones. And the solution using a hash table is laughably 130 00:09:00,095 --> 00:09:06,051 simple. So every time a new object arrives in the stream, you look it up. If it?s 131 00:09:06,051 --> 00:09:11,079 there, then it?s a duplicate and you ignore it. If it?s not there, then this is 132 00:09:11,079 --> 00:09:16,064 a new object and you remember it. Qed, that's it. And so then after the string 133 00:09:16,064 --> 00:09:21,000 completes, so for example after you finish reading some huge file, if you just want 134 00:09:21,000 --> 00:09:25,040 to report all of the unique objects, hash tables generally support a linear scan 135 00:09:25,040 --> 00:09:30,007 through them and you can just report all of the distinct objects when this stream 136 00:09:30,007 --> 00:09:34,037 finishes. So let's move on to a second application slightly less trivial maybe 137 00:09:34,037 --> 00:09:38,040 but still quite easy, and this is the subject of Programming Projects number 138 00:09:38,040 --> 00:09:43,041 five. So this is a problem called the two sum problem. You're given as input an 139 00:09:43,041 --> 00:09:48,048 array of N number. These images are in no particular order. You're also given a 140 00:09:48,048 --> 00:09:53,054 target sum, which I'll call T. And what you want to know is are there two integers 141 00:09:53,054 --> 00:09:58,044 from amongst these N you are given that sum to T. Now the most obvious and naive 142 00:09:58,044 --> 00:10:03,033 way to solve this problem is just to go over all N, choose two pairs of integers 143 00:10:03,033 --> 00:10:07,080 in the input, and check each one separately. So that's clearly a quadratic 144 00:10:07,080 --> 00:10:12,053 time algorithm. But now, of course, we need to ask, can we do better? And, yes, 145 00:10:12,053 --> 00:10:16,099 we can. And first of all let's see what you'd do if you couldn't use any data 146 00:10:16,099 --> 00:10:21,085 structures. So if you were clever, but you didn't use any data structures like a hash 147 00:10:21,085 --> 00:10:25,096 table, here would be a reasonable improvement over the naive one. So the 148 00:10:25,096 --> 00:10:30,082 first step of a better solution is to sort A upfront, For example, using word sort or 149 00:10:30,082 --> 00:10:35,038 heap sort, something that runs in end log and time. So you may be asking about the 150 00:10:35,038 --> 00:10:39,061 motivation for sorting. Well, again, you know, one thing is just, you know whenever 151 00:10:39,061 --> 00:10:43,096 you're trying to do better than N squared; you might think that sorting your data 152 00:10:43,096 --> 00:10:48,014 somehow helps. Right and you can sort of do it almost for free in N log N time. 153 00:10:48,014 --> 00:10:52,048 Now, why would sorting the array up front help us? Well, then the clever insight is 154 00:10:52,048 --> 00:10:56,067 that for each entry of the array a, say the first entry, now we know what we're 155 00:10:56,067 --> 00:11:01,007 looking for to achieve this given target, right. If the target that we're trying to 156 00:11:01,007 --> 00:11:05,025 get to is summed to 100 and the first entry in the sorted array is 43, then we 157 00:11:05,025 --> 00:11:09,056 know we're looking for a 57 somewhere else in. This now sorted array. And we know 158 00:11:09,056 --> 00:11:13,091 that searching a sorted array is pretty easy, right. That just binary search. That 159 00:11:13,091 --> 00:11:18,073 just takes logarithmic time. So for each of the n array entries, we can look for a 160 00:11:18,073 --> 00:11:23,083 complementary. Entry, namely of reach X we can look for T - X using binary search. 161 00:11:23,083 --> 00:11:28,080 And to use binary search takes log N time. So the sorting upfront speeds up this 162 00:11:28,080 --> 00:11:33,038 entire batch of N searches. So that's why it's a win. So, in the second step, 163 00:11:33,038 --> 00:11:37,089 because we do a linear number of binary searches, again, this is just n, the 164 00:11:37,089 --> 00:11:42,083 number of searches, times log-n, the time per search. So, this is just another theta 165 00:11:42,083 --> 00:11:48,036 of N log N factor. Alright, so that's pretty cool. You, I don't think you could 166 00:11:48,036 --> 00:11:51,010 come up with this N log N solution without having some basic, facility with 167 00:11:51,010 --> 00:11:55,043 algorithms. This is already a really nice improvement over the naive N squared. But 168 00:11:55,043 --> 00:11:59,055 we can do even better. It is no reason we're stuck with an N log N lower bound 169 00:11:59,055 --> 00:12:03,072 for the [inaudible] problem. Obviously, because the array is unsorted, we have to 170 00:12:03,072 --> 00:12:07,083 look at all the integers. So we're not gonna do better than linear time. But we 171 00:12:07,083 --> 00:12:12,040 can do linear time via a hash table. So a good question you might ask at this point 172 00:12:12,040 --> 00:12:16,087 is what's the clue about this problem, about this task that suggests we want to 173 00:12:16,087 --> 00:12:21,017 use a hash table. Well, so hash tables are going to dramatically speed up any 174 00:12:21,017 --> 00:12:25,087 application where the bulk of the word is just repeated look-ups. And if we examine 175 00:12:25,087 --> 00:12:30,046 this n log n solution, once we have this idea of doing a search for T minus X for 176 00:12:30,046 --> 00:12:35,005 each value of X, we realize actually, you know, the only thing we needed the sorted 177 00:12:35,005 --> 00:12:39,064 array for was to support look-ups. That's all binary search here is doing, is just 178 00:12:39,064 --> 00:12:44,002 looking stuff up. So we say, ah-ha. All of the work here in step two is from repeated 179 00:12:44,002 --> 00:12:47,085 look-ups. We're paying an exorbitant relatively, logarithm per amount of time 180 00:12:47,085 --> 00:12:51,053 per look-up, whereas hash tables can do them in cost and time. So, repeated 181 00:12:51,053 --> 00:12:55,036 look-ups, ding, ding, ding, let's use a hash table; and indeed that's what gives 182 00:12:55,036 --> 00:12:59,034 us linear time in this problem. So from the amazing guarantee of hash tables, we 183 00:12:59,034 --> 00:13:03,052 get the following amazing solution for the true [inaudible] problem, although again 184 00:13:03,052 --> 00:13:07,015 this is subject to the same fine print about you better use it properly 185 00:13:07,015 --> 00:13:11,019 implemented hash table and you better not have pathological data. So rather than 186 00:13:11,019 --> 00:13:15,028 sorting, you just insert everything in the array into a hash table. So insertions 187 00:13:15,028 --> 00:13:19,042 cost time. So this is gonna be linear time rather than the end log [inaudible] we 188 00:13:19,042 --> 00:13:23,095 were paying before. Once all the stuff is in the hash table, we just do the same 189 00:13:23,095 --> 00:13:28,080 thing as in the n log-n solution. For each x in the array, we look for its matching 190 00:13:28,080 --> 00:13:33,065 elements, t-x in the hash table using the cost and time look-up operation exported 191 00:13:33,065 --> 00:13:38,050 by the hash table. And of course if for some X, you do find the matching element T 192 00:13:38,050 --> 00:13:43,043 minus X. Then you can just report X and T minus X. That proves that there is indeed 193 00:13:43,043 --> 00:13:48,048 a pair of integers of target sum T. If for every single element of the input array A, 194 00:13:48,048 --> 00:13:53,035 you fail to find this matching element T minus X in the hash table. Then, for sure 195 00:13:53,035 --> 00:13:58,040 there is no pair of integers in the input that sums to T. So this solves the problem 196 00:13:58,040 --> 00:14:02,079 correctly. Moreover, constant time insertion, so that means this first step 197 00:14:02,079 --> 00:14:07,098 is going to be O of end time. And constant time look-up. So that means that the 198 00:14:07,098 --> 00:14:12,072 second step is also gonna be linear time. That leaves subjects to the caveats that 199 00:14:12,072 --> 00:14:17,027 we discussed on the previous slide. So it's kind of amazing how many different 200 00:14:17,027 --> 00:14:21,093 applications of computer science boil down in their essence to repeated look up 201 00:14:21,093 --> 00:14:26,071 operations. Therefore, having a super fast look up operation, like that supported by 202 00:14:26,071 --> 00:14:31,043 a hash table, permits these applications to scale to fantastic sizes. It's really 203 00:14:31,043 --> 00:14:35,074 amazing, and it drives a lot of modern technology. So let me just mention a 204 00:14:35,074 --> 00:14:40,034 couple examples. Again, if you look around or do some research on the web, you'll 205 00:14:40,034 --> 00:14:44,069 quickly find many more. So originally what prompted researchers to think hard about 206 00:14:44,069 --> 00:14:48,075 data structures that support super fast look ups, was back when people were first 207 00:14:48,075 --> 00:14:52,071 building compilers. So this is a long time ago. This is in the fifties or so. And 208 00:14:52,071 --> 00:14:56,076 these repeated look ups to figure out, you know, what has and has not been defined 209 00:14:56,076 --> 00:15:00,087 before was, was emerging as a bottleneck in compilers. Back in the early days of 210 00:15:00,087 --> 00:15:05,023 programming languages. And that was one of the early applications of hash tables. Was 211 00:15:05,023 --> 00:15:09,024 to support super fast look ups to speed up compile time. To keep track of the 212 00:15:09,024 --> 00:15:13,090 function of variable names and things like that. Hash table technology is also super 213 00:15:13,090 --> 00:15:18,027 useful for software on routers in the Internet. So, for example, you might want 214 00:15:18,027 --> 00:15:21,061 to block network traffic from certain sources. So, for example, maybe you 215 00:15:21,061 --> 00:15:25,061 suspect that a certain IP address has been taken over by spammers and so any traffic 216 00:15:25,061 --> 00:15:29,057 coming from that IP address you just want to ignore. And you don't wanna even let it 217 00:15:29,057 --> 00:15:33,034 get to the end host, to the computer on someone's desktop, or to someone's mobile 218 00:15:33,034 --> 00:15:37,020 device but rather inside the internet. You wanna just drop packets that are coming 219 00:15:37,020 --> 00:15:41,001 certain, certain centers. So what is that problem boil down to? Well, you might have 220 00:15:41,001 --> 00:15:44,083 a blacklist of IP addresses that you're refusing traffic from and then the tasks 221 00:15:44,083 --> 00:15:48,059 faced by the router is really the look up problem. So if data packet comes in at 222 00:15:48,059 --> 00:15:52,050 some insanely fast data rate, and when you wanna. You immediately, just look up, is 223 00:15:52,050 --> 00:15:56,080 this in the blacklist or not, and if it is in the blacklist then you drop the packet, 224 00:15:56,080 --> 00:16:01,066 if it?s not, then you let it go through. So a very different application is for 225 00:16:01,066 --> 00:16:06,057 speeding up search algorithms. And when I say a search algorithm, what I'm thinking 226 00:16:06,057 --> 00:16:10,047 about here is something like a chess playing program. So something that does 227 00:16:10,047 --> 00:16:14,056 game tree exploration. So we've already talked a fair amount about graph search in 228 00:16:14,056 --> 00:16:18,032 this class, but in our discussion of breadth first and depth first search, we 229 00:16:18,032 --> 00:16:22,047 were thinking about graphs that you could basically write down. You could store them 230 00:16:22,047 --> 00:16:26,038 in the main memory of your machine or, in the worst case, on some big cluster. So 231 00:16:26,038 --> 00:16:30,038 maybe graphs, you know, about the size of the web graph or possibly smaller. But in 232 00:16:30,038 --> 00:16:34,019 a context of something like a chess playing program the graph that you're 233 00:16:34,019 --> 00:16:38,029 interested in is way, way, way bigger than the web graph. So what's the graph we care 234 00:16:38,029 --> 00:16:42,000 about for a chess playing program? Well, the nodes of the graph are going to 235 00:16:42,000 --> 00:16:46,047 correspond to all possible configurations of chess pieces On a chess board. So every 236 00:16:46,047 --> 00:16:51,039 chess board that you might ever encounter in a game of chess. So that's a. Massive, 237 00:16:51,039 --> 00:16:55,025 massive number of configurations. And you're never gonna be able to write down 238 00:16:55,025 --> 00:16:58,071 these vertices. The edges in this graph are going to take you from one 239 00:16:58,071 --> 00:17:02,072 configuration to another. And there gonna correspond to legal moves. So if you can 240 00:17:02,072 --> 00:17:06,057 move a bishop from. One place to another place, and you get from one configuration 241 00:17:06,057 --> 00:17:10,038 to another configuration, there's an edge in the graph corresponding to that move. 242 00:17:10,038 --> 00:17:14,015 Now you can't write down this graph. So you can't implement breadth versus depth 243 00:17:14,015 --> 00:17:17,096 versus search exactly as we discussed it before. But, you'd still like to do graph 244 00:17:17,096 --> 00:17:21,082 exploration, right? So you'd like to have your computer program, reason about the at 245 00:17:21,082 --> 00:17:25,016 least short term ramifications of your possible next move. So that will 246 00:17:25,016 --> 00:17:28,056 correspond to searching through this graph. Now, how are you gonna, it's 247 00:17:28,056 --> 00:17:32,042 remembering graphs search a really important property was you don't want to 248 00:17:32,042 --> 00:17:36,038 do redundant work, you don't want to re-explore things you've already explored. 249 00:17:36,038 --> 00:17:40,049 That would be silly and might lead into loops and so on. And you can't write down 250 00:17:40,049 --> 00:17:44,050 the graph just remembering where you've been, is suddenly a non-trivial problem; 251 00:17:44,050 --> 00:17:49,026 but what is remembering where you've been, fundamentally? Fundamentally that's a look 252 00:17:49,026 --> 00:17:53,035 up operation. So that is exactly what hash tables are for. So to be a little more 253 00:17:53,035 --> 00:17:56,087 concrete, you know, one where you use the hash table in, say, a chess playing 254 00:17:56,087 --> 00:18:00,068 program, is you'd stake, take the initial configuration. You would sort of imagine 255 00:18:00,068 --> 00:18:04,025 trying all possible moves from this configuration. And then you'd try, you'd 256 00:18:04,025 --> 00:18:08,006 sort of have all moves from your opponent and then you'd have all your moves in 257 00:18:08,006 --> 00:18:11,078 response. And you would always remember, as you were doing this reasoning, every 258 00:18:11,078 --> 00:18:15,064 chessboard configuration you'd ever looked at before and you'd stick in the hash 259 00:18:15,064 --> 00:18:19,035 table. And before you go exploring some configuration, you'd look it up in your 260 00:18:19,035 --> 00:18:22,098 hash table to see if you've already explored it in the past. And if you have, 261 00:18:22,098 --> 00:18:27,047 you don't bother. You've already seen it. All right. So chess playing programs 262 00:18:27,047 --> 00:18:30,098 operate by exploring systematically as many configurations as they'd have time 263 00:18:30,098 --> 00:18:34,052 for. You know, obviously, in a budget of three minutes or whatever you don't wanna 264 00:18:34,052 --> 00:18:37,085 waste any work exploring any given configuration more than once. How do you 265 00:18:37,085 --> 00:18:41,013 remember where you've been? Well everything you've explored you stick in a 266 00:18:41,013 --> 00:18:44,028 hash table Before you explore a configuration you look it up in a hash 267 00:18:44,028 --> 00:18:48,002 table and see if you've already done it. So these of course are just scratching the 268 00:18:48,002 --> 00:18:51,058 surface. I just wanted to highlight a couple, you know, fairly different looking 269 00:18:51,058 --> 00:18:55,023 applications, you know to convince you that hash tables come up all the time. And 270 00:18:55,023 --> 00:18:58,066 the reason they come up all the time is because you know the need for fast 271 00:18:58,066 --> 00:19:02,035 look-ups comes up all the time. It's kind of amazing how much technology is being 272 00:19:02,035 --> 00:19:05,092 driven just by you know repeated fast look-ups. So as homework I encourage you 273 00:19:05,092 --> 00:19:09,061 to just sort of think about you know your own life, or think about technology out 274 00:19:09,061 --> 00:19:13,047 there in the world, and come up with some. You know, guesses about where probably 275 00:19:13,047 --> 00:19:17,050 hash tables are making something out there running blazingly fast. I think it won't 276 00:19:17,050 --> 00:19:20,080 take you more than a few minutes to come up with some good examples.