1 00:00:00,460 --> 00:00:01,450 This video talks about indexes. 2 00:00:02,410 --> 00:00:05,260 It's actually a relatively short video about a very important topic. 3 00:00:05,920 --> 00:00:07,140 Before I get started, however, let 4 00:00:07,290 --> 00:00:08,710 me mention that indexes are also 5 00:00:09,050 --> 00:00:10,900 sometimes prefer two as indices, those are equivalent. 6 00:00:11,460 --> 00:00:13,040 I personally prefer using the term indexes. 7 00:00:14,350 --> 00:00:15,660 The reason indexes are so important 8 00:00:16,020 --> 00:00:16,880 is that they are the primary 9 00:00:17,390 --> 00:00:19,420 way of getting improved performance out of a database. 10 00:00:20,320 --> 00:00:21,730 Indexes are a persistent data structure. 11 00:00:22,100 --> 00:00:23,700 They're stored together with the database itself. 12 00:00:24,740 --> 00:00:25,890 Now, there are many very interesting 13 00:00:26,400 --> 00:00:27,710 implementation issues in indexes, 14 00:00:28,780 --> 00:00:29,920 but in this video and in 15 00:00:30,030 --> 00:00:31,110 this course in general, we're focusing 16 00:00:31,680 --> 00:00:33,120 on the perspective of the user and application. 17 00:00:33,860 --> 00:00:35,160 So, we'll talk about how applications 18 00:00:35,700 --> 00:00:37,140 will use indexes to speed up performance. 19 00:00:38,590 --> 00:00:39,570 So, let's suppose we have a 20 00:00:39,610 --> 00:00:41,490 very simple table T that 21 00:00:41,630 --> 00:00:42,780 has three columns, but we're 22 00:00:42,890 --> 00:00:43,850 going to focus on columns A 23 00:00:43,990 --> 00:00:44,960 and columns B. And we can 24 00:00:45,120 --> 00:00:45,970 see that Column A is a 25 00:00:46,000 --> 00:00:47,640 string valued column with animal 26 00:00:47,850 --> 00:00:49,290 names, and Column B is an integer column. 27 00:00:50,200 --> 00:00:52,980 Now, we're gonna be asking queries that involve conditions over these two columns. 28 00:00:54,340 --> 00:00:55,260 In order to speed up those 29 00:00:55,460 --> 00:00:56,570 queries, if we're concerned about 30 00:00:56,780 --> 00:00:58,440 evaluating conditions on column A, 31 00:00:59,030 --> 00:01:01,320 then we can build an index on that column. 32 00:01:01,660 --> 00:01:02,850 So we call that an index on 33 00:01:03,010 --> 00:01:04,980 column T.A. What that 34 00:01:05,220 --> 00:01:06,160 index allows us to do 35 00:01:06,390 --> 00:01:07,180 - and "us" in this case 36 00:01:07,410 --> 00:01:09,280 is actually the query processor - 37 00:01:09,430 --> 00:01:10,770 is ask questions, for example, 38 00:01:12,020 --> 00:01:13,760 let's ask what tuples have 39 00:01:14,130 --> 00:01:15,210 "cow" in the value of 40 00:01:15,270 --> 00:01:16,730 T.A. If we 41 00:01:17,220 --> 00:01:18,080 ask that question of our index, 42 00:01:18,780 --> 00:01:20,170 that that index will 43 00:01:20,430 --> 00:01:21,890 quickly tell us that tuple 44 00:01:22,550 --> 00:01:23,860 3 and tuple 7 have 45 00:01:24,060 --> 00:01:26,490 a value "cow", without scanning the entire table. 46 00:01:27,680 --> 00:01:28,820 We could also ask the index 47 00:01:29,390 --> 00:01:30,990 what tuples have say, value "cat". 48 00:01:31,950 --> 00:01:33,030 And if we ask the index that 49 00:01:33,360 --> 00:01:34,410 question, it will tell us 50 00:01:34,640 --> 00:01:36,260 tuple 1 and tuple 5 51 00:01:36,310 --> 00:01:38,900 and tuple 6 have the value "cat." 52 00:01:40,240 --> 00:01:41,610 If we're interested in evaluating conditions 53 00:01:42,130 --> 00:01:43,210 in column B then we can 54 00:01:43,340 --> 00:01:44,560 also build an index on 55 00:01:44,740 --> 00:01:46,430 column B. For example now 56 00:01:46,660 --> 00:01:48,270 we could ask questions like, when 57 00:01:48,610 --> 00:01:50,520 is T.B equal to the value two? 58 00:01:51,100 --> 00:01:52,120 We asked the index and the 59 00:01:52,180 --> 00:01:53,950 index will tell us 60 00:01:54,250 --> 00:01:56,460 that tuple 1 and tuple 5 have the value two. 61 00:01:57,270 --> 00:01:58,540 We could also ask, for example, 62 00:01:59,450 --> 00:02:00,970 when the value in T.B 63 00:02:01,670 --> 00:02:02,560 is less than six? 64 00:02:03,610 --> 00:02:04,690 And the index in that case 65 00:02:04,980 --> 00:02:06,020 would tell us that tuple 66 00:02:06,260 --> 00:02:07,580 1 is less than six, two...wow, 67 00:02:08,120 --> 00:02:10,890 most of them: three, five, and seven. 68 00:02:11,830 --> 00:02:13,410 We could ask an even more complicated question. 69 00:02:13,840 --> 00:02:15,690 We could ask when the 70 00:02:16,140 --> 00:02:17,640 value for T.B is say, 71 00:02:18,000 --> 00:02:20,520 greater than four and less than or equal to eight. 72 00:02:21,210 --> 00:02:22,200 Again, we ask the index, 73 00:02:22,550 --> 00:02:23,490 and in this case the index 74 00:02:23,980 --> 00:02:25,240 would tell us that it 75 00:02:25,470 --> 00:02:27,660 is tuple two and tuple 76 00:02:28,470 --> 00:02:29,460 seven in that case. 77 00:02:30,710 --> 00:02:32,110 Lastly, suppose we're interested in 78 00:02:32,280 --> 00:02:33,550 having conditions that are 79 00:02:33,680 --> 00:02:34,800 on both columns A and B. 80 00:02:35,400 --> 00:02:36,230 Then we can build an index 81 00:02:36,720 --> 00:02:38,080 that is on both columns together. 82 00:02:39,020 --> 00:02:40,380 Now we could ask questions, for 83 00:02:40,560 --> 00:02:42,410 example, like, when is T.A 84 00:02:42,890 --> 00:02:45,520 equal to cat and T.B, 85 00:02:46,580 --> 00:02:47,660 say, greater than five? 86 00:02:48,490 --> 00:02:49,950 Do we have any of those? 87 00:02:51,170 --> 00:02:52,470 Well, we have just one of them there. 88 00:02:52,630 --> 00:02:53,190 That's tuple six. 89 00:02:54,200 --> 00:02:56,610 We could also have inequalities, by the way, on the first column. 90 00:02:56,950 --> 00:02:58,000 So we might ask, when is 91 00:02:58,410 --> 00:02:59,720 T.A less than, say the 92 00:02:59,810 --> 00:03:02,260 value D, and T.B equal 93 00:03:02,960 --> 00:03:03,990 to say the value 1, and 94 00:03:04,810 --> 00:03:06,810 in that case we'll get the tuple 3 as a result. 95 00:03:07,220 --> 00:03:08,150 So I think this gives an 96 00:03:08,240 --> 00:03:09,140 idea with a simple table 97 00:03:09,850 --> 00:03:11,070 of how indexes are used to 98 00:03:11,130 --> 00:03:12,290 go directly to the tuples that 99 00:03:12,440 --> 00:03:14,810 satisfy conditions rather than scanning the entire table. 100 00:03:15,900 --> 00:03:17,540 So that's the main utility of an index. 101 00:03:17,890 --> 00:03:18,910 Again, the tables can be 102 00:03:19,230 --> 00:03:20,560 absolutely gigantic in databases 103 00:03:21,230 --> 00:03:22,360 and the difference between scanning 104 00:03:22,710 --> 00:03:23,780 an entire table to find 105 00:03:23,950 --> 00:03:25,080 that tuples that match a condition 106 00:03:25,720 --> 00:03:27,020 and locating the tuples, more or 107 00:03:27,140 --> 00:03:28,120 less immediately using an index, 108 00:03:28,670 --> 00:03:30,920 can be orders of magnitude in performance difference. 109 00:03:31,550 --> 00:03:32,810 So really it's quite important to 110 00:03:32,830 --> 00:03:33,950 take a look at the database and 111 00:03:34,120 --> 00:03:35,570 build indexes on those 112 00:03:35,790 --> 00:03:36,720 attributes that are going to 113 00:03:36,790 --> 00:03:39,910 be used frequently in conditions, especially conditions that are very selective. 114 00:03:41,640 --> 00:03:42,410 Now I mentioned that we're not 115 00:03:42,580 --> 00:03:43,850 covering the implementation of indexes 116 00:03:44,380 --> 00:03:45,520 in this video, but it is 117 00:03:45,750 --> 00:03:48,130 important to understand the basic data structures that are used. 118 00:03:48,980 --> 00:03:50,870 Specifically there are two different structures. 119 00:03:51,230 --> 00:03:52,310 One of them is balance trees 120 00:03:52,660 --> 00:03:54,090 and substantiation of that is 121 00:03:54,260 --> 00:03:56,120 typically what's called a B-tree or a B+tree. 122 00:03:56,420 --> 00:03:57,860 And the other is hash tables. 123 00:03:58,900 --> 00:04:00,680 Now balance trees indexes can 124 00:04:00,930 --> 00:04:02,160 be used to help with conditions 125 00:04:02,590 --> 00:04:03,800 of the form "attribute equals value." They 126 00:04:04,370 --> 00:04:05,390 can also be used for 127 00:04:05,580 --> 00:04:07,700 "attribute less than value," for 128 00:04:07,910 --> 00:04:09,140 "attribute between two values" 129 00:04:10,120 --> 00:04:12,110 and so on, as we have shown earlier. 130 00:04:13,340 --> 00:04:14,320 Hash tables, on the other 131 00:04:14,500 --> 00:04:16,550 hand, can only be used for equality conditions. 132 00:04:17,030 --> 00:04:18,480 So only attribute equal value. 133 00:04:18,760 --> 00:04:19,710 And if you're familiar with these 134 00:04:19,860 --> 00:04:22,190 structures then you'll know why there's the limit on hash tables. 135 00:04:23,140 --> 00:04:25,250 So balanced tree indexes are certainly more flexible. 136 00:04:25,800 --> 00:04:26,940 Now there is one small downside. 137 00:04:27,500 --> 00:04:28,280 For those of you who are 138 00:04:28,420 --> 00:04:29,850 familiar with the structures and 139 00:04:29,980 --> 00:04:31,630 with the running time, the operations 140 00:04:32,320 --> 00:04:33,360 on the balance trees tend 141 00:04:33,740 --> 00:04:35,300 to be logarithmic in their 142 00:04:35,410 --> 00:04:36,830 running time, while well designed 143 00:04:37,220 --> 00:04:38,630 hash tables can have more 144 00:04:38,950 --> 00:04:40,870 or less constant running time. 145 00:04:41,110 --> 00:04:43,110 Even in large databases, logarithmic is 146 00:04:43,330 --> 00:04:44,840 okay, although when only equality 147 00:04:45,400 --> 00:04:46,420 conditions are being used, then 148 00:04:46,570 --> 00:04:47,930 a hash table index might be preferred. 149 00:04:49,540 --> 00:04:50,260 Now, let's take a look at 150 00:04:50,330 --> 00:04:51,740 a few SQL queries and see 151 00:04:51,870 --> 00:04:53,040 how indexes might allow the 152 00:04:53,140 --> 00:04:54,920 query execution engine to speed up processing. 153 00:04:56,080 --> 00:04:57,970 We'll use our usual student and college database. 154 00:04:58,600 --> 00:04:59,500 The first one is a very 155 00:04:59,530 --> 00:05:01,160 simple query, it's looking for 156 00:05:01,260 --> 00:05:02,800 the student with a particular student ID. 157 00:05:03,480 --> 00:05:04,630 So if we have an index on 158 00:05:04,770 --> 00:05:05,710 the student ID then again the 159 00:05:05,800 --> 00:05:07,510 index will allow the 160 00:05:07,610 --> 00:05:08,790 query execution engine to go 161 00:05:09,180 --> 00:05:10,380 pretty much straight to that 162 00:05:10,640 --> 00:05:11,930 tuple, whereas without an 163 00:05:12,200 --> 00:05:14,680 index the entire student table would have to be scanned. 164 00:05:15,540 --> 00:05:16,660 Now let me mention that many 165 00:05:17,050 --> 00:05:20,370 database systems do automatically build indexes on primary keys. 166 00:05:20,880 --> 00:05:22,580 So it's likely that in an 167 00:05:22,710 --> 00:05:24,100 application the student ID would 168 00:05:24,270 --> 00:05:26,450 be declared as a primary key and there would be an index automatically. 169 00:05:27,100 --> 00:05:29,600 But it's a good thing to check if this type of query is common. 170 00:05:30,930 --> 00:05:32,310 And some systems even also build 171 00:05:32,550 --> 00:05:35,060 indexes automatically on attributes that are declared as unique. 172 00:05:35,570 --> 00:05:36,400 As a reminder from the constraint 173 00:05:36,870 --> 00:05:38,020 video, every table can have 174 00:05:38,110 --> 00:05:39,420 one primary key, and then 175 00:05:39,650 --> 00:05:41,940 any number of additional keys labeled as unique. 176 00:05:43,380 --> 00:05:45,480 Now let's take a look at a slightly more complicated example. 177 00:05:46,000 --> 00:05:47,640 Here we're looking for students whose 178 00:05:47,830 --> 00:05:49,430 name is Mary and whose 179 00:05:49,610 --> 00:05:52,100 GPA is greater then 3.9, and there may be a few of those students. 180 00:05:52,970 --> 00:05:54,310 So one possibility is that 181 00:05:54,430 --> 00:05:55,650 we have an index on the 182 00:05:55,710 --> 00:05:57,110 student name and if 183 00:05:57,210 --> 00:05:58,650 that is the case, expand the query 184 00:05:59,100 --> 00:06:00,430 processing can find quickly 185 00:06:01,260 --> 00:06:02,330 the tuples whose student name is 186 00:06:02,450 --> 00:06:03,450 Mary, and then check 187 00:06:03,720 --> 00:06:05,850 each one of those to see if the GPA is greater than 3.9. 188 00:06:06,950 --> 00:06:09,840 Alternatively we might have an index on the GPA. 189 00:06:10,890 --> 00:06:11,770 In that case, the system 190 00:06:12,170 --> 00:06:13,380 will use the index to find 191 00:06:14,050 --> 00:06:15,020 the students whose GPA is 192 00:06:15,250 --> 00:06:18,380 greater than 3.9 and then look to see if their name is Mary. 193 00:06:19,370 --> 00:06:20,340 Finally, it is possible we 194 00:06:20,450 --> 00:06:21,540 can have an index on the 195 00:06:21,630 --> 00:06:23,000 two attributes together so we 196 00:06:23,130 --> 00:06:24,310 can have S name and GPA 197 00:06:24,570 --> 00:06:25,710 together and then this 198 00:06:25,990 --> 00:06:27,050 index can be used to 199 00:06:27,620 --> 00:06:29,360 simultaneously find students that 200 00:06:29,510 --> 00:06:32,040 have the name Mary and the GPA greater than 3.9. 201 00:06:34,040 --> 00:06:35,020 Now, I should mention that because 202 00:06:35,310 --> 00:06:36,580 this is an inequality 203 00:06:37,290 --> 00:06:38,690 condition, it is important that 204 00:06:38,850 --> 00:06:39,890 the GPA is a tree 205 00:06:40,160 --> 00:06:41,330 based index in order to 206 00:06:41,460 --> 00:06:43,600 support that evaluation of 207 00:06:43,620 --> 00:06:44,990 this condition where the student 208 00:06:45,260 --> 00:06:46,630 name is an equality condition so 209 00:06:46,770 --> 00:06:49,100 that could be a hash based index or a tree based index. 210 00:06:50,420 --> 00:06:51,890 Now let's look at a query that involves a joint. 211 00:06:52,520 --> 00:06:53,850 We're joining the student and apply 212 00:06:54,270 --> 00:06:55,300 tables in order to find 213 00:06:55,610 --> 00:06:57,890 the names of the colleges that each student has applied to. 214 00:06:58,220 --> 00:06:59,980 And we're returning the student name and the college name. 215 00:07:00,990 --> 00:07:02,210 So let's suppose for starters 216 00:07:02,690 --> 00:07:04,050 that we had an index on 217 00:07:04,380 --> 00:07:06,730 the student ID attribute of the apply relation. 218 00:07:07,580 --> 00:07:08,990 If that's the case then the 219 00:07:09,180 --> 00:07:10,620 query execution engine can scan 220 00:07:11,250 --> 00:07:12,540 the student relation and, for 221 00:07:12,750 --> 00:07:14,210 each student, use that SID 222 00:07:14,680 --> 00:07:15,710 and quickly find the matching 223 00:07:16,160 --> 00:07:17,410 SIDs in the apply relation. 224 00:07:18,900 --> 00:07:20,410 Alternatively, let's suppose we 225 00:07:20,510 --> 00:07:21,900 had an index on the 226 00:07:22,190 --> 00:07:23,870 SID attribute of the student relation. 227 00:07:24,750 --> 00:07:25,900 In that case, the system could 228 00:07:26,450 --> 00:07:28,210 scan the apply relation, and, 229 00:07:28,300 --> 00:07:29,220 for each student ID and each 230 00:07:29,600 --> 00:07:30,830 apply tuple, find the matching 231 00:07:31,060 --> 00:07:31,970 student IDs in the student 232 00:07:32,370 --> 00:07:34,250 tuple using the index that we have there. 233 00:07:35,250 --> 00:07:36,190 In some cases it's actually 234 00:07:36,720 --> 00:07:37,670 possible to use the two 235 00:07:37,950 --> 00:07:40,810 indexes together and make the query run even faster. 236 00:07:41,500 --> 00:07:42,720 I'm not going to go into 237 00:07:42,850 --> 00:07:43,800 detail, but indexes often allow 238 00:07:44,170 --> 00:07:45,660 relations to be accessed in 239 00:07:45,890 --> 00:07:47,180 sorted order of the indexed attributes. 240 00:07:48,100 --> 00:07:49,160 So, suppose we can get 241 00:07:49,300 --> 00:07:50,400 the student relation in sorted 242 00:07:50,680 --> 00:07:52,470 order and the apply relation in sorted order. 243 00:07:52,950 --> 00:07:53,870 Then we can kind of do a 244 00:07:54,250 --> 00:07:56,050 merge-like operation of the 245 00:07:56,240 --> 00:07:57,490 two indexes to get the 246 00:07:57,570 --> 00:07:58,840 matching student and apply 247 00:07:59,210 --> 00:08:00,780 records, those whose SIDs are equal. 248 00:08:02,130 --> 00:08:03,600 If we had additional conditions in 249 00:08:03,690 --> 00:08:04,540 the query there might be even 250 00:08:04,770 --> 00:08:05,750 more choices of how to 251 00:08:05,910 --> 00:08:07,430 use indexes, and that gets 252 00:08:07,590 --> 00:08:08,990 into the entire area of 253 00:08:09,100 --> 00:08:11,460 what is known as query planning and query optimization. 254 00:08:11,700 --> 00:08:13,590 And this is actually one 255 00:08:13,650 --> 00:08:15,000 of the most exciting and interesting 256 00:08:15,690 --> 00:08:17,540 areas of the implementation of 257 00:08:17,640 --> 00:08:19,070 database systems and is 258 00:08:19,240 --> 00:08:20,140 what allows us to have of 259 00:08:20,280 --> 00:08:22,600 a declarative query language that's implemented efficiently. 260 00:08:24,070 --> 00:08:25,380 So, indexes seem like great things. 261 00:08:25,730 --> 00:08:27,120 We just throw some indexes onto 262 00:08:27,400 --> 00:08:29,620 our data and all of a sudden our queries run much, much faster. 263 00:08:30,490 --> 00:08:32,580 So, there must be some downsides, and of course, there are. 264 00:08:33,260 --> 00:08:35,940 Let me list three of them from sort of least severe to most severe. 265 00:08:36,310 --> 00:08:39,100 So, the first one is that indexes do take up extra space. 266 00:08:39,460 --> 00:08:42,390 As I mentioned, they are persistent data structures that resides with the data. 267 00:08:42,760 --> 00:08:44,070 I consider this sort of 268 00:08:44,290 --> 00:08:45,990 a marginal downside especially with 269 00:08:46,110 --> 00:08:47,090 the cost of disk these days 270 00:08:47,360 --> 00:08:48,580 its really not that big of 271 00:08:48,640 --> 00:08:49,900 deal to use additional 272 00:08:50,280 --> 00:08:52,360 space, even to potentially double the size of your database. 273 00:08:53,640 --> 00:08:55,560 The second downside is the 274 00:08:55,810 --> 00:08:57,060 overhead involved in index creation. 275 00:08:57,820 --> 00:08:59,150 So, when a database is 276 00:08:59,250 --> 00:09:00,450 loaded, if we're going 277 00:09:00,510 --> 00:09:02,900 to have indexes, those indexes need to be created over the data. 278 00:09:03,570 --> 00:09:05,650 Or if we add indexes later on, they need to be created. 279 00:09:06,640 --> 00:09:07,900 Index creation can actually be 280 00:09:08,470 --> 00:09:11,490 a fairly time consuming operation, so I'm going to make this as a medium downside. 281 00:09:12,450 --> 00:09:13,500 On the other hand, once the index 282 00:09:13,720 --> 00:09:15,210 is created, all the 283 00:09:15,380 --> 00:09:17,950 queries run faster, so it's usually worthwhile to do it. 284 00:09:18,860 --> 00:09:19,880 The last one is the 285 00:09:19,920 --> 00:09:22,810 most significant one and that's the issue of index maintenance. 286 00:09:23,360 --> 00:09:24,250 So the index is a 287 00:09:24,300 --> 00:09:25,530 data structure that sits to 288 00:09:25,580 --> 00:09:27,330 the side of the database and helps answer conditions. 289 00:09:28,420 --> 00:09:30,100 When the values in 290 00:09:30,260 --> 00:09:31,840 the database change, then the 291 00:09:31,920 --> 00:09:34,010 index has to be modified to reflect those changes. 292 00:09:34,860 --> 00:09:36,090 So if the database 293 00:09:36,800 --> 00:09:38,930 is modified frequently, each of 294 00:09:39,150 --> 00:09:40,340 those modifications is going to 295 00:09:40,410 --> 00:09:42,590 be significantly slower than if we didn't have indexes. 296 00:09:43,400 --> 00:09:44,250 So in fact, in a database 297 00:09:44,740 --> 00:09:46,000 that's modified a whole bunch 298 00:09:46,400 --> 00:09:47,730 and not queried all that 299 00:09:47,900 --> 00:09:49,580 often, the cost of 300 00:09:49,760 --> 00:09:51,160 index maintenance can actually 301 00:09:51,540 --> 00:09:52,990 offset the benefits of having the index. 302 00:09:53,430 --> 00:09:54,720 So it really is a cost-benefit 303 00:09:55,210 --> 00:09:56,880 trade off to decide when to build indexes. 304 00:09:58,250 --> 00:10:00,260 So given that we have this cost-benefit trade off. 305 00:10:00,730 --> 00:10:01,720 How do we figure out which indexes 306 00:10:02,230 --> 00:10:03,000 to create when we have a 307 00:10:03,030 --> 00:10:04,560 database an applications on that database. 308 00:10:05,620 --> 00:10:06,460 The benefit of an index 309 00:10:07,240 --> 00:10:08,270 first of all on how big the 310 00:10:08,340 --> 00:10:09,130 table is since the index 311 00:10:09,600 --> 00:10:11,600 helps us find specific portions of the table quickly. 312 00:10:12,720 --> 00:10:14,210 It depends on the data distributions 313 00:10:14,920 --> 00:10:16,070 again because the index helps 314 00:10:16,420 --> 00:10:18,130 us find specific data values quickly. 315 00:10:18,960 --> 00:10:21,430 And finally how often we're going to query the database. 316 00:10:22,070 --> 00:10:23,650 First of all it's how we're going to update it. 317 00:10:24,030 --> 00:10:25,230 As I mentioned, every time the 318 00:10:25,290 --> 00:10:28,190 database is updated indexes needed to be maintained and that's costly. 319 00:10:29,140 --> 00:10:30,400 Every time we query the indexes 320 00:10:30,630 --> 00:10:32,250 may help us answer our queries more quickly. 321 00:10:33,740 --> 00:10:34,990 Fortunately, over the last decade 322 00:10:35,320 --> 00:10:36,740 or so, many database system 323 00:10:37,010 --> 00:10:39,360 vendors have introduced what's called a physical design advisor. 324 00:10:40,070 --> 00:10:41,470 In this case, physical design means 325 00:10:41,710 --> 00:10:44,030 determining which indexes to build on a database. 326 00:10:44,590 --> 00:10:45,650 The input to the design 327 00:10:46,130 --> 00:10:47,040 advisor is the database 328 00:10:47,570 --> 00:10:48,600 itself and the workload. 329 00:10:49,280 --> 00:10:50,650 The workload consists of the 330 00:10:50,800 --> 00:10:51,910 sets of queries and updates 331 00:10:52,310 --> 00:10:53,290 that are expected to be performed 332 00:10:53,750 --> 00:10:55,020 on the database as well as their frequency. 333 00:10:56,260 --> 00:10:57,470 Now actually the design advisor 334 00:10:57,840 --> 00:10:58,700 doesn't usually look at the 335 00:10:58,900 --> 00:10:59,990 entire database, but rather looks 336 00:11:00,230 --> 00:11:01,180 at statistics on the database 337 00:11:01,710 --> 00:11:04,250 that describe how large the tables are and their data distributions. 338 00:11:05,570 --> 00:11:06,580 The output of the design advisor 339 00:11:07,070 --> 00:11:08,240 is a recommended set of indexes 340 00:11:08,800 --> 00:11:11,010 to build that will speed up the overall workload. 341 00:11:12,470 --> 00:11:13,940 Interestingly, physical design advisors 342 00:11:14,440 --> 00:11:15,580 rely very heavily on a 343 00:11:15,650 --> 00:11:16,590 component of database systems that already 344 00:11:16,900 --> 00:11:18,130 existed, actually one of the 345 00:11:18,450 --> 00:11:19,890 most important components of database 346 00:11:20,310 --> 00:11:21,460 systems, which is the query optimizer. 347 00:11:21,870 --> 00:11:23,520 That's the component that 348 00:11:23,670 --> 00:11:25,720 takes a query and figures out how to execute it. 349 00:11:26,170 --> 00:11:27,790 Specifically, it'll take statistics on 350 00:11:28,000 --> 00:11:29,150 the database, the query to be 351 00:11:29,250 --> 00:11:30,340 executed or the update command, 352 00:11:31,200 --> 00:11:32,320 and the set of indexes that 353 00:11:32,440 --> 00:11:33,880 currently exist, and it 354 00:11:34,000 --> 00:11:35,550 will explore the various ways 355 00:11:36,010 --> 00:11:37,040 of actually executing the query, 356 00:11:37,330 --> 00:11:39,720 which indexes will be used, which order things will be done in. 357 00:11:40,250 --> 00:11:41,290 It estimates the cost of 358 00:11:41,390 --> 00:11:42,560 each one, and it spits 359 00:11:42,870 --> 00:11:44,260 out the estimated best execution 360 00:11:44,900 --> 00:11:46,130 plan with the estimated cost. 361 00:11:47,120 --> 00:11:48,060 So now let's look at how 362 00:11:48,400 --> 00:11:50,850 this component can be used to build a design advisor. 363 00:11:51,700 --> 00:11:52,560 Let's just draw the design 364 00:11:52,990 --> 00:11:54,170 advisor around the whole thing 365 00:11:54,470 --> 00:11:56,190 here, and the input 366 00:11:56,470 --> 00:11:57,950 to the design advisor, again, are 367 00:11:58,400 --> 00:11:59,980 the statistics and the workload, 368 00:12:00,920 --> 00:12:02,650 and the output is supposed to be the indexes. 369 00:12:03,410 --> 00:12:05,110 So what the design adviser actually 370 00:12:05,680 --> 00:12:08,730 does is it experiments with different set-ups of indexes. 371 00:12:09,780 --> 00:12:11,290 For each set-up of indexes, 372 00:12:11,660 --> 00:12:13,060 it takes the workload, it issues 373 00:12:13,460 --> 00:12:15,630 the queries, and updates to the query optimizer. 374 00:12:16,230 --> 00:12:17,070 It doesn't actually run them against 375 00:12:17,090 --> 00:12:20,570 the database and see's what cost the query optimizer produces. 376 00:12:21,660 --> 00:12:22,570 It tries this with different 377 00:12:23,000 --> 00:12:24,960 configurations of indexes and 378 00:12:25,050 --> 00:12:25,930 then in the end determines 379 00:12:26,530 --> 00:12:29,510 those indexes that bring down the cost the most. 380 00:12:29,940 --> 00:12:30,910 In other words, it will give 381 00:12:31,100 --> 00:12:32,840 you back those indexes where 382 00:12:33,130 --> 00:12:34,460 the benefits of having the 383 00:12:34,710 --> 00:12:35,980 index outweigh the drawbacks 384 00:12:36,650 --> 00:12:38,340 of having that index in terms 385 00:12:38,750 --> 00:12:40,440 of the workload and using 386 00:12:40,770 --> 00:12:42,550 the costs that were estimated by the query optimizer. 387 00:12:43,770 --> 00:12:44,760 If you're using a system that 388 00:12:44,810 --> 00:12:45,770 doesn't have a design adviser, 389 00:12:46,200 --> 00:12:48,090 then you'll have to kind of go through this process yourself. 390 00:12:48,620 --> 00:12:49,370 You'll have to take a look 391 00:12:49,600 --> 00:12:50,720 at the queries and updates that 392 00:12:50,850 --> 00:12:52,190 you expect, how often you 393 00:12:52,270 --> 00:12:53,510 expect them to happen, and 394 00:12:53,620 --> 00:12:55,470 which indexes will benefit those 395 00:12:55,710 --> 00:12:57,090 queries, and hopefully won't incur 396 00:12:57,410 --> 00:12:58,880 too much overhead when there are updates. 397 00:13:00,110 --> 00:13:02,160 Just quickly, here's the SQL standard for creating indexes. 398 00:13:02,950 --> 00:13:03,940 All indexes are given names. 399 00:13:04,450 --> 00:13:06,120 We can create an index on a single attribute. 400 00:13:06,940 --> 00:13:08,960 We can create an index on several attributes together. 401 00:13:09,900 --> 00:13:10,930 We can also say that we 402 00:13:11,040 --> 00:13:12,120 want our index to enforce 403 00:13:12,640 --> 00:13:13,870 a uniqueness constraint so when 404 00:13:14,100 --> 00:13:16,610 we add the word unique it sort of adds constraint enforcement. 405 00:13:17,270 --> 00:13:18,040 It says, we're going to check 406 00:13:18,370 --> 00:13:19,450 that all values for "A" are 407 00:13:19,510 --> 00:13:20,790 unique using our index and 408 00:13:20,900 --> 00:13:22,060 will generate an error if there 409 00:13:22,360 --> 00:13:23,160 are two values that have the 410 00:13:23,380 --> 00:13:24,950 same, two tuples that have 411 00:13:25,090 --> 00:13:26,490 the same value for A, and 412 00:13:26,630 --> 00:13:28,240 finally we have a command for dropping indexes. 413 00:13:29,740 --> 00:13:31,140 In summary, indexes are really important. 414 00:13:31,700 --> 00:13:33,980 They're the primary way to get improved performance on a database. 415 00:13:34,940 --> 00:13:36,550 By building the right indexes over 416 00:13:36,820 --> 00:13:37,930 a database for its work flow 417 00:13:38,150 --> 00:13:40,240 we can get orders of magnitude performance improvement. 418 00:13:41,050 --> 00:13:41,980 Although we do have to be careful, 419 00:13:42,140 --> 00:13:43,340 because there are trade-offs in 420 00:13:43,430 --> 00:13:46,340 building indexes, especially for databases that are modified frequently. 421 00:13:47,360 --> 00:13:48,530 There are persistent data structure that 422 00:13:48,680 --> 00:13:49,940 are stored together with the database, 423 00:13:50,300 --> 00:13:51,920 and there are many interesting implementation issues, 424 00:13:52,610 --> 00:13:53,440 but in this video and course 425 00:13:53,740 --> 00:13:55,030 we're focusing specifically on the 426 00:13:55,090 --> 00:13:56,870 user and application perspective, determining 427 00:13:57,320 --> 00:13:58,630 which indexes to build and 428 00:13:58,750 --> 00:14:00,600 how they will gain performance improvement for us.