1
00:00:00,460 --> 00:00:01,450
This video talks about indexes.

2
00:00:02,410 --> 00:00:05,260
It's actually a relatively short video about a very important topic.

3
00:00:05,920 --> 00:00:07,140
Before I get started, however, let

4
00:00:07,290 --> 00:00:08,710
me mention that indexes are also

5
00:00:09,050 --> 00:00:10,900
sometimes prefer two as indices, those are equivalent.

6
00:00:11,460 --> 00:00:13,040
I personally prefer using the term indexes.

7
00:00:14,350 --> 00:00:15,660
The reason indexes are so important

8
00:00:16,020 --> 00:00:16,880
is that they are the primary

9
00:00:17,390 --> 00:00:19,420
way of getting improved performance out of a database.

10
00:00:20,320 --> 00:00:21,730
Indexes are a persistent data structure.

11
00:00:22,100 --> 00:00:23,700
They're stored together with the database itself.

12
00:00:24,740 --> 00:00:25,890
Now, there are many very interesting

13
00:00:26,400 --> 00:00:27,710
implementation issues in indexes,

14
00:00:28,780 --> 00:00:29,920
but in this video and in

15
00:00:30,030 --> 00:00:31,110
this course in general, we're focusing

16
00:00:31,680 --> 00:00:33,120
on the perspective of the user and application.

17
00:00:33,860 --> 00:00:35,160
So, we'll talk about how applications

18
00:00:35,700 --> 00:00:37,140
will use indexes to speed up performance.

19
00:00:38,590 --> 00:00:39,570
So, let's suppose we have a

20
00:00:39,610 --> 00:00:41,490
very simple table T that

21
00:00:41,630 --> 00:00:42,780
has three columns, but we're

22
00:00:42,890 --> 00:00:43,850
going to focus on columns A

23
00:00:43,990 --> 00:00:44,960
and columns B. And we can

24
00:00:45,120 --> 00:00:45,970
see that Column A is a

25
00:00:46,000 --> 00:00:47,640
string valued column with animal

26
00:00:47,850 --> 00:00:49,290
names, and Column B is an integer column.

27
00:00:50,200 --> 00:00:52,980
Now, we're gonna be asking queries that involve conditions over these two columns.

28
00:00:54,340 --> 00:00:55,260
In order to speed up those

29
00:00:55,460 --> 00:00:56,570
queries, if we're concerned about

30
00:00:56,780 --> 00:00:58,440
evaluating conditions on column A,

31
00:00:59,030 --> 00:01:01,320
then we can build an index on that column.

32
00:01:01,660 --> 00:01:02,850
So we call that an index on

33
00:01:03,010 --> 00:01:04,980
column T.A. What that

34
00:01:05,220 --> 00:01:06,160
index allows us to do

35
00:01:06,390 --> 00:01:07,180
- and "us" in this case

36
00:01:07,410 --> 00:01:09,280
is actually the query processor -

37
00:01:09,430 --> 00:01:10,770
is ask questions, for example,

38
00:01:12,020 --> 00:01:13,760
let's ask what tuples have

39
00:01:14,130 --> 00:01:15,210
"cow" in the value of

40
00:01:15,270 --> 00:01:16,730
T.A. If we

41
00:01:17,220 --> 00:01:18,080
ask that question of our index,

42
00:01:18,780 --> 00:01:20,170
that that index will

43
00:01:20,430 --> 00:01:21,890
quickly tell us that tuple

44
00:01:22,550 --> 00:01:23,860
3 and tuple 7 have

45
00:01:24,060 --> 00:01:26,490
a value "cow", without scanning the entire table.

46
00:01:27,680 --> 00:01:28,820
We could also ask the index

47
00:01:29,390 --> 00:01:30,990
what tuples have say, value "cat".

48
00:01:31,950 --> 00:01:33,030
And if we ask the index that

49
00:01:33,360 --> 00:01:34,410
question, it will tell us

50
00:01:34,640 --> 00:01:36,260
tuple 1 and tuple 5

51
00:01:36,310 --> 00:01:38,900
and tuple 6 have the value "cat."

52
00:01:40,240 --> 00:01:41,610
If we're interested in evaluating conditions

53
00:01:42,130 --> 00:01:43,210
in column B then we can

54
00:01:43,340 --> 00:01:44,560
also build an index on

55
00:01:44,740 --> 00:01:46,430
column B. For example now

56
00:01:46,660 --> 00:01:48,270
we could ask questions like, when

57
00:01:48,610 --> 00:01:50,520
is T.B equal to the value two?

58
00:01:51,100 --> 00:01:52,120
We asked the index and the

59
00:01:52,180 --> 00:01:53,950
index will tell us

60
00:01:54,250 --> 00:01:56,460
that tuple 1 and tuple 5 have the value two.

61
00:01:57,270 --> 00:01:58,540
We could also ask, for example,

62
00:01:59,450 --> 00:02:00,970
when the value in T.B

63
00:02:01,670 --> 00:02:02,560
is less than six?

64
00:02:03,610 --> 00:02:04,690
And the index in that case

65
00:02:04,980 --> 00:02:06,020
would tell us that tuple

66
00:02:06,260 --> 00:02:07,580
1 is less than six, two...wow,

67
00:02:08,120 --> 00:02:10,890
most of them: three, five, and seven.

68
00:02:11,830 --> 00:02:13,410
We could ask an even more complicated question.

69
00:02:13,840 --> 00:02:15,690
We could ask when the

70
00:02:16,140 --> 00:02:17,640
value for T.B is say,

71
00:02:18,000 --> 00:02:20,520
greater than four and less than or equal to eight.

72
00:02:21,210 --> 00:02:22,200
Again, we ask the index,

73
00:02:22,550 --> 00:02:23,490
and in this case the index

74
00:02:23,980 --> 00:02:25,240
would tell us that it

75
00:02:25,470 --> 00:02:27,660
is tuple two and tuple

76
00:02:28,470 --> 00:02:29,460
seven in that case.

77
00:02:30,710 --> 00:02:32,110
Lastly, suppose we're interested in

78
00:02:32,280 --> 00:02:33,550
having conditions that are

79
00:02:33,680 --> 00:02:34,800
on both columns A and B.

80
00:02:35,400 --> 00:02:36,230
Then we can build an index

81
00:02:36,720 --> 00:02:38,080
that is on both columns together.

82
00:02:39,020 --> 00:02:40,380
Now we could ask questions, for

83
00:02:40,560 --> 00:02:42,410
example, like, when is T.A

84
00:02:42,890 --> 00:02:45,520
equal to cat and T.B,

85
00:02:46,580 --> 00:02:47,660
say, greater than five?

86
00:02:48,490 --> 00:02:49,950
Do we have any of those?

87
00:02:51,170 --> 00:02:52,470
Well, we have just one of them there.

88
00:02:52,630 --> 00:02:53,190
That's tuple six.

89
00:02:54,200 --> 00:02:56,610
We could also have inequalities, by the way, on the first column.

90
00:02:56,950 --> 00:02:58,000
So we might ask, when is

91
00:02:58,410 --> 00:02:59,720
T.A less than, say the

92
00:02:59,810 --> 00:03:02,260
value D, and T.B equal

93
00:03:02,960 --> 00:03:03,990
to say the value 1, and

94
00:03:04,810 --> 00:03:06,810
in that case we'll get the tuple 3 as a result.

95
00:03:07,220 --> 00:03:08,150
So I think this gives an

96
00:03:08,240 --> 00:03:09,140
idea with a simple table

97
00:03:09,850 --> 00:03:11,070
of how indexes are used to

98
00:03:11,130 --> 00:03:12,290
go directly to the tuples that

99
00:03:12,440 --> 00:03:14,810
satisfy conditions rather than scanning the entire table.

100
00:03:15,900 --> 00:03:17,540
So that's the main utility of an index.

101
00:03:17,890 --> 00:03:18,910
Again, the tables can be

102
00:03:19,230 --> 00:03:20,560
absolutely gigantic in databases

103
00:03:21,230 --> 00:03:22,360
and the difference between scanning

104
00:03:22,710 --> 00:03:23,780
an entire table to find

105
00:03:23,950 --> 00:03:25,080
that tuples that match a condition

106
00:03:25,720 --> 00:03:27,020
and locating the tuples, more or

107
00:03:27,140 --> 00:03:28,120
less immediately using an index,

108
00:03:28,670 --> 00:03:30,920
can be orders of magnitude in performance difference.

109
00:03:31,550 --> 00:03:32,810
So really it's quite important to

110
00:03:32,830 --> 00:03:33,950
take a look at the database and

111
00:03:34,120 --> 00:03:35,570
build indexes on those

112
00:03:35,790 --> 00:03:36,720
attributes that are going to

113
00:03:36,790 --> 00:03:39,910
be used frequently in conditions, especially conditions that are very selective.

114
00:03:41,640 --> 00:03:42,410
Now I mentioned that we're not

115
00:03:42,580 --> 00:03:43,850
covering the implementation of indexes

116
00:03:44,380 --> 00:03:45,520
in this video, but it is

117
00:03:45,750 --> 00:03:48,130
important to understand the basic data structures that are used.

118
00:03:48,980 --> 00:03:50,870
Specifically there are two different structures.

119
00:03:51,230 --> 00:03:52,310
One of them is balance trees

120
00:03:52,660 --> 00:03:54,090
and substantiation of that is

121
00:03:54,260 --> 00:03:56,120
typically what's called a B-tree or a B+tree.

122
00:03:56,420 --> 00:03:57,860
And the other is hash tables.

123
00:03:58,900 --> 00:04:00,680
Now balance trees indexes can

124
00:04:00,930 --> 00:04:02,160
be used to help with conditions

125
00:04:02,590 --> 00:04:03,800
of the form "attribute equals value." They

126
00:04:04,370 --> 00:04:05,390
can also be used for

127
00:04:05,580 --> 00:04:07,700
"attribute less than value," for

128
00:04:07,910 --> 00:04:09,140
"attribute between two values"

129
00:04:10,120 --> 00:04:12,110
and so on, as we have shown earlier.

130
00:04:13,340 --> 00:04:14,320
Hash tables, on the other

131
00:04:14,500 --> 00:04:16,550
hand, can only be used for equality conditions.

132
00:04:17,030 --> 00:04:18,480
So only attribute equal value.

133
00:04:18,760 --> 00:04:19,710
And if you're familiar with these

134
00:04:19,860 --> 00:04:22,190
structures then you'll know why there's the limit on hash tables.

135
00:04:23,140 --> 00:04:25,250
So balanced tree indexes are certainly more flexible.

136
00:04:25,800 --> 00:04:26,940
Now there is one small downside.

137
00:04:27,500 --> 00:04:28,280
For those of you who are

138
00:04:28,420 --> 00:04:29,850
familiar with the structures and

139
00:04:29,980 --> 00:04:31,630
with the running time, the operations

140
00:04:32,320 --> 00:04:33,360
on the balance trees tend

141
00:04:33,740 --> 00:04:35,300
to be logarithmic in their

142
00:04:35,410 --> 00:04:36,830
running time, while well designed

143
00:04:37,220 --> 00:04:38,630
hash tables can have more

144
00:04:38,950 --> 00:04:40,870
or less constant running time.

145
00:04:41,110 --> 00:04:43,110
Even in large databases, logarithmic is

146
00:04:43,330 --> 00:04:44,840
okay, although when only equality

147
00:04:45,400 --> 00:04:46,420
conditions are being used, then

148
00:04:46,570 --> 00:04:47,930
a hash table index might be preferred.

149
00:04:49,540 --> 00:04:50,260
Now, let's take a look at

150
00:04:50,330 --> 00:04:51,740
a few SQL queries and see

151
00:04:51,870 --> 00:04:53,040
how indexes might allow the

152
00:04:53,140 --> 00:04:54,920
query execution engine to speed up processing.

153
00:04:56,080 --> 00:04:57,970
We'll use our usual student and college database.

154
00:04:58,600 --> 00:04:59,500
The first one is a very

155
00:04:59,530 --> 00:05:01,160
simple query, it's looking for

156
00:05:01,260 --> 00:05:02,800
the student with a particular student ID.

157
00:05:03,480 --> 00:05:04,630
So if we have an index on

158
00:05:04,770 --> 00:05:05,710
the student ID then again the

159
00:05:05,800 --> 00:05:07,510
index will allow the

160
00:05:07,610 --> 00:05:08,790
query execution engine to go

161
00:05:09,180 --> 00:05:10,380
pretty much straight to that

162
00:05:10,640 --> 00:05:11,930
tuple, whereas without an

163
00:05:12,200 --> 00:05:14,680
index the entire student table would have to be scanned.

164
00:05:15,540 --> 00:05:16,660
Now let me mention that many

165
00:05:17,050 --> 00:05:20,370
database systems do automatically build indexes on primary keys.

166
00:05:20,880 --> 00:05:22,580
So it's likely that in an

167
00:05:22,710 --> 00:05:24,100
application the student ID would

168
00:05:24,270 --> 00:05:26,450
be declared as a primary key and there would be an index automatically.

169
00:05:27,100 --> 00:05:29,600
But it's a good thing to check if this type of query is common.

170
00:05:30,930 --> 00:05:32,310
And some systems even also build

171
00:05:32,550 --> 00:05:35,060
indexes automatically on attributes that are declared as unique.

172
00:05:35,570 --> 00:05:36,400
As a reminder from the constraint

173
00:05:36,870 --> 00:05:38,020
video, every table can have

174
00:05:38,110 --> 00:05:39,420
one primary key, and then

175
00:05:39,650 --> 00:05:41,940
any number of additional keys labeled as unique.

176
00:05:43,380 --> 00:05:45,480
Now let's take a look at a slightly more complicated example.

177
00:05:46,000 --> 00:05:47,640
Here we're looking for students whose

178
00:05:47,830 --> 00:05:49,430
name is Mary and whose

179
00:05:49,610 --> 00:05:52,100
GPA is greater then 3.9, and there may be a few of those students.

180
00:05:52,970 --> 00:05:54,310
So one possibility is that

181
00:05:54,430 --> 00:05:55,650
we have an index on the

182
00:05:55,710 --> 00:05:57,110
student name and if

183
00:05:57,210 --> 00:05:58,650
that is the case, expand the query

184
00:05:59,100 --> 00:06:00,430
processing can find quickly

185
00:06:01,260 --> 00:06:02,330
the tuples whose student name is

186
00:06:02,450 --> 00:06:03,450
Mary, and then check

187
00:06:03,720 --> 00:06:05,850
each one of those to see if the GPA is greater than 3.9.

188
00:06:06,950 --> 00:06:09,840
Alternatively we might have an index on the GPA.

189
00:06:10,890 --> 00:06:11,770
In that case, the system

190
00:06:12,170 --> 00:06:13,380
will use the index to find

191
00:06:14,050 --> 00:06:15,020
the students whose GPA is

192
00:06:15,250 --> 00:06:18,380
greater than 3.9 and then look to see if their name is Mary.

193
00:06:19,370 --> 00:06:20,340
Finally, it is possible we

194
00:06:20,450 --> 00:06:21,540
can have an index on the

195
00:06:21,630 --> 00:06:23,000
two attributes together so we

196
00:06:23,130 --> 00:06:24,310
can have S name and GPA

197
00:06:24,570 --> 00:06:25,710
together and then this

198
00:06:25,990 --> 00:06:27,050
index can be used to

199
00:06:27,620 --> 00:06:29,360
simultaneously find students that

200
00:06:29,510 --> 00:06:32,040
have the name Mary and the GPA greater than 3.9.

201
00:06:34,040 --> 00:06:35,020
Now, I should mention that because

202
00:06:35,310 --> 00:06:36,580
this is an inequality

203
00:06:37,290 --> 00:06:38,690
condition, it is important that

204
00:06:38,850 --> 00:06:39,890
the GPA is a tree

205
00:06:40,160 --> 00:06:41,330
based index in order to

206
00:06:41,460 --> 00:06:43,600
support that evaluation of

207
00:06:43,620 --> 00:06:44,990
this condition where the student

208
00:06:45,260 --> 00:06:46,630
name is an equality condition so

209
00:06:46,770 --> 00:06:49,100
that could be a hash based index or a tree based index.

210
00:06:50,420 --> 00:06:51,890
Now let's look at a query that involves a joint.

211
00:06:52,520 --> 00:06:53,850
We're joining the student and apply

212
00:06:54,270 --> 00:06:55,300
tables in order to find

213
00:06:55,610 --> 00:06:57,890
the names of the colleges that each student has applied to.

214
00:06:58,220 --> 00:06:59,980
And we're returning the student name and the college name.

215
00:07:00,990 --> 00:07:02,210
So let's suppose for starters

216
00:07:02,690 --> 00:07:04,050
that we had an index on

217
00:07:04,380 --> 00:07:06,730
the student ID attribute of the apply relation.

218
00:07:07,580 --> 00:07:08,990
If that's the case then the

219
00:07:09,180 --> 00:07:10,620
query execution engine can scan

220
00:07:11,250 --> 00:07:12,540
the student relation and, for

221
00:07:12,750 --> 00:07:14,210
each student, use that SID

222
00:07:14,680 --> 00:07:15,710
and quickly find the matching

223
00:07:16,160 --> 00:07:17,410
SIDs in the apply relation.

224
00:07:18,900 --> 00:07:20,410
Alternatively, let's suppose we

225
00:07:20,510 --> 00:07:21,900
had an index on the

226
00:07:22,190 --> 00:07:23,870
SID attribute of the student relation.

227
00:07:24,750 --> 00:07:25,900
In that case, the system could

228
00:07:26,450 --> 00:07:28,210
scan the apply relation, and,

229
00:07:28,300 --> 00:07:29,220
for each student ID and each

230
00:07:29,600 --> 00:07:30,830
apply tuple, find the matching

231
00:07:31,060 --> 00:07:31,970
student IDs in the student

232
00:07:32,370 --> 00:07:34,250
tuple using the index that we have there.

233
00:07:35,250 --> 00:07:36,190
In some cases it's actually

234
00:07:36,720 --> 00:07:37,670
possible to use the two

235
00:07:37,950 --> 00:07:40,810
indexes together and make the query run even faster.

236
00:07:41,500 --> 00:07:42,720
I'm not going to go into

237
00:07:42,850 --> 00:07:43,800
detail, but indexes often allow

238
00:07:44,170 --> 00:07:45,660
relations to be accessed in

239
00:07:45,890 --> 00:07:47,180
sorted order of the indexed attributes.

240
00:07:48,100 --> 00:07:49,160
So, suppose we can get

241
00:07:49,300 --> 00:07:50,400
the student relation in sorted

242
00:07:50,680 --> 00:07:52,470
order and the apply relation in sorted order.

243
00:07:52,950 --> 00:07:53,870
Then we can kind of do a

244
00:07:54,250 --> 00:07:56,050
merge-like operation of the

245
00:07:56,240 --> 00:07:57,490
two indexes to get the

246
00:07:57,570 --> 00:07:58,840
matching student and apply

247
00:07:59,210 --> 00:08:00,780
records, those whose SIDs are equal.

248
00:08:02,130 --> 00:08:03,600
If we had additional conditions in

249
00:08:03,690 --> 00:08:04,540
the query there might be even

250
00:08:04,770 --> 00:08:05,750
more choices of how to

251
00:08:05,910 --> 00:08:07,430
use indexes, and that gets

252
00:08:07,590 --> 00:08:08,990
into the entire area of

253
00:08:09,100 --> 00:08:11,460
what is known as query planning and query optimization.

254
00:08:11,700 --> 00:08:13,590
And this is actually one

255
00:08:13,650 --> 00:08:15,000
of the most exciting and interesting

256
00:08:15,690 --> 00:08:17,540
areas of the implementation of

257
00:08:17,640 --> 00:08:19,070
database systems and is

258
00:08:19,240 --> 00:08:20,140
what allows us to have of

259
00:08:20,280 --> 00:08:22,600
a declarative query language that's implemented efficiently.

260
00:08:24,070 --> 00:08:25,380
So, indexes seem like great things.

261
00:08:25,730 --> 00:08:27,120
We just throw some indexes onto

262
00:08:27,400 --> 00:08:29,620
our data and all of a sudden our queries run much, much faster.

263
00:08:30,490 --> 00:08:32,580
So, there must be some downsides, and of course, there are.

264
00:08:33,260 --> 00:08:35,940
Let me list three of them from sort of least severe to most severe.

265
00:08:36,310 --> 00:08:39,100
So, the first one is that indexes do take up extra space.

266
00:08:39,460 --> 00:08:42,390
As I mentioned, they are persistent data structures that resides with the data.

267
00:08:42,760 --> 00:08:44,070
I consider this sort of

268
00:08:44,290 --> 00:08:45,990
a marginal downside especially with

269
00:08:46,110 --> 00:08:47,090
the cost of disk these days

270
00:08:47,360 --> 00:08:48,580
its really not that big of

271
00:08:48,640 --> 00:08:49,900
deal to use additional

272
00:08:50,280 --> 00:08:52,360
space, even to potentially double the size of your database.

273
00:08:53,640 --> 00:08:55,560
The second downside is the

274
00:08:55,810 --> 00:08:57,060
overhead involved in index creation.

275
00:08:57,820 --> 00:08:59,150
So, when a database is

276
00:08:59,250 --> 00:09:00,450
loaded, if we're going

277
00:09:00,510 --> 00:09:02,900
to have indexes, those indexes need to be created over the data.

278
00:09:03,570 --> 00:09:05,650
Or if we add indexes later on, they need to be created.

279
00:09:06,640 --> 00:09:07,900
Index creation can actually be

280
00:09:08,470 --> 00:09:11,490
a fairly time consuming operation, so I'm going to make this as a medium downside.

281
00:09:12,450 --> 00:09:13,500
On the other hand, once the index

282
00:09:13,720 --> 00:09:15,210
is created, all the

283
00:09:15,380 --> 00:09:17,950
queries run faster, so it's usually worthwhile to do it.

284
00:09:18,860 --> 00:09:19,880
The last one is the

285
00:09:19,920 --> 00:09:22,810
most significant one and that's the issue of index maintenance.

286
00:09:23,360 --> 00:09:24,250
So the index is a

287
00:09:24,300 --> 00:09:25,530
data structure that sits to

288
00:09:25,580 --> 00:09:27,330
the side of the database and helps answer conditions.

289
00:09:28,420 --> 00:09:30,100
When the values in

290
00:09:30,260 --> 00:09:31,840
the database change, then the

291
00:09:31,920 --> 00:09:34,010
index has to be modified to reflect those changes.

292
00:09:34,860 --> 00:09:36,090
So if the database

293
00:09:36,800 --> 00:09:38,930
is modified frequently, each of

294
00:09:39,150 --> 00:09:40,340
those modifications is going to

295
00:09:40,410 --> 00:09:42,590
be significantly slower than if we didn't have indexes.

296
00:09:43,400 --> 00:09:44,250
So in fact, in a database

297
00:09:44,740 --> 00:09:46,000
that's modified a whole bunch

298
00:09:46,400 --> 00:09:47,730
and not queried all that

299
00:09:47,900 --> 00:09:49,580
often, the cost of

300
00:09:49,760 --> 00:09:51,160
index maintenance can actually

301
00:09:51,540 --> 00:09:52,990
offset the benefits of having the index.

302
00:09:53,430 --> 00:09:54,720
So it really is a cost-benefit

303
00:09:55,210 --> 00:09:56,880
trade off to decide when to build indexes.

304
00:09:58,250 --> 00:10:00,260
So given that we have this cost-benefit trade off.

305
00:10:00,730 --> 00:10:01,720
How do we figure out which indexes

306
00:10:02,230 --> 00:10:03,000
to create when we have a

307
00:10:03,030 --> 00:10:04,560
database an applications on that database.

308
00:10:05,620 --> 00:10:06,460
The benefit of an index

309
00:10:07,240 --> 00:10:08,270
first of all on how big the

310
00:10:08,340 --> 00:10:09,130
table is since the index

311
00:10:09,600 --> 00:10:11,600
helps us find specific portions of the table quickly.

312
00:10:12,720 --> 00:10:14,210
It depends on the data distributions

313
00:10:14,920 --> 00:10:16,070
again because the index helps

314
00:10:16,420 --> 00:10:18,130
us find specific data values quickly.

315
00:10:18,960 --> 00:10:21,430
And finally how often we're going to query the database.

316
00:10:22,070 --> 00:10:23,650
First of all it's how we're going to update it.

317
00:10:24,030 --> 00:10:25,230
As I mentioned, every time the

318
00:10:25,290 --> 00:10:28,190
database is updated indexes needed to be maintained and that's costly.

319
00:10:29,140 --> 00:10:30,400
Every time we query the indexes

320
00:10:30,630 --> 00:10:32,250
may help us answer our queries more quickly.

321
00:10:33,740 --> 00:10:34,990
Fortunately, over the last decade

322
00:10:35,320 --> 00:10:36,740
or so, many database system

323
00:10:37,010 --> 00:10:39,360
vendors have introduced what's called a physical design advisor.

324
00:10:40,070 --> 00:10:41,470
In this case, physical design means

325
00:10:41,710 --> 00:10:44,030
determining which indexes to build on a database.

326
00:10:44,590 --> 00:10:45,650
The input to the design

327
00:10:46,130 --> 00:10:47,040
advisor is the database

328
00:10:47,570 --> 00:10:48,600
itself and the workload.

329
00:10:49,280 --> 00:10:50,650
The workload consists of the

330
00:10:50,800 --> 00:10:51,910
sets of queries and updates

331
00:10:52,310 --> 00:10:53,290
that are expected to be performed

332
00:10:53,750 --> 00:10:55,020
on the database as well as their frequency.

333
00:10:56,260 --> 00:10:57,470
Now actually the design advisor

334
00:10:57,840 --> 00:10:58,700
doesn't usually look at the

335
00:10:58,900 --> 00:10:59,990
entire database, but rather looks

336
00:11:00,230 --> 00:11:01,180
at statistics on the database

337
00:11:01,710 --> 00:11:04,250
that describe how large the tables are and their data distributions.

338
00:11:05,570 --> 00:11:06,580
The output of the design advisor

339
00:11:07,070 --> 00:11:08,240
is a recommended set of indexes

340
00:11:08,800 --> 00:11:11,010
to build that will speed up the overall workload.

341
00:11:12,470 --> 00:11:13,940
Interestingly, physical design advisors

342
00:11:14,440 --> 00:11:15,580
rely very heavily on a

343
00:11:15,650 --> 00:11:16,590
component of database systems that already

344
00:11:16,900 --> 00:11:18,130
existed, actually one of the

345
00:11:18,450 --> 00:11:19,890
most important components of database

346
00:11:20,310 --> 00:11:21,460
systems, which is the query optimizer.

347
00:11:21,870 --> 00:11:23,520
That's the component that

348
00:11:23,670 --> 00:11:25,720
takes a query and figures out how to execute it.

349
00:11:26,170 --> 00:11:27,790
Specifically, it'll take statistics on

350
00:11:28,000 --> 00:11:29,150
the database, the query to be

351
00:11:29,250 --> 00:11:30,340
executed or the update command,

352
00:11:31,200 --> 00:11:32,320
and the set of indexes that

353
00:11:32,440 --> 00:11:33,880
currently exist, and it

354
00:11:34,000 --> 00:11:35,550
will explore the various ways

355
00:11:36,010 --> 00:11:37,040
of actually executing the query,

356
00:11:37,330 --> 00:11:39,720
which indexes will be used, which order things will be done in.

357
00:11:40,250 --> 00:11:41,290
It estimates the cost of

358
00:11:41,390 --> 00:11:42,560
each one, and it spits

359
00:11:42,870 --> 00:11:44,260
out the estimated best execution

360
00:11:44,900 --> 00:11:46,130
plan with the estimated cost.

361
00:11:47,120 --> 00:11:48,060
So now let's look at how

362
00:11:48,400 --> 00:11:50,850
this component can be used to build a design advisor.

363
00:11:51,700 --> 00:11:52,560
Let's just draw the design

364
00:11:52,990 --> 00:11:54,170
advisor around the whole thing

365
00:11:54,470 --> 00:11:56,190
here, and the input

366
00:11:56,470 --> 00:11:57,950
to the design advisor, again, are

367
00:11:58,400 --> 00:11:59,980
the statistics and the workload,

368
00:12:00,920 --> 00:12:02,650
and the output is supposed to be the indexes.

369
00:12:03,410 --> 00:12:05,110
So what the design adviser actually

370
00:12:05,680 --> 00:12:08,730
does is it experiments with different set-ups of indexes.

371
00:12:09,780 --> 00:12:11,290
For each set-up of indexes,

372
00:12:11,660 --> 00:12:13,060
it takes the workload, it issues

373
00:12:13,460 --> 00:12:15,630
the queries, and updates to the query optimizer.

374
00:12:16,230 --> 00:12:17,070
It doesn't actually run them against

375
00:12:17,090 --> 00:12:20,570
the database and see's what cost the query optimizer produces.

376
00:12:21,660 --> 00:12:22,570
It tries this with different

377
00:12:23,000 --> 00:12:24,960
configurations of indexes and

378
00:12:25,050 --> 00:12:25,930
then in the end determines

379
00:12:26,530 --> 00:12:29,510
those indexes that bring down the cost the most.

380
00:12:29,940 --> 00:12:30,910
In other words, it will give

381
00:12:31,100 --> 00:12:32,840
you back those indexes where

382
00:12:33,130 --> 00:12:34,460
the benefits of having the

383
00:12:34,710 --> 00:12:35,980
index outweigh the drawbacks

384
00:12:36,650 --> 00:12:38,340
of having that index in terms

385
00:12:38,750 --> 00:12:40,440
of the workload and using

386
00:12:40,770 --> 00:12:42,550
the costs that were estimated by the query optimizer.

387
00:12:43,770 --> 00:12:44,760
If you're using a system that

388
00:12:44,810 --> 00:12:45,770
doesn't have a design adviser,

389
00:12:46,200 --> 00:12:48,090
then you'll have to kind of go through this process yourself.

390
00:12:48,620 --> 00:12:49,370
You'll have to take a look

391
00:12:49,600 --> 00:12:50,720
at the queries and updates that

392
00:12:50,850 --> 00:12:52,190
you expect, how often you

393
00:12:52,270 --> 00:12:53,510
expect them to happen, and

394
00:12:53,620 --> 00:12:55,470
which indexes will benefit those

395
00:12:55,710 --> 00:12:57,090
queries, and hopefully won't incur

396
00:12:57,410 --> 00:12:58,880
too much overhead when there are updates.

397
00:13:00,110 --> 00:13:02,160
Just quickly, here's the SQL standard for creating indexes.

398
00:13:02,950 --> 00:13:03,940
All indexes are given names.

399
00:13:04,450 --> 00:13:06,120
We can create an index on a single attribute.

400
00:13:06,940 --> 00:13:08,960
We can create an index on several attributes together.

401
00:13:09,900 --> 00:13:10,930
We can also say that we

402
00:13:11,040 --> 00:13:12,120
want our index to enforce

403
00:13:12,640 --> 00:13:13,870
a uniqueness constraint so when

404
00:13:14,100 --> 00:13:16,610
we add the word unique it sort of adds constraint enforcement.

405
00:13:17,270 --> 00:13:18,040
It says, we're going to check

406
00:13:18,370 --> 00:13:19,450
that all values for "A" are

407
00:13:19,510 --> 00:13:20,790
unique using our index and

408
00:13:20,900 --> 00:13:22,060
will generate an error if there

409
00:13:22,360 --> 00:13:23,160
are two values that have the

410
00:13:23,380 --> 00:13:24,950
same, two tuples that have

411
00:13:25,090 --> 00:13:26,490
the same value for A, and

412
00:13:26,630 --> 00:13:28,240
finally we have a command for dropping indexes.

413
00:13:29,740 --> 00:13:31,140
In summary, indexes are really important.

414
00:13:31,700 --> 00:13:33,980
They're the primary way to get improved performance on a database.

415
00:13:34,940 --> 00:13:36,550
By building the right indexes over

416
00:13:36,820 --> 00:13:37,930
a database for its work flow

417
00:13:38,150 --> 00:13:40,240
we can get orders of magnitude performance improvement.

418
00:13:41,050 --> 00:13:41,980
Although we do have to be careful,

419
00:13:42,140 --> 00:13:43,340
because there are trade-offs in

420
00:13:43,430 --> 00:13:46,340
building indexes, especially for databases that are modified frequently.

421
00:13:47,360 --> 00:13:48,530
There are persistent data structure that

422
00:13:48,680 --> 00:13:49,940
are stored together with the database,

423
00:13:50,300 --> 00:13:51,920
and there are many interesting implementation issues,

424
00:13:52,610 --> 00:13:53,440
but in this video and course

425
00:13:53,740 --> 00:13:55,030
we're focusing specifically on the

426
00:13:55,090 --> 00:13:56,870
user and application perspective, determining

427
00:13:57,320 --> 00:13:58,630
which indexes to build and

428
00:13:58,750 --> 00:14:00,600
how they will gain performance improvement for us.