This video talks about indexes.
It's actually a relatively short video about a very important topic.
Before I get started, however, let
me mention that indexes are also
sometimes prefer two as indices, those are equivalent.
I personally prefer using the term indexes.
The reason indexes are so important
is that they are the primary
way of getting improved performance out of a database.
Indexes are a persistent data structure.
They're stored together with the database itself.
Now, there are many very interesting
implementation issues in indexes,
but in this video and in
this course in general, we're focusing
on the perspective of the user and application.
So, we'll talk about how applications
will use indexes to speed up performance.
So, let's suppose we have a
very simple table T that
has three columns, but we're
going to focus on columns A
and columns B. And we can
see that Column A is a
string valued column with animal
names, and Column B is an integer column.
Now, we're gonna be asking queries that involve conditions over these two columns.
In order to speed up those
queries, if we're concerned about
evaluating conditions on column A,
then we can build an index on that column.
So we call that an index on
column T.A. What that
index allows us to do
- and "us" in this case
is actually the query processor -
is ask questions, for example,
let's ask what tuples have
"cow" in the value of
T.A. If we
ask that question of our index,
that that index will
quickly tell us that tuple
3 and tuple 7 have
a value "cow", without scanning the entire table.
We could also ask the index
what tuples have say, value "cat".
And if we ask the index that
question, it will tell us
tuple 1 and tuple 5
and tuple 6 have the value "cat."
If we're interested in evaluating conditions
in column B then we can
also build an index on
column B. For example now
we could ask questions like, when
is T.B equal to the value two?
We asked the index and the
index will tell us
that tuple 1 and tuple 5 have the value two.
We could also ask, for example,
when the value in T.B
is less than six?
And the index in that case
would tell us that tuple
1 is less than six, two...wow,
most of them: three, five, and seven.
We could ask an even more complicated question.
We could ask when the
value for T.B is say,
greater than four and less than or equal to eight.
Again, we ask the index,
and in this case the index
would tell us that it
is tuple two and tuple
seven in that case.
Lastly, suppose we're interested in
having conditions that are
on both columns A and B.
Then we can build an index
that is on both columns together.
Now we could ask questions, for
example, like, when is T.A
equal to cat and T.B,
say, greater than five?
Do we have any of those?
Well, we have just one of them there.
That's tuple six.
We could also have inequalities, by the way, on the first column.
So we might ask, when is
T.A less than, say the
value D, and T.B equal
to say the value 1, and
in that case we'll get the tuple 3 as a result.
So I think this gives an
idea with a simple table
of how indexes are used to
go directly to the tuples that
satisfy conditions rather than scanning the entire table.
So that's the main utility of an index.
Again, the tables can be
absolutely gigantic in databases
and the difference between scanning
an entire table to find
that tuples that match a condition
and locating the tuples, more or
less immediately using an index,
can be orders of magnitude in performance difference.
So really it's quite important to
take a look at the database and
build indexes on those
attributes that are going to
be used frequently in conditions, especially conditions that are very selective.
Now I mentioned that we're not
covering the implementation of indexes
in this video, but it is
important to understand the basic data structures that are used.
Specifically there are two different structures.
One of them is balance trees
and substantiation of that is
typically what's called a B-tree or a B+tree.
And the other is hash tables.
Now balance trees indexes can
be used to help with conditions
of the form "attribute equals value." They
can also be used for
"attribute less than value," for
"attribute between two values"
and so on, as we have shown earlier.
Hash tables, on the other
hand, can only be used for equality conditions.
So only attribute equal value.
And if you're familiar with these
structures then you'll know why there's the limit on hash tables.
So balanced tree indexes are certainly more flexible.
Now there is one small downside.
For those of you who are
familiar with the structures and
with the running time, the operations
on the balance trees tend
to be logarithmic in their
running time, while well designed
hash tables can have more
or less constant running time.
Even in large databases, logarithmic is
okay, although when only equality
conditions are being used, then
a hash table index might be preferred.
Now, let's take a look at
a few SQL queries and see
how indexes might allow the
query execution engine to speed up processing.
We'll use our usual student and college database.
The first one is a very
simple query, it's looking for
the student with a particular student ID.
So if we have an index on
the student ID then again the
index will allow the
query execution engine to go
pretty much straight to that
tuple, whereas without an
index the entire student table would have to be scanned.
Now let me mention that many
database systems do automatically build indexes on primary keys.
So it's likely that in an
application the student ID would
be declared as a primary key and there would be an index automatically.
But it's a good thing to check if this type of query is common.
And some systems even also build
indexes automatically on attributes that are declared as unique.
As a reminder from the constraint
video, every table can have
one primary key, and then
any number of additional keys labeled as unique.
Now let's take a look at a slightly more complicated example.
Here we're looking for students whose
name is Mary and whose
GPA is greater then 3.9, and there may be a few of those students.
So one possibility is that
we have an index on the
student name and if
that is the case, expand the query
processing can find quickly
the tuples whose student name is
Mary, and then check
each one of those to see if the GPA is greater than 3.9.
Alternatively we might have an index on the GPA.
In that case, the system
will use the index to find
the students whose GPA is
greater than 3.9 and then look to see if their name is Mary.
Finally, it is possible we
can have an index on the
two attributes together so we
can have S name and GPA
together and then this
index can be used to
simultaneously find students that
have the name Mary and the GPA greater than 3.9.
Now, I should mention that because
this is an inequality
condition, it is important that
the GPA is a tree
based index in order to
support that evaluation of
this condition where the student
name is an equality condition so
that could be a hash based index or a tree based index.
Now let's look at a query that involves a joint.
We're joining the student and apply
tables in order to find
the names of the colleges that each student has applied to.
And we're returning the student name and the college name.
So let's suppose for starters
that we had an index on
the student ID attribute of the apply relation.
If that's the case then the
query execution engine can scan
the student relation and, for
each student, use that SID
and quickly find the matching
SIDs in the apply relation.
Alternatively, let's suppose we
had an index on the
SID attribute of the student relation.
In that case, the system could
scan the apply relation, and,
for each student ID and each
apply tuple, find the matching
student IDs in the student
tuple using the index that we have there.
In some cases it's actually
possible to use the two
indexes together and make the query run even faster.
I'm not going to go into
detail, but indexes often allow
relations to be accessed in
sorted order of the indexed attributes.
So, suppose we can get
the student relation in sorted
order and the apply relation in sorted order.
Then we can kind of do a
merge-like operation of the
two indexes to get the
matching student and apply
records, those whose SIDs are equal.
If we had additional conditions in
the query there might be even
more choices of how to
use indexes, and that gets
into the entire area of
what is known as query planning and query optimization.
And this is actually one
of the most exciting and interesting
areas of the implementation of
database systems and is
what allows us to have of
a declarative query language that's implemented efficiently.
So, indexes seem like great things.
We just throw some indexes onto
our data and all of a sudden our queries run much, much faster.
So, there must be some downsides, and of course, there are.
Let me list three of them from sort of least severe to most severe.
So, the first one is that indexes do take up extra space.
As I mentioned, they are persistent data structures that resides with the data.
I consider this sort of
a marginal downside especially with
the cost of disk these days
its really not that big of
deal to use additional
space, even to potentially double the size of your database.
The second downside is the
overhead involved in index creation.
So, when a database is
loaded, if we're going
to have indexes, those indexes need to be created over the data.
Or if we add indexes later on, they need to be created.
Index creation can actually be
a fairly time consuming operation, so I'm going to make this as a medium downside.
On the other hand, once the index
is created, all the
queries run faster, so it's usually worthwhile to do it.
The last one is the
most significant one and that's the issue of index maintenance.
So the index is a
data structure that sits to
the side of the database and helps answer conditions.
When the values in
the database change, then the
index has to be modified to reflect those changes.
So if the database
is modified frequently, each of
those modifications is going to
be significantly slower than if we didn't have indexes.
So in fact, in a database
that's modified a whole bunch
and not queried all that
often, the cost of
index maintenance can actually
offset the benefits of having the index.
So it really is a cost-benefit
trade off to decide when to build indexes.
So given that we have this cost-benefit trade off.
How do we figure out which indexes
to create when we have a
database an applications on that database.
The benefit of an index
first of all on how big the
table is since the index
helps us find specific portions of the table quickly.
It depends on the data distributions
again because the index helps
us find specific data values quickly.
And finally how often we're going to query the database.
First of all it's how we're going to update it.
As I mentioned, every time the
database is updated indexes needed to be maintained and that's costly.
Every time we query the indexes
may help us answer our queries more quickly.
Fortunately, over the last decade
or so, many database system
vendors have introduced what's called a physical design advisor.
In this case, physical design means
determining which indexes to build on a database.
The input to the design
advisor is the database
itself and the workload.
The workload consists of the
sets of queries and updates
that are expected to be performed
on the database as well as their frequency.
Now actually the design advisor
doesn't usually look at the
entire database, but rather looks
at statistics on the database
that describe how large the tables are and their data distributions.
The output of the design advisor
is a recommended set of indexes
to build that will speed up the overall workload.
Interestingly, physical design advisors
rely very heavily on a
component of database systems that already
existed, actually one of the
most important components of database
systems, which is the query optimizer.
That's the component that
takes a query and figures out how to execute it.
Specifically, it'll take statistics on
the database, the query to be
executed or the update command,
and the set of indexes that
currently exist, and it
will explore the various ways
of actually executing the query,
which indexes will be used, which order things will be done in.
It estimates the cost of
each one, and it spits
out the estimated best execution
plan with the estimated cost.
So now let's look at how
this component can be used to build a design advisor.
Let's just draw the design
advisor around the whole thing
here, and the input
to the design advisor, again, are
the statistics and the workload,
and the output is supposed to be the indexes.
So what the design adviser actually
does is it experiments with different set-ups of indexes.
For each set-up of indexes,
it takes the workload, it issues
the queries, and updates to the query optimizer.
It doesn't actually run them against
the database and see's what cost the query optimizer produces.
It tries this with different
configurations of indexes and
then in the end determines
those indexes that bring down the cost the most.
In other words, it will give
you back those indexes where
the benefits of having the
index outweigh the drawbacks
of having that index in terms
of the workload and using
the costs that were estimated by the query optimizer.
If you're using a system that
doesn't have a design adviser,
then you'll have to kind of go through this process yourself.
You'll have to take a look
at the queries and updates that
you expect, how often you
expect them to happen, and
which indexes will benefit those
queries, and hopefully won't incur
too much overhead when there are updates.
Just quickly, here's the SQL standard for creating indexes.
All indexes are given names.
We can create an index on a single attribute.
We can create an index on several attributes together.
We can also say that we
want our index to enforce
a uniqueness constraint so when
we add the word unique it sort of adds constraint enforcement.
It says, we're going to check
that all values for "A" are
unique using our index and
will generate an error if there
are two values that have the
same, two tuples that have
the same value for A, and
finally we have a command for dropping indexes.
In summary, indexes are really important.
They're the primary way to get improved performance on a database.
By building the right indexes over
a database for its work flow
we can get orders of magnitude performance improvement.
Although we do have to be careful,
because there are trade-offs in
building indexes, especially for databases that are modified frequently.
There are persistent data structure that
are stored together with the database,
and there are many interesting implementation issues,
but in this video and course
we're focusing specifically on the
user and application perspective, determining
which indexes to build and
how they will gain performance improvement for us.