This video introduces online analytical processing or OLAP.
A subsequent video will have
a demo of OLAP queries in action.
Overall, database activity can
be divided into two broad classes.
One of them, the traditional one,
is known as OLTP, or online transaction processing.
The other one, the subject
of this video, came about more
recently, and it's known
as OLAP, or online analytical processing.
Online transaction processing is typically
characterized by short transactions,
both queries and updates.
Things like updating an account
balance in a bank database
or logging a page view in a web application.
Queries in OLTP data bases are generally fairly simple.
Find an account balance or find the GPA of a student.
They typically touch small portions of the data.
And updates in this environment can be frequent.
We might be making airline seat
reservations or updating a online shopping cart.
OLAP is pretty much the opposite in all respects.
In OLAP, we have long
transactions, often complex analysis
of the data or data mining type operations.
The queries as I said, can
be complex and especially they
often touch large portions of
the data rather than small portions as in OLTP.
And updates in the OLAP
environment tend to be infrequent,
in fact, sometimes in the
OLAP environment there are no updates to the data at all.
Now, these two are extremes
and really there is a spectrum
between those two extremes.
We might have a sort of, moderate
amount of update and queries
that touch a moderate portion of the data.
But the fact is that
database systems traditionally were designed for the first extreme.
And then special techniques were developed for the other extreme.
So the systems are tuned for the two extremes.
And depending on ones work load
one might choose to use different
options in a database system
just a little bit more terminology
in the OLAP world.
There's a concept called data warehousing.
It's really a software architecture.
The idea is that often in
enter prizes or other operation,
there are lots of operational sources.
So you can think of a
point of sale, for example, might
have many, many OLTP database
pieces related to an enterprise,
and data warehousing is the
process of bringing the data from
all of those distributed OLTP sources
into a single, gigantic warehouse where
the point then is to do
analyses of the data, and
that would fall under the OLAP camp.
Another term you might encounter is
decision support systems also known as DSS.
This isn't really an exact term.
It's generally used to talk
about infrastructure for again large scale data analyses.
So, if you think of
a data warehouse, where we're
bringing in a lot of
data from operational sources, and
that warehouse is tuned for
OLAP queries that would
be thought of as a decision support system.
And, of course, this system
is designed to support decisions
that are made, again, based on data analysis.
Now, let's get into some technical details of OLAP.
Frequently applications that are
doing online analytical processing
are designed based around a
star schema, so it's a
certain type of relational schema.
In a star schema, there's usually one fact table.
That will be a typically very
large table, it will be updated frequently.
Often it's actually append only,
so there are only inserts into the fact table.
And then there are maybe many dimension tables.
Those are updated infrequently and don't tend to be as large.
So examples of a fact table
might be sales transactions in
a sales database or in
a university database, maybe students
enrolling in courses or in
a web application logging the page views.
In all of these cases we can
see that the fact table can
be very large and can
be append only, so inserts only.
Examples of dimension tables
might be in a sales
database store's items and
customers in a college enrollment database.
Maybe students and courses in a web application.
Maybe web pages his users and advertisers.
So, you can see, that
these are generally smaller
tables, they're more stable, they're
not updated as frequently.
You can sort of think
of dimension tables as things
in the real world and then
fact tables as logging things that happened.
It's not always divided this way but, it's not a bad approximation.
Now, you might be wondering
why is it called a
star schema and it's called
that because we have the
fact table sort of, centrally
referencing dimension tables around it.
So, I'll draw the picture.
Let's take a particular example and let's look at the sales domain.
So, we'll have our fact
table here, which will be
the sales table and that
will log sales transactions actions.
It will include the store where
the sale was made, the item
that was sold, the customer, how
many were sold, and the price that was paid.
And then the other three tables are the dimension tables.
So those those are giving
us information about the stores
and the items and the customers.
So, I've drawn a picture of our schema here.
We have our central fact table, the sales table.
And we can see that the
sales table contains these three
columns I've abbreviated them in
the picture: the Store ID, Item
ID, and the Customer ID.
The store ID values in
this column will be foreign
key attributes to the
primary key of the store
table if you remember our constraints video.
So we can think of these as
pointers into the store
table, least specifically matching store
IDs over here.
And we'll have similarly our
item IDs will be foreign
keys to the item table.
I won't actually point to the values here.
And then our costumer IDs
over here will be pointing to the customer table.
So if you look at
this squinting, you will
see that it is kind of
a star schema with the
central fact table pointing
to the dimension tables around it, and that's where the name comes from.
Just a little more terminology.
The first three attributes here in the fact fact table.
These three are what are
known as dimension attributes.
So those are the attributes
that are foreign keys into the dimension tables.
Then the remaining attributes in
this case the quantity and the
price are called dependent attributes.
So they're I guess dependent on the
values for the dimension
attributes and typically, queries will
tend to aggregate on the dependent attributes.
We'll see examples of that in a moment.
So, now that we known
what a star schema looks like,
let's look at the type of
queries that are generally issued
over this schema, and they're called OLAP queries.
Typically a query over
a star schema will first
join some or all of the relations.
And when you're joining the sale
as the fact table with
the dimension tables, you can
almost think of it as expanding the
facts in the sales table
to include more information about the sales.
Since we have the foreign keys we'll
be adding, for example, to the information about a sale.
More about the store.
The city and state of the store.
For a sale item will
be adding the category brand and so on.
So that's the join
process and the query will
join as much as it
needs in order to do the rest of it's work.
It might then filter the data.
For example we might decide
that in our query we only
care about stores in California
or customers in California, we're only
interested in shirts and so on.
So they can filter on the
dimension attributes after joining,
or could filter on the price or quantity as well.
After filtering there's often a group by an aggregation.
So we might decide that we're
interested in figuring out our
total sales divided by customer
or by item or by state or all of those.
And then the aggregation might sum
up the sales or it might
determine the average price that's sold.
We'll be doing a number of this
type of query in our demo later on.
So if you think about executing
queries of this type, they
can be quite complex and they
can touch large portions of the database.
Sowe 're worried about performance,
and our data is large, we do have a worry.
Running this type of
query on a gigantic database
over a standard database system
can be very slow, but over
the past decade or so,
special indexing techniques have
been introduced and special query processing
techniques specifically to handle
this type of query on
star schemas on large databases.
And again, by large, just
think about the number of sales,
for example, in a large
retail chain, or a
number of web views, or
even shopping cart additions
in a large online vendor.
So, in all of those
applications, people are interested in
doing OLAP queries and they
tend to use a system that supports these special techniques.
Another component of getting good
performance in these systems
is the use of materialized views.
You might remember that materialized
views are useful when we
have a workload that
consists of lots of queries and not so many updates.
And that's exactly the type of workload we have in OLAP.
furthermore, we have many queries
that take roughly the same
structure so material wise we
use are useful in that setting as well.
Now let me switch gears
and introduce a different way
of looking at the data in
these OLAP applications with star
schemas, and it's what's known as a data cube.
Sometimes this is also called
multidimensional OLAP and
the basic idea is that
when we have data with dimensions,
we can think of those dimensions
as forming the axis of a cube.
It's kind of like an N dimensional spreadsheet.
Now we can have any
number of dimensions, but for
the examples I'm gonna give,
the best I can draw is
up to three dimensions, and that's why people call acute.
Because they know how to draw three dimensions.
But again, any number of
dimensions are possible in this view of the data.
So we have our dimensions forming the axis of our cube.
And then the cells of the
cube, again, you can think of it sort of like cells of a spreadsheet.
Are the fact of data.
Or the dependent data.
It's like in the previous example that
would be our quantity and price.
And finally we have aggregated
data on the sides, edges
and corners of corner of the cube.
Again similar to how you might aggregate columns in a spreadsheet.
So let's go ahead and
I'll do my best to draw
a picture to explain what's going on.
So here's my cube with these
three axes that I've drawn in black.
And I've drawn these dash lines
as well to sort of give
you a visual idea of the cube.
But I'm going to actually
get rid of these dash lines right
now just so we don't have too much clutter.
So for our sales example,
we're sticking with the same example,
we have 3 dimensions.
And those will label the three
the three axises of are cube
and in one dimension we will
have the stores and another
dimension we will have the
customers here, and in
another dimension we have the items.
Then we can think of
the points along these axes
as being the different elements
in each of those domains, or
the different tuples in each of those dimension tables.
So for example, in the
store domain, we'll have,
you know, store 1 store 2,
store 3 and so on.
I'm not giving them any fancy names here.
And so, each of those is
a point on that dimension and
similarly for the items will have
item 1 item 2 item 3 and so on.
And for the customers along the bottom,
we'll have customer 1 customer
number 2, customer 3 and so on.
Now here comes the tricky part, especially for drawing.
The idea is Is that
every cell in the
cube, so every combination of
item costumer in store has
a cell in the cube, so
this would be sort of a free floating cell here.
And This will have for our
schema the quantity and
the price for that
item, that customer, and that store.
So this might be the floating
thing here that's, you know, Item
I32, Costumer 4, and
Store 17, something like that.
And then floating in
there is this cell with the quantity and the price.
Now we are assuming that there's
just one quantity and price
for the combination of those three attributes.
And I'll come back to that in a moment, but let's assume that for now.
So that's what we have in
the whole central area of the cube.
So now on the faces, edges,
and corner of the cube
are going to have aggregated data.
And there does need
to be with each data cube a predefined aggregate.
So for this one let's say
that what we want as our
aggregate is the sum
of the quantity times the price
so we're going to figure
out the total amount that we're
making for different combinations
of stores, items, and customers.
So now let's consider a
cell on the face of the cube.
So again, I'm not drawing this very well.
But let's assume this is on the bottom face of the cube.
So, this is for a particular customer.
Say customer 10, in a
particular store, say store 7,
and then, since it's
on the bottom of the cube,
so we didn't go up this dimension
here, it considers all items
for customer 10 and store 7.
So this will be
the aggregate over all items
for that particular store and customer.
And we'd have similar values on the other faces of the cube.
So this face over here, for
example, would be for a
particular item and customer overall stores.
And then on the front face
of the cube, if you could imagine
that, would be for a particular
item and store over all customers.
Now let's talk about what's on the edge of the cube.
So here we have, say
for store 3, we'll
have the aggregate value over
all customers and items
in this point for store 3.
So that will be the
total sales that we
conducted at store S3.
Over here on this edge
we'd have the total for a
specific costumer and over here for specific items.
And then finally, we have
at the corner of the
cube the full aggregation.
So that's going to be in
this case the sum of
the quantity times price for every
store, customer and item.
So, I'm not a great artist,
but I hope this gives you
some understanding of how the data cube works.
So as we saw in the
cube, we have one cell
in the cube for each combination of
store ID, item ID, and customer ID.
So if those three together form a key, then it's very straight forward.
If the dimension attributes together
don't form a key then
we might be pre-aggregating already inside the data cube.
So, we might decide to already
have say the sum of quantity
times price for each combination
of store item and customer.
Another possibility and it's done
quite commonly is to add
to the fact table the attribute
date, or even the time.
And that can be used to create a key.
Typically, we won't have two
transactions at exactly the same time.
Now if we do have an
attribute here called date, one
might wonder is that a
dimension attribute or a dependent attribute.
Actually, it's pretty much
a dimension attribute because we're
gonna use it as another dimension
in our data cube, but the
difference being that we would
not have an actual dimension table listing the dates.
Now let's move on to
a couple other concepts in the
olap world called drill down and roll up.
The idea of drill
down, is that we may
be examining summary data and then we want to get more information.
Drill down into the details of that data.
And, actually, we can think
of that very specifically in
a sequel context as follows.
Let's suppose that we have
this query and sequel which
follows by the way the
description of the query I
had earlier where we'll do a
join and then a selection
and then it grouped by, and finally we have an aggregation here.
So this query specifically is
looking at our total sales broken
out by state and brand.
Maybe we'll look at that and we'll just say that's not enough detail.
I need more information.
So, to drill down what we
do is add a grouping attribute.
So if we added, for example,
category, when we add
another grouping attribute, that gets
us more data in the answer - more detail in our data.
Rollup is exactly the opposite.
Rollup says we're looking at
data and we decide we
have too much detail and we want to summarize.
And summarize is simply
a matter of removing a group by attributes.
So if we took out state, then
now we'll only see our
data summarized by brand
rather than broken out into state and brand.
And lastly, I want to add introduce some SQL constructs.
These are constructs that have been added,
fairly recently, to the SQL
standard in order to perform OLAP queries.
And we'll be seeing these in our demo.
The constructs are called with cube
and with roll up and
they're added to the group by clause.
When we add with cube to
a query with a group by
what happens is that, basically,
we're adding to the result
of our query, the faces,
edges, and corner of the cube.
Using no values for the attributes that we're not constraining.
We'll see this clearly in the demo.
With <u>rollup is similar to with<u>cube, except it's smaller.</u></u>
It actually is a portion of
the data cube, and that
makes sense when we have dimensions
that are inherently hierarchical.
And again we'll see that in the demo as well.
So, we can conclude there are
two broad types of data
base activity, online transaction processing.
Short, simple transactions touching
small portions of the data,
lots of updating and OLAP,
or online analytical processing, where
we have complex queries, long transactions,
might touch a large portion of
the data and might not update the data at all.
For online analytical processing OLAP
we saw that star schemas are frequently used.
We saw how to view the data as a data cube.
Of course, that can be in any number of dimensions.
We just use three for visualization.
There are two new
constructs in sequel with<u>cube and with rollup.</u>
And finally this type of
query can be very stressful on
a database system when we have very large databases.
So special techniques have been
introduced into systems to help perform these queries efficiently.