In this video we'll be doing a live demonstration of OLAP. We'll create a star schema similar to the one we used in our previous examples. It will be sales data with a fact table and several dimension tables and then we'll do several different types of OLAP queries. We'll show a full star-join over the star schema. We'll show some examples of drill-down and roll-up. We'll also show what's known as slicing and dicing which we haven't explained yet, we'll just do when we get to those queries. And finally we'll show the With Cube and With Rollup clauses that have been added to the sequel standard for OLAP queries. For those we'll be using MySQL, actually for the whole demo we'll be using MySQL. MySQL supports with roll-up. Neither of the other systems that we've been using, Sequel Lite or postgres supports with roll up yet. And MySQL does not yet support With Cube, but we'll see that we can simulate the behavior With Cube, using With Rollup command of MySQL. So let's move to the demo. Here we are in the demo, we have a fact table that contains information about stores, items and customers. We don't see much here except their ID values. And we have sixty tuples loaded up in our fact table. Our dimension tables are the customer table. We have four customers with a gender and an age. We have our item table: five items, a category of the item, just t-shirts and jackets, just in blue and red, and finally we have our stores. And we have six stores that are in various cities in Santa Clara, San Mateo County in California and King County in Washington. Our first query is simply the complete star join. So, we're joining all four tables using the foreign key references in the fact table, to the keys and the dimension tables. Now since these are foreign key references, you can think of this star join as simply extending the tuples in the sales table with further information about the store item and customer mentioned in each one. So, here we go. And we should expect again sixty tuples in the results, one for each tuple in the fact table, and we can see that we have the information from the fact table, but then we also have more information that we've joined in from the dimension tables. Now it's not typical to do the complete star join, usually we would have at least constrained the star join in some way. So, our next query will do selections and projections on the JOIN. We're going to limit ourselves to sales where the state of the sale is California, the category is t-shirt, the age of the customer is less than 22, and the last condition is actually over the fact table, saying the price of the purchase was less than 25. And now we'll return the city, color, customer name, and price of the item. We run the query and we see now that we have just 5 sales that meet our criteria. So, we've seen the JOIN with constraints on the JOIN, but what's even more common in OLAP applications, is to do grouping and aggregation in order to analyze the data. And we're going to start by looking at grouping it in the aggregation involving the dimension attributes in the group by, and then the dependent attribute being aggregated. This is a very common form of query. So here we'll say we want to analyze our data based on combinations of stores and customers. So, we'll group by store ID and customer ID. And then, we'll sum up to see the total sales for each combination of store and customer. So, here's our result. Now, maybe not that meaningful to us right now, just looking at the store IDs and customer IDs, but for an analyst, it might be meaningful. We'll see in a moment doing group by an aggregation on other attributes that look a little more interesting. So now I'll demonstrate the concept of drilling down. So you might remember drilling down says that we're looking at data that's aggregated in some fashion, and we decide we want more detail. We get more detail by adding more attributes to the group by, so, let's say in this case I want to break out, not only by store and customer, but also by item. So, I'll add item ID to the group by, and also to the select, and when I run this query, I see that I get more results, and I now have more detail in the analysis of the data. Now I don't have 60 tumbles in my result even though I'm grouping by all three dimension attributes, because I do have in my fact table more than one tuple for some combinations of store, item, and customer. Next, I'm gonna introduce a query that shows the concept called slicing. Slicing of course evokes the data cube. We've talked about the data cube in the previous video, and we specifically used this example to demonstrate a three-dimensional data cube. So, the idea of a slicing query is a query that analyzes a slice of the cube and it does that by constraining one of the dimensions. So what I am going to do, is add to this query a constraint that says let's only consider sales that are from the state of Washington. And when I add that we'll continue to do the group by an aggregation, but only on that slice of the cube representing Washington stores, the sales that are from Washington stores. So it made a few changes to the query. The first thing we see is that I added the store relation to the from clause in order to constrain the state of the sale, I have to join with the store table that dimension table so that I can access the value of state which is not present in the sales table. So I also had to add variables, and I add the join condition for the sales table to join with the dimension table, and then I add the constraint that the state is Washington. The last small change is to add a variable to the store ID, so that I don't get an ambiguous error. So now let's go ahead and run the query and we see that we do get a subset of what we had before. And that subset is the slice of the data cube, if you want to think of it that way, representing the sales in Washington State. Now in addition to slicing, there is also a notion of dicing, and again if you imagine the data cube, what a dice does is slice in two dimensions and it gives you a chunk of the cube. So I am going to add a condition on the item being sold. I am going to add that the item must be red. So then I'm going to be looking at the chunk of the data cube that identifies red items purchased in the state of Washington. So here's the query; I've added a third relation to the "from" clause. I added the item dimension tables so that I can constrain the color to be red. I have added a "join" condition, joining the fact table to the item dimension table. I have added the condition that the color is red. And finally, I had to add again the variable names to make sure I didn't get ambiguous attribute references. So we'll run this query. And we see that we get an even smaller portion of the data cube. Again a dice where we have constrained two dimensions. So we have seen drilling down and slicing and dicing, now let's take a look at rolling up. Rolling up says that we're looking at our data analysis, and we decide that it's too detailed. We want to have less detail, in other words, more aggregation. To do that, we take attributes out of the group by clause. So let's say this is way, way too much detail and we just want our sales broken down on item. So we take all of the attributes out of our group by clause except for item. Here we go. And then, when we run the query, we'll see much more summarized data. And here it is, broken down again, just by separate items. So far, our grouping and aggregation has been on the dimension attributes, specifically the ID attributes identifying the tuples in the dimension table. And that may be meaningful for analyses, but for this demonstration, it's more fun to look at attributes that actually mean something to us. So, let's try a few queries that group instead based on attributes in the dimension tables. We'll still be using the fact table. We'll be joining it with dimension tables. And here we're going to start by grouping on the state and the category of the sales. Again, summing up the total sales for the different groups. So, here we see that we have four combinations of state and category California or Washington jackets and t-shirts, and then we have the total sales in each of those so we can see in both states the sales for jackets account for considerably more than the sales for t-shirts, but we do also notice that in California there's a lot more sales of t-shirts than there are in Washington, and I guess that's not surprising given the weather. Now let's demonstrate drill down on this query. So let's say we want a little more information. Maybe we want a breakdown by county in addition to state. So to do that, we add county to the group by clause and the select clause, and when we run the query we see we do now have more detail. We can see, for example, that we had more jacket sales in Santa Clara County than in San Mateo County although the t-shirts were approximately equal. A little, actually, more t-shirts in San Mateo. This is a little surprising, because Santa Clara is generally warmer than San Mateo, but it's fabricated data. Now let's see, we want it drilled it out even further, and we want to break our data down by gender as well as the other attributes. In order to drill down based on gender, I first needed to add the customer table to our from clause. Prior to this, we weren't doing any analysis that involved any attributes of the customer table. And so I need to add the join condition here. And then to do the drill down, I add the gender attribute to the group by and to the select. We run the query, and what do we find? Well, we can see, for example that not too many females in San Mateo County are buying t-shirts. Most of those t-shirt sales are counted for by males. The other thing we notice is that we don't have every single combination of county, category and gender. Very specifically we look and we can see that no males in King County bought any t-shirts. So we only have in our result, those combinations of dimension values that actually appear in the fact table. Or in the join of the fact table with the dimension tables. Now let's suppose after I've gotten to this level of analysis, I've decided that what might be most interesting is the data broken down just by combination of state and gender. So that would be a roll-up operation. And remember for roll-up I take attributes out of the group by clause. So I take out the county and category and I'm just interested in state/gender combinations. By the way at this point if I wanted to, I could remove the item table because I'm no longer constraining based on items or grouping based on items. But I'll just leave it in for now, it's not going to make a difference in our result. Of course for efficiency I might just prefer to take it out. Anyway let's run the query and now we see that rolled up data. And when we look at it I guess our main conclusion here would be that in our database, the Californian males are the big spenders. So far everything we've seen has actually just used the regular SQL constructs that we already knew about. Although we were demonstrating them over a star schema and we're showing the types of queries that one tends to run over star schema's in OLAP applications. Now we're going to move to the specific constructs that have been added to SQL for OLAP. As you may remember the two constructs are called With Cube and With Rollup and they are written in the group by-clause. So our first query is the absolute basic cube query. We start with our sales fact table, we group on all three of its dimensions, and then we add With Cube. And what we're going to get is, basically, the data in the data cube that we saw in our previous video that introduced the cube. Let's go ahead and run the query. Disappointingly we get an error message that this version of MySQL doesn't yet support cube. Actually no version of MySQL supports cube at this point in time but we're hopeful that a future one will. In fact of the three open source systems that we have been using for our demos only MySQL supports the with rollup so we will continue with MySQL and in fact you can get the same result that you would get if with cube was supported, using with roll up, although it's a little bit contorted, but I'll do that now, to demonstrate what you would get if you wrote with cube. So here's the query, I'm not going to explain it, if you're particularly interested in it, you can download our script and exam it yourself. Again what I'm most interested in is looking at the results, and here they are. So this result is the result of a With Cube on the grouping and aggregation on all dimensions of our sales table, all three dimensions. So some of our tuples look very normal. This first tuple, second, and third tuple are all giving us the total price for combination of store, item, and customer, and these are exactly what we got in our original query that didn't have the With-Cube operator. But now we see some tuples that have blanks and as a reminder, blanks in this interface mean null. So this tuple is store item and null. And what this corresponds to is an element on the face of the cube. So this is the face of the cube that's not constraining the customer dimension. And what the value gives us then is the value we have on the face of the cube, which is the total sales for the combination of store one and item one and any customer, and that's seventy. And we can cross check that by seeing that the first three tuples add up to 70, because those happen to be the three customers that have made purchases at Store 1 and Item 1. And we can similarly see for Store 1 and Item 2 the total sales are 150. If we scroll down a bit we'll see cases where we have other single-null values; I'll keep going until I find one here. For example, here we have the null value for the store and that would be the face of the cube that's not constraining the store dimension. So that would be the sum for Item 1 and Customer 1 at any store, Item 1 and Customer 2 to at any store. Then we also have tuples that have two null values, here's an example. So this one is going to be corresponding to the edge of the cube. so this is the edge of the cube that is along the store dimension and is picking out Store 5 along that dimension and then giving the total sales for all items and all customers in Store 5, and we see that. And again we'll see other instances where we're missing two values, scroll down to get here. We've got one where we're missing the store in the item, so this would be on the edge along the customer dimension. So this gives us the total sales for any store item made by customer 3. And then finally, if we find it, we'll find the one that's all three null values, and that's right here. So that represents the corner of the cube. That gives us the total for all stores, all items and all customers. So, what we've done here by adding With Cube--and again, that would be the result if this query here were supported--what we're getting is the contents of the data cube, which are the inner parts of the cube, which is what we would get without adding anything to our group by and what we saw in one of our very first queries of the demo, in addition to the data on the faces, the edges, and the corner of the cube. Now let's do a quick cross check. So this data tells us that the corner of the cube, here, the one with all null values, is 33.50 total price. So, that should be exactly what we get if we sum up our prices without doing any grouping or aggregation so lets give that a try. So we merely take our sales table, we sum up the total prices and we run it and indeed we get thirty-three fifty. So now we've seen how SQL can be used to effectively create the data cube. Sometimes in OLAP applications it's convenient to query the cube directly. So people frequently create data cube say, in SQL using a materialized view, or just by creating a table that contains the result of the data cube which is what we are going to do. There are even some systems that use the data cube as their basic native data model. So let's go ahead and take the result of our cube query and put it in a table. So I've just added "create table cube as" before our query and the one the other thing I did was add a name to the result of the aggregation attribute so that we can refer to it in the data cube. So P then will then be containing the sum of the price for the items inside the data cube as well as for the faces, edges, and corners. So, let's go ahead and create that table and now let's look at querying the cube directly. So, the first thing you will notice in the next few queries is that we've replaced the sales table with the cubed table so we're gonna run our queries directly on the cube and we can join the cube with the store and item tables and the customer, if we needed it, just as we joined the sales table because it does contain the same dimension attributes. What the cube gives us is pre-aggregated data, both for the store, a customer and item combinations. As well as the data that has null values that is already aggregated for the faces edges and corner of the cube as we'll just see in a moment. So what our query is going to do is find total sales of blue items in the state of California. And it will start by showing the tupples that are used to get that total. So we'll join our cube table with the store and the item dimension tables in order to constrain the state and the color and then we will constrain the state and the color. Notice the last thing we say is that customer id is null and that's going to give us the data on the face of the cube that doesn't go along with the customer's dimension. That means it's going to be pre-aggregated for all customers and that's what we want since we don't care about the customers and just about the color and the state. So, let's first run this query and we see we get six tuples, and these tuples are all from the portion of the result of that cube query we ran that has a null value for customer ID and that is all combinations of stores of items, and if we checked our dimension tables we'd see that these stores are in California and these items are blue and these are our total sales broken down by store and item. Finally, to get the total total sales, we'll just sum up our column "p" which remember was the sum of price in the data queue. So we replace c. with sum of p and we run the query and we see that our total is 13.65. Now let me show you something that may seem non-intuitive at first, but if you think about it or maybe run the queries yourself, you'll start to understand exactly what's happening. I'm going to go back to c. in the select clause. And incidentally, I didn't mention it before but c. gives us in the select clause all the attributes from the cube table, and that is showing which tuples are being used for the aggregation. So, we had before six tuples that were being used. Now, I'm going to make one tiny change to the query. Instead of finding the customer IDs that are null, I'm going to find the customer IDs that are not null. I'm going to run that query, and I see that we have fourteen tuples now. What we've actually done is moved away from the face of the cube and into the cube, and I've said don't give me any of the summarized data, just give me the original values that are the store item and customer together. So, I've taken away the summaries, but actually this should still give me the same answer. Let's make sure that's the case. I do sum of P, and I run it, and I get 1365. So, what happened here is I used what's effectively the same data, but broken down by customer. And I added it up, that was a less efficient query, but both of those queries do give the correct result, they do give you the total sales of California stores' blue items. Again, I encourage you to download these and run them yourself to figure out exactly what's going on. And as a last demonstration of this very same query, we can actually go back to the sales table. Let's say we don't even have our data cube. So, we're gonna use our sales table instead, and it should give us the same result, but even less efficient than the the last query that we did. So, you can see here I've changed the cube to be the sales and all the C dots to F dots. I took away the constraints on being null or not null, since that's not relevant in the sales table. Otherwise, the query looks pretty much the same. So, let's run this much and see how many tuples we are operating on now. So now, we're operating on 25 tuples. So, this is the original completely broken down data. Just as a reminder, in the data cube, even in the cells, we'll be summarizing or summing for the combination of store item and customer. So, if we have three tuples that are the same store item and customer, those will be broken out in the sales table and summarized in the cube, even in the least summarized portion of the cube, if you get what I'm saying. So now, we've got our 25 tuples, and if we sum up the price of those twenty-five tuples, we should get that same result, "$13.65", and we do. So, we saw that we had six tuples in the most efficient query over the data cube. When we used just a center portion of the cube, we had fourteen or something, I actually don't remember the exact number, and then 25. All of them gave us the same answer. Obviously these numbers are very small, but if you think about huge data that has millions or tens of millions of rows, then we're talking about dramatic differences in how much data needs to be summarized or aggregated in order to get the answer. So, using the summary data from the cube can be orders of magnitude faster than going to the original fact data. Now a variation on the width cube is to give specific attributes with the cube operator itself. And what that says is to only give summaries for these dimension attributes and not the others. In other words, we'll only see null values in the result for the store ID and customer ID and we won't see any null values for the Item ID. So, in some sense, we're materializing just some of the faces, edges, and corner of the cube, not all of them. If we run this query, well, we'll get the same results that MySQL doesn't yet support it, but this is the SQL standard for the cube operator with subsets of attributes. As before, I've cooked up an equivalent query using a MySQL's with roll-up command, and I'm certainly not going to explain it, but you're welcome to download it and examine it. What I really like to look at is the result, and here it is. So, this looks a lot like our original data cube but what we will notice is that there are no null values in the item column. So, it's a subset of the result of the cube query. And we would use that when we know we're never going to be rolling up on items. So, that dimension, the summaries based on that dimension aren't going to be useful to us. Now, let's look at with roll-up, which is supported natively by the MySQL system. With roll-up, again, gives us a subset of the data. cube. But, it's a very specific one and it's based on the ordering of the attributes in the group by clause. So, let me just go ahead and run it and explain what we get. We again get null values, and those null values indicate the summary. For example, the Store 1, Item 1 and 70s. The sum of prices for all customers for Store 1 and Item 1. And we see Store 1, Item 2, as well. We again see cases with two null values, so this is the summary for Store 1, all of the items and customers, the total sales, and we also have the triple null somewhere. It's at the bottom this time, with total sales of $33.50. But what we'll see is that we don't have all combinations of null values. We have a null value for customer ID, or we have nulls for the combination of customer ID and item ID, or all three nulls. So we only have the right-hand attribute or the two most-right attributes or all of them and if we had a fourth dimension, we'd see that we have the right-most, two right-most, three right-most. This doesn't really make any particular sense for this query, but it's used when we have dimensions that are hierarchical. And I actually introduced a hierarchical dimensions in our original schema just for this purpose. So, lets turn to that query. So, here's a query that's grouping by the state, county, and city. These are three dimension attributes, they all come from the store dimension table. And they are hierarchical, meaning that we don't have every combination of state, county and city. We have cities that are in specific counties, and counties that are in specific states. So, when we look at the results of the query, we see, of course, that we, when we have San Mateo is always in the state of California. King County is always in the state of Washington. We don't have Palo Alto combined with, say, King County with state California. So, we don't have all the combinations of the three. We have a hierarchical structure on them. And it's this structure that makes sense when we use a roll-up. So, let's add with roll-up to this query and see what we get. So, here in our result we see our original tuples for a state, county, city combination. And then, we see for example, this tuple here, which is taking the state of California and the county of San Mateo and adding up all of the sales there. And we see similarly that we have all of the sales for Santa Clara County. Then we can see that we have all of the sales for each state, so this is the California sales and the Washington sales. And finally, the triple null is the corner the cube, it's all of our sales and, as usual, we get the total of 33.50. Now, what don't we see in here compared with the data cube? Well, we don't see, for example, a tuple that has California and Palo Alto and the county as null. Why is that? Well, Palo Alto is always in Santa Clara County. So rolling up the county, or saying I want California and Palo Alto sales for every county, is exactly the same as saying I want California and Palo Alto sales in Santa Clara County. We also don't see, for example, the state and county both being null. For example, if we had Seattle as a city and the state and county being null. Well, Seattle is always in King County and always in Washington, so we're not aggregating anything there, we get the exact same results as having Seattle King in Washington. So if we ran WITH CUBE on this hierarchically-structured data, we'd actually not get anymore information, we'd have a bigger result, but it wouldn't be giving us more information. It would just be less efficient for getting the same data. So that concludes our OLAP demonstration. We saw Star Schema and we saw plain SQL queries over that schema. We saw the concept of drilling down and rolling up; also slicing and dicing. We introduced a WITH CUBE extension to SQL, which is not yet implemented in MySQL, but we were able to write a query that's equivalent to WITH CUBE. We also saw putting a WITH CUBE query into a table and then querying that table directly and that can be much more efficient than running the equivalent query in SQL directly over the fact table. We also saw WITH ROLLUP, which is implemented. We didn't demonstrate putting the result of WITH ROLLUP in a table, but we could certainly do that too. All of these features are useful primarily in applications that are performing analyses over very large data sets that exhibit this dimensional type structure, but this is actually quite a common structure in analysis applications.