1 00:00:00,380 --> 00:00:02,910 In this video we'll be doing a live demonstration of OLAP. 2 00:00:03,820 --> 00:00:05,340 We'll create a star schema similar 3 00:00:05,650 --> 00:00:06,940 to the one we used in our previous examples. 4 00:00:07,640 --> 00:00:09,160 It will be sales data with 5 00:00:09,340 --> 00:00:10,630 a fact table and several 6 00:00:11,040 --> 00:00:12,540 dimension tables and then we'll 7 00:00:12,720 --> 00:00:14,380 do several different types of OLAP queries. 8 00:00:15,150 --> 00:00:17,360 We'll show a full star-join over the star schema. 9 00:00:18,480 --> 00:00:20,830 We'll show some examples of drill-down and roll-up. 10 00:00:21,360 --> 00:00:22,560 We'll also show what's known as 11 00:00:22,810 --> 00:00:23,920 slicing and dicing which we 12 00:00:24,000 --> 00:00:26,050 haven't explained yet, we'll just do when we get to those queries. 13 00:00:26,910 --> 00:00:28,130 And finally we'll show the With 14 00:00:28,730 --> 00:00:29,890 Cube and With Rollup 15 00:00:30,020 --> 00:00:31,020 clauses that have been added 16 00:00:31,660 --> 00:00:33,300 to the sequel standard for OLAP queries. 17 00:00:33,780 --> 00:00:35,110 For those we'll be using 18 00:00:35,390 --> 00:00:37,760 MySQL, actually for the whole demo we'll be using MySQL. 19 00:00:38,530 --> 00:00:40,370 MySQL supports with roll-up. 20 00:00:40,890 --> 00:00:41,700 Neither of the other systems 21 00:00:42,140 --> 00:00:43,760 that we've been using, Sequel Lite 22 00:00:44,030 --> 00:00:45,760 or postgres supports with roll up yet. 23 00:00:46,560 --> 00:00:47,710 And MySQL does not yet 24 00:00:47,890 --> 00:00:49,450 support With Cube, but we'll 25 00:00:49,590 --> 00:00:50,800 see that we can simulate the 26 00:00:51,130 --> 00:00:52,350 behavior With Cube, using 27 00:00:52,780 --> 00:00:53,970 With Rollup command of MySQL. 28 00:00:54,660 --> 00:00:55,480 So let's move to the demo. 29 00:00:56,470 --> 00:00:57,900 Here we are in the demo, we 30 00:00:58,080 --> 00:00:59,400 have a fact table that 31 00:00:59,570 --> 00:01:01,410 contains information about stores, items and customers. 32 00:01:01,940 --> 00:01:03,870 We don't see much here except their ID values. 33 00:01:04,340 --> 00:01:05,780 And we have sixty tuples 34 00:01:06,170 --> 00:01:07,440 loaded up in our fact table. 35 00:01:08,240 --> 00:01:10,630 Our dimension tables are the customer table. 36 00:01:10,830 --> 00:01:12,890 We have four customers with a gender and an age. 37 00:01:13,830 --> 00:01:15,660 We have our item table: five 38 00:01:15,980 --> 00:01:17,170 items, a category of the 39 00:01:17,240 --> 00:01:18,270 item, just t-shirts and jackets, 40 00:01:18,790 --> 00:01:19,890 just in blue and red, and 41 00:01:20,850 --> 00:01:22,320 finally we have our stores. 42 00:01:23,270 --> 00:01:24,420 And we have six stores that 43 00:01:24,890 --> 00:01:26,770 are in various cities in Santa 44 00:01:27,020 --> 00:01:28,370 Clara, San Mateo County in 45 00:01:28,470 --> 00:01:30,340 California and King County in Washington. 46 00:01:31,580 --> 00:01:34,170 Our first query is simply the complete star join. 47 00:01:34,540 --> 00:01:35,920 So, we're joining all four tables 48 00:01:36,430 --> 00:01:37,690 using the foreign key references 49 00:01:38,210 --> 00:01:40,750 in the fact table, to the keys and the dimension tables. 50 00:01:41,590 --> 00:01:42,890 Now since these are foreign key 51 00:01:43,040 --> 00:01:44,340 references, you can think of 52 00:01:44,430 --> 00:01:46,000 this star join as simply 53 00:01:46,590 --> 00:01:48,060 extending the tuples in the 54 00:01:48,250 --> 00:01:49,610 sales table with further information 55 00:01:49,860 --> 00:01:50,850 about the store item and customer 56 00:01:51,640 --> 00:01:52,370 mentioned in each one. 57 00:01:53,000 --> 00:01:53,820 So, here we go. 58 00:01:54,310 --> 00:01:56,520 And we should expect again sixty tuples in the results, 59 00:01:57,030 --> 00:01:58,120 one for each tuple in the 60 00:01:58,260 --> 00:01:59,500 fact table, and we can 61 00:01:59,660 --> 00:02:01,270 see that we have the information from 62 00:02:01,360 --> 00:02:02,570 the fact table, but then 63 00:02:02,740 --> 00:02:04,750 we also have more information that 64 00:02:04,850 --> 00:02:06,450 we've joined in from the dimension tables. 65 00:02:07,470 --> 00:02:08,820 Now it's not typical to do 66 00:02:08,930 --> 00:02:10,270 the complete star join, usually 67 00:02:10,600 --> 00:02:12,940 we would have at least constrained the star join in some way. 68 00:02:13,200 --> 00:02:14,930 So, our next query will do 69 00:02:15,320 --> 00:02:17,460 selections and projections on the JOIN. 70 00:02:18,940 --> 00:02:20,000 We're going to limit ourselves to 71 00:02:20,280 --> 00:02:21,190 sales where the state of the 72 00:02:21,290 --> 00:02:22,490 sale is California, the category 73 00:02:23,000 --> 00:02:24,250 is t-shirt, the age of 74 00:02:24,310 --> 00:02:25,280 the customer is less than 22, 75 00:02:25,460 --> 00:02:28,510 and the last condition is actually over the fact table, 76 00:02:28,890 --> 00:02:30,560 saying the price of the purchase was less than 25. 77 00:02:30,640 --> 00:02:32,030 And now we'll return 78 00:02:32,390 --> 00:02:35,430 the city, color, customer name, and price of the item. 79 00:02:36,240 --> 00:02:37,160 We run the query and we 80 00:02:37,380 --> 00:02:38,470 see now that we have 81 00:02:38,720 --> 00:02:41,110 just 5 sales that meet our criteria. 82 00:02:42,290 --> 00:02:43,510 So, we've seen the JOIN 83 00:02:43,690 --> 00:02:45,330 with constraints on the 84 00:02:45,470 --> 00:02:46,500 JOIN, but what's even more 85 00:02:46,710 --> 00:02:48,710 common in OLAP applications, is 86 00:02:48,830 --> 00:02:51,640 to do grouping and aggregation in order to analyze the data. 87 00:02:52,150 --> 00:02:53,190 And we're going to start by 88 00:02:53,340 --> 00:02:54,280 looking at grouping it in 89 00:02:54,400 --> 00:02:55,920 the aggregation involving the dimension 90 00:02:56,540 --> 00:02:57,500 attributes in the group 91 00:02:57,790 --> 00:02:59,400 by, and then the 92 00:02:59,510 --> 00:03:00,930 dependent attribute being aggregated. 93 00:03:01,530 --> 00:03:02,950 This is a very common form of query. 94 00:03:03,630 --> 00:03:04,410 So here we'll say we want 95 00:03:04,730 --> 00:03:06,120 to analyze our data based on 96 00:03:06,330 --> 00:03:07,850 combinations of stores and customers. 97 00:03:08,510 --> 00:03:10,290 So, we'll group by store ID and customer ID. 98 00:03:10,390 --> 00:03:11,600 And then, we'll sum up 99 00:03:11,870 --> 00:03:13,120 to see the total sales for 100 00:03:13,360 --> 00:03:14,680 each combination of store and customer. 101 00:03:15,930 --> 00:03:16,530 So, here's our result. 102 00:03:17,340 --> 00:03:18,440 Now, maybe not that meaningful 103 00:03:19,070 --> 00:03:20,230 to us right now, just 104 00:03:20,390 --> 00:03:21,520 looking at the store IDs and 105 00:03:21,700 --> 00:03:23,590 customer IDs, but for an analyst, it might be meaningful. 106 00:03:24,310 --> 00:03:25,330 We'll see in a moment doing 107 00:03:25,580 --> 00:03:27,040 group by an aggregation on other 108 00:03:27,380 --> 00:03:28,820 attributes that look a little more interesting. 109 00:03:29,810 --> 00:03:32,030 So now I'll demonstrate the concept of drilling down. 110 00:03:32,740 --> 00:03:33,900 So you might remember drilling down 111 00:03:34,210 --> 00:03:35,420 says that we're looking at data 112 00:03:35,890 --> 00:03:37,390 that's aggregated in some fashion, 113 00:03:37,860 --> 00:03:39,380 and we decide we want more detail. 114 00:03:40,320 --> 00:03:41,760 We get more detail by adding 115 00:03:42,130 --> 00:03:43,420 more attributes to the 116 00:03:43,500 --> 00:03:44,510 group by, so, let's say in 117 00:03:44,600 --> 00:03:45,830 this case I want to break 118 00:03:46,010 --> 00:03:47,370 out, not only by store and 119 00:03:47,500 --> 00:03:49,030 customer, but also by item. 120 00:03:49,680 --> 00:03:50,940 So, I'll add item ID to 121 00:03:51,100 --> 00:03:52,500 the group by, and also to 122 00:03:52,540 --> 00:03:53,790 the select, and when I 123 00:03:53,880 --> 00:03:54,930 run this query, I see 124 00:03:55,320 --> 00:03:57,380 that I get more results, and 125 00:03:57,880 --> 00:03:59,140 I now have more detail 126 00:03:59,970 --> 00:04:01,190 in the analysis of the data. 127 00:04:02,040 --> 00:04:03,410 Now I don't have 60 tumbles 128 00:04:03,590 --> 00:04:04,750 in my result even though I'm 129 00:04:04,940 --> 00:04:06,420 grouping by all three dimension 130 00:04:06,940 --> 00:04:08,150 attributes, because I do 131 00:04:08,360 --> 00:04:09,430 have in my fact table 132 00:04:09,850 --> 00:04:11,280 more than one tuple for some 133 00:04:11,530 --> 00:04:13,220 combinations of store, item, and customer. 134 00:04:14,580 --> 00:04:17,360 Next, I'm gonna introduce a query that shows the concept called slicing. 135 00:04:18,530 --> 00:04:20,390 Slicing of course evokes the data cube. 136 00:04:20,990 --> 00:04:21,980 We've talked about the data cube 137 00:04:22,550 --> 00:04:23,810 in the previous video, and we 138 00:04:23,950 --> 00:04:25,670 specifically used this example to 139 00:04:25,740 --> 00:04:27,930 demonstrate a three-dimensional data cube. 140 00:04:28,450 --> 00:04:29,650 So, the idea of a 141 00:04:30,090 --> 00:04:31,070 slicing query is a query 142 00:04:31,140 --> 00:04:32,820 that analyzes a slice of 143 00:04:32,950 --> 00:04:33,940 the cube and it does 144 00:04:34,180 --> 00:04:35,790 that by constraining one of the dimensions. 145 00:04:36,790 --> 00:04:37,530 So what I am going to 146 00:04:37,640 --> 00:04:38,910 do, is add to this query 147 00:04:39,230 --> 00:04:40,530 a constraint that says 148 00:04:40,760 --> 00:04:43,330 let's only consider sales that are from the state of Washington. 149 00:04:44,540 --> 00:04:45,430 And when I add that we'll 150 00:04:45,490 --> 00:04:46,370 continue to do the group 151 00:04:46,530 --> 00:04:47,950 by an aggregation, but only on 152 00:04:48,120 --> 00:04:48,930 that slice of the cube 153 00:04:49,260 --> 00:04:52,950 representing Washington stores, the sales that are from Washington stores. 154 00:04:53,800 --> 00:04:55,470 So it made a few changes to the query. 155 00:04:55,870 --> 00:04:56,820 The first thing we see is 156 00:04:56,940 --> 00:04:58,250 that I added the store 157 00:04:58,760 --> 00:04:59,760 relation to the from clause 158 00:05:00,450 --> 00:05:01,900 in order to constrain the 159 00:05:01,990 --> 00:05:03,440 state of the sale, I 160 00:05:03,510 --> 00:05:04,360 have to join with the store 161 00:05:04,710 --> 00:05:06,380 table that dimension table so that 162 00:05:06,550 --> 00:05:07,610 I can access 163 00:05:07,620 --> 00:05:09,850 the value of state which is not present in the sales table. 164 00:05:10,570 --> 00:05:11,380 So I also had to add 165 00:05:11,480 --> 00:05:12,800 variables, and I add 166 00:05:13,010 --> 00:05:14,600 the join condition for the 167 00:05:14,790 --> 00:05:16,150 sales table to join 168 00:05:16,420 --> 00:05:19,060 with the dimension table, and then I add the constraint 169 00:05:19,490 --> 00:05:20,330 that the state is Washington. 170 00:05:21,000 --> 00:05:22,360 The last small change is to 171 00:05:22,620 --> 00:05:23,760 add a variable to the 172 00:05:23,860 --> 00:05:26,830 store ID, so that I don't get an ambiguous error. 173 00:05:27,500 --> 00:05:28,690 So now let's go ahead and 174 00:05:28,770 --> 00:05:29,740 run the query and we see 175 00:05:29,940 --> 00:05:31,930 that we do get a subset of what we had before. 176 00:05:32,440 --> 00:05:33,590 And that subset is the slice 177 00:05:34,230 --> 00:05:35,080 of the data cube, if you 178 00:05:35,180 --> 00:05:35,960 want to think of it that way, 179 00:05:36,490 --> 00:05:38,550 representing the sales in Washington State. 180 00:05:39,740 --> 00:05:40,980 Now in addition to slicing, there 181 00:05:41,060 --> 00:05:42,510 is also a notion of dicing, 182 00:05:43,690 --> 00:05:44,890 and again if you imagine the 183 00:05:44,970 --> 00:05:45,820 data cube, what a dice 184 00:05:46,270 --> 00:05:47,440 does is slice in two 185 00:05:47,560 --> 00:05:49,310 dimensions and it gives you a chunk of the cube. 186 00:05:50,010 --> 00:05:51,070 So I am going to add a 187 00:05:51,450 --> 00:05:53,550 condition on the item being sold. 188 00:05:53,780 --> 00:05:55,270 I am going to add that the item must be red. 189 00:05:55,890 --> 00:05:56,600 So then I'm going to be looking 190 00:05:56,980 --> 00:05:57,950 at the chunk of the data 191 00:05:58,070 --> 00:06:00,120 cube that identifies red 192 00:06:00,440 --> 00:06:01,840 items purchased in the state of Washington. 193 00:06:03,270 --> 00:06:04,630 So here's the query; I've added 194 00:06:05,110 --> 00:06:06,570 a third relation to the "from" clause. 195 00:06:06,970 --> 00:06:08,420 I added the item dimension tables 196 00:06:08,880 --> 00:06:10,830 so that I can constrain the color to be red. 197 00:06:11,640 --> 00:06:13,200 I have added a "join" condition, joining 198 00:06:13,730 --> 00:06:15,650 the fact table to the item dimension table. 199 00:06:16,420 --> 00:06:18,080 I have added the condition that the color is red. 200 00:06:18,780 --> 00:06:19,710 And finally, I had to add 201 00:06:19,860 --> 00:06:21,360 again the variable names to 202 00:06:21,490 --> 00:06:23,350 make sure I didn't get ambiguous attribute references. 203 00:06:24,370 --> 00:06:25,070 So we'll run this query. 204 00:06:25,470 --> 00:06:28,170 And we see that we get an even smaller portion of the data cube. 205 00:06:28,700 --> 00:06:30,910 Again a dice where we have constrained two dimensions. 206 00:06:32,400 --> 00:06:33,810 So we have seen drilling down and 207 00:06:34,210 --> 00:06:36,810 slicing and dicing, now let's take a look at rolling up. 208 00:06:37,470 --> 00:06:38,400 Rolling up says that we're 209 00:06:38,690 --> 00:06:39,940 looking at our data 210 00:06:40,210 --> 00:06:42,080 analysis, and we decide that it's too detailed. 211 00:06:42,860 --> 00:06:45,530 We want to have less detail, in other words, more aggregation. 212 00:06:46,770 --> 00:06:49,690 To do that, we take attributes out of the group by clause. 213 00:06:50,120 --> 00:06:50,950 So let's say this is way, 214 00:06:51,320 --> 00:06:52,500 way too much detail and we 215 00:06:52,740 --> 00:06:54,810 just want our sales broken down on item. 216 00:06:55,590 --> 00:06:56,700 So we take all of 217 00:06:56,800 --> 00:06:57,760 the attributes out of our 218 00:06:57,920 --> 00:06:59,590 group by clause except for item. 219 00:07:00,570 --> 00:07:00,890 Here we go. 220 00:07:01,570 --> 00:07:02,380 And then, when we run the 221 00:07:02,440 --> 00:07:04,850 query, we'll see much more summarized data. 222 00:07:05,790 --> 00:07:08,240 And here it is, broken down again, just by separate items. 223 00:07:09,490 --> 00:07:10,600 So far, our grouping and 224 00:07:10,760 --> 00:07:12,040 aggregation has been on 225 00:07:12,310 --> 00:07:14,620 the dimension attributes, specifically the 226 00:07:14,750 --> 00:07:17,990 ID attributes identifying the tuples in the dimension table. 227 00:07:18,810 --> 00:07:19,960 And that may be meaningful for 228 00:07:20,100 --> 00:07:21,450 analyses, but for this 229 00:07:21,940 --> 00:07:23,400 demonstration, it's more fun 230 00:07:23,570 --> 00:07:24,820 to look at attributes that actually 231 00:07:25,530 --> 00:07:26,350 mean something to us. 232 00:07:26,890 --> 00:07:28,020 So, let's try a few 233 00:07:28,190 --> 00:07:29,830 queries that group instead based 234 00:07:30,150 --> 00:07:31,800 on attributes in the dimension tables. 235 00:07:32,210 --> 00:07:33,950 We'll still be using the fact table. 236 00:07:34,280 --> 00:07:35,560 We'll be joining it with dimension tables. 237 00:07:36,260 --> 00:07:37,320 And here we're going to start by 238 00:07:37,470 --> 00:07:39,990 grouping on the state and the category of the sales. 239 00:07:40,730 --> 00:07:42,140 Again, summing up the total 240 00:07:42,610 --> 00:07:44,370 sales for the different groups. 241 00:07:45,280 --> 00:07:46,290 So, here we see that 242 00:07:46,420 --> 00:07:47,920 we have four combinations of 243 00:07:48,020 --> 00:07:49,480 state and category California or 244 00:07:49,830 --> 00:07:50,940 Washington jackets and t-shirts, 245 00:07:51,380 --> 00:07:52,200 and then we have the total 246 00:07:52,560 --> 00:07:53,400 sales in each of those 247 00:07:54,070 --> 00:07:55,180 so we can see in both states 248 00:07:55,680 --> 00:07:57,290 the sales for jackets account 249 00:07:57,650 --> 00:07:58,750 for considerably more than the 250 00:07:58,870 --> 00:08:00,310 sales for t-shirts, but we 251 00:08:00,500 --> 00:08:01,820 do also notice that in 252 00:08:01,970 --> 00:08:03,270 California there's a lot 253 00:08:03,440 --> 00:08:04,930 more sales of t-shirts than 254 00:08:05,080 --> 00:08:06,290 there are in Washington, and I 255 00:08:06,410 --> 00:08:07,810 guess that's not surprising given the weather. 256 00:08:09,240 --> 00:08:11,780 Now let's demonstrate drill down on this query. 257 00:08:12,260 --> 00:08:13,570 So let's say we want a little more information. 258 00:08:14,180 --> 00:08:17,100 Maybe we want a breakdown by county in addition to state. 259 00:08:17,920 --> 00:08:18,910 So to do that, we add 260 00:08:19,160 --> 00:08:20,160 county to the group by 261 00:08:20,320 --> 00:08:21,410 clause and the select 262 00:08:21,770 --> 00:08:23,060 clause, and when we 263 00:08:23,120 --> 00:08:24,980 run the query we see we do now have more detail. 264 00:08:25,900 --> 00:08:27,310 We can see, for example, that 265 00:08:27,710 --> 00:08:29,760 we had more jacket sales in 266 00:08:29,850 --> 00:08:30,940 Santa Clara County than in 267 00:08:31,030 --> 00:08:32,340 San Mateo County although the 268 00:08:32,420 --> 00:08:34,090 t-shirts were approximately equal. 269 00:08:34,470 --> 00:08:36,240 A little, actually, more t-shirts in San Mateo. 270 00:08:36,890 --> 00:08:38,160 This is a little surprising, because 271 00:08:38,340 --> 00:08:39,690 Santa Clara is generally warmer than 272 00:08:39,940 --> 00:08:41,920 San Mateo, but it's fabricated data. 273 00:08:43,040 --> 00:08:43,800 Now let's see, we want it 274 00:08:43,900 --> 00:08:45,270 drilled it out even further, and 275 00:08:45,380 --> 00:08:46,730 we want to break our 276 00:08:46,850 --> 00:08:49,300 data down by gender as well as the other attributes. 277 00:08:50,860 --> 00:08:51,880 In order to drill down based 278 00:08:52,110 --> 00:08:53,470 on gender, I first needed 279 00:08:53,640 --> 00:08:55,510 to add the customer table to our from clause. 280 00:08:55,820 --> 00:08:56,760 Prior to this, we weren't 281 00:08:57,090 --> 00:08:58,360 doing any analysis that involved 282 00:08:58,760 --> 00:09:00,230 any attributes of the customer table. 283 00:09:01,650 --> 00:09:03,350 And so I need to add the join condition here. 284 00:09:03,830 --> 00:09:04,670 And then to do the drill 285 00:09:04,910 --> 00:09:06,430 down, I add the gender attribute 286 00:09:06,450 --> 00:09:08,090 to the group by and to the select. 287 00:09:08,990 --> 00:09:11,290 We run the query, and what do we find? 288 00:09:12,570 --> 00:09:13,860 Well, we can see, for 289 00:09:13,860 --> 00:09:15,010 example that not too many 290 00:09:15,350 --> 00:09:16,720 females in San Mateo County 291 00:09:17,120 --> 00:09:18,260 are buying t-shirts. Most of those 292 00:09:18,490 --> 00:09:20,150 t-shirt sales are counted for by males. 293 00:09:20,560 --> 00:09:21,820 The other thing we notice 294 00:09:22,230 --> 00:09:23,390 is that we don't have every 295 00:09:23,820 --> 00:09:26,080 single combination of county, category and gender. 296 00:09:26,860 --> 00:09:28,010 Very specifically we look and 297 00:09:28,130 --> 00:09:29,770 we can see that no males 298 00:09:30,990 --> 00:09:32,360 in King County bought any t-shirts. 299 00:09:33,010 --> 00:09:33,920 So we only have in our 300 00:09:34,070 --> 00:09:35,870 result, those combinations of dimension 301 00:09:36,390 --> 00:09:38,300 values that actually appear in the fact table. 302 00:09:38,590 --> 00:09:41,320 Or in the join of the fact table with the dimension tables. 303 00:09:42,300 --> 00:09:43,690 Now let's suppose after I've 304 00:09:43,760 --> 00:09:44,660 gotten to this level of 305 00:09:44,770 --> 00:09:46,560 analysis, I've decided that what 306 00:09:46,620 --> 00:09:48,060 might be most interesting is the 307 00:09:48,200 --> 00:09:50,700 data broken down just by combination of state and gender. 308 00:09:51,540 --> 00:09:52,660 So that would be a roll-up operation. 309 00:09:53,640 --> 00:09:55,020 And remember for roll-up I 310 00:09:55,380 --> 00:09:57,450 take attributes out of the group by clause. 311 00:09:58,140 --> 00:09:58,880 So I take out the county 312 00:09:59,340 --> 00:10:01,740 and category and I'm just interested in state/gender combinations. 313 00:10:02,970 --> 00:10:03,930 By the way at this point if 314 00:10:03,980 --> 00:10:04,950 I wanted to, I could remove 315 00:10:05,560 --> 00:10:06,940 the item table because I'm 316 00:10:07,040 --> 00:10:08,630 no longer constraining based on 317 00:10:09,220 --> 00:10:10,940 items or grouping based on items. 318 00:10:11,540 --> 00:10:14,080 But I'll just leave it in for now, it's not going to make a difference in our result. 319 00:10:14,570 --> 00:10:16,640 Of course for efficiency I might just prefer to take it out. 320 00:10:16,900 --> 00:10:18,110 Anyway let's run the query 321 00:10:18,700 --> 00:10:20,750 and now we see that rolled up data. 322 00:10:21,140 --> 00:10:22,230 And when we look at 323 00:10:22,360 --> 00:10:23,670 it I guess our main conclusion here 324 00:10:23,990 --> 00:10:25,150 would be that in our database, 325 00:10:25,840 --> 00:10:27,710 the Californian males are the big spenders. 326 00:10:28,950 --> 00:10:30,340 So far everything we've seen 327 00:10:30,680 --> 00:10:31,890 has actually just used the 328 00:10:31,990 --> 00:10:34,240 regular SQL constructs that we already knew about. 329 00:10:34,750 --> 00:10:36,330 Although we were demonstrating them over 330 00:10:36,820 --> 00:10:38,360 a star schema and we're 331 00:10:38,640 --> 00:10:39,870 showing the types of queries that 332 00:10:39,990 --> 00:10:41,040 one tends to run over star 333 00:10:41,350 --> 00:10:42,430 schema's in OLAP applications. 334 00:10:43,910 --> 00:10:44,780 Now we're going to move to 335 00:10:44,960 --> 00:10:46,280 the specific constructs that have 336 00:10:46,420 --> 00:10:48,030 been added to SQL for OLAP. 337 00:10:48,930 --> 00:10:50,150 As you may remember the two 338 00:10:50,300 --> 00:10:51,820 constructs are called With Cube 339 00:10:52,480 --> 00:10:55,210 and With Rollup and they are written in the group by-clause. 340 00:10:56,350 --> 00:10:58,050 So our first query is the 341 00:10:58,550 --> 00:10:59,770 absolute basic cube query. 342 00:11:00,570 --> 00:11:01,810 We start with our sales 343 00:11:02,270 --> 00:11:03,520 fact table, we group on 344 00:11:03,770 --> 00:11:04,680 all three of its dimensions, 345 00:11:05,570 --> 00:11:06,620 and then we add With Cube. 346 00:11:07,350 --> 00:11:08,180 And what we're going to get 347 00:11:08,810 --> 00:11:10,120 is, basically, the data 348 00:11:10,650 --> 00:11:11,860 in the data cube that we 349 00:11:12,030 --> 00:11:14,070 saw in our previous video that introduced the cube. 350 00:11:14,650 --> 00:11:15,890 Let's go ahead and run the query. 351 00:11:16,980 --> 00:11:18,300 Disappointingly we get an 352 00:11:18,400 --> 00:11:19,740 error message that this version 353 00:11:20,230 --> 00:11:21,920 of MySQL doesn't yet support cube. 354 00:11:22,820 --> 00:11:23,950 Actually no version of MySQL 355 00:11:24,520 --> 00:11:25,810 supports cube at this 356 00:11:26,010 --> 00:11:28,130 point in time but we're hopeful that a future one will. 357 00:11:28,810 --> 00:11:29,740 In fact of the three 358 00:11:30,210 --> 00:11:31,420 open source systems that we 359 00:11:31,640 --> 00:11:32,960 have been using for our demos only 360 00:11:33,250 --> 00:11:34,690 MySQL supports the with 361 00:11:35,080 --> 00:11:36,370 rollup so we will continue 362 00:11:36,720 --> 00:11:38,090 with MySQL and in fact 363 00:11:38,180 --> 00:11:39,740 you can get the same result 364 00:11:40,140 --> 00:11:41,620 that you would get if with 365 00:11:41,950 --> 00:11:43,420 cube was supported, using with roll 366 00:11:43,680 --> 00:11:44,350 up, although it's a little bit 367 00:11:44,510 --> 00:11:45,520 contorted, but I'll do that 368 00:11:45,790 --> 00:11:48,420 now, to demonstrate what you would get if you wrote with cube. 369 00:11:49,590 --> 00:11:50,910 So here's the query, I'm not 370 00:11:51,330 --> 00:11:52,390 going to explain it, if you're 371 00:11:52,650 --> 00:11:54,170 particularly interested in it, you 372 00:11:54,270 --> 00:11:56,330 can download our script and exam it yourself. 373 00:11:56,980 --> 00:11:58,200 Again what I'm most interested in 374 00:11:58,380 --> 00:12:00,390 is looking at the results, and here they are. 375 00:12:01,020 --> 00:12:02,590 So this result is the 376 00:12:02,680 --> 00:12:03,820 result of a With Cube 377 00:12:04,900 --> 00:12:05,780 on the grouping and aggregation 378 00:12:06,380 --> 00:12:09,000 on all dimensions of our sales table, all three dimensions. 379 00:12:10,490 --> 00:12:12,470 So some of our tuples look very normal. 380 00:12:13,050 --> 00:12:14,500 This first tuple, second, and 381 00:12:14,590 --> 00:12:15,630 third tuple are all giving 382 00:12:15,970 --> 00:12:18,000 us the total price for combination 383 00:12:18,200 --> 00:12:19,160 of store, item, and customer, 384 00:12:19,700 --> 00:12:20,920 and these are exactly what we 385 00:12:21,060 --> 00:12:22,430 got in our original query that 386 00:12:22,570 --> 00:12:24,030 didn't have the With-Cube operator. 387 00:12:25,100 --> 00:12:26,280 But now we see some 388 00:12:26,760 --> 00:12:28,010 tuples that have blanks and as 389 00:12:28,130 --> 00:12:30,840 a reminder, blanks in this interface mean null. 390 00:12:31,530 --> 00:12:34,240 So this tuple is store item and null. 391 00:12:35,070 --> 00:12:36,890 And what this corresponds to is 392 00:12:37,290 --> 00:12:39,110 an element on the face of the cube. 393 00:12:39,530 --> 00:12:40,350 So this is the face of the 394 00:12:40,410 --> 00:12:42,540 cube that's not constraining the customer dimension. 395 00:12:43,730 --> 00:12:45,040 And what the value gives us 396 00:12:45,300 --> 00:12:46,130 then is the value we 397 00:12:46,400 --> 00:12:47,090 have on the face of the 398 00:12:47,240 --> 00:12:48,440 cube, which is the 399 00:12:48,630 --> 00:12:50,100 total sales for the 400 00:12:50,180 --> 00:12:51,410 combination of store one 401 00:12:51,650 --> 00:12:54,000 and item one and any customer, and that's seventy. 402 00:12:54,910 --> 00:12:55,890 And we can cross check that 403 00:12:56,100 --> 00:12:56,990 by seeing that the first 404 00:12:57,250 --> 00:12:58,480 three tuples add up to 405 00:12:58,790 --> 00:13:00,060 70, because those happen to be 406 00:13:00,290 --> 00:13:01,480 the three customers that have made 407 00:13:01,800 --> 00:13:03,420 purchases at Store 1 and Item 1. 408 00:13:03,770 --> 00:13:05,340 And we can similarly see 409 00:13:05,600 --> 00:13:06,450 for Store 1 and Item 410 00:13:06,850 --> 00:13:08,400 2 the total sales are 150. 411 00:13:08,690 --> 00:13:10,380 If we scroll down 412 00:13:10,760 --> 00:13:11,910 a bit we'll see cases where 413 00:13:12,050 --> 00:13:13,830 we have other single-null values; 414 00:13:13,960 --> 00:13:15,890 I'll keep going until I find one here. 415 00:13:17,030 --> 00:13:18,390 For example, here we have 416 00:13:18,760 --> 00:13:19,810 the null value for the 417 00:13:20,060 --> 00:13:21,560 store and that would 418 00:13:21,690 --> 00:13:22,480 be the face of the cube 419 00:13:22,840 --> 00:13:24,200 that's not constraining the store dimension. 420 00:13:24,640 --> 00:13:25,340 So that would be the sum 421 00:13:25,850 --> 00:13:26,690 for Item 1 and Customer 422 00:13:27,200 --> 00:13:30,070 1 at any store, Item 1 and Customer 2 to at any store. 423 00:13:31,280 --> 00:13:34,200 Then we also have tuples that have two null values, here's an example. 424 00:13:35,210 --> 00:13:38,160 So this one is going to be corresponding to the edge of the cube. 425 00:13:38,610 --> 00:13:39,570 so this is the edge of 426 00:13:39,590 --> 00:13:40,500 the cube that is along 427 00:13:41,010 --> 00:13:42,280 the store dimension and is 428 00:13:42,600 --> 00:13:43,900 picking out Store 5 along 429 00:13:44,310 --> 00:13:45,680 that dimension and then giving 430 00:13:46,010 --> 00:13:47,250 the total sales for all 431 00:13:47,590 --> 00:13:48,760 items and all customers in 432 00:13:49,040 --> 00:13:50,420 Store 5, and we see that. 433 00:13:50,810 --> 00:13:51,920 And again we'll see other 434 00:13:52,220 --> 00:13:53,560 instances where we're missing 435 00:13:53,980 --> 00:13:57,150 two values, scroll down to get here. 436 00:13:57,450 --> 00:13:58,550 We've got one where we're 437 00:13:58,700 --> 00:13:59,510 missing the store in the 438 00:13:59,660 --> 00:14:00,440 item, so this would be 439 00:14:00,960 --> 00:14:02,720 on the edge along the customer dimension. 440 00:14:03,610 --> 00:14:04,300 So this gives us the total 441 00:14:04,670 --> 00:14:07,840 sales for any store item made by customer 3. 442 00:14:07,980 --> 00:14:10,150 And then finally, if we 443 00:14:10,360 --> 00:14:11,460 find it, we'll find the 444 00:14:11,580 --> 00:14:12,980 one that's all three null values, 445 00:14:13,460 --> 00:14:14,060 and that's right here. 446 00:14:14,730 --> 00:14:16,470 So that represents the corner of the cube. 447 00:14:17,200 --> 00:14:18,130 That gives us the total 448 00:14:18,630 --> 00:14:20,590 for all stores, all items and all customers. 449 00:14:21,750 --> 00:14:23,260 So, what we've done 450 00:14:23,330 --> 00:14:25,960 here by adding With Cube--and again, 451 00:14:26,240 --> 00:14:27,270 that would be the result if 452 00:14:27,530 --> 00:14:28,890 this query here were supported--what 453 00:14:29,700 --> 00:14:31,890 we're getting is the contents 454 00:14:31,980 --> 00:14:33,390 of the data cube, which are the 455 00:14:33,780 --> 00:14:34,600 inner parts of the cube, 456 00:14:34,910 --> 00:14:35,860 which is what we would get without 457 00:14:36,360 --> 00:14:37,320 adding anything to our group 458 00:14:37,570 --> 00:14:38,400 by and what we saw in one 459 00:14:38,510 --> 00:14:39,570 of our very first queries of 460 00:14:39,630 --> 00:14:41,160 the demo, in addition to the 461 00:14:41,250 --> 00:14:43,380 data on the faces, the 462 00:14:43,570 --> 00:14:44,920 edges, and the corner of the cube. 463 00:14:46,080 --> 00:14:47,330 Now let's do a quick cross check. 464 00:14:47,620 --> 00:14:48,660 So this data tells us 465 00:14:48,830 --> 00:14:49,870 that the corner of the cube, 466 00:14:50,210 --> 00:14:51,040 here, the one with all null 467 00:14:51,260 --> 00:14:54,050 values, is 33.50 total price. 468 00:14:54,730 --> 00:14:56,210 So, that should be exactly what we 469 00:14:56,330 --> 00:14:57,330 get if we sum up 470 00:14:57,530 --> 00:14:59,220 our prices without doing any 471 00:14:59,420 --> 00:15:01,310 grouping or aggregation so lets give that a try. 472 00:15:02,450 --> 00:15:03,990 So we merely take our sales 473 00:15:04,390 --> 00:15:05,420 table, we sum up the 474 00:15:05,540 --> 00:15:06,740 total prices and we run 475 00:15:06,940 --> 00:15:08,950 it and indeed we get thirty-three fifty. 476 00:15:10,430 --> 00:15:11,390 So now we've seen how SQL 477 00:15:11,740 --> 00:15:13,990 can be used to effectively create the data cube. 478 00:15:15,090 --> 00:15:18,060 Sometimes in OLAP applications it's convenient to query the cube directly. 479 00:15:19,110 --> 00:15:20,910 So people frequently create data 480 00:15:21,110 --> 00:15:22,490 cube say, in SQL using 481 00:15:22,830 --> 00:15:24,400 a materialized view, or just 482 00:15:24,550 --> 00:15:25,740 by creating a table that contains 483 00:15:26,000 --> 00:15:28,280 the result of the data cube which is what we are going to do. 484 00:15:28,880 --> 00:15:29,970 There are even some systems that 485 00:15:30,220 --> 00:15:32,630 use the data cube as their basic native data model. 486 00:15:33,400 --> 00:15:34,740 So let's go ahead and take 487 00:15:34,940 --> 00:15:36,020 the result of our cube 488 00:15:36,340 --> 00:15:37,540 query and put it in a table. 489 00:15:38,650 --> 00:15:39,860 So I've just added "create 490 00:15:40,220 --> 00:15:41,700 table cube as" before our 491 00:15:41,930 --> 00:15:42,940 query and the one the 492 00:15:43,030 --> 00:15:44,070 other thing I did was add a 493 00:15:44,120 --> 00:15:45,860 name to the result of 494 00:15:46,010 --> 00:15:47,910 the aggregation attribute so 495 00:15:48,040 --> 00:15:49,840 that we can refer to it in the data cube. 496 00:15:50,500 --> 00:15:52,170 So P then will then be 497 00:15:52,450 --> 00:15:53,980 containing the sum of the price for 498 00:15:54,490 --> 00:15:55,930 the items inside the data 499 00:15:56,040 --> 00:15:58,360 cube as well as for the faces, edges, and corners. 500 00:15:59,090 --> 00:15:59,930 So, let's go ahead and create 501 00:16:00,340 --> 00:16:03,050 that table and now let's look at querying the cube directly. 502 00:16:04,430 --> 00:16:05,360 So, the first thing you will notice in 503 00:16:05,560 --> 00:16:06,590 the next few queries is that 504 00:16:06,760 --> 00:16:08,090 we've replaced the sales table 505 00:16:08,450 --> 00:16:09,780 with the cubed table so we're 506 00:16:09,990 --> 00:16:11,400 gonna run our queries directly on the 507 00:16:11,480 --> 00:16:12,450 cube and we can 508 00:16:12,710 --> 00:16:13,910 join the cube with the 509 00:16:14,140 --> 00:16:15,360 store and item tables and 510 00:16:15,500 --> 00:16:16,520 the customer, if we needed it, 511 00:16:16,940 --> 00:16:18,300 just as we joined the sales 512 00:16:18,710 --> 00:16:21,780 table because it does contain the same dimension attributes. 513 00:16:22,600 --> 00:16:24,040 What the cube gives us is 514 00:16:24,490 --> 00:16:26,190 pre-aggregated data, both for 515 00:16:26,360 --> 00:16:28,400 the store, a customer and item combinations. 516 00:16:29,710 --> 00:16:30,870 As well as the data 517 00:16:31,300 --> 00:16:32,960 that has null values that is 518 00:16:33,180 --> 00:16:34,670 already aggregated for the 519 00:16:34,830 --> 00:16:37,330 faces edges and corner of the cube as we'll just see in a moment. 520 00:16:38,030 --> 00:16:39,100 So what our query is going 521 00:16:39,430 --> 00:16:40,670 to do is find total 522 00:16:41,030 --> 00:16:43,460 sales of blue items in the state of California. 523 00:16:44,420 --> 00:16:45,560 And it will start by showing the 524 00:16:45,670 --> 00:16:47,870 tupples that are used to get that total. 525 00:16:48,670 --> 00:16:49,990 So we'll join our cube table 526 00:16:50,300 --> 00:16:51,510 with the store and the item 527 00:16:51,710 --> 00:16:52,740 dimension tables in order to 528 00:16:52,830 --> 00:16:54,060 constrain the state and the 529 00:16:54,230 --> 00:16:56,830 color and then we will constrain the state and the color. 530 00:16:57,530 --> 00:16:58,780 Notice the last thing we 531 00:16:58,900 --> 00:17:00,170 say is that customer id 532 00:17:00,410 --> 00:17:02,090 is null and that's going 533 00:17:02,130 --> 00:17:03,080 to give us the data on the 534 00:17:03,930 --> 00:17:05,070 face of the cube that 535 00:17:05,330 --> 00:17:06,820 doesn't go along with the customer's dimension. 536 00:17:07,380 --> 00:17:08,350 That means it's going to 537 00:17:08,430 --> 00:17:10,100 be pre-aggregated for all customers 538 00:17:10,700 --> 00:17:11,650 and that's what we want since 539 00:17:11,780 --> 00:17:12,790 we don't care about the customers 540 00:17:13,390 --> 00:17:14,590 and just about the color and the state. 541 00:17:15,510 --> 00:17:16,540 So, let's first run this query 542 00:17:16,950 --> 00:17:18,180 and we see we get six tuples, 543 00:17:18,580 --> 00:17:19,850 and these tuples are all 544 00:17:20,290 --> 00:17:21,450 from the portion of the result 545 00:17:21,860 --> 00:17:23,200 of that cube query we ran 546 00:17:23,560 --> 00:17:24,570 that has a null value for 547 00:17:24,830 --> 00:17:26,300 customer ID and that 548 00:17:26,410 --> 00:17:27,860 is all combinations of stores of 549 00:17:27,990 --> 00:17:29,190 items, and if we checked 550 00:17:29,450 --> 00:17:30,840 our dimension tables we'd see 551 00:17:31,010 --> 00:17:32,080 that these stores are in California 552 00:17:32,700 --> 00:17:34,420 and these items are blue and 553 00:17:34,560 --> 00:17:35,780 these are our total sales broken 554 00:17:36,090 --> 00:17:38,100 down by store and item. 555 00:17:38,480 --> 00:17:39,770 Finally, to get the total total sales, 556 00:17:40,320 --> 00:17:41,460 we'll just sum up our column 557 00:17:41,950 --> 00:17:43,240 "p" which remember was 558 00:17:43,610 --> 00:17:44,850 the sum of price in the data queue. 559 00:17:45,870 --> 00:17:47,780 So we replace c. with sum 560 00:17:48,300 --> 00:17:49,470 of p and we run 561 00:17:49,850 --> 00:17:52,070 the query and we see that our total is 13.65. 562 00:17:54,110 --> 00:17:55,080 Now let me show you something that 563 00:17:55,190 --> 00:17:56,530 may seem non-intuitive at first, 564 00:17:56,940 --> 00:17:58,230 but if you think about it 565 00:17:58,420 --> 00:17:59,350 or maybe run the queries yourself, 566 00:17:59,770 --> 00:18:02,750 you'll start to understand exactly what's happening. 567 00:18:03,810 --> 00:18:05,020 I'm going to go back to c. in the select clause. 568 00:18:05,270 --> 00:18:06,310 And incidentally, I didn't mention 569 00:18:06,690 --> 00:18:08,210 it before but c. gives us 570 00:18:08,370 --> 00:18:09,500 in the select clause all the 571 00:18:09,750 --> 00:18:10,910 attributes from the cube 572 00:18:11,310 --> 00:18:12,220 table, and that is 573 00:18:12,300 --> 00:18:14,260 showing which tuples are being used for the aggregation. 574 00:18:15,680 --> 00:18:18,130 So, we had before six tuples that were being used. 575 00:18:18,790 --> 00:18:20,980 Now, I'm going to make one tiny change to the query. 576 00:18:21,520 --> 00:18:22,600 Instead of finding the customer 577 00:18:23,190 --> 00:18:24,920 IDs that are null, I'm going 578 00:18:25,210 --> 00:18:26,660 to find the customer IDs that are not null. 579 00:18:27,440 --> 00:18:28,760 I'm going to run that query, and 580 00:18:28,910 --> 00:18:31,540 I see that we have fourteen tuples now. 581 00:18:32,310 --> 00:18:33,470 What we've actually done is moved 582 00:18:33,970 --> 00:18:35,140 away from the face of 583 00:18:35,240 --> 00:18:36,290 the cube and into the cube, 584 00:18:36,650 --> 00:18:38,110 and I've said don't give me 585 00:18:38,430 --> 00:18:39,770 any of the summarized data, just 586 00:18:40,250 --> 00:18:42,290 give me the original values that 587 00:18:42,420 --> 00:18:44,060 are the store item and customer together. 588 00:18:45,120 --> 00:18:45,990 So, I've taken away the summaries, 589 00:18:46,740 --> 00:18:49,390 but actually this should still give me the same answer. 590 00:18:49,710 --> 00:18:50,830 Let's make sure that's the case. 591 00:18:51,560 --> 00:18:52,910 I do sum of P, and 592 00:18:53,010 --> 00:18:55,110 I run it, and I get 1365. 593 00:18:55,630 --> 00:18:57,200 So, what happened here is I 594 00:18:57,420 --> 00:18:58,950 used what's effectively the same 595 00:18:59,310 --> 00:19:01,310 data, but broken down by customer. 596 00:19:02,320 --> 00:19:03,260 And I added it up, that 597 00:19:03,400 --> 00:19:05,280 was a less efficient query, but 598 00:19:05,530 --> 00:19:06,560 both of those queries do 599 00:19:06,720 --> 00:19:07,890 give the correct result, they do 600 00:19:08,200 --> 00:19:10,440 give you the total sales of California 601 00:19:11,150 --> 00:19:12,160 stores' blue items. 602 00:19:12,850 --> 00:19:13,860 Again, I encourage you to 603 00:19:14,000 --> 00:19:15,140 download these and run them 604 00:19:15,400 --> 00:19:16,700 yourself to figure out exactly what's 605 00:19:16,910 --> 00:19:18,300 going on. 606 00:19:18,410 --> 00:19:19,650 And as a last demonstration of 607 00:19:19,740 --> 00:19:21,230 this very same query, we 608 00:19:21,440 --> 00:19:23,300 can actually go back to the sales table. 609 00:19:23,670 --> 00:19:25,170 Let's say we don't even have our data cube. 610 00:19:25,780 --> 00:19:26,900 So, we're gonna use our sales table 611 00:19:27,270 --> 00:19:28,790 instead, and it should 612 00:19:29,020 --> 00:19:30,280 give us the same result, but 613 00:19:30,590 --> 00:19:32,770 even less efficient than the the last query that we did. 614 00:19:33,830 --> 00:19:34,760 So, you can see here I've 615 00:19:34,990 --> 00:19:36,190 changed the cube to be 616 00:19:36,360 --> 00:19:38,250 the sales and all the C dots to F dots. 617 00:19:38,970 --> 00:19:40,470 I took away the constraints 618 00:19:41,100 --> 00:19:41,730 on being null or not null, 619 00:19:41,780 --> 00:19:44,190 since that's not relevant in the sales table. 620 00:19:44,710 --> 00:19:46,520 Otherwise, the query looks pretty much the same. 621 00:19:46,870 --> 00:19:47,900 So, let's run this much 622 00:19:48,250 --> 00:19:50,090 and see how many tuples we are operating on now. 623 00:19:50,830 --> 00:19:52,490 So now, we're operating on 25 tuples. 624 00:19:53,130 --> 00:19:55,480 So, this is the original completely broken down data. 625 00:19:56,180 --> 00:19:57,380 Just as a reminder, in the 626 00:19:57,550 --> 00:19:58,570 data cube, even in the 627 00:19:58,840 --> 00:20:00,330 cells, we'll be summarizing 628 00:20:01,750 --> 00:20:02,780 or summing for the combination 629 00:20:03,950 --> 00:20:04,980 of store item and customer. 630 00:20:05,730 --> 00:20:07,710 So, if we have three tuples that 631 00:20:07,850 --> 00:20:08,930 are the same store item and customer, 632 00:20:09,320 --> 00:20:10,390 those will be broken out in 633 00:20:10,440 --> 00:20:12,090 the sales table and summarized in 634 00:20:12,280 --> 00:20:13,310 the cube, even in the 635 00:20:13,520 --> 00:20:16,150 least summarized portion of the cube, if you get what I'm saying. 636 00:20:16,970 --> 00:20:17,830 So now, we've got our 25 637 00:20:18,590 --> 00:20:19,480 tuples, and if we 638 00:20:19,540 --> 00:20:20,760 sum up the price of those 639 00:20:21,020 --> 00:20:22,340 twenty-five tuples, we should 640 00:20:22,520 --> 00:20:25,200 get that same result, "$13.65", and we do. 641 00:20:26,210 --> 00:20:27,310 So, we saw that we had 642 00:20:27,820 --> 00:20:29,260 six tuples in the 643 00:20:29,330 --> 00:20:30,760 most efficient query over the data cube. 644 00:20:31,520 --> 00:20:32,900 When we used just a center 645 00:20:33,120 --> 00:20:34,060 portion of the cube, we had 646 00:20:34,770 --> 00:20:35,630 fourteen or something, I actually 647 00:20:35,790 --> 00:20:38,020 don't remember the exact number, and then 25. 648 00:20:38,480 --> 00:20:39,920 All of them gave us the same answer. 649 00:20:41,260 --> 00:20:42,380 Obviously these numbers are very 650 00:20:42,850 --> 00:20:43,790 small, but if you think about 651 00:20:44,270 --> 00:20:45,540 huge data that has millions 652 00:20:46,270 --> 00:20:47,370 or tens of millions of rows, 653 00:20:48,040 --> 00:20:49,690 then we're talking about dramatic differences 654 00:20:50,600 --> 00:20:51,880 in how much data needs to 655 00:20:52,070 --> 00:20:54,770 be summarized or aggregated in order to get the answer. 656 00:20:55,430 --> 00:20:56,950 So, using the summary data from 657 00:20:57,130 --> 00:20:58,550 the cube can be orders of 658 00:20:58,630 --> 00:21:01,390 magnitude faster than going to the original fact data. 659 00:21:02,710 --> 00:21:03,970 Now a variation on the 660 00:21:04,040 --> 00:21:05,110 width cube is to give 661 00:21:05,320 --> 00:21:08,470 specific attributes with the cube operator itself. 662 00:21:09,210 --> 00:21:10,370 And what that says is to 663 00:21:10,650 --> 00:21:12,190 only give summaries for 664 00:21:12,370 --> 00:21:14,590 these dimension attributes and not the others. 665 00:21:15,170 --> 00:21:16,330 In other words, we'll only see null 666 00:21:16,700 --> 00:21:19,260 values in the result for the store ID and customer ID 667 00:21:19,660 --> 00:21:21,690 and we won't see any null values for the Item ID. 668 00:21:22,330 --> 00:21:23,720 So, in some sense, we're 669 00:21:23,900 --> 00:21:25,240 materializing just some of 670 00:21:25,340 --> 00:21:27,180 the faces, edges, and corner 671 00:21:27,560 --> 00:21:28,840 of the cube, not all of them. 672 00:21:29,650 --> 00:21:30,710 If we run this query, well, 673 00:21:30,850 --> 00:21:32,150 we'll get the same results that 674 00:21:32,630 --> 00:21:33,940 MySQL doesn't yet support it, 675 00:21:34,110 --> 00:21:35,490 but this is the SQL standard 676 00:21:36,390 --> 00:21:38,430 for the cube operator with subsets of attributes. 677 00:21:39,660 --> 00:21:40,920 As before, I've cooked up an 678 00:21:41,240 --> 00:21:43,030 equivalent query using a MySQL's 679 00:21:43,480 --> 00:21:45,220 with roll-up command, and I'm 680 00:21:45,370 --> 00:21:46,810 certainly not going to explain it, 681 00:21:46,910 --> 00:21:48,870 but you're welcome to download it and examine it. 682 00:21:49,630 --> 00:21:50,420 What I really like to look 683 00:21:50,660 --> 00:21:53,300 at is the result, and here it is. 684 00:21:53,380 --> 00:21:54,420 So, this looks a lot like our 685 00:21:54,570 --> 00:21:56,740 original data cube but what we will notice is 686 00:21:56,890 --> 00:21:58,330 that there are no null values 687 00:21:59,220 --> 00:22:00,690 in the item column. 688 00:22:01,150 --> 00:22:03,340 So, it's a subset of the result of the cube query. 689 00:22:03,610 --> 00:22:04,640 And we would use that when 690 00:22:04,870 --> 00:22:06,630 we know we're never going to be rolling up on items. 691 00:22:07,640 --> 00:22:09,240 So, that dimension, the summaries 692 00:22:09,740 --> 00:22:12,040 based on that dimension aren't going to be useful to us. 693 00:22:13,180 --> 00:22:14,270 Now, let's look at with roll-up, 694 00:22:14,710 --> 00:22:16,880 which is supported natively by the MySQL system. 695 00:22:17,870 --> 00:22:20,700 With roll-up, again, gives us a subset of the data. 696 00:22:20,810 --> 00:22:20,940 cube. 697 00:22:21,820 --> 00:22:22,390 But, it's a very specific one 698 00:22:22,460 --> 00:22:23,740 and it's based on the ordering of 699 00:22:23,890 --> 00:22:25,410 the attributes in the group by clause. 700 00:22:26,190 --> 00:22:28,680 So, let me just go ahead and run it and explain what we get. 701 00:22:29,440 --> 00:22:30,690 We again get null values, 702 00:22:31,300 --> 00:22:33,240 and those null values indicate the summary. 703 00:22:33,790 --> 00:22:36,030 For example, the Store 1, Item 1 and 70s. 704 00:22:36,310 --> 00:22:37,720 The sum of prices for 705 00:22:37,960 --> 00:22:40,100 all customers for Store 1 and Item 1. 706 00:22:40,850 --> 00:22:42,310 And we see Store 1, Item 2, as well. 707 00:22:42,630 --> 00:22:43,810 We again see cases with 708 00:22:44,220 --> 00:22:45,630 two null values, so this 709 00:22:45,720 --> 00:22:47,230 is the summary for Store 1, 710 00:22:47,530 --> 00:22:48,480 all of the items and customers, 711 00:22:49,000 --> 00:22:50,320 the total sales, and we 712 00:22:50,450 --> 00:22:52,300 also have the triple null somewhere. 713 00:22:53,450 --> 00:22:56,350 It's at the bottom this time, with total sales of $33.50. 714 00:22:56,540 --> 00:22:58,200 But what we'll see 715 00:22:58,410 --> 00:23:00,890 is that we don't have all combinations of null values. 716 00:23:01,750 --> 00:23:02,690 We have a null value for customer 717 00:23:03,200 --> 00:23:04,190 ID, or we have nulls 718 00:23:04,500 --> 00:23:06,150 for the combination of customer ID 719 00:23:06,540 --> 00:23:08,550 and item ID, or all three nulls. 720 00:23:08,800 --> 00:23:09,770 So we only have the right-hand 721 00:23:10,150 --> 00:23:11,520 attribute or the two 722 00:23:11,790 --> 00:23:13,540 most-right attributes or all 723 00:23:13,790 --> 00:23:14,630 of them and if we had 724 00:23:14,890 --> 00:23:16,090 a fourth dimension, we'd see that 725 00:23:16,150 --> 00:23:18,680 we have the right-most, two right-most, three right-most. 726 00:23:19,740 --> 00:23:20,920 This doesn't really make any 727 00:23:21,240 --> 00:23:22,600 particular sense for this query, 728 00:23:23,260 --> 00:23:25,200 but it's used when we have dimensions that are hierarchical. 729 00:23:26,510 --> 00:23:28,350 And I actually introduced a hierarchical 730 00:23:28,980 --> 00:23:31,330 dimensions in our original schema just for this purpose. 731 00:23:31,740 --> 00:23:32,910 So, lets turn to that query. 732 00:23:34,250 --> 00:23:35,480 So, here's a query that's grouping 733 00:23:35,950 --> 00:23:37,440 by the state, county, and city. 734 00:23:37,760 --> 00:23:39,790 These are three dimension attributes, 735 00:23:40,130 --> 00:23:42,050 they all come from the store dimension table. 736 00:23:42,490 --> 00:23:44,190 And they are hierarchical, meaning 737 00:23:44,530 --> 00:23:45,600 that we don't have every 738 00:23:45,990 --> 00:23:47,540 combination of state, county and city. 739 00:23:48,580 --> 00:23:49,810 We have cities that are in 740 00:23:50,070 --> 00:23:51,510 specific counties, and counties 741 00:23:52,010 --> 00:23:53,170 that are in specific states. 742 00:23:54,150 --> 00:23:54,980 So, when we look at the results 743 00:23:55,230 --> 00:23:56,320 of the query, we see, of 744 00:23:56,540 --> 00:23:57,490 course, that we, when we 745 00:23:57,660 --> 00:23:59,780 have San Mateo is always in the state of California. 746 00:24:01,070 --> 00:24:02,770 King County is always in the state of Washington. 747 00:24:03,830 --> 00:24:05,530 We don't have Palo Alto combined 748 00:24:06,110 --> 00:24:07,710 with, say, King County with state California. 749 00:24:08,290 --> 00:24:10,390 So, we don't have all the combinations of the three. 750 00:24:11,050 --> 00:24:12,890 We have a hierarchical structure on them. 751 00:24:13,460 --> 00:24:16,330 And it's this structure that makes sense when we use a roll-up. 752 00:24:17,100 --> 00:24:20,080 So, let's add with roll-up to this query and see what we get. 753 00:24:20,750 --> 00:24:21,860 So, here in our result we see 754 00:24:22,080 --> 00:24:25,310 our original tuples for a state, county, city combination. 755 00:24:26,570 --> 00:24:27,700 And then, we see for example, 756 00:24:28,090 --> 00:24:29,550 this tuple here, which is 757 00:24:29,880 --> 00:24:30,980 taking the state of California 758 00:24:31,520 --> 00:24:32,660 and the county of San Mateo 759 00:24:33,200 --> 00:24:34,760 and adding up all of the sales there. 760 00:24:35,890 --> 00:24:38,790 And we see similarly that we have all of the sales for Santa Clara County. 761 00:24:39,560 --> 00:24:40,410 Then we can see that we 762 00:24:40,530 --> 00:24:41,550 have all of the sales for 763 00:24:41,950 --> 00:24:42,920 each state, so this is 764 00:24:42,970 --> 00:24:44,570 the California sales and the Washington sales. 765 00:24:45,000 --> 00:24:46,730 And finally, the triple null is 766 00:24:47,000 --> 00:24:48,360 the corner the cube, it's 767 00:24:48,580 --> 00:24:49,900 all of our sales and, as 768 00:24:50,040 --> 00:24:51,460 usual, we get the total of 33.50. 769 00:24:52,330 --> 00:24:53,780 Now, what don't we 770 00:24:54,000 --> 00:24:55,610 see in here compared with the data cube? 771 00:24:56,150 --> 00:24:57,430 Well, we don't see, for example, 772 00:24:57,490 --> 00:24:58,770 a tuple that has 773 00:24:59,020 --> 00:25:00,730 California and Palo Alto 774 00:25:01,050 --> 00:25:02,250 and the county as null. 775 00:25:03,020 --> 00:25:03,400 Why is that? 776 00:25:03,650 --> 00:25:06,080 Well, Palo Alto is always in Santa Clara County. 777 00:25:07,080 --> 00:25:08,500 So rolling up the county, 778 00:25:09,320 --> 00:25:10,970 or saying I want California and 779 00:25:11,080 --> 00:25:12,230 Palo Alto sales for every 780 00:25:12,380 --> 00:25:13,830 county, is exactly the 781 00:25:13,920 --> 00:25:14,820 same as saying I want 782 00:25:15,070 --> 00:25:17,210 California and Palo Alto sales in Santa Clara County. 783 00:25:18,540 --> 00:25:19,670 We also don't see, for 784 00:25:19,800 --> 00:25:21,940 example, the state and county both being null. 785 00:25:22,340 --> 00:25:23,470 For example, if we had 786 00:25:23,740 --> 00:25:24,830 Seattle as a city and 787 00:25:25,020 --> 00:25:26,460 the state and county being null. 788 00:25:27,120 --> 00:25:28,350 Well, Seattle is always in 789 00:25:28,530 --> 00:25:29,620 King County and always in 790 00:25:29,750 --> 00:25:30,860 Washington, so we're not aggregating 791 00:25:31,480 --> 00:25:32,360 anything there, we get the 792 00:25:32,610 --> 00:25:34,620 exact same results as having Seattle King in Washington. 793 00:25:35,940 --> 00:25:37,110 So if we ran WITH CUBE 794 00:25:37,770 --> 00:25:40,230 on this hierarchically-structured data, we'd 795 00:25:40,380 --> 00:25:42,010 actually not get anymore information, 796 00:25:42,570 --> 00:25:45,460 we'd have a bigger result, but it wouldn't be giving us more information. 797 00:25:46,230 --> 00:25:48,550 It would just be less efficient for getting the same data. 798 00:25:49,900 --> 00:25:51,810 So that concludes our OLAP demonstration. 799 00:25:52,910 --> 00:25:54,280 We saw Star Schema and 800 00:25:54,460 --> 00:25:56,410 we saw plain SQL queries over that schema. 801 00:25:57,440 --> 00:25:59,010 We saw the concept of drilling 802 00:25:59,370 --> 00:26:01,700 down and rolling up; also slicing and dicing. 803 00:26:02,930 --> 00:26:04,310 We introduced a WITH CUBE 804 00:26:04,630 --> 00:26:06,110 extension to SQL, which is 805 00:26:06,220 --> 00:26:07,490 not yet implemented in MySQL, 806 00:26:08,120 --> 00:26:08,850 but we were able to write 807 00:26:09,030 --> 00:26:10,510 a query that's equivalent to WITH CUBE. 808 00:26:11,370 --> 00:26:12,780 We also saw putting a WITH 809 00:26:13,090 --> 00:26:14,300 CUBE query into a table 810 00:26:14,760 --> 00:26:15,960 and then querying that table directly 811 00:26:16,890 --> 00:26:18,140 and that can be much more efficient 812 00:26:18,980 --> 00:26:20,160 than running the equivalent query 813 00:26:21,010 --> 00:26:22,440 in SQL directly over the fact table. 814 00:26:23,390 --> 00:26:25,430 We also saw WITH ROLLUP, which is implemented. 815 00:26:26,480 --> 00:26:27,680 We didn't demonstrate putting the 816 00:26:27,750 --> 00:26:30,250 result of WITH ROLLUP in a table, but we could certainly do that too. 817 00:26:31,400 --> 00:26:32,700 All of these features are useful 818 00:26:33,120 --> 00:26:34,530 primarily in applications that 819 00:26:34,660 --> 00:26:36,420 are performing analyses over very 820 00:26:36,790 --> 00:26:38,130 large data sets that exhibit 821 00:26:38,800 --> 00:26:40,220 this dimensional type structure, but 822 00:26:40,440 --> 00:26:41,910 this is actually quite a 823 00:26:42,100 --> 00:26:43,730 common structure in analysis applications.