1 00:00:00,510 --> 00:00:04,000 This video introduces online analytical processing or OLAP. 2 00:00:04,440 --> 00:00:06,110 A subsequent video will have 3 00:00:06,410 --> 00:00:07,850 a demo of OLAP queries in action. 4 00:00:09,070 --> 00:00:11,000 Overall, database activity can 5 00:00:11,090 --> 00:00:12,640 be divided into two broad classes. 6 00:00:13,640 --> 00:00:15,150 One of them, the traditional one, 7 00:00:15,740 --> 00:00:19,070 is known as OLTP, or online transaction processing. 8 00:00:19,910 --> 00:00:20,720 The other one, the subject 9 00:00:21,140 --> 00:00:22,630 of this video, came about more 10 00:00:22,800 --> 00:00:23,960 recently, and it's known 11 00:00:24,250 --> 00:00:26,250 as OLAP, or online analytical processing. 12 00:00:28,840 --> 00:00:29,970 Online transaction processing is typically 13 00:00:30,350 --> 00:00:31,720 characterized by short transactions, 14 00:00:32,750 --> 00:00:33,500 both queries and updates. 15 00:00:34,330 --> 00:00:35,660 Things like updating an account 16 00:00:36,130 --> 00:00:37,180 balance in a bank database 17 00:00:37,670 --> 00:00:39,380 or logging a page view in a web application. 18 00:00:40,700 --> 00:00:43,770 Queries in OLTP data bases are generally fairly simple. 19 00:00:44,570 --> 00:00:47,190 Find an account balance or find the GPA of a student. 20 00:00:47,750 --> 00:00:49,580 They typically touch small portions of the data. 21 00:00:50,770 --> 00:00:52,660 And updates in this environment can be frequent. 22 00:00:53,810 --> 00:00:55,080 We might be making airline seat 23 00:00:55,390 --> 00:00:58,480 reservations or updating a online shopping cart. 24 00:00:59,950 --> 00:01:02,170 OLAP is pretty much the opposite in all respects. 25 00:01:03,420 --> 00:01:04,390 In OLAP, we have long 26 00:01:04,830 --> 00:01:06,860 transactions, often complex analysis 27 00:01:07,470 --> 00:01:09,420 of the data or data mining type operations. 28 00:01:10,780 --> 00:01:11,760 The queries as I said, can 29 00:01:11,880 --> 00:01:13,570 be complex and especially they 30 00:01:13,870 --> 00:01:15,520 often touch large portions of 31 00:01:15,580 --> 00:01:17,950 the data rather than small portions as in OLTP. 32 00:01:19,420 --> 00:01:20,600 And updates in the OLAP 33 00:01:20,860 --> 00:01:22,050 environment tend to be infrequent, 34 00:01:22,630 --> 00:01:23,900 in fact, sometimes in the 35 00:01:24,100 --> 00:01:26,530 OLAP environment there are no updates to the data at all. 36 00:01:27,540 --> 00:01:29,620 Now, these two are extremes 37 00:01:30,450 --> 00:01:31,220 and really there is a spectrum 38 00:01:31,900 --> 00:01:32,870 between those two extremes. 39 00:01:34,290 --> 00:01:35,400 We might have a sort of, moderate 40 00:01:35,850 --> 00:01:37,530 amount of update and queries 41 00:01:37,910 --> 00:01:39,460 that touch a moderate portion of the data. 42 00:01:40,190 --> 00:01:41,250 But the fact is that 43 00:01:41,540 --> 00:01:44,850 database systems traditionally were designed for the first extreme. 44 00:01:46,000 --> 00:01:49,020 And then special techniques were developed for the other extreme. 45 00:01:49,910 --> 00:01:51,810 So the systems are tuned for the two extremes. 46 00:01:52,380 --> 00:01:53,490 And depending on ones work load 47 00:01:53,680 --> 00:01:55,370 one might choose to use different 48 00:01:55,960 --> 00:01:57,000 options in a database system 49 00:01:58,340 --> 00:01:59,230 just a little bit more terminology 50 00:02:00,030 --> 00:02:00,960 in the OLAP world. 51 00:02:01,630 --> 00:02:03,110 There's a concept called data warehousing. 52 00:02:03,800 --> 00:02:05,120 It's really a software architecture. 53 00:02:06,160 --> 00:02:07,620 The idea is that often in 54 00:02:08,070 --> 00:02:09,120 enter prizes or other operation, 55 00:02:10,070 --> 00:02:12,340 there are lots of operational sources. 56 00:02:12,920 --> 00:02:13,730 So you can think of a 57 00:02:13,920 --> 00:02:15,610 point of sale, for example, might 58 00:02:15,830 --> 00:02:17,770 have many, many OLTP database 59 00:02:18,230 --> 00:02:19,540 pieces related to an enterprise, 60 00:02:20,680 --> 00:02:22,720 and data warehousing is the 61 00:02:22,840 --> 00:02:23,870 process of bringing the data from 62 00:02:24,200 --> 00:02:26,120 all of those distributed OLTP sources 63 00:02:26,880 --> 00:02:29,290 into a single, gigantic warehouse where 64 00:02:29,490 --> 00:02:30,440 the point then is to do 65 00:02:30,920 --> 00:02:32,390 analyses of the data, and 66 00:02:32,510 --> 00:02:33,980 that would fall under the OLAP camp. 67 00:02:35,100 --> 00:02:36,470 Another term you might encounter is 68 00:02:36,610 --> 00:02:38,710 decision support systems also known as DSS. 69 00:02:39,510 --> 00:02:41,090 This isn't really an exact term. 70 00:02:41,780 --> 00:02:42,850 It's generally used to talk 71 00:02:43,000 --> 00:02:45,780 about infrastructure for again large scale data analyses. 72 00:02:46,790 --> 00:02:47,780 So, if you think of 73 00:02:47,870 --> 00:02:49,310 a data warehouse, where we're 74 00:02:49,520 --> 00:02:50,280 bringing in a lot of 75 00:02:50,370 --> 00:02:51,910 data from operational sources, and 76 00:02:52,030 --> 00:02:53,120 that warehouse is tuned for 77 00:02:53,340 --> 00:02:54,760 OLAP queries that would 78 00:02:54,930 --> 00:02:57,040 be thought of as a decision support system. 79 00:02:57,670 --> 00:02:58,590 And, of course, this system 80 00:02:58,990 --> 00:03:00,130 is designed to support decisions 81 00:03:00,660 --> 00:03:02,620 that are made, again, based on data analysis. 82 00:03:04,160 --> 00:03:06,380 Now, let's get into some technical details of OLAP. 83 00:03:07,460 --> 00:03:09,180 Frequently applications that are 84 00:03:09,280 --> 00:03:10,800 doing online analytical processing 85 00:03:12,060 --> 00:03:13,490 are designed based around a 86 00:03:13,810 --> 00:03:15,250 star schema, so it's a 87 00:03:15,360 --> 00:03:16,980 certain type of relational schema. 88 00:03:18,140 --> 00:03:21,430 In a star schema, there's usually one fact table. 89 00:03:22,230 --> 00:03:23,980 That will be a typically very 90 00:03:24,230 --> 00:03:26,140 large table, it will be updated frequently. 91 00:03:26,890 --> 00:03:28,150 Often it's actually append only, 92 00:03:28,590 --> 00:03:30,950 so there are only inserts into the fact table. 93 00:03:31,940 --> 00:03:34,360 And then there are maybe many dimension tables. 94 00:03:35,140 --> 00:03:38,010 Those are updated infrequently and don't tend to be as large. 95 00:03:40,060 --> 00:03:40,910 So examples of a fact table 96 00:03:41,300 --> 00:03:43,040 might be sales transactions in 97 00:03:43,160 --> 00:03:44,480 a sales database or in 98 00:03:44,570 --> 00:03:45,940 a university database, maybe students 99 00:03:46,500 --> 00:03:48,150 enrolling in courses or in 100 00:03:48,290 --> 00:03:50,150 a web application logging the page views. 101 00:03:51,390 --> 00:03:52,430 In all of these cases we can 102 00:03:52,580 --> 00:03:53,580 see that the fact table can 103 00:03:53,800 --> 00:03:55,130 be very large and can 104 00:03:55,360 --> 00:03:57,350 be append only, so inserts only. 105 00:03:58,400 --> 00:03:59,880 Examples of dimension tables 106 00:04:00,660 --> 00:04:01,880 might be in a sales 107 00:04:02,130 --> 00:04:03,850 database store's items and 108 00:04:04,030 --> 00:04:07,390 customers in a college enrollment database. 109 00:04:07,940 --> 00:04:10,380 Maybe students and courses in a web application. 110 00:04:11,110 --> 00:04:13,550 Maybe web pages his users and advertisers. 111 00:04:14,180 --> 00:04:15,280 So, you can see, that 112 00:04:15,540 --> 00:04:17,000 these are generally smaller 113 00:04:17,520 --> 00:04:18,850 tables, they're more stable, they're 114 00:04:18,970 --> 00:04:20,370 not updated as frequently. 115 00:04:21,400 --> 00:04:22,360 You can sort of think 116 00:04:22,600 --> 00:04:24,270 of dimension tables as things 117 00:04:24,660 --> 00:04:26,480 in the real world and then 118 00:04:26,920 --> 00:04:29,010 fact tables as logging things that happened. 119 00:04:29,870 --> 00:04:32,680 It's not always divided this way but, it's not a bad approximation. 120 00:04:34,240 --> 00:04:35,140 Now, you might be wondering 121 00:04:35,500 --> 00:04:36,920 why is it called a 122 00:04:37,100 --> 00:04:38,700 star schema and it's called 123 00:04:39,050 --> 00:04:40,100 that because we have the 124 00:04:40,310 --> 00:04:41,510 fact table sort of, centrally 125 00:04:42,220 --> 00:04:44,190 referencing dimension tables around it. 126 00:04:44,320 --> 00:04:44,890 So, I'll draw the picture. 127 00:04:45,670 --> 00:04:48,250 Let's take a particular example and let's look at the sales domain. 128 00:04:49,140 --> 00:04:50,240 So, we'll have our fact 129 00:04:50,660 --> 00:04:51,690 table here, which will be 130 00:04:51,900 --> 00:04:53,610 the sales table and that 131 00:04:53,820 --> 00:04:55,360 will log sales transactions actions. 132 00:04:55,770 --> 00:04:56,860 It will include the store where 133 00:04:57,000 --> 00:04:58,250 the sale was made, the item 134 00:04:58,640 --> 00:04:59,720 that was sold, the customer, how 135 00:04:59,890 --> 00:05:01,510 many were sold, and the price that was paid. 136 00:05:02,260 --> 00:05:04,800 And then the other three tables are the dimension tables. 137 00:05:05,490 --> 00:05:07,090 So those those are giving 138 00:05:07,240 --> 00:05:08,330 us information about the stores 139 00:05:08,950 --> 00:05:09,960 and the items and the customers. 140 00:05:11,470 --> 00:05:13,200 So, I've drawn a picture of our schema here. 141 00:05:13,420 --> 00:05:15,860 We have our central fact table, the sales table. 142 00:05:16,850 --> 00:05:17,700 And we can see that the 143 00:05:17,790 --> 00:05:18,990 sales table contains these three 144 00:05:19,190 --> 00:05:20,580 columns I've abbreviated them in 145 00:05:20,650 --> 00:05:22,100 the picture: the Store ID, Item 146 00:05:22,470 --> 00:05:23,850 ID, and the Customer ID. 147 00:05:24,870 --> 00:05:27,030 The store ID values in 148 00:05:27,230 --> 00:05:28,700 this column will be foreign 149 00:05:29,170 --> 00:05:30,460 key attributes to the 150 00:05:30,540 --> 00:05:31,650 primary key of the store 151 00:05:31,890 --> 00:05:33,750 table if you remember our constraints video. 152 00:05:34,520 --> 00:05:35,550 So we can think of these as 153 00:05:35,860 --> 00:05:36,870 pointers into the store 154 00:05:37,200 --> 00:05:39,660 table, least specifically matching store 155 00:05:40,410 --> 00:05:41,300 IDs over here. 156 00:05:42,070 --> 00:05:44,230 And we'll have similarly our 157 00:05:44,880 --> 00:05:45,780 item IDs will be foreign 158 00:05:46,260 --> 00:05:47,340 keys to the item table. 159 00:05:47,770 --> 00:05:49,510 I won't actually point to the values here. 160 00:05:50,330 --> 00:05:52,960 And then our costumer IDs 161 00:05:53,310 --> 00:05:55,680 over here will be pointing to the customer table. 162 00:05:56,490 --> 00:05:57,650 So if you look at 163 00:05:57,790 --> 00:05:59,830 this squinting, you will 164 00:06:00,130 --> 00:06:01,420 see that it is kind of 165 00:06:01,480 --> 00:06:02,930 a star schema with the 166 00:06:03,260 --> 00:06:04,960 central fact table pointing 167 00:06:05,280 --> 00:06:08,250 to the dimension tables around it, and that's where the name comes from. 168 00:06:09,380 --> 00:06:10,830 Just a little more terminology. 169 00:06:11,550 --> 00:06:14,030 The first three attributes here in the fact fact table. 170 00:06:14,290 --> 00:06:15,700 These three are what are 171 00:06:15,750 --> 00:06:17,420 known as dimension attributes. 172 00:06:18,340 --> 00:06:19,630 So those are the attributes 173 00:06:20,120 --> 00:06:22,580 that are foreign keys into the dimension tables. 174 00:06:23,600 --> 00:06:25,150 Then the remaining attributes in 175 00:06:25,240 --> 00:06:26,580 this case the quantity and the 176 00:06:26,680 --> 00:06:28,870 price are called dependent attributes. 177 00:06:29,340 --> 00:06:30,810 So they're I guess dependent on the 178 00:06:30,960 --> 00:06:32,800 values for the dimension 179 00:06:33,440 --> 00:06:35,150 attributes and typically, queries will 180 00:06:35,580 --> 00:06:38,070 tend to aggregate on the dependent attributes. 181 00:06:38,420 --> 00:06:39,920 We'll see examples of that in a moment. 182 00:06:40,880 --> 00:06:41,910 So, now that we known 183 00:06:42,050 --> 00:06:43,230 what a star schema looks like, 184 00:06:43,450 --> 00:06:44,350 let's look at the type of 185 00:06:44,530 --> 00:06:45,730 queries that are generally issued 186 00:06:46,000 --> 00:06:48,010 over this schema, and they're called OLAP queries. 187 00:06:48,900 --> 00:06:50,320 Typically a query over 188 00:06:50,920 --> 00:06:52,350 a star schema will first 189 00:06:52,740 --> 00:06:54,400 join some or all of the relations. 190 00:06:55,700 --> 00:06:56,820 And when you're joining the sale 191 00:06:57,240 --> 00:06:58,210 as the fact table with 192 00:06:58,340 --> 00:06:59,880 the dimension tables, you can 193 00:06:59,920 --> 00:07:01,780 almost think of it as expanding the 194 00:07:02,050 --> 00:07:03,080 facts in the sales table 195 00:07:03,390 --> 00:07:05,410 to include more information about the sales. 196 00:07:06,100 --> 00:07:07,820 Since we have the foreign keys we'll 197 00:07:07,980 --> 00:07:10,810 be adding, for example, to the information about a sale. 198 00:07:11,150 --> 00:07:11,870 More about the store. 199 00:07:12,170 --> 00:07:13,140 The city and state of the store. 200 00:07:13,910 --> 00:07:15,410 For a sale item will 201 00:07:15,500 --> 00:07:17,050 be adding the category brand and so on. 202 00:07:17,090 --> 00:07:18,390 So that's the join 203 00:07:18,710 --> 00:07:19,630 process and the query will 204 00:07:19,770 --> 00:07:20,660 join as much as it 205 00:07:20,750 --> 00:07:23,100 needs in order to do the rest of it's work. 206 00:07:23,950 --> 00:07:25,590 It might then filter the data. 207 00:07:26,060 --> 00:07:27,000 For example we might decide 208 00:07:27,380 --> 00:07:28,280 that in our query we only 209 00:07:28,540 --> 00:07:29,760 care about stores in California 210 00:07:30,710 --> 00:07:32,290 or customers in California, we're only 211 00:07:32,520 --> 00:07:34,300 interested in shirts and so on. 212 00:07:34,540 --> 00:07:35,850 So they can filter on the 213 00:07:36,390 --> 00:07:37,840 dimension attributes after joining, 214 00:07:38,160 --> 00:07:40,210 or could filter on the price or quantity as well. 215 00:07:41,700 --> 00:07:44,620 After filtering there's often a group by an aggregation. 216 00:07:45,890 --> 00:07:47,230 So we might decide that we're 217 00:07:47,450 --> 00:07:49,510 interested in figuring out our 218 00:07:49,910 --> 00:07:51,660 total sales divided by customer 219 00:07:52,370 --> 00:07:54,520 or by item or by state or all of those. 220 00:07:55,070 --> 00:07:57,080 And then the aggregation might sum 221 00:07:57,290 --> 00:07:58,290 up the sales or it might 222 00:07:58,560 --> 00:08:00,530 determine the average price that's sold. 223 00:08:01,290 --> 00:08:02,390 We'll be doing a number of this 224 00:08:02,530 --> 00:08:04,410 type of query in our demo later on. 225 00:08:05,280 --> 00:08:06,630 So if you think about executing 226 00:08:07,000 --> 00:08:08,380 queries of this type, they 227 00:08:08,680 --> 00:08:10,190 can be quite complex and they 228 00:08:10,260 --> 00:08:11,760 can touch large portions of the database. 229 00:08:12,720 --> 00:08:13,940 Sowe 're worried about performance, 230 00:08:14,630 --> 00:08:16,780 and our data is large, we do have a worry. 231 00:08:17,330 --> 00:08:18,670 Running this type of 232 00:08:18,830 --> 00:08:20,210 query on a gigantic database 233 00:08:21,020 --> 00:08:22,560 over a standard database system 234 00:08:22,910 --> 00:08:24,720 can be very slow, but over 235 00:08:25,560 --> 00:08:26,580 the past decade or so, 236 00:08:27,230 --> 00:08:28,720 special indexing techniques have 237 00:08:28,830 --> 00:08:30,170 been introduced and special query processing 238 00:08:30,600 --> 00:08:32,060 techniques specifically to handle 239 00:08:32,830 --> 00:08:34,340 this type of query on 240 00:08:34,520 --> 00:08:35,940 star schemas on large databases. 241 00:08:36,500 --> 00:08:37,880 And again, by large, just 242 00:08:37,900 --> 00:08:39,140 think about the number of sales, 243 00:08:39,590 --> 00:08:40,640 for example, in a large 244 00:08:40,930 --> 00:08:42,720 retail chain, or a 245 00:08:43,060 --> 00:08:44,160 number of web views, or 246 00:08:44,260 --> 00:08:45,620 even shopping cart additions 247 00:08:46,210 --> 00:08:47,370 in a large online vendor. 248 00:08:48,250 --> 00:08:49,100 So, in all of those 249 00:08:49,220 --> 00:08:50,410 applications, people are interested in 250 00:08:50,540 --> 00:08:51,410 doing OLAP queries and they 251 00:08:52,050 --> 00:08:54,440 tend to use a system that supports these special techniques. 252 00:08:55,760 --> 00:08:57,260 Another component of getting good 253 00:08:57,460 --> 00:08:58,640 performance in these systems 254 00:08:59,360 --> 00:09:00,670 is the use of materialized views. 255 00:09:01,330 --> 00:09:02,900 You might remember that materialized 256 00:09:03,020 --> 00:09:04,240 views are useful when we 257 00:09:04,420 --> 00:09:06,140 have a workload that 258 00:09:06,360 --> 00:09:08,610 consists of lots of queries and not so many updates. 259 00:09:09,120 --> 00:09:11,760 And that's exactly the type of workload we have in OLAP. 260 00:09:12,580 --> 00:09:14,100 furthermore, we have many queries 261 00:09:14,570 --> 00:09:15,580 that take roughly the same 262 00:09:15,910 --> 00:09:17,430 structure so material wise we 263 00:09:17,570 --> 00:09:18,990 use are useful in that setting as well. 264 00:09:19,820 --> 00:09:20,950 Now let me switch gears 265 00:09:21,230 --> 00:09:22,590 and introduce a different way 266 00:09:22,890 --> 00:09:24,640 of looking at the data in 267 00:09:24,700 --> 00:09:26,360 these OLAP applications with star 268 00:09:26,730 --> 00:09:28,820 schemas, and it's what's known as a data cube. 269 00:09:29,440 --> 00:09:30,410 Sometimes this is also called 270 00:09:30,650 --> 00:09:32,850 multidimensional OLAP and 271 00:09:32,980 --> 00:09:34,110 the basic idea is that 272 00:09:34,270 --> 00:09:35,640 when we have data with dimensions, 273 00:09:36,470 --> 00:09:37,430 we can think of those dimensions 274 00:09:38,420 --> 00:09:39,900 as forming the axis of a cube. 275 00:09:40,550 --> 00:09:42,520 It's kind of like an N dimensional spreadsheet. 276 00:09:43,490 --> 00:09:44,810 Now we can have any 277 00:09:45,010 --> 00:09:46,380 number of dimensions, but for 278 00:09:46,680 --> 00:09:48,050 the examples I'm gonna give, 279 00:09:49,450 --> 00:09:50,430 the best I can draw is 280 00:09:50,550 --> 00:09:52,490 up to three dimensions, and that's why people call acute. 281 00:09:52,770 --> 00:09:54,150 Because they know how to draw three dimensions. 282 00:09:54,930 --> 00:09:56,140 But again, any number of 283 00:09:56,400 --> 00:09:58,340 dimensions are possible in this view of the data. 284 00:09:59,630 --> 00:10:02,740 So we have our dimensions forming the axis of our cube. 285 00:10:03,420 --> 00:10:04,480 And then the cells of the 286 00:10:04,600 --> 00:10:06,660 cube, again, you can think of it sort of like cells of a spreadsheet. 287 00:10:07,820 --> 00:10:08,960 Are the fact of data. 288 00:10:09,410 --> 00:10:10,190 Or the dependent data. 289 00:10:10,430 --> 00:10:11,690 It's like in the previous example that 290 00:10:11,820 --> 00:10:13,510 would be our quantity and price. 291 00:10:14,430 --> 00:10:15,530 And finally we have aggregated 292 00:10:16,150 --> 00:10:17,740 data on the sides, edges 293 00:10:18,130 --> 00:10:19,780 and corners of corner of the cube. 294 00:10:20,100 --> 00:10:22,930 Again similar to how you might aggregate columns in a spreadsheet. 295 00:10:23,840 --> 00:10:24,970 So let's go ahead and 296 00:10:25,210 --> 00:10:26,160 I'll do my best to draw 297 00:10:26,450 --> 00:10:27,940 a picture to explain what's going on. 298 00:10:29,010 --> 00:10:30,260 So here's my cube with these 299 00:10:30,650 --> 00:10:32,530 three axes that I've drawn in black. 300 00:10:33,080 --> 00:10:34,510 And I've drawn these dash lines 301 00:10:34,900 --> 00:10:36,020 as well to sort of give 302 00:10:36,170 --> 00:10:37,660 you a visual idea of the cube. 303 00:10:38,400 --> 00:10:39,410 But I'm going to actually 304 00:10:39,490 --> 00:10:40,720 get rid of these dash lines right 305 00:10:40,880 --> 00:10:42,220 now just so we don't have too much clutter. 306 00:10:43,310 --> 00:10:44,790 So for our sales example, 307 00:10:45,300 --> 00:10:46,270 we're sticking with the same example, 308 00:10:47,200 --> 00:10:48,320 we have 3 dimensions. 309 00:10:48,830 --> 00:10:49,750 And those will label the three 310 00:10:49,780 --> 00:10:50,850 the three axises of are cube 311 00:10:51,690 --> 00:10:53,000 and in one dimension we will 312 00:10:53,150 --> 00:10:55,190 have the stores and another 313 00:10:55,500 --> 00:10:56,520 dimension we will have the 314 00:10:56,600 --> 00:10:58,840 customers here, and in 315 00:10:58,960 --> 00:11:00,930 another dimension we have the items. 316 00:11:01,970 --> 00:11:03,010 Then we can think of 317 00:11:03,250 --> 00:11:04,520 the points along these axes 318 00:11:05,210 --> 00:11:06,900 as being the different elements 319 00:11:07,510 --> 00:11:08,700 in each of those domains, or 320 00:11:08,970 --> 00:11:11,470 the different tuples in each of those dimension tables. 321 00:11:12,460 --> 00:11:13,980 So for example, in the 322 00:11:14,220 --> 00:11:15,660 store domain, we'll have, 323 00:11:16,160 --> 00:11:18,300 you know, store 1 store 2, 324 00:11:18,790 --> 00:11:20,310 store 3 and so on. 325 00:11:20,620 --> 00:11:22,080 I'm not giving them any fancy names here. 326 00:11:22,880 --> 00:11:23,740 And so, each of those is 327 00:11:23,900 --> 00:11:25,090 a point on that dimension and 328 00:11:25,380 --> 00:11:27,100 similarly for the items will have 329 00:11:27,250 --> 00:11:29,820 item 1 item 2 item 3 and so on. 330 00:11:30,650 --> 00:11:32,140 And for the customers along the bottom, 331 00:11:33,020 --> 00:11:34,280 we'll have customer 1 customer 332 00:11:34,990 --> 00:11:37,010 number 2, customer 3 and so on. 333 00:11:38,150 --> 00:11:39,950 Now here comes the tricky part, especially for drawing. 334 00:11:40,510 --> 00:11:41,380 The idea is Is that 335 00:11:41,830 --> 00:11:43,320 every cell in the 336 00:11:43,400 --> 00:11:44,970 cube, so every combination of 337 00:11:45,180 --> 00:11:46,950 item costumer in store has 338 00:11:47,160 --> 00:11:48,280 a cell in the cube, so 339 00:11:48,370 --> 00:11:50,340 this would be sort of a free floating cell here. 340 00:11:51,050 --> 00:11:52,350 And This will have for our 341 00:11:52,770 --> 00:11:54,770 schema the quantity and 342 00:11:54,920 --> 00:11:56,490 the price for that 343 00:11:56,960 --> 00:11:59,910 item, that customer, and that store. 344 00:12:00,220 --> 00:12:01,440 So this might be the floating 345 00:12:01,750 --> 00:12:02,720 thing here that's, you know, Item 346 00:12:03,120 --> 00:12:05,060 I32, Costumer 4, and 347 00:12:05,710 --> 00:12:07,320 Store 17, something like that. 348 00:12:08,010 --> 00:12:09,520 And then floating in 349 00:12:09,620 --> 00:12:12,240 there is this cell with the quantity and the price. 350 00:12:12,950 --> 00:12:14,130 Now we are assuming that there's 351 00:12:14,350 --> 00:12:15,640 just one quantity and price 352 00:12:15,960 --> 00:12:17,500 for the combination of those three attributes. 353 00:12:17,830 --> 00:12:20,670 And I'll come back to that in a moment, but let's assume that for now. 354 00:12:21,650 --> 00:12:22,720 So that's what we have in 355 00:12:22,970 --> 00:12:24,770 the whole central area of the cube. 356 00:12:25,880 --> 00:12:27,680 So now on the faces, edges, 357 00:12:28,060 --> 00:12:29,130 and corner of the cube 358 00:12:29,670 --> 00:12:31,420 are going to have aggregated data. 359 00:12:32,230 --> 00:12:33,910 And there does need 360 00:12:34,140 --> 00:12:36,320 to be with each data cube a predefined aggregate. 361 00:12:37,460 --> 00:12:39,190 So for this one let's say 362 00:12:39,380 --> 00:12:40,610 that what we want as our 363 00:12:40,760 --> 00:12:42,130 aggregate is the sum 364 00:12:42,340 --> 00:12:43,790 of the quantity times the price 365 00:12:44,050 --> 00:12:44,850 so we're going to figure 366 00:12:44,890 --> 00:12:45,860 out the total amount that we're 367 00:12:46,010 --> 00:12:47,170 making for different combinations 368 00:12:47,820 --> 00:12:48,920 of stores, items, and customers. 369 00:12:49,810 --> 00:12:51,780 So now let's consider a 370 00:12:51,850 --> 00:12:53,300 cell on the face of the cube. 371 00:12:53,970 --> 00:12:56,110 So again, I'm not drawing this very well. 372 00:12:56,300 --> 00:12:58,560 But let's assume this is on the bottom face of the cube. 373 00:12:58,940 --> 00:13:00,730 So, this is for a particular customer. 374 00:13:01,390 --> 00:13:02,640 Say customer 10, in a 375 00:13:02,890 --> 00:13:04,230 particular store, say store 7, 376 00:13:04,530 --> 00:13:06,200 and then, since it's 377 00:13:06,440 --> 00:13:07,170 on the bottom of the cube, 378 00:13:07,490 --> 00:13:08,890 so we didn't go up this dimension 379 00:13:09,440 --> 00:13:11,160 here, it considers all items 380 00:13:11,780 --> 00:13:13,290 for customer 10 and store 7. 381 00:13:13,480 --> 00:13:14,590 So this will be 382 00:13:15,370 --> 00:13:17,570 the aggregate over all items 383 00:13:18,360 --> 00:13:19,930 for that particular store and customer. 384 00:13:21,010 --> 00:13:23,920 And we'd have similar values on the other faces of the cube. 385 00:13:24,240 --> 00:13:25,460 So this face over here, for 386 00:13:25,660 --> 00:13:26,680 example, would be for a 387 00:13:27,010 --> 00:13:29,770 particular item and customer overall stores. 388 00:13:30,700 --> 00:13:32,310 And then on the front face 389 00:13:32,640 --> 00:13:33,490 of the cube, if you could imagine 390 00:13:34,300 --> 00:13:35,250 that, would be for a particular 391 00:13:35,510 --> 00:13:36,810 item and store over all customers. 392 00:13:38,250 --> 00:13:39,920 Now let's talk about what's on the edge of the cube. 393 00:13:40,290 --> 00:13:41,530 So here we have, say 394 00:13:41,970 --> 00:13:43,440 for store 3, we'll 395 00:13:44,180 --> 00:13:45,870 have the aggregate value over 396 00:13:46,440 --> 00:13:49,060 all customers and items 397 00:13:49,590 --> 00:13:51,370 in this point for store 3. 398 00:13:51,600 --> 00:13:52,720 So that will be the 399 00:13:52,810 --> 00:13:54,080 total sales that we 400 00:13:54,330 --> 00:13:55,880 conducted at store S3. 401 00:13:56,880 --> 00:13:57,870 Over here on this edge 402 00:13:58,140 --> 00:13:59,340 we'd have the total for a 403 00:13:59,560 --> 00:14:01,640 specific costumer and over here for specific items. 404 00:14:02,360 --> 00:14:03,770 And then finally, we have 405 00:14:04,020 --> 00:14:05,970 at the corner of the 406 00:14:06,800 --> 00:14:08,050 cube the full aggregation. 407 00:14:08,830 --> 00:14:10,240 So that's going to be in 408 00:14:10,430 --> 00:14:11,750 this case the sum of 409 00:14:11,820 --> 00:14:13,540 the quantity times price for every 410 00:14:13,970 --> 00:14:15,220 store, customer and item. 411 00:14:16,350 --> 00:14:17,520 So, I'm not a great artist, 412 00:14:17,890 --> 00:14:18,850 but I hope this gives you 413 00:14:19,150 --> 00:14:21,100 some understanding of how the data cube works. 414 00:14:21,650 --> 00:14:22,880 So as we saw in the 415 00:14:23,000 --> 00:14:24,140 cube, we have one cell 416 00:14:24,420 --> 00:14:26,080 in the cube for each combination of 417 00:14:26,250 --> 00:14:28,400 store ID, item ID, and customer ID. 418 00:14:29,090 --> 00:14:31,810 So if those three together form a key, then it's very straight forward. 419 00:14:32,030 --> 00:14:33,780 If the dimension attributes together 420 00:14:34,150 --> 00:14:35,450 don't form a key then 421 00:14:35,680 --> 00:14:38,830 we might be pre-aggregating already inside the data cube. 422 00:14:39,130 --> 00:14:40,380 So, we might decide to already 423 00:14:40,630 --> 00:14:41,760 have say the sum of quantity 424 00:14:42,240 --> 00:14:43,710 times price for each combination 425 00:14:44,340 --> 00:14:45,220 of store item and customer. 426 00:14:46,740 --> 00:14:47,640 Another possibility and it's done 427 00:14:47,960 --> 00:14:49,470 quite commonly is to add 428 00:14:49,870 --> 00:14:51,190 to the fact table the attribute 429 00:14:51,600 --> 00:14:52,800 date, or even the time. 430 00:14:53,480 --> 00:14:55,020 And that can be used to create a key. 431 00:14:55,470 --> 00:14:56,500 Typically, we won't have two 432 00:14:57,000 --> 00:14:58,890 transactions at exactly the same time. 433 00:14:59,550 --> 00:15:00,460 Now if we do have an 434 00:15:00,740 --> 00:15:02,450 attribute here called date, one 435 00:15:02,660 --> 00:15:03,790 might wonder is that a 436 00:15:03,970 --> 00:15:05,990 dimension attribute or a dependent attribute. 437 00:15:07,400 --> 00:15:09,050 Actually, it's pretty much 438 00:15:09,290 --> 00:15:10,870 a dimension attribute because we're 439 00:15:11,160 --> 00:15:12,110 gonna use it as another dimension 440 00:15:12,530 --> 00:15:13,760 in our data cube, but the 441 00:15:13,830 --> 00:15:14,610 difference being that we would 442 00:15:14,870 --> 00:15:17,490 not have an actual dimension table listing the dates. 443 00:15:18,400 --> 00:15:19,550 Now let's move on to 444 00:15:19,710 --> 00:15:20,970 a couple other concepts in the 445 00:15:21,080 --> 00:15:23,610 olap world called drill down and roll up. 446 00:15:24,240 --> 00:15:25,710 The idea of drill 447 00:15:26,100 --> 00:15:26,920 down, is that we may 448 00:15:27,160 --> 00:15:30,290 be examining summary data and then we want to get more information. 449 00:15:30,770 --> 00:15:32,680 Drill down into the details of that data. 450 00:15:33,680 --> 00:15:34,430 And, actually, we can think 451 00:15:34,620 --> 00:15:36,590 of that very specifically in 452 00:15:36,690 --> 00:15:38,110 a sequel context as follows. 453 00:15:38,820 --> 00:15:39,790 Let's suppose that we have 454 00:15:40,200 --> 00:15:41,480 this query and sequel which 455 00:15:41,720 --> 00:15:43,350 follows by the way the 456 00:15:43,880 --> 00:15:44,970 description of the query I 457 00:15:45,180 --> 00:15:46,290 had earlier where we'll do a 458 00:15:46,360 --> 00:15:47,840 join and then a selection 459 00:15:48,340 --> 00:15:50,960 and then it grouped by, and finally we have an aggregation here. 460 00:15:51,740 --> 00:15:53,300 So this query specifically is 461 00:15:53,420 --> 00:15:55,120 looking at our total sales broken 462 00:15:55,540 --> 00:15:57,010 out by state and brand. 463 00:15:58,010 --> 00:16:00,690 Maybe we'll look at that and we'll just say that's not enough detail. 464 00:16:01,130 --> 00:16:01,790 I need more information. 465 00:16:02,900 --> 00:16:04,300 So, to drill down what we 466 00:16:04,440 --> 00:16:06,360 do is add a grouping attribute. 467 00:16:06,840 --> 00:16:07,830 So if we added, for example, 468 00:16:08,330 --> 00:16:09,940 category, when we add 469 00:16:10,080 --> 00:16:11,250 another grouping attribute, that gets 470 00:16:11,520 --> 00:16:13,880 us more data in the answer - more detail in our data. 471 00:16:15,280 --> 00:16:16,810 Rollup is exactly the opposite. 472 00:16:17,370 --> 00:16:18,620 Rollup says we're looking at 473 00:16:18,810 --> 00:16:20,010 data and we decide we 474 00:16:20,130 --> 00:16:22,150 have too much detail and we want to summarize. 475 00:16:22,550 --> 00:16:24,200 And summarize is simply 476 00:16:24,540 --> 00:16:26,110 a matter of removing a group by attributes. 477 00:16:26,540 --> 00:16:28,000 So if we took out state, then 478 00:16:28,210 --> 00:16:29,170 now we'll only see our 479 00:16:29,260 --> 00:16:30,520 data summarized by brand 480 00:16:31,090 --> 00:16:32,940 rather than broken out into state and brand. 481 00:16:34,170 --> 00:16:36,250 And lastly, I want to add introduce some SQL constructs. 482 00:16:37,070 --> 00:16:38,150 These are constructs that have been added, 483 00:16:38,500 --> 00:16:39,460 fairly recently, to the SQL 484 00:16:39,810 --> 00:16:42,060 standard in order to perform OLAP queries. 485 00:16:42,460 --> 00:16:43,720 And we'll be seeing these in our demo. 486 00:16:44,390 --> 00:16:45,840 The constructs are called with cube 487 00:16:46,150 --> 00:16:47,530 and with roll up and 488 00:16:47,640 --> 00:16:49,080 they're added to the group by clause. 489 00:16:50,060 --> 00:16:52,080 When we add with cube to 490 00:16:52,360 --> 00:16:53,320 a query with a group by 491 00:16:54,000 --> 00:16:55,060 what happens is that, basically, 492 00:16:55,670 --> 00:16:56,750 we're adding to the result 493 00:16:57,250 --> 00:16:58,540 of our query, the faces, 494 00:16:59,380 --> 00:17:00,850 edges, and corner of the cube. 495 00:17:01,710 --> 00:17:04,340 Using no values for the attributes that we're not constraining. 496 00:17:05,030 --> 00:17:06,060 We'll see this clearly in the demo. 497 00:17:06,890 --> 00:17:10,550 With rollup is similar to withcube, except it's smaller. 498 00:17:10,950 --> 00:17:12,040 It actually is a portion of 499 00:17:12,130 --> 00:17:13,460 the data cube, and that 500 00:17:13,620 --> 00:17:14,610 makes sense when we have dimensions 501 00:17:15,120 --> 00:17:16,240 that are inherently hierarchical. 502 00:17:17,060 --> 00:17:18,450 And again we'll see that in the demo as well. 503 00:17:19,400 --> 00:17:20,390 So, we can conclude there are 504 00:17:20,510 --> 00:17:21,790 two broad types of data 505 00:17:21,990 --> 00:17:23,960 base activity, online transaction processing. 506 00:17:24,590 --> 00:17:26,240 Short, simple transactions touching 507 00:17:26,550 --> 00:17:27,390 small portions of the data, 508 00:17:28,160 --> 00:17:29,770 lots of updating and OLAP, 509 00:17:30,120 --> 00:17:32,160 or online analytical processing, where 510 00:17:32,290 --> 00:17:33,800 we have complex queries, long transactions, 511 00:17:34,880 --> 00:17:35,890 might touch a large portion of 512 00:17:35,980 --> 00:17:37,820 the data and might not update the data at all. 513 00:17:38,530 --> 00:17:40,310 For online analytical processing OLAP 514 00:17:40,790 --> 00:17:42,910 we saw that star schemas are frequently used. 515 00:17:43,490 --> 00:17:46,180 We saw how to view the data as a data cube. 516 00:17:46,740 --> 00:17:48,380 Of course, that can be in any number of dimensions. 517 00:17:48,670 --> 00:17:49,780 We just use three for visualization. 518 00:17:50,140 --> 00:17:52,460 There are two new 519 00:17:52,680 --> 00:17:54,760 constructs in sequel withcube and with rollup. 520 00:17:55,580 --> 00:17:56,930 And finally this type of 521 00:17:57,080 --> 00:17:58,580 query can be very stressful on 522 00:17:58,810 --> 00:18:00,380 a database system when we have very large databases. 523 00:18:01,220 --> 00:18:02,530 So special techniques have been 524 00:18:02,830 --> 00:18:05,750 introduced into systems to help perform these queries efficiently.