1
00:00:00,380 --> 00:00:02,910
In this video we'll be doing a live demonstration of OLAP.

2
00:00:03,820 --> 00:00:05,340
We'll create a star schema similar

3
00:00:05,650 --> 00:00:06,940
to the one we used in our previous examples.

4
00:00:07,640 --> 00:00:09,160
It will be sales data with

5
00:00:09,340 --> 00:00:10,630
a fact table and several

6
00:00:11,040 --> 00:00:12,540
dimension tables and then we'll

7
00:00:12,720 --> 00:00:14,380
do several different types of OLAP queries.

8
00:00:15,150 --> 00:00:17,360
We'll show a full star-join over the star schema.

9
00:00:18,480 --> 00:00:20,830
We'll show some examples of drill-down and roll-up.

10
00:00:21,360 --> 00:00:22,560
We'll also show what's known as

11
00:00:22,810 --> 00:00:23,920
slicing and dicing which we

12
00:00:24,000 --> 00:00:26,050
haven't explained yet, we'll just do when we get to those queries.

13
00:00:26,910 --> 00:00:28,130
And finally we'll show the With

14
00:00:28,730 --> 00:00:29,890
Cube and With Rollup

15
00:00:30,020 --> 00:00:31,020
clauses that have been added

16
00:00:31,660 --> 00:00:33,300
to the sequel standard for OLAP queries.

17
00:00:33,780 --> 00:00:35,110
For those we'll be using

18
00:00:35,390 --> 00:00:37,760
MySQL, actually for the whole demo we'll be using MySQL.

19
00:00:38,530 --> 00:00:40,370
MySQL supports with roll-up.

20
00:00:40,890 --> 00:00:41,700
Neither of the other systems

21
00:00:42,140 --> 00:00:43,760
that we've been using, Sequel Lite

22
00:00:44,030 --> 00:00:45,760
or postgres supports with roll up yet.

23
00:00:46,560 --> 00:00:47,710
And MySQL does not yet

24
00:00:47,890 --> 00:00:49,450
support With Cube, but we'll

25
00:00:49,590 --> 00:00:50,800
see that we can simulate the

26
00:00:51,130 --> 00:00:52,350
behavior With Cube, using

27
00:00:52,780 --> 00:00:53,970
With Rollup command of MySQL.

28
00:00:54,660 --> 00:00:55,480
So let's move to the demo.

29
00:00:56,470 --> 00:00:57,900
Here we are in the demo, we

30
00:00:58,080 --> 00:00:59,400
have a fact table that

31
00:00:59,570 --> 00:01:01,410
contains information about stores, items and customers.

32
00:01:01,940 --> 00:01:03,870
We don't see much here except their ID values.

33
00:01:04,340 --> 00:01:05,780
And we have sixty tuples

34
00:01:06,170 --> 00:01:07,440
loaded up in our fact table.

35
00:01:08,240 --> 00:01:10,630
Our dimension tables are the customer table.

36
00:01:10,830 --> 00:01:12,890
We have four customers with a gender and an age.

37
00:01:13,830 --> 00:01:15,660
We have our item table: five

38
00:01:15,980 --> 00:01:17,170
items, a category of the

39
00:01:17,240 --> 00:01:18,270
item, just t-shirts and jackets,

40
00:01:18,790 --> 00:01:19,890
just in blue and red, and

41
00:01:20,850 --> 00:01:22,320
finally we have our stores.

42
00:01:23,270 --> 00:01:24,420
And we have six stores that

43
00:01:24,890 --> 00:01:26,770
are in various cities in Santa

44
00:01:27,020 --> 00:01:28,370
Clara, San Mateo County in

45
00:01:28,470 --> 00:01:30,340
California and King County in Washington.

46
00:01:31,580 --> 00:01:34,170
Our first query is simply the complete star join.

47
00:01:34,540 --> 00:01:35,920
So, we're joining all four tables

48
00:01:36,430 --> 00:01:37,690
using the foreign key references

49
00:01:38,210 --> 00:01:40,750
in the fact table, to the keys and the dimension tables.

50
00:01:41,590 --> 00:01:42,890
Now since these are foreign key

51
00:01:43,040 --> 00:01:44,340
references, you can think of

52
00:01:44,430 --> 00:01:46,000
this star join as simply

53
00:01:46,590 --> 00:01:48,060
extending the tuples in the

54
00:01:48,250 --> 00:01:49,610
sales table with further information

55
00:01:49,860 --> 00:01:50,850
about the store item and customer

56
00:01:51,640 --> 00:01:52,370
mentioned in each one.

57
00:01:53,000 --> 00:01:53,820
So, here we go.

58
00:01:54,310 --> 00:01:56,520
And we should expect again sixty tuples in the results,

59
00:01:57,030 --> 00:01:58,120
one for each tuple in the

60
00:01:58,260 --> 00:01:59,500
fact table, and we can

61
00:01:59,660 --> 00:02:01,270
see that we have the information from

62
00:02:01,360 --> 00:02:02,570
the fact table, but then

63
00:02:02,740 --> 00:02:04,750
we also have more information that

64
00:02:04,850 --> 00:02:06,450
we've joined in from the dimension tables.

65
00:02:07,470 --> 00:02:08,820
Now it's not typical to do

66
00:02:08,930 --> 00:02:10,270
the complete star join, usually

67
00:02:10,600 --> 00:02:12,940
we would have at least constrained the star join in some way.

68
00:02:13,200 --> 00:02:14,930
So, our next query will do

69
00:02:15,320 --> 00:02:17,460
selections and projections on the  JOIN.

70
00:02:18,940 --> 00:02:20,000
We're going to limit ourselves to

71
00:02:20,280 --> 00:02:21,190
sales where the state of the

72
00:02:21,290 --> 00:02:22,490
sale is California, the category

73
00:02:23,000 --> 00:02:24,250
is t-shirt, the age of

74
00:02:24,310 --> 00:02:25,280
the customer is less than 22,

75
00:02:25,460 --> 00:02:28,510
and the last condition is actually over the fact table,

76
00:02:28,890 --> 00:02:30,560
saying the price of the purchase was less than 25.

77
00:02:30,640 --> 00:02:32,030
And now we'll return

78
00:02:32,390 --> 00:02:35,430
the city, color, customer name, and price of the item.

79
00:02:36,240 --> 00:02:37,160
We run the query and we

80
00:02:37,380 --> 00:02:38,470
see now that we have

81
00:02:38,720 --> 00:02:41,110
just 5 sales that meet our criteria.

82
00:02:42,290 --> 00:02:43,510
So, we've seen the  JOIN

83
00:02:43,690 --> 00:02:45,330
with constraints on the

84
00:02:45,470 --> 00:02:46,500
JOIN, but what's even more

85
00:02:46,710 --> 00:02:48,710
common in OLAP applications, is

86
00:02:48,830 --> 00:02:51,640
to do grouping and aggregation in order to analyze the data.

87
00:02:52,150 --> 00:02:53,190
And we're going to start by

88
00:02:53,340 --> 00:02:54,280
looking at grouping it in

89
00:02:54,400 --> 00:02:55,920
the aggregation involving the dimension

90
00:02:56,540 --> 00:02:57,500
attributes in the group

91
00:02:57,790 --> 00:02:59,400
by, and then the

92
00:02:59,510 --> 00:03:00,930
dependent attribute being aggregated.

93
00:03:01,530 --> 00:03:02,950
This is a very common form of query.

94
00:03:03,630 --> 00:03:04,410
So here we'll say we want

95
00:03:04,730 --> 00:03:06,120
to analyze our data based on

96
00:03:06,330 --> 00:03:07,850
combinations of stores and customers.

97
00:03:08,510 --> 00:03:10,290
So, we'll group by store ID and customer ID.

98
00:03:10,390 --> 00:03:11,600
And then, we'll sum up

99
00:03:11,870 --> 00:03:13,120
to see the total sales for

100
00:03:13,360 --> 00:03:14,680
each combination of store and customer.

101
00:03:15,930 --> 00:03:16,530
So, here's our result.

102
00:03:17,340 --> 00:03:18,440
Now, maybe not that meaningful

103
00:03:19,070 --> 00:03:20,230
to us right now, just

104
00:03:20,390 --> 00:03:21,520
looking at the store IDs and

105
00:03:21,700 --> 00:03:23,590
customer IDs, but for an analyst, it might be meaningful.

106
00:03:24,310 --> 00:03:25,330
We'll see in a moment doing

107
00:03:25,580 --> 00:03:27,040
group by an aggregation on other

108
00:03:27,380 --> 00:03:28,820
attributes that look a little more interesting.

109
00:03:29,810 --> 00:03:32,030
So now I'll demonstrate the concept of drilling down.

110
00:03:32,740 --> 00:03:33,900
So you might remember drilling down

111
00:03:34,210 --> 00:03:35,420
says that we're looking at data

112
00:03:35,890 --> 00:03:37,390
that's aggregated in some fashion,

113
00:03:37,860 --> 00:03:39,380
and we decide we want more detail.

114
00:03:40,320 --> 00:03:41,760
We get more detail by adding

115
00:03:42,130 --> 00:03:43,420
more attributes to the

116
00:03:43,500 --> 00:03:44,510
group by, so, let's say in

117
00:03:44,600 --> 00:03:45,830
this case I want to break

118
00:03:46,010 --> 00:03:47,370
out, not only by store and

119
00:03:47,500 --> 00:03:49,030
customer, but also by item.

120
00:03:49,680 --> 00:03:50,940
So, I'll add item ID to

121
00:03:51,100 --> 00:03:52,500
the group by, and also to

122
00:03:52,540 --> 00:03:53,790
the select, and when I

123
00:03:53,880 --> 00:03:54,930
run this query, I see

124
00:03:55,320 --> 00:03:57,380
that I get more results, and

125
00:03:57,880 --> 00:03:59,140
I now have more detail

126
00:03:59,970 --> 00:04:01,190
in the analysis of the data.

127
00:04:02,040 --> 00:04:03,410
Now I don't have 60 tumbles

128
00:04:03,590 --> 00:04:04,750
in my result even though I'm

129
00:04:04,940 --> 00:04:06,420
grouping by all three dimension

130
00:04:06,940 --> 00:04:08,150
attributes, because I do

131
00:04:08,360 --> 00:04:09,430
have in my fact table

132
00:04:09,850 --> 00:04:11,280
more than one tuple for some

133
00:04:11,530 --> 00:04:13,220
combinations of store, item, and customer.

134
00:04:14,580 --> 00:04:17,360
Next, I'm gonna introduce a query that shows the concept called slicing.

135
00:04:18,530 --> 00:04:20,390
Slicing of course evokes the data cube.

136
00:04:20,990 --> 00:04:21,980
We've talked about the data cube

137
00:04:22,550 --> 00:04:23,810
in the previous video, and we

138
00:04:23,950 --> 00:04:25,670
specifically used this example to

139
00:04:25,740 --> 00:04:27,930
demonstrate a three-dimensional data cube.

140
00:04:28,450 --> 00:04:29,650
So, the idea of a

141
00:04:30,090 --> 00:04:31,070
slicing query is a query

142
00:04:31,140 --> 00:04:32,820
that analyzes a slice of

143
00:04:32,950 --> 00:04:33,940
the cube and it does

144
00:04:34,180 --> 00:04:35,790
that by constraining one of the dimensions.

145
00:04:36,790 --> 00:04:37,530
So what I am going to

146
00:04:37,640 --> 00:04:38,910
do, is add to this query

147
00:04:39,230 --> 00:04:40,530
a constraint that says

148
00:04:40,760 --> 00:04:43,330
let's only consider sales that are from the state of Washington.

149
00:04:44,540 --> 00:04:45,430
And when I add that we'll

150
00:04:45,490 --> 00:04:46,370
continue to do the group

151
00:04:46,530 --> 00:04:47,950
by an aggregation, but only on

152
00:04:48,120 --> 00:04:48,930
that slice of the cube

153
00:04:49,260 --> 00:04:52,950
representing Washington stores, the sales that are from Washington stores.

154
00:04:53,800 --> 00:04:55,470
So it made a few changes to the query.

155
00:04:55,870 --> 00:04:56,820
The first thing we see is

156
00:04:56,940 --> 00:04:58,250
that I added the store

157
00:04:58,760 --> 00:04:59,760
relation to the from clause

158
00:05:00,450 --> 00:05:01,900
in order to constrain the

159
00:05:01,990 --> 00:05:03,440
state of the sale, I

160
00:05:03,510 --> 00:05:04,360
have to join with the store

161
00:05:04,710 --> 00:05:06,380
table that dimension table so that

162
00:05:06,550 --> 00:05:07,610
I can access

163
00:05:07,620 --> 00:05:09,850
the value of state which is not present in the sales table.

164
00:05:10,570 --> 00:05:11,380
So I also had to add

165
00:05:11,480 --> 00:05:12,800
variables, and I add

166
00:05:13,010 --> 00:05:14,600
the join condition for the

167
00:05:14,790 --> 00:05:16,150
sales table to join

168
00:05:16,420 --> 00:05:19,060
with the dimension table, and then I add the constraint

169
00:05:19,490 --> 00:05:20,330
that the state is Washington.

170
00:05:21,000 --> 00:05:22,360
The last small change is to

171
00:05:22,620 --> 00:05:23,760
add a variable to the

172
00:05:23,860 --> 00:05:26,830
store ID, so that I don't get an ambiguous error.

173
00:05:27,500 --> 00:05:28,690
So now let's go ahead and

174
00:05:28,770 --> 00:05:29,740
run the query and we see

175
00:05:29,940 --> 00:05:31,930
that we do get a subset of what we had before.

176
00:05:32,440 --> 00:05:33,590
And that subset is the slice

177
00:05:34,230 --> 00:05:35,080
of the data cube, if you

178
00:05:35,180 --> 00:05:35,960
want to think of it that way,

179
00:05:36,490 --> 00:05:38,550
representing the sales in Washington State.

180
00:05:39,740 --> 00:05:40,980
Now in addition to slicing, there

181
00:05:41,060 --> 00:05:42,510
is also a notion of dicing,

182
00:05:43,690 --> 00:05:44,890
and again if you imagine the

183
00:05:44,970 --> 00:05:45,820
data cube, what a dice

184
00:05:46,270 --> 00:05:47,440
does is slice in two

185
00:05:47,560 --> 00:05:49,310
dimensions and it gives you a chunk of the cube.

186
00:05:50,010 --> 00:05:51,070
So I am going to add a

187
00:05:51,450 --> 00:05:53,550
condition on the item being sold.

188
00:05:53,780 --> 00:05:55,270
I am going to add that the item must be red.

189
00:05:55,890 --> 00:05:56,600
So then I'm going to be looking

190
00:05:56,980 --> 00:05:57,950
at the chunk of the data

191
00:05:58,070 --> 00:06:00,120
cube that identifies red

192
00:06:00,440 --> 00:06:01,840
items purchased in the state of Washington.

193
00:06:03,270 --> 00:06:04,630
So here's the query; I've added

194
00:06:05,110 --> 00:06:06,570
a third relation to the "from" clause.

195
00:06:06,970 --> 00:06:08,420
I added the item dimension tables

196
00:06:08,880 --> 00:06:10,830
so that I can constrain the color to be red.

197
00:06:11,640 --> 00:06:13,200
I have added a "join" condition, joining

198
00:06:13,730 --> 00:06:15,650
the fact table to the item dimension table.

199
00:06:16,420 --> 00:06:18,080
I have added the condition that the color is red.

200
00:06:18,780 --> 00:06:19,710
And finally, I had to add

201
00:06:19,860 --> 00:06:21,360
again the variable names to

202
00:06:21,490 --> 00:06:23,350
make sure I didn't get ambiguous attribute references.

203
00:06:24,370 --> 00:06:25,070
So we'll run this query.

204
00:06:25,470 --> 00:06:28,170
And we see that we get an even smaller portion of the data cube.

205
00:06:28,700 --> 00:06:30,910
Again a dice where we have constrained two dimensions.

206
00:06:32,400 --> 00:06:33,810
So we have seen drilling down and

207
00:06:34,210 --> 00:06:36,810
slicing and dicing, now let's take a look at rolling up.

208
00:06:37,470 --> 00:06:38,400
Rolling up says that we're

209
00:06:38,690 --> 00:06:39,940
looking at our data

210
00:06:40,210 --> 00:06:42,080
analysis, and we decide that it's too detailed.

211
00:06:42,860 --> 00:06:45,530
We want to have less detail, in other words, more aggregation.

212
00:06:46,770 --> 00:06:49,690
To do that, we take attributes out of the group by clause.

213
00:06:50,120 --> 00:06:50,950
So let's say this is way,

214
00:06:51,320 --> 00:06:52,500
way too much detail and we

215
00:06:52,740 --> 00:06:54,810
just want our sales broken down on item.

216
00:06:55,590 --> 00:06:56,700
So we take all of

217
00:06:56,800 --> 00:06:57,760
the attributes out of our

218
00:06:57,920 --> 00:06:59,590
group by clause except for item.

219
00:07:00,570 --> 00:07:00,890
Here we go.

220
00:07:01,570 --> 00:07:02,380
And then, when we run the

221
00:07:02,440 --> 00:07:04,850
query, we'll see much more summarized data.

222
00:07:05,790 --> 00:07:08,240
And here it is, broken down again, just by separate items.

223
00:07:09,490 --> 00:07:10,600
So far, our grouping and

224
00:07:10,760 --> 00:07:12,040
aggregation has been on

225
00:07:12,310 --> 00:07:14,620
the dimension attributes, specifically the

226
00:07:14,750 --> 00:07:17,990
ID attributes identifying the tuples in the dimension table.

227
00:07:18,810 --> 00:07:19,960
And that may be meaningful for

228
00:07:20,100 --> 00:07:21,450
analyses, but for this

229
00:07:21,940 --> 00:07:23,400
demonstration, it's more fun

230
00:07:23,570 --> 00:07:24,820
to look at attributes that actually

231
00:07:25,530 --> 00:07:26,350
mean something to us.

232
00:07:26,890 --> 00:07:28,020
So, let's try a few

233
00:07:28,190 --> 00:07:29,830
queries that group instead based

234
00:07:30,150 --> 00:07:31,800
on attributes in the dimension tables.

235
00:07:32,210 --> 00:07:33,950
We'll still be using the fact table.

236
00:07:34,280 --> 00:07:35,560
We'll be joining it with dimension tables.

237
00:07:36,260 --> 00:07:37,320
And here we're going to start by

238
00:07:37,470 --> 00:07:39,990
grouping on the state and the category of the sales.

239
00:07:40,730 --> 00:07:42,140
Again, summing up the total

240
00:07:42,610 --> 00:07:44,370
sales for the different groups.

241
00:07:45,280 --> 00:07:46,290
So, here we see that

242
00:07:46,420 --> 00:07:47,920
we have four combinations of

243
00:07:48,020 --> 00:07:49,480
state and category California or

244
00:07:49,830 --> 00:07:50,940
Washington jackets and t-shirts,

245
00:07:51,380 --> 00:07:52,200
and then we have the total

246
00:07:52,560 --> 00:07:53,400
sales in each of those

247
00:07:54,070 --> 00:07:55,180
so we can see in both states

248
00:07:55,680 --> 00:07:57,290
the sales for jackets account

249
00:07:57,650 --> 00:07:58,750
for considerably more than the

250
00:07:58,870 --> 00:08:00,310
sales for t-shirts, but we

251
00:08:00,500 --> 00:08:01,820
do also notice that in

252
00:08:01,970 --> 00:08:03,270
California there's a lot

253
00:08:03,440 --> 00:08:04,930
more sales of t-shirts than

254
00:08:05,080 --> 00:08:06,290
there are in Washington, and I

255
00:08:06,410 --> 00:08:07,810
guess that's not surprising given the weather.

256
00:08:09,240 --> 00:08:11,780
Now let's demonstrate drill down on this query.

257
00:08:12,260 --> 00:08:13,570
So let's say we want a little more information.

258
00:08:14,180 --> 00:08:17,100
Maybe we want a breakdown by county in addition to state.

259
00:08:17,920 --> 00:08:18,910
So to do that, we add

260
00:08:19,160 --> 00:08:20,160
county to the group by

261
00:08:20,320 --> 00:08:21,410
clause and the select

262
00:08:21,770 --> 00:08:23,060
clause, and when we

263
00:08:23,120 --> 00:08:24,980
run the query we see we do now have more detail.

264
00:08:25,900 --> 00:08:27,310
We can see, for example, that

265
00:08:27,710 --> 00:08:29,760
we had more jacket sales in

266
00:08:29,850 --> 00:08:30,940
Santa Clara County than in

267
00:08:31,030 --> 00:08:32,340
San Mateo County although the

268
00:08:32,420 --> 00:08:34,090
t-shirts were approximately equal.

269
00:08:34,470 --> 00:08:36,240
A little, actually, more t-shirts in San Mateo.

270
00:08:36,890 --> 00:08:38,160
This is a little surprising, because

271
00:08:38,340 --> 00:08:39,690
Santa Clara is generally warmer than

272
00:08:39,940 --> 00:08:41,920
San Mateo, but it's fabricated data.

273
00:08:43,040 --> 00:08:43,800
Now let's see, we want it

274
00:08:43,900 --> 00:08:45,270
drilled it out even further, and

275
00:08:45,380 --> 00:08:46,730
we want to break our

276
00:08:46,850 --> 00:08:49,300
data down by gender as well as the other attributes.

277
00:08:50,860 --> 00:08:51,880
In order to drill down based

278
00:08:52,110 --> 00:08:53,470
on gender, I first needed

279
00:08:53,640 --> 00:08:55,510
to add the customer table to our from clause.

280
00:08:55,820 --> 00:08:56,760
Prior to this, we weren't

281
00:08:57,090 --> 00:08:58,360
doing any analysis that involved

282
00:08:58,760 --> 00:09:00,230
any attributes of the customer table.

283
00:09:01,650 --> 00:09:03,350
And so I need to add the join condition here.

284
00:09:03,830 --> 00:09:04,670
And then to do the drill

285
00:09:04,910 --> 00:09:06,430
down, I add the gender attribute

286
00:09:06,450 --> 00:09:08,090
to the group by and to the select.

287
00:09:08,990 --> 00:09:11,290
We run the query, and what do we find?

288
00:09:12,570 --> 00:09:13,860
Well, we can see, for

289
00:09:13,860 --> 00:09:15,010
example that not too many

290
00:09:15,350 --> 00:09:16,720
females in San Mateo County

291
00:09:17,120 --> 00:09:18,260
are buying t-shirts. Most of those

292
00:09:18,490 --> 00:09:20,150
t-shirt sales are counted for by males.

293
00:09:20,560 --> 00:09:21,820
The other thing we notice

294
00:09:22,230 --> 00:09:23,390
is that we don't have every

295
00:09:23,820 --> 00:09:26,080
single combination of county, category and gender.

296
00:09:26,860 --> 00:09:28,010
Very specifically we look and

297
00:09:28,130 --> 00:09:29,770
we can see that no males

298
00:09:30,990 --> 00:09:32,360
in King County bought any t-shirts.

299
00:09:33,010 --> 00:09:33,920
So we only have in our

300
00:09:34,070 --> 00:09:35,870
result, those combinations of dimension

301
00:09:36,390 --> 00:09:38,300
values that actually appear in the fact table.

302
00:09:38,590 --> 00:09:41,320
Or in the join of the fact table with the dimension tables.

303
00:09:42,300 --> 00:09:43,690
Now let's suppose after I've

304
00:09:43,760 --> 00:09:44,660
gotten to this level of

305
00:09:44,770 --> 00:09:46,560
analysis, I've decided that what

306
00:09:46,620 --> 00:09:48,060
might be most interesting is the

307
00:09:48,200 --> 00:09:50,700
data broken down just by combination of state and gender.

308
00:09:51,540 --> 00:09:52,660
So that would be a roll-up operation.

309
00:09:53,640 --> 00:09:55,020
And remember for roll-up I

310
00:09:55,380 --> 00:09:57,450
take attributes out of the group by clause.

311
00:09:58,140 --> 00:09:58,880
So I take out the county

312
00:09:59,340 --> 00:10:01,740
and category and I'm just interested in state/gender combinations.

313
00:10:02,970 --> 00:10:03,930
By the way at this point if

314
00:10:03,980 --> 00:10:04,950
I wanted to, I could remove

315
00:10:05,560 --> 00:10:06,940
the item table because I'm

316
00:10:07,040 --> 00:10:08,630
no longer constraining based on

317
00:10:09,220 --> 00:10:10,940
items or grouping based on items.

318
00:10:11,540 --> 00:10:14,080
But I'll just leave it in for now, it's not going to make a difference in our result.

319
00:10:14,570 --> 00:10:16,640
Of course for efficiency I might just prefer to take it out.

320
00:10:16,900 --> 00:10:18,110
Anyway let's run the query

321
00:10:18,700 --> 00:10:20,750
and now we see that rolled up data.

322
00:10:21,140 --> 00:10:22,230
And when we look at

323
00:10:22,360 --> 00:10:23,670
it I guess our main conclusion here

324
00:10:23,990 --> 00:10:25,150
would be that in our database,

325
00:10:25,840 --> 00:10:27,710
the Californian males are the big spenders.

326
00:10:28,950 --> 00:10:30,340
So far everything we've seen

327
00:10:30,680 --> 00:10:31,890
has actually just used the

328
00:10:31,990 --> 00:10:34,240
regular SQL constructs that we already knew about.

329
00:10:34,750 --> 00:10:36,330
Although we were demonstrating them over

330
00:10:36,820 --> 00:10:38,360
a star schema and we're

331
00:10:38,640 --> 00:10:39,870
showing the types of queries that

332
00:10:39,990 --> 00:10:41,040
one tends to run over star

333
00:10:41,350 --> 00:10:42,430
schema's in OLAP applications.

334
00:10:43,910 --> 00:10:44,780
Now we're going to move to

335
00:10:44,960 --> 00:10:46,280
the specific constructs that have

336
00:10:46,420 --> 00:10:48,030
been added to SQL for OLAP.

337
00:10:48,930 --> 00:10:50,150
As you may remember the two

338
00:10:50,300 --> 00:10:51,820
constructs are called With Cube

339
00:10:52,480 --> 00:10:55,210
and With Rollup and they are written in the group by-clause.

340
00:10:56,350 --> 00:10:58,050
So our first query is the

341
00:10:58,550 --> 00:10:59,770
absolute basic cube query.

342
00:11:00,570 --> 00:11:01,810
We start with our sales

343
00:11:02,270 --> 00:11:03,520
fact table, we group on

344
00:11:03,770 --> 00:11:04,680
all three of its dimensions,

345
00:11:05,570 --> 00:11:06,620
and then we add With Cube.

346
00:11:07,350 --> 00:11:08,180
And what we're going to get

347
00:11:08,810 --> 00:11:10,120
is, basically, the data

348
00:11:10,650 --> 00:11:11,860
in the data cube that we

349
00:11:12,030 --> 00:11:14,070
saw in our previous video that introduced the cube.

350
00:11:14,650 --> 00:11:15,890
Let's go ahead and run the query.

351
00:11:16,980 --> 00:11:18,300
Disappointingly we get an

352
00:11:18,400 --> 00:11:19,740
error message that this version

353
00:11:20,230 --> 00:11:21,920
of MySQL doesn't yet support cube.

354
00:11:22,820 --> 00:11:23,950
Actually no version of MySQL

355
00:11:24,520 --> 00:11:25,810
supports cube at this

356
00:11:26,010 --> 00:11:28,130
point in time but we're hopeful that a future one will.

357
00:11:28,810 --> 00:11:29,740
In fact of the three

358
00:11:30,210 --> 00:11:31,420
open source systems that we

359
00:11:31,640 --> 00:11:32,960
have been using for our demos only

360
00:11:33,250 --> 00:11:34,690
MySQL supports the with

361
00:11:35,080 --> 00:11:36,370
rollup so we will continue

362
00:11:36,720 --> 00:11:38,090
with MySQL and in fact

363
00:11:38,180 --> 00:11:39,740
you can get the same result

364
00:11:40,140 --> 00:11:41,620
that you would get if with

365
00:11:41,950 --> 00:11:43,420
cube was supported, using with roll

366
00:11:43,680 --> 00:11:44,350
up, although it's a little bit

367
00:11:44,510 --> 00:11:45,520
contorted, but I'll do that

368
00:11:45,790 --> 00:11:48,420
now, to demonstrate what you would get if you wrote with cube.

369
00:11:49,590 --> 00:11:50,910
So here's the query, I'm not

370
00:11:51,330 --> 00:11:52,390
going to explain it, if you're

371
00:11:52,650 --> 00:11:54,170
particularly interested in it, you

372
00:11:54,270 --> 00:11:56,330
can download our script and exam it yourself.

373
00:11:56,980 --> 00:11:58,200
Again what I'm most interested in

374
00:11:58,380 --> 00:12:00,390
is looking at the results, and here they are.

375
00:12:01,020 --> 00:12:02,590
So this result is the

376
00:12:02,680 --> 00:12:03,820
result of a With Cube

377
00:12:04,900 --> 00:12:05,780
on the grouping and aggregation

378
00:12:06,380 --> 00:12:09,000
on all dimensions of our sales table, all three dimensions.

379
00:12:10,490 --> 00:12:12,470
So some of our tuples look very normal.

380
00:12:13,050 --> 00:12:14,500
This first tuple, second, and

381
00:12:14,590 --> 00:12:15,630
third tuple are all giving

382
00:12:15,970 --> 00:12:18,000
us the total price for combination

383
00:12:18,200 --> 00:12:19,160
of store, item, and customer,

384
00:12:19,700 --> 00:12:20,920
and these are exactly what we

385
00:12:21,060 --> 00:12:22,430
got in our original query that

386
00:12:22,570 --> 00:12:24,030
didn't have the With-Cube operator.

387
00:12:25,100 --> 00:12:26,280
But now we see some

388
00:12:26,760 --> 00:12:28,010
tuples that have blanks and as

389
00:12:28,130 --> 00:12:30,840
a reminder, blanks in this interface mean null.

390
00:12:31,530 --> 00:12:34,240
So this tuple is store item and null.

391
00:12:35,070 --> 00:12:36,890
And what this corresponds to is

392
00:12:37,290 --> 00:12:39,110
an element on the face of the cube.

393
00:12:39,530 --> 00:12:40,350
So this is the face of the

394
00:12:40,410 --> 00:12:42,540
cube that's not constraining the customer dimension.

395
00:12:43,730 --> 00:12:45,040
And what the value gives us

396
00:12:45,300 --> 00:12:46,130
then is the value we

397
00:12:46,400 --> 00:12:47,090
have on the face of the

398
00:12:47,240 --> 00:12:48,440
cube, which is the

399
00:12:48,630 --> 00:12:50,100
total sales for the

400
00:12:50,180 --> 00:12:51,410
combination of store one

401
00:12:51,650 --> 00:12:54,000
and item one and any customer, and that's seventy.

402
00:12:54,910 --> 00:12:55,890
And we can cross check that

403
00:12:56,100 --> 00:12:56,990
by seeing that the first

404
00:12:57,250 --> 00:12:58,480
three tuples add up to

405
00:12:58,790 --> 00:13:00,060
70, because those happen to be

406
00:13:00,290 --> 00:13:01,480
the three customers that have made

407
00:13:01,800 --> 00:13:03,420
purchases at Store 1 and Item 1.

408
00:13:03,770 --> 00:13:05,340
And we can similarly see

409
00:13:05,600 --> 00:13:06,450
for Store 1 and Item

410
00:13:06,850 --> 00:13:08,400
2 the total sales are 150.

411
00:13:08,690 --> 00:13:10,380
If we scroll down

412
00:13:10,760 --> 00:13:11,910
a bit we'll see cases where

413
00:13:12,050 --> 00:13:13,830
we have other single-null values;

414
00:13:13,960 --> 00:13:15,890
I'll keep going until I find one here.

415
00:13:17,030 --> 00:13:18,390
For example, here we have

416
00:13:18,760 --> 00:13:19,810
the null value for the

417
00:13:20,060 --> 00:13:21,560
store and that would

418
00:13:21,690 --> 00:13:22,480
be the face of the cube

419
00:13:22,840 --> 00:13:24,200
that's not constraining the store dimension.

420
00:13:24,640 --> 00:13:25,340
So that would be the sum

421
00:13:25,850 --> 00:13:26,690
for Item 1 and Customer

422
00:13:27,200 --> 00:13:30,070
1 at any store, Item 1 and Customer 2 to at any store.

423
00:13:31,280 --> 00:13:34,200
Then we also have tuples that have two null values, here's an example.

424
00:13:35,210 --> 00:13:38,160
So this one is going to be corresponding to the edge of the cube.

425
00:13:38,610 --> 00:13:39,570
so this is the edge of

426
00:13:39,590 --> 00:13:40,500
the cube that is along

427
00:13:41,010 --> 00:13:42,280
the store dimension and is

428
00:13:42,600 --> 00:13:43,900
picking out Store 5 along

429
00:13:44,310 --> 00:13:45,680
that dimension and then giving

430
00:13:46,010 --> 00:13:47,250
the total sales for all

431
00:13:47,590 --> 00:13:48,760
items and all customers in

432
00:13:49,040 --> 00:13:50,420
Store 5, and we see that.

433
00:13:50,810 --> 00:13:51,920
And again we'll see other

434
00:13:52,220 --> 00:13:53,560
instances where we're missing

435
00:13:53,980 --> 00:13:57,150
two values, scroll down to get here.

436
00:13:57,450 --> 00:13:58,550
We've got one where we're

437
00:13:58,700 --> 00:13:59,510
missing the store in the

438
00:13:59,660 --> 00:14:00,440
item, so this would be

439
00:14:00,960 --> 00:14:02,720
on the edge along the customer dimension.

440
00:14:03,610 --> 00:14:04,300
So this gives us the total

441
00:14:04,670 --> 00:14:07,840
sales for any store item made by customer 3.

442
00:14:07,980 --> 00:14:10,150
And then finally, if we

443
00:14:10,360 --> 00:14:11,460
find it, we'll find the

444
00:14:11,580 --> 00:14:12,980
one that's all three null values,

445
00:14:13,460 --> 00:14:14,060
and that's right here.

446
00:14:14,730 --> 00:14:16,470
So that represents the corner of the cube.

447
00:14:17,200 --> 00:14:18,130
That gives us the total

448
00:14:18,630 --> 00:14:20,590
for all stores, all items and all customers.

449
00:14:21,750 --> 00:14:23,260
So, what we've done

450
00:14:23,330 --> 00:14:25,960
here by adding With Cube--and again,

451
00:14:26,240 --> 00:14:27,270
that would be the result if

452
00:14:27,530 --> 00:14:28,890
this query here were supported--what

453
00:14:29,700 --> 00:14:31,890
we're getting is the contents

454
00:14:31,980 --> 00:14:33,390
of the data cube, which are the

455
00:14:33,780 --> 00:14:34,600
inner parts of the cube,

456
00:14:34,910 --> 00:14:35,860
which is what we would get without

457
00:14:36,360 --> 00:14:37,320
adding anything to our group

458
00:14:37,570 --> 00:14:38,400
by and what we saw in one

459
00:14:38,510 --> 00:14:39,570
of our very first queries of

460
00:14:39,630 --> 00:14:41,160
the demo, in addition to the

461
00:14:41,250 --> 00:14:43,380
data on the faces, the

462
00:14:43,570 --> 00:14:44,920
edges, and the corner of the cube.

463
00:14:46,080 --> 00:14:47,330
Now let's do a quick cross check.

464
00:14:47,620 --> 00:14:48,660
So this data tells us

465
00:14:48,830 --> 00:14:49,870
that the corner of the cube,

466
00:14:50,210 --> 00:14:51,040
here, the one with all null

467
00:14:51,260 --> 00:14:54,050
values, is 33.50 total price.

468
00:14:54,730 --> 00:14:56,210
So, that should be exactly what we

469
00:14:56,330 --> 00:14:57,330
get if we sum up

470
00:14:57,530 --> 00:14:59,220
our prices without doing any

471
00:14:59,420 --> 00:15:01,310
grouping or aggregation so lets give that a try.

472
00:15:02,450 --> 00:15:03,990
So we merely take our sales

473
00:15:04,390 --> 00:15:05,420
table, we sum up the

474
00:15:05,540 --> 00:15:06,740
total prices and we run

475
00:15:06,940 --> 00:15:08,950
it and indeed we get thirty-three fifty.

476
00:15:10,430 --> 00:15:11,390
So now we've seen how SQL

477
00:15:11,740 --> 00:15:13,990
can be used to effectively create the data cube.

478
00:15:15,090 --> 00:15:18,060
Sometimes in OLAP applications it's convenient to query the cube directly.

479
00:15:19,110 --> 00:15:20,910
So people frequently create data

480
00:15:21,110 --> 00:15:22,490
cube say, in SQL using

481
00:15:22,830 --> 00:15:24,400
a materialized view, or just

482
00:15:24,550 --> 00:15:25,740
by creating a table that contains

483
00:15:26,000 --> 00:15:28,280
the result of the data cube which is what we are going to do.

484
00:15:28,880 --> 00:15:29,970
There are even some systems that

485
00:15:30,220 --> 00:15:32,630
use the data cube as their basic native data model.

486
00:15:33,400 --> 00:15:34,740
So let's go ahead and take

487
00:15:34,940 --> 00:15:36,020
the result of our cube

488
00:15:36,340 --> 00:15:37,540
query and put it in a table.

489
00:15:38,650 --> 00:15:39,860
So I've just added "create

490
00:15:40,220 --> 00:15:41,700
table cube as" before our

491
00:15:41,930 --> 00:15:42,940
query and the one the

492
00:15:43,030 --> 00:15:44,070
other thing I did was add a

493
00:15:44,120 --> 00:15:45,860
name to the result of

494
00:15:46,010 --> 00:15:47,910
the aggregation attribute so

495
00:15:48,040 --> 00:15:49,840
that we can refer to it in the data cube.

496
00:15:50,500 --> 00:15:52,170
So P then will then be

497
00:15:52,450 --> 00:15:53,980
containing the sum of the price for

498
00:15:54,490 --> 00:15:55,930
the items inside the data

499
00:15:56,040 --> 00:15:58,360
cube as well as for the faces, edges, and corners.

500
00:15:59,090 --> 00:15:59,930
So, let's go ahead and create

501
00:16:00,340 --> 00:16:03,050
that table and now let's look at querying the cube directly.

502
00:16:04,430 --> 00:16:05,360
So, the first thing you will notice in

503
00:16:05,560 --> 00:16:06,590
the next few queries is that

504
00:16:06,760 --> 00:16:08,090
we've replaced the sales table

505
00:16:08,450 --> 00:16:09,780
with the cubed table so we're

506
00:16:09,990 --> 00:16:11,400
gonna run our queries directly on the

507
00:16:11,480 --> 00:16:12,450
cube and we can

508
00:16:12,710 --> 00:16:13,910
join the cube with the

509
00:16:14,140 --> 00:16:15,360
store and item tables and

510
00:16:15,500 --> 00:16:16,520
the customer, if we needed it,

511
00:16:16,940 --> 00:16:18,300
just as we joined the sales

512
00:16:18,710 --> 00:16:21,780
table because it does contain the same dimension attributes.

513
00:16:22,600 --> 00:16:24,040
What the cube gives us is

514
00:16:24,490 --> 00:16:26,190
pre-aggregated data, both for

515
00:16:26,360 --> 00:16:28,400
the store, a customer and item combinations.

516
00:16:29,710 --> 00:16:30,870
As well as the data

517
00:16:31,300 --> 00:16:32,960
that has null values that is

518
00:16:33,180 --> 00:16:34,670
already aggregated for the

519
00:16:34,830 --> 00:16:37,330
faces edges and corner of the cube as we'll just see in a moment.

520
00:16:38,030 --> 00:16:39,100
So what our query is going

521
00:16:39,430 --> 00:16:40,670
to do is find total

522
00:16:41,030 --> 00:16:43,460
sales of blue items in the state of California.

523
00:16:44,420 --> 00:16:45,560
And it will start by showing the

524
00:16:45,670 --> 00:16:47,870
tupples that are used to get that total.

525
00:16:48,670 --> 00:16:49,990
So we'll join our cube table

526
00:16:50,300 --> 00:16:51,510
with the store and the item

527
00:16:51,710 --> 00:16:52,740
dimension tables in order to

528
00:16:52,830 --> 00:16:54,060
constrain the state and the

529
00:16:54,230 --> 00:16:56,830
color and then we will constrain the state and the color.

530
00:16:57,530 --> 00:16:58,780
Notice the last thing we

531
00:16:58,900 --> 00:17:00,170
say is that customer id

532
00:17:00,410 --> 00:17:02,090
is null and that's going

533
00:17:02,130 --> 00:17:03,080
to give us the data on the

534
00:17:03,930 --> 00:17:05,070
face of the cube that

535
00:17:05,330 --> 00:17:06,820
doesn't go along with the customer's dimension.

536
00:17:07,380 --> 00:17:08,350
That means it's going to

537
00:17:08,430 --> 00:17:10,100
be pre-aggregated for all customers

538
00:17:10,700 --> 00:17:11,650
and that's what we want since

539
00:17:11,780 --> 00:17:12,790
we don't care about the customers

540
00:17:13,390 --> 00:17:14,590
and just about the color and the state.

541
00:17:15,510 --> 00:17:16,540
So, let's first run this query

542
00:17:16,950 --> 00:17:18,180
and we see we get six tuples,

543
00:17:18,580 --> 00:17:19,850
and these tuples are all

544
00:17:20,290 --> 00:17:21,450
from the portion of the result

545
00:17:21,860 --> 00:17:23,200
of that cube query we ran

546
00:17:23,560 --> 00:17:24,570
that has a null value for

547
00:17:24,830 --> 00:17:26,300
customer ID and that

548
00:17:26,410 --> 00:17:27,860
is all combinations of stores of

549
00:17:27,990 --> 00:17:29,190
items, and if we checked

550
00:17:29,450 --> 00:17:30,840
our dimension tables we'd see

551
00:17:31,010 --> 00:17:32,080
that these stores are in California

552
00:17:32,700 --> 00:17:34,420
and these items are blue and

553
00:17:34,560 --> 00:17:35,780
these are our total sales broken

554
00:17:36,090 --> 00:17:38,100
down by store and item.

555
00:17:38,480 --> 00:17:39,770
Finally, to get the total total sales,

556
00:17:40,320 --> 00:17:41,460
we'll just sum up our column

557
00:17:41,950 --> 00:17:43,240
"p" which remember was

558
00:17:43,610 --> 00:17:44,850
the sum of price in the data queue.

559
00:17:45,870 --> 00:17:47,780
So we replace c. with sum

560
00:17:48,300 --> 00:17:49,470
of p and we run

561
00:17:49,850 --> 00:17:52,070
the query and we see that our total is 13.65.

562
00:17:54,110 --> 00:17:55,080
Now let me show you something that

563
00:17:55,190 --> 00:17:56,530
may seem non-intuitive at first,

564
00:17:56,940 --> 00:17:58,230
but if you think about it

565
00:17:58,420 --> 00:17:59,350
or maybe run the queries yourself,

566
00:17:59,770 --> 00:18:02,750
you'll start to understand exactly what's happening.

567
00:18:03,810 --> 00:18:05,020
I'm going to go back to c. in the select clause.

568
00:18:05,270 --> 00:18:06,310
And incidentally, I didn't mention

569
00:18:06,690 --> 00:18:08,210
it before but c. gives us

570
00:18:08,370 --> 00:18:09,500
in the select clause all the

571
00:18:09,750 --> 00:18:10,910
attributes from the cube

572
00:18:11,310 --> 00:18:12,220
table, and that is

573
00:18:12,300 --> 00:18:14,260
showing which tuples are being used for the aggregation.

574
00:18:15,680 --> 00:18:18,130
So, we had before six tuples that were being used.

575
00:18:18,790 --> 00:18:20,980
Now, I'm going to make one tiny change to the query.

576
00:18:21,520 --> 00:18:22,600
Instead of finding the customer

577
00:18:23,190 --> 00:18:24,920
IDs that are null, I'm going

578
00:18:25,210 --> 00:18:26,660
to find the customer IDs that are not null.

579
00:18:27,440 --> 00:18:28,760
I'm going to run that query, and

580
00:18:28,910 --> 00:18:31,540
I see that we have fourteen tuples now.

581
00:18:32,310 --> 00:18:33,470
What we've actually done is moved

582
00:18:33,970 --> 00:18:35,140
away from the face of

583
00:18:35,240 --> 00:18:36,290
the cube and into the cube,

584
00:18:36,650 --> 00:18:38,110
and I've said don't give me

585
00:18:38,430 --> 00:18:39,770
any of the summarized data, just

586
00:18:40,250 --> 00:18:42,290
give me the original values that

587
00:18:42,420 --> 00:18:44,060
are the store item and customer together.

588
00:18:45,120 --> 00:18:45,990
So, I've taken away the summaries,

589
00:18:46,740 --> 00:18:49,390
but actually this should still give me the same answer.

590
00:18:49,710 --> 00:18:50,830
Let's make sure that's the case.

591
00:18:51,560 --> 00:18:52,910
I do sum of P, and

592
00:18:53,010 --> 00:18:55,110
I run it, and I get 1365.

593
00:18:55,630 --> 00:18:57,200
So, what happened here is I

594
00:18:57,420 --> 00:18:58,950
used what's effectively the same

595
00:18:59,310 --> 00:19:01,310
data, but broken down by customer.

596
00:19:02,320 --> 00:19:03,260
And I added it up, that

597
00:19:03,400 --> 00:19:05,280
was a less efficient query, but

598
00:19:05,530 --> 00:19:06,560
both of those queries do

599
00:19:06,720 --> 00:19:07,890
give the correct result, they do

600
00:19:08,200 --> 00:19:10,440
give you the total sales of California

601
00:19:11,150 --> 00:19:12,160
stores' blue items.

602
00:19:12,850 --> 00:19:13,860
Again, I encourage you to

603
00:19:14,000 --> 00:19:15,140
download these and run them

604
00:19:15,400 --> 00:19:16,700
yourself to figure out exactly what's

605
00:19:16,910 --> 00:19:18,300
going on.

606
00:19:18,410 --> 00:19:19,650
And as a last demonstration of

607
00:19:19,740 --> 00:19:21,230
this very same query, we

608
00:19:21,440 --> 00:19:23,300
can actually go back to the sales table.

609
00:19:23,670 --> 00:19:25,170
Let's say we don't even have our data cube.

610
00:19:25,780 --> 00:19:26,900
So, we're gonna use our sales table

611
00:19:27,270 --> 00:19:28,790
instead, and it should

612
00:19:29,020 --> 00:19:30,280
give us the same result, but

613
00:19:30,590 --> 00:19:32,770
even less efficient than the the last query that we did.

614
00:19:33,830 --> 00:19:34,760
So, you can see here I've

615
00:19:34,990 --> 00:19:36,190
changed the cube to be

616
00:19:36,360 --> 00:19:38,250
the sales and all the C dots to F dots.

617
00:19:38,970 --> 00:19:40,470
I took away the constraints

618
00:19:41,100 --> 00:19:41,730
on being null or not null,

619
00:19:41,780 --> 00:19:44,190
since that's not relevant in the sales table.

620
00:19:44,710 --> 00:19:46,520
Otherwise, the query looks pretty much the same.

621
00:19:46,870 --> 00:19:47,900
So, let's run this much

622
00:19:48,250 --> 00:19:50,090
and see how many tuples we are operating on now.

623
00:19:50,830 --> 00:19:52,490
So now, we're operating on 25 tuples.

624
00:19:53,130 --> 00:19:55,480
So, this is the original completely broken down data.

625
00:19:56,180 --> 00:19:57,380
Just as a reminder, in the

626
00:19:57,550 --> 00:19:58,570
data cube, even in the

627
00:19:58,840 --> 00:20:00,330
cells, we'll be summarizing

628
00:20:01,750 --> 00:20:02,780
or summing for the combination

629
00:20:03,950 --> 00:20:04,980
of store item and customer.

630
00:20:05,730 --> 00:20:07,710
So, if we have three tuples that

631
00:20:07,850 --> 00:20:08,930
are the same store item and customer,

632
00:20:09,320 --> 00:20:10,390
those will be broken out in

633
00:20:10,440 --> 00:20:12,090
the sales table and summarized in

634
00:20:12,280 --> 00:20:13,310
the cube, even in the

635
00:20:13,520 --> 00:20:16,150
least summarized portion of the cube, if you get what I'm saying.

636
00:20:16,970 --> 00:20:17,830
So now, we've got our 25

637
00:20:18,590 --> 00:20:19,480
tuples, and if we

638
00:20:19,540 --> 00:20:20,760
sum up the price of those

639
00:20:21,020 --> 00:20:22,340
twenty-five tuples, we should

640
00:20:22,520 --> 00:20:25,200
get that same result, "$13.65", and we do.

641
00:20:26,210 --> 00:20:27,310
So, we saw that we had

642
00:20:27,820 --> 00:20:29,260
six tuples in the

643
00:20:29,330 --> 00:20:30,760
most efficient query over the data cube.

644
00:20:31,520 --> 00:20:32,900
When we used just a center

645
00:20:33,120 --> 00:20:34,060
portion of the cube, we had

646
00:20:34,770 --> 00:20:35,630
fourteen or something, I actually

647
00:20:35,790 --> 00:20:38,020
don't remember the exact number, and then 25.

648
00:20:38,480 --> 00:20:39,920
All of them gave us the same answer.

649
00:20:41,260 --> 00:20:42,380
Obviously these numbers are very

650
00:20:42,850 --> 00:20:43,790
small, but if you think about

651
00:20:44,270 --> 00:20:45,540
huge data that has millions

652
00:20:46,270 --> 00:20:47,370
or tens of millions of rows,

653
00:20:48,040 --> 00:20:49,690
then we're talking about dramatic differences

654
00:20:50,600 --> 00:20:51,880
in how much data needs to

655
00:20:52,070 --> 00:20:54,770
be summarized or aggregated in order to get the answer.

656
00:20:55,430 --> 00:20:56,950
So, using the summary data from

657
00:20:57,130 --> 00:20:58,550
the cube can be orders of

658
00:20:58,630 --> 00:21:01,390
magnitude faster than going to the original fact data.

659
00:21:02,710 --> 00:21:03,970
Now a variation on the

660
00:21:04,040 --> 00:21:05,110
width cube is to give

661
00:21:05,320 --> 00:21:08,470
specific attributes with the cube operator itself.

662
00:21:09,210 --> 00:21:10,370
And what that says is to

663
00:21:10,650 --> 00:21:12,190
only give summaries for

664
00:21:12,370 --> 00:21:14,590
these dimension attributes and not the others.

665
00:21:15,170 --> 00:21:16,330
In other words, we'll only see null

666
00:21:16,700 --> 00:21:19,260
values in the result for the store ID and customer ID

667
00:21:19,660 --> 00:21:21,690
and we won't see any null values for the Item ID.

668
00:21:22,330 --> 00:21:23,720
So, in some sense, we're

669
00:21:23,900 --> 00:21:25,240
materializing just some of

670
00:21:25,340 --> 00:21:27,180
the faces, edges, and corner

671
00:21:27,560 --> 00:21:28,840
of the cube, not all of them.

672
00:21:29,650 --> 00:21:30,710
If we run this query, well,

673
00:21:30,850 --> 00:21:32,150
we'll get the same results that

674
00:21:32,630 --> 00:21:33,940
MySQL doesn't yet support it,

675
00:21:34,110 --> 00:21:35,490
but this is the SQL standard

676
00:21:36,390 --> 00:21:38,430
for the cube operator with subsets of attributes.

677
00:21:39,660 --> 00:21:40,920
As before, I've cooked up an

678
00:21:41,240 --> 00:21:43,030
equivalent query using a MySQL's

679
00:21:43,480 --> 00:21:45,220
with roll-up command, and I'm

680
00:21:45,370 --> 00:21:46,810
certainly not going to explain it,

681
00:21:46,910 --> 00:21:48,870
but you're welcome to download it and examine it.

682
00:21:49,630 --> 00:21:50,420
What I really like to look

683
00:21:50,660 --> 00:21:53,300
at is the result, and here it is.

684
00:21:53,380 --> 00:21:54,420
So, this looks a lot like our

685
00:21:54,570 --> 00:21:56,740
original data cube but what we will notice is

686
00:21:56,890 --> 00:21:58,330
that there are no null values

687
00:21:59,220 --> 00:22:00,690
in the item column.

688
00:22:01,150 --> 00:22:03,340
So, it's a subset of the result of the cube query.

689
00:22:03,610 --> 00:22:04,640
And we would use that when

690
00:22:04,870 --> 00:22:06,630
we know we're never going to be rolling up on items.

691
00:22:07,640 --> 00:22:09,240
So, that dimension, the summaries

692
00:22:09,740 --> 00:22:12,040
based on that dimension aren't going to be useful to us.

693
00:22:13,180 --> 00:22:14,270
Now, let's look at with roll-up,

694
00:22:14,710 --> 00:22:16,880
which is supported natively by the MySQL system.

695
00:22:17,870 --> 00:22:20,700
With roll-up, again, gives us a subset of the data.

696
00:22:20,810 --> 00:22:20,940
cube.

697
00:22:21,820 --> 00:22:22,390
But, it's a very specific one

698
00:22:22,460 --> 00:22:23,740
and it's based on the ordering of

699
00:22:23,890 --> 00:22:25,410
the attributes in the group by clause.

700
00:22:26,190 --> 00:22:28,680
So, let me just go ahead and run it and explain what we get.

701
00:22:29,440 --> 00:22:30,690
We again get null values,

702
00:22:31,300 --> 00:22:33,240
and those null values indicate the summary.

703
00:22:33,790 --> 00:22:36,030
For example, the Store 1, Item 1 and 70s.

704
00:22:36,310 --> 00:22:37,720
The sum of prices for

705
00:22:37,960 --> 00:22:40,100
all customers for Store 1 and Item 1.

706
00:22:40,850 --> 00:22:42,310
And we see Store 1, Item 2, as well.

707
00:22:42,630 --> 00:22:43,810
We again see cases with

708
00:22:44,220 --> 00:22:45,630
two null values, so this

709
00:22:45,720 --> 00:22:47,230
is the summary for Store 1,

710
00:22:47,530 --> 00:22:48,480
all of the items and customers,

711
00:22:49,000 --> 00:22:50,320
the total sales, and we

712
00:22:50,450 --> 00:22:52,300
also have the triple null somewhere.

713
00:22:53,450 --> 00:22:56,350
It's at the bottom this time, with total sales of $33.50.

714
00:22:56,540 --> 00:22:58,200
But what we'll see

715
00:22:58,410 --> 00:23:00,890
is that we don't have all combinations of null values.

716
00:23:01,750 --> 00:23:02,690
We have a null value for customer

717
00:23:03,200 --> 00:23:04,190
ID, or we have nulls

718
00:23:04,500 --> 00:23:06,150
for the combination of customer ID

719
00:23:06,540 --> 00:23:08,550
and item ID, or all three nulls.

720
00:23:08,800 --> 00:23:09,770
So we only have the right-hand

721
00:23:10,150 --> 00:23:11,520
attribute or the two

722
00:23:11,790 --> 00:23:13,540
most-right attributes or all

723
00:23:13,790 --> 00:23:14,630
of them and if we had

724
00:23:14,890 --> 00:23:16,090
a fourth dimension, we'd see that

725
00:23:16,150 --> 00:23:18,680
we have the right-most, two right-most, three right-most.

726
00:23:19,740 --> 00:23:20,920
This doesn't really make any

727
00:23:21,240 --> 00:23:22,600
particular sense for this query,

728
00:23:23,260 --> 00:23:25,200
but it's used when we have dimensions that are hierarchical.

729
00:23:26,510 --> 00:23:28,350
And I actually introduced a hierarchical

730
00:23:28,980 --> 00:23:31,330
dimensions in our original schema just for this purpose.

731
00:23:31,740 --> 00:23:32,910
So, lets turn to that query.

732
00:23:34,250 --> 00:23:35,480
So, here's a query that's grouping

733
00:23:35,950 --> 00:23:37,440
by the state, county, and city.

734
00:23:37,760 --> 00:23:39,790
These are three dimension attributes,

735
00:23:40,130 --> 00:23:42,050
they all come from the store dimension table.

736
00:23:42,490 --> 00:23:44,190
And they are hierarchical, meaning

737
00:23:44,530 --> 00:23:45,600
that we don't have every

738
00:23:45,990 --> 00:23:47,540
combination of state, county and city.

739
00:23:48,580 --> 00:23:49,810
We have cities that are in

740
00:23:50,070 --> 00:23:51,510
specific counties, and counties

741
00:23:52,010 --> 00:23:53,170
that are in specific states.

742
00:23:54,150 --> 00:23:54,980
So, when we look at the results

743
00:23:55,230 --> 00:23:56,320
of the query, we see, of

744
00:23:56,540 --> 00:23:57,490
course, that we, when we

745
00:23:57,660 --> 00:23:59,780
have San Mateo is always in the state of California.

746
00:24:01,070 --> 00:24:02,770
King County is always in the state of Washington.

747
00:24:03,830 --> 00:24:05,530
We don't have Palo Alto combined

748
00:24:06,110 --> 00:24:07,710
with, say, King County with state California.

749
00:24:08,290 --> 00:24:10,390
So, we don't have all the combinations of the three.

750
00:24:11,050 --> 00:24:12,890
We have a hierarchical structure on them.

751
00:24:13,460 --> 00:24:16,330
And it's this structure that makes sense when we use a roll-up.

752
00:24:17,100 --> 00:24:20,080
So, let's add with roll-up to this query and see what we get.

753
00:24:20,750 --> 00:24:21,860
So, here in our result we see

754
00:24:22,080 --> 00:24:25,310
our original tuples for a state, county, city combination.

755
00:24:26,570 --> 00:24:27,700
And then, we see for example,

756
00:24:28,090 --> 00:24:29,550
this tuple here, which is

757
00:24:29,880 --> 00:24:30,980
taking the state of California

758
00:24:31,520 --> 00:24:32,660
and the county of San Mateo

759
00:24:33,200 --> 00:24:34,760
and adding up all of the sales there.

760
00:24:35,890 --> 00:24:38,790
And we see similarly that we have all of the sales for Santa Clara County.

761
00:24:39,560 --> 00:24:40,410
Then we can see that we

762
00:24:40,530 --> 00:24:41,550
have all of the sales for

763
00:24:41,950 --> 00:24:42,920
each state, so this is

764
00:24:42,970 --> 00:24:44,570
the California sales and the Washington sales.

765
00:24:45,000 --> 00:24:46,730
And finally, the triple null is

766
00:24:47,000 --> 00:24:48,360
the corner the cube, it's

767
00:24:48,580 --> 00:24:49,900
all of our sales and, as

768
00:24:50,040 --> 00:24:51,460
usual, we get the total of 33.50.

769
00:24:52,330 --> 00:24:53,780
Now, what don't we

770
00:24:54,000 --> 00:24:55,610
see in here compared with the data cube?

771
00:24:56,150 --> 00:24:57,430
Well, we don't see, for example,

772
00:24:57,490 --> 00:24:58,770
a tuple that has

773
00:24:59,020 --> 00:25:00,730
California and Palo Alto

774
00:25:01,050 --> 00:25:02,250
and the county as null.

775
00:25:03,020 --> 00:25:03,400
Why is that?

776
00:25:03,650 --> 00:25:06,080
Well, Palo Alto is always in Santa Clara County.

777
00:25:07,080 --> 00:25:08,500
So rolling up the county,

778
00:25:09,320 --> 00:25:10,970
or saying I want California and

779
00:25:11,080 --> 00:25:12,230
Palo Alto sales for every

780
00:25:12,380 --> 00:25:13,830
county, is exactly the

781
00:25:13,920 --> 00:25:14,820
same as saying I want

782
00:25:15,070 --> 00:25:17,210
California and Palo Alto sales in Santa Clara County.

783
00:25:18,540 --> 00:25:19,670
We also don't see, for

784
00:25:19,800 --> 00:25:21,940
example, the state and county both being null.

785
00:25:22,340 --> 00:25:23,470
For example, if we had

786
00:25:23,740 --> 00:25:24,830
Seattle as a city and

787
00:25:25,020 --> 00:25:26,460
the state and county being null.

788
00:25:27,120 --> 00:25:28,350
Well, Seattle is always in

789
00:25:28,530 --> 00:25:29,620
King County and always in

790
00:25:29,750 --> 00:25:30,860
Washington, so we're not aggregating

791
00:25:31,480 --> 00:25:32,360
anything there, we get the

792
00:25:32,610 --> 00:25:34,620
exact same results as having Seattle King in Washington.

793
00:25:35,940 --> 00:25:37,110
So if we ran WITH CUBE

794
00:25:37,770 --> 00:25:40,230
on this hierarchically-structured data, we'd

795
00:25:40,380 --> 00:25:42,010
actually not get anymore information,

796
00:25:42,570 --> 00:25:45,460
we'd have a bigger result, but it wouldn't be giving us more information.

797
00:25:46,230 --> 00:25:48,550
It would just be less efficient for getting the same data.

798
00:25:49,900 --> 00:25:51,810
So that concludes our OLAP demonstration.

799
00:25:52,910 --> 00:25:54,280
We saw Star Schema and

800
00:25:54,460 --> 00:25:56,410
we saw plain SQL queries over that schema.

801
00:25:57,440 --> 00:25:59,010
We saw the concept of drilling

802
00:25:59,370 --> 00:26:01,700
down and rolling up; also slicing and dicing.

803
00:26:02,930 --> 00:26:04,310
We introduced a WITH CUBE

804
00:26:04,630 --> 00:26:06,110
extension to SQL, which is

805
00:26:06,220 --> 00:26:07,490
not yet implemented in MySQL,

806
00:26:08,120 --> 00:26:08,850
but we were able to write

807
00:26:09,030 --> 00:26:10,510
a query that's equivalent to WITH CUBE.

808
00:26:11,370 --> 00:26:12,780
We also saw putting a WITH

809
00:26:13,090 --> 00:26:14,300
CUBE query into a table

810
00:26:14,760 --> 00:26:15,960
and then querying that table directly

811
00:26:16,890 --> 00:26:18,140
and that can be much more efficient

812
00:26:18,980 --> 00:26:20,160
than running the equivalent query

813
00:26:21,010 --> 00:26:22,440
in SQL directly over the fact table.

814
00:26:23,390 --> 00:26:25,430
We also saw WITH ROLLUP, which is implemented.

815
00:26:26,480 --> 00:26:27,680
We didn't demonstrate putting the

816
00:26:27,750 --> 00:26:30,250
result of WITH ROLLUP in a table, but we could certainly do that too.

817
00:26:31,400 --> 00:26:32,700
All of these features are useful

818
00:26:33,120 --> 00:26:34,530
primarily in applications that

819
00:26:34,660 --> 00:26:36,420
are performing analyses over very

820
00:26:36,790 --> 00:26:38,130
large data sets that exhibit

821
00:26:38,800 --> 00:26:40,220
this dimensional type structure, but

822
00:26:40,440 --> 00:26:41,910
this is actually quite a

823
00:26:42,100 --> 00:26:43,730
common structure in analysis applications.