1
00:00:00,510 --> 00:00:04,000
This video introduces online analytical processing or OLAP.

2
00:00:04,440 --> 00:00:06,110
A subsequent video will have

3
00:00:06,410 --> 00:00:07,850
a demo of OLAP queries in action.

4
00:00:09,070 --> 00:00:11,000
Overall, database activity can

5
00:00:11,090 --> 00:00:12,640
be divided into two broad classes.

6
00:00:13,640 --> 00:00:15,150
One of them, the traditional one,

7
00:00:15,740 --> 00:00:19,070
is known as OLTP, or online transaction processing.

8
00:00:19,910 --> 00:00:20,720
The other one, the subject

9
00:00:21,140 --> 00:00:22,630
of this video, came about more

10
00:00:22,800 --> 00:00:23,960
recently, and it's known

11
00:00:24,250 --> 00:00:26,250
as OLAP, or online analytical processing.

12
00:00:28,840 --> 00:00:29,970
Online transaction processing is typically

13
00:00:30,350 --> 00:00:31,720
characterized by short transactions,

14
00:00:32,750 --> 00:00:33,500
both queries and updates.

15
00:00:34,330 --> 00:00:35,660
Things like updating an account

16
00:00:36,130 --> 00:00:37,180
balance in a bank database

17
00:00:37,670 --> 00:00:39,380
or logging a page view in a web application.

18
00:00:40,700 --> 00:00:43,770
Queries in OLTP data bases are generally fairly simple.

19
00:00:44,570 --> 00:00:47,190
Find an account balance or find the GPA of a student.

20
00:00:47,750 --> 00:00:49,580
They typically touch small portions of the data.

21
00:00:50,770 --> 00:00:52,660
And updates in this environment can be frequent.

22
00:00:53,810 --> 00:00:55,080
We might be making airline seat

23
00:00:55,390 --> 00:00:58,480
reservations or updating a online shopping cart.

24
00:00:59,950 --> 00:01:02,170
OLAP is pretty much the opposite in all respects.

25
00:01:03,420 --> 00:01:04,390
In OLAP, we have long

26
00:01:04,830 --> 00:01:06,860
transactions, often complex analysis

27
00:01:07,470 --> 00:01:09,420
of the data or data mining type operations.

28
00:01:10,780 --> 00:01:11,760
The queries as I said, can

29
00:01:11,880 --> 00:01:13,570
be complex and especially they

30
00:01:13,870 --> 00:01:15,520
often touch large portions of

31
00:01:15,580 --> 00:01:17,950
the data rather than small portions as in OLTP.

32
00:01:19,420 --> 00:01:20,600
And updates in the OLAP

33
00:01:20,860 --> 00:01:22,050
environment tend to be infrequent,

34
00:01:22,630 --> 00:01:23,900
in fact, sometimes in the

35
00:01:24,100 --> 00:01:26,530
OLAP environment there are no updates to the data at all.

36
00:01:27,540 --> 00:01:29,620
Now, these two are extremes

37
00:01:30,450 --> 00:01:31,220
and really there is a spectrum

38
00:01:31,900 --> 00:01:32,870
between those two extremes.

39
00:01:34,290 --> 00:01:35,400
We might have a sort of, moderate

40
00:01:35,850 --> 00:01:37,530
amount of update and queries

41
00:01:37,910 --> 00:01:39,460
that touch a moderate portion of the data.

42
00:01:40,190 --> 00:01:41,250
But the fact is that

43
00:01:41,540 --> 00:01:44,850
database systems traditionally were designed for the first extreme.

44
00:01:46,000 --> 00:01:49,020
And then special techniques were developed for the other extreme.

45
00:01:49,910 --> 00:01:51,810
So the systems are tuned for the two extremes.

46
00:01:52,380 --> 00:01:53,490
And depending on ones work load

47
00:01:53,680 --> 00:01:55,370
one might choose to use different

48
00:01:55,960 --> 00:01:57,000
options in a database system

49
00:01:58,340 --> 00:01:59,230
just a little bit more terminology

50
00:02:00,030 --> 00:02:00,960
in the OLAP world.

51
00:02:01,630 --> 00:02:03,110
There's a concept called data warehousing.

52
00:02:03,800 --> 00:02:05,120
It's really a software architecture.

53
00:02:06,160 --> 00:02:07,620
The idea is that often in

54
00:02:08,070 --> 00:02:09,120
enter prizes or other operation,

55
00:02:10,070 --> 00:02:12,340
there are lots of operational sources.

56
00:02:12,920 --> 00:02:13,730
So you can think of a

57
00:02:13,920 --> 00:02:15,610
point of sale, for example, might

58
00:02:15,830 --> 00:02:17,770
have many, many OLTP database

59
00:02:18,230 --> 00:02:19,540
pieces related to an enterprise,

60
00:02:20,680 --> 00:02:22,720
and data warehousing is the

61
00:02:22,840 --> 00:02:23,870
process of bringing the data from

62
00:02:24,200 --> 00:02:26,120
all of those distributed OLTP sources

63
00:02:26,880 --> 00:02:29,290
into a single, gigantic warehouse where

64
00:02:29,490 --> 00:02:30,440
the point then is to do

65
00:02:30,920 --> 00:02:32,390
analyses of the data, and

66
00:02:32,510 --> 00:02:33,980
that would fall under the OLAP camp.

67
00:02:35,100 --> 00:02:36,470
Another term you might encounter is

68
00:02:36,610 --> 00:02:38,710
decision support systems also known as DSS.

69
00:02:39,510 --> 00:02:41,090
This isn't really an exact term.

70
00:02:41,780 --> 00:02:42,850
It's generally used to talk

71
00:02:43,000 --> 00:02:45,780
about infrastructure for again large scale data analyses.

72
00:02:46,790 --> 00:02:47,780
So, if you think of

73
00:02:47,870 --> 00:02:49,310
a data warehouse, where we're

74
00:02:49,520 --> 00:02:50,280
bringing in a lot of

75
00:02:50,370 --> 00:02:51,910
data from operational sources, and

76
00:02:52,030 --> 00:02:53,120
that warehouse is tuned for

77
00:02:53,340 --> 00:02:54,760
OLAP queries that would

78
00:02:54,930 --> 00:02:57,040
be thought of as a decision support system.

79
00:02:57,670 --> 00:02:58,590
And, of course, this system

80
00:02:58,990 --> 00:03:00,130
is designed to support decisions

81
00:03:00,660 --> 00:03:02,620
that are made, again, based on data analysis.

82
00:03:04,160 --> 00:03:06,380
Now, let's get into some technical details of OLAP.

83
00:03:07,460 --> 00:03:09,180
Frequently applications that are

84
00:03:09,280 --> 00:03:10,800
doing online analytical processing

85
00:03:12,060 --> 00:03:13,490
are designed based around a

86
00:03:13,810 --> 00:03:15,250
star schema, so it's a

87
00:03:15,360 --> 00:03:16,980
certain type of relational schema.

88
00:03:18,140 --> 00:03:21,430
In a star schema, there's usually one fact table.

89
00:03:22,230 --> 00:03:23,980
That will be a typically very

90
00:03:24,230 --> 00:03:26,140
large table, it will be updated frequently.

91
00:03:26,890 --> 00:03:28,150
Often it's actually append only,

92
00:03:28,590 --> 00:03:30,950
so there are only inserts into the fact table.

93
00:03:31,940 --> 00:03:34,360
And then there are maybe many dimension tables.

94
00:03:35,140 --> 00:03:38,010
Those are updated infrequently and don't tend to be as large.

95
00:03:40,060 --> 00:03:40,910
So examples of a fact table

96
00:03:41,300 --> 00:03:43,040
might be sales transactions in

97
00:03:43,160 --> 00:03:44,480
a sales database or in

98
00:03:44,570 --> 00:03:45,940
a university database, maybe students

99
00:03:46,500 --> 00:03:48,150
enrolling in courses or in

100
00:03:48,290 --> 00:03:50,150
a web application logging the page views.

101
00:03:51,390 --> 00:03:52,430
In all of these cases we can

102
00:03:52,580 --> 00:03:53,580
see that the fact table can

103
00:03:53,800 --> 00:03:55,130
be very large and can

104
00:03:55,360 --> 00:03:57,350
be append only, so inserts only.

105
00:03:58,400 --> 00:03:59,880
Examples of dimension tables

106
00:04:00,660 --> 00:04:01,880
might be in a sales

107
00:04:02,130 --> 00:04:03,850
database store's items and

108
00:04:04,030 --> 00:04:07,390
customers in a college enrollment database.

109
00:04:07,940 --> 00:04:10,380
Maybe students and courses in a web application.

110
00:04:11,110 --> 00:04:13,550
Maybe web pages his users and advertisers.

111
00:04:14,180 --> 00:04:15,280
So, you can see, that

112
00:04:15,540 --> 00:04:17,000
these are generally smaller

113
00:04:17,520 --> 00:04:18,850
tables, they're more stable, they're

114
00:04:18,970 --> 00:04:20,370
not updated as frequently.

115
00:04:21,400 --> 00:04:22,360
You can sort of think

116
00:04:22,600 --> 00:04:24,270
of dimension tables as things

117
00:04:24,660 --> 00:04:26,480
in the real world and then

118
00:04:26,920 --> 00:04:29,010
fact tables as logging things that happened.

119
00:04:29,870 --> 00:04:32,680
It's not always divided this way but, it's not a bad approximation.

120
00:04:34,240 --> 00:04:35,140
Now, you might be wondering

121
00:04:35,500 --> 00:04:36,920
why is it called a

122
00:04:37,100 --> 00:04:38,700
star schema and it's called

123
00:04:39,050 --> 00:04:40,100
that because we have the

124
00:04:40,310 --> 00:04:41,510
fact table sort of, centrally

125
00:04:42,220 --> 00:04:44,190
referencing dimension tables around it.

126
00:04:44,320 --> 00:04:44,890
So, I'll draw the picture.

127
00:04:45,670 --> 00:04:48,250
Let's take a particular example and let's look at the sales domain.

128
00:04:49,140 --> 00:04:50,240
So, we'll have our fact

129
00:04:50,660 --> 00:04:51,690
table here, which will be

130
00:04:51,900 --> 00:04:53,610
the sales table and that

131
00:04:53,820 --> 00:04:55,360
will log sales transactions actions.

132
00:04:55,770 --> 00:04:56,860
It will include the store where

133
00:04:57,000 --> 00:04:58,250
the sale was made, the item

134
00:04:58,640 --> 00:04:59,720
that was sold, the customer, how

135
00:04:59,890 --> 00:05:01,510
many were sold, and the price that was paid.

136
00:05:02,260 --> 00:05:04,800
And then the other three tables are the dimension tables.

137
00:05:05,490 --> 00:05:07,090
So those those are giving

138
00:05:07,240 --> 00:05:08,330
us information about the stores

139
00:05:08,950 --> 00:05:09,960
and the items and the customers.

140
00:05:11,470 --> 00:05:13,200
So, I've drawn a picture of our schema here.

141
00:05:13,420 --> 00:05:15,860
We have our central fact table, the sales table.

142
00:05:16,850 --> 00:05:17,700
And we can see that the

143
00:05:17,790 --> 00:05:18,990
sales table contains these three

144
00:05:19,190 --> 00:05:20,580
columns I've abbreviated them in

145
00:05:20,650 --> 00:05:22,100
the picture: the Store ID, Item

146
00:05:22,470 --> 00:05:23,850
ID, and the Customer ID.

147
00:05:24,870 --> 00:05:27,030
The store ID values in

148
00:05:27,230 --> 00:05:28,700
this column will be foreign

149
00:05:29,170 --> 00:05:30,460
key attributes to the

150
00:05:30,540 --> 00:05:31,650
primary key of the store

151
00:05:31,890 --> 00:05:33,750
table if you remember our constraints video.

152
00:05:34,520 --> 00:05:35,550
So we can think of these as

153
00:05:35,860 --> 00:05:36,870
pointers into the store

154
00:05:37,200 --> 00:05:39,660
table, least specifically matching store

155
00:05:40,410 --> 00:05:41,300
IDs over here.

156
00:05:42,070 --> 00:05:44,230
And we'll have similarly our

157
00:05:44,880 --> 00:05:45,780
item IDs will be foreign

158
00:05:46,260 --> 00:05:47,340
keys to the item table.

159
00:05:47,770 --> 00:05:49,510
I won't actually point to the values here.

160
00:05:50,330 --> 00:05:52,960
And then our costumer IDs

161
00:05:53,310 --> 00:05:55,680
over here will be pointing to the customer table.

162
00:05:56,490 --> 00:05:57,650
So if you look at

163
00:05:57,790 --> 00:05:59,830
this squinting, you will

164
00:06:00,130 --> 00:06:01,420
see that it is kind of

165
00:06:01,480 --> 00:06:02,930
a star schema with the

166
00:06:03,260 --> 00:06:04,960
central fact table pointing

167
00:06:05,280 --> 00:06:08,250
to the dimension tables around it, and that's where the name comes from.

168
00:06:09,380 --> 00:06:10,830
Just a little more terminology.

169
00:06:11,550 --> 00:06:14,030
The first three attributes here in the fact fact table.

170
00:06:14,290 --> 00:06:15,700
These three are what are

171
00:06:15,750 --> 00:06:17,420
known as dimension attributes.

172
00:06:18,340 --> 00:06:19,630
So those are the attributes

173
00:06:20,120 --> 00:06:22,580
that are foreign keys into the dimension tables.

174
00:06:23,600 --> 00:06:25,150
Then the remaining attributes in

175
00:06:25,240 --> 00:06:26,580
this case the quantity and the

176
00:06:26,680 --> 00:06:28,870
price are called dependent attributes.

177
00:06:29,340 --> 00:06:30,810
So they're I guess dependent on the

178
00:06:30,960 --> 00:06:32,800
values for the dimension

179
00:06:33,440 --> 00:06:35,150
attributes and typically, queries will

180
00:06:35,580 --> 00:06:38,070
tend to aggregate on the dependent attributes.

181
00:06:38,420 --> 00:06:39,920
We'll see examples of that in a moment.

182
00:06:40,880 --> 00:06:41,910
So, now that we known

183
00:06:42,050 --> 00:06:43,230
what a star schema looks like,

184
00:06:43,450 --> 00:06:44,350
let's look at the type of

185
00:06:44,530 --> 00:06:45,730
queries that are generally issued

186
00:06:46,000 --> 00:06:48,010
over this schema, and they're called OLAP queries.

187
00:06:48,900 --> 00:06:50,320
Typically a query over

188
00:06:50,920 --> 00:06:52,350
a star schema will first

189
00:06:52,740 --> 00:06:54,400
join some or all of the relations.

190
00:06:55,700 --> 00:06:56,820
And when you're joining the sale

191
00:06:57,240 --> 00:06:58,210
as the fact table with

192
00:06:58,340 --> 00:06:59,880
the dimension tables, you can

193
00:06:59,920 --> 00:07:01,780
almost think of it as expanding the

194
00:07:02,050 --> 00:07:03,080
facts in the sales table

195
00:07:03,390 --> 00:07:05,410
to include more information about the sales.

196
00:07:06,100 --> 00:07:07,820
Since we have the foreign keys we'll

197
00:07:07,980 --> 00:07:10,810
be adding, for example, to the information about a sale.

198
00:07:11,150 --> 00:07:11,870
More about the store.

199
00:07:12,170 --> 00:07:13,140
The city and state of the store.

200
00:07:13,910 --> 00:07:15,410
For a sale item will

201
00:07:15,500 --> 00:07:17,050
be adding the category brand and so on.

202
00:07:17,090 --> 00:07:18,390
So that's the join

203
00:07:18,710 --> 00:07:19,630
process and the query will

204
00:07:19,770 --> 00:07:20,660
join as much as it

205
00:07:20,750 --> 00:07:23,100
needs in order to do the rest of it's work.

206
00:07:23,950 --> 00:07:25,590
It might then filter the data.

207
00:07:26,060 --> 00:07:27,000
For example we might decide

208
00:07:27,380 --> 00:07:28,280
that in our query we only

209
00:07:28,540 --> 00:07:29,760
care about stores in California

210
00:07:30,710 --> 00:07:32,290
or customers in California, we're only

211
00:07:32,520 --> 00:07:34,300
interested in shirts and so on.

212
00:07:34,540 --> 00:07:35,850
So they can filter on the

213
00:07:36,390 --> 00:07:37,840
dimension attributes after joining,

214
00:07:38,160 --> 00:07:40,210
or could filter on the price or quantity as well.

215
00:07:41,700 --> 00:07:44,620
After filtering there's often a group by an aggregation.

216
00:07:45,890 --> 00:07:47,230
So we might decide that we're

217
00:07:47,450 --> 00:07:49,510
interested in figuring out our

218
00:07:49,910 --> 00:07:51,660
total sales divided by customer

219
00:07:52,370 --> 00:07:54,520
or by item or by state or all of those.

220
00:07:55,070 --> 00:07:57,080
And then the aggregation might sum

221
00:07:57,290 --> 00:07:58,290
up the sales or it might

222
00:07:58,560 --> 00:08:00,530
determine the average price that's sold.

223
00:08:01,290 --> 00:08:02,390
We'll be doing a number of this

224
00:08:02,530 --> 00:08:04,410
type of query in our demo later on.

225
00:08:05,280 --> 00:08:06,630
So if you think about executing

226
00:08:07,000 --> 00:08:08,380
queries of this type, they

227
00:08:08,680 --> 00:08:10,190
can be quite complex and they

228
00:08:10,260 --> 00:08:11,760
can touch large portions of the database.

229
00:08:12,720 --> 00:08:13,940
Sowe 're worried about performance,

230
00:08:14,630 --> 00:08:16,780
and our data is large, we do have a worry.

231
00:08:17,330 --> 00:08:18,670
Running this type of

232
00:08:18,830 --> 00:08:20,210
query on a gigantic database

233
00:08:21,020 --> 00:08:22,560
over a standard database system

234
00:08:22,910 --> 00:08:24,720
can be very slow, but over

235
00:08:25,560 --> 00:08:26,580
the past decade or so,

236
00:08:27,230 --> 00:08:28,720
special indexing techniques have

237
00:08:28,830 --> 00:08:30,170
been introduced and special query processing

238
00:08:30,600 --> 00:08:32,060
techniques specifically to handle

239
00:08:32,830 --> 00:08:34,340
this type of query on

240
00:08:34,520 --> 00:08:35,940
star schemas on large databases.

241
00:08:36,500 --> 00:08:37,880
And again, by large, just

242
00:08:37,900 --> 00:08:39,140
think about the number of sales,

243
00:08:39,590 --> 00:08:40,640
for example, in a large

244
00:08:40,930 --> 00:08:42,720
retail chain, or a

245
00:08:43,060 --> 00:08:44,160
number of web views, or

246
00:08:44,260 --> 00:08:45,620
even shopping cart additions

247
00:08:46,210 --> 00:08:47,370
in a large online vendor.

248
00:08:48,250 --> 00:08:49,100
So, in all of those

249
00:08:49,220 --> 00:08:50,410
applications, people are interested in

250
00:08:50,540 --> 00:08:51,410
doing OLAP queries and they

251
00:08:52,050 --> 00:08:54,440
tend to use a system that supports these special techniques.

252
00:08:55,760 --> 00:08:57,260
Another component of getting good

253
00:08:57,460 --> 00:08:58,640
performance in these systems

254
00:08:59,360 --> 00:09:00,670
is the use of materialized views.

255
00:09:01,330 --> 00:09:02,900
You might remember that materialized

256
00:09:03,020 --> 00:09:04,240
views are useful when we

257
00:09:04,420 --> 00:09:06,140
have a workload that

258
00:09:06,360 --> 00:09:08,610
consists of lots of queries and not so many updates.

259
00:09:09,120 --> 00:09:11,760
And that's exactly the type of workload we have in OLAP.

260
00:09:12,580 --> 00:09:14,100
furthermore, we have many queries

261
00:09:14,570 --> 00:09:15,580
that take roughly the same

262
00:09:15,910 --> 00:09:17,430
structure so material wise we

263
00:09:17,570 --> 00:09:18,990
use are useful in that setting as well.

264
00:09:19,820 --> 00:09:20,950
Now let me switch gears

265
00:09:21,230 --> 00:09:22,590
and introduce a different way

266
00:09:22,890 --> 00:09:24,640
of looking at the data in

267
00:09:24,700 --> 00:09:26,360
these OLAP applications with star

268
00:09:26,730 --> 00:09:28,820
schemas, and it's what's known as a data cube.

269
00:09:29,440 --> 00:09:30,410
Sometimes this is also called

270
00:09:30,650 --> 00:09:32,850
multidimensional OLAP and

271
00:09:32,980 --> 00:09:34,110
the basic idea is that

272
00:09:34,270 --> 00:09:35,640
when we have data with dimensions,

273
00:09:36,470 --> 00:09:37,430
we can think of those dimensions

274
00:09:38,420 --> 00:09:39,900
as forming the axis of a cube.

275
00:09:40,550 --> 00:09:42,520
It's kind of like an N dimensional spreadsheet.

276
00:09:43,490 --> 00:09:44,810
Now we can have any

277
00:09:45,010 --> 00:09:46,380
number of dimensions, but for

278
00:09:46,680 --> 00:09:48,050
the examples I'm gonna give,

279
00:09:49,450 --> 00:09:50,430
the best I can draw is

280
00:09:50,550 --> 00:09:52,490
up to three dimensions, and that's why people call acute.

281
00:09:52,770 --> 00:09:54,150
Because they know how to draw three dimensions.

282
00:09:54,930 --> 00:09:56,140
But again, any number of

283
00:09:56,400 --> 00:09:58,340
dimensions are possible in this view of the data.

284
00:09:59,630 --> 00:10:02,740
So we have our dimensions forming the axis of our cube.

285
00:10:03,420 --> 00:10:04,480
And then the cells of the

286
00:10:04,600 --> 00:10:06,660
cube, again, you can think of it sort of like cells of a spreadsheet.

287
00:10:07,820 --> 00:10:08,960
Are the fact of data.

288
00:10:09,410 --> 00:10:10,190
Or the dependent data.

289
00:10:10,430 --> 00:10:11,690
It's like in the previous example that

290
00:10:11,820 --> 00:10:13,510
would be our quantity and price.

291
00:10:14,430 --> 00:10:15,530
And finally we have aggregated

292
00:10:16,150 --> 00:10:17,740
data on the sides, edges

293
00:10:18,130 --> 00:10:19,780
and corners of corner of the cube.

294
00:10:20,100 --> 00:10:22,930
Again similar to how you might aggregate columns in a spreadsheet.

295
00:10:23,840 --> 00:10:24,970
So let's go ahead and

296
00:10:25,210 --> 00:10:26,160
I'll do my best to draw

297
00:10:26,450 --> 00:10:27,940
a picture to explain what's going on.

298
00:10:29,010 --> 00:10:30,260
So here's my cube with these

299
00:10:30,650 --> 00:10:32,530
three axes that I've drawn in black.

300
00:10:33,080 --> 00:10:34,510
And I've drawn these dash lines

301
00:10:34,900 --> 00:10:36,020
as well to sort of give

302
00:10:36,170 --> 00:10:37,660
you a visual idea of the cube.

303
00:10:38,400 --> 00:10:39,410
But I'm going to actually

304
00:10:39,490 --> 00:10:40,720
get rid of these dash lines right

305
00:10:40,880 --> 00:10:42,220
now just so we don't have too much clutter.

306
00:10:43,310 --> 00:10:44,790
So for our sales example,

307
00:10:45,300 --> 00:10:46,270
we're sticking with the same example,

308
00:10:47,200 --> 00:10:48,320
we have 3 dimensions.

309
00:10:48,830 --> 00:10:49,750
And those will label the three

310
00:10:49,780 --> 00:10:50,850
the three axises of are cube

311
00:10:51,690 --> 00:10:53,000
and in one dimension we will

312
00:10:53,150 --> 00:10:55,190
have the stores and another

313
00:10:55,500 --> 00:10:56,520
dimension we will have the

314
00:10:56,600 --> 00:10:58,840
customers here, and in

315
00:10:58,960 --> 00:11:00,930
another dimension we have the items.

316
00:11:01,970 --> 00:11:03,010
Then we can think of

317
00:11:03,250 --> 00:11:04,520
the points along these axes

318
00:11:05,210 --> 00:11:06,900
as being the different elements

319
00:11:07,510 --> 00:11:08,700
in each of those domains, or

320
00:11:08,970 --> 00:11:11,470
the different tuples in each of those dimension tables.

321
00:11:12,460 --> 00:11:13,980
So for example, in the

322
00:11:14,220 --> 00:11:15,660
store domain, we'll have,

323
00:11:16,160 --> 00:11:18,300
you know, store 1 store 2,

324
00:11:18,790 --> 00:11:20,310
store 3 and so on.

325
00:11:20,620 --> 00:11:22,080
I'm not giving them any fancy names here.

326
00:11:22,880 --> 00:11:23,740
And so, each of those is

327
00:11:23,900 --> 00:11:25,090
a point on that dimension and

328
00:11:25,380 --> 00:11:27,100
similarly for the items will have

329
00:11:27,250 --> 00:11:29,820
item 1 item 2 item 3 and so on.

330
00:11:30,650 --> 00:11:32,140
And for the customers along the bottom,

331
00:11:33,020 --> 00:11:34,280
we'll have customer 1 customer

332
00:11:34,990 --> 00:11:37,010
number 2, customer 3 and so on.

333
00:11:38,150 --> 00:11:39,950
Now here comes the tricky part, especially for drawing.

334
00:11:40,510 --> 00:11:41,380
The idea is Is that

335
00:11:41,830 --> 00:11:43,320
every cell in the

336
00:11:43,400 --> 00:11:44,970
cube, so every combination of

337
00:11:45,180 --> 00:11:46,950
item costumer in store has

338
00:11:47,160 --> 00:11:48,280
a cell in the cube, so

339
00:11:48,370 --> 00:11:50,340
this would be sort of a free floating cell here.

340
00:11:51,050 --> 00:11:52,350
And This will have for our

341
00:11:52,770 --> 00:11:54,770
schema the quantity and

342
00:11:54,920 --> 00:11:56,490
the price for that

343
00:11:56,960 --> 00:11:59,910
item, that customer, and that store.

344
00:12:00,220 --> 00:12:01,440
So this might be the floating

345
00:12:01,750 --> 00:12:02,720
thing here that's, you know, Item

346
00:12:03,120 --> 00:12:05,060
I32, Costumer 4, and

347
00:12:05,710 --> 00:12:07,320
Store 17, something like that.

348
00:12:08,010 --> 00:12:09,520
And then floating in

349
00:12:09,620 --> 00:12:12,240
there is this cell with the quantity and the price.

350
00:12:12,950 --> 00:12:14,130
Now we are assuming that there's

351
00:12:14,350 --> 00:12:15,640
just one quantity and price

352
00:12:15,960 --> 00:12:17,500
for the combination of those three attributes.

353
00:12:17,830 --> 00:12:20,670
And I'll come back to that in a moment, but let's assume that for now.

354
00:12:21,650 --> 00:12:22,720
So that's what we have in

355
00:12:22,970 --> 00:12:24,770
the whole central area of the cube.

356
00:12:25,880 --> 00:12:27,680
So now on the faces, edges,

357
00:12:28,060 --> 00:12:29,130
and corner of the cube

358
00:12:29,670 --> 00:12:31,420
are going to have aggregated data.

359
00:12:32,230 --> 00:12:33,910
And there does need

360
00:12:34,140 --> 00:12:36,320
to be with each data cube a predefined aggregate.

361
00:12:37,460 --> 00:12:39,190
So for this one let's say

362
00:12:39,380 --> 00:12:40,610
that what we want as our

363
00:12:40,760 --> 00:12:42,130
aggregate is the sum

364
00:12:42,340 --> 00:12:43,790
of the quantity times the price

365
00:12:44,050 --> 00:12:44,850
so we're going to figure

366
00:12:44,890 --> 00:12:45,860
out the total amount that we're

367
00:12:46,010 --> 00:12:47,170
making for different combinations

368
00:12:47,820 --> 00:12:48,920
of stores, items, and customers.

369
00:12:49,810 --> 00:12:51,780
So now let's consider a

370
00:12:51,850 --> 00:12:53,300
cell on the face of the cube.

371
00:12:53,970 --> 00:12:56,110
So again, I'm not drawing this very well.

372
00:12:56,300 --> 00:12:58,560
But let's assume this is on the bottom face of the cube.

373
00:12:58,940 --> 00:13:00,730
So, this is for a particular customer.

374
00:13:01,390 --> 00:13:02,640
Say customer 10, in a

375
00:13:02,890 --> 00:13:04,230
particular store, say store 7,

376
00:13:04,530 --> 00:13:06,200
and then, since it's

377
00:13:06,440 --> 00:13:07,170
on the bottom of the cube,

378
00:13:07,490 --> 00:13:08,890
so we didn't go up this dimension

379
00:13:09,440 --> 00:13:11,160
here, it considers all items

380
00:13:11,780 --> 00:13:13,290
for customer 10 and store 7.

381
00:13:13,480 --> 00:13:14,590
So this will be

382
00:13:15,370 --> 00:13:17,570
the aggregate over all items

383
00:13:18,360 --> 00:13:19,930
for that particular store and customer.

384
00:13:21,010 --> 00:13:23,920
And we'd have similar values on the other faces of the cube.

385
00:13:24,240 --> 00:13:25,460
So this face over here, for

386
00:13:25,660 --> 00:13:26,680
example, would be for a

387
00:13:27,010 --> 00:13:29,770
particular item and customer overall stores.

388
00:13:30,700 --> 00:13:32,310
And then on the front face

389
00:13:32,640 --> 00:13:33,490
of the cube, if you could imagine

390
00:13:34,300 --> 00:13:35,250
that, would be for a particular

391
00:13:35,510 --> 00:13:36,810
item and store over all customers.

392
00:13:38,250 --> 00:13:39,920
Now let's talk about what's on the edge of the cube.

393
00:13:40,290 --> 00:13:41,530
So here we have, say

394
00:13:41,970 --> 00:13:43,440
for store 3, we'll

395
00:13:44,180 --> 00:13:45,870
have the aggregate value over

396
00:13:46,440 --> 00:13:49,060
all customers and items

397
00:13:49,590 --> 00:13:51,370
in this point for store 3.

398
00:13:51,600 --> 00:13:52,720
So that will be the

399
00:13:52,810 --> 00:13:54,080
total sales that we

400
00:13:54,330 --> 00:13:55,880
conducted at store S3.

401
00:13:56,880 --> 00:13:57,870
Over here on this edge

402
00:13:58,140 --> 00:13:59,340
we'd have the total for a

403
00:13:59,560 --> 00:14:01,640
specific costumer and over here for specific items.

404
00:14:02,360 --> 00:14:03,770
And then finally, we have

405
00:14:04,020 --> 00:14:05,970
at the corner of the

406
00:14:06,800 --> 00:14:08,050
cube the full aggregation.

407
00:14:08,830 --> 00:14:10,240
So that's going to be in

408
00:14:10,430 --> 00:14:11,750
this case the sum of

409
00:14:11,820 --> 00:14:13,540
the quantity times price for every

410
00:14:13,970 --> 00:14:15,220
store, customer and item.

411
00:14:16,350 --> 00:14:17,520
So, I'm not a great artist,

412
00:14:17,890 --> 00:14:18,850
but I hope this gives you

413
00:14:19,150 --> 00:14:21,100
some understanding of how the data cube works.

414
00:14:21,650 --> 00:14:22,880
So as we saw in the

415
00:14:23,000 --> 00:14:24,140
cube, we have one cell

416
00:14:24,420 --> 00:14:26,080
in the cube for each combination of

417
00:14:26,250 --> 00:14:28,400
store ID, item ID, and customer ID.

418
00:14:29,090 --> 00:14:31,810
So if those three together form a key, then it's very straight forward.

419
00:14:32,030 --> 00:14:33,780
If the dimension attributes together

420
00:14:34,150 --> 00:14:35,450
don't form a key then

421
00:14:35,680 --> 00:14:38,830
we might be pre-aggregating already inside the data cube.

422
00:14:39,130 --> 00:14:40,380
So, we might decide to already

423
00:14:40,630 --> 00:14:41,760
have say the sum of quantity

424
00:14:42,240 --> 00:14:43,710
times price for each combination

425
00:14:44,340 --> 00:14:45,220
of store item and customer.

426
00:14:46,740 --> 00:14:47,640
Another possibility and it's done

427
00:14:47,960 --> 00:14:49,470
quite commonly is to add

428
00:14:49,870 --> 00:14:51,190
to the fact table the attribute

429
00:14:51,600 --> 00:14:52,800
date, or even the time.

430
00:14:53,480 --> 00:14:55,020
And that can be used to create a key.

431
00:14:55,470 --> 00:14:56,500
Typically, we won't have two

432
00:14:57,000 --> 00:14:58,890
transactions at exactly the same time.

433
00:14:59,550 --> 00:15:00,460
Now if we do have an

434
00:15:00,740 --> 00:15:02,450
attribute here called date, one

435
00:15:02,660 --> 00:15:03,790
might wonder is that a

436
00:15:03,970 --> 00:15:05,990
dimension attribute or a dependent attribute.

437
00:15:07,400 --> 00:15:09,050
Actually, it's pretty much

438
00:15:09,290 --> 00:15:10,870
a dimension attribute because we're

439
00:15:11,160 --> 00:15:12,110
gonna use it as another dimension

440
00:15:12,530 --> 00:15:13,760
in our data cube, but the

441
00:15:13,830 --> 00:15:14,610
difference being that we would

442
00:15:14,870 --> 00:15:17,490
not have an actual dimension table listing the dates.

443
00:15:18,400 --> 00:15:19,550
Now let's move on to

444
00:15:19,710 --> 00:15:20,970
a couple other concepts in the

445
00:15:21,080 --> 00:15:23,610
olap world called drill down and roll up.

446
00:15:24,240 --> 00:15:25,710
The idea of drill

447
00:15:26,100 --> 00:15:26,920
down, is that we may

448
00:15:27,160 --> 00:15:30,290
be examining summary data and then we want to get more information.

449
00:15:30,770 --> 00:15:32,680
Drill down into the details of that data.

450
00:15:33,680 --> 00:15:34,430
And, actually, we can think

451
00:15:34,620 --> 00:15:36,590
of that very specifically in

452
00:15:36,690 --> 00:15:38,110
a sequel context as follows.

453
00:15:38,820 --> 00:15:39,790
Let's suppose that we have

454
00:15:40,200 --> 00:15:41,480
this query and sequel which

455
00:15:41,720 --> 00:15:43,350
follows by the way the

456
00:15:43,880 --> 00:15:44,970
description of the query I

457
00:15:45,180 --> 00:15:46,290
had earlier where we'll do a

458
00:15:46,360 --> 00:15:47,840
join and then a selection

459
00:15:48,340 --> 00:15:50,960
and then it grouped by, and finally we have an aggregation here.

460
00:15:51,740 --> 00:15:53,300
So this query specifically is

461
00:15:53,420 --> 00:15:55,120
looking at our total sales broken

462
00:15:55,540 --> 00:15:57,010
out by state and brand.

463
00:15:58,010 --> 00:16:00,690
Maybe we'll look at that and we'll just say that's not enough detail.

464
00:16:01,130 --> 00:16:01,790
I need more information.

465
00:16:02,900 --> 00:16:04,300
So, to drill down what we

466
00:16:04,440 --> 00:16:06,360
do is add a grouping attribute.

467
00:16:06,840 --> 00:16:07,830
So if we added, for example,

468
00:16:08,330 --> 00:16:09,940
category, when we add

469
00:16:10,080 --> 00:16:11,250
another grouping attribute, that gets

470
00:16:11,520 --> 00:16:13,880
us more data in the answer - more detail in our data.

471
00:16:15,280 --> 00:16:16,810
Rollup is exactly the opposite.

472
00:16:17,370 --> 00:16:18,620
Rollup says we're looking at

473
00:16:18,810 --> 00:16:20,010
data and we decide we

474
00:16:20,130 --> 00:16:22,150
have too much detail and we want to summarize.

475
00:16:22,550 --> 00:16:24,200
And summarize is simply

476
00:16:24,540 --> 00:16:26,110
a matter of removing a group by attributes.

477
00:16:26,540 --> 00:16:28,000
So if we took out state, then

478
00:16:28,210 --> 00:16:29,170
now we'll only see our

479
00:16:29,260 --> 00:16:30,520
data summarized by brand

480
00:16:31,090 --> 00:16:32,940
rather than broken out into state and brand.

481
00:16:34,170 --> 00:16:36,250
And lastly, I want to add introduce some SQL constructs.

482
00:16:37,070 --> 00:16:38,150
These are constructs that have been added,

483
00:16:38,500 --> 00:16:39,460
fairly recently, to the SQL

484
00:16:39,810 --> 00:16:42,060
standard in order to perform OLAP queries.

485
00:16:42,460 --> 00:16:43,720
And we'll be seeing these in our demo.

486
00:16:44,390 --> 00:16:45,840
The constructs are called with cube

487
00:16:46,150 --> 00:16:47,530
and with roll up and

488
00:16:47,640 --> 00:16:49,080
they're added to the group by clause.

489
00:16:50,060 --> 00:16:52,080
When we add with cube to

490
00:16:52,360 --> 00:16:53,320
a query with a group by

491
00:16:54,000 --> 00:16:55,060
what happens is that, basically,

492
00:16:55,670 --> 00:16:56,750
we're adding to the result

493
00:16:57,250 --> 00:16:58,540
of our query, the faces,

494
00:16:59,380 --> 00:17:00,850
edges, and corner of the cube.

495
00:17:01,710 --> 00:17:04,340
Using no values for the attributes that we're not constraining.

496
00:17:05,030 --> 00:17:06,060
We'll see this clearly in the demo.

497
00:17:06,890 --> 00:17:10,550
With <u>rollup is similar to with<u>cube, except it's smaller.</u></u>

498
00:17:10,950 --> 00:17:12,040
It actually is a portion of

499
00:17:12,130 --> 00:17:13,460
the data cube, and that

500
00:17:13,620 --> 00:17:14,610
makes sense when we have dimensions

501
00:17:15,120 --> 00:17:16,240
that are inherently hierarchical.

502
00:17:17,060 --> 00:17:18,450
And again we'll see that in the demo as well.

503
00:17:19,400 --> 00:17:20,390
So, we can conclude there are

504
00:17:20,510 --> 00:17:21,790
two broad types of data

505
00:17:21,990 --> 00:17:23,960
base activity, online transaction processing.

506
00:17:24,590 --> 00:17:26,240
Short, simple transactions touching

507
00:17:26,550 --> 00:17:27,390
small portions of the data,

508
00:17:28,160 --> 00:17:29,770
lots of updating and OLAP,

509
00:17:30,120 --> 00:17:32,160
or online analytical processing, where

510
00:17:32,290 --> 00:17:33,800
we have complex queries, long transactions,

511
00:17:34,880 --> 00:17:35,890
might touch a large portion of

512
00:17:35,980 --> 00:17:37,820
the data and might not update the data at all.

513
00:17:38,530 --> 00:17:40,310
For online analytical processing OLAP

514
00:17:40,790 --> 00:17:42,910
we saw that star schemas are frequently used.

515
00:17:43,490 --> 00:17:46,180
We saw how to view the data as a data cube.

516
00:17:46,740 --> 00:17:48,380
Of course, that can be in any number of dimensions.

517
00:17:48,670 --> 00:17:49,780
We just use three for visualization.

518
00:17:50,140 --> 00:17:52,460
There are two new

519
00:17:52,680 --> 00:17:54,760
constructs in sequel with<u>cube and with rollup.</u>

520
00:17:55,580 --> 00:17:56,930
And finally this type of

521
00:17:57,080 --> 00:17:58,580
query can be very stressful on

522
00:17:58,810 --> 00:18:00,380
a database system when we have very large databases.

523
00:18:01,220 --> 00:18:02,530
So special techniques have been

524
00:18:02,830 --> 00:18:05,750
introduced into systems to help perform these queries efficiently.