1
00:00:00,510 --> 00:00:01,810
In this sequence of videos, we'll

2
00:00:02,060 --> 00:00:04,810
learn about designing good schemas for relational databases.

3
00:00:06,050 --> 00:00:07,340
So let's suppose we're building a

4
00:00:07,380 --> 00:00:08,660
database for an application or

5
00:00:08,740 --> 00:00:09,810
set of applications and we

6
00:00:09,930 --> 00:00:12,430
have to figure out what schema we want to store our data.

7
00:00:13,340 --> 00:00:14,510
Usually there are many different

8
00:00:14,930 --> 00:00:16,400
possible schema designs for a

9
00:00:16,710 --> 00:00:19,010
database, and databases do tend to get quite complicated.

10
00:00:20,200 --> 00:00:22,080
And some designs are much better than others.

11
00:00:22,610 --> 00:00:24,590
So how do we choose what design to use?

12
00:00:25,570 --> 00:00:26,760
Now the reality is that people

13
00:00:26,820 --> 00:00:28,400
often use higher level tools

14
00:00:28,890 --> 00:00:30,450
to design relational databases and

15
00:00:30,580 --> 00:00:31,970
don't design the schemas directly themselves.

16
00:00:32,930 --> 00:00:34,080
But some designers do go

17
00:00:34,330 --> 00:00:35,920
straight to relations, and furthermore,

18
00:00:36,440 --> 00:00:37,660
it's useful to understand why

19
00:00:38,180 --> 00:00:39,110
the relations that are produced

20
00:00:39,590 --> 00:00:41,190
by design tools are what

21
00:00:41,420 --> 00:00:42,850
they are.

22
00:00:43,000 --> 00:00:44,040
Furthermore, from an academic point of view,

23
00:00:44,310 --> 00:00:45,040
it turns out there's a very

24
00:00:45,480 --> 00:00:47,180
nice theory for relational data base design.

25
00:00:48,350 --> 00:00:49,670
So let's consider the process of

26
00:00:49,780 --> 00:00:51,360
designing the schema for our

27
00:00:51,480 --> 00:00:53,030
database about students applying to colleges.

28
00:00:54,070 --> 00:00:55,190
Specifically, for a given

29
00:00:55,350 --> 00:00:56,370
student, let's suppose we have

30
00:00:56,600 --> 00:00:58,320
their social security number and

31
00:00:58,520 --> 00:01:00,050
their name, the colleges that

32
00:01:00,220 --> 00:01:01,420
student is applying to, the

33
00:01:01,580 --> 00:01:02,880
high schools they attended and

34
00:01:02,940 --> 00:01:05,010
what city those high schools were in, and the student's hobbies.

35
00:01:06,200 --> 00:01:07,220
So if that's what we want we

36
00:01:07,330 --> 00:01:08,290
can create a single relation

37
00:01:08,730 --> 00:01:09,890
called apply, that has one

38
00:01:10,350 --> 00:01:12,110
attribute for each of those pieces of information.

39
00:01:13,290 --> 00:01:15,260
Now let's take a look at how that database would be populated.

40
00:01:16,360 --> 00:01:17,170
Let's suppose that we have

41
00:01:17,300 --> 00:01:18,550
a student, Anne, with Social Security

42
00:01:18,940 --> 00:01:20,560
number 123, she went to 2

43
00:01:20,780 --> 00:01:21,990
different high schools in Palo

44
00:01:22,270 --> 00:01:23,750
Alto, she plays tennis and

45
00:01:23,910 --> 00:01:25,110
the trumpet, and she's applying

46
00:01:25,530 --> 00:01:26,690
to Stanford, Berkeley, and MIT.

47
00:01:27,730 --> 00:01:28,660
So let's look at some of

48
00:01:28,760 --> 00:01:29,720
the tuples that we would be

49
00:01:30,090 --> 00:01:31,000
having in the apply relation

50
00:01:31,500 --> 00:01:33,050
to represent this information about Anne.

51
00:01:33,820 --> 00:01:35,200
So we'll have 1,2,3, Anne,

52
00:01:36,270 --> 00:01:37,460
her name, she's applying to

53
00:01:37,870 --> 00:01:39,400
Stanford, she went to

54
00:01:39,810 --> 00:01:41,720
Palo Alto High School, and

55
00:01:42,090 --> 00:01:45,400
that's in Palo Alto, and one of her hobbies is tennis.

56
00:01:47,150 --> 00:01:48,200
And then we also have 1

57
00:01:48,310 --> 00:01:51,030
2 3 and she applied to

58
00:01:51,340 --> 00:01:52,910
Berkeley and went to

59
00:01:53,110 --> 00:01:54,540
Palo Alto High School in

60
00:01:54,760 --> 00:01:58,000
Palo Alto and tennis there as well.

61
00:01:58,370 --> 00:01:59,690
Of course she also has

62
00:02:00,340 --> 00:02:01,770
a tuple representing the fact

63
00:02:02,200 --> 00:02:04,260
that she's applying to Berkeley and

64
00:02:04,440 --> 00:02:05,350
and we'll stick with Palo

65
00:02:05,810 --> 00:02:07,340
Alto High School, and she played the trumpet.

66
00:02:08,750 --> 00:02:09,670
And as you can see we'll

67
00:02:09,890 --> 00:02:11,230
have more tuples, we'll have

68
00:02:11,420 --> 00:02:12,540
various Stanford and Berkeleys, we'll

69
00:02:12,890 --> 00:02:13,980
have some for her other high

70
00:02:14,180 --> 00:02:15,180
schools called Gunn High School

71
00:02:15,490 --> 00:02:17,880
also in Palo Alto, and so on.

72
00:02:18,450 --> 00:02:19,460
So if we think about

73
00:02:19,750 --> 00:02:20,660
it we will need a total

74
00:02:21,210 --> 00:02:22,990
of 12 tuples to represent

75
00:02:23,800 --> 00:02:25,510
this information about Ann.

76
00:02:26,710 --> 00:02:28,160
Now do we think that's a good design?

77
00:02:29,580 --> 00:02:31,460
I'm going to argue no, it's not a good design.

78
00:02:31,980 --> 00:02:34,290
There are several types of anomalies in that design.

79
00:02:35,320 --> 00:02:36,870
First of all, we capture information

80
00:02:37,730 --> 00:02:39,770
multiple times in that

81
00:02:40,020 --> 00:02:41,600
design, and I'll give some examples of that.

82
00:02:42,130 --> 00:02:43,550
For example how many times

83
00:02:43,750 --> 00:02:45,320
do we capture the fact that

84
00:02:45,650 --> 00:02:46,890
1 2 3 the Social Security

85
00:02:47,350 --> 00:02:49,140
number is associated with a student named Ann?

86
00:02:49,820 --> 00:02:52,170
We capture that twelve times in our twelve tuples.

87
00:02:53,040 --> 00:02:53,870
How many times do we

88
00:02:54,040 --> 00:02:56,720
capture that Anne went to Palo Alto High School?

89
00:02:56,920 --> 00:02:57,730
We're going to capture that six times.

90
00:02:58,050 --> 00:02:59,550
And we're going to capture

91
00:02:59,720 --> 00:03:01,580
the fact that she plays tennis six times.

92
00:03:02,340 --> 00:03:03,140
And we're going to capture the fact

93
00:03:03,400 --> 00:03:04,590
that she went to apply to

94
00:03:04,790 --> 00:03:06,550
MIT four times, so for

95
00:03:07,000 --> 00:03:08,150
each piece of information, in fact,

96
00:03:08,480 --> 00:03:09,700
we're capturing it many, many times.

97
00:03:10,420 --> 00:03:12,510
So that doesn't seem like a good feature of the design.

98
00:03:13,720 --> 00:03:14,560
The second type is an

99
00:03:14,680 --> 00:03:17,400
update anomaly, and that's really a direct effect of redundancy.

100
00:03:18,420 --> 00:03:19,650
What update anomalies say

101
00:03:20,050 --> 00:03:21,270
is that you can update facts

102
00:03:21,630 --> 00:03:22,820
in some places but not

103
00:03:23,010 --> 00:03:24,620
all all or differently in different places.

104
00:03:26,230 --> 00:03:27,350
So let's take the fact for

105
00:03:27,510 --> 00:03:29,480
example that Ann plays the trumpet.

106
00:03:30,270 --> 00:03:31,300
I might decide to call

107
00:03:31,470 --> 00:03:33,010
that the coronet instead but I

108
00:03:33,490 --> 00:03:34,810
can go ahead and I

109
00:03:35,010 --> 00:03:36,870
can modify, say, three of

110
00:03:37,130 --> 00:03:38,310
the incidences where we captured

111
00:03:38,700 --> 00:03:39,590
the fact about her playing the

112
00:03:39,660 --> 00:03:40,820
trumpet and not the fourth

113
00:03:41,110 --> 00:03:42,230
one and then we end up

114
00:03:42,450 --> 00:03:44,010
with what's effectively an inconsistent database.

115
00:03:45,730 --> 00:03:47,070
And the third type of

116
00:03:47,090 --> 00:03:48,650
anomaly is called a deletion anomaly, and

117
00:03:49,030 --> 00:03:50,160
there's a case where we could inadvertently

118
00:03:51,190 --> 00:03:52,370
completely do a complete

119
00:03:52,630 --> 00:03:54,280
deletion of somebody in the database.

120
00:03:54,800 --> 00:03:56,430
Let's say for example that we

121
00:03:56,710 --> 00:03:58,300
decide that surfing is an

122
00:03:58,480 --> 00:03:59,850
unacceptable hobby for our

123
00:04:00,080 --> 00:04:01,520
college applicants, and we go

124
00:04:01,740 --> 00:04:03,740
ahead and we delete the tuples about surfing.

125
00:04:04,650 --> 00:04:06,030
If we have students who have

126
00:04:06,340 --> 00:04:09,720
surfing as their only hobby, then those students will be deleted completely.

127
00:04:10,640 --> 00:04:11,740
Now you may argue that's the right

128
00:04:11,940 --> 00:04:13,890
thing to do, but probably that isn't what was intended.

129
00:04:15,360 --> 00:04:18,120
So now let's take a look at a very different design for the same data.

130
00:04:18,510 --> 00:04:19,810
Here we have five different

131
00:04:20,130 --> 00:04:21,730
relations, one with the

132
00:04:21,750 --> 00:04:22,770
information about students and their

133
00:04:22,970 --> 00:04:24,030
names, one where they've applied

134
00:04:24,420 --> 00:04:25,830
to colleges, one where they

135
00:04:25,920 --> 00:04:26,850
went to high school, where their

136
00:04:27,010 --> 00:04:29,320
high schools are located and what hobbies the students has.

137
00:04:30,080 --> 00:04:31,740
In this case we have no anomalies.

138
00:04:32,460 --> 00:04:33,600
If we go back and look at

139
00:04:33,670 --> 00:04:36,430
the three different types, they don't occur in this design.

140
00:04:36,860 --> 00:04:39,010
We don't have redundant information, we

141
00:04:39,130 --> 00:04:41,590
don't have the update anomaly or the deletion anomaly.

142
00:04:42,570 --> 00:04:44,950
Furthermore, we can reconstruct all

143
00:04:45,440 --> 00:04:47,420
of the original data from our

144
00:04:47,940 --> 00:04:48,920
first design, so we haven't

145
00:04:49,280 --> 00:04:51,440
lost any information by breaking it up this way.

146
00:04:52,150 --> 00:04:53,790
So in fact this looks like a much better design.

147
00:04:54,950 --> 00:04:55,850
Now let me mention a couple

148
00:04:56,020 --> 00:04:58,090
of modifications to this design that might occur.

149
00:04:58,850 --> 00:05:00,220
Let's suppose, for example, that

150
00:05:00,470 --> 00:05:02,670
the high school name alone is not a key.

151
00:05:03,140 --> 00:05:04,110
So when we break up the

152
00:05:04,200 --> 00:05:05,050
high school name and high school

153
00:05:05,420 --> 00:05:07,790
city, we no longer can identify the high school.

154
00:05:08,520 --> 00:05:09,420
In that case, the preferred

155
00:05:09,920 --> 00:05:11,090
design would be to move

156
00:05:11,360 --> 00:05:13,080
the high school up here so

157
00:05:13,410 --> 00:05:14,910
we'll have that together with

158
00:05:15,120 --> 00:05:17,530
the high school name and then we don't need this relation here.

159
00:05:17,760 --> 00:05:19,330
And actually that's a fine design.

160
00:05:19,710 --> 00:05:20,980
It does not introduce any anomalies,

161
00:05:21,780 --> 00:05:22,720
that's just based on the

162
00:05:22,800 --> 00:05:23,550
fact that we need the name

163
00:05:23,820 --> 00:05:25,980
of the high school together with the city to identify it.

164
00:05:26,820 --> 00:05:28,560
As another example, suppose a

165
00:05:28,770 --> 00:05:29,910
student doesn't want all of

166
00:05:30,000 --> 00:05:31,200
their hobbies revealed to all

167
00:05:31,360 --> 00:05:32,910
of the colleges that they are applying to.

168
00:05:33,460 --> 00:05:36,060
For example, maybe they don't want Stanford to know about their surfing.

169
00:05:37,190 --> 00:05:38,400
If that's the case then we

170
00:05:38,590 --> 00:05:39,820
can modify the design again,

171
00:05:40,170 --> 00:05:41,050
and in that case we would

172
00:05:41,270 --> 00:05:43,710
put the hobby up here with where they're applying to college.

173
00:05:44,600 --> 00:05:45,730
And so that would include the hobbies

174
00:05:46,310 --> 00:05:47,450
that they want to reveal

175
00:05:47,880 --> 00:05:50,260
to those particular colleges, and we'll take away this one.

176
00:05:50,610 --> 00:05:51,730
So it looked like we were

177
00:05:51,870 --> 00:05:53,070
taking our nice, small relations,

178
00:05:53,870 --> 00:05:55,950
and moving back to a design that had bigger relations.

179
00:05:56,870 --> 00:05:58,200
But in this case it was very well motivated.

180
00:05:58,720 --> 00:06:00,190
We needed these attributes together to

181
00:06:00,270 --> 00:06:01,520
identify the high school and

182
00:06:01,710 --> 00:06:03,690
we want it to have our hobbies specific to the colleges.

183
00:06:04,880 --> 00:06:06,190
So what that shows is

184
00:06:06,350 --> 00:06:08,370
that the best design, for an

185
00:06:08,640 --> 00:06:09,980
application for relational databases

186
00:06:10,670 --> 00:06:11,850
depend not only on constructing

187
00:06:12,430 --> 00:06:13,560
the relations well, but also

188
00:06:14,260 --> 00:06:16,560
in what the data is representing in the real world.

189
00:06:17,610 --> 00:06:18,690
So the basic of idea of

190
00:06:18,780 --> 00:06:20,010
what we're going to do is

191
00:06:20,240 --> 00:06:22,560
design by decomposition, specifically,

192
00:06:23,060 --> 00:06:24,070
we're going to do what we

193
00:06:24,200 --> 00:06:25,140
did at the very beginning of this

194
00:06:25,310 --> 00:06:26,450
example, which is start

195
00:06:26,620 --> 00:06:27,680
by creating mega-relations that just

196
00:06:27,810 --> 00:06:29,760
contain attributes for everything

197
00:06:30,190 --> 00:06:31,220
that we want to represent in our

198
00:06:31,410 --> 00:06:32,720
database, then we're going

199
00:06:32,930 --> 00:06:34,380
to decompose those mega relations

200
00:06:34,860 --> 00:06:36,030
into smaller ones that are

201
00:06:36,150 --> 00:06:37,890
better, but still capture the same information.

202
00:06:39,070 --> 00:06:41,770
And most importantly we can do this decomposition automatically.

203
00:06:43,100 --> 00:06:44,880
So how does automatic decomposition work?

204
00:06:45,580 --> 00:06:46,320
In addition to the mega

205
00:06:46,620 --> 00:06:48,050
relations, we're going to specify

206
00:06:48,560 --> 00:06:50,320
formally, properties of the data.The

207
00:06:50,970 --> 00:06:52,320
system is going to use

208
00:06:52,650 --> 00:06:54,170
the properties to decompose the

209
00:06:54,280 --> 00:06:55,730
relations, and then it's

210
00:06:55,840 --> 00:06:57,100
going to guarantee that the final

211
00:06:57,500 --> 00:07:00,290
set of relations satisfy what's called a normal form.

212
00:07:00,570 --> 00:07:02,070
And we'll be formalizing all of this.

213
00:07:02,600 --> 00:07:03,930
But the basic idea behind normal

214
00:07:04,320 --> 00:07:05,300
forms is that they don't

215
00:07:05,600 --> 00:07:06,840
have any of those anomalies that

216
00:07:06,960 --> 00:07:08,600
I showed and they don't lose any information.

217
00:07:09,900 --> 00:07:11,860
So specifically for specification of

218
00:07:11,970 --> 00:07:12,990
properties, we're going to begin

219
00:07:13,150 --> 00:07:14,620
by looking at something called functional dependencies.

220
00:07:15,720 --> 00:07:17,190
And once we specify functional dependencies,

221
00:07:17,240 --> 00:07:18,810
the system will generate

222
00:07:19,290 --> 00:07:20,510
relations that are in

223
00:07:20,620 --> 00:07:22,420
what's called Boyse Codd normal form.

224
00:07:22,790 --> 00:07:23,690
And Boyse and Codd by the

225
00:07:23,760 --> 00:07:27,000
way were two early pioneers in relational databases in general.

226
00:07:27,400 --> 00:07:28,870
Then we're going to

227
00:07:29,080 --> 00:07:30,470
look at another type of

228
00:07:30,550 --> 00:07:32,240
specification called multi valued

229
00:07:32,620 --> 00:07:34,010
dependencies which will add to

230
00:07:34,310 --> 00:07:35,560
functional dependencies and when we

231
00:07:35,680 --> 00:07:36,930
have both functional and multi

232
00:07:37,240 --> 00:07:38,840
valued dependencies, then we

233
00:07:38,950 --> 00:07:40,180
can have what's called fourth

234
00:07:40,600 --> 00:07:42,170
normal form, and again, that

235
00:07:42,380 --> 00:07:43,480
would be relations that are generated

236
00:07:44,040 --> 00:07:46,050
by the system that satisfy the normal form.

237
00:07:47,140 --> 00:07:50,120
Boyce-Codd normal form is stricter than fourth normal form.

238
00:07:50,440 --> 00:07:51,410
Specifically if we make

239
00:07:51,640 --> 00:07:53,540
a big Venn diagram here of

240
00:07:53,750 --> 00:07:55,680
all the relational designs

241
00:07:56,220 --> 00:07:58,110
that satisfied Boyce-Codd Normal Form,

242
00:07:58,330 --> 00:07:59,280
which by the way is very

243
00:07:59,600 --> 00:08:02,340
often abbreviated BCNF, then that

244
00:08:02,890 --> 00:08:03,940
contains all of the

245
00:08:04,020 --> 00:08:06,370
relations that satisfy fourth normal form,

246
00:08:06,830 --> 00:08:07,840
normally abbreviated 4NF.

247
00:08:08,920 --> 00:08:10,250
So every relation that's in

248
00:08:10,570 --> 00:08:11,630
fourth normal form is also

249
00:08:11,950 --> 00:08:14,020
in Boyce-Codd normal form, but not vice versa.

250
00:08:15,120 --> 00:08:15,850
You might be wondering what happened

251
00:08:16,660 --> 00:08:18,480
to first, second and third, normal forms.

252
00:08:19,160 --> 00:08:20,510
So first normal form is

253
00:08:21,020 --> 00:08:21,740
pretty much just a specification

254
00:08:22,590 --> 00:08:23,930
that relations are real

255
00:08:24,210 --> 00:08:26,070
relations with atomic values in each cell.

256
00:08:27,200 --> 00:08:28,840
Second normal form is specifying

257
00:08:29,560 --> 00:08:32,450
something about the way relations are structured with respect to their keys.

258
00:08:33,530 --> 00:08:35,310
Neither of those is discussed very much anymore.

259
00:08:36,110 --> 00:08:37,450
Third normal form is a

260
00:08:37,580 --> 00:08:39,260
slight weakening of Boyce-Codd

261
00:08:39,710 --> 00:08:40,870
normal form and sometimes people

262
00:08:41,010 --> 00:08:42,440
do like to talk about third normal form.

263
00:08:42,630 --> 00:08:45,600
So you can think of third normal form as a little bit of a even bigger circle here.

264
00:08:46,800 --> 00:08:47,610
We're not going to cover third normal

265
00:08:47,960 --> 00:08:49,300
form in this video because

266
00:08:49,700 --> 00:08:50,910
Boyce-Codd normal form is the

267
00:08:50,990 --> 00:08:52,230
most common normal form used

268
00:08:52,720 --> 00:08:54,750
if we have functional dependencies only, and

269
00:08:54,890 --> 00:08:55,890
fourth normal form if we

270
00:08:56,030 --> 00:08:57,430
have functional and multivalued dependencies.

271
00:08:58,990 --> 00:09:00,290
So what's going to happen next is

272
00:09:00,450 --> 00:09:01,850
I'm going to give some examples

273
00:09:02,330 --> 00:09:03,650
to motivate these four concepts:

274
00:09:04,220 --> 00:09:05,900
functional dependencies, Boyce-Codd normal form,

275
00:09:06,120 --> 00:09:07,780
multivalued dependencies normal form,

276
00:09:08,350 --> 00:09:09,660
and then later videos will

277
00:09:09,780 --> 00:09:11,450
go into each one in much greater depth.

278
00:09:12,460 --> 00:09:13,290
So let me just give

279
00:09:13,460 --> 00:09:14,850
the general idea of functional dependencies

280
00:09:15,550 --> 00:09:16,560
and Boyce-Codd Normal Form.

281
00:09:16,850 --> 00:09:17,760
And we'll use a very

282
00:09:18,170 --> 00:09:19,770
simple for example, an abbreviated version

283
00:09:20,090 --> 00:09:21,250
of our apply relation that has

284
00:09:21,490 --> 00:09:23,370
students' social security numbers, the

285
00:09:23,430 --> 00:09:24,730
student's name and their colleges

286
00:09:25,240 --> 00:09:26,400
that the student is applying to.

287
00:09:26,580 --> 00:09:28,700
Even this small relation actually

288
00:09:28,980 --> 00:09:31,270
has redundancy and update and deletion anomalies.

289
00:09:32,310 --> 00:09:33,640
Specifically, let's say that our

290
00:09:33,820 --> 00:09:36,380
student, 123Ann, applies to 7 colleges.

291
00:09:36,960 --> 00:09:38,750
Then there will be 7 tuples and

292
00:09:38,980 --> 00:09:40,810
there will be 7 instances where we

293
00:09:40,970 --> 00:09:42,470
know that a student with the

294
00:09:42,700 --> 00:09:44,890
social security number 123 is named Ann.

295
00:09:45,430 --> 00:09:46,390
Specifically, we're going to store

296
00:09:46,850 --> 00:09:48,620
for every student the name

297
00:09:49,330 --> 00:09:50,580
and social security number pair once

298
00:09:50,910 --> 00:09:52,500
for each college that they apply to.

299
00:09:53,640 --> 00:09:54,630
So now let me explain what

300
00:09:54,840 --> 00:09:56,150
a functional dependency is and then

301
00:09:56,310 --> 00:09:57,990
we'll see how functional dependencies are

302
00:09:58,040 --> 00:09:59,370
used to recognize when we

303
00:09:59,440 --> 00:10:00,390
have a bad design like this

304
00:10:00,570 --> 00:10:02,280
one, and to see how we can fix it.

305
00:10:03,140 --> 00:10:04,300
A functional dependency, in this

306
00:10:04,470 --> 00:10:05,720
case from social security number

307
00:10:06,340 --> 00:10:07,230
to name, and we're saying

308
00:10:07,610 --> 00:10:09,370
social security number functionally determines

309
00:10:10,380 --> 00:10:11,970
the student name says that

310
00:10:12,240 --> 00:10:15,090
the same social security number always has the same name.

311
00:10:15,710 --> 00:10:18,290
In other words, every time we see 123, we're going to see Ann.

312
00:10:18,960 --> 00:10:20,560
Now it doesn't necessarily go in the other direction.

313
00:10:21,050 --> 00:10:21,830
It might not be that whenever

314
00:10:22,110 --> 00:10:23,620
we see Ann, it's 123,

315
00:10:23,710 --> 00:10:25,790
but whenever we see 123, it is Ann.

316
00:10:26,540 --> 00:10:27,620
And so what we'd like to

317
00:10:27,980 --> 00:10:30,360
do is store that relationship just one time.

318
00:10:30,900 --> 00:10:32,960
One time say that for 123, the name is Ann.

319
00:10:34,080 --> 00:10:35,520
Now what Boyce Codd Normal Form

320
00:10:35,710 --> 00:10:37,000
says is that whenever we have

321
00:10:37,330 --> 00:10:39,590
one of these functional dependencies, then

322
00:10:39,940 --> 00:10:42,510
the left hand side of that functional dependency must be a key.

323
00:10:43,330 --> 00:10:44,400
And think about what that's saying.

324
00:10:44,640 --> 00:10:46,060
Remember a key says

325
00:10:46,610 --> 00:10:49,310
that we have just one tupple with each value for that attribute.

326
00:10:50,330 --> 00:10:51,610
So if we have say

327
00:10:51,760 --> 00:10:53,320
social security number to name

328
00:10:53,550 --> 00:10:54,990
as a functional dependency and we

329
00:10:55,230 --> 00:10:56,780
satisfy Boyce-Codd Normal Form,

330
00:10:57,350 --> 00:10:58,400
then we're going to say that

331
00:10:58,600 --> 00:10:59,920
social security number has to

332
00:11:00,060 --> 00:11:00,870
be a key in our relation,

333
00:11:01,420 --> 00:11:02,390
and we'll only have one

334
00:11:02,810 --> 00:11:04,800
tupple for each social security number.

335
00:11:05,460 --> 00:11:08,070
Specifically, we can go back to our original relation.

336
00:11:08,900 --> 00:11:10,740
We have this functional dependency social

337
00:11:11,090 --> 00:11:13,350
security number here is not a key, right?

338
00:11:14,040 --> 00:11:16,790
So then we know that this is not in Boyce-Codd Normal Form.

339
00:11:17,450 --> 00:11:18,430
So we're going to use functional

340
00:11:18,970 --> 00:11:20,360
dependencies to help us

341
00:11:20,530 --> 00:11:22,170
decompose our relation so

342
00:11:22,370 --> 00:11:25,010
that the decomposed relations are in Boyce-Codd Normal Form.

343
00:11:25,650 --> 00:11:26,860
And here's what would happen in this example.

344
00:11:27,850 --> 00:11:29,100
Our functional dependency would tell

345
00:11:29,330 --> 00:11:30,610
us to pull out the social

346
00:11:31,150 --> 00:11:32,300
security number and student name

347
00:11:32,560 --> 00:11:34,240
into its own relation where

348
00:11:34,370 --> 00:11:35,680
the social security number is a

349
00:11:35,910 --> 00:11:36,900
key and then we have

350
00:11:37,150 --> 00:11:38,620
just one time for each

351
00:11:38,810 --> 00:11:39,920
social security number that students

352
00:11:40,050 --> 00:11:42,170
name, and then separately we'll

353
00:11:42,390 --> 00:11:43,680
have the information about the

354
00:11:43,790 --> 00:11:45,690
students and which colleges they applied to.

355
00:11:46,320 --> 00:11:48,240
Again, we'll completely formalize this

356
00:11:48,530 --> 00:11:49,790
whole idea, the definition of

357
00:11:49,900 --> 00:11:51,500
functional dependencies, their properties, the

358
00:11:51,590 --> 00:11:52,500
normal form, and how we

359
00:11:52,600 --> 00:11:54,450
do the decomposition in a later video.

360
00:11:55,670 --> 00:11:57,070
Now, let's similarly motivate the

361
00:11:57,310 --> 00:11:59,920
concept of multi-value dependencies, and fourth normal form.

362
00:12:00,350 --> 00:12:03,190
It is actually a little bit more complicated, but it follows the same rough outline.

363
00:12:04,350 --> 00:12:05,150
Now let's look at a different

364
00:12:05,490 --> 00:12:06,680
portion of the information

365
00:12:06,960 --> 00:12:08,680
about applying and let's suppose,

366
00:12:08,800 --> 00:12:09,670
for now, that we're just concerned

367
00:12:10,170 --> 00:12:12,020
about students, what colleges they're

368
00:12:12,210 --> 00:12:14,260
applying to, and what high schools they went to.

369
00:12:15,050 --> 00:12:17,450
We still have redundancy and update and deletion anomalies.

370
00:12:17,980 --> 00:12:18,940
For example, a student who

371
00:12:19,350 --> 00:12:20,970
applies to Stanford is going

372
00:12:21,190 --> 00:12:22,640
to have that fact captured once

373
00:12:23,050 --> 00:12:24,430
for every high school that they went to.

374
00:12:24,730 --> 00:12:26,320
A student who went to

375
00:12:26,730 --> 00:12:27,750
Palo Alto high School will have

376
00:12:28,370 --> 00:12:30,910
that fact captured once for every college they apply to.

377
00:12:31,810 --> 00:12:34,290
In addition, we get a kind of multiplicative effect here.

378
00:12:34,840 --> 00:12:35,860
Because let's say a student

379
00:12:36,100 --> 00:12:37,510
applies to C colleges

380
00:12:38,470 --> 00:12:39,700
and they went to H

381
00:12:39,960 --> 00:12:41,160
high schools; I know students don't

382
00:12:41,370 --> 00:12:42,040
go to a lot of high schools

383
00:12:42,310 --> 00:12:43,930
but let's suppose that this is one that had moved a lot.

384
00:12:44,710 --> 00:12:47,340
In that case, we're going to have C times H, tuples.

385
00:12:48,300 --> 00:12:49,350
What we'd really like to have

386
00:12:49,480 --> 00:12:50,480
is something more on the

387
00:12:50,590 --> 00:12:51,960
order of C plus H,

388
00:12:52,200 --> 00:12:54,760
because then we'd be capturing each piece of information just once.

389
00:12:55,880 --> 00:12:57,000
Now the interesting thing is that

390
00:12:57,180 --> 00:12:58,470
the badness of this particular

391
00:12:58,920 --> 00:13:00,480
design is not addressed by

392
00:13:00,600 --> 00:13:01,670
Boyce-Codd Normal Form, in fact

393
00:13:01,950 --> 00:13:03,010
this relation is in Boyce-Codd

394
00:13:03,470 --> 00:13:05,940
Normal Form, because it has no functional dependencies.

395
00:13:06,900 --> 00:13:08,060
It's not the case that every

396
00:13:08,400 --> 00:13:09,570
instance of a social security

397
00:13:10,010 --> 00:13:11,170
number is associated with a

398
00:13:11,200 --> 00:13:13,360
single college name or a single high school.

399
00:13:14,280 --> 00:13:15,200
As we will see later, if there

400
00:13:15,310 --> 00:13:16,780
are no functional dependencies, then the

401
00:13:16,850 --> 00:13:18,240
relation is automatically in Boyce-Codd

402
00:13:18,670 --> 00:13:19,930
Normal Form, but it's not

403
00:13:20,470 --> 00:13:21,510
in and fourth normal form.

404
00:13:22,520 --> 00:13:25,520
So fourth normal form is associated with what are called multi-value dependencies.

405
00:13:26,750 --> 00:13:27,970
When we specify a multi-value

406
00:13:28,340 --> 00:13:29,560
dependency as we've done here

407
00:13:29,900 --> 00:13:31,720
with the double arrow, what this

408
00:13:31,800 --> 00:13:32,880
is saying is that if we

409
00:13:33,050 --> 00:13:34,510
take a particular social security

410
00:13:35,040 --> 00:13:36,230
number in the relation,

411
00:13:36,760 --> 00:13:38,200
we will have every combination

412
00:13:39,170 --> 00:13:40,290
of college names that are

413
00:13:40,650 --> 00:13:41,750
associated with that social security

414
00:13:42,230 --> 00:13:43,570
number with every high

415
00:13:43,840 --> 00:13:46,010
school that's associated with that social security number.

416
00:13:46,620 --> 00:13:48,180
We'll actually see that when

417
00:13:48,370 --> 00:13:49,500
we have this multi-value dependency,

418
00:13:50,500 --> 00:13:52,280
we automatically have this one, too.

419
00:13:53,040 --> 00:13:53,920
I know it seems a bit complicated,

420
00:13:54,250 --> 00:13:55,210
and we will formalize it completely,

421
00:13:55,650 --> 00:13:56,920
but for now now just think

422
00:13:57,060 --> 00:13:59,260
about the English statement that multi-valued dependency

423
00:13:59,800 --> 00:14:00,420
is saying that we are going to

424
00:14:00,530 --> 00:14:01,730
have every combination of those

425
00:14:02,380 --> 00:14:03,640
two attributes and values

426
00:14:04,050 --> 00:14:05,910
in those attributes for a given social security number.

427
00:14:06,010 --> 00:14:07,610
In other words, those values

428
00:14:08,070 --> 00:14:09,390
are really independent of each other.

429
00:14:10,360 --> 00:14:11,390
So if we have that situation,

430
00:14:12,300 --> 00:14:13,270
then what we should really do

431
00:14:13,550 --> 00:14:15,010
is store each college name

432
00:14:15,210 --> 00:14:16,140
and each high school for each

433
00:14:16,350 --> 00:14:17,900
social security number one time,

434
00:14:18,660 --> 00:14:21,290
and that's what fourth normal form will do for us.

435
00:14:21,720 --> 00:14:23,920
Fourth normal form, similarly to

436
00:14:24,030 --> 00:14:25,570
Boyce-Codd normal form, says if

437
00:14:25,790 --> 00:14:28,800
we have a dependency, then the left hand side must be a key.

438
00:14:29,250 --> 00:14:30,740
In this case, it's a multi-value dependency

439
00:14:31,210 --> 00:14:32,000
we're looking at, so it's really

440
00:14:32,290 --> 00:14:33,970
saying something different but the

441
00:14:34,140 --> 00:14:35,370
basic idea is the same

442
00:14:35,800 --> 00:14:36,780
which is that we want

443
00:14:37,160 --> 00:14:38,970
only one tuple that has

444
00:14:39,320 --> 00:14:40,960
each value that's appears on

445
00:14:41,110 --> 00:14:41,960
the left hand side of a

446
00:14:42,220 --> 00:14:43,760
multi-value dependency So let's

447
00:14:43,970 --> 00:14:44,880
see what would happen, in this

448
00:14:45,060 --> 00:14:46,260
example, if we use

449
00:14:46,570 --> 00:14:48,690
our multi-value dependencies to decompose

450
00:14:49,240 --> 00:14:51,750
the relation, based on the idea of fourth Normal Form.

451
00:14:52,260 --> 00:14:53,590
Well it is the intuitive thing that happens.

452
00:14:54,230 --> 00:14:56,060
We separate the information about

453
00:14:56,440 --> 00:14:57,490
the college names that a student

454
00:14:57,720 --> 00:14:59,110
applies to from the information

455
00:14:59,430 --> 00:15:00,700
about the high schools themselves, and

456
00:15:00,910 --> 00:15:01,770
then we'll see that that

457
00:15:01,870 --> 00:15:02,990
we only store each fact

458
00:15:03,430 --> 00:15:04,640
once and we do get

459
00:15:04,870 --> 00:15:06,120
the behavior of having C

460
00:15:06,820 --> 00:15:09,240
plus H tuples instead of having C times H tuples.

461
00:15:10,090 --> 00:15:11,590
Like with functional dependencies and Boyce-Codd

462
00:15:11,960 --> 00:15:12,860
Normal Form we'll be completely

463
00:15:13,560 --> 00:15:14,560
formalizing all of this

464
00:15:14,770 --> 00:15:16,550
reasoning and the definitions in later videos.

465
00:15:17,830 --> 00:15:20,190
To summarize, we're going to do relational design by decomposition.

466
00:15:21,240 --> 00:15:22,450
We're going to start by specifying mega

467
00:15:22,770 --> 00:15:24,020
relations that contain all the

468
00:15:24,320 --> 00:15:25,360
information that we want to capture,

469
00:15:26,010 --> 00:15:27,500
as well as specifying properties of

470
00:15:27,620 --> 00:15:30,160
the data usually reflecting the real world in some fashion.

471
00:15:31,210 --> 00:15:32,610
The system can automatically decompose

472
00:15:33,390 --> 00:15:34,860
the mega-relations into smaller relations

473
00:15:35,370 --> 00:15:36,390
based on the properties we specify,

474
00:15:37,420 --> 00:15:38,630
and guarantee that the final

475
00:15:38,810 --> 00:15:40,070
set of relations have certain

476
00:15:40,280 --> 00:15:42,140
good properties captured in a normal form.

477
00:15:42,860 --> 00:15:45,760
They will have no anomalies, and they'll be guaranteed not to lose information.

478
00:15:47,260 --> 00:15:48,830
We'll start by specifying properties as

479
00:15:48,940 --> 00:15:50,400
functional dependencies and from

480
00:15:50,540 --> 00:15:51,760
there the system will guarantee Boyce-Codd

481
00:15:52,310 --> 00:15:53,690
Normal Form and then we'll

482
00:15:54,340 --> 00:15:55,520
add to that properties specified as

483
00:15:56,090 --> 00:15:57,500
multi-value dependencies, and from

484
00:15:57,690 --> 00:15:58,940
there the system will guarantee fourth

485
00:15:59,260 --> 00:16:00,730
normal form, which is even

486
00:16:00,950 --> 00:16:02,400
stronger than Boyce-Codd Normal Form and

487
00:16:02,620 --> 00:16:04,960
is generally thought to be good relational design.