1 00:00:00,510 --> 00:00:01,810 In this sequence of videos, we'll 2 00:00:02,060 --> 00:00:04,810 learn about designing good schemas for relational databases. 3 00:00:06,050 --> 00:00:07,340 So let's suppose we're building a 4 00:00:07,380 --> 00:00:08,660 database for an application or 5 00:00:08,740 --> 00:00:09,810 set of applications and we 6 00:00:09,930 --> 00:00:12,430 have to figure out what schema we want to store our data. 7 00:00:13,340 --> 00:00:14,510 Usually there are many different 8 00:00:14,930 --> 00:00:16,400 possible schema designs for a 9 00:00:16,710 --> 00:00:19,010 database, and databases do tend to get quite complicated. 10 00:00:20,200 --> 00:00:22,080 And some designs are much better than others. 11 00:00:22,610 --> 00:00:24,590 So how do we choose what design to use? 12 00:00:25,570 --> 00:00:26,760 Now the reality is that people 13 00:00:26,820 --> 00:00:28,400 often use higher level tools 14 00:00:28,890 --> 00:00:30,450 to design relational databases and 15 00:00:30,580 --> 00:00:31,970 don't design the schemas directly themselves. 16 00:00:32,930 --> 00:00:34,080 But some designers do go 17 00:00:34,330 --> 00:00:35,920 straight to relations, and furthermore, 18 00:00:36,440 --> 00:00:37,660 it's useful to understand why 19 00:00:38,180 --> 00:00:39,110 the relations that are produced 20 00:00:39,590 --> 00:00:41,190 by design tools are what 21 00:00:41,420 --> 00:00:42,850 they are. 22 00:00:43,000 --> 00:00:44,040 Furthermore, from an academic point of view, 23 00:00:44,310 --> 00:00:45,040 it turns out there's a very 24 00:00:45,480 --> 00:00:47,180 nice theory for relational data base design. 25 00:00:48,350 --> 00:00:49,670 So let's consider the process of 26 00:00:49,780 --> 00:00:51,360 designing the schema for our 27 00:00:51,480 --> 00:00:53,030 database about students applying to colleges. 28 00:00:54,070 --> 00:00:55,190 Specifically, for a given 29 00:00:55,350 --> 00:00:56,370 student, let's suppose we have 30 00:00:56,600 --> 00:00:58,320 their social security number and 31 00:00:58,520 --> 00:01:00,050 their name, the colleges that 32 00:01:00,220 --> 00:01:01,420 student is applying to, the 33 00:01:01,580 --> 00:01:02,880 high schools they attended and 34 00:01:02,940 --> 00:01:05,010 what city those high schools were in, and the student's hobbies. 35 00:01:06,200 --> 00:01:07,220 So if that's what we want we 36 00:01:07,330 --> 00:01:08,290 can create a single relation 37 00:01:08,730 --> 00:01:09,890 called apply, that has one 38 00:01:10,350 --> 00:01:12,110 attribute for each of those pieces of information. 39 00:01:13,290 --> 00:01:15,260 Now let's take a look at how that database would be populated. 40 00:01:16,360 --> 00:01:17,170 Let's suppose that we have 41 00:01:17,300 --> 00:01:18,550 a student, Anne, with Social Security 42 00:01:18,940 --> 00:01:20,560 number 123, she went to 2 43 00:01:20,780 --> 00:01:21,990 different high schools in Palo 44 00:01:22,270 --> 00:01:23,750 Alto, she plays tennis and 45 00:01:23,910 --> 00:01:25,110 the trumpet, and she's applying 46 00:01:25,530 --> 00:01:26,690 to Stanford, Berkeley, and MIT. 47 00:01:27,730 --> 00:01:28,660 So let's look at some of 48 00:01:28,760 --> 00:01:29,720 the tuples that we would be 49 00:01:30,090 --> 00:01:31,000 having in the apply relation 50 00:01:31,500 --> 00:01:33,050 to represent this information about Anne. 51 00:01:33,820 --> 00:01:35,200 So we'll have 1,2,3, Anne, 52 00:01:36,270 --> 00:01:37,460 her name, she's applying to 53 00:01:37,870 --> 00:01:39,400 Stanford, she went to 54 00:01:39,810 --> 00:01:41,720 Palo Alto High School, and 55 00:01:42,090 --> 00:01:45,400 that's in Palo Alto, and one of her hobbies is tennis. 56 00:01:47,150 --> 00:01:48,200 And then we also have 1 57 00:01:48,310 --> 00:01:51,030 2 3 and she applied to 58 00:01:51,340 --> 00:01:52,910 Berkeley and went to 59 00:01:53,110 --> 00:01:54,540 Palo Alto High School in 60 00:01:54,760 --> 00:01:58,000 Palo Alto and tennis there as well. 61 00:01:58,370 --> 00:01:59,690 Of course she also has 62 00:02:00,340 --> 00:02:01,770 a tuple representing the fact 63 00:02:02,200 --> 00:02:04,260 that she's applying to Berkeley and 64 00:02:04,440 --> 00:02:05,350 and we'll stick with Palo 65 00:02:05,810 --> 00:02:07,340 Alto High School, and she played the trumpet. 66 00:02:08,750 --> 00:02:09,670 And as you can see we'll 67 00:02:09,890 --> 00:02:11,230 have more tuples, we'll have 68 00:02:11,420 --> 00:02:12,540 various Stanford and Berkeleys, we'll 69 00:02:12,890 --> 00:02:13,980 have some for her other high 70 00:02:14,180 --> 00:02:15,180 schools called Gunn High School 71 00:02:15,490 --> 00:02:17,880 also in Palo Alto, and so on. 72 00:02:18,450 --> 00:02:19,460 So if we think about 73 00:02:19,750 --> 00:02:20,660 it we will need a total 74 00:02:21,210 --> 00:02:22,990 of 12 tuples to represent 75 00:02:23,800 --> 00:02:25,510 this information about Ann. 76 00:02:26,710 --> 00:02:28,160 Now do we think that's a good design? 77 00:02:29,580 --> 00:02:31,460 I'm going to argue no, it's not a good design. 78 00:02:31,980 --> 00:02:34,290 There are several types of anomalies in that design. 79 00:02:35,320 --> 00:02:36,870 First of all, we capture information 80 00:02:37,730 --> 00:02:39,770 multiple times in that 81 00:02:40,020 --> 00:02:41,600 design, and I'll give some examples of that. 82 00:02:42,130 --> 00:02:43,550 For example how many times 83 00:02:43,750 --> 00:02:45,320 do we capture the fact that 84 00:02:45,650 --> 00:02:46,890 1 2 3 the Social Security 85 00:02:47,350 --> 00:02:49,140 number is associated with a student named Ann? 86 00:02:49,820 --> 00:02:52,170 We capture that twelve times in our twelve tuples. 87 00:02:53,040 --> 00:02:53,870 How many times do we 88 00:02:54,040 --> 00:02:56,720 capture that Anne went to Palo Alto High School? 89 00:02:56,920 --> 00:02:57,730 We're going to capture that six times. 90 00:02:58,050 --> 00:02:59,550 And we're going to capture 91 00:02:59,720 --> 00:03:01,580 the fact that she plays tennis six times. 92 00:03:02,340 --> 00:03:03,140 And we're going to capture the fact 93 00:03:03,400 --> 00:03:04,590 that she went to apply to 94 00:03:04,790 --> 00:03:06,550 MIT four times, so for 95 00:03:07,000 --> 00:03:08,150 each piece of information, in fact, 96 00:03:08,480 --> 00:03:09,700 we're capturing it many, many times. 97 00:03:10,420 --> 00:03:12,510 So that doesn't seem like a good feature of the design. 98 00:03:13,720 --> 00:03:14,560 The second type is an 99 00:03:14,680 --> 00:03:17,400 update anomaly, and that's really a direct effect of redundancy. 100 00:03:18,420 --> 00:03:19,650 What update anomalies say 101 00:03:20,050 --> 00:03:21,270 is that you can update facts 102 00:03:21,630 --> 00:03:22,820 in some places but not 103 00:03:23,010 --> 00:03:24,620 all all or differently in different places. 104 00:03:26,230 --> 00:03:27,350 So let's take the fact for 105 00:03:27,510 --> 00:03:29,480 example that Ann plays the trumpet. 106 00:03:30,270 --> 00:03:31,300 I might decide to call 107 00:03:31,470 --> 00:03:33,010 that the coronet instead but I 108 00:03:33,490 --> 00:03:34,810 can go ahead and I 109 00:03:35,010 --> 00:03:36,870 can modify, say, three of 110 00:03:37,130 --> 00:03:38,310 the incidences where we captured 111 00:03:38,700 --> 00:03:39,590 the fact about her playing the 112 00:03:39,660 --> 00:03:40,820 trumpet and not the fourth 113 00:03:41,110 --> 00:03:42,230 one and then we end up 114 00:03:42,450 --> 00:03:44,010 with what's effectively an inconsistent database. 115 00:03:45,730 --> 00:03:47,070 And the third type of 116 00:03:47,090 --> 00:03:48,650 anomaly is called a deletion anomaly, and 117 00:03:49,030 --> 00:03:50,160 there's a case where we could inadvertently 118 00:03:51,190 --> 00:03:52,370 completely do a complete 119 00:03:52,630 --> 00:03:54,280 deletion of somebody in the database. 120 00:03:54,800 --> 00:03:56,430 Let's say for example that we 121 00:03:56,710 --> 00:03:58,300 decide that surfing is an 122 00:03:58,480 --> 00:03:59,850 unacceptable hobby for our 123 00:04:00,080 --> 00:04:01,520 college applicants, and we go 124 00:04:01,740 --> 00:04:03,740 ahead and we delete the tuples about surfing. 125 00:04:04,650 --> 00:04:06,030 If we have students who have 126 00:04:06,340 --> 00:04:09,720 surfing as their only hobby, then those students will be deleted completely. 127 00:04:10,640 --> 00:04:11,740 Now you may argue that's the right 128 00:04:11,940 --> 00:04:13,890 thing to do, but probably that isn't what was intended. 129 00:04:15,360 --> 00:04:18,120 So now let's take a look at a very different design for the same data. 130 00:04:18,510 --> 00:04:19,810 Here we have five different 131 00:04:20,130 --> 00:04:21,730 relations, one with the 132 00:04:21,750 --> 00:04:22,770 information about students and their 133 00:04:22,970 --> 00:04:24,030 names, one where they've applied 134 00:04:24,420 --> 00:04:25,830 to colleges, one where they 135 00:04:25,920 --> 00:04:26,850 went to high school, where their 136 00:04:27,010 --> 00:04:29,320 high schools are located and what hobbies the students has. 137 00:04:30,080 --> 00:04:31,740 In this case we have no anomalies. 138 00:04:32,460 --> 00:04:33,600 If we go back and look at 139 00:04:33,670 --> 00:04:36,430 the three different types, they don't occur in this design. 140 00:04:36,860 --> 00:04:39,010 We don't have redundant information, we 141 00:04:39,130 --> 00:04:41,590 don't have the update anomaly or the deletion anomaly. 142 00:04:42,570 --> 00:04:44,950 Furthermore, we can reconstruct all 143 00:04:45,440 --> 00:04:47,420 of the original data from our 144 00:04:47,940 --> 00:04:48,920 first design, so we haven't 145 00:04:49,280 --> 00:04:51,440 lost any information by breaking it up this way. 146 00:04:52,150 --> 00:04:53,790 So in fact this looks like a much better design. 147 00:04:54,950 --> 00:04:55,850 Now let me mention a couple 148 00:04:56,020 --> 00:04:58,090 of modifications to this design that might occur. 149 00:04:58,850 --> 00:05:00,220 Let's suppose, for example, that 150 00:05:00,470 --> 00:05:02,670 the high school name alone is not a key. 151 00:05:03,140 --> 00:05:04,110 So when we break up the 152 00:05:04,200 --> 00:05:05,050 high school name and high school 153 00:05:05,420 --> 00:05:07,790 city, we no longer can identify the high school. 154 00:05:08,520 --> 00:05:09,420 In that case, the preferred 155 00:05:09,920 --> 00:05:11,090 design would be to move 156 00:05:11,360 --> 00:05:13,080 the high school up here so 157 00:05:13,410 --> 00:05:14,910 we'll have that together with 158 00:05:15,120 --> 00:05:17,530 the high school name and then we don't need this relation here. 159 00:05:17,760 --> 00:05:19,330 And actually that's a fine design. 160 00:05:19,710 --> 00:05:20,980 It does not introduce any anomalies, 161 00:05:21,780 --> 00:05:22,720 that's just based on the 162 00:05:22,800 --> 00:05:23,550 fact that we need the name 163 00:05:23,820 --> 00:05:25,980 of the high school together with the city to identify it. 164 00:05:26,820 --> 00:05:28,560 As another example, suppose a 165 00:05:28,770 --> 00:05:29,910 student doesn't want all of 166 00:05:30,000 --> 00:05:31,200 their hobbies revealed to all 167 00:05:31,360 --> 00:05:32,910 of the colleges that they are applying to. 168 00:05:33,460 --> 00:05:36,060 For example, maybe they don't want Stanford to know about their surfing. 169 00:05:37,190 --> 00:05:38,400 If that's the case then we 170 00:05:38,590 --> 00:05:39,820 can modify the design again, 171 00:05:40,170 --> 00:05:41,050 and in that case we would 172 00:05:41,270 --> 00:05:43,710 put the hobby up here with where they're applying to college. 173 00:05:44,600 --> 00:05:45,730 And so that would include the hobbies 174 00:05:46,310 --> 00:05:47,450 that they want to reveal 175 00:05:47,880 --> 00:05:50,260 to those particular colleges, and we'll take away this one. 176 00:05:50,610 --> 00:05:51,730 So it looked like we were 177 00:05:51,870 --> 00:05:53,070 taking our nice, small relations, 178 00:05:53,870 --> 00:05:55,950 and moving back to a design that had bigger relations. 179 00:05:56,870 --> 00:05:58,200 But in this case it was very well motivated. 180 00:05:58,720 --> 00:06:00,190 We needed these attributes together to 181 00:06:00,270 --> 00:06:01,520 identify the high school and 182 00:06:01,710 --> 00:06:03,690 we want it to have our hobbies specific to the colleges. 183 00:06:04,880 --> 00:06:06,190 So what that shows is 184 00:06:06,350 --> 00:06:08,370 that the best design, for an 185 00:06:08,640 --> 00:06:09,980 application for relational databases 186 00:06:10,670 --> 00:06:11,850 depend not only on constructing 187 00:06:12,430 --> 00:06:13,560 the relations well, but also 188 00:06:14,260 --> 00:06:16,560 in what the data is representing in the real world. 189 00:06:17,610 --> 00:06:18,690 So the basic of idea of 190 00:06:18,780 --> 00:06:20,010 what we're going to do is 191 00:06:20,240 --> 00:06:22,560 design by decomposition, specifically, 192 00:06:23,060 --> 00:06:24,070 we're going to do what we 193 00:06:24,200 --> 00:06:25,140 did at the very beginning of this 194 00:06:25,310 --> 00:06:26,450 example, which is start 195 00:06:26,620 --> 00:06:27,680 by creating mega-relations that just 196 00:06:27,810 --> 00:06:29,760 contain attributes for everything 197 00:06:30,190 --> 00:06:31,220 that we want to represent in our 198 00:06:31,410 --> 00:06:32,720 database, then we're going 199 00:06:32,930 --> 00:06:34,380 to decompose those mega relations 200 00:06:34,860 --> 00:06:36,030 into smaller ones that are 201 00:06:36,150 --> 00:06:37,890 better, but still capture the same information. 202 00:06:39,070 --> 00:06:41,770 And most importantly we can do this decomposition automatically. 203 00:06:43,100 --> 00:06:44,880 So how does automatic decomposition work? 204 00:06:45,580 --> 00:06:46,320 In addition to the mega 205 00:06:46,620 --> 00:06:48,050 relations, we're going to specify 206 00:06:48,560 --> 00:06:50,320 formally, properties of the data.The 207 00:06:50,970 --> 00:06:52,320 system is going to use 208 00:06:52,650 --> 00:06:54,170 the properties to decompose the 209 00:06:54,280 --> 00:06:55,730 relations, and then it's 210 00:06:55,840 --> 00:06:57,100 going to guarantee that the final 211 00:06:57,500 --> 00:07:00,290 set of relations satisfy what's called a normal form. 212 00:07:00,570 --> 00:07:02,070 And we'll be formalizing all of this. 213 00:07:02,600 --> 00:07:03,930 But the basic idea behind normal 214 00:07:04,320 --> 00:07:05,300 forms is that they don't 215 00:07:05,600 --> 00:07:06,840 have any of those anomalies that 216 00:07:06,960 --> 00:07:08,600 I showed and they don't lose any information. 217 00:07:09,900 --> 00:07:11,860 So specifically for specification of 218 00:07:11,970 --> 00:07:12,990 properties, we're going to begin 219 00:07:13,150 --> 00:07:14,620 by looking at something called functional dependencies. 220 00:07:15,720 --> 00:07:17,190 And once we specify functional dependencies, 221 00:07:17,240 --> 00:07:18,810 the system will generate 222 00:07:19,290 --> 00:07:20,510 relations that are in 223 00:07:20,620 --> 00:07:22,420 what's called Boyse Codd normal form. 224 00:07:22,790 --> 00:07:23,690 And Boyse and Codd by the 225 00:07:23,760 --> 00:07:27,000 way were two early pioneers in relational databases in general. 226 00:07:27,400 --> 00:07:28,870 Then we're going to 227 00:07:29,080 --> 00:07:30,470 look at another type of 228 00:07:30,550 --> 00:07:32,240 specification called multi valued 229 00:07:32,620 --> 00:07:34,010 dependencies which will add to 230 00:07:34,310 --> 00:07:35,560 functional dependencies and when we 231 00:07:35,680 --> 00:07:36,930 have both functional and multi 232 00:07:37,240 --> 00:07:38,840 valued dependencies, then we 233 00:07:38,950 --> 00:07:40,180 can have what's called fourth 234 00:07:40,600 --> 00:07:42,170 normal form, and again, that 235 00:07:42,380 --> 00:07:43,480 would be relations that are generated 236 00:07:44,040 --> 00:07:46,050 by the system that satisfy the normal form. 237 00:07:47,140 --> 00:07:50,120 Boyce-Codd normal form is stricter than fourth normal form. 238 00:07:50,440 --> 00:07:51,410 Specifically if we make 239 00:07:51,640 --> 00:07:53,540 a big Venn diagram here of 240 00:07:53,750 --> 00:07:55,680 all the relational designs 241 00:07:56,220 --> 00:07:58,110 that satisfied Boyce-Codd Normal Form, 242 00:07:58,330 --> 00:07:59,280 which by the way is very 243 00:07:59,600 --> 00:08:02,340 often abbreviated BCNF, then that 244 00:08:02,890 --> 00:08:03,940 contains all of the 245 00:08:04,020 --> 00:08:06,370 relations that satisfy fourth normal form, 246 00:08:06,830 --> 00:08:07,840 normally abbreviated 4NF. 247 00:08:08,920 --> 00:08:10,250 So every relation that's in 248 00:08:10,570 --> 00:08:11,630 fourth normal form is also 249 00:08:11,950 --> 00:08:14,020 in Boyce-Codd normal form, but not vice versa. 250 00:08:15,120 --> 00:08:15,850 You might be wondering what happened 251 00:08:16,660 --> 00:08:18,480 to first, second and third, normal forms. 252 00:08:19,160 --> 00:08:20,510 So first normal form is 253 00:08:21,020 --> 00:08:21,740 pretty much just a specification 254 00:08:22,590 --> 00:08:23,930 that relations are real 255 00:08:24,210 --> 00:08:26,070 relations with atomic values in each cell. 256 00:08:27,200 --> 00:08:28,840 Second normal form is specifying 257 00:08:29,560 --> 00:08:32,450 something about the way relations are structured with respect to their keys. 258 00:08:33,530 --> 00:08:35,310 Neither of those is discussed very much anymore. 259 00:08:36,110 --> 00:08:37,450 Third normal form is a 260 00:08:37,580 --> 00:08:39,260 slight weakening of Boyce-Codd 261 00:08:39,710 --> 00:08:40,870 normal form and sometimes people 262 00:08:41,010 --> 00:08:42,440 do like to talk about third normal form. 263 00:08:42,630 --> 00:08:45,600 So you can think of third normal form as a little bit of a even bigger circle here. 264 00:08:46,800 --> 00:08:47,610 We're not going to cover third normal 265 00:08:47,960 --> 00:08:49,300 form in this video because 266 00:08:49,700 --> 00:08:50,910 Boyce-Codd normal form is the 267 00:08:50,990 --> 00:08:52,230 most common normal form used 268 00:08:52,720 --> 00:08:54,750 if we have functional dependencies only, and 269 00:08:54,890 --> 00:08:55,890 fourth normal form if we 270 00:08:56,030 --> 00:08:57,430 have functional and multivalued dependencies. 271 00:08:58,990 --> 00:09:00,290 So what's going to happen next is 272 00:09:00,450 --> 00:09:01,850 I'm going to give some examples 273 00:09:02,330 --> 00:09:03,650 to motivate these four concepts: 274 00:09:04,220 --> 00:09:05,900 functional dependencies, Boyce-Codd normal form, 275 00:09:06,120 --> 00:09:07,780 multivalued dependencies normal form, 276 00:09:08,350 --> 00:09:09,660 and then later videos will 277 00:09:09,780 --> 00:09:11,450 go into each one in much greater depth. 278 00:09:12,460 --> 00:09:13,290 So let me just give 279 00:09:13,460 --> 00:09:14,850 the general idea of functional dependencies 280 00:09:15,550 --> 00:09:16,560 and Boyce-Codd Normal Form. 281 00:09:16,850 --> 00:09:17,760 And we'll use a very 282 00:09:18,170 --> 00:09:19,770 simple for example, an abbreviated version 283 00:09:20,090 --> 00:09:21,250 of our apply relation that has 284 00:09:21,490 --> 00:09:23,370 students' social security numbers, the 285 00:09:23,430 --> 00:09:24,730 student's name and their colleges 286 00:09:25,240 --> 00:09:26,400 that the student is applying to. 287 00:09:26,580 --> 00:09:28,700 Even this small relation actually 288 00:09:28,980 --> 00:09:31,270 has redundancy and update and deletion anomalies. 289 00:09:32,310 --> 00:09:33,640 Specifically, let's say that our 290 00:09:33,820 --> 00:09:36,380 student, 123Ann, applies to 7 colleges. 291 00:09:36,960 --> 00:09:38,750 Then there will be 7 tuples and 292 00:09:38,980 --> 00:09:40,810 there will be 7 instances where we 293 00:09:40,970 --> 00:09:42,470 know that a student with the 294 00:09:42,700 --> 00:09:44,890 social security number 123 is named Ann. 295 00:09:45,430 --> 00:09:46,390 Specifically, we're going to store 296 00:09:46,850 --> 00:09:48,620 for every student the name 297 00:09:49,330 --> 00:09:50,580 and social security number pair once 298 00:09:50,910 --> 00:09:52,500 for each college that they apply to. 299 00:09:53,640 --> 00:09:54,630 So now let me explain what 300 00:09:54,840 --> 00:09:56,150 a functional dependency is and then 301 00:09:56,310 --> 00:09:57,990 we'll see how functional dependencies are 302 00:09:58,040 --> 00:09:59,370 used to recognize when we 303 00:09:59,440 --> 00:10:00,390 have a bad design like this 304 00:10:00,570 --> 00:10:02,280 one, and to see how we can fix it. 305 00:10:03,140 --> 00:10:04,300 A functional dependency, in this 306 00:10:04,470 --> 00:10:05,720 case from social security number 307 00:10:06,340 --> 00:10:07,230 to name, and we're saying 308 00:10:07,610 --> 00:10:09,370 social security number functionally determines 309 00:10:10,380 --> 00:10:11,970 the student name says that 310 00:10:12,240 --> 00:10:15,090 the same social security number always has the same name. 311 00:10:15,710 --> 00:10:18,290 In other words, every time we see 123, we're going to see Ann. 312 00:10:18,960 --> 00:10:20,560 Now it doesn't necessarily go in the other direction. 313 00:10:21,050 --> 00:10:21,830 It might not be that whenever 314 00:10:22,110 --> 00:10:23,620 we see Ann, it's 123, 315 00:10:23,710 --> 00:10:25,790 but whenever we see 123, it is Ann. 316 00:10:26,540 --> 00:10:27,620 And so what we'd like to 317 00:10:27,980 --> 00:10:30,360 do is store that relationship just one time. 318 00:10:30,900 --> 00:10:32,960 One time say that for 123, the name is Ann. 319 00:10:34,080 --> 00:10:35,520 Now what Boyce Codd Normal Form 320 00:10:35,710 --> 00:10:37,000 says is that whenever we have 321 00:10:37,330 --> 00:10:39,590 one of these functional dependencies, then 322 00:10:39,940 --> 00:10:42,510 the left hand side of that functional dependency must be a key. 323 00:10:43,330 --> 00:10:44,400 And think about what that's saying. 324 00:10:44,640 --> 00:10:46,060 Remember a key says 325 00:10:46,610 --> 00:10:49,310 that we have just one tupple with each value for that attribute. 326 00:10:50,330 --> 00:10:51,610 So if we have say 327 00:10:51,760 --> 00:10:53,320 social security number to name 328 00:10:53,550 --> 00:10:54,990 as a functional dependency and we 329 00:10:55,230 --> 00:10:56,780 satisfy Boyce-Codd Normal Form, 330 00:10:57,350 --> 00:10:58,400 then we're going to say that 331 00:10:58,600 --> 00:10:59,920 social security number has to 332 00:11:00,060 --> 00:11:00,870 be a key in our relation, 333 00:11:01,420 --> 00:11:02,390 and we'll only have one 334 00:11:02,810 --> 00:11:04,800 tupple for each social security number. 335 00:11:05,460 --> 00:11:08,070 Specifically, we can go back to our original relation. 336 00:11:08,900 --> 00:11:10,740 We have this functional dependency social 337 00:11:11,090 --> 00:11:13,350 security number here is not a key, right? 338 00:11:14,040 --> 00:11:16,790 So then we know that this is not in Boyce-Codd Normal Form. 339 00:11:17,450 --> 00:11:18,430 So we're going to use functional 340 00:11:18,970 --> 00:11:20,360 dependencies to help us 341 00:11:20,530 --> 00:11:22,170 decompose our relation so 342 00:11:22,370 --> 00:11:25,010 that the decomposed relations are in Boyce-Codd Normal Form. 343 00:11:25,650 --> 00:11:26,860 And here's what would happen in this example. 344 00:11:27,850 --> 00:11:29,100 Our functional dependency would tell 345 00:11:29,330 --> 00:11:30,610 us to pull out the social 346 00:11:31,150 --> 00:11:32,300 security number and student name 347 00:11:32,560 --> 00:11:34,240 into its own relation where 348 00:11:34,370 --> 00:11:35,680 the social security number is a 349 00:11:35,910 --> 00:11:36,900 key and then we have 350 00:11:37,150 --> 00:11:38,620 just one time for each 351 00:11:38,810 --> 00:11:39,920 social security number that students 352 00:11:40,050 --> 00:11:42,170 name, and then separately we'll 353 00:11:42,390 --> 00:11:43,680 have the information about the 354 00:11:43,790 --> 00:11:45,690 students and which colleges they applied to. 355 00:11:46,320 --> 00:11:48,240 Again, we'll completely formalize this 356 00:11:48,530 --> 00:11:49,790 whole idea, the definition of 357 00:11:49,900 --> 00:11:51,500 functional dependencies, their properties, the 358 00:11:51,590 --> 00:11:52,500 normal form, and how we 359 00:11:52,600 --> 00:11:54,450 do the decomposition in a later video. 360 00:11:55,670 --> 00:11:57,070 Now, let's similarly motivate the 361 00:11:57,310 --> 00:11:59,920 concept of multi-value dependencies, and fourth normal form. 362 00:12:00,350 --> 00:12:03,190 It is actually a little bit more complicated, but it follows the same rough outline. 363 00:12:04,350 --> 00:12:05,150 Now let's look at a different 364 00:12:05,490 --> 00:12:06,680 portion of the information 365 00:12:06,960 --> 00:12:08,680 about applying and let's suppose, 366 00:12:08,800 --> 00:12:09,670 for now, that we're just concerned 367 00:12:10,170 --> 00:12:12,020 about students, what colleges they're 368 00:12:12,210 --> 00:12:14,260 applying to, and what high schools they went to. 369 00:12:15,050 --> 00:12:17,450 We still have redundancy and update and deletion anomalies. 370 00:12:17,980 --> 00:12:18,940 For example, a student who 371 00:12:19,350 --> 00:12:20,970 applies to Stanford is going 372 00:12:21,190 --> 00:12:22,640 to have that fact captured once 373 00:12:23,050 --> 00:12:24,430 for every high school that they went to. 374 00:12:24,730 --> 00:12:26,320 A student who went to 375 00:12:26,730 --> 00:12:27,750 Palo Alto high School will have 376 00:12:28,370 --> 00:12:30,910 that fact captured once for every college they apply to. 377 00:12:31,810 --> 00:12:34,290 In addition, we get a kind of multiplicative effect here. 378 00:12:34,840 --> 00:12:35,860 Because let's say a student 379 00:12:36,100 --> 00:12:37,510 applies to C colleges 380 00:12:38,470 --> 00:12:39,700 and they went to H 381 00:12:39,960 --> 00:12:41,160 high schools; I know students don't 382 00:12:41,370 --> 00:12:42,040 go to a lot of high schools 383 00:12:42,310 --> 00:12:43,930 but let's suppose that this is one that had moved a lot. 384 00:12:44,710 --> 00:12:47,340 In that case, we're going to have C times H, tuples. 385 00:12:48,300 --> 00:12:49,350 What we'd really like to have 386 00:12:49,480 --> 00:12:50,480 is something more on the 387 00:12:50,590 --> 00:12:51,960 order of C plus H, 388 00:12:52,200 --> 00:12:54,760 because then we'd be capturing each piece of information just once. 389 00:12:55,880 --> 00:12:57,000 Now the interesting thing is that 390 00:12:57,180 --> 00:12:58,470 the badness of this particular 391 00:12:58,920 --> 00:13:00,480 design is not addressed by 392 00:13:00,600 --> 00:13:01,670 Boyce-Codd Normal Form, in fact 393 00:13:01,950 --> 00:13:03,010 this relation is in Boyce-Codd 394 00:13:03,470 --> 00:13:05,940 Normal Form, because it has no functional dependencies. 395 00:13:06,900 --> 00:13:08,060 It's not the case that every 396 00:13:08,400 --> 00:13:09,570 instance of a social security 397 00:13:10,010 --> 00:13:11,170 number is associated with a 398 00:13:11,200 --> 00:13:13,360 single college name or a single high school. 399 00:13:14,280 --> 00:13:15,200 As we will see later, if there 400 00:13:15,310 --> 00:13:16,780 are no functional dependencies, then the 401 00:13:16,850 --> 00:13:18,240 relation is automatically in Boyce-Codd 402 00:13:18,670 --> 00:13:19,930 Normal Form, but it's not 403 00:13:20,470 --> 00:13:21,510 in and fourth normal form. 404 00:13:22,520 --> 00:13:25,520 So fourth normal form is associated with what are called multi-value dependencies. 405 00:13:26,750 --> 00:13:27,970 When we specify a multi-value 406 00:13:28,340 --> 00:13:29,560 dependency as we've done here 407 00:13:29,900 --> 00:13:31,720 with the double arrow, what this 408 00:13:31,800 --> 00:13:32,880 is saying is that if we 409 00:13:33,050 --> 00:13:34,510 take a particular social security 410 00:13:35,040 --> 00:13:36,230 number in the relation, 411 00:13:36,760 --> 00:13:38,200 we will have every combination 412 00:13:39,170 --> 00:13:40,290 of college names that are 413 00:13:40,650 --> 00:13:41,750 associated with that social security 414 00:13:42,230 --> 00:13:43,570 number with every high 415 00:13:43,840 --> 00:13:46,010 school that's associated with that social security number. 416 00:13:46,620 --> 00:13:48,180 We'll actually see that when 417 00:13:48,370 --> 00:13:49,500 we have this multi-value dependency, 418 00:13:50,500 --> 00:13:52,280 we automatically have this one, too. 419 00:13:53,040 --> 00:13:53,920 I know it seems a bit complicated, 420 00:13:54,250 --> 00:13:55,210 and we will formalize it completely, 421 00:13:55,650 --> 00:13:56,920 but for now now just think 422 00:13:57,060 --> 00:13:59,260 about the English statement that multi-valued dependency 423 00:13:59,800 --> 00:14:00,420 is saying that we are going to 424 00:14:00,530 --> 00:14:01,730 have every combination of those 425 00:14:02,380 --> 00:14:03,640 two attributes and values 426 00:14:04,050 --> 00:14:05,910 in those attributes for a given social security number. 427 00:14:06,010 --> 00:14:07,610 In other words, those values 428 00:14:08,070 --> 00:14:09,390 are really independent of each other. 429 00:14:10,360 --> 00:14:11,390 So if we have that situation, 430 00:14:12,300 --> 00:14:13,270 then what we should really do 431 00:14:13,550 --> 00:14:15,010 is store each college name 432 00:14:15,210 --> 00:14:16,140 and each high school for each 433 00:14:16,350 --> 00:14:17,900 social security number one time, 434 00:14:18,660 --> 00:14:21,290 and that's what fourth normal form will do for us. 435 00:14:21,720 --> 00:14:23,920 Fourth normal form, similarly to 436 00:14:24,030 --> 00:14:25,570 Boyce-Codd normal form, says if 437 00:14:25,790 --> 00:14:28,800 we have a dependency, then the left hand side must be a key. 438 00:14:29,250 --> 00:14:30,740 In this case, it's a multi-value dependency 439 00:14:31,210 --> 00:14:32,000 we're looking at, so it's really 440 00:14:32,290 --> 00:14:33,970 saying something different but the 441 00:14:34,140 --> 00:14:35,370 basic idea is the same 442 00:14:35,800 --> 00:14:36,780 which is that we want 443 00:14:37,160 --> 00:14:38,970 only one tuple that has 444 00:14:39,320 --> 00:14:40,960 each value that's appears on 445 00:14:41,110 --> 00:14:41,960 the left hand side of a 446 00:14:42,220 --> 00:14:43,760 multi-value dependency So let's 447 00:14:43,970 --> 00:14:44,880 see what would happen, in this 448 00:14:45,060 --> 00:14:46,260 example, if we use 449 00:14:46,570 --> 00:14:48,690 our multi-value dependencies to decompose 450 00:14:49,240 --> 00:14:51,750 the relation, based on the idea of fourth Normal Form. 451 00:14:52,260 --> 00:14:53,590 Well it is the intuitive thing that happens. 452 00:14:54,230 --> 00:14:56,060 We separate the information about 453 00:14:56,440 --> 00:14:57,490 the college names that a student 454 00:14:57,720 --> 00:14:59,110 applies to from the information 455 00:14:59,430 --> 00:15:00,700 about the high schools themselves, and 456 00:15:00,910 --> 00:15:01,770 then we'll see that that 457 00:15:01,870 --> 00:15:02,990 we only store each fact 458 00:15:03,430 --> 00:15:04,640 once and we do get 459 00:15:04,870 --> 00:15:06,120 the behavior of having C 460 00:15:06,820 --> 00:15:09,240 plus H tuples instead of having C times H tuples. 461 00:15:10,090 --> 00:15:11,590 Like with functional dependencies and Boyce-Codd 462 00:15:11,960 --> 00:15:12,860 Normal Form we'll be completely 463 00:15:13,560 --> 00:15:14,560 formalizing all of this 464 00:15:14,770 --> 00:15:16,550 reasoning and the definitions in later videos. 465 00:15:17,830 --> 00:15:20,190 To summarize, we're going to do relational design by decomposition. 466 00:15:21,240 --> 00:15:22,450 We're going to start by specifying mega 467 00:15:22,770 --> 00:15:24,020 relations that contain all the 468 00:15:24,320 --> 00:15:25,360 information that we want to capture, 469 00:15:26,010 --> 00:15:27,500 as well as specifying properties of 470 00:15:27,620 --> 00:15:30,160 the data usually reflecting the real world in some fashion. 471 00:15:31,210 --> 00:15:32,610 The system can automatically decompose 472 00:15:33,390 --> 00:15:34,860 the mega-relations into smaller relations 473 00:15:35,370 --> 00:15:36,390 based on the properties we specify, 474 00:15:37,420 --> 00:15:38,630 and guarantee that the final 475 00:15:38,810 --> 00:15:40,070 set of relations have certain 476 00:15:40,280 --> 00:15:42,140 good properties captured in a normal form. 477 00:15:42,860 --> 00:15:45,760 They will have no anomalies, and they'll be guaranteed not to lose information. 478 00:15:47,260 --> 00:15:48,830 We'll start by specifying properties as 479 00:15:48,940 --> 00:15:50,400 functional dependencies and from 480 00:15:50,540 --> 00:15:51,760 there the system will guarantee Boyce-Codd 481 00:15:52,310 --> 00:15:53,690 Normal Form and then we'll 482 00:15:54,340 --> 00:15:55,520 add to that properties specified as 483 00:15:56,090 --> 00:15:57,500 multi-value dependencies, and from 484 00:15:57,690 --> 00:15:58,940 there the system will guarantee fourth 485 00:15:59,260 --> 00:16:00,730 normal form, which is even 486 00:16:00,950 --> 00:16:02,400 stronger than Boyce-Codd Normal Form and 487 00:16:02,620 --> 00:16:04,960 is generally thought to be good relational design.