In the previous video, we learned the basics of XML. In this video, we're going to learn about Document Type Descriptors, also known as DTDs, and also ID and ID ref attributes. We learned that well-formed XML is XML that adheres to basic structural requirements: a single root element, matched tags with proper nesting, and unique attributes within each element. Now we're going to learn about what's known as valid XML. Valid XML has to adhere to the same basic structural requirements as well-formed XML, but it also adheres to content specific specifications. And we're going to learn two languages for those specifications. One of them is Document Type Descriptors or DTDs, and the other, a more powerful language, is XML schema. Specifications in XML schema are known as XSDs, for XML Schema Descriptions. So as a reminder, here's how things worked with well-formed XML documents. We sent the document to a parser and the parser would either return that the document was not well-formed or it would return parsed XML. Now let's consider what happens with valid XML. Now we use a validating XML parser, and we have an additional input to the process, which is a specification, either a DTD or an XSD. So that's also fed to the parser, along with the document. The parser can again say the document is not well formed if it doesn't meet the basic structural requirements. It could also say that the document is not valid, meaning the structure of the document doesn't match the content specific specification. If everything is good, then once again "parsed XML" is returned. Now let's talk about the document-type descriptors, or DTDs. We see a DTD in the lower-left corner of the video, but we won't look at it in any detail, because we'll be doing demos of DTDs a little later on. A DTD is a language that's kind of like a grammar, and what you can specify in that language is for a particular document what elements you want that document to contain, the tags of the elements, what attributes can be in the elements, how the different types of elements can be nested. Sometimes the ordering of the elements might want to be specified, and sometimes the number of occurrences of different elements. DTDs also allow the introduction of special types of attributes, called id and idrefs. And, effectively, what these allow you to do is specify pointers within a document, although these pointers are untyped. Before moving to the demo, let's talk a little bit about the positives and negatives about choosing to use a DTD or and XSD for one's XML data. After all, if you're building an application that encodes its data in XML, you'll have to decide whether you want the XML to just be well formed or whether you want to have specifications and require the XML to be valid to satisfy those specifications. So, let's put a few positives of choosing a later of requiring a DTD or an XSD. First of all, one of them is that when you write your program, you can assume that the data adheres to a specific structure. So programs can assume a structure and so the programs themselves are simpler because they don't have to be doing a lot of error checking on the data. They'll know that before the data reaches the program, it's been run through a validator and it does satisfy a particular structure. Second of all, we talked at some time ago about the cascading style sheet language and the extensible style sheet languages. These are languages that take XML and they run rules on it to process it into a different form, often HTML. When you write those rules, if you note that the data has a certain structure, then those rules can be simpler, so like the programs they also can assume particular structure and it makes them simpler. Now, another use for DTDs or XSDs is as a specification language for conveying what XML might need to look like. So, as an example if you're performing data exchange using XML, maybe a company is going to receive purchase orders in XML, the company can actually use the DTD as a specification for what the XML needs to look like when it arrives at the program it's going to operate on it. Also documentation, it can be useful to use one of the specifications to just document what the data itself looks like. In general, really what we have here is the benefits of typing. We're talking about strongly typed data versus loosely-typed data, if you want to think of it that way. Now let's look at when we might prefer not to use a DTD. So what I'm going describe down here is the benefits of not using a DTD. So the biggest benefit is flexibility. So a DTD makes your XML data have to conform to a specification. If you want more flexibility or you want ease of change in the way that the data is formatted without running into a lot of errors, then, if that's what you want, then the DTD can be constraining. Another fact is that DTDs can be fairly messy and this is not going to be obvious to you yet until we get into the demo, but if the data is irregular, very irregular, then specifying its structure can be hard, especially for irregular documents. Actually, when we see the schema language, we'll discover that XSDs can be, I would say, really messy, so they can actually get very large. It's possible to have a document where the specification of the structure of the document is much, much larger than the document itself, which seems not entirely intuitive, but when we get to learn about XSDs, I think you'll see how that can happen. So, overall, this is the benefits of nil typing. It' s really quite similar to the analogy in programming languages. The remainder of this video will teach about the DTDs themselves through a set of examples. We'll have a separate video for learning about XML schema and XSDs. So, here we are with our first document that we're going to look at with a document type descriptor. We have on the left the document itself. We have on the right the document-type descriptor, and then we have in the lower right a command line shell that we're going to use to validate the document. So this is similar data to what we saw on the last video, but let's go through it just to see what we have. We have an outermost element called bookstore, and we have two books in our bookstore. The first book has an ISBN number, price and editions. As attributes and then it has a sub-element called title, another sub-element called authors with two authors underneath; first names and last names. The second book element is similar, except it doesn't have a edition. It also has, as we see, a remark. Now let's take a look at the DTD and I'm just going to walk through DTD, not too slowly, not too fast, and explain exactly what it's doing. So the start of the DTD says this a DTD named bookstore and the root element is called bookstore, and now we have the first grammar-like construct. So these constructs, in fact, are a little bit like regular expressions if you know them. What this says is that a bookstore element has as its sub-element any number of elements that are called book or magazine. We have book or magazine. We don't have any magazines yet but we'll add one. And then this star says, zero or more instances. It's the clean and close operator for those of you familiar with regular expression. Now let's talk about what the book element has, so that's our next specification. The book element has a title followed by authors, followed by an optional remark. So now we don't have an "or", we have a comma, and that says that these are going to be in that order - title, authors, and remark and the question mark says that the remark is optional. Next we have the attributes of our book elements. So this bang attribute list says we're going to describe the attributes and we're going to have three of them: the ISBN, the price, and the edition. C data is the type of the attribute. It's just a string. And then required says that the attribute must be present, whereas implied says it doesn't have to be present. As you may remember, we have one book that doesn't have an edition. Our magazines are simply going to have titles and they're going to have attributes that are month and year. Again, we don't have any magazines yet. A title is going to consist of string data. So here we see our title of first course and database system. You can think of that as the leaf data in the XML tree. And when you have a leaf that consists of text data, this is what you put in the DTD - just take my word for it: hash PC data in parentheses. Now our authors are an element that still has structure . Our authors have a sub-element, author sub-elements or elements, and we're going to specify here that the author's element must have one or more author subelements. So that's what the plus is saying here, again taken from regular expressions. "Plus" means one or more instances. We have the remark, which is just going to be pc data or string data. We have our authors which consist of a first name sub-element and a last-name sub-element, and in that order. And then finally, our first names and last names are also strengths. So, this is the entire DTD and it describes in detail the structure of our document. Now we have a command, we're using something called xmllint, that will check to see if the document meets the structure. We'll just run that command here with a couple of options, and it doesn't give us any output which actually means that our document is correct. Well be making some edits and seeing when our document is not correct what happens when we run the command. So let's make our first edit, let's say that we decide that we want the additional attribute of our books to be "required" rather than "applied". So we'll change the DTD. We'll save the file and now when we run our command. So as expected we got an error, and the error said that one of our book elements does not have attribute addition. Now that addition is required, every book element ought to have it. So let's add an addition to our second book. Let 's say that it's the second edition, save the file, we'll validate our document again, and now everything is good. Let's do an edit to the document this time to see what happens when we change the order of the first name and the last name. So we've swapped Jeffrey Ullman to be Ullman Jeffery. We validate our document, and now we see we got an error because the elements are not in the correct order. In this case, let's undo that change, rather than change our DTD. Let's try another edit to our document. Let's add a remark to our first book. But what we'll do is we'll leave the remark empty, so we'll add a opening and then directly a closing tag, and let's see if that validates. So, it did validate. And in fact when we have PC data as the type of an element it's perfectly acceptable to have a empty element. As a final change, let's add a magazine to our database. You'll have to bear with me as I type. I'm always a little bit slow. So we see over here that when we have a magazine there are two required attributes, the month and the year. So, let's say the month is January and the year, let's make that 2011, and then we have a title for our magazine. Here. We'll go down here. Our title, let's make it National Geographic. We'll close the tag, title tag. And then, sorry again about my typing. Let's go ahead and validate the document. we saw premature end of something or other. We forgot our closing tag for magazine, let's put that in. My terrible typing, and here we go. Let's validate, and we're done. Now we're gonna learn about and id rep attributes. The document on the left side contains the same data as our previous document but completely restructured. Instead of having authors as subelements of book elements, we're going to have our authors listed separately, and then effectively point from the books to the authors of the book. We'll take a look at the data first, and then we'll look at the DTD that describes the data. Let's actually start with the author, so our bookstore element here has two subelements that are books and three that are authors. So, looking at the authors, we have the first name and last name as sub-elements as usual, but we've added what we call the ident attribute. That's not a keyword; we've just called the attribute ident, and then for each of the three authors, we've given a string value to that attribute that we're going to use effectively for the pointers in the book. So we have our three authors, now let's take a look at the books. Our book has the ISBN number and price. I've taken the addition out for now. special attribute called authors. Authors is an ID reps attribute, and it's value can refer to one or more strings that are ID attributes. attributes in another element. So that's what we're doing here. We're referring to the two author elements here. And in our second book we're referring to the three author elements. We still have the title subelement and we still have the remarks subelement. And furthermore, we have one other cute thing here, which is, instead of referring to the book by name within the remark when we're talking about the other book, we have another type of pointer. So we'll specify that the ISBN is an ID for books and then this is an id reps attribute that's referring to the id of the other book. The DTD on the right that describes the structure of this document. This time our bookstore is going to contain zero or more books followed by zero or more authors. Our books contain a title and an optional remark is subelements and now they contain three attributes, the IDBN which is now a special type of attribute called and ID, the price,which is the string value as usual and the authors which is the special type called id reps. Let's keep going, our title is just string Value as usual. A remark, here this is a actually interesting construct. A remark consist of the PC data which is string, or a book reference and then zero more instances of those. This is the type of construct that can be used to mix strings and sub elements within an element. So anytime you want an element that might have some strings and then another element and then more string value. That's how it's done. PC data or the element type zero or more. Then we have our book reference which is actually an empty element it's only interesting because is has an attribute so let's go back here we see our book wrap here it actually doesn't have any data or sub elements, but it has an attribute called book and that is an ID ref. That means it refers to an ID attribute of another, another element. Now we have our authors the first name and the last name and our author attributes have again an ID and we're calling it the ident. And finally the first name and last name are string values. This may seem overwhelming but the key points in this DTD are the ID the attributes. So the ID attributes, the ISBN attributes in the book, and the ident, wherever it went, ident attribute in the author are special attributes, and by the way, they do need to be unique values for those attributes, and they're special in that ID refs attributes can refer to them, and that will be checked as well. Now, I did want to point out that the book reference here says ID ref singular. When you have a singular ID ref then the string has to be exactly one ID value. When you have the plural ID refs. Then the string of the attribute is one or more ID ref value, I'm sorry one or more ID values separated by spaces. So it's a little bit clunky, but it does seem to work. Now let's go to our command line, and let's validate the document. So the document is in fact valid. That's what it means when we get nothing back, and let's make some changes, as we did before, to explore what structure is imposed and what's checked with this DTD in the presence. IDs and ID refs. As a first change, let's change this ID, this identifier HG to JU. That should actually cause a couple of problems when we do that let's validate the document and see what happens. And we do in fact get two different errors. The first error says that we have two instances of "JU". As you can see here, we now have JU twice where ID values do have to be unique. They have to be globally unique throughout the document. The second error that occurred when we changed HG to JU is we effectively have a dangling pointer. We refer to HG here in this ID refs attribute but there's no longer an element whose value is HG. So that's an error as well. So let's change it back to HG just so our document is valid again. Now let's make another change, let's take our book reference. We can see that our book reference is referring to the other book. We're in the complete book here and the comment, the remark is referring to the first course through the ISBN number, but let's change this string instead to refer to HG. So now we're actually referring to an author rather than another book. Let's check if the document validates. In fact it does. And that shows that the pointers when you have a DTD are untyped. So it does check to make sure that this is an id of another element, but we weren't able to specify that it should be a book element in our DTD, and since we're not able to specify it, of course it's not possible to check it. We will see that in XML schema, we can have typed pointers but it's not possible to have them in DTDs. The last change I'm going to show is to add a second book reference within our remark. So as I pointed out over here, when we write PC data or in an element type followed by the [xx] closure, the zero or more star, that means we can freely mix text and sub-elements. So just right in the middle here, let's put a book reference. and we can put, let's say book equals JU, and that will be the end of our reference there and now we see that we have text followed by a subelement followed by more text then so on. That should validate fine, and in fact it does. That completes our demonstration of XML documents with DTDs.