1 00:00:00,330 --> 00:00:02,890 Now let's turn to the subject of querying XML. 2 00:00:04,030 --> 00:00:04,870 First of all, let me say right 3 00:00:05,210 --> 00:00:06,870 up front that querying XML is 4 00:00:06,980 --> 00:00:09,780 not nearly as mature as querying relational data bases. 5 00:00:10,380 --> 00:00:11,610 And there is a couple of reasons for that. 6 00:00:11,810 --> 00:00:13,760 First of all it's just much, much newer. 7 00:00:14,660 --> 00:00:15,940 Second of all it's not quite 8 00:00:16,100 --> 00:00:17,360 as clean, there's no underlying 9 00:00:18,080 --> 00:00:20,000 algebra for XML that's 10 00:00:20,200 --> 00:00:23,070 similar to the relational algebra for querying relational data bases. 11 00:00:24,210 --> 00:00:25,370 Let's talk about the sequence of 12 00:00:25,470 --> 00:00:26,850 development of query languages for 13 00:00:27,210 --> 00:00:28,430 XML up until the present time. 14 00:00:29,190 --> 00:00:31,010 The first language to be developed was XPath. 15 00:00:32,740 --> 00:00:34,680 XPath consists of path 16 00:00:35,000 --> 00:00:37,150 expressions and conditions 17 00:00:38,730 --> 00:00:39,660 and that's what we'll be covering in 18 00:00:39,760 --> 00:00:41,730 this video once we finish the introductory material. 19 00:00:43,610 --> 00:00:45,300 The next thing to be developed was XSLT. 20 00:00:46,720 --> 00:00:48,480 XSLT has XPath 21 00:00:48,730 --> 00:00:50,000 as a component but it also 22 00:00:50,360 --> 00:00:52,420 has transformations, and that's 23 00:00:52,590 --> 00:00:54,320 what the T stands for, and 24 00:00:54,440 --> 00:00:56,370 it also has constructs for output formatting. 25 00:00:56,990 --> 00:00:58,820 As I've mentioned before, XSLT is 26 00:00:58,970 --> 00:01:00,380 often used to translate 27 00:01:01,050 --> 00:01:03,190 XML into HTML for rendering. 28 00:01:04,070 --> 00:01:05,640 And finally, the latest 29 00:01:06,120 --> 00:01:08,780 language and the most expressive language is XQuery. 30 00:01:09,510 --> 00:01:10,920 So that also has XPath as 31 00:01:11,040 --> 00:01:12,510 a component, plus what I 32 00:01:12,640 --> 00:01:14,540 would call a full featured query language. 33 00:01:15,100 --> 00:01:17,050 So it's most similar to SQL 34 00:01:17,870 --> 00:01:18,770 in a way, as we'll be seeing. 35 00:01:19,730 --> 00:01:20,670 The order that we're going to 36 00:01:20,750 --> 00:01:22,530 cover them in is first 37 00:01:22,830 --> 00:01:24,150 XPath and then actually second 38 00:01:24,480 --> 00:01:26,050 XQuery and finally XSLT. 39 00:01:27,220 --> 00:01:27,990 There are a couple of other 40 00:01:28,560 --> 00:01:30,550 languages, XLink and XPointer. 41 00:01:31,810 --> 00:01:32,910 Those languages are for specifying, 42 00:01:34,120 --> 00:01:35,270 as you can see, links and pointers. 43 00:01:36,000 --> 00:01:38,500 They also use the XPath language as a component. 44 00:01:38,940 --> 00:01:40,450 We won't be covering those in this video. 45 00:01:41,380 --> 00:01:43,120 Now we'll be covering XPath, XQuery, 46 00:01:43,900 --> 00:01:45,610 and XSLT in moderate detail. 47 00:01:46,230 --> 00:01:47,310 We're not going to cover every 48 00:01:47,650 --> 00:01:48,630 single construct of the languages, 49 00:01:49,400 --> 00:01:50,550 but we will be covering enough 50 00:01:51,160 --> 00:01:54,070 to write a wide variety of queries using those languages. 51 00:01:55,320 --> 00:01:56,660 To understand how XPath 52 00:01:56,940 --> 00:01:59,310 works, it's good to think of the XML as a tree. 53 00:01:59,780 --> 00:02:00,460 So I'd like you to bear with 54 00:02:00,730 --> 00:02:01,700 me for a moment while I 55 00:02:01,870 --> 00:02:03,030 write a little bit 56 00:02:03,060 --> 00:02:04,200 of a tree that would be 57 00:02:04,400 --> 00:02:05,550 the tree encoding of the 58 00:02:05,620 --> 00:02:07,200 book store data that we've been working with. 59 00:02:07,800 --> 00:02:08,690 So we would write as our 60 00:02:08,870 --> 00:02:10,010 root the book store element, 61 00:02:10,430 --> 00:02:11,650 and then we'll have sub-elements 62 00:02:12,370 --> 00:02:14,260 that would contain the books 63 00:02:15,240 --> 00:02:16,920 that are the sub elements of our bookstore. 64 00:02:17,630 --> 00:02:18,320 We might have another book. 65 00:02:19,120 --> 00:02:20,090 We might have over here a 66 00:02:20,530 --> 00:02:22,260 magazine and within 67 00:02:23,090 --> 00:02:24,380 the books then we had, as 68 00:02:24,500 --> 00:02:26,520 you might remember some attributes and some sub elements. 69 00:02:27,090 --> 00:02:29,030 We had for example the ISBN 70 00:02:29,230 --> 00:02:30,950 number I'll write as an attribute here. 71 00:02:31,710 --> 00:02:33,830 We had a price and we 72 00:02:33,980 --> 00:02:35,190 also had of course the 73 00:02:35,360 --> 00:02:37,480 title of the book and we 74 00:02:37,630 --> 00:02:39,840 had the author, excuse me, 75 00:02:40,430 --> 00:02:42,280 over here, I'm obviously 76 00:02:42,750 --> 00:02:44,680 not going to be filling in 77 00:02:44,780 --> 00:02:46,070 the subelement structure here we 78 00:02:46,220 --> 00:02:47,570 are just going to look at one book as an example. 79 00:02:48,960 --> 00:02:50,820 The ISBN number we 80 00:02:50,990 --> 00:02:52,070 now are at the leaf of 81 00:02:52,170 --> 00:02:53,030 the tree so we could have 82 00:02:53,200 --> 00:02:55,040 a string value here to 83 00:02:55,190 --> 00:02:56,850 denote the leaf maybe, 100 84 00:02:57,130 --> 00:02:58,560 for the price, for the 85 00:02:58,790 --> 00:03:00,030 title: "A First Course 86 00:03:00,410 --> 00:03:03,960 in Database Systems", then our authors had further sub-elements. 87 00:03:04,780 --> 00:03:06,320 We had maybe two authors' 88 00:03:06,810 --> 00:03:08,630 sub elements here, I'm abbreviating 89 00:03:09,320 --> 00:03:10,680 a bit, below here, a 90 00:03:10,860 --> 00:03:12,540 first name and a last 91 00:03:12,820 --> 00:03:14,370 name, again abbreviating so that 92 00:03:14,530 --> 00:03:16,980 might have been Jeff Ullman, and so on. 93 00:03:17,180 --> 00:03:18,130 I think you get the 94 00:03:18,300 --> 00:03:21,260 idea of how we render our X and L as a tree. 95 00:03:21,990 --> 00:03:22,950 And the reason we're doing that 96 00:03:23,440 --> 00:03:25,780 is so that we can think 97 00:03:26,190 --> 00:03:27,260 of the expressions we have 98 00:03:27,450 --> 00:03:30,000 in XPath as navigations down the tree. 99 00:03:30,960 --> 00:03:32,880 Specifically, what XML consists 100 00:03:33,380 --> 00:03:34,940 of is path expressions that 101 00:03:35,070 --> 00:03:36,540 describe navigation down and 102 00:03:36,680 --> 00:03:38,420 sometimes across and up a tree. 103 00:03:39,220 --> 00:03:40,580 And then we also have conditions 104 00:03:41,230 --> 00:03:42,400 that we evaluate to pick 105 00:03:42,670 --> 00:03:45,020 out the components of the XML that we're interested in. 106 00:03:45,980 --> 00:03:46,920 So let me just go through 107 00:03:47,310 --> 00:03:48,640 a few of the basic constructs 108 00:03:49,760 --> 00:03:51,320 that we have in XPath. 109 00:03:52,810 --> 00:03:55,560 Let me just erase a few of these things here that got in my way. 110 00:03:56,090 --> 00:03:56,090 Okay. 111 00:03:56,740 --> 00:03:57,880 I'm gonna use this little box and 112 00:03:58,200 --> 00:03:59,470 I'm gonna put the construct in and 113 00:03:59,560 --> 00:04:01,180 then sort of explain how it works. 114 00:04:01,930 --> 00:04:03,300 So the first construct is 115 00:04:03,530 --> 00:04:05,320 simply a slash, and the 116 00:04:05,470 --> 00:04:08,460 slash is for designating the root element. 117 00:04:08,970 --> 00:04:10,200 So we'll put the slash at 118 00:04:10,280 --> 00:04:11,480 the beginning of an XPath 119 00:04:11,710 --> 00:04:13,110 query to say we want to start at the root. 120 00:04:13,450 --> 00:04:15,790 A slash is also used as a separator. 121 00:04:16,680 --> 00:04:18,210 So we're going to write paths 122 00:04:18,420 --> 00:04:19,540 that are going to navigate down the 123 00:04:19,640 --> 00:04:20,280 tree and we're going to put 124 00:04:20,350 --> 00:04:23,050 a '/' between the elements of the path. 125 00:04:24,200 --> 00:04:25,330 All of this will become much clearer in the demo. 126 00:04:25,740 --> 00:04:28,600 So I'll try to go fairly quickly now so we can move to the demo itself. 127 00:04:29,750 --> 00:04:32,400 The next construct is simply writing the name of an element. 128 00:04:32,610 --> 00:04:34,880 I put 'x' here but we might for example write 'book'. 129 00:04:35,570 --> 00:04:36,760 When we write 'book' in an 130 00:04:36,970 --> 00:04:38,290 X path expression, we're saying 131 00:04:38,490 --> 00:04:39,590 that we want to navigate say 132 00:04:39,970 --> 00:04:40,980 we're up here at the bookstore down 133 00:04:41,410 --> 00:04:43,740 to the book sub-element as part of our path expression. 134 00:04:45,110 --> 00:04:46,140 We can also write the special 135 00:04:46,710 --> 00:04:49,110 element symbol '' and '' matches anything. 136 00:04:49,910 --> 00:04:51,760 So if we write '/' then 137 00:04:52,790 --> 00:04:55,610 we'll match any sub-element of our current element. 138 00:04:56,260 --> 00:04:57,540 When we execute X path, there's 139 00:04:57,740 --> 00:04:59,010 sort of a notion as we're writing 140 00:04:59,400 --> 00:05:01,840 the path expressions of being at a particular place. 141 00:05:02,200 --> 00:05:03,360 So we might have navigated from 142 00:05:03,500 --> 00:05:04,520 bookstore to book and then 143 00:05:04,750 --> 00:05:06,080 we would navigate say further down 144 00:05:06,370 --> 00:05:07,500 to title or if we 145 00:05:07,590 --> 00:05:09,490 put a '' then we navigate to any sub-element. 146 00:05:10,920 --> 00:05:12,220 If we want to match an 147 00:05:12,510 --> 00:05:14,990 attribute, we write '@' and then the attribute name. 148 00:05:15,320 --> 00:05:16,420 So for example, if we're 149 00:05:16,550 --> 00:05:17,340 at the book and we want 150 00:05:17,610 --> 00:05:18,850 to match down to 151 00:05:18,940 --> 00:05:20,140 the ISBN number, we'll write 152 00:05:20,520 --> 00:05:23,290 ISBN in our query, our path expression. 153 00:05:24,700 --> 00:05:27,620 We saw the single slash for navigating one step down. 154 00:05:28,210 --> 00:05:29,500 There's also a double slash construct. 155 00:05:30,550 --> 00:05:32,540 The double slash matches any 156 00:05:32,860 --> 00:05:34,220 descendant of our current element. 157 00:05:34,630 --> 00:05:35,730 So, for example, if we're 158 00:05:35,850 --> 00:05:37,030 here at the book and we 159 00:05:37,200 --> 00:05:38,270 write double slash, we'll match 160 00:05:38,560 --> 00:05:39,830 the title, the authors, the off, 161 00:05:40,220 --> 00:05:41,040 the first name and the last 162 00:05:41,240 --> 00:05:43,920 name, every descendant, and actually we'll also match ourselves. 163 00:05:44,710 --> 00:05:46,150 So this symbol here 164 00:05:46,380 --> 00:05:49,460 means any descendant, including the element where we currently are. 165 00:05:50,340 --> 00:05:52,790 So now I've given a flavor of how we write path expressions. 166 00:05:53,280 --> 00:05:54,410 Again, we'll see lots of them in our demo. 167 00:05:55,160 --> 00:05:55,600 What about conditions? 168 00:05:56,530 --> 00:05:57,800 If we want to evaluate a 169 00:05:57,890 --> 00:05:58,910 condition at the current 170 00:05:59,290 --> 00:06:00,690 point in the path, we put 171 00:06:00,930 --> 00:06:03,530 it in a square bracket and we write the condition here. 172 00:06:04,070 --> 00:06:05,200 So, for example, if we 173 00:06:05,290 --> 00:06:06,300 wanted our price to be 174 00:06:06,480 --> 00:06:07,790 less than 50, that would 175 00:06:07,940 --> 00:06:08,890 be a condition we could put 176 00:06:09,440 --> 00:06:10,510 in square brackets if we 177 00:06:10,630 --> 00:06:12,380 were (actually, better be the attribute) 178 00:06:13,060 --> 00:06:14,510 at this point in the navigation. 179 00:06:15,960 --> 00:06:17,390 Now we shouldn't confuse putting a 180 00:06:17,460 --> 00:06:18,490 condition in a square bracket 181 00:06:19,380 --> 00:06:20,790 with putting a number in a square bracket. 182 00:06:21,690 --> 00:06:22,730 If we put a number in 183 00:06:22,860 --> 00:06:24,440 a square bracket, N, for 184 00:06:24,640 --> 00:06:25,610 example, if I write three, 185 00:06:26,300 --> 00:06:27,100 that is not a condition 186 00:06:27,530 --> 00:06:29,190 but rather it matches the Nth 187 00:06:29,810 --> 00:06:31,020 sub element of the current element. 188 00:06:31,630 --> 00:06:32,760 For example, if we were 189 00:06:32,860 --> 00:06:34,010 here at authors and we 190 00:06:34,190 --> 00:06:35,600 put off square bracket two, 191 00:06:36,190 --> 00:06:38,870 then we would match the second off sub element of the authors. 192 00:06:39,680 --> 00:06:40,770 There are many, many other constructs. 193 00:06:41,420 --> 00:06:42,750 This just gives the basic flavor 194 00:06:43,490 --> 00:06:45,300 of the constructs for creating path 195 00:06:45,580 --> 00:06:46,920 expressions and evaluating conditions. 196 00:06:48,220 --> 00:06:50,600 XPath also has lots of built in functions. 197 00:06:51,030 --> 00:06:53,200 I'll just mention two of them as somewhat random examples. 198 00:06:54,280 --> 00:06:56,850 There's a function that you can use in XPath called contains. 199 00:06:57,990 --> 00:06:59,240 If you write contains and then 200 00:06:59,330 --> 00:07:01,220 you write two expressions, each of 201 00:07:01,330 --> 00:07:02,750 which has a string value - 202 00:07:02,970 --> 00:07:04,270 this is actually a predicate 203 00:07:04,870 --> 00:07:06,540 - will return true, if 204 00:07:06,800 --> 00:07:08,430 the first string contains the second string. 205 00:07:09,770 --> 00:07:12,290 As a second example of a function, there's a function called name. 206 00:07:13,190 --> 00:07:15,460 If we write name in a 207 00:07:15,650 --> 00:07:16,860 path, that returns the tag 208 00:07:17,450 --> 00:07:18,860 of the current element in the path. 209 00:07:19,670 --> 00:07:21,140 We'll see the use of functions in our demo. 210 00:07:22,490 --> 00:07:23,550 The last concept that I 211 00:07:23,610 --> 00:07:24,590 want to talk about is what's 212 00:07:24,800 --> 00:07:27,020 known as navigation axes, and 213 00:07:27,140 --> 00:07:31,310 there's 13 axes in XPath. 214 00:07:31,500 --> 00:07:32,400 And what an axis is, it's 215 00:07:32,550 --> 00:07:33,650 sort of a key word that allows 216 00:07:34,050 --> 00:07:36,200 us to navigate around the XML tree. 217 00:07:36,970 --> 00:07:39,820 So, for example, one axis is called parent. 218 00:07:40,360 --> 00:07:41,560 You might have noticed that 219 00:07:41,680 --> 00:07:42,420 when we talked about the basic 220 00:07:42,760 --> 00:07:43,880 constructs, most of them 221 00:07:44,290 --> 00:07:45,600 were about going down a tree. 222 00:07:46,240 --> 00:07:47,590 If you want to navigate up 223 00:07:47,680 --> 00:07:48,580 the tree, then you can 224 00:07:48,740 --> 00:07:51,230 use the parent access that tells you to go up to the parent. 225 00:07:52,080 --> 00:07:54,670 There's an access called following sibling. 226 00:07:57,550 --> 00:07:59,990 And the colon colon - you'll see how that works when we get to the demo. 227 00:08:00,430 --> 00:08:01,880 The following sibling says match 228 00:08:02,580 --> 00:08:03,650 actually all of the following 229 00:08:04,040 --> 00:08:05,290 siblings of the current element. 230 00:08:05,710 --> 00:08:06,480 So if we have a tree 231 00:08:06,960 --> 00:08:07,870 and we're sitting at this 232 00:08:08,050 --> 00:08:09,060 point in the tree, 233 00:08:09,800 --> 00:08:11,710 then we...the following sibling axis 234 00:08:12,160 --> 00:08:13,900 will match all of the 235 00:08:14,460 --> 00:08:16,370 siblings that are after the current one in the tree. 236 00:08:17,640 --> 00:08:19,020 There's an axis called descendants 237 00:08:21,630 --> 00:08:22,650 descendants, as you might guess, 238 00:08:23,260 --> 00:08:25,640 matches all the descendants of the current element. 239 00:08:25,960 --> 00:08:26,890 Now it's not quite the same 240 00:08:27,380 --> 00:08:28,790 as slash, slash, because as a 241 00:08:29,150 --> 00:08:30,470 reminder, slash, slash also matches 242 00:08:30,850 --> 00:08:32,250 the current element as well as the descendants. 243 00:08:33,160 --> 00:08:34,140 Actually as it happens, there is 244 00:08:34,220 --> 00:08:36,040 a navigation access called descendants 245 00:08:36,790 --> 00:08:38,920 and self that' s equivalent to slash, slash. 246 00:08:39,340 --> 00:08:40,300 And by the way, there's 247 00:08:40,470 --> 00:08:43,470 also one called self that will match the current element. 248 00:08:43,930 --> 00:08:44,810 And that may not seem to 249 00:08:44,880 --> 00:08:46,360 be useful, but well see 250 00:08:46,620 --> 00:08:47,620 uses for that, for example, 251 00:08:47,730 --> 00:08:49,060 in conjunction with the name 252 00:08:49,820 --> 00:08:50,850 function that we talked 253 00:08:51,110 --> 00:08:53,370 about up here, that would give us the tag of the current element. 254 00:08:53,700 --> 00:08:55,890 Just a few details to wrap up. 255 00:08:57,260 --> 00:08:58,820 XPath queries technically operate on 256 00:08:59,090 --> 00:09:00,510 and return a sequence of elements. 257 00:09:00,890 --> 00:09:01,590 That's their formal semantics. 258 00:09:02,620 --> 00:09:04,190 There is a specification for how 259 00:09:04,310 --> 00:09:05,720 XML documents and XML 260 00:09:05,880 --> 00:09:07,390 streams map to sequences of 261 00:09:07,470 --> 00:09:09,200 elements and you'll see that it's quite natural. 262 00:09:10,660 --> 00:09:11,960 When we run an XPath query, 263 00:09:12,830 --> 00:09:13,720 sometimes the result can be expressed 264 00:09:14,150 --> 00:09:15,240 as XML, but not always. 265 00:09:16,020 --> 00:09:18,360 But as we'll see again, that's fairly natural as well. 266 00:09:18,640 --> 00:09:21,770 So this video has given an introduction to XPath. 267 00:09:22,420 --> 00:09:23,460 We've shown how to think of 268 00:09:23,570 --> 00:09:24,570 XML data as a tree 269 00:09:25,130 --> 00:09:26,750 and then XPath as expressions 270 00:09:27,300 --> 00:09:29,530 that navigate around the tree and also evaluate conditions. 271 00:09:30,390 --> 00:09:33,240 We've seen a few of the constructs for path expressions or conditions. 272 00:09:34,130 --> 00:09:35,180 We've seen a couple of built-in functions 273 00:09:35,590 --> 00:09:37,490 and I've introduced the concept of navigation axes. 274 00:09:38,050 --> 00:09:39,240 But the real way to 275 00:09:39,390 --> 00:09:41,850 learn and understand XPath is to run some queries. 276 00:09:42,510 --> 00:09:43,620 So I urge you to watch the 277 00:09:43,680 --> 00:09:44,940 next video which is 278 00:09:45,030 --> 00:09:46,270 a demo of XPath queries over 279 00:09:46,450 --> 00:09:48,610 our bookstore data and then try some queries yourself.