1
00:00:00,330 --> 00:00:02,890
Now let's turn to the subject of querying XML.
2
00:00:04,030 --> 00:00:04,870
First of all, let me say right
3
00:00:05,210 --> 00:00:06,870
up front that querying XML is
4
00:00:06,980 --> 00:00:09,780
not nearly as mature as querying relational data bases.
5
00:00:10,380 --> 00:00:11,610
And there is a couple of reasons for that.
6
00:00:11,810 --> 00:00:13,760
First of all it's just much, much newer.
7
00:00:14,660 --> 00:00:15,940
Second of all it's not quite
8
00:00:16,100 --> 00:00:17,360
as clean, there's no underlying
9
00:00:18,080 --> 00:00:20,000
algebra for XML that's
10
00:00:20,200 --> 00:00:23,070
similar to the relational algebra for querying relational data bases.
11
00:00:24,210 --> 00:00:25,370
Let's talk about the sequence of
12
00:00:25,470 --> 00:00:26,850
development of query languages for
13
00:00:27,210 --> 00:00:28,430
XML up until the present time.
14
00:00:29,190 --> 00:00:31,010
The first language to be developed was XPath.
15
00:00:32,740 --> 00:00:34,680
XPath consists of path
16
00:00:35,000 --> 00:00:37,150
expressions and conditions
17
00:00:38,730 --> 00:00:39,660
and that's what we'll be covering in
18
00:00:39,760 --> 00:00:41,730
this video once we finish the introductory material.
19
00:00:43,610 --> 00:00:45,300
The next thing to be developed was XSLT.
20
00:00:46,720 --> 00:00:48,480
XSLT has XPath
21
00:00:48,730 --> 00:00:50,000
as a component but it also
22
00:00:50,360 --> 00:00:52,420
has transformations, and that's
23
00:00:52,590 --> 00:00:54,320
what the T stands for, and
24
00:00:54,440 --> 00:00:56,370
it also has constructs for output formatting.
25
00:00:56,990 --> 00:00:58,820
As I've mentioned before, XSLT is
26
00:00:58,970 --> 00:01:00,380
often used to translate
27
00:01:01,050 --> 00:01:03,190
XML into HTML for rendering.
28
00:01:04,070 --> 00:01:05,640
And finally, the latest
29
00:01:06,120 --> 00:01:08,780
language and the most expressive language is XQuery.
30
00:01:09,510 --> 00:01:10,920
So that also has XPath as
31
00:01:11,040 --> 00:01:12,510
a component, plus what I
32
00:01:12,640 --> 00:01:14,540
would call a full featured query language.
33
00:01:15,100 --> 00:01:17,050
So it's most similar to SQL
34
00:01:17,870 --> 00:01:18,770
in a way, as we'll be seeing.
35
00:01:19,730 --> 00:01:20,670
The order that we're going to
36
00:01:20,750 --> 00:01:22,530
cover them in is first
37
00:01:22,830 --> 00:01:24,150
XPath and then actually second
38
00:01:24,480 --> 00:01:26,050
XQuery and finally XSLT.
39
00:01:27,220 --> 00:01:27,990
There are a couple of other
40
00:01:28,560 --> 00:01:30,550
languages, XLink and XPointer.
41
00:01:31,810 --> 00:01:32,910
Those languages are for specifying,
42
00:01:34,120 --> 00:01:35,270
as you can see, links and pointers.
43
00:01:36,000 --> 00:01:38,500
They also use the XPath language as a component.
44
00:01:38,940 --> 00:01:40,450
We won't be covering those in this video.
45
00:01:41,380 --> 00:01:43,120
Now we'll be covering XPath, XQuery,
46
00:01:43,900 --> 00:01:45,610
and XSLT in moderate detail.
47
00:01:46,230 --> 00:01:47,310
We're not going to cover every
48
00:01:47,650 --> 00:01:48,630
single construct of the languages,
49
00:01:49,400 --> 00:01:50,550
but we will be covering enough
50
00:01:51,160 --> 00:01:54,070
to write a wide variety of queries using those languages.
51
00:01:55,320 --> 00:01:56,660
To understand how XPath
52
00:01:56,940 --> 00:01:59,310
works, it's good to think of the XML as a tree.
53
00:01:59,780 --> 00:02:00,460
So I'd like you to bear with
54
00:02:00,730 --> 00:02:01,700
me for a moment while I
55
00:02:01,870 --> 00:02:03,030
write a little bit
56
00:02:03,060 --> 00:02:04,200
of a tree that would be
57
00:02:04,400 --> 00:02:05,550
the tree encoding of the
58
00:02:05,620 --> 00:02:07,200
book store data that we've been working with.
59
00:02:07,800 --> 00:02:08,690
So we would write as our
60
00:02:08,870 --> 00:02:10,010
root the book store element,
61
00:02:10,430 --> 00:02:11,650
and then we'll have sub-elements
62
00:02:12,370 --> 00:02:14,260
that would contain the books
63
00:02:15,240 --> 00:02:16,920
that are the sub elements of our bookstore.
64
00:02:17,630 --> 00:02:18,320
We might have another book.
65
00:02:19,120 --> 00:02:20,090
We might have over here a
66
00:02:20,530 --> 00:02:22,260
magazine and within
67
00:02:23,090 --> 00:02:24,380
the books then we had, as
68
00:02:24,500 --> 00:02:26,520
you might remember some attributes and some sub elements.
69
00:02:27,090 --> 00:02:29,030
We had for example the ISBN
70
00:02:29,230 --> 00:02:30,950
number I'll write as an attribute here.
71
00:02:31,710 --> 00:02:33,830
We had a price and we
72
00:02:33,980 --> 00:02:35,190
also had of course the
73
00:02:35,360 --> 00:02:37,480
title of the book and we
74
00:02:37,630 --> 00:02:39,840
had the author, excuse me,
75
00:02:40,430 --> 00:02:42,280
over here, I'm obviously
76
00:02:42,750 --> 00:02:44,680
not going to be filling in
77
00:02:44,780 --> 00:02:46,070
the subelement structure here we
78
00:02:46,220 --> 00:02:47,570
are just going to look at one book as an example.
79
00:02:48,960 --> 00:02:50,820
The ISBN number we
80
00:02:50,990 --> 00:02:52,070
now are at the leaf of
81
00:02:52,170 --> 00:02:53,030
the tree so we could have
82
00:02:53,200 --> 00:02:55,040
a string value here to
83
00:02:55,190 --> 00:02:56,850
denote the leaf maybe, 100
84
00:02:57,130 --> 00:02:58,560
for the price, for the
85
00:02:58,790 --> 00:03:00,030
title: "A First Course
86
00:03:00,410 --> 00:03:03,960
in Database Systems", then our authors had further sub-elements.
87
00:03:04,780 --> 00:03:06,320
We had maybe two authors'
88
00:03:06,810 --> 00:03:08,630
sub elements here, I'm abbreviating
89
00:03:09,320 --> 00:03:10,680
a bit, below here, a
90
00:03:10,860 --> 00:03:12,540
first name and a last
91
00:03:12,820 --> 00:03:14,370
name, again abbreviating so that
92
00:03:14,530 --> 00:03:16,980
might have been Jeff Ullman, and so on.
93
00:03:17,180 --> 00:03:18,130
I think you get the
94
00:03:18,300 --> 00:03:21,260
idea of how we render our X and L as a tree.
95
00:03:21,990 --> 00:03:22,950
And the reason we're doing that
96
00:03:23,440 --> 00:03:25,780
is so that we can think
97
00:03:26,190 --> 00:03:27,260
of the expressions we have
98
00:03:27,450 --> 00:03:30,000
in XPath as navigations down the tree.
99
00:03:30,960 --> 00:03:32,880
Specifically, what XML consists
100
00:03:33,380 --> 00:03:34,940
of is path expressions that
101
00:03:35,070 --> 00:03:36,540
describe navigation down and
102
00:03:36,680 --> 00:03:38,420
sometimes across and up a tree.
103
00:03:39,220 --> 00:03:40,580
And then we also have conditions
104
00:03:41,230 --> 00:03:42,400
that we evaluate to pick
105
00:03:42,670 --> 00:03:45,020
out the components of the XML that we're interested in.
106
00:03:45,980 --> 00:03:46,920
So let me just go through
107
00:03:47,310 --> 00:03:48,640
a few of the basic constructs
108
00:03:49,760 --> 00:03:51,320
that we have in XPath.
109
00:03:52,810 --> 00:03:55,560
Let me just erase a few of these things here that got in my way.
110
00:03:56,090 --> 00:03:56,090
Okay.
111
00:03:56,740 --> 00:03:57,880
I'm gonna use this little box and
112
00:03:58,200 --> 00:03:59,470
I'm gonna put the construct in and
113
00:03:59,560 --> 00:04:01,180
then sort of explain how it works.
114
00:04:01,930 --> 00:04:03,300
So the first construct is
115
00:04:03,530 --> 00:04:05,320
simply a slash, and the
116
00:04:05,470 --> 00:04:08,460
slash is for designating the root element.
117
00:04:08,970 --> 00:04:10,200
So we'll put the slash at
118
00:04:10,280 --> 00:04:11,480
the beginning of an XPath
119
00:04:11,710 --> 00:04:13,110
query to say we want to start at the root.
120
00:04:13,450 --> 00:04:15,790
A slash is also used as a separator.
121
00:04:16,680 --> 00:04:18,210
So we're going to write paths
122
00:04:18,420 --> 00:04:19,540
that are going to navigate down the
123
00:04:19,640 --> 00:04:20,280
tree and we're going to put
124
00:04:20,350 --> 00:04:23,050
a '/' between the elements of the path.
125
00:04:24,200 --> 00:04:25,330
All of this will become much clearer in the demo.
126
00:04:25,740 --> 00:04:28,600
So I'll try to go fairly quickly now so we can move to the demo itself.
127
00:04:29,750 --> 00:04:32,400
The next construct is simply writing the name of an element.
128
00:04:32,610 --> 00:04:34,880
I put 'x' here but we might for example write 'book'.
129
00:04:35,570 --> 00:04:36,760
When we write 'book' in an
130
00:04:36,970 --> 00:04:38,290
X path expression, we're saying
131
00:04:38,490 --> 00:04:39,590
that we want to navigate say
132
00:04:39,970 --> 00:04:40,980
we're up here at the bookstore down
133
00:04:41,410 --> 00:04:43,740
to the book sub-element as part of our path expression.
134
00:04:45,110 --> 00:04:46,140
We can also write the special
135
00:04:46,710 --> 00:04:49,110
element symbol '' and '' matches anything.
136
00:04:49,910 --> 00:04:51,760
So if we write '/' then
137
00:04:52,790 --> 00:04:55,610
we'll match any sub-element of our current element.
138
00:04:56,260 --> 00:04:57,540
When we execute X path, there's
139
00:04:57,740 --> 00:04:59,010
sort of a notion as we're writing
140
00:04:59,400 --> 00:05:01,840
the path expressions of being at a particular place.
141
00:05:02,200 --> 00:05:03,360
So we might have navigated from
142
00:05:03,500 --> 00:05:04,520
bookstore to book and then
143
00:05:04,750 --> 00:05:06,080
we would navigate say further down
144
00:05:06,370 --> 00:05:07,500
to title or if we
145
00:05:07,590 --> 00:05:09,490
put a '' then we navigate to any sub-element.
146
00:05:10,920 --> 00:05:12,220
If we want to match an
147
00:05:12,510 --> 00:05:14,990
attribute, we write '@' and then the attribute name.
148
00:05:15,320 --> 00:05:16,420
So for example, if we're
149
00:05:16,550 --> 00:05:17,340
at the book and we want
150
00:05:17,610 --> 00:05:18,850
to match down to
151
00:05:18,940 --> 00:05:20,140
the ISBN number, we'll write
152
00:05:20,520 --> 00:05:23,290
ISBN in our query, our path expression.
153
00:05:24,700 --> 00:05:27,620
We saw the single slash for navigating one step down.
154
00:05:28,210 --> 00:05:29,500
There's also a double slash construct.
155
00:05:30,550 --> 00:05:32,540
The double slash matches any
156
00:05:32,860 --> 00:05:34,220
descendant of our current element.
157
00:05:34,630 --> 00:05:35,730
So, for example, if we're
158
00:05:35,850 --> 00:05:37,030
here at the book and we
159
00:05:37,200 --> 00:05:38,270
write double slash, we'll match
160
00:05:38,560 --> 00:05:39,830
the title, the authors, the off,
161
00:05:40,220 --> 00:05:41,040
the first name and the last
162
00:05:41,240 --> 00:05:43,920
name, every descendant, and actually we'll also match ourselves.
163
00:05:44,710 --> 00:05:46,150
So this symbol here
164
00:05:46,380 --> 00:05:49,460
means any descendant, including the element where we currently are.
165
00:05:50,340 --> 00:05:52,790
So now I've given a flavor of how we write path expressions.
166
00:05:53,280 --> 00:05:54,410
Again, we'll see lots of them in our demo.
167
00:05:55,160 --> 00:05:55,600
What about conditions?
168
00:05:56,530 --> 00:05:57,800
If we want to evaluate a
169
00:05:57,890 --> 00:05:58,910
condition at the current
170
00:05:59,290 --> 00:06:00,690
point in the path, we put
171
00:06:00,930 --> 00:06:03,530
it in a square bracket and we write the condition here.
172
00:06:04,070 --> 00:06:05,200
So, for example, if we
173
00:06:05,290 --> 00:06:06,300
wanted our price to be
174
00:06:06,480 --> 00:06:07,790
less than 50, that would
175
00:06:07,940 --> 00:06:08,890
be a condition we could put
176
00:06:09,440 --> 00:06:10,510
in square brackets if we
177
00:06:10,630 --> 00:06:12,380
were (actually, better be the attribute)
178
00:06:13,060 --> 00:06:14,510
at this point in the navigation.
179
00:06:15,960 --> 00:06:17,390
Now we shouldn't confuse putting a
180
00:06:17,460 --> 00:06:18,490
condition in a square bracket
181
00:06:19,380 --> 00:06:20,790
with putting a number in a square bracket.
182
00:06:21,690 --> 00:06:22,730
If we put a number in
183
00:06:22,860 --> 00:06:24,440
a square bracket, N, for
184
00:06:24,640 --> 00:06:25,610
example, if I write three,
185
00:06:26,300 --> 00:06:27,100
that is not a condition
186
00:06:27,530 --> 00:06:29,190
but rather it matches the Nth
187
00:06:29,810 --> 00:06:31,020
sub element of the current element.
188
00:06:31,630 --> 00:06:32,760
For example, if we were
189
00:06:32,860 --> 00:06:34,010
here at authors and we
190
00:06:34,190 --> 00:06:35,600
put off square bracket two,
191
00:06:36,190 --> 00:06:38,870
then we would match the second off sub element of the authors.
192
00:06:39,680 --> 00:06:40,770
There are many, many other constructs.
193
00:06:41,420 --> 00:06:42,750
This just gives the basic flavor
194
00:06:43,490 --> 00:06:45,300
of the constructs for creating path
195
00:06:45,580 --> 00:06:46,920
expressions and evaluating conditions.
196
00:06:48,220 --> 00:06:50,600
XPath also has lots of built in functions.
197
00:06:51,030 --> 00:06:53,200
I'll just mention two of them as somewhat random examples.
198
00:06:54,280 --> 00:06:56,850
There's a function that you can use in XPath called contains.
199
00:06:57,990 --> 00:06:59,240
If you write contains and then
200
00:06:59,330 --> 00:07:01,220
you write two expressions, each of
201
00:07:01,330 --> 00:07:02,750
which has a string value -
202
00:07:02,970 --> 00:07:04,270
this is actually a predicate
203
00:07:04,870 --> 00:07:06,540
- will return true, if
204
00:07:06,800 --> 00:07:08,430
the first string contains the second string.
205
00:07:09,770 --> 00:07:12,290
As a second example of a function, there's a function called name.
206
00:07:13,190 --> 00:07:15,460
If we write name in a
207
00:07:15,650 --> 00:07:16,860
path, that returns the tag
208
00:07:17,450 --> 00:07:18,860
of the current element in the path.
209
00:07:19,670 --> 00:07:21,140
We'll see the use of functions in our demo.
210
00:07:22,490 --> 00:07:23,550
The last concept that I
211
00:07:23,610 --> 00:07:24,590
want to talk about is what's
212
00:07:24,800 --> 00:07:27,020
known as navigation axes, and
213
00:07:27,140 --> 00:07:31,310
there's 13 axes in XPath.
214
00:07:31,500 --> 00:07:32,400
And what an axis is, it's
215
00:07:32,550 --> 00:07:33,650
sort of a key word that allows
216
00:07:34,050 --> 00:07:36,200
us to navigate around the XML tree.
217
00:07:36,970 --> 00:07:39,820
So, for example, one axis is called parent.
218
00:07:40,360 --> 00:07:41,560
You might have noticed that
219
00:07:41,680 --> 00:07:42,420
when we talked about the basic
220
00:07:42,760 --> 00:07:43,880
constructs, most of them
221
00:07:44,290 --> 00:07:45,600
were about going down a tree.
222
00:07:46,240 --> 00:07:47,590
If you want to navigate up
223
00:07:47,680 --> 00:07:48,580
the tree, then you can
224
00:07:48,740 --> 00:07:51,230
use the parent access that tells you to go up to the parent.
225
00:07:52,080 --> 00:07:54,670
There's an access called following sibling.
226
00:07:57,550 --> 00:07:59,990
And the colon colon - you'll see how that works when we get to the demo.
227
00:08:00,430 --> 00:08:01,880
The following sibling says match
228
00:08:02,580 --> 00:08:03,650
actually all of the following
229
00:08:04,040 --> 00:08:05,290
siblings of the current element.
230
00:08:05,710 --> 00:08:06,480
So if we have a tree
231
00:08:06,960 --> 00:08:07,870
and we're sitting at this
232
00:08:08,050 --> 00:08:09,060
point in the tree,
233
00:08:09,800 --> 00:08:11,710
then we...the following sibling axis
234
00:08:12,160 --> 00:08:13,900
will match all of the
235
00:08:14,460 --> 00:08:16,370
siblings that are after the current one in the tree.
236
00:08:17,640 --> 00:08:19,020
There's an axis called descendants
237
00:08:21,630 --> 00:08:22,650
descendants, as you might guess,
238
00:08:23,260 --> 00:08:25,640
matches all the descendants of the current element.
239
00:08:25,960 --> 00:08:26,890
Now it's not quite the same
240
00:08:27,380 --> 00:08:28,790
as slash, slash, because as a
241
00:08:29,150 --> 00:08:30,470
reminder, slash, slash also matches
242
00:08:30,850 --> 00:08:32,250
the current element as well as the descendants.
243
00:08:33,160 --> 00:08:34,140
Actually as it happens, there is
244
00:08:34,220 --> 00:08:36,040
a navigation access called descendants
245
00:08:36,790 --> 00:08:38,920
and self that' s equivalent to slash, slash.
246
00:08:39,340 --> 00:08:40,300
And by the way, there's
247
00:08:40,470 --> 00:08:43,470
also one called self that will match the current element.
248
00:08:43,930 --> 00:08:44,810
And that may not seem to
249
00:08:44,880 --> 00:08:46,360
be useful, but well see
250
00:08:46,620 --> 00:08:47,620
uses for that, for example,
251
00:08:47,730 --> 00:08:49,060
in conjunction with the name
252
00:08:49,820 --> 00:08:50,850
function that we talked
253
00:08:51,110 --> 00:08:53,370
about up here, that would give us the tag of the current element.
254
00:08:53,700 --> 00:08:55,890
Just a few details to wrap up.
255
00:08:57,260 --> 00:08:58,820
XPath queries technically operate on
256
00:08:59,090 --> 00:09:00,510
and return a sequence of elements.
257
00:09:00,890 --> 00:09:01,590
That's their formal semantics.
258
00:09:02,620 --> 00:09:04,190
There is a specification for how
259
00:09:04,310 --> 00:09:05,720
XML documents and XML
260
00:09:05,880 --> 00:09:07,390
streams map to sequences of
261
00:09:07,470 --> 00:09:09,200
elements and you'll see that it's quite natural.
262
00:09:10,660 --> 00:09:11,960
When we run an XPath query,
263
00:09:12,830 --> 00:09:13,720
sometimes the result can be expressed
264
00:09:14,150 --> 00:09:15,240
as XML, but not always.
265
00:09:16,020 --> 00:09:18,360
But as we'll see again, that's fairly natural as well.
266
00:09:18,640 --> 00:09:21,770
So this video has given an introduction to XPath.
267
00:09:22,420 --> 00:09:23,460
We've shown how to think of
268
00:09:23,570 --> 00:09:24,570
XML data as a tree
269
00:09:25,130 --> 00:09:26,750
and then XPath as expressions
270
00:09:27,300 --> 00:09:29,530
that navigate around the tree and also evaluate conditions.
271
00:09:30,390 --> 00:09:33,240
We've seen a few of the constructs for path expressions or conditions.
272
00:09:34,130 --> 00:09:35,180
We've seen a couple of built-in functions
273
00:09:35,590 --> 00:09:37,490
and I've introduced the concept of navigation axes.
274
00:09:38,050 --> 00:09:39,240
But the real way to
275
00:09:39,390 --> 00:09:41,850
learn and understand XPath is to run some queries.
276
00:09:42,510 --> 00:09:43,620
So I urge you to watch the
277
00:09:43,680 --> 00:09:44,940
next video which is
278
00:09:45,030 --> 00:09:46,270
a demo of XPath queries over
279
00:09:46,450 --> 00:09:48,610
our bookstore data and then try some queries yourself.