Introduction to XQuery as a tool for TEI users

8 November 2010

Nearby documents


This isn't really yet a full set of slides for the tutorial, but it is supposed to be a start. At least, it's a place to write some stuff down.

Organizational preliminaries

Schedule

What's where

A request

This is a newly developed course. So:
* Interest and novelty are determined at the sole discretion of the instructors, whose decisions are final.

Goals of the course

When you leave, you should:
  • have an overview of XQuery
  • have used XQuery a bit yourself, hands-on
  • be able to fend for yourself
You will not become an expert here. (Sorry.)
A lot of very bright people have spent more than a decade developing the technology we are talking about here. You are not going to master that technology in a day and a half.
So you really need to learn to fend for yourselves. By that I mean that when you go back home and try to do something more ambitious, you will if you're like me run into trouble. You should know in principle what's possible, but when things don't work ‘right’ (i.e. as you expect), you need to be able to discover, in practice and in detail, what's wrong.
So in some of the exercises I am going to try to let you flounder a bit. You need to get used to floundering; it's part of the task of learning a new language.

Outline of the course

  1. Introduction: XQuery origin, history, context
  2. Atomic values, nodes, sequences
  3. FLWOR expressions
  4. Functions, modules, collections
  5. Full-text search
  6. Miscellaneous, Q/A*, and Wrapup

Shortcuts

To fit the course in the time, we give some topics short shrift:
  • function library
  • interface to XSD
  • static typing
  • details of complicated rules
We'll work with a few functions, and look at the documentation for a few. Once you know how to read up on one function and call it, you know how to do that for any function; you can learn new functions at home.
If you're working with TEI documents or other document-oriented XML, you don't have the kind of XSD schemas that require close attention to how XQuery and XSD interact. We will talk about the type system, because it's pervasive in XQuery, but we'll focus on a small number of the most useful built-in simple types. We won't talk about complex types or user-supplied schemas at all.
The static typing in XQuery is a work of great subtlety and ingenuity. I believe there are theoretical computer scientists who find the design choices and tradeoffs in the design a fascinating object of contemplation. If you have an interest in theoretical computer science, or even if you are just writing an implementation of XQuery, I commend the static typing in the language to your attention. Otherwise, what you have to know about static typing is what it is (in general terms), that it exists, and that occasionally it may get in your way with puzzling messages or behaviors, which you have to ignore or try to work around.

Assumptions

The slides and exercises assume some things:
  • You are familiar with XML and XML namespaces.
  • You can produce well-formed XML in an editor.
  • You either have a passing acquaintance with XPath 1.0 and XSLT (1.0 or 2.0) or do not much mind hearing occasional comparisons of features with a language you don't know.
  • You have a network connection.
  • A current Java is installed on your machine.
  • A copy of BaseX is installed on your machine.
If any of these assumptions is false, please raise your hand now.

Installing BaseX

Don't have BaseX installed?
  1. Navigate to http://www.inf.uni-konstanz.de/dbis/basex/
  2. Click Download.
  3. Select .jar, .exe, or .dmg.
  4. Install in ‘the usual way’.
We'll use BaseX for the exercises in this course.
BaseX is a fine piece of work, but please note that this use does not constitute an endorsement of BaseX over other XQuery implementations. BaseX is:
You'll need to have it installed by the end of the morning break; If you do not already have it installed, now is a good time to start the download.

Rules

  • If you cannot hear what I'm saying or see what I'm showing, speak up.
  • If you hear but do not understand, ask a question. (But N.B. I don't promise to answer right away.)
  • If you have questions about side points, hold them for break.
  • If I'm going too fast (linguistically or conceptually), let me know.
  • If I'm going too slow, and boring you, feel free to signal “pick up the pace!” or “End the digression!”
  • During the exercises, feel free to experiment with variations on the exercise as shown in the slides. But if your experiments lead to complications, you're on your own.

Rules for Q and A

Much of this course involves questions and answers. I ask, you answer. Rules include:
  • Guess!
  • Think!
  • Try it!
We are not operating a nuclear power plant here. No catastrophe is going to occur if you guess wrong.
But if you fail to guess, a catastrophe will occur: you won't learn as much.

A word from the wise

Inevitably, the people who get the most from the class share one characteristic: They remain focused on the topic at hand.
The first trick to maintaining focus is to get enough sleep. I suggest 10 hours of sleep each night when you are studying new ideas. Before dismissing this idea, try it. You will wake up refreshed and ready to learn. Caffeine is not a substitute for sleep.
The second trick is to [remember that s]ome things in this world are just hard.
Before going any further, assure yourself that you are not stupid and that some things are just hard.

-Aaron Hillegass

Acknowledgements

Many people have provided help, examples, insights, and material shamelessly recycled here without attribution. Some have helped me personally, some have never heard of me but have helped me by making their knowledge and work available on the Web. Thanks to my colleagues in the W3C XML Query working group.
Particular thanks to Syd Bauman, Dan Greenstein, Mary Holstege, Jim Melton, Liam Quin, Don Spaeth, Priscilla Walmsley.

Session 1: Introduction

Outline of this section

  1. Introduction: XQuery origin, history, context
    • preliminaries
    • introductions around the table
    • origin of XQuery
    • how the specifications are organized
    • XQuery as a language
    • the implementation landscape
  2. XQuery syntax and semantics (1): atomic values, nodes, sequences
  3. XQuery syntax and semantics (2): FLWOR expressions, modules, ‘advanced’ features
  4. Building applications (1): examining the data
  5. Building applications (2): providing a RESTful interface
  6. Building applications (3): more interesting search
  7. Introduction: XQuery origin, history, context
  8. Atomic values, nodes, sequences
  9. FLWOR expressions
  10. Functions, modules, collections
  11. Full-text search
  12. Miscellaneous, Q/A, and Wrapup

Introductions

Origin of XQuery (1)

Origin of XQuery

How the specifications are organized

XQuery as a language

Goals:
  • composability
  • closure
  • schema awareness
  • XPath 1.0 compatibility
  • simplicity
  • completeness
  • generality
  • concision
  • static analysis

Who was at the table?

  • database (esp. SQL) vendors
  • middleware vendors
  • computer scientists / type theorists
  • document-markup users
  • XSLT users (and XSL WG)

Implementation landscape

Databases:
  • Oracle
  • IBM DB2
  • SQL Server
XQuery over persistent stores:
  • Mark Logic
  • BaseX
  • eXist
  • XQEngine
  • ...
XQuery over transient data:
  • Saxon
  • Galax
  • ...
In-memory XQuery:
  • Nux
  • GCX
  • ...

Session 2: Atomic values, nodes, sequences

  1. Introduction: XQuery origin, history, context
  2. Atomic values, nodes, sequences
    • atomic values
      • simple literals
      • types and typed values
      • type coercions
    • simple expressions
      • (arithmetic) operators
      • function calls
      • comparisons and conditionals
    • sequences
      • range expressions
      • functions returning sequences
      • filter expressions
    • nodes
      • the XQuery and XPath data model (XDM)
      • nodes
      • path expressions
  3. FLWOR expressions
  4. Functions, modules, collections
  5. Full-text search
  6. Miscellaneous, Q/A, and Wrapup

The elements of a language

How do we describe a language effectively?

A powerful programming language is more than just a means for instructing a computer to perform tasks. The language also serves as a framework within which we organize our ideas about processes. Thus, when we describe a language, we should pay particular attention to the means that the language provides for combining simple ideas to form more complex ideas. Every powerful language has three mechanisms for accomplishing this:

primitive expressions, which represent the simplest entities with which the language is concerned,
means of combination, by which compound expressions are built from simpler ones, and
means of abstraction, by which compound objects can be named and manipulated as units.

-Hal Abelson and Gerald Jay Sussman, with Julie Sussman, Structure and interpretation of computer programs (Cambridge: MIT Press, 1985), p. 4.

The elements of XQuery

We can describe XQuery that way.
  • primitive expressions and atomic objects: literals, variable references
  • simple combinations: operators of many kinds, function calls, parenthesized expressions, constructors
  • more combinations: conditionals, typeswitches, quantified expressions, FLWOR expressions
  • abstraction: function definitions, modules

Setup

  1. Either or
    • Sit down at a lab machine and start it up.
  2. Launch BaseX.

Brief examination of interface

  • Menu bar.
  • Tool bar.
  • XQuery window.
  • Result window.

Some atomic values

Q. Is
2
a legal XQuery expression?
A. Yes, because it's a literal value denoting an integer.
A literal is one legal form for XQuery expressions.

Some atomic values, cont'd

Q. Is
23456
a legal XQuery expression?
A. Yes, because it's also a literal value denoting an integer.

Some atomic values, cont'd

Q. Is
12345678901234567890
a legal XQuery expression?
A. Yes, because it's a literal value denoting an integer.
However, because it has more than 18 digits, conforming XQuery processors are not required to accept it.

Some atomic values, cont'd

Q. What is the value of
314
as an XQuery expression?
A. The integer 314.

Some atomic values, cont'd

Q. What is the value of
3.141592
as an XQuery expression?
A. The decimal number 3.141592.

Some atomic values, cont'd

Q. Is
.141592
a legal XQuery expression?
A. Yes.

Some atomic values, cont'd

Q. Is
3.
a legal XQuery expression?
A. Yes.

Some atomic values, cont'd

Q. Is
3.141592E3
a legal XQuery expression?
A. Yes.

Some atomic values, cont'd

Q. Is
3141592e-6
a legal XQuery expression?
A. Yes.

Some atomic values, cont'd

Q. Is
3.141592e333
a legal XQuery expression?
A. Yes. But it overflows.

Some atomic values, cont'd

Q. Is
3141592e-666
a legal XQuery expression?
A. Yes.

Some atomic values, cont'd

Q. Is
"3.141592"
a legal XQuery expression? If so, what is its value?
A. Yes; its value is the eight-character string 3, decimal-point, 1, 4, 1, 5, 9, 2.

Exercise 1

Kernighan and Ritchie write

The first program to write is the same for all languages:

Print the words

hello, world

Write a hello-world program in XQuery.
A. You have got to be kidding.
Well, OK, if you insist. “"Hello, world."

Some atomic values, cont'd

Q. Is
"Želim naručiti čaša piva, molim."
a legal XQuery expression? If so, what is its value?
A. Yes; its value is (in one sense) the string “Želim naručiti čaša piva.
In another sense, its value is very high: it means “I'll have a glass of beer, please.”

Some atomic values, cont'd

Q. Is
"Želim naručiti čaša piva."
a legal XQuery expression? If so, what is its value?
A. Yes; its value is again the string “Želim naručiti čaša piva.

Some atomic values, cont'd

Q. Is
'hello, world'
a legal XQuery expression? If so, what is its value? And what about
"hello, world"
?
A. Yes; their values are the strings of UCS characters enclosed within the quotation marks.

Some atomic values, cont'd

Q. Is
"'Hello, world!" he shouted.'
a legal XQuery expression? If so, what is its value?

Some atomic values, cont'd

Q. Is
"""Hello, world!"" he shouted."
a legal XQuery expression? If so, what is its value?

Some atomic values, cont'd

Q. Is
""Hello, world!" he shouted."
a legal XQuery expression? If so, what is its value?

Some atomic values, cont'd

Q. Is
'""Hello, world!"" he shouted.'
a legal XQuery expression? If so, what is its value?

Syntax summary

To summarize*:
E ⇒* Literal // E for 'Expression'
Literal = Number | StringLit
Number = DIGITS // integer
| DIGITS . DIGITS? | . DIGITS // decimal
| DIGITS (. DIGITS?)? [eE] [+-] DIGITS // double
| . DIGITS [eE] [+-] DIGITS // double
StringLit = ' ([^'] | '')* ' // string
| " ([^"] | "")* " // string
* How many people understand these formulae?

Digression: EBNF notation

EBNF is a notation for precise definition of syntax. It uses:
Strictly speaking, “⇒*” means that the non-terminal on the left-hand side derives, directly or indirectly, the right-hand side. Almost always, here, the derivation is indirect; otherwise I'd just use the equals sign.

EBNF, defined in EBNF

Syntax = Rule+
Rule = Identifier = Expression
Expression = Term (| Term)*
Term = (Factor Rep?)+
Factor = Identifier
| String
| Terminal-symbol
| ( Expression )
Rep = ? | + | *
Identifier = NAME // as defined in XML
String = ‘"’ CHAR* ‘"’
| “'” CHAR* “'”
Terminal-symbol = (A | B | ... | Z)+
Warning: EBNF is designed for precision, but we'll be sloppy about some details here, for pedagogical reasons.

Syntax summary, revisited

Now let's look at that summary again:
E ⇒* Literal // E for 'Expression'
Literal = Number | StringLit
Number = DIGITS // integer
| DIGITS . DIGITS? | . DIGITS // decimal
| DIGITS (. DIGITS?)? [eE] [+-]? DIGITS // double
| . DIGITS [eE] [+-] DIGITS // double
StringLit = ' ([^'] | '')* ' // string
| " ([^"] | "")* " // string
(Yes, the previous slide did not define “[^X]”; it means “any character except X”.)
There are other kinds of expression of course, not just literals. We'll see others later.
As a (formal) language, XQuery is a set of strings, sometimes called the sentences of the language, or the words of the language.
The class of words we are mostly concerned with here are expressions.
One kind of expression is a literal expression, or just literal for short. Read the first line: An expression (an E) can be a literal.
The symbol “⇒*” is used in discussions of formal languages to mean ‘derives’. For technical reasons, the grammar does not define E as Literal directly (i.e. it does not contain the rule E = Literal), but it could without changing the language.
A literal, in turn, is either a number or a string literal. (Pedants will say “either a numeral or ...”, but that requires explanation.)
There are various kinds of numbers, with or without decimal point and in normal or scientific notation.
Strings are quoted singly or doubly.

Types of values

Q. What's the difference between
3.141592
and
"3.1415492"
?
A. They have different types. One is a decimal number, the other is a string of characters.

Some atomic values, cont'd

Q. Is
3.141592 instance of xs:decimal
an XQuery expression? If so, what is its value?
A. Yes. It's an InstanceOfExpression testing whether the left-hand argument of the instance of operator is an instance of the type named by the right-hand argument.
The value of the expression is the Boolean value true.

A new type of expression

E = E instance of SequenceType // returns boolean
SequenceType ⇒* QName
N.B. There are other kinds of sequence type expressions, but we won't need them for a while yet.

Some atomic values, cont'd

Q. Is
"3.141592" instance of xs:decimal
an XQuery expression? If so, what is its value?
A. Yes. It's an InstanceOfExpression testing whether the left-hand argument of the instance of operator is an instance of the type named by the right-hand argument.
The value of the expression is the Boolean false.

Some atomic values, cont'd

Q. If the type of 3.141592 is xs:decimal, what might the type of "3.141592" be?
That is, how could you replace “xs:decimal” in the expression below to elicit the value true?
"3.141592" instance of xs:decimal
A. If it's replaced with “xs:string”, the expression becomes true.

Typed values

Exercise:
Q. What other QNames might you substitute for “xs:decimal” in an instance-of expression?
Q. What atomic datatypes are built in to XQuery?
Q. Where would you go to look for a list?
Watch this space.

Typed values

Q. Is
2010-11-08
a member of the lexical space for type xs:date?
Q. Is
2010-11-08
a legal XQuery expression? If so, what are its value and type?
[Q. How do you find out the value and type of an expression?]
Watch this space.

Typed values

Q. Is
xs:date(2010-11-08)
a legal XQuery expression? If so, what are its value and type?
Watch this space.

Typed values

Q. Is
xs:date("2010-11-08")
a legal XQuery expression? If so, what are its value and type?
Watch this space.

A new type of expression

E ⇒* FunctionCall
FunctionCall = QName ( E (, E)* )

But what about ... ?

Q. Is
xs:date("2010-11-08")
a function call returning a value of type xs:date?
Q. What if you want a value of some other type?
A. Yes.
A. Try its type name.

A special class of functions

Q. Hmm. So
xs:date("2010-11-08")
constructs a value of type xs:date.
Are there analogous constructor functions for all the atomic types of XSD 1.0?
What examples can you think of?
A. Yes.

Another way to construct typed values

Q. Is
"2010-11-08" cast as xs:date
a legal XQuery expression?
If so what is its value?
A. Yes.
Its value is the date 10 November 2010.

Syntax of cast expression

E = E cast as SequenceType // returns boolean
SequenceType ⇒* QName

Constructor functions examples

Q. Which of the following are legal XQuery expressions? What are their values and types?
xs:date("2010-11-08")
xs:date(2010-11-08)
2010-11-08 cast as xs:date
2010-11-08 cast as xs:gYear
"2010-11-08" cast as xs:gYear
xs:date("2010-11-08") cast as xs:gYear
xs:date(2010-11-08) cast as xs:gYear
A. Some.
Note that because constructors use casting rules, integer expressions like 2010-11-08 (another way of writing 1991) are accepted when the integer can be cast to the required type, but their quoted form (a string) is accepted only when the string is in the lexical space of the target datatype.

More constructor examples

Q. Which of the following are legal expressions?
xs:integer(2010-11-08)
xs:integer("2010-11-08")
xs:integer(1989)
xs:integer("1989")
xs:byte(65)
xs:byte(127)
xs:byte(255)
xs:byte(-128)
xs:byte(-129)
xs:float("NaN")
xs:float("-INF")
A. Most.
Note that the range restrictions of types like xs:byte are enforced.

Type coercions

Q. Which of the following are legal XQuery expressions? What are their values and types?
xs:date("2010-11-08") cast as xs:date
xs:date("2010-11-08") cast as xs:gYear
xs:untypedAtomic("2010-11-08")
xs:untypedAtomic("2010-11-08") cast as xs:date
xs:untypedAtomic("2010-11-08") cast as xs:gYear
(xs:untypedAtomic("2010-11-08") cast as xs:date) 
  cast as xs:gYear
There are complicated rules governing type promotion in section 17 of the Functions and Operators spec. We will not try to cover them here.
For the most part, they ensure that cases that obviously should work in a certain way do work that way. For the large gray area, you just need to try it out and/or look it up.

Function calls

Q. Are any of
ceiling(3.141592)
floor(3.141592)
round(3.141592)
legal XQuery expressions?

Function calls

Q. Are any of
concat("Syd", "Bauman")
fn:concat("Syd", " ", "Bauman -- ", "programmer extraordinaire")
legal XQuery expressions?

Function calls

Q. Are any of
current-date()
current-time()
current-datetime()
legal XQuery expressions?
Note the time zone in the date.

Digression: date/time datatypes, timezones, and function library

Q. What is the time zone doing in the value of current-date()?
Q. How do we get rid of it?
Q. Where might you look it up?

Timezones

Q. Would
adjust-date-to-timezone(
  current-date(),
  xs:duration("PT0H")
)
do the trick?

Timezones, cont'd

Q. How about
adjust-date-to-timezone(
  current-date(),
  ()
)
?

The built-in function library

Exercise:
What functions are built into every conforming XQuery processor?
Where do you find the list?

Quick tour of function library

More on some of these later.

Variable references

Q. Is
$x
a legal XQuery expression? If so, what is its type and value?
A. Sometimes.
It's a variable reference, and it's legal if and only if the variable is bound or in-scope at the place where “$x” occurs in the larger expression. (So by itself, no, it's not valid, but it may be valid in a larger context.)
Expressions don't have type; values have type. The type assigned to “$x” depends on what value it's bound to. Because in reality, the type is not assigned to the expression “$x” but to the value bound to x.

Context item

Q. Is
.
a legal XQuery expression? If so, what is its type and value?
A. Sometimes.
It's a reference to the context item, and it's legal if and only if there is a context item in scope at the place where “.” occurs in the larger expression. (So by itself, no, it's not valid, but it may be valid in a larger context.)
The type of . depends on (or: is) the type of its value, so that will vary, too.

Parenthesized expressions

Q. If E is a legal XQuery expression is
(E)
a legal XQuery expression? If so, what is its type and value?
A. Yes.
Value and type are those of E.
Nothing mysterious here; we just have to note explicitly that XQuery uses the same rules for parenthesized expressions that most languages do.

Grammar for primary expressions

We can now define the full class of primary expressions.
E = Primary // among other things
Primary = Literal | VarRef | FunctionCall | ( E? )
| . // context-item expression
VarRef = $ QName
Next: expressions with operators!

Arithmetic expressions

Q. Are any of
2 + 3
2 - 3
2 ^ 3
2 ** 3
(2 + 1) * 4
2 + 1 * 4
(2 + 1) / 4
2 + 1 / 4
legal XQuery expressions? If so, what are their values?
Watch this space.

Arithmetic expressions

Q. Are any of
2 / 3
2 div 3
2 idiv 3
2 mod 3
2 % 3
legal XQuery expressions? If so, what are their values?
Watch this space.

Grammar for arithmetic expressions

E = [+-]* E
E = E * E
E = E div E
E = E idiv E
E = E mod E
E = E + E
E = E - E

Comparisons

Q. Are any of
xs:date("2010-11-08") = xs:date("2011-02-20")
xs:date("2010-11-08") = xs:date("2011-02-29")
a legal XQuery expression? If so, what are its value and type?
Watch this space.

Grammar for comparison expressions

E = E ValueComp E // value comparison
| E GeneralComp E // general comparison
| E NodeComp E // node comparison
ValueComp = eq | ne
| lt | le
| gt | ge
GeneralComp = = | !=
| < | <=
| > | >=
NodeComp = is | << | >>

General and value comparators

Value comparators expect singleton values on each side: 2 < 3 and $x = $y (and $x and $y must have singleton values).
General comparators take arbitrary sequences; true if there is some pair of (L, R) values for which the comparison is true.
let $a := (1, 2, 3),
        $b := (1, 1, 1, 1)
    return ($a = $a, $b = $b, $a = $b,
            $a != $a, $b != $b, $a != $b,
            $a < $a, $b < $b, $a < $b,
            $a > $a, $b > $b, $a > $b)

Conditional expression

Q. Is
if (1 = 1) then "Equal!" else "Not equal!"
a legal XQuery expression? If so, what are its value and type?
Watch this space.

Conditional expression

Q. Is
if 1 = 1 then "Equal!" else "Not equal!"
a legal XQuery expression? If so, what are its value and type?
Watch this space.

Grammar for conditional expressions

E = if ( E ) then E else E
Note that the parentheses around the condition are required.
Conditions can use Boolean combinations:
E = E and E
E = E or E
N.B. not is not an operator, but a function: not(E).

Sequences

Q. What does the expression
/child::tei:TEI/child::tei:text/child::*
denote?
Q. One thing, or several?
Q. If several, how are they organized? Do they have an order? Are they a set? A bag? A sequence?
It denotes the children of the TEI text element.
Usually, that will be several items.
They are arranged in a sequence (in document order).

Sequences and closure

Q. If values in the data model can include integers, strings, nodes, and sequences of nodes (or integers or strings or ...), then how on earth can we achieve closure in the language? How can we ensure that every expression in the model denotes a value in the model?

Sequences and closure (2)

A. We achieve closure and uniformity by the rule:
Everything is a sequence.
If anything looks like it's not a sequence (say, the integer 23), it's because it's a singleton sequence.
If you know databases, you may think that this sounds a bit like the rule used in relational databases, which says that the value of count(*) is not an integer, but a relational table with a single row, which in turn has a single column.
If so, you're wrong. It's not “a little” like it. It's exactly like it. It may seem hokey, but it makes a huge difference to the formal semantics of the language, and thus to its cleanliness and its reliability.

Sequences and closure (3)

Q. So what do
23
1, 2, 3
//tei:text/*
denote?
They denote, respectively,

Literal sequences

Wait a minute, there. What was that
1, 2, 3
you stuck in there?
OK, let's do it formally.
Q. Is
1, 2, 3
a legal XQuery expression? If so, what are its type and value?
A. Yes. It's a literal sequence constructor. Any sequence of expressions, separated by commas, denotes a sequence of values.
Happy now?

Grammar for literal sequence constructors

E = E , E
N.B. Sometimes parentheses are needed: “(E, E)” instead of “E, E”.
In some contexts, literal sequence constructors are not allowed. The grammar in the spec therefore distinguishes between Expr and ExprSingle. The grammar listed on the resources page similarly distinguishes E from E1.
But I'll ignore that distinction here in the slides. All you have to do is remember that comma binds very weakly, more weakly than any other operator. So if you need a literal sequence expression in some larger context, you pretty much always need parentheses.

Range expressions

Q. Are any of
1 to 10
(1 + 1) to (3 + 3)
(100 div 100) to (100 div 10)
legal XQuery expressions? If so, what are their values? If not, why not?
Yes. They are range expressions.

Grammar for range expressions

E = E to E
Both arguments of to should have integer values.
Q. What if the second is less than the first?

Sequences and sequences

Q. Can sequences contain atomic values?
Q. Can sequences contain nodes?
Q. Can sequences contain items?
Q. Can sequences contain sequence?

Sequences and sequences (2)

Q. Is the following a legal XQuery expression?
(1, 2, (3, 4, 5), 6, 7)
If so, what are its type and value? If not, why not?
Yes, yes, yes, and no. Repeat no.

Functions returning sequences

  • fn:string-to-codepoints( $arg as xs:string? ) as xs:integer*
  • fn:tokenize($input as xs:string?, $pattern as xs:string) as xs:string*
  • fn:tokenize($input as xs:string?, $pattern as xs:string, $flags as xs:string) as xs:string*
  • fn:in-scope-prefixes( $element as element()) as xs:string*
  • fn:collection() as node()*
  • fn:collection($arg as xs:string?) as node()*

Functions taking sequences

  • fn:codepoints-to-string( $arg as xs:integer* ) as xs:string
  • fn:string-join($arg1 as xs:string*, $arg2 as xs:string) as xs:string
  • fn:deep-equal($parameter1 as item()*, $parameter2 as item()*) as xs:boolean
  • fn:count($arg as item()*) as xs:integer
  • fn:avg($arg as xs:anyAtomicType*) as xs:anyAtomicType?
  • fn:max($arg as xs:anyAtomicType*) as xs:anyAtomicType?
  • fn:min($arg as xs:anyAtomicType*) as xs:anyAtomicType?
  • fn:sum($arg as xs:anyAtomicType*) as xs:anyAtomicType
  • fn:sum($arg as xs:anyAtomicType*, $zero as xs:anyAtomicType?) as xs:anyAtomicType?

Functions mapping sequences to sequences

  • fn:index-of($seqParam as xs:anyAtomicType*, $srchParam as xs:anyAtomicType) as xs:integer*
  • fn:distinct-values($arg as xs:anyAtomicType*) as xs:anyAtomicType*
  • fn:insert-before($target as item()*, $position as xs:integer, $inserts as item()*) as item()*
  • fn:remove($target as item()*, $position as xs:integer) as item()*
  • fn:reverse($arg as item()*) as item()*
  • fn:subsequence($sourceSeq as item()*, $startingLoc as xs:double) as item()*
  • fn:subsequence($sourceSeq as item()*, $startingLoc as xs:double, $length as xs:double) as item()*
  • fn:unordered($sourceSeq as item()*) as item()*

Filter expressions

Any sequence can be filtered through a set of predicates.
For example: odd numbers:
(1 to 100)[. mod 2 = 1]
Numbers not divisible by 2 or 3 (throdd numbers):
(1 to 100)[. mod 2 = 1]
[. mod 3 ne 0]
The beginning of the Sieve of Eratosthenes:
(1 to 100)[. mod 2 = 1][. mod 3 ne 0]
[. mod 5 ne 0][. mod 7 ne 0][. mod 11 ne 0]
[. mod 13 ne 0][. mod 17 ne 0][. mod 19 ne 0]

Filter expressions (syntax)

FilterExpr = Primary Predicate*
Predicate = [ E ]
A generalization of XPath 1.0 predicates (just as sequences generalize nodesets).

Nodes and node constructors

Q. Is
<e/>
a legal XQuery expression? If so, what does it denote?
A. An (not the) element whose name is “e”, which has no attributes and no content.

Nodes, cont'd

Q. Are
<e></e>
and
<e/>
equal to each other?
A. Yes.

Nodes, cont'd

Q. Is
<e></e> = <f/>
true or false?
A. It's true.

Nodes, cont'd

Q. Is
deep-equal(<e></e>, <f/>)
true or false?
A. It's false.

Nodes, cont'd

Q. Which of the following are true, and which false?
deep-equal(<e></e>, <e/>)
deep-equal(<e> 23 </e>, <e/>)
deep-equal(<e> </e>, <e/>)
deep-equal(<e> <!-- ? -->  </e>, <e/>)
deep-equal(<e xmlns="http://example.org/ns1"></e>, <e/>)
deep-equal(<e xmlns=""></e>, <e/>)
deep-equal(<e foo=""> </e>, <e foo=" "/>) 
true or false?
A. It's false.

Nodes, cont'd

Q. Is
<e></e> = <e/>
true or false?
A. It's true. But it doesn't mean what you think.

Two kinds of constructors

  • literal constructors: see above.
  • computed constructors: allow names to be calculated.
element book { 
   attribute isbn {"isbn-0060229357" }, 
   element title { "Harold and the Purple Crayon"},
   element author { 
      element first { "Crockett" }, 
      element last {"Johnson" }
   }
}

Why computed constructors?

Computed constructors allow choice of names to be delayed til run time.
element $elem-name { 
   attribute isbn { $isbn-value }, 
   element title { $item-title },
   element author { 
      element $firstname-gi { "Crockett" }, 
      element $lastname-gi {"Johnson" }
   }
}

Building a database

An XML tree

Quick tour of the XDM

  • items = nodes + atomic values
  • nodes = document + element + attribute + comment + processing-instruction + text + namespace
  • atomic values = values of XSD simple types
  • nodes arranged in trees, related by axes:
    • parent, child
    • ancestor, descendant
    • preceding-sibling, following-sibling
    • preceding, following
  • total* document ordering (>>, <<) on each document
    • The document node precedes all others.
    • Parents precede children.
    • Children are ordered.
    • Namespaces precede attributes, attributes precede children.

Path expressions

Path expression: sequence of '/'-separated steps.
/* simplified */
PathExpr ::= RelativePathExpr
           | ("/" RelativePathExpr?)
RelativePathExpr ::= StepExpr ('/' StepExpr)*
Step: axis, name or node test, predicates.
/* simplified */
StepExpr ::= Axis NodeTest Predicate*

Path expressions (examples)

  • child::p
  • child::*
  • child::text()
  • child::node()
  • attribute::type
  • attribute::*
  • parent::node()
  • descendant::p
  • ancestor-or-self::p
  • child::p[position() = 1]
  • child::p[1]
  • child::p[position() = last()]
  • child::p[last()]
  • child::p[attribute::type='warning']
  • attribute::type='warning'

Path Expressions

This is the full syntax only.
Path = (/ RelPath?) // absolute path expression
| RelPath // relative path expression
RelPath = Step (/ Step)*
Step = FilterExpr | AxisStep
AxisStep = AxisName :: NodeTest Predicate*
AxisName = child | parent
| descendant | ancestor
| descendant-or-self | ancestor-or-self
| following | preceding
| following-sibling | preceding-sibling
| self | attribute | namespace
NodeTest = KindTest | NameTest
KindTest = document-node( ElementTest? )
| element( NameAndOrType? )
| schema-element( QName )
| attribute( NameAndOrType? )
| schema-attribute( QName )
| processing-instruction( NCName | StringLit )
| comment()
| text()
| node()
ElementTest = element( NameAndOrType? )
| schema-element( QName )
NameAndOrType = * // anything
| QName // element or attribute of this name
| Qname , QName // this name, this type
| * , QName // any name, this type
NameTest = * // anything
| (NCName : *) // any name in that namespace
| (* : NCName) // that name in any namespace
Predicate = [ E ]

Abbreviated Path Expressions

Add to the above:
Path = (// RelPath) // relative path
RelPath = Step (// Step)*
AxisStep = NodeTest // implicit child::
| @ NodeTest // implicit attribute::
| .. // = parent::*

Functions for XDM nodes

Back to the function library:
  • ...
  • accessors for XDM node properties: node-name(), nilled(), string(), data(), base-uri(), document-uri()
  • functions on XDM nodes: name(), local-name(), namespace-uri(), number(), lang(), root()
  • operators on XDM nodes: is, <<, >>

XPath poker

Instructor starts.
  • Current player formulates a question to be answered by formulating an XPath (e.g. “find all the line breaks in the document).” Assume TEI markup.
  • Pick next player at random (using random number generator).
  • If next player answers correctly, they get a point and become the currnet player.
  • If next player cannot answer correctly, they gain no point and the current player must answer the question. If they cannot, they lose a point.
Winners get rewards.

Optimization

Which is going to be faster?
collection()/id('sha-son')
or
collection()//*[@xml:id = 'sha-son']
?
I would expect the first to be faster, possibly much faster.
In BaseX, on unadorned Shakespeare, the first takes 5000 ms, the second 100, a difference of a factor of 50.
Moral 1: don't make premature guesses about time and optimization.
Moral 2: optimization is very uneven and digital. Re-formulating something several different ways may make a big difference. (But quite possibly a different difference for each implementation.)

Session 3: FLWOR Expressions

  1. Introduction: XQuery origin, history, context
  2. Atomic values, nodes, sequences
  3. FLWOR expressions
    • basics: FLWOR expressions and their meaning
    • exercises with integers
    • exercises with words and word lists
    • references lists, concordances
  4. Functions, modules, collections
  5. Full-text search
  6. Miscellaneous, Q/A, and Wrapup

A language design problem

Given that (path) expressions can denote sequences of more than one value, how do we deal individually with each value in the sequence?
Three ways:
  • implicit iteration (the / operator, predicates)
  • recursive functions (handle one, recur on rest, terminate on empty sequence)
  • explicit iteration using FLWOR expressions

FLWOR expressions

for ... let ... where ... order by ... return
  • the for clause: binding variables
  • the let clause: binding dependent variables
  • joins: binding several independent variables
  • the where clause: selection
  • the order by clause: sorting
  • return clauses

The for clause

Binds one or more variables to each value in a sequence, in turn.
Are these legal XQuery expressions?
(: 1 :)    for $x in 1 to 10 
        return $x 
(: 2 :)    for $x in 1 to 10 
        return <e> { $x } </e>
What do they return?

Digression: comment syntax

In the following XQuery expressions, what the heck are (: 1 :) and (: 2 :)?
(: 1 :)    for $x in 1 to 10 
        return $x 
(: 2 :)    for $x in 1 to 10 
        return <e> { $x } </e>
What do they do / mean / return?

Multiple for-clause variables

Is this a legal expression?
   for $x in 1 to 10,
       $y in reverse(1 to 10)
return <e> { $x, $y, $x - $y } </e>
If so, what is its value? What is its type?
In what order will the triples appear?

The let clause

Binds a variable to a single value.
Is this a legal XQuery expression?
   let $x := 1 to 10
return <e> { $x } </e>
If so, what are its type and value?

For vs. Let

How many bindings does
for $x in //l
produce?
How many bindings does
let $x := //l
produce?

For vs. Let (2)

What is the value of
for $x in //l
return count($x)
? And of
let $x := //l
return count($x)
?
Does length($x) work too?

Contrasting for and let

Quiz: why do these return values that look the same?
(: 1 :)    for $i in 1 to 10
        return $i
(: 2 :)    let $i := 1 to 10
        return $i

Contrasting for and let

Quiz: why do these return values that look different?
(: 1' :)    for $i in 1 to 10
         return <e> { $i } </e>
(: 2' :)    let $i := 1 to 10
         return <e> { $i } </e>

Multiple let-clause variables

Is this a legal expression?
   let $x := 1 to 10,
       $y := reverse(1 to 10)
return <e> { $x, $y, $x - $y } </e>
If so, what is its value? What is its type?
A. No, it's not legal.
Q. Why not?
A. The subtraction operator expects singleton values on each side, not sequences.

Comparing for and let

Quiz: Do these return the same value(s), or different?
(: 1 :)    for $i in 3.14159
        return ('Pi is about ', $i, '.')
(: 2 :)    let $i := 3.14159
        return ('Pi is about ', $i, '.')
Pop quiz: Are the parentheses around the return value important? useful? legal? Why / why not?

The usual use of let

Use let for values used repeatedly, and for values that depend on the current value of something else.
Is this a legal XQuery expression?
   for $x := 1 to 10
   let $x-squared := $x * $x
return <square n="{$x}"> { $x-squared } </square>
If so, what are its type and value?

Using let for dependent variables

Exercise: Zero your buffer. Then write FLWOR expressions which return, for the numbers 1 to 10,
  • the sequence of numbers
  • the sequence of their squares
  • the sequence of the difference between each square and its predecessor
  • the sequence of the difference between difference and its predecessor
Does this suggest another way to generate a table of squares?

Sample solution

Sample solution:
let $diffs := 
    let $squares := 
        for $i in 0 to 10
        return $i * $i
    for $n in 1 to 10
    return $squares[$n + 1] - $squares[$n]
for $n in 1 to 9
return $diffs[$n + 1] - $diffs[$n]
Note the interleaving of for and let.

The order by clause

What might a clause introduced by order by do?
Q. Is the following a legal XQuery expression?
let $h := my:hailstorm(91,50)
(: $h is a hailstorm-number sequence beginning with 91:
  (91, 274, 137, 412, 206, 103, 310, 155, 
    466, 233, 700, 350, 175, 526, 263, 790, 395, 1186, 
    593, 1780, 890, 445, 1336, 668, 334, 167, 502, 251, 
    754, 377, 1132, 566, 283, 850, 425, 1276, 638, 319, 
    958, 479, 1438, 719, 2158, 1079, 3238, 1619, 4858, 
    2429, 7288, 3644, 1822 ) :)
for $i in $h
order by $i
return $i
If so, what are its type and value? If not, why not?

The order by clause

Q. Is the following a legal XQuery expression?
for $i in //l[1]
order by $i
return $i
If so, what are its type and value? If not, why not?

The return clause

The return clause is evaluated once per tuple; its argument can be any expression.
For example,
   for $i in 1 to 100
   let $s := $i * $i
return $s
can also be written
   for $i in 1 to 100
return $i * $i

FLWOR expression syntax

FLWOR = (For | Let)+ Where? OrderBy? return E1
For = "for" ForVar (, ForVar)*
Let = "let" LetVar (, LetVar)*
ForVar = VarRef TypeDecl? PositionVar? in E1
LetVar = VarRef TypeDecl? := E1
TypeDecl = as SequenceType
PositionVar = at VarRef
Where = where E1
OrderBy = stable? order by Order (, Order)
Order = E1 OrderMod
OrderMod = (ascending | descending)?,
(empty greatest | empty least)?,
(collation StringLit)? // StringLit is a URI

Joins

The where clause can sometimes be replaced by making the for or let more complex:
for $char in document("characterlist.xml")//character[@sex='1']
for $speech in document("roj.xml")//sp
where $speech/@who = $char/@id
return <hit>{$speech}</hit>
or
for $char in document("characterlist.xml")//character[@sex='1']
for $speech in document("roj.xml")//sp[@who = $char/@id]
return <hit>{$speech}</hit>

Session 4: Functions, modules, collections

  1. Introduction: XQuery origin, history, context
  2. Atomic values, nodes, sequences
  3. FLWOR expressions
  4. Functions, modules, collections
    • collections
    • modules, imports
    • function declarations
    • collecting topics for tomorrow
  5. Full-text search
  6. Miscellaneous, Q/A, and Wrapup

Questions for tomorrow

We have some open time tomorrow.
Questions you want covered?

Session 5: FLWOR expressions, continued

  1. Introduction: XQuery origin, history, context
  2. Atomic values, nodes, sequences
  3. FLWOR expressions
  4. Functions, modules, collections
  5. Full-text search FLWOR expressions, cont'd
    • group work:
      • making word lists
      • word frequency lists
    • individual work:
      • Mendenhall spectra, Zipf's law
      • Word-tagged Shakespeare
      • Rhyme propagation in Wilde sonnets
      • Gorbals data
  6. Miscellaneous, Q/A, and Wrapup

FLWOR exercises: Word lists

Make a database of any one of Wilde's sonnets (pointed to from Sample data page).
  • Preparatory work: Given a string, break it into blank-delimited words.
  • Make an alphabetical list of the words in the sonnet.
  • Make a frequency list of the words in the sonnet. (I.e. for each distinct word type, count its occurrences and emit both the type and the count, yoked together.)
  • Sort by word type. Sort by frequency.
  • Examine the shortcomings of the simple tokenization on blanks: what goes wrong? Fix some of the problems.
  • Apply to Shakespeare's sonnets (the ‘unadorned’ version).

FLWOR exercises: Zipf's law

Zipf's law says (simplifying) that if the most frequent word in a corpus has n occurrences, the next most frequent will have n/2 occurrences, the third n/3, and item i in the list will have n/i.
Let's test this on our work:
  • Make a frequency list, sorted by descending frequency.
  • For each item in the list, calculate the number of occurrences predicted by Zipf's Law.
  • For each item in the list, calculate the ratio of actual occurrences to expected occurrences.

FLWOR exercises: Mendenhall

Thomas Corwin Mendenhall: did Bacon write Shakespeare?
Calculated word spectra: frequency distribution of words of various lengths.
Can we do in seconds what took him years? Calculate a Mendenhall word spectrum for the text.
  • Make a word list, sorted by descending length of word.
  • For each word length in the list, calculate the number of tokens in the text of that length.
  • (Optional) Given a sequence of number values falling in a particular range, make a bar graph of the value distribution. (It can be very primitive.)
  • How might we make this work more efficient?
  • How might we produce better graphic display of the results?

FLWOR expressions: Bibadorned shakespeare sonnets

The WordHoard Shakespeare supplies lemmas and part of speech information for each word in the corpus.
  • Make a list of all pos values.
  • Make a word list; does this list agree with the one you made for the unadorned Shakespeare?
  • Re-run your word frequency list, Zipf, and Mendenhall code on the word-tagged corpus; modify as needed.
  • Make a list of words in the sonnets together with the sonnet number and line number where they occur.
  • Test Zipf's law on the lemmatized forms of words instead of on the inflected forms.

FLWOR expressions: restructuring / rearranging

The CELT encodings of Wilde's sonnets use the rhyme attribute to give the rhyme scheme. For example:
<div0 type="poem" lang="en">
<head>THE GRAVE OF KEATS</head>
<lg1 type="poem" met="sonnet" rhyme="abbaabba ccddee">
<lg2 type="octet" rhyme="abbaabba">
<l n="1">Rid of the world's injustice, and his pain,</l>
<l n="2">He rests at last beneath God's veil of blue.</l>
<l n="3">Taken from life when life and love were new</l>
<l n="4">The youngest of the martyrs here is lain,</l>
...
Reformat it to include the rhyme-code letter as an attribute rhyme on the l elements:
<div0 type="poem" lang="en">
<head>THE GRAVE OF KEATS</head>
<lg1 type="poem" met="sonnet" rhyme="abbaabba ccddee">
<lg2 type="octet" rhyme="abbaabba">
<l n="1" rhyme="a">Rid of the world's injustice, and his pain,</l>
<l n="2" rhyme="b">He rests at last beneath God's veil of blue.</l>
<l n="3" rhyme="b">Taken from life when life and love were new</l>
<l n="4" rhyme="a">The youngest of the martyrs here is lain,</l>
...

FLWOR expressions: Examples using Gorbals data

Open and explore the Gorbals data from 1851 and 1881. Build a BaseX database for the directory. We'll focus on the nested form.

Exploring the Gorbals data

Some questions:
  • How many households are represented in the sample?
  • Some of them have vacuous values for @rooms — how many have vacuous and how many non-vacuous values?
  • What other attributes on /gorbals-household-records/household sometimes have missing values?
  • What values appear for number of rooms? Address?
  • For individuals, what values appear for sex? Age? Occupation? Town of birth?

Historical information in the Gorbals data

For this data, a historian has posed the following tasks:
  • Retrieve a list of male heads of households' occupations and birthplaces.
  • Restrict the list to male heads of households born in Glasgow.
  • List the occupations and birthplaces given for sons of all household heads. (Extend to include stepsons.)
  • For all residents of Norfolk Street (that is, all individuals in the sample), give surname, forename, and age. Sorted by surname; sort by age. Sort by forename (are there fashions in forenames?)
  • Count the Glasgow-born residents of Norfolk Street in 1881. In 1851. (Unfair question; I haven't told you how to distinguish.)
  • Find the average age of all heads of household. Does it differ between 1851 and 1881?
  • What is the average number of rooms per household? (Beware missing data!)
  • Find all Glasgow-born heads of household aged 25 to 50, in households with one or two rooms, with no servants.
  • Find all Irish-born female heads of household with three or more children under six in the household.

FLWOR exercises: Word lists

Make a database of any one of Wilde's sonnets (pointed to from Sample data page).
  • Preparatory work: Given a string, break it into blank-delimited words.
  • Make an alphabetical list of the words in the sonnet.
  • Make a frequency list of the words in the sonnet. (I.e. for each distinct word type, count its occurrences and emit both the type and the count, yoked together.)
  • Sort by word type. Sort by frequency.
  • Examine the shortcomings of the simple tokenization on blanks: what goes wrong? Fix some of the problems.
  • Apply to Shakespeare's sonnets (the ‘unadorned’ version).

Calculating a table of squares with Babbage

Optional (time permitting).
Exercise:
  • Let 2 be the standard second-level difference.
  • For the first-level difference, start with 1 and let the next first-level difference be the sum of the previous first-level difference and the second-level difference.
  • For the value, start with zero and let the next value be the sum of the previous value and the first-level difference.

Sample solution

(: Imitation Babbage :)
let $table-length := 100
let $d2 := for $n in 1 to $table-length 
           return 2
(: return $d2 :)
let $d1 := for $n in 1 to ($table-length - 1)
           return 1 + sum( $d2[position() < $n])
(: return $d1 :)
let $values := for $n in 1 to ($table-length - 2)
               return + sum( $d1[position() < $n])
return $values

This is correct, but a little free with Babbage's instructions. We'll come back to this problem.

Session 6: Miscellaneous topics, Q/A, wrapup

  1. Introduction: XQuery origin, history, context
  2. Atomic values, nodes, sequences
  3. FLWOR expressions
  4. Functions, modules, collections
  5. Full-text search
  6. Miscellaneous, Q/A, and Wrapup
    • collections
    • modules
    • function declarations
    • Questions and Answers and/or Group exercises
    • Wrapup

Collections

XQuery can operate on collections, not just single docuents.
Syntax:
  • doc( URI )
  • collection( URI )
  • collection()

Four ways to set context

XQuery requires a context node. It can be set
  • using doc()
  • using collection()
  • passing in as external parameter
  • implementation-defined methods (e.g. user interface, BaseX File / Open dialog)

Making collections in BaseX

In GUI
  • index local file
  • index Web file
  • index directory in local file system
Using server commands:
  • add as {Name} {URI}
  • ...

Naming collections

BaseX collections have names:
for $doc at $pos in collection("Wilde")
return <doc n="{$pos}" href="{ base-uri($doc) }"/>
or
for $doc at $pos in collection()
return <doc n="{$pos}" href="{ base-uri($doc) }"/>

Modules

One query = set of modules.
One query = one main module, zero or more library modules.
Main module = prolog + query-body.
Query body = Expression.

Prolog

Query prolog sets context for evaluation of query: declarations, defaults, options.
For example:
declare namespace my = "http://blackmesatech.com/nss/exx";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
Q. Where would you look for other prolog options?

Library modules

Library module = module declaration + prolog
Module declaration:
module namespace tei = "http://www.tei-c.org/ns/1.0";
Use modules for reusing and sharing functionality.
URI gives namespace name; everything declared in the module goes into that namespace.

Imports

Main modules may import library modules.
import module 
   namespace tei = "http://www.tei-c.org/ns/1.0"
   at "http://example.org/libs/TEI.xq",
      "http://example.org/libs/TEI-extras.xqy";

Function declarations

To declare a function, you must specify its name, type, arguments (with types), and function body:
FunctionDecl = declare function QName
           ( ParamList )
           as SequenceType
           { E }

Function name

The function name is a QName.
Do not try to make it an NCName; beware the default function namespace.*

Function parameters

Quite conventional: a list of names, with types. Names are, of course, QNames.
ParamList = Param (, Param)*
Param = VarRef as SequenceType

Another word about the type system

Remember sequence types?

I didn't think so.

XQuery additions to the type system

XSD types use their normal QNames.
In addition, you can use kind tests, item(), and empty-sequence().
SequenceType = empty-sequence()
| ItemType [?*+]?
ItemType = KindTest | item() | QName
SingleType = QName ??
In case of doubt, item()* should always work.

Example

declare namespace my = "http://blackmesatech.com/nss/exx";

declare function my:hailstorm(
           $n as xs:integer, 
           $count as xs:integer
        ) as xs:integer* {
  let $newCount := $count - 1
  let $newN := if ($n mod 2 = 0) then 
                  $n idiv 2 
               else 
                   3 * $n + 1
  return if ($count > 0) then
    ($n, my:hailstorm($newN, $newCount))
  else
    $n
};

Using the hailstorm function

let $count := 50
for $n in 1 to 100
return <results n="{$n}"> { my:hailstorm($n,$count) } </results> 

A simple exercise

Consider the function string-to-codepoints(s).
It returns a sequence of integers.
I'd like to get a sequence of hex numbers in U+xxxx form.
How?

Babbage, revisited

Recall the method of finite differences (Newton) used by Babbage:
  • For i in 1 to ..., calculate v = p(i).
  • First-level difference diff-1(i) is p(i+1) - p(i)
  • Second-level difference for i is diff-1(i+1) - diff-1(i)
  • Third-level difference for i is diff-2(i+1) - diff-2(i)
  • ...
For equations of degree n, diff-n is constant (i.e. higher-level differences are zero).

Function of quadratic and cubic equations

declare function my:vd1d2d3(
	$v as xs:integer,
  $d1 as xs:integer,
  $d2 as xs:integer,
  $d3 as xs:integer,
  $n as xs:integer) as xs:integer* {

  let $d2Next := $d2 + $d3,
      $d1Next := $d1 + $d2,
      $vNext := $v + $d1
  return if ($n <= 0) then 
            $v
         else 
            ($v, my:vd1d2d3($vNext, $d1Next, $d2Next, $d3, $n - 1))
         
};

Babbage, revisited

(: Imitation Babbage, revisited :)

let $table-length := 101,
    $v.init := 0,
    $d1.init := 1,
    $d2.init := 2,
    $d3.init := 0,
    $d4.init := 0
let $d4 := for $n in 1 to $table-length
           return $d4.init
let $d3 := ($d3.init, 
            for $n in 2 to $table-length 
            return $d3.init + sum($d4[position() < $n]))
(: return $d3 :)
let $d2 := ($d2.init,
            for $n in 2 to $table-length 
            return $d2.init + sum($d3[position() < $n]))
(: return $d2 :)
let $d1 := ($d1.init,
            for $n in 2 to $table-length 
            return $d1.init + sum( $d2[position() < $n]) )
(: return $d1 :) 
let $values := ($v.init,
                for $n in 2 to $table-length 
                return + sum( $d1[position() < $n]))
return $values

Testing

Try
my:vd1d2d3(0,1,2,0,101)
If that works, what does this compute?
my:vd1d2d3(0,1,6,6,10)

Notes for later attention

Original sessions 5 and 6

Session 5: Full-text search

  1. Introduction: XQuery origin, history, context
  2. Atomic values, nodes, sequences
  3. FLWOR expressions
  4. Functions, modules, collections
  5. Full-text search
    • basic concepts
    • full-text operators and functions
    • word lists revisited
  6. Miscellaneous, Q/A, and Wrapup
<head>Requirements for full-text spec</head>
Requirements for
  • language design
  • integration with environment
  • implementation
  • functionality and scope
<head>Language design requirements (1)</head>
Data model
  • operate on XDM
<head>Language design requirements (2)</head>
Side effects
  • no side effects
<head>Language design requirements (3)</head>
Score function and full-text predicates
  • full-text predicates independent of scoring
  • full-text predicates and scoring in same language (or subset/superset)
<head>Language design requirements (4)</head>
Score algorithm
  • score available to user
  • results sortable on score
  • score a double between 0 and 1
  • full details (corpus stats) of scoring not required
  • may partially define scoring
<head>Language design requirements (5)</head>
Combined score
  • scoring for combinations of full-text predicates
  • vendor-provided scoring algorithm
  • * user override of scoring algorithm
  • user influence on score components
<head>Language design requirements (6)</head>
Extensibility
  • vendor extensibility (must)
  • user extensibility (may)
<head>Language design requirements (7)</head>
First, future versions
  • framework for future versions
<head>Language design requirements (8)</head>
End user language
  • not a requirement
<head>Language design requirements (9)</head>
Searchable query
  • queries searchable
<head>Language design requirements (10)</head>
Universality
  • universal (Unicode)
<head>Integration requirements</head>
Integration:
  • usable in XPath
  • extensible in style of XPath
  • composable with self and XQuery
  • human-readable syntax available
  • XML-syntax available
<head>Implementation requirements</head>
Implementation
  • declarative specification, no prescribed implementation strategy
<head>Functionality requirements</head>
Functionality
  • single-word search
  • phrase search
  • stop-word support
  • single-character suffix
  • 0-more character suffix, prefix, infix
  • proximity searching (in words)
  • proximity with order
  • AND, OR, NOT
  • word normalization, diacritics
  • ranking, relevance
<head>Scope requirements</head>
Scope
  • search within arbitrary structure (XPath expression)
  • within constructed structures?
  • return arbitrary nodes
  • combination of predicates on different parts of tree
  • search within attributes
  • search across attributes and content
  • search within markup (element/attribute names)
  • distinguish markup from content
  • search across element boundaries (at least for NEAR)
  • elements form token boundaries (usually)
  • score accessible, generalizable
<head>Setup</head>
Build database of sample data from “XQuery and XPath Full Text 1.0 Use Cases” document (data/full-text-use-cases/sampledata.xml).
<head>Basic concepts of full-text search</head>
First approximation, basic notions:
  • Text is sequence of paragraphs, consisting of sentences, consisting of phrases, consisting of words (tokens).
  • Regions of text are marked up.
  • Markup boundaries (tags) imply word boundaries.
Consequences:
  • Some things are closer together than others.
  • Some words are in same paragraph or sentence, some aren't.
<head>Complications for full-text search</head>
Complications:
  • Desired search result set is fuzzy.
  • Paragraphs, sentences, tokens not necessarily marked up.
  • Words inflect.
  • Words may have near-synonyms.
  • Tokens not necessarily disjoint.
  • Phrases, sentences, and paragraphs don't necessarily nest.
  • Sequence not necessarily flat.
  • Tokens may sometimes cross markup boundaries.
  • Vendors compete hard on skill at recognizing tokens, sentences, paragraphs.

How do you handle that with the XDM?

<head>full-text search model</head>
How do you handle that with the XDM? You don't.
  • Full-text operators operate on and return not XDM instances but instances of AllMatches model.
  • AllMatches instances describe all possible matches.
  • Each match is described with
    • required tokens (positive terms)
    • forbidden tokens (negative terms)
    • position
    • match info
<head>Data for sample full-text queries</head>
  • A collection of books
  • Each book with metadata and content
  • Metadata with title, author, publicationInfo, price, subjects ...
  • Content with introduction and parts
  • Parts with number and chapters — or container and components
  • Chapters with title and p[aragraph]s
<head>Sample query</head>
Find all book titles containing the word “usability”.
Find all book titles containing the word “test”.
<head>Sample answers</head>
First, just find the titles:
//title

Next, just the book titles:

/books/book/metadata/title

Next, those containing the word we want:

/books/book/metadata/title[contains(.,'usability')]
/books/book/metadata/title[contains(.,'test')]

Oops.

<head>Sample answers</head>

Fix case problem:

/books/book/metadata/title[contains(.,'Usability')]
/books/book/metadata/title[contains(.,'Test')]

But we don't want the personal name Usabilityguy.

/books/book/metadata/title[. contains text 'test']
/books/book/metadata/title[. contains text 'usability']

<head>Sample query</head>
Find all the subject entries for books containing the phrase "usability testing".
<head>Sample answer</head>
/books/book/metadata/subjects/subject
  [. contains text "usability testing"]
<head>Syntax</head>
To summarize the syntax so far:
E = E contains text FTE FTIgnore?
FTE = Literal | { E }
The contains text operator returns a Boolean*, so it typically occurs in a predicate.
Another way to express it:
FT-Contains-Expr = SearchContext contains text FT-Selection
<head>Sample query</head>
Find all books with a subject containing the Chinese n-gram (phrase) “网站”
<head>Sample answer</head>
/books/book/metadata/subjects/subject[. contains text 
   "网站" using language "zh"]
Trick example: language support is implementation-defined. But this works:
/books/book/metadata/subjects/subject[contains(., 
   "网站")]
<head>Syntax</head>
In the previous example, “using language "zh"” is a full-text option; there are several:
FTE ⇒* FTE FTMatchOption*
FTMatchOption = using language StringLiteral
using wildcards | using no wildcards
using stemming | using no stemming
using case sensitive | using case insensitive
using lowercase | using uppercase
using diacritics sensitive | using diacritics insensitive
<head>Sample query</head>
Find the title and author of every book with a subject entry containing the phrase "usability testing"
<head>Sample answer</head>
for $book in doc("http://bstore1.example.com/full-text.xml")
   /books/book
where $book/metadata/subjects/subject contains text "usability testing"
return $book/metadata/(title|author)
<head>Sample query</head>
Find book titles beginning either with “Improving usability” or with “Improving [some-word] usability”
<head>Sample answer</head>
   for $book in doc("http://bstore1.example.com/full-text.xml")
                /books/book
   let $title := $book/metadata/title 
 where $title contains text "improving" ftand "usability" 
              ordered 
              distance at most 2 words 
              at start
return $title
<head>Sample query</head>
I think the title was “Improving the usability of a web site through expert reviews and usability testing” — but maybe it was “Improve ...”?
<head>Sample answer</head>
for $book in doc("http://bstore1.example.com/full-text.xml")
   /books/book
let $exactTitle := $book/metadata/title
where $exactTitle contains text "improv.* the usability of a 
   web site through expert reviews and usability testing" 
   using wildcards entire content
return $exactTitle
<head>Sample query</head>
Find books containing a chapter which says “one of the best known lists of heuristics is Ten Usability Heuristics” without tripping over the different ways the title of the essay might be marked up.
<head>Sample answer</head>
for $book in doc("http://bstore1.example.com/full-text.xml")
   /books/book
let $chap := $book//chapter
where $chap contains text "one of the best known lists of 
   heuristics is Ten Usability Heuristics"
return $book
<head>Sample query</head>
Find books whose introductions contain the word “identify” (or any word beginning “identif...”).
<head>Sample answer</head>
for $book in doc("http://bstore1.example.com/full-text.xml")
   /books/book
let $intro := $book/content/(introduction|part/introduction)   
where $intro [./p contains text "identif.*" using wildcards]
return $book
<head>Sample query</head>
Find books whose short titles (given in the shortTitle attribute) contain the words “improve”, “web”, and “usability” (in any order, and allowing for inflections in “improve”)
<head>Sample answer</head>
for $book in doc("http://bstore1.example.com/full-text.xml")
   /books/book
where $book/metadata/title/@shortTitle contains text "improve" 
   using stemming ftand "web" ftand "usability" distance 
   at most 2 words    
return $book/metadata/title
<head>Sample query</head>
Find boks with content containing either the word “way” or a four-letter word ending in “way”.
Exercise: find a less-implausible example of single-character wildcards!
<head>Sample answer</head>
   for $book in doc("http://bstore1.example.com/full-text.xml")
       /books/book
   let $cont := $book/content
 where $cont contains text ".?way" using wildcards
return $book
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Sample query</head>
Find ...
<head>Sample answer</head>
<head>Exercises using full-text facilities</head>
tbs
  • word lists, revisited
  • word frequency lists
  • proximity searches

Session 6: Miscellaneous, Q/A/ and Wrapup

  1. Introduction: XQuery origin, history, context
  2. Atomic values, nodes, sequences
  3. FLWOR expressions
  4. Functions, modules, collections
  5. Full-text search
  6. Miscellaneous, Q/A, and Wrapup
    • Deploying an XQuery server (BaseX, MarkLogic Server, eXist, ...)
    • XQuery and RESTful Web site design
    • Questions and Answers and/or Group exercises
    • Wrapup

Making collections

In BaseX, you can conveniently index groups of documents from the GUI if (but only if) they are in the same directory and have a common filename pattern. If that doesn't work, you have to add individual files manually using server commands.
In the case of MEP, after I had made an essentially empty database, the process was:
  • add as DE /home/mep/DE/docs/DE.bbe.xml
  • add as ER /home/mep/ER/docs/ER.bbe.xml
  • add as FC /home/mep/FC/docs/FC.bbe.xml
  • add as FD /home/mep/FD/docs/FD.bbe.xml
  • add as GM /home/mep/GM/docs/GM.bbe.xml
  • add as HL /home/mep/HL/docs/HL.bbe.xml
  • add as JH /home/mep/JH/docs/JH.bbe.xml
  • add as MG /home/mep/MG/docs/MG.bbe.xml
  • add as MS /home/mep/MS/docs/MS.bbe.xml
  • add as NG /home/mep/NG/docs/NG.bbe.xml
  • add as RC /home/mep/RC/docs/RC.bbe.xml
  • add as SA /home/mep/SA/docs/SA.bbe.xml
I think the initial creation would have been create db MEP.bigbooks.
To refer to a collection in BaseX, you say things like
for $doc at $pos in collection("MEP.bigbooks")
return <doc n="{$pos}" href="{ base-uri($doc) }"/>
or, for the currently open collection,
for $doc at $pos in collection()
return <doc n="{$pos}" href="{ base-uri($doc) }"/>

Sample queries

Queries I made while playing around. They may or may not be helpful.
for $l in //l,
    $n in (1, 2, 3, 5, 7, 11, 13)
where $l/@n = $n
return $l


for $l in //l
let $words := tokenize($l,"\s")
return <l>{
  for $w in $words
  return <w>{$w}</w>
}</l>

for $l in //l
let $words := tokenize($l,"\s")
for $w in $words
order by string-length($w) descending
return <w>{$w}</w>

for $l in $doc//l
let $words := tokenize($l,'\s+')[string-length(.) > 0]
for $w in $words
order by $w
return <w>{$w}</w>

for $l in //l
let $words := tokenize($l,"\s")
for $w in $words
order by string-length($w) descending, $w
return <w>{$w}</w>

declare variable $doc := 
  <lg>
    <l>
      Summer grasses --
    </l>
    <l>
      all that's left
    </l>
    <l>
      of warriors' dreams
    </l>
  </lg>;
for $l in $doc//l
let $words := tokenize($l,'\s+')[string-length(.) > 0]
for $w in $words
order by $w
return <w>{$w}</w>


let $words := for $l in //l return tokenize($l,"\s")
for $w in $words
order by string-length($w) descending, $w
return <w>{$w}</w>

let $words := for $l in //l return tokenize($l,"\s")
for $w in distinct-values($words)
let $n := count($words[. eq $w])
order by $n descending, string-length($w) descending, $w
return <w n="{$n}">{$w}</w>

(: What about Zipf's law?  What is the frequency
distribution in this text of words of various lengths? :)

(: Test Zipf''s law.
   He observes that the nth most frequent word 
   in a corpus tends to occur about 1/n times 
   as many as the most frequent word.
   Trivially true for n=1 of course.
:)

let $freqlist :=
  let $words := for $l in //l return tokenize($l,"\s")
  for $w in distinct-values($words)
  let $n := count($words[. eq $w])
  order by $n descending, string-length($w) descending, $w
  return <w n="{$n}">{$w}</w>
let $unit := $freqlist[1]/@n
let $ratios := 
    for $w at $rank in $freqlist
    let $p := $unit div $rank (: predicted frequency :)
    let $r := $w/@n div $p (: ratio of actual to predicted :)
    return <w rank="{$rank}" 
             n="{$w/@n}" 
             predicted="{$p}" 
             ratio="{$r}"
           >{string($w)}</w>
return $ratios


(: Make a Mendenhall distribution.
   Check what the longest word is, or 
   guess.
   Then count tokens of each length.
:)

  let $words := for $l in //l return tokenize($l,"\s")
  for $i in 1 to 20
  let $n := count($words[string-length(.) eq $i])
  return <frequency word-length="{$i}" count="{$n}" />

(: Make a Mendenhall distribution.
   Check what the longest word is, or 
   guess.
   Then count tokens of each length.
   For comparative purposes, normalize to count per 1000
:)

  let $words := for $l in //l return tokenize($l,"\s"),
      $tokens := count($words)
  for $i in 1 to 20
  let $n := count($words[string-length(.) eq $i]),
      $normalized := (1000 * $n) div $tokens  
  return (: 
      <frequency word-length="{$i}" count="{$n}" 
        norm="{ round-half-to-even($normalized, 1)}"/> :)
    round-half-to-even($normalized,2)

(: Make a Mendenhall distribution.
   Check what the longest word is, or 
   guess.
   Then count tokens of each length.
   For comparative purposes, normalize to count per 1000
:)
declare default element namespace "http://www.tei-c.org/ns/1.0";

  Let $words := for $l in //l return tokenize($l,"\s"),
      $tokens := count($words)
  for $i in 1 to 20
  let $n := count($words[string-length(.) eq $i]),
      $normalized := (1000 * $n) div $tokens  
  return (: <frequency word-length="{$i}" count="{$n}" 
         norm="{ round-half-to-even($normalized, 1)}"/> :)
    round-half-to-even($normalized,2)
 
(: corrected form:  \s+ and normalize space first 
   to strip leading and trailing ws :)

  let $words := for $l in //l return tokenize(normalize-space($l),"\s+"),
      $tokens := count($words)
  for $i in 1 to 20
  let $n := count($words[string-length(.) eq $i]),
      $normalized := (1000 * $n) div $tokens  
  return (: <frequency word-length="{$i}" count="{$n}" 
            norm="{ round-half-to-even($normalized, 1)}"/> :)
    round-half-to-even($normalized,2)
 



(: Wilde gives us
39.25 132.71 198.13 216.82 143.93 95.33 67.29 46.73 14.95 16.82
 9.35 7.48 3.74 5.61 1.87 0 0 0 0 0
:)
(: Sonnets:
29.03 163.14 183.16 227.71 153.73 101.25 62.46 39.13 22.87 10.04
 3.82 1.77 0.8 0.74 0.11 0.11 0.06 0.06 0 
bibadorned:
208.2 144.53 179.47 197.13 98.82 64.61 47.02 27.86 16.62 8.42
 4.01 1.75 0.82 0.41 0.22 0.05 0.04 0.01 0.01 0
:)
(: Shakespeare as a whole, using bibadorned
208.2 144.53 179.47 197.13 98.82 64.61 47.02 27.86 16.62 8.42
 4.01 1.75 0.82 0.41 0.22 0.05 0.04 0.01 0.01 0
:)

An organizing principle?

In the first lecture of 6.001 (i.e. SICP) (on the Web and in the Internet Archive!), Hal Abelson says that whenever you learn a language, you need to ask three questions:
  • What are the primitive elements of the language?
  • What are the means of combination?
  • What are the means of abstraction?

Modified XPath Grammar

Basic Expressions

E = E1 (, E1)*
E1 is a basic or ‘single’ expression. (Not necessarily a single or basic value: its value may be a sequence.)
E1 = Primary // primary expressions
| Filter // filter expressions
| Path // path expressions
Remember that primary expressions are filter expressions, and filter expressions are path expressions. So this is redundant.

Expressions involving types

E1 = E1 instance of SequenceType // boolean
E1 = E1 castable as SingleType // boolean
E1 = E1 treat as SequenceType // modifies static type
E1 = E1 cast as SingleType // modifies dynamic type
SequenceType = empty-sequence()
| ItemType [?*+]?
ItemType = KindTest | item() | QName
SingleType = QName ??

Set operations

E1 = E1 union E1
| E1 | E1 // set union
| E1 intersect E1
| E1 except E1

Expressions returning booleans

Quantified expressions
E1 = (some | every) VarRef in E1
(, VarRef in E1)*
satisfies E1
Comparisons

XQuery grammar

To the above, add:

Expressions

E = FLWOR | TypeSwitch

Validate expressions

E1 = validate Mode? { E }
Mode = lax | strict

Extension expressions

E1 = Pragma+ { E }
Pragma = (# S? QName (S CHAR*)? #)

New primaries

Primary = ordered { E }
| unordered { E }
| DirectConstructor
| ComputedConstructor
DirectConstructor = < QName Atts />
| < QName Atts > Content </ QName >
| <!-- CHAR* -->
| <? QName? CHAR* ?>
Atts = (S QName = StringLit)* // StringLit can contain { E }
Content = (DirectConstructor
| <![CDATA[ CHAR* ]]>
| PredefinedEntityRef
| CharRef
| {{
| }}
| { E })* // evaluates E
ComputedConstructor = document { E }
| element (QName | { E }) { E? }
| attribute (QName | { E }) { E? }
| text { E? }
| comment { E? }
| processing-instruction (NCName | { E }) { E? }

Typeswitch expressions

Typeswitch = typeswitch ( E )
Case+
default VarRef
return E1
Case = case (VarRef as)? SequenceType return E1