Higher Order Functions for XQuery

Proposal 9 November 2008

Author:
John Snelson

Abstract

This proposal provides higher order functions in XQuery. It is based on the presentation by Michael Kay, with detailed fleshed out and modifications according to experience implementing a prototype of this functionality.

Table of Contents

1 Function Items
    1.1 Literal Function Items
    1.2 Inline Functions
    1.3 Dynamic Function Invocation
2 Type Checking
    2.1 SequenceType Matching
    2.2 Function Item Coercion
    2.3 Static Typing
3 Additional Functions
    3.1 fn:function-name
    3.2 fn:function-arity
    3.3 fn:partial-apply
4 Additional Modifications to XQuery 1.0
    4.1 Modifications to Effective Boolean Value
    4.2 Modifications to Atomization
    4.3 Content of an Element Constructor
    4.4 Modifications to fn:string()
    4.5 Modifications to fn:deep-equal()
5 Use Cases
    5.1 Sorting
    5.2 Highest
    5.3 Subsequence Before First
    5.4 Analyze String
    5.5 Checking for Cycles
    5.6 Sum
    5.7 Grouping
    5.8 Sorted Map

Appendices

A Full Grammar for Higher Order Functions
B Glossary (Non-Normative)
C Revision Log


1 Function Items

The data model is extended with a function type as a direct subtype of item(). Subtypes of this function type are partitioned into discrete types, each representing function items that accept a particular number of arguments. Function types which have the same arity can be subtypes of each other based on their argument and return types. For example, the type function(item()) as xs:integer is subsumed by the type function(item()) as item(). [Definition: A function item is a value in the XQuery data model that represents a function. Function items are a subtype of item().] [Definition: Function items can be invoked, which is the act of calling the function that the function item represents.]

                                             item()
                                               |
                                 +-------------+-------------+
                                 |             |             |
                               node()  xs:anyAtomicType  function()
                                                             |
        +----------------------+-----------------------------+-----------------
        |                      |                             |
  function() as item()*  function(item()*) as item()*  function(item()*, item()*) as item()*  ...
                               |
              +----------------+------------+----------------------
              |                             |
        function(item()*) as node()*  function(item()*) as xs:anyAtomicType* ...
              |                             |
        function(item()*) as node()+       ...
              |
             ...
        

Function items are returned using a literal function item, or an inline function. Function items are invoked using a dynamic function invocation.

PrimaryExpr ::=
  ...
  | FunctionItemExpr
  | DynamicFunctionInvocation

FunctionItemExpr ::= LiteralFunctionItem | InlineFunction
        

A function item consists of the following information:

[Definition: The combination of the variable values needed to execute a function and a method of invoking that function is often called a closure.]

Note:

For the moment we will assume that a function is not allowed to contain a reference to a mutable variable, although there can be good use cases for being able to do so.

1.1 Literal Function Items

[Definition: A literal function item creates a function item that represents a named function.] [Definition: A named function is a function defined in the static context for the query. To uniquely identify a particular named function, both its name as a QName and its arity are required.]

If the QName in the literal function item has no namespace prefix, it is considered to be in the default function namespace.

If the expanded QName and arity in a literal function item do not match the name and arity of a function signature in the static context, a static error is raised [err:XPST0017].

The result of a literal function item is a single function item with the following properties:

  • An empty set of variable values.
  • The name specified in the literal function item.
  • The arity specified in the literal function item.
  • The function from the static context that matches the name and arity given.

Certain functions in the [Functions and Operators] specification have the apparent privilege of being polymorphic. These are denoted as accepting parameters of "numeric" type, or returning "numeric" type. Here "numeric" is a pseudonym for the four primitive numeric types xs:decimal, xs:integer, xs:float, and xs:double. The functions in question are:

  • fn:abs()

  • fn:ceiling()

  • fn:floor()

  • fn:round()

  • fn:round-half-to-even()

For the purposes of literal function items, these functions are regarded as taking arguments and producing results of type xs:anyAtomicType, with a type error raised at runtime if the argument value provided is not of the correct numeric type. This way of viewing these numeric functions is semantically fully backwards compatible.

Note:

An implementation that supports static typing can choose to model the types of these functions more accurately if desired.

1.2 Inline Functions

[Definition: An inline function creates a function item that represents an anonymous function defined directly in the inline function itself.] An inline function specifies the names and SequenceTypes of the parameters to the function, the SequenceType of the result, and the body of the function.

Note:

The use of the "function(" notation for an inline function means that an unprefixed function call to a function with a local name of "function" will no longer be legal. This can be worked around by always calling such a function using a prefix.

If a function parameter is declared using a name but no type, its default type is item()*. If the result type is omitted from a function declaration, its default result type is item()*.

The parameters of a function declaration are considered to be variables whose scope is the function body. It is a static error [err:XQST0039] for a function declaration to have more than one parameter with the same name.

The static context for the function body is inherited from the location of the inline function expression, with the exception of the static type of the focus (context item, context position, and context size) which is undefined.

The variables in scope for the function body include all variables representing the function parameters, as well as all variables that are in scope for the inline function expression.

Note:

Function parameter names can mask variables that would otherwise be in scope for the function body.

The result of an inline function is a single function item with the following properties:

  • The set of variable values for any variables referenced by the inline function's body.
  • An absent name.
  • The arity of the inline function.
  • The inline function itself.

1.3 Dynamic Function Invocation

[Definition: A dynamic function invocation invokes a function item, calling the function it represents.] A dynamic function invocation consists of an expression that returns the function item and a parenthesized list of zero or more arguments.

If the function item expression does not return a sequence consisting of a single function item with the same arity as the number of specified arguments, a type error is raised [err:TBD].

A dynamic function invocation is evaluated as follows:

  1. Argument values are calculated for the function item using rules 1 and 2 for evaluation of a function call as defined in Section 3.1.5 Function CallsXQ.
  2. The set of variable values from the function item's closure are added to the dynamic context with a scope of the invocation of the function.
  3. The function from the function item is evaluated using the argument values according to rules 3 - 5 for evaluation of a function call as defined in Section 3.1.5 Function CallsXQ.

Note:

These rules are derived from the rules for function calls defined in Section 3.1.5 Function CallsXQ except for the addition of a rule to deal with the use of the variable values from the closure.

2 Type Checking

There could be many ways to specify type checking against function items, and many of them have been discussed and explored during the process of writing this proposal. In the end I have decided to stick with the principal that SequenceType matching should only be defined against sequences of items, rather than against other SequenceTypes. SequenceTypes are only ever approximations of the real runtime type, so using a SequenceType subtype relationship would necessarily introduce inaccuracy or pessimism into the otherwise accurate and optimistic type system.

2.2 Function Item Coercion

In addition, wherever SequenceType matching occurs against a function ItemType an auxilliary transformation called function item coercion is subsequently applied. [Definition: Function item coercion wraps a function item in a new inline function with signature the same as the SequenceType being matched. This effectively delays the checking of the argument and return types until the function item is invoked.]

Function item coercion is always applied after SequenceType matching, and only applied if the SequenceType is a function ItemType other than function(). For each function item, $function, in the input sequence, function item coercion returns a new function item with the following properties:

  • An empty set of variable values.
  • The name of $function.
  • The arity of $function.
  • A new function with the signature of the function ItemType. The result of calling the new function is the result of invoking $function with the arguments that were specifed at the new function's invocation.

If the result of invoking the new function item would necessarily result in a type error, that error can be raised during function coercion. It is implementation dependent whether this happens or not.

These rules have the following consequences:

  • SequenceType matching of the function item's arguments and result are delayed until that function item is invoked.
  • The function conversion rules applied to the function item's arguments and result are defined by the SequenceType it has most recently been matched against. Additional function conversion rules could apply when the wrapped function item is invoked.
  • SequenceType matching continues to only be defined on sequences of items.
  • If an implementation has static type information about a function item, that can be used to type check the function item's argument and return types during static analysis.
For instance, consider the following query:

declare function local:filter($s as item()*, $p as function(xs:string) as xs:boolean) as item()*
{
  $s[$p(.)]
};

let $f := function($a) { starts-with($a, "E") }
return
  local:filter(("Ethel", "Enid", "Gertrude"), $f)
      

The function item $f has a static type of function(item()*) as item()*. When the local:filter() function is called, the following occurs to the function item:

  1. $f is matched against the SequenceType of function(xs:string) as xs:boolean. Since $f is a single function item with an arity of 1, SequenceType matching succeeds.
  2. Function item coercion wraps $f in a new inline function ($p) with the signature function(xs:string) as xs:boolean.
  3. When $p is invoked inside the predicate, function conversion rules change the context item argument into an xs:string.
  4. $f is invoked with the xs:string, which returns a xs:boolean.
  5. $p applies function conversion rules to the result sequence from $f, which already matches its declared return type of xs:boolean.
  6. The xs:boolean is returned as the result of $p.

Note:

Although the semantics of function item coercion are specified in terms of wrapping the function items, static typing should be able to reduce the number of places where this is actually necessary.

2.3 Static Typing

A supplied function item with a signature function(A1, A2) as R obviously matches an expected function type function(a1, a2) as r if A1 :> a1, A2 :> a2, and R <: r. When these conditions are met it is unnecessary to perform function item coercion.

When these conditions are not satisfied, static typing has three options from which I have no special preference:

  1. Raise a type error.
  2. Optimistically continue with function item coercion and dynamic typing of the function item.
  3. Attempt to specify better type inference rules to allow more expressions to pass static typing.

The SequenceType matching and function item coercion rules are defined in such a way that static typing implementations can continue to apply function conversion rules at compile time based on the static type of the function item.

3 Additional Functions

4 Additional Modifications to XQuery 1.0

A few concepts and functions in XQuery 1.0 needs to be augmented to handle the addition of function items to the XQuery data model.

5 Use Cases

5.2 Highest

Find the employees earning the highest salary.

Solution without higher order functions:

Difficulty:

  1. It's likely to require two passes over the data, computing salary twice for each employee (wasteful if the computation is expensive);
  2. There's no explicit representation of the concept of "highest", and no reuse of code if we want the highest bonus instead of the highest salary.
  3. An alternative solution is a recursive function but this is too daunting for most users.

Higher-order solution:

In this case we've coded the higher-order function to use the same two-pass algorithm as the original.

However, it is easy to re-implement the same generic function to use a more efficient recursive implementation if the need arises.

5.5 Checking for Cycles

Suppose you have an parts explosion in which parts have subparts; a part can have many subparts, and a subpart can be used in many parts, but a part cannot use itself. How do we check that there are no cycles in the data?

Let's suppose this is done using ID and IDREFS:

We can do this currently with a pair of functions like this:

declare function subparts($part as element(part)) as element(part)* {
  for $s in $part/subpart/@ref return id($s)
}

declare function transitiveReferences($part as element(part), 
                                      $subpart as element(part),
                                      $route as element(part)*) as
xs:boolean {
  let $direct := subparts($part)
  return 
     exists($direct intersect $subpart) or
           (some $C in ($direct except $route) 
                 satisfies transitiveReferences($C, $subpart, ($route, $C)))
}

//part[transitiveReferences(., ., ())]

The problem here is that the difficult part of the code (looking for cycles) is a general-purpose algorithm, but it is not reusable. We can make it reusable (so it can look for cycles in other kinds of data) by parameterizing it. The parameter is the function that follows a relationship. We can rewrite it as:

declare function transitiveReferences($from as node(), 
                                      $to as node(),
                                      $route as node()*
                                      $arc as function(node()) as node()*) as
xs:boolean {
  let $direct := $arc($from)
  return 
     exists($direct intersect $to) or
           (some $C in ($direct except $route) 
                 satisfies transitiveReferences($C, $to, ($route, $C), $arc))
} 
and now we can use it for a much wider variety of tasks. For example we use the same code to analyze stylesheets: we can determine whether an XSLT template is recursive by means of the call
declare function local:find-template($t as node()) as node()
{
  for $c in $t//xsl:call-template/@name 
  return $template//xsl:template[@name = $c]
};

transitiveReferences($template, $template, (), local:find-template#1)

5.7 Grouping

Inline functions can also be useful for returning "sequences of sequences" by wrapping the inner sequences in a function item. With the flexibility of closures this can be extremely powerful - for example this function performs simple grouping, returning a sequence of function items to represent the groups:

declare function local:group(
  $seq as item()*,
  $key as function(item()) as xs:anyAtomicType
) as function() as item()**
{
  for $a in distinct-values(for $a in $seq return $key($a))
  return function() { $seq[$key(.) = $a] }
};

for $a in local:group((1, 6, 1, 2, 3, 4, 3, 2, 3), function($a) { $a mod 2 })
return <group>{ $a() }</group>

5.8 Sorted Map

The following use case is an implementation of a sorted map structure. Both map insertion and map searching use an O(log n) binary search algorithm (assuming an XQuery implementation that uses constant time random access XDM sequences).

It is interesting to note how these functions use 0 arity function items to represent the sorted map as a "sequence of sequences". It is also interesting that through the use of function item arguments, the general purpose map processing algorithm map:process() can be used to implement map:contains(), map:get() and map:put() - in this way the binary search algorithm only has to be implemented once.

declare namespace map = "http://snelson.org.uk/functions/map";

declare function map:pair($key as item(), $value as item()*)
  as function() as item()+
{
  function() { $key, $value }
};

declare function map:key($pair as function() as item()+) as item()
{
  $pair()[1]
};

declare function map:value($pair as function() as item()+) as item()*
{
  subsequence($pair(), 2)
};

declare function map:process(
  $map as (function() as item()+)*,
  $key as item(),
  $found as function(function() as item()+) as item()*,
  $notfound as item()*,
  $unused as function((function() as item()+)*) as item()*
) as item()*
{
  if(empty($map)) then $notfound
  else

  let $length := count($map)
  let $middle := $length idiv 2 + 1
  let $pair := $map[$middle]
  let $pair_key := $pair()[1]
  return
    if($pair_key eq $key) then (
      $unused(subsequence($map, 1, $middle - 1)),
      $found($pair),
      $unused(subsequence($map, $middle + 1))
    )
    else if($pair_key gt $key) then (
      map:process(subsequence($map, 1, $middle - 1), $key,
        $found, $notfound, $unused),
      $unused(subsequence($map, $middle))
    )
    else (
      $unused(subsequence($map, 1, $middle)),
      map:process(subsequence($map, $middle + 1), $key,
        $found, $notfound, $unused)
    )
};

declare function map:contains($map as (function() as item()+)*, $key as item())
  as xs:boolean
{
  map:process($map, $key, function($a) { true() }, false(),
    function($a) { () })
};

declare function map:get($map as (function() as item()+)*, $key as item())
  as item()*
{
  map:process($map, $key, map:value#1, (), function($a) { () })
};

declare function map:put(
  $map as (function() as item()+)*,
  $key as item(),
  $value as item()*
) as (function() as item()+)+
{
  let $pair := map:pair($key, $value)
  return
    map:process($map, $key, function($a) { $pair }, $pair,
      function($a) { $a })
};


let $map := map:put(map:put(map:put(map:put(map:put(map:put((),
  "a", "aardvark"),
  "z", "zebra"),
  "e", ("elephant", "eagle")),
  "o", "osterich"),
  "t", "terrapin"),
  "a", "antelope")
return (
  map:get($map, "o"),

  for $m in $map
  return concat("key: ", map:key($m), ", value: (",
    string-join(map:value($m), ", "), ")")
)

A Full Grammar for Higher Order Functions

B Glossary (Non-Normative)

closure

The combination of the variable values needed to execute a function and a method of invoking that function is often called a closure.

dynamic function invocation

A dynamic function invocation invokes a function item, calling the function it represents.

function item

A function item is a value in the XQuery data model that represents a function. Function items are a subtype of item().

function item coercion

Function item coercion wraps a function item in a new inline function with signature the same as the SequenceType being matched. This effectively delays the checking of the argument and return types until the function item is invoked.

inline function

An inline function creates a function item that represents an anonymous function defined directly in the inline function itself.

invoke

Function items can be invoked, which is the act of calling the function that the function item represents.

literal function item

A literal function item creates a function item that represents a named function.

named function

A named function is a function defined in the static context for the query. To uniquely identify a particular named function, both its name as a QName and its arity are required.

C Revision Log

  1. Removed the use of the "=>" operator for function invocation.

  2. Removed the ability to use the focus from the calling scope inside an inline function.

  3. Clarified how XQuery's built in "polymorphic" functions are referred to as literal function items.