[Galax]

Compiler
 Overview
 Parsing
 Normalization
 Rewriting
 Compilation
 Optimization
 Specialized Optimization
 Physical Layer

Specialized Optimizations
 Document Projection
 Tree Patterns

Tech. Center
 Publications
 Optimization
 Static Typing

The Galax Compiler

The purpose of this page is to document our progress in developing the Galax XQuery compiler. It includes an overview of its internal architecture, intermediate compilation stages, and some material on XQuery optimization.

Some of the main features of the compiler are:

  • A fully documented compilation scheme. Most of the compilation is formally specified as a means to provide specific correctness guarantees.
  • State of the art optimization layers, including type-based rewriting, unused code elimination, join optimization, query decorrelation techniques.
  • An hybrid physical later, that gives the ability to process XML streams as well as XML trees in memory.
  • Experimental optimization techniques for XQuery updates, XQuery scripting, and distributed XQuery.

The compiler is fully stable, and passes the majority of the W3C XQuery test suite (version 1.0.2). Some of the optimization techniques described here are at various level of maturity. As a result, some of them are disabled by default in the main Galax public release.


Overview and Architecture

The Galax compiler internally uses two intermediate languages:

  • The XQuery Core is defined as part of the set of XQuery specifications and used during the earlier compilation stages. It exposes the details of XQuery's semantics, in a way suitable for applying cleanup rewrites, and to prepare for heavier-duty optimization.
  • The Galax Algebra is an extension of classic nested-relational algebras with suitable XML operators, and is used during the later compilation stages.

The following figure shows the compiler architecture.

[Global Architecture]

The rest of this page provides details about each compilation phase, as well as pointers to relevant documents.


Parsing

Galax follows a fairly traditional parsing, with XQuery grammar's implemented using Yacc, along with special purpose lexers in Lex. Some aspects of parsing in XQuery are known to be challenging, notably due to the presence of several lexical states. The output of the parser is a classic abstract syntax tree (AST), whose node correspond to the different kinds of expressions allowed by the XQuery grammar.

The full XQuery grammar in BNF can be found in the XQuery 1.0 specification. More details about issues related to XQuery lexing can be found in the tokenizer note both published by the W3C.


Normalization

Normalization is applied to the AST produced by the parser to produce a new AST with the following properties.

  • It only uses a small subset of XQuery known as the XQuery Core. Reliance on a smaller subset facilitates the development of subsequent rewritings, and ensure their most general application.
  • Every operation is explicit, making sure a full account for XQuery's semantics is available (i.e., with no implicit operations, or syntactic sugar).

Normalization is fully specified as part of the XQuery Formal Semantics, which is part of the set of published W3C XQuery documents.

Galax implements normalization exactly as specified, but to the following important exception that iterators (FLWOR expressions, quantified expressions), are not broken down to individual for, let, conditional expressions, for each clause. This departure from the official specification is key to prepare for algebraic compilation.


Rewritings

[Architecture -- Query Rewriting]

Before applying optimization techniques, a number of actual transformation on the query are performed. Some of those query transformations have a direct impact on performances, or enable the consistent use of optimizations during later compilation stages. Examples of such transformations include: removing implicit casting of nodes to values or removing sorting by document order.


Compilation

Once the AST has been cleaned up, through rewriting, we compile the corresponding query into an algebraic form. The algebra being used in based on a traditional NRA (Nested-Relational Algebra) extended to support XML operations.


Optimization

[Architecture -- Specific Operations]

Once in algebraic form, the query plan is rewritten to obtain a more efficient evaluation strategy. This notably includes key join and query decorrelation rewritings.

Complete rules for compiling the XQuery core into that algebra are descibed in the following paper.


Specialized XML Optimizations

[Architecture -- Specific Operations]

Galax includes specific algorithms for certain classes of operations which are expensive or used very often during XQuery processing. Note that efficient support for those operations typically assumes specific knowledge about the physical representation.

XML Projection

One of the bottleneck of query processing for main-memory XQuery implementations is due to the size of tree representations for XML document (e.g., DOM or the XQuery Data Model). XML projection is physical operation that can be used to remove uncessary node in the XML data model based on the paths used in a given query. Document projection takes an XML stream as input and loads a projected document according to a set of input path expressions. Those paths are inferred from the query using a static analysis algorithm described in details in the following papers.

Tree Patterns

Another important operation which is widely used in XQuery is navigation, in the form of XPath expressions. Specific algorithms have been developed for certain fragment of XPath known as Tree Patterns. In some cases those algorithms have been proven to be optimal. Galax includes specific rewritings to detect such tree patterns when possible, hence enabling the use of such efficient algorithms.

How tree patterns operators are introduced in query plans, and limited optimality of the corresponding detections, are described in the following papers.


Physical representation, storage and indices

[Architecture -- Physical Layer]

XML is a very versatile markup language, suitable for many kinds of applications in many kinds of environment. Galax has the ambition to be a very versatile XQuery implementation that will work on a variety of physical XML representations. We are especially interested in experimenting with Galax in the following environments.

  • XML Files. Many applications just deal with ordinary XML files. Files are easy to write and maintaing using many ordinary XML tools and work well for light-weight applications.
  • XML Streams. In the context of distributed applications, such as for information integration or Web services, XML documents are accessed as streams from the network.
  • Native XML repository. Data-intensive applications dealing with large amounts of XML data often require the use of a storage manager. The use of an XML storage manager can provide important benefit to your application, including scalability, performances, concurrency control, crash recovery, and high-availability.