The Galax Compiler
The purpose of this page is to document our progress in developing the Galax XQuery compiler. It includes an overview of its internal architecture, intermediate compilation stages, and some material on XQuery optimization.
Some of the main features of the compiler are:
The compiler is fully stable, and passes the majority of the W3C XQuery test suite (version 1.0.2). Some of the optimization techniques described here are at various level of maturity. As a result, some of them are disabled by default in the main Galax public release.
The Galax compiler internally uses two intermediate languages:
The following figure shows the compiler architecture.
The rest of this page provides details about each compilation phase, as well as pointers to relevant documents.
Galax follows a fairly traditional parsing, with XQuery grammar's implemented using Yacc, along with special purpose lexers in Lex. Some aspects of parsing in XQuery are known to be challenging, notably due to the presence of several lexical states. The output of the parser is a classic abstract syntax tree (AST), whose node correspond to the different kinds of expressions allowed by the XQuery grammar.
Normalization is applied to the AST produced by the parser to produce a new AST with the following properties.
Normalization is fully specified as part of the XQuery Formal Semantics, which is part of the set of published W3C XQuery documents.
Galax implements normalization exactly as specified, but to the following important exception that iterators (FLWOR expressions, quantified expressions), are not broken down to individual for, let, conditional expressions, for each clause. This departure from the official specification is key to prepare for algebraic compilation.
Before applying optimization techniques, a number of actual transformation on the query are performed. Some of those query transformations have a direct impact on performances, or enable the consistent use of optimizations during later compilation stages. Examples of such transformations include: removing implicit casting of nodes to values or removing sorting by document order.
Once the AST has been cleaned up, through rewriting, we compile the corresponding query into an algebraic form. The algebra being used in based on a traditional NRA (Nested-Relational Algebra) extended to support XML operations.
Once in algebraic form, the query plan is rewritten to obtain a more efficient evaluation strategy. This notably includes key join and query decorrelation rewritings.
Complete rules for compiling the XQuery core into that algebra are descibed in the following paper.
Galax includes specific algorithms for certain classes of operations which are expensive or used very often during XQuery processing. Note that efficient support for those operations typically assumes specific knowledge about the physical representation.
One of the bottleneck of query processing for main-memory XQuery implementations is due to the size of tree representations for XML document (e.g., DOM or the XQuery Data Model). XML projection is physical operation that can be used to remove uncessary node in the XML data model based on the paths used in a given query. Document projection takes an XML stream as input and loads a projected document according to a set of input path expressions. Those paths are inferred from the query using a static analysis algorithm described in details in the following papers.
Another important operation which is widely used in XQuery is
navigation, in the form of XPath expressions. Specific algorithms
have been developed for certain fragment of XPath known
How tree patterns operators are introduced in query plans, and limited optimality of the corresponding detections, are described in the following papers.
XML is a very versatile markup language, suitable for many kinds of applications in many kinds of environment. Galax has the ambition to be a very versatile XQuery implementation that will work on a variety of physical XML representations. We are especially interested in experimenting with Galax in the following environments.