March 21, 2008

Fast XML Pull Parser 0.3 released

I've been doing quite a bit of work on Faxpp recently. My enthusiasm had kind of ground to a halt for a while after I realised the full complexity of implementing entities, but then I decided I just needed to knuckle down and get it finished. The fruit of my labours can now be downloaded from Sourceforge.

I think I've got a robust framework for resolving and parsing internal and external entities - and I've learnt things about XML that I'm not sure many people in the world know:

  • Parameter entities ("%entity;") can appear almost anywhere in an external subset (DTD), but their replacement value is substituted with an extra leading and trailing space if the reference isn't in a literal value.
  • Character references in entity values are expanded when the entity declaration is parsed, but general entity references are not resolved until the entity value is substituted for a reference.
  • An XML 1.0 DTD referenced by an XML 1.1 document will be parsed as though it were XML 1.1.
  • At least two thirds of the code in an XML parser is there to support functionality that 90% of XML documents never use.

I can also lay claim to actually understanding what notations are, although I don't think I'll ever find a use for them.

I'm calling this release a beta, because I know there's still a bit of work left to be done. Top of the list is implementing default attribute values, then maybe I'll get to work on shrinking the parser - since the DTD parsing code has made it much larger than I want it to be.

Posted by john at 12:18 AM | Comments (4)

March 18, 2008

XQilla in the News

Oracle officially announced the XQilla license change today. It feels like this has been a long time coming - I was involved in pushing for the original Pathan project to be open sourced in 2003 when I worked at Decisionsoft. Later when I worked for Sleepycat, I was involved in pushing for a liberally licensed release of the XQuery implementation and improvements to Pathan which became XQilla some 3 years later.

It's great to see something that I've worked on for the last 7 years start to get the exposure I always thought it deserved. XQuery has huge potential to change the way that people use their data, and it's close relationship to the web means now might be the right time, and XQilla might be in the right place.

Thanks has to go to Mike Olson who put in the lion's share of the work needed to make this happen. Hopefully his efforts will make it easier for even more Oracle code to reach it's potential by being released as open source.

Posted by john at 11:19 PM | Comments (0)

March 12, 2008

Google Summer of Code 2008

I've just finished the application process for XQilla to be a part of Google Summer of Code 2008. I'm always coming up with way too many projects to do, and there's never enough time to get around to all of them - the ideas list we've put together has some of the most interesting and self contained of those projects. Take a look if you fancy learning more about XQuery, open source or XQilla.

Posted by john at 01:39 PM | Comments (0)

February 14, 2008

Parsing JSON into XQuery

Doug Crockford stirred things up at XML 2007 with his comments on JSON and XML. He was wrong - mainly because he's only looking at structured data transfer, rather than anything else that XML is very good at like documents and semi-structured data. However he got me thinking about how easy it would be to process JSON in XQuery, which as Data Direct has shown is very good at manipulating all sorts of data formats.

I headed over to json.org to take a look at the work that had already been done on converting JSON to and from XML. Then I googled, and read around. I even had an email conversation with Dimitre Novatchev of FXSL fame, who's written a JSON parser entirely in XSLT!

All of the designs I looked at had at least one of the following problems:

  1. They could only convert a subset of JSON - things like map keys that weren't valid NCNames would cause problems.
  2. They didn't specify a 1-1 mapping - in other cases map keys were munged to be valid NCNames using a function with no inverse. This would mean that I wouldn't be able to convert back to JSON from it's XML representation.
  3. They lost the JSON type information, like whether a value was null or an empty string - which is also a way in which the mapping is not 1-1.

So given nothing fitted I chose to come up with my own mapping from JSON to XML - which is, after all, the JSON way. So here it is, in all it's simplicity:

JSONtype(JSON)toXML(JSON)
JSON N/A <json type="type(JSON)">toXML(JSON)</json>
{ "key1": value1, "key2": value2 }
object
<pair name="key1" type="type(value1)">toXML(value1)</pair>
<pair name="key2" type="type(value2)">toXML(value2)</pair>
  
[ value1, value2 ]
array
<item type="type(value1)">toXML(value1)</item>
<item type="type(value2)">toXML(value2)</item>
  
"value"
string
value
number
number
number
true / false
boolean
true / false
null
null empty

The table defines two abstract functions "type" and "toXML", which are recursively defined on the structure of the input JSON. The extension functions to parse and serialize JSON are called xqilla:parse-json() and xqilla:serialize-json(), and will available in the next release of XQilla.

xqilla:parse-json($xml as xs:string?) as element()?
xqilla:serialize-json($json-xml as element()?) as xs:string?

The translation produces a simple generic XML document - as an example, here's a query to parse a sample of JSON (swiped from wikipedia):

xqilla:parse-json('{
     "firstName": "John",
     "lastName": "Smith",
     "address": {
         "streetAddress": "21 2nd Street",
         "city": "New York",
         "state": "NY",
         "postalCode": 10021
     },
     "phoneNumbers": [
         "212 732-1234",
         "646 123-4567"
     ]
 }')

And here's its translation, the result of the query:

<json type='object'>
  <pair name='firstName' type='string'>John</pair>
  <pair name='lastName' type='string'>Smith</pair>
  <pair name='address' type='object'>
    <pair name='streetAddress' type='string'>21 2nd Street</pair>
    <pair name='city' type='string'>New York</pair>
    <pair name='state' type='string'>NY</pair>
    <pair name='postalCode' type='number'>10021</pair>
  </pair>
  <pair name='phoneNumbers' type='array'>
    <item type='string'>212 732-1234</item>
    <item type='string'>646 123-4567</item>
  </pair>
</json>

One of the nice things about the XML format for JSON is that it's easy to navigate. If I want to get the city from the JSON object above, the XQuery for it would be:

xqilla:parse-json("...")/pair[@name="address"]/pair[@name="city"]

Or if I want to get both the phone numbers I could use:

xqilla:parse-json("...")/pair[@name="phoneNumbers"]/item

Now I probably have to do a follow up post about all the cool things you can do in XQuery when you can parse JSON...

Posted by john at 05:21 PM | Comments (6)

February 12, 2008

DB XML with Python

At Oxford Geek Night 5 last week I bumped into James Gardner, apparently now known as a Pylons Guru, and a friend of mine from university.

I encouraged him to take a look at Berkeley DB XML's Python bindings, which he has done. His initial experience is written up in his latest blog post, which is worth reading if you're looking to get DB XML working from Python.

I said the same thing to James as I did to Greg Pollack when I met him at XML 2007 - why don't agile web frameworks like Pylons and Ruby on Rails support XML databases as a back end, rather than a SQL database? Surely XML databases are a much better fit for most data in web frameworks?

Posted by john at 04:23 PM | Comments (0)

January 25, 2008

Latency vs Throughput

I was helping a Berkeley DB XML user recently who complained that his query took as long as ~55s to run. It turned out he was trying to run 40 concurrent queries, so I tried out the query on his data set using our concurrency testing framework. I was confused when it reported to me that I was getting ~2 ops/s, which is a completely different ballpark.

It took me a while, but eventually I figured it out: When the user was saying that his query took ~55s what he meant was that a thread's view of how long a query takes can be as much as 55s. When I talked about getting ~2 ops/s, I was looking at the system as a whole. The problem was that he was talking about latency, and I was talking about throughput.

Typically the way that a server gets smaller latency is to implement a thread pool and work queue. Since DB XML is an embedded library and does not spawn it's own threads, it is down to the application using DB XML to manage it's threads well.

So I created a Java test program that used a thread pool to manage the queries being run through DB XML (in Java it's really easy). I created a test function that could run the query with arguments of how many threads to create, how many queries to run, and how many queries to start per second.

I ran a number of tests - in all of the test runs I kept the number of queries run as constant, so that exactly the same amount of useful work was done in all cases. I measured "workload" time, which is the time from when the query is requested until it is completed - this is a measure of latency. I also measured the total time taken for all the queries, and the number of operation per second executed - this is a measure of throughput. Here are the actual figures I got:

Queries runQueries per secondThreadsLatency (avg query time in seconds)Throughput (queries per second)
80no delay4032.151.70
80no delay1023.341.96
80no delay422.001.89
8010418.511.88
805413.871.93
80240.901.97

The first thing I noticed is that as the number of threads spawned decreased, latency decreases and throughput increases - this peaked around 4-5 threads on my system. Throughput increased because each thread has a processing overhead associated with it, so less threads means less overhead. Average latency decreased because instead of 5 queries each taking 50s (as an example), the first query takes 10s, and the last takes 50s:

Average time for 5 queries in parallel: (50s + 50s + 50s + 50s + 50s) / 5 = 50s
Average time for 5 queries in series: (10s + 20s + 30s + 40s + 50s) / 5 = 30s

Surprise number one - adding more threads can actually increase your latency!

Next I tried to add a little bit of realism to the simulation - it is very rare that 80 queries need to be executed at exactly the same time, so I added a little bit of a delay between starting each query. As the queries per second decreased throughput was more or less constant, but latency decreased dramatically. This is simple to explain - the queries spent less time in the work queue, when they aren't actually being executed:

Average time for 5 queries executed at the same time: (10s + 20s + 30s + 40s + 50s) / 5 = 30s
Average time for 5 queries executed with a 10s delay between each: (10s + 10s + 10s + 10s + 10s) / 5 = 10s

So surprise number two - a more realistic workload can result in smaller latency!

Posted by john at 10:42 AM | Comments (2)

January 17, 2008

Why DB XML Doesn't Do Path Indexes

Every so often we have a Berkeley DB XML user ask us if we implement path indexes. Or maybe they assume that we do implement them and are confused about why their index isn't working. It turns out that a lot of these users have come to us after using eXist, which for a long time has only been able to specify path indexes.

A path index is specified by describing the path to the nodes that you want to index - often with a subset of XPath. So you might want to index all "firstname" elements, but only if they are children of an "author" element - which you might write like this:

//author/firstname

The good thing about path indexes is that they can be very small targeted indexes, which only record exactly the information that you want.

The trouble with path indexes, I've always explained, is that it's really hard to work out when you can use one in a query. You miss out on lots of potential optimisation opportunities because query analysis can't prove that an index is applicable. It's not surprising really - path indexes are a subset of materialised views which is known to be a very hard problem, in the general case.

So it's not surprising to see that recent work on eXist has added a new type of index which works in exactly the same way as DB XML's do. The author of the blog post even describes the problems that I have been telling our users about, and recommends that path indexes be used sparingly if at all.

It's good to see other people coming to the same conclusion that I came to. DB XML may implement path indexes some day, but they won't be anywhere near as useful as some users think they should be.

Posted by john at 04:55 PM | Comments (0)

January 11, 2007

XQuery makes Recommendation

The XML querying language that I'm involved in implementing in Berkeley DB XML and XQilla, XQuery, is now a W3C Recommendation. Congratulations to everyone who has been involved in it's creation over the last 6 years.

Out of interest, why not compare the current specification with the very first working draft, from February 15th 2001. It's come a long way...

Posted by john at 03:57 PM | Comments (0)

December 13, 2006

Elliotte Rusty Harold thinks he might take a look at BDB XML

DB XML has been posted to Cafe con Leche.

Elliotte, please be my guest and try out the project. One of the great things about being a developer on the Berkeley DB projects is being able to work closely with our users, so it would be great to find out what you think.

Posted by john at 03:59 PM | Comments (0)

December 12, 2006

Berkeley DB XML 2.3 Released

The program I'm working on at Oracle has just had another release.

The new version includes a SAX and StAX like (push and pull) streaming XML events API, and significantly faster performance with node indexing.

Along with this release, we've also released a new library called XQilla. This is a combination of our previous XQuery library and Decisionsoft's Pathan library, which removes duplicate code and significantly moves the project forward.

Both projects implement the latest W3C Proposed Recommendation for XQuery and XPath 2.0.

Posted by john at 10:32 AM | Comments (0)