« XQilla in the News | Main

March 21, 2008

Fast XML Pull Parser 0.3 released

I've been doing quite a bit of work on Faxpp recently. My enthusiasm had kind of ground to a halt for a while after I realised the full complexity of implementing entities, but then I decided I just needed to knuckle down and get it finished. The fruit of my labours can now be downloaded from Sourceforge.

I think I've got a robust framework for resolving and parsing internal and external entities - and I've learnt things about XML that I'm not sure many people in the world know:

  • Parameter entities ("%entity;") can appear almost anywhere in an external subset (DTD), but their replacement value is substituted with an extra leading and trailing space if the reference isn't in a literal value.
  • Character references in entity values are expanded when the entity declaration is parsed, but general entity references are not resolved until the entity value is substituted for a reference.
  • An XML 1.0 DTD referenced by an XML 1.1 document will be parsed as though it were XML 1.1.
  • At least two thirds of the code in an XML parser is there to support functionality that 90% of XML documents never use.

I can also lay claim to actually understanding what notations are, although I don't think I'll ever find a use for them.

I'm calling this release a beta, because I know there's still a bit of work left to be done. Top of the list is implementing default attribute values, then maybe I'll get to work on shrinking the parser - since the DTD parsing code has made it much larger than I want it to be.

Posted by john at March 21, 2008 12:18 AM

Comments

Hallo John,
how can I use this parser with Xqilla?

Is then possible to use xqilla.parse() as input from buffer instead of xerces documentcache?

How can I apply faxpp with Xqilla and is it usable now (any build switch?)? I have performance problems with xqilla.parsefromuri() when xerces is used. It's crazy lazy.

I red that Xqilla 2.0 has optional support:
* Added optional support for using the FAXPP XML parser instead of the Xerces-C parser.

Posted by: Jan Suchy at March 28, 2008 03:18 PM

Hi Jan,

I think that the XQilla support for faxpp isn't compiled in at the moment - and it hasn't been updated for the recent changes to faxpp.

Faxpp is certainly a whole lot faster than Xerces-C for parsing, so I'll try to update XQilla's support for it soon.

John

Posted by: John Snelson at March 28, 2008 11:51 PM

I wonder how it compares to a SAX parser like Expat performance-wise. Especially in the non-zero-terminated fragments mode.

Posted by: boris at April 24, 2008 01:31 PM

Hi Boris,

My ad-hoc benchmarks show Faxpp to be faster than Expat when the input encoding is the same as the output encoding, but slower when the encoding has to change. I haven't done much optimization on the latter case, so I expect that can be improved.

John

Posted by: John Snelson at April 24, 2008 04:21 PM