by Rick Jelliffe

Here is XML's grammar expressed as an automaton in XML syntax. It is based on Tim Bray's Lark grammar, which Tim has recently released.

I've XML-ified it to give a headstart to anyone interested in making their own XML processor: you could write an XSLT script to generate code for example. The XML version is based on a finite state machine, but with a couple of extra attributes that act on stacks. There are various Lark-specific actions included in the transitions: these provide a good guide for some of the processing required for a complete XML processor. However, of course you might prefer to annotate it with your own actions.

A big thanks to Tim for the original productions!


Martin Bryan
2006-04-26 00:20:05

What do users gain by having an XML representation of the states? Are there any available tools that can take advantage of the XML representation?

Rick Jelliffe
2006-05-05 23:13:07
You could use it with XSLT to generate code, for example. Or have it in a DOM and build a FSM/PDA visitor. You could even create an XSLT stylesheet to generate ANTLR code or even Bison/Lex code, I suppose!

In the case of Tim's original format, there is no off-the-shelf tool either, so the XML format improves things a little at the syntax level only.

Modern Pentiums have deep pipelines which make it very difficult to know whether one algorithm will necessarily perform better than another (of the same order.)

For example, is it better for a table-based FSM to have byte tables and a conditional jump based on their value or have jump tables (where the table contains the branch address)? The byte-based takes a quarter the size, but cache lines decrease the advantage to some extent, and there is an test or branch. And different processors act differently. So empirical testing is required to choose from several different candidates, and to make generating alternative parsers feasible we need to have a headstart like the XML-in-XML, I think. Now, I am not embarking on this project...but if I were then XML-in-XML or equivalent would be necessary to cut down the workload. (Also, it makes program logic clearer, as little languages tend to do.)