Where is XML Going?

by Kurt Cagle

I did my prognostication schtick at the beginning of the year, but I thought given the resuscitation aspect of this particular blog, that I would go back and look at things in a little more depth, as I'm seeing a number of interesting, and in some cases disturbing trends that are unfolding right now, things that will likely have a fairly significant impact upon the XML development community.


18 Comments

Michael Champion
2007-02-28 19:59:05
"I do not think that JSON is going to "replace" XML; what I do see though is perhaps the dawning realization that the XML Infoset does not in fact have to be represented in angle-bracket notation". I very strongly agree with that. 'XML' will come to mean the Infoset (or the XQuery data model, or tree views and XPath-like axes over object graphs) more than the bits on the wire format. That liberates XML tools to support JSON, various binary XML formats, HTML tag soup, etc. without insisting that everyone play by the XML syntax rules.
Kurt Cagle
2007-02-28 20:28:21
Michael,


I'm always happy when we manage to find points of commonality :-)


I think for me the realization on JSON occurred when I heard people talking about creating JSON schemas and JSON transformations, and watching the fairly convoluted discussions going on over at xml-dev about whether JSON was in fact "superior to" or "inferior to" XML. I also remember a few years back working with a language called Curl that was intended to provide an alternative client representation to XML + JavaScript, and the irony didn't escape me that the Curl notation was in the mainin fact XML represented as Scheme or Lisp.


I hadn't thought of the XQuery data model as another XML infoset, but of course you're right, it is. Makes me think that an XSLT2 transform could take in JSON via the unparsed-text method (or passed as a parameter) then run through templated regexes to generate the appropriate XML to run through a pipeline ... hmmm.


Thanks for the comment, though.

mawrya
2007-02-28 20:30:10
You have mentioned eXist in numerous posts, so I finally downloaded it, created a collection of documents (business expense reports), made an XForm ala Firefox to feed information into the collection via PHP, and then fooled around with PHP to XQuery the collection and generate reports.


Wow!


For any application that is document-centric this is simply smooth. I've been playing with XForms for about a year and eXist was the missing piece to bring everything together. Now just for SVG trend graphs in those reports!


And that brings me to the future of XML...walking around at the office I notice people use their computers in two basic ways: they are accessing the all-powerful central office application (a database backed application) that controls the core production and financial aspects of the company, or they are creating Excel spreadsheets to try to manage data that is not easily handled by the centralized software. I have seen some spreadsheets that are so loaded with programming, macros and cross-references they cause my eyebrows to receed right back past my hairline.


Its the same story everywhere I go... the users prefer the spreadsheets because they like to work with documents, they understand documents. They don't have to be connected to the server to work with a document. They can fill out a spreadsheet expense report on the airplane, for example. However, its not long before they realize the limitations of their spreadsheets - they don't scale, its hard to share the info inside them, its difficult to group multiple documents together and do analysis on them. Those are the jobs that the big corporate software does well. But, again, the corporate software system is annoying because it lacks all the benefits of documents.


The users want documents but the business applications want groups of documents. Both want to access the data in the documents, but they want to work with it in different ways. For so long it seems we have been made to choose between one or the other. For the past six years I have seen XML as the solution to this problem, but just now we are getting the tools to bring the possiblities to life: an XForm lets me, the human, manipulate the data in the XML document the way I would expect to do so, and an XML database such as eXist lets the central business application work with the data in the XML document in the way it needs to - in the context of a group of similar documents.


For example, if I have the company expense report XForm on my computer, I can be sitting on the airplane, open the form in firefox, fill it out, and save it to may local disk, as a single file. Then I can open it again when I get back to the office, make a few changes, then submit a copy to the centralized buisiness application. I get to keep a copy on my computer, as a file for easy reference in the future and the head office gets all the data they want, the way they want it.


From what I understand, this is essentially what Microsoft has done with InfoPath, and I can see why. What better way is there to bridge the spreadsheet--central database divide? XForms and XQuery/XML databases are the core, and I see SVG and XSLT and XSL-FO as the supporting tools used server-side to generate reports once the documents are in the XML database. Its funny, it seems that these supporting tools were created before the core pieces of the puzzle, but maybe its like my grandma used to teach me, "when you are starting a puzzle, always look for the edge pieces first and then work inward."


Anyway, if/when XForms and XQuery start gaining more traction, I can see a lot of these "supporting" XML technologies coming along for the ride.

Kurt Cagle
2007-02-28 20:59:28
Yeah, eXist is proving to be one of my favorite "best-kept-secrets". It's actually made me realize what the XQuery paradigm CAN do, and the irony is that querying is in fact only a fairly small portion of it.


Your comments about the office users is also well taken. I haven't been as heavily involved in spreadsheets as I have with "word-processing" document formats, but I know that a significant amount of the documents I personally have dealt with are contracts. Typically the workflow of a contract looks something like this:



  1. Get sent a Word or PDF document - if I'm lucky they've put in form fields, but most of these generally are read-only.

  2. Print the document out.

  3. Fill out the form with a ball-point pen.

  4. Fax the contract, or worse, send the contract in a Fed-ex package.


Now, at home (and my home office) I do not have a LAN line - we just have cell phones. No LAN line means no fax. Moreover, I normally have to wait until I can get home to print the damn contract out, fill it out, and scan it back in with a scanner, no doubt losing a certain degree of fidelity and wasting a good hour or more of my time on mechanics; this can be a real pain since I'm typically also trying to cook and get kids to bed in the evening. I've lost a few contracts this way simply because the workflow was such a pain that I never got the contracts submitted.


Now, combine something like XHTML+XForms and an eXist XQuery database, and the generation of the contracts becomes reasonably simple while still making it easy to fill the contract fields online. Security can easily be maintained via SSH to provide a fairly high degree of authentication (certainly more than you can get with a pen and paper and a fax machine ...) and you can perform pre-validation to insure that the information entered is complete and consistent, and the final "signed" contract can be emailed back as a secure PDF (through XSL-FO) for my files. I don't use any paper, don't need to have a fax machine or a phone line, don't need to worry about getting contracts lost in the mail or endless rounds of corrections because someone messed up or dropped out or were added to the contract at hand. The interfaces don't require that I pony up $500 for the latest Adobe Acrobat or MS Office, and the whole transaction can be completed in the time it takes me to fill in a few form fields.


So, yes, I have a personal motive in seeing XForms succeed - I don't feel I should have to pay for a fax machine or specialized software just to sign a contract.



3)



piers
2007-02-28 21:44:08
@mawrya - I also experienced the "wow" factor with eXist and xquery a while back, in a situation where a conventional database would be too big, and conventional documents too small for the demands of the project, amd I wholeheartedly agree with your comments.


With IBMForms pushing it along, I think XForms should gain a great deal of traction at the enterprise level, given enough time, but it seems that, as a scaling down of the active server model, xquery via eXist has so much to offer... to the long tail immediately, perhaps, where an enterprise solution is too grandiose. This may in turn lead to widespread adoption at the corporate level, as folks become more familiar with the idea of a database collection. Maybe if google adopted eXist into its ever expanding suite of software-as-service applications?


That would be something.

Hans Teijgeler
2007-02-28 22:40:56
Interesting overview, but I dearly missed an analysis of the state of the art in the Semantic Web domain. Maybe in a next issue?
Kurt Cagle
2007-02-28 22:55:33
I'm still trying to get a sense of where the Semantic Web itself is going. There's a lot of very interesting stuff occurring around Sparql, RDFa has some interesting potential, and I see signs that OWL is finally penetrating Business Intelligence systems, but I'd like to hold off on covering that until I can get a chance to research it in greater detail.
len
2007-03-01 04:46:56
RE: JSON is just curly XML.


I guess we should have kept the SGML Declaration file too. :-)

Kurt Cagle
2007-03-01 10:00:03
Len,


I need to clarify my position here. I do not think that the Infoset by itself will be the only factor in determining the nature of XML, and in reviewing my own comments it occurs to me that I need to be more specific concerning three factors - the general use of HTML as an infoset, the role of JSON, and "alternative" forms.


XML by itself consists of two distinct concepts - the containment model infoset with the ability to define attributes (in the larger programmatic sense) upon containers or leaves, and the specific pointy angle bracket notation that acts as the serialization of the first.


There is also a third notion at work, though one that's not always made explicit, the notion that containment should not be inferred. XML and JSON both have explicit containment models. HTML (and SGML in general) assume that containment can be implicit.


Unfortunately, once you start down that slope, things get slippery quickly - validating the well-formedness of XML is easy, in great part because you can insure that a structure follows an explicit containment model without having to know an underlying semantic. HTML validation requires that you do know the schema, because only specific elements can be "improperly" terminated (such as the <li> element).


JSON lacks a number of fairly critical components to make it the full equivalent to XML - namespaces, attributes, comments and PIs, just to name a few. However, while XML cannot always be mapped to JSON without introducing some form of conventional semantic, JSON can always be mapped to XML, with the one convention that there be an explicit name for the anonymous overall container of a JSON object.


My recommendation on that front is that the W3C should recognize this fact and define more clearly what in fact constitutes the conventions for valid mapping of a JSON-like language to XML and vice versa - a compact notation, if you will. If they don't do it now then a given market leader will do it, and the W3C will have to address the issue later with considerably less control over the final product.


This is the same issue that is facing the W3C with regard to binary XML ... the realization that there are formats that will end up needing to be supported that don't fit into the convenient happenstance of TBL's original HTML notation. I think rather than spin their wheels wrt "clarifying" XML with an XML 1.1 spec that no one is taking seriously, it would be worth the W3C's efforts to understand that the map is not the territory here.

maizer
2007-03-01 23:50:45
Good article, but can you make it more readable?
Harsh S.
2007-03-03 12:04:00
>>JSON lacks a number of fairly critical components to make it the full equivalent to XML - namespaces, attributes, comments and PIs, just to name a few.
>>
But JSON was not designed to be a general "markup" language, but a language for exchanging data structures.
Kurt Cagle
2007-03-03 19:02:57
I'm not claiming (nor do I think any but the most die hard AJAX fanatic) that JSON is a replacement for XML in all circumstances - as you point out, a point I agree with, JSON is a way of encoding data structures. Even given this, it has some limitations - you cannot encode multiple items to the same hash key:


var obj = {a:"first",b:"second",a:"third"}


implies that obj.a = "third"


while


var obj = <root><a>first</a><b>second</b><a>third</a></root>


implies that obj.a is a nodelist of two <a/> nodes - and in fact there is no clean way that you can in fact render that structure in JSON without losing document order or relying upon some additional constraint semantics.


That's why f:json(obj)=>xml:obj is generally feasible, while f:xml(obj) => json:obj is valid only for a subset of highly constrained schemas.


2007-03-06 05:30:56
So where is XML going? In search of its parent?


I enjoy watching evermore complicated explanations given to justify the centrality of a language that was awarded its central status by its claims to simplifying the complications of its progenitor.


Very karmic.


I feel fortunate to have been around the track enough times to understand a few simple things:


1. No language is fit for all applications.


2. Any syntax can be made to fit all applications with enough exceptional language in the ammendments to that '27 page spec' thrown so triumphantly to the floor.


3. Verbosity does matter.


4. Syntax barely matters but without it, you have to create an abstract object model that is more a ball and chain than an SGML Declaration ever was.


As the twig is bent, Kurt.


len

jeffgoz
2007-03-06 11:43:24
I am very interested in hearing about the XML SVG Standard (Scalable Vector Graphics). If you look at the potential for this market and the quality of the web playback the opportunity for SVG is really fantastic. We need an open web standard and not a closed standard like Flash or Microsoft. I also have been playing with a neat tool called SnapKast. The Snapkast software converts a powerpoint to the SVG standard and then delivers the content as a mpeg4 file or podcast. I hope we can start to see better animation from this standard. The possibilities are really great but we need more companies like SnapKast taking the standards to video applications.
Kurt Cagle
2007-03-06 12:16:09
I think your points are true of most languages. To me, the argument tends to come down to which domains is XML appropriate for, and which domains its not. We know that its not a terribly efficient language for representing highly indexed data, even though the serialization of that data in XML works reasonably well. It's a hideously inefficient imperative language, though as a wrapper to such languages it provides a nice metadata layer. It's not the most ideal language for expressing relational mappings by hand (I personally hate RDF+XML for its complexity) but it turns out that machine encodings of RDF as XML make a great deal of sense. It's a fairly inefficient mechanism for encoding data structures, but it fulfills the minimal requirements that you can express most relationships with it (you can only have one hash tag active at a given level in a JSON document, for instance, so you cannot express non-adjacent tag containers in JSON).


XML is, by definition, a compromise, and always has been. Like most such compromises, this means that there are almost inevitably better specialist tools to perform the same solution. In many cases, the benefits of those tools will almost certainly outweigh the benefits that the more generalist XML approach takes, but in other cases it may not.


I think a lot of people look back to SGML with fondness, like it represented a nirvana standard and the move to XML represents some kind of fall from grace. I came to SGML fairly late in the process, and for me, it was a complex, unwieldy, difficult language to work with - yes, it managed to embody a great number of concepts that have been "reinvented" under XML, but overall I think the reason for creating XML (warts and all) from SGML remain as sound today as they did ten years ago. Personally, I suspect that if SGML was in fact such a perfect language, then you'd be seeing a significant uptick in its usage, and I've seen no such thing. I had one commentator on another blog say the same thing about LISP's superiority over XML (which I'll readily grant), but notice again that LISP remains very much a niche market.


XML is hideous for some applications, marginally acceptable for more, and "good enough for government work" for a whole lot of others. I suspect that combination of necessary and sufficient is what keeps the XML engine going.

len
2007-03-07 02:34:12
I never said it was nirvana and my thinking isn't that fuzzy. I look at the human reasons for technologies and find the overlap often quite contradictory. Yet much is made of the separation of validation, well-formedness, death to DTDs, yadda yadda yadda, Kurt and then I see this:


"Valid markup has become equated with two things nobody wants: impracticality and implausibility."


quoted at alistapart where an author who obviously starts his career with the web and continually has to justify the costs of working with broken code:


PRECISELY. And it only took the wisdom of the crowds here a decade to catch up to 1993 on that point.


It's broken, Kurt. The people who broke it keep defending that position with a lot of self-covering justification based on technical reasons that given the human uses are becoming highly suspicious because technical means did exist, were abandoned, are being replaced with even more complex means, and all of this in the name of 'simplicity'.


Sorry, fella, but at some point you have to ask yourself where we went wrong and what we should learn from that. I don't expect Bosak or Bray to ever do that because they can't, but the rest of us might want to. Given that quote, something has obviously gone very wrong. If we can't learn from that, then our blogs will only get longer because it takes a very long coat to cover asses hanging out that far.

Kurt Cagle
2007-03-07 08:10:52
So where is the solution? Go back to a pure SGML, where we end up replacing flawed but largely interoperable XML with flawed and largely non-interoperable SGML syntaxes? Recognize the validity of HTML as a "human-readible" format when most content nowadays is being generated by machines for machine consumption? Resuscitate CORBA from the dead?


It's easy to criticize the W3C - I've got enough personal reasons to fill the walls of my house in microtype - but overall I see the reason for the problems that occur on the XML well-formedness side (let's ignore validity for a moment) really come down to questions of poor programming practices, venality and ignorance ... and no standard, no matter how well crafted, is going to be able to overcome those three factors. Yes, JSON is simple, but its also woefully inadequate for representing most document structures. If you want to see confusion, try getting the average JavaScript monkey to learn LISP, even though its what XML (and JavaScript, for that matter) keeps trying to be. We can ignore the declarative side completely, stay only on the imperative fence, and we end up performing a lobotomy on the gestalt programming brain.


It's easy to criticize XML, but its a lot harder to see what's out there that has the potential to replace it. I see hopeful signs that this something IS emerging - I think that XML + JavaScript + XMLHttpRequest together make up a fairly potent mix, especially as we begin to discard a lot of the things that have been baggage - the use of the DOM as the only means of manipulating XML objects, the appearance of a more coherent XSLT, the rise of objectified XML objects. Already, DTDs and PIs are receding from usage, entities are frankly disappearing except when absolutely necessary. The language that's appearing out of the XML duck's nest may be more of a swan (to the bewilderment of the ducks) but I personally see that as a good sign.


As the first comment in this thread pointed out, ultimately the question becomes "What is the Infoset?". Well-formedness is another way of saying "unambiguous", not necessarily a measure of syntactical correctness. HTML 4.0 is ambiguous because it makes significant assumptions about completeness and containment that, I believe, require some semantics to leak into the syntactical model, which is bad design. Ideally you want to be able to avoid instantiating the infoset in order to determine its viability.


Thus, I need to know what you're questioning here as being flawed? If it is the syntax, then yes, frankly I'd agree with you, as I think many others do. To me, the best way to push that is to raise the question of what exactly we mean by Infoset (I'll likely do so shortly in a blog anyway) and try to push that envelope as much as possible. If you're arguing that the Infoset model itself is flawed, I'd have to disagree there. For what it needs to do, the infoset is pretty nicely balanced between flexibility and efficiency. Perhaps I'm being dense here, but I just cannot see the benefit that scrapping the infoset would have, compared to the gain that its already managed to bring.


Any ideas on this?

Rick Jelliffe
2007-03-09 06:27:17
On the subject of SGML, JSON, etc, I think we should recognize that a standard can be a success merely by being ready at the time it is needed and bowing out after it has allowed technology to leap to its next stable state. ISO standards recognize this by having a five year review process; every five years, every living standard has to be verified that it is still in operation, otherwise it gets taken off the list.


We can get too caught up with the permanence of standards, as if they were holy scripture. Consider SGML: took years to develop and for Charles and everyone to gather the requirements, harmonize them, and get the standard out, then after 10 years it is superceded by something much simpler, and after 15 years people start doing the things that it originally standardized though without the benefit of an automatic mapping to SGML. It is frustrating, and dumb or ignorant in another sense; but it is perfectly OK, because the reason for a standard is to meet a market requirement at that time, not to be the answer for all times.