XML makes you stoopid!

by Ric Johnson

I wrote a recent article the other day on how Google has decided NOT to use XML for a recent project they open sourced. I received a LOT of very opinionated responses to that post. Unfortunately every one was from a complete MORON.

22 Comments

Guy Mac
2008-07-14 19:25:17
You encode a single digit in 3 *bits*...
Shawn
2008-07-14 20:33:06
I think it is interesting that Facebook introduced a very similar concept in Thrift for their infrastructure. Facebook and Google face a very similar problem and came up with a very similar solution (inspired by the same research).


After looking at Protocol Buffers, I am impressed by its simplicity. After about an hour of reading, I was able to understand the basics of the wire protocol and was able to write Python code to read and write Protocol Buffers.


I am not sure if Google could have created an Efficient version of XML (many have tried in the past) or modified JSON to fit better. My guess is faced with tight performance requirements and dealing with scaling problems, they found nothing fit what they needed and modifying XML or JSON was too much work. Back when they created this (2001 I think), XML was wrought with interoperability problems. Since then, most of those problemswere solved. But they had an immediate problem that needed to be solved and didn't have time to go through the W3C to modify XML for something they were using in house.


Could they have used something better for Proto files? Maybe. Could they have used XML or JSON as their interface definition? I don't really know. The devil is always in the details and hindsight is 20/20. They probably didn't do everything right, but they did do a lot of things right. I find this a very promising alternative to XML. Will it replace XML? Probably not. Will it fill a void that XML and JSON don't? Probably.


M. David Peterson
2008-07-15 05:02:56
@Shawn,


>> "Back when they created this (2001 I think),"


<snip/>


>> "Could they have used XML or JSON as their interface definition?"


XML maybe. But definitely not JSON. It hadn't been invented yet. ;-)


So what this really all comes down to is that Google has an internal data serialization format that solves some particular in-house problems. <speculation>Some engineer probably said to some mid-level manager "Hey, we should open source this!". That same manager probably thought about it for a minute and thought "Hey, if we did, then the whole world would be able to speak 'our' native language, and that would mean we could improve the efficiencies of our external API's using something we are already very good at dealing with.</speculation>


In short, this is a good thing for Google, but unless you're a company the size of Google that processes the information that Google processes, what's good for Google doesn't translate to what's good for you. If interop with *other* web service providers is important to you, you should probably stick with XML and/or JSON. If it's not, then take a look at proto buffers.


Point of contemplation: If MSFT had open sourced a proto buffers like serialization format instead of praise -- from people who either work for Google (and therefore are biased) or who simply have nothing better to do with their time than attempt to reinvent something that doesn't need reinventing just because they *think* they suddenly believe they're infrastructure will instantly become more efficient as a result -- what would the headlines instead read?

Scot
2008-07-15 05:41:44
2) Google COULD have used XML if they wanted ( or JSON looks like a better fit)


But then they wouldn't get the advantages PB brings to the table.


3) Using an established format would NOT have effected the actual TRANSFER


What do you mean by "established format"? If you mean XML, then the transfers would be slower than other "established formats". It would be far more interesting to see PB compared to ASN.1 than XML, as PB is far more efficient than XML.


4) ...they optimized, and in so doing specifically limited expandability or at least reuse.


What are you on about? Exactly how does PB have limited expandability or reuse potential?


5) IF they spent this time creating better XML or JSON libraries, then we ALL benefit in projects that do not use this protoco


Again, what are you on about? No amount of wasted time or money on XML/JSON libraries would ever bring the features of PB to XML/JSON. More to the point, why should they bother, if XML/JSON is not the itch they want to scratch?


I welcome any REASONED argument.


Helps if you actually make reasoned arguments yourself.

Ric
2008-07-15 06:08:05
Scot,


Thank you for your input.


However, did you miss the part that XML does _NOT_ go over the wire?
I mean, I only said it 5 times. Perhaps if I repeat it?


THE TRANSFER _OVER_THE_WIRE_ does *NOT* use XML!


The could have used Excel Spread sheets as the proto files and when they send the RPC, the amount of data and latency would not have changed.

angus
2008-07-15 07:14:39
THE TRANSFER _OVER_THE_WIRE_ does *NOT* use XML!


It seems you should also blame Python and Ruby - they should all be coded in XML and shouldn't creating new syntax

Eric Jain
2008-07-15 10:52:00
Looks like what you're talking about now is not the serialization format, but the custom definition language? Google could no doubt have come up with an XML syntax for that (trivial), though I'm not sure what the benefit of that would be? On the other hand, if the point of contention is that they didn't use XML Schema as their definition language, keep in mind that XML Schema is an order of magnitude more complex. Google may have a lot of resources, but it's a good sign that they are not squandering them on writing tools that are way more complex than what they need, or have no direct benefit for them (apart from showing off how smart they are).


P.S.


MORON = person who disagrees?
REASONED = in agreement with the original post?
treatise = brief rant?


Ric
2008-07-15 10:59:53
@Eric:
Finally! Some one actually READ the spec and my contention. 'Google could no doubt have come up with an XML syntax for that (trivial), though I'm not sure what the benefit of that would be" -
YES - I admit it may not have been as fast: the benefit would have been BETTER processing libraries for us all AND a protocol we could all use out of the box without having to understand a new format.


P.S.
MORON is a person who argues the wrong point repeatedly
REASONED means someone with logic in their arguments
TREATISE = (long) rant - you got me on this one :)

MORON
2008-07-15 12:03:45
Rather amusing exchange here! Seems like all the Morons that bothered to reply to your original post actually read Google's announcement (link in the first sentence of your post) in which the question is posed on how to encode the data to guarantee interoperability and backward compatibility and all that good stuff.


One possibility to guarantee these things is to serialize the data to xml and write that to disk or send it over the wire. However this would be rather expensive. This is how I, MORON, interpret the first 3 paragraphs in the Google post you refer to. It then follows that Google’s decision not to use XML for this is actually a good engineering decision and the whole reason for the existence of Protocol Buffers.


And along comes Ric Johnson claiming to know how to solve the above stated problem using XML. At least that is the impression one gets when reading your original blog post. You then go on to clarify in your second post that your are in fact talking about something entirely different.


Took your bait, and for that, I am a Moron!


Maybe you should take better care crafting your blog posts instead of calling people names.


Cheers

Lars
2008-07-15 12:11:55
"the benefit would have been BETTER processing libraries for us all AND a protocol we could all use out of the box without having to understand a new format."


But would these 'better' libraries be good enough for what Google needed it for? And considering that the package is now open source, what is stopping anyone to add an XML representation as alternative input?


However, personally I don't think anyone will come up with an XML representation that is as human readable and succinct as this format.

M. David Peterson
2008-07-16 00:14:54
@Scot,


>> It would be far more interesting to see PB compared to ASN.1 than XML, as PB is far more efficient than XML.


Agreed. In fact, why don't we set aside three-letter-acronyms altogether and instead use the following statement,


Binary formats are more efficient than their text-based counterparts. The fact that the non-compiled representation of a protobuf happens to be something other than XML is not what makes them fast. It's the fact that -- when compiled -- they're a binary format, and binary formats are *MAN* magnitudes more efficient than their text-based representation. Trying to suggest that it's the markup language in and of itself that makes proto buffers so fast is like attempting to argue that adding,


using System.Xml;


to a C# code file and the referencing an 'XmlDocument' directly throughout your code, e.g. XmlDocument xDoc = new XmlDocument(); will make your compiled code run faster than it would if you used System.Xml.XmlDocument xDoc = new System.Xml.XmlDocument(); Just because it requires a greater amount of "markup" in the second usage example doesn't mean the code takes longer to compile and/or the compiled code runs slower.


Bottom line: This isn't about the chosen markup language being better than any other chosen markup language. This is about the difference text and binary, plain and simple.

marc
2008-07-16 06:48:30
95% of people are using XML because "it is the way to go because all people are promoting it so it must be good".


See for example Microsoft Office OOXML format: an awful use of XML .. which contradicts all its goals.


But it *is* valid XML, this is the important thing ... you get the "it is XML!" logo in your product.


Google is not limitated by this marketing necessity... they must really make things work and work efficiently ...so in this case, and i'm sorry that you don't like it: XML is not the way to go


Andre
2008-07-16 07:16:38
Let me take your summary point by point:
1) Agreed.
2) You mean the .proto description DSL? Well, no one keeps you from writing an XML->.proto compiler. :-)
3) Agreed, if you again mean the meta-format for describing messages.
4) Disagreed. By creating a meta-format with the .proto language, they have - at least in-house - increased the reuse dramatically. Also the message format itself is extendible. What extensions would you suggest on the .proto format? Where do you see a limit to expandability or reuse? Do you realize that the meta-format for JSON (as a subset of javascript) or XML (DTDs or schema) is also fixed?
5) Disagreed. They had a task which required fast encoding and decoding. No text based or general purpose format would have cut it. Are you implying that benefiting the XML/JSON community is more important than solving the particular problem at hand?


P.S. Tone down the flamebaiting if you are interested in a REASONED argument. :-)

Ric
2008-07-16 07:36:38
@Andre:
Thank you for a reasoned response. I will try to tone down a bit
You are right: Google's primary concern may have been SPEED rather than re-use. My point is the EFFORT of creating the new format may have been applied to make the XML/JSON libraries faster instead. It may not have matched the final total spped of theit current solution, but YES, I am saying the COMMUNITY is more important that the particular problem at hand, IF that is their final real goal.
It is a difference between optimizing a tight loop as a developer vs. changing the architecture for the SYSTEM to become faster.
Andre
2008-07-16 08:17:04
Hi again. I'd like to make the distinction clearer, as it appeared a little muddy through your writings: There is the message format (the language) and the .proto format (the meta-language). So the need for speed does not extend to the meta-language, as the protoc is used only at build time.


Now you appear to have two criticisms:
1. They should have used DTDs/XML/JSON/[insert your favorite standard here] instead of inventing their own C-like-language for the .proto meta-language.


I agree with Eric that their .proto format is a very simple and readable (for a C++ programmer) representation of what they need, and I also see no way that involving JSON or XML would bring any improvement.


2. Best, IF (citing your big if) they want to benefit the XML community (which, as you argue, is more important than a problem they needed to solve?) they should have tried to create faster XML/JSON parsers.


I sure must have misunderstood you, because this sounds silly. If you accept that google could not have used a text-based format to solve their problem, why then do you argue that this is somehow bad for XML because the time the google folks spend solving their problem (which they needed solved anyway) could have been spent on writing better XML tools? Even if they had used XML or JSON as a basis for their meta-language, they would have no direct incentive to optimize the parsers. As the .proto meta-language is quite simple, the time spend writing the parser was probably limited anyway.

dbt
2008-07-16 10:39:59
you're missing the point. It's mathematically impossible to parse text as fast as you can parse binary. Rather than trying to shave 10% off improving a JSON library performance, they made it 20x faster by replacing it with something that fit their needs.


For an inside-the-firewall application, this thing makes a lot of sense.


And I don't understand why you're so afraid of parsing a really lightweight grammar like .proto files. Sure, they _could_ have used XML for it, but since nothing else about the spec is XML it would have just been gratuitous.

Andre
2008-07-17 02:51:59
By the way, regarding your point about long numbers, the Protocol buffers have enough different number formats to choose from. Given a developer chooses wisely, they will (except in very rare circumstances) have less overhead than a text-based representation. To encode a number n, you will need the following amount of bytes:


- If it fits in 29 bits, use the variable-length encoded integer formats: 2+[log128 n] bytes
- If it fits in 64 bytes, use a fixed 64 bit format: 10 bytes
- Else, use a string as a variable-length bitfield and do the number-coding in your own code: 2+[log128 log256 n] + [log256 n] bytes


Contrast that with plain text or base-64 encoding (generously assuming that we have a single byte as separator):
- Plain text will use one byte per digit: 1 + [log10 n] bytes
- Base64 encoding: 1 + (([log64 n] mod 3) * 3 bytes


Even with this generous assumption, the binary formats win most of the cases.

Andre
2008-07-17 05:47:39
Correction: I meant fit into 64 bits, not 64 bytes - fitting a 64 byte number into 64 bits is an exercise left to the reader... :-)
Kurt Cagle
2008-07-18 20:54:10
For more information about the rationale behind the use of Protocol Buffers, take a look at an interval with the person who open-sourced it:


http://news.oreilly.com/2008/07/interview-google-open-sources.html

fauigerzigerk
2008-07-19 02:42:40
I think it comes down to that question of separation of payload and metadata and the role of schemas in all of this. Stripping metadata from the payload makes it more efficient but it also makes loosely coupled data exchange more difficult and error prone.


I agree that they could have devised an optimised XML serialization format that would have had characteristics very similar to PB, but that format would lack many of XML's advantages. It would not be human readable or writable. You would not be able to process it without a schema as there would have to be translation between IDs and names and a lot of other trickery.


And then there is the schema language issue. I know it's taboo in the XML community to bring this up again, but XML Schema is simply the worst piece of crap ever invented and nobody wants to use it. However, google is in a position to force the XML community to reconsider this matter, for instance by backing RELAX NG.


What about JSON? You point to the Person message format and say "Am I the ONLY one here that saw this as a JSON construct?". But you're confusing schema and content here. The fragement you show resembles JSON syntax, but its role in PB is that of a schema whereas JSON doesn't even have a schema language. I don't understand the relevance you afford to this coincidental syntactical similarity between a schema language and the JSON payload format.


By the way, the whole affair very much reminds me of the debate around binary XML and XML compression. Much of what can be said about this topic has been said there.

kay
2008-07-21 01:26:06
I find the IDL-like syntax is 100 times more readable than the equivalent XSD...


That is all.

Jimmy Zhang
2008-07-23 18:49:25
The other issue of .proto is that it is schema dependent... and lead s to tight coupling