Google hates XML

by Ric Johnson

I just came across an article that announced Google open sourced their 'Protocol buffers' but decided NOT to use XML.

It is great that Google is contributing to the community and is showing the world how their system works, but I wonder if the inmates have taken over the asylum. I am a software engineer, so I enjoy the technical information, but I also ran my own company, and evolved to view problems from the business end. If you let the 'engineers run the show', then you get a very narrow viewpoint.

I think we have a severe ‘Not invented here’ syndrome inside Google.

24 Comments

DeWitt Clinton
2008-07-11 11:22:52
Hi Ric,


I assure you, Google does not hate XML.


On the contrary, we're one of the biggest supporters of XML-based APIs, with a huge variety of AtomPub services (http://code.google.com/apis/gdata/). Between AtomPub and our JSON-based Ajax services we're deeply committed to interop-friendly internet protocols.


Protocol Buffers are advantageous in other situations, though. Protobufs offer known endpoints the ability to serialize data with a high degree of time and space efficiency. The gain in performance and reduction in size is significant -- especially on the scale of data that Google processes! -- but we don't expose protobuf-based endpoints to external, unknown clients. For those we do indeed tend to favor XML-based protocols.


Cheers,


-DeWitt

Eric Jain
2008-07-11 11:24:31
Well there is no X on Google, or is there? :-)


To be fair, I've run into the same issues when storing large sets of small items for fast random access. Binary serialization had a huge (don't know about 200x, but definitely 10x) advantage, perhaps because it's more compact (which means less disk access and better use of caches) and is simpler to parse.


I've tested various "binary" XML formats, but didn't find any that offered a significant advantage over plain old gzip (especially when you take CPU use into consideration). Gzip is great for distributing medium-to-large documents, but less so for storing and transferring small pieces of data within a system.


I agree that the best optimization is to simplify the data model, but there is only so much you can do, and if you reach the point where engineering is removing stuff from the customer data model, then that's what I'd call having the inmates run the asylum!

James Bennett
2008-07-11 11:27:30
See, me, I look at the guys who do a better job working with markup than anyone else on the planet, and hear them say "XML didn't suit what we needed", and think "wow, they probably know their own needs and the limitations of available tools better than I do".


But then I don't really have my life invested in any particular markup scheme, so I don't need to be personally insulted when somebody decides XML doesn't work for them ;)

Ric
2008-07-11 11:48:08
@Dewitt: I know Google does not really hate XML, but sometimes it seems like it! I use a lot of your protocols, but mostly in JSON. I am SURE 'Protocol Buffers' are kewl, but my point is that you guys could get XML to work. I am willing to bet someone will make something similar in less than a year that uses more structured markup. You game? By the way - the website you linked to seems to be down.
@Eric: The whole point of my post is that Google created a whole NEW markup instead of improving XML libraries. Then the 200 times speed would go down to 100, then 50, then ....
@James: I agree XML is not the best tool for EVERYTHING, but it seems they invented a new markup that depends more on context, which limits the reuse-ability.
dbt
2008-07-11 12:00:47
Excuse me, have you ever dealt with systems that have to transmit hundreds of thousands of messages per second?


XML is good for some things, but dealing with large amounts of data at wirespeed is not one of them.

Andrew
2008-07-12 00:23:53
I think your misunderstanding the use of this technology. Its there for low latency message passing and isn't really designed for much else. The google engineers don't hate XML, its just that for the number of transactions that they are talking about, XML is just too bulky. Yes you can compress the XML, but once again it increases the latency.It doesn't replace xml, and you don't get the flexibility, validation etc.. that you get with XML. For _this_ usecase, those don't apply. What they want is small, fast markup that is quickly parsable. They don't need to be able to query it, display it in different ways etc..


I really think that what they have is a decent solution to a problem that XML isn't directly suited for.If you want the things XML provides when you are creating RPC then use XML, if you want raw throughput and low latency then google has _one_ solution.


CORBA, RMI, DBUS are other methods of doing the same thing.

John
2008-07-12 04:22:48
Sorry, but this was just the most stupid blog post I've read in a while. This post feels like it is written by a 14 year old who is pissed that somebody doesn't like xml. Well I'd say there are reasons not to like it but as DeWitt pointed out, Google is far away from hating xml.


If you're so good, why don't you just make xml run circles? Nobody is holding you back!


This post made me finally unsubscribe your blog. Happy days in your angle bracket world!

Henning
2008-07-12 07:24:10
Ric,
It seems to me that the opinion you expressed in your post is rather misguided or from a narrow point of view. Xml is great for many things and I use it on a daily basis for my work and could not imagine solving certain problems without it. However, the statement that Kenton Varda makes in the announcement of open sourcing Google’s Protocol Buffers makes perfect sense when considering the environment that they are working in. I work for a company that makes applications that processes large amounts of data (hundreds of millions of records per hour) and while we use xml to communicate on the front end and to configure our applications, early tests showed that it is impractical and much too expensive to do so in our backend systems. For certain things it simply adds an extra layer of processing that is not necessary and only costs you cycles. Even if you can “make xml run in circles” and parsing it only takes a fraction of a second longer than reading a binary format, multiply that fraction times a few hundred million (records) or in the case of Google probably billions and you end up with a huge chunk of time (on the order of hours or even days, depending on your data).
It always depends!
Cheers,
Hans
Christophe
2008-07-12 19:00:23
The shortest possible Protocol Buffers message, encoding a single digit number, is three bytes.


The shortest possible XML document, encoding a single-digit number, is by my count 29 bytes.


I don't think any super-clever XML parser implementation is going to be able to get around the overhead that a textual, markup-based representation has. It's a speed-of-light issue; you have to walk over and parse those characters.


For public APIs, XML has a lot of virtues (for example, highly corrupt XML documents are probably not going to cause sparks to fly in your application). But I wouldn't be surprised if Twitter was kind of wishing that they used Protocol Buffers for their internal message passing.

orlando_ombzzz
2008-07-12 21:16:11
XML was not designed for efficient data interchange between applications or processes.


End point. Don't cry

Lopsterrrr
2008-07-13 09:22:50
Steinberg Cubase SX 2.2.0.33 software
charles
2008-07-13 10:02:25
Now I understand! You are a genius! If you decide to actually do something, Google will weep and cry...
Dan McCreary
2008-07-13 11:05:32
It seems like Protocol Buffers is just XML using attributes instead of elements.


Protocol Buffers:
person {
name: "John Doe"
email: "jdoe@example.com"
}


XML With Attributes:
name="John Doe"
email="jdoe@example.com"
/>



So where is the efficiency if I am using an XML appliance with with XPath in VLSI ASIC hardware?

Dan McCreary
2008-07-13 11:14:09
Here it is with the escape characters:


<person
name="John Doe"
email="jdoe@example.com"
/>


Anonymous Coward
2008-07-13 16:28:39
If you want a challenge, I suggest you write an internet scale XMPP implementation using run-of-the-mill XML parsers as your only parsing tool. No cheating; no writing your own shallow parser to tokenize stanzas at the border.


(Internet Scale would imply # of concurrent users being measured in tens of millions)


Requiem
2008-07-13 16:58:32
Protobuffers and XML address different problems.


XML is geared towards representation of data in a way where you really do not care all that much about performance. A key idea is to have a (presumably) self explanatory representation of the data than offered by pure, binary serialization formats. By virtue this means the data will be bulky and contain lots of information that is redundant in a representation that is extremely verbose.


Every data type, when expressed in valid XML, will use many times more bits to represent the same information than would be the case in a format where you take great care to only use as many bits as you have to (with some tradeoffs for performance).


This is fine. It is a design tradeoff for XML. It is why it makes sense to use XML for documents and in messaging architectures where performance is less of an issue than for the protocol to be possible to figure out by casual inspection.


Protobuffers have a completely different goal, and I think you missed that in your post. Protobuffers are about saving network bandwidth, memory AND CPU in situations where you have an abundance of data but a finite amount of money and space for hardware. Their intent is not as a format for representing documents in a way that makes recovering the information without any knowledge of the system that wrote the information. It is performance.


I thought it was rather obvious.


(How many XML parsers do you know that produce code for parsing specific schemas? Hey, perhaps they should.)


Here's a challenge for you. Implement a simple networked file system and use XML as your transport. Say the filesystem will be used by a multimedia-system for streaming multimedia content aboard a plane. You have 450 seats, each equipped with their own viewing device connected to your server by ethernet. Each user should be able to choose what he or she wants to see when, so you could end up serving 450 streams all at once. You can choose from the most common XML implementations that are published as open source, but you can not write your own specialized parser that only understands your schema.


In this scenario, electricity, networking, cooling, unit cost of the viewing device, weight of the system, latency, and availability will have an impact on feasability and usability.


Show me why XML would make sense in this scenario.

Christophe
2008-07-13 19:34:48
@Dan McCreary:


I believe you are confusing the .proto specification file (which is a text file that is input to the toolset, and is not part of what is sent over the wire) with the binary data that is sent over the wire.

Andre
2008-07-14 07:48:38
Ric: As many before me have pointed out, Protocol Buffers are more a marshalling solution (like ASN.1) than a document language (like JSON or XML). Different problems, different solutions. By the way, your NIH-argument would probably make more sense when comparing Protocol Buffers to other marshalling solutions.


Dan McCreary: You appear to have missed that messages can include other messages, thus creating an element-like relation.

PsydayKes
2008-07-14 21:50:56
Please visit my new blog

http://botlog.ru/


Let's talk about it
Afrodit@
2008-07-16 21:03:11
Hello
I want to learn, whether colour of the car influences insurance payment
I thank
While
Gregory Pierce
2008-07-18 04:07:21
Ric, I'm afraid that your blog post is somewhat misguided. While Google did create new markup for this, they really did it in the same vein as JSON. In fact I'm sure the authors were heavily influenced by it. As stated before XML is not the best way to do things all the time. It may be the cleanest way, it may even be the most standard way, but rarely is it the most efficient way. XML tends to be fat, even when you need to deal with optimized forms of it. Some of the methods you describe XML/zip would require extra time to take it from the 'zipped' form to its uncompressed for so that you can then parse it - either that or have more complex code to parse it.


Personally I don't see too much wrong with their approach. I wish they would have simply adopted JSON since they aren't too far from it and it is a well understood approach by more people - but its clear that they felt that they needed to re-engineer everything from the ground up for performance and I'm hopeful that they didn't just do a clean room approach for grins before looking at JSON and similar as solutions to the problem.

jimmy Zhang
2008-07-23 23:33:22
XML doesn't have a performance issue..
XML parsers have performance issues
new parsers such as vtd-xml are dramatically better


I wrote an article called the performance woe of binary XML

M.
2008-07-25 13:00:10
I'd point the author of the article to:
http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing


They should all be taken seriously, this one being the point in case:


3) Bandwidth is infinite.


Well - it really, really isn't.
Whoever tried to build a large scale server even by orders of magnitude simpler then what Google does would soon find that out for himself.

vijayanarasimha
2008-07-31 03:06:16
xml