Wow this is a really interesting blog, I've not though about it this way. It's interesting how the various formats/wrappers etc.. support different modus operandi.
It's a while since I've been into programming around audio (or video) but this reminded me of a blast from the past (when using QTand Real stuff) Soemthing called SMIL I only played with it a little in the early days and haven't used it recently. Now it looks pretty wide ranging up to V2.0 I believe.
My point here is that maybe SMIL can act as a wrapper around the media (not just audio) to provide a standard interface. That is one would access the sample.smil rather than sample.mp3 or whatever. Within the .smil file are all of the required controls, Start/Stop durations and much more.
Not sure if this actually helps as I am not entirely sure what you are after but this kind of wrapping could certainly add value and standardise things a little.
PS I think QT and real support SMIL not sure about the others