An unreliable ws-reliability spec
Here are the specific issues brought up by the posting and its predecessor, along with my responses -
Issues from http://radio.weblogs.com/0108971/2003/01/15.html -
First, let me say thank you for pointing out these issues. Many of these things were
discussed during the initial formation of the spec, and we decided that it would be better to
wait until the formation of the OASIS TC, which has now happened. Although the initial spec
was posted as a group of vendors collaborating on a specification, our goal all along has been
to use the initial spec as input to the formation of a WG or TC. We didn't want to go down
too many implementation detail paths, particularly when it comes to things like inherent
requirements on the underlying infrastructure. We also didn't want to go too far down a
"proprietary" path as a rogue gang of vendors, without bringing it to a broader forum such as
an OASIS TC. We look forward to ironing these types of issues out with the other WS-RM TC
members. That being said, let me see if I can address your points specifically -
- "Because WS-Reliability is unaware of and not integrated with WS-Routing, it is only useful
as a point to point mechanism. While routing from the sender to the receiver will likely be
possible, the "ReplyTo" to send the acknowledgement message to does specify a plain URL and
doesn't allow integration with a reverse path as per WS-Routing. This means that unless the
ACK message can be piggybacked on a synchronous response (the luckiest of all circumstances),
the spec requires either direct connectivity from the receiver back to the sender, which may
be impossible due to firewalls and NAT, or requires some form of acknowledgement message
dispatcher gateway at the sender's site, which requires some form of central service
deployment as well. In short: This doesn't really work for a desktop PC wishing to reliably
deliver a message to an external service from within the corporate firewall."
Good issue. We actually had many discussions and early versions of the spec that had
attempted to address multi-hop, and perhaps even WS-Routing. Multi-hop issues in general are
being discussed in other work groups like XMLP (SOAP 1.2), WS-Architecture, and WS-I. We look
forward to converging with those discussion to make sure we are in step and doing the right
thing. There is also a bigger issue with WS-Routing in particular in that it is thus far a
Another point is that the growing trend in the industry for supporting asynchronous
messaging-style web services communication for interactions within and across the extended
enterprise is going to mean that most organizations will host asynchronous listeners anyhow.
WS-Reliability is not driving the charge there, its already happening. I agree that there
still needs to be some sort of routing or dispatching necessary to get back to the desktop PC.
That's a good issue to flesh out in the TC.
- "There's quite a few problems to be solved with regards to simple sequence numbers and
resends of an unaltered, carbon-copy (2.2.2) of the original message considering the accuracy
of message timestamps, digital signatures, context coordination and techniques to avoid replay
attacks. Sending the exact same message may be entirely impossible, even if it couldn't be
delivered properly and therefore the "MUST" requirement of 2.2.2 cannot be fulfilled. Also, in
2.2.2 there's a reference to a "specified number of resend attempts" -- who specifies them? "
We chose to use the message id as the thing that determines whether a message is a duplicate,
for these reasons. The specified number of resend attempts is intended to be a configurable
option, but falls under the category of a requirement on the underlying infrastructure, which
is yet to be specified.
- "The spec rightfully calls for persistent storage of messages (2.2.3), but doesn't spell out
rules for when messages must be written to persistent storage in the process (it should
obviously before sending and after receiving, but before acknowledgement and forward)."
I thought that section 2.2.3 was pretty clear about it. I will make a note of that as an item
of discussion in the TC.
- "What I find also very noteworthy is that the authors say that they have yet to address
synchronization between sender and receiver and establishing a common understanding by sender
and receiver about whether the message was properly delivered (meaning that the send/ack cycle
was fully completed). I assume that once they do so, they'll throw the synchronous,
piggybacked reply on top of HTTP out of the window, because this creates an in-doubt situation
for the acknowledging party. "
That situation is currently addressed by message redelivery on the sender side, and dupe
elimination on the receiver side. We will make a note to revisit this in the TC discussions.
Now that we have formed an OASIS TC, you have a public place to have these discussions. Feel
free to post your feedback to firstname.lastname@example.org.
Issues from http://weblogs.cs.cornell.edu/AllThingsDistributed/archives/000013.html
- "The requirement that messages need to be persisted has not been thought through well enough
(as Clemens already hinted at). The operation on the sender side seems obvious, when you
recover you try to get acknowledgements for those message you think you have sent, but may
have gotten lost in the crash. However at the receiver this is less obvious. What does it mean
to have delivered the message to the application successfully? Can you be sure about the point
of the possible crash? Can you be sure never to deliver duplicate messages to the application
during recovery? Does the app also needs to handle duplicates? There are no conditions
specified for how to remove received messages from the persistent store at the receiver. "
Issues 3 + 4 in appendix 2 are general statements that we need to further refine the semantics
of failure and recovery. Many of us in the TC have very strong experience in enterprise
messaging and are very capable of figuring this stuff out.
- "What are exactly the semantics of an acknowledgement? Does this means the message was
stored in persistent storage? Or that it was successfully delivered to the application? "
My view of it is that the message can be considered acknowledgeable once it has been safely
persisted. Issues of undelivery to the application can be addressed by the notion of a
centralized fault location, or dead message queue, as noted in Appendix 2, section 3.
- "What does time-to-live really mean in case of persistent storing your received messages. I
can send an ack telling the sender I received the message, then I get delayed for some reason
(maybe a crash) and when I want to deliver the message I notice that its time has expired .
According to the current spec I cannot deliver this message and have to drop it. Hence the
message transport becomes unreliable. "
Also addressed by Appendix 2, section 3. Look forward to other alternatives which can be
discussed in the WSRM OASIS forum.
- "The requirement to send a simple ack immediately for each message will introduce a real
mess. The scenario in which a message gets lost and a subsequent message is received, will
trigger an ack for this new message making the sender believe that it is reliably received.
However the receiver cannot deliver the message to the app until it has received the
retransmission of the missing message. This can cause unreliable behavior because you may have
to drop the message if there is a ttl field, or if the sender crashes before it could
retransmit the missing message, the sender gets stuck with the message it has received for
ever without being able to deliver. The solution here should have been to do a delayed ack or
send a negative ack, allowing the receiver to treat the new message as volatile until the
retransmission gap has been filled. "
This is recognized by section 6 in Appendix 2.