Error-full systems emerge from single-strategy maintenance regimes

by Rick Jelliffe

When you run the same process over a few years, its particular shortcomings emerge and can dominate: for example, Joel Spolsky claimed that MicroSoft had an economic criterion for fixing bugs, so that they only fix a bug if it costs them more (e.g. in sales) to have it than to fix it—for a monopoly in a growing market there is no loss of sales from bugs, and for a near monopoly with free alternatives to some extent the funds you lose from an application sale may be spent on another purchase from you anyway: we spend less on Office but that gives us more to spend on Vista, for example.

I don't know if Joel was correct in his assessment, or whether Microsoft have a different strategy now. But clearly the mid-term impact of such a strategy would be a buggy code base, with entrenched workarounds, combinatorial explosions of symptoms that prevent diagnosis, and an inadquate foundation to prevent major errors. Not to mention a sudden exposure to loss of market share when the market gets saturated and stops growing: when a sucker isn't born every minute.

Sun's Java effort is similarly suffering recently: they have a nice-looking error process based on people voting for errors as critical. Now whether Sun acutally use this list to determine which bugs they fix first, or whether they use the vote to justify ignoring bugs that they are not interested in, the result is probably the same. A system with lots of known bugs.

There are lots of other single-strategy methodologies: risk-based analysis, ISO 9126 software quality analysis, weighting bugs against their depth in the call stack so that libary bugs are fixed at hgih priority, metrics, test driven programming, and so on. I don't know why we should have any confidence that any of them will necessarily not, over time, systematically fail to address some kinds of errors. Which will bite us.

So is a better approach to just fix bugs randomly? Pick a bug from a hat? Well, maybe....

Perhaps we should say each maintenance methodology applied singly over time will result in an accumulation of unaddressed errors in some aspect.

Part of the problem is human: people have interests and pressures and viewpoints. So democracies solve this by what Lee Teng-Hui (the Taiwanese president who secretly funded the opposition parties) called "the regular alternation of power": term limits, shifting jobs, even sabbaticals.

Part of the problem, as I see it, is with simple prioritization of bugs. Sometimes it is better to see each module as a whole, allocate quality requirements for that module, and then handle each bug according to its module priority. For example, Sun could say "we don't treat text.html as a priority module but we do treat 3D rendering as a priority". Apply this to voting, and then two votes for an HTML bug would be required to equal one vote for a 3D bug.

But that is a more complex strategy to be sure, but it is still a single strategy.

A better way of doing things may be to divide the debugging/maintenance/natural enhancement effort into independent efforts. For example, have main stream process use immediate rational economic effect, risk or deadline criteria. But also have a background effort that alternates between different strategies: systematic audits for internationalization, performance, standards-compliance, transparency, integrity, resource utilization, and other quality concerns. And also have a background effort that uses weighted voting and different criteria that accepts minor Requests For Enhancement as well as bugs.

And even, for one in a hundred bug fixes, do pick a bug out of the hat, on the grounds that you don't have 100% confidence that even the multi-criteria maintenance will prevent the emergence of a nasty clump of errors in some aspect. Shake it up.


Michael Champion
2006-11-01 07:53:14
"have main stream process use immediate rational economic effect, risk or deadline criteria. But also have a background effort that alternates between different strategies: systematic audits for internationalization, performance, standards-compliance, transparency, integrity, resource utilization, and other quality concerns." That's more or less how things are done at Microsoft. There are quite different processes (and sometimes different teams) for shipping and "sustained engineering". There are all sorts of oversight groups with specific charters and processes for security, localization, etc. Then there are the various customer-driven field organizations that push for fixes affecting specific customers.

All this ends up as the "pointless process paralysis" that people complain about on minimsft and invites invidious comparisons with IBM in its heyday. That's good news / bad news ... harder and less fun work than it probably was in the good ol days of the '90s, but a lot fewer embarassing glitches.

The idea of just randomly choosing bugs to fix is interesting. I don't think it would work for most big operations, however: the reason bugs don't get fixed is not some economic prioritization but a customer support priority "no breaking changes unless absolutely necessary." Making some feature more conformant with a standard (or slighly better performing) at the cost of breaking thousands of applications that assume the old non-conformant behavior is not going to be appreciated by anyone except a handful of geeks.

2006-11-01 08:30:20
It sounds corny, but bug fixing *can* get old very quickly, and I definitely would advocate the sort of approach you'd take in a storefront retail business: "October is Customer Satisfaction Month! November is Performance Issue Month!" Planned out ahead of time, the month could begin with an informal discussion or specific training in a focus area. If your employees are motivated by bug-stats, in-focus bugs might count as two if closed within the prescribed time period. Shake it up indeed!

Great article, Rick!

Todd Derscheid
2006-11-01 10:45:37
Do you think companies could adopt a flipside of the Google 20% Project to this approach (instead of 20% of work time devoted to individual development of new features, 20% of work time devoted to individual bug squashing and refactoring)?

Other prioritization mistakes I see from past employers are under-valuing the pains of poorly-developed internal tools for your non-programmers versus end-user pain, and failure to draw connections between similar complaints from disparate users (e.g. complaints that interfaces aren't intuitive, alongside long training times for new employees).

Rick Jelliffe
2006-11-15 07:25:37
Michael, if you have two customers who use an API, and one realizes there is a bug and writes their application to utilize the bug, while the other writes the application thinking that they get what the documentation says, are you saying that the first user should be preferred to the second? The first user makes their bed, and they can sleep in it, to some extent; the second will be delighted at risks reducing and quality improving. Isn't what you are suggesting really that after an API has been released and used, the documentation should be corrected to reflect what the software actually does rather than what it was initially specced to do? (Like the Confucian Chinese idea of the Rectification of Names) However, of course, I understand this is a lifecycle thing: as an API matures it solidifies (and finally fossilizes?) so bugs need to be fixed early because they can become entrenched otherwise. On the other hand, I am not sure I would want to use any product where the documentation was retroactively changed in preference to fixing genuine bugs, no matter how longstanding.