What's the worst bug you've ever written?

by Andy Lester

What's the worst bug you've ever written? Or if you're a sysadmin, what's the worst system problem you've allowed to happen?

It sounds like a starting point for war stories over beers at OSCON, but it can tell you a lot about yourself. Analyzing mistakes is a crucial skill for everyone, but especially for programmers, where mistakes can be so devastating, and are so easily fixed in the future.

I ask this question of every programmer I interview (and I'm still looking for programmers, by the way) to get an idea of the candidate's self-awareness, and to get a feel for her background. If she can't think of one, then I know she hasn't been around very much.

For me, it was an Exchange conversion project...

I had just started at a new company and was eager to show my chops. My department was in charge of the company mail servers, and they were upgrading from one version of Exchange to another. For some reason, the conversion process would not bring over mailboxes from the old version of Exchange to the new one. Dozens of users had hundreds of mailboxes each, containing correspondence with customers.

Hotshot Andy spent a day or two with using one of the Perl Win32 modules, calling Outlook OLE objects to read in data from one Exchange instance and write them to the new instance. My program would log in, suck up the mailboxes, log in on the new system, and create new mailboxes. The messages converted fine, and I had many safeguards to make sure that message counts before and after were the same. It worked beautifully.

We spent a weekend migrating over the data, as well as all Outlook clients, and Monday morning brought no complaints. We were all pleased with how well it all went. Around Thursday, well after the point of no return, the complaints hit.

It turns out that these mailboxes were organized into folders, and my program hadn't taken that into account. All user mailboxes were in the top level of the hierarchy. All organization was lost. Worse, they couldn't get back to the old instance of Exchange to see how things had been organized. We couldn't even do the grunt work of recreating the hierarchy because we weren't familiar with the data.

Then, after the candidate tells me the story of the terrible bug, I ask the crucial follow-up: "What did you learn? What did you change about yourself?" The reaction is often telling, and I can easily see how self-aware she is. If the answer is "We fixed the bug and had to do some cleanup," then I know nothing's been learned. If she comes back with a "I'll tell you one thing: I made sure that my X always....", I know that she's a self-optimizing person.

In my case...

The crucial error was making assumptions about the data. I had created my own dummy mailboxes, with my own dummy data in it, rather than using real live data. If I'd looked at live data, rather than assuming that I knew what it would look like, the hierarchy would have been immediately clear. "Always look at real data early in the project" is a long-standing maxim from me.

Think about it over the next few days. What's your biggest mistake programming? Did you change anything? Or maybe you find you've over-compensated, and are overly cautious? How do you optimize yourself?

What's your worst bug? What did you change? How do you make sure you're improving?


2005-05-24 20:16:28
SQL Server DTS blunder!
I once wrote a DTS (data transformation) package for SQL Server that was to help us push an updated set of selected tables to client databases. When I ran it on our company's test environment, everything seemed to check out ok. However, because our test environment was composed of two databases from the same client (configured slightly differently, but still. . . ) I didn't notice the ill effects until we actually ran the export live at a client site. . .

Apparently, when I created the export, I neglected to uncheck the box that told SQL Server to not only copy selected objects, but all objects that depend on the selected objects (and those objects that my selected tables depended on, too). Imagine my surprise when we ran this to export data to the first client site and discovered that much of their live data had been replaced by our test data! The sinking feeling in the pit of my stomach is still indescribable to this day. Had I tried it between two different databases, I'd have caught that little blunder right away. Thankfully, they had a good backup from the previous day, and we were able to put all right in short order.

So what did I learn? 1) Never, EVER blindingly accept default options. Always know what they mean, and always know what the default options will be; 2) always test your programs on as many different datasets as it takes to test all the different behaviors/nuances of your programs; and 3) always make sure your intended victim has a good set of backups before they try something new. I got lucky in my situation, I imagine others have been burned by that default setting.

Great food for thought, Andy. Thanks!