The Case Of The Mischievous Perl Script

by Chris Josephes

Parody. Based on a true story.

It was a lazy, summer Saturday morning, just like any other. From my vantage point, I could see kids riding their bikes and playing street hockey. I sat in the middle of an empty garage with my feet propped up on a table, reading a book on how to lift fingerprints from a CVS repository. Saturdays were good days for the detective business, so I knew it wouldn't be long before customers came in with their problems.

Less than an hour after opening up, Kip, Sally, and Tom walked in. It was Kip that spoke first. "We want to hire you," he said. He backed up his statement by dropping twenty-five cents into the jar sitting on the table. I put down my book and assessed my clients.

Kip was a good perl programmer; always keeping his parents happy with his ability to quickly write scripts to handle any problem. Sally was the DBA, who held the record for winning the Oracle performance competition at the county fair. Tom was the new systems administrator, who was known for carrying his pet lizard (Suse) with him everywhere he went. All three of them were good at their job.

"I have a perl script that's taking way too long," Kip said while dropping a few pages of source code on the table. "It's a looping function that impacts the database and the filesystem."

"What exactly does the script do?" I asked.

"The script grabs a lot of records through a SELECT call; each record contains information on a HTML file on the disk. For every record, I run a loop that stats the file, reads the first few lines from the file, makes an UPDATE call to the database with new information, and rests before moving on to the next record."

Tom jumped in, "The job is scheduled through cron to run every six hours: 6am, noon, 6pm, and midnight. But when I look at the process table, there's multiple instances running. The jobs are taking too long and not finishing within six hours."

Sally added her two cents, "The updates don't seem to have a serious impact against the Database. The table itself is only sixty thousand rows, and it's fully indexed." Sally brought me graphs generated from an Oracle management program.

"And the filesystem is fine, too." Tom said. "The average number of disk requests remains constant. There's no iowait or contention." To make his point, he placed a stack of SAR reports on my desk.

I looked at the data sitting on my desk and thought about it for a second.

"Gentlemen,... and Sally," I said. "This isn't a problem with code, and it's not a problem system performance. You could say that the problem lies with not understanding the very nature of the planet."

All three of them looked at me, perplexed. "But you didn't even look at my source code!" Kip exclaimed.

"I don't need to," I replied. "You already gave me everything I need to know." I reached into my baseball mitt and grabbed my iPhone. "In five minutes I can have a DNS administrator over here, and he can give you an important fact that you all missed."

WHAT DID I KNOW THAT THEY DID NOT? THE SOLUTION WILL APPEAR MONDAY, BUT FEEL FREE TO THROW IN YOUR COMMENTS.

Update: The solution has been posted.

19 Comments

FĂ©lim Whiteley
2007-10-19 08:27:22
My brain is too Friday-Frazzled but I'll take a stab at them not having localhost reverse mapped to 127.0.0.1


Chris Josephes
2007-10-19 08:57:46
Good effort, but the problem does not actually lie with a DNS configuration.
John Q Fizzle
2007-10-19 09:58:58
That a programmer, sys admin and DB administrator never thought to look at the network bottlenecks possibly present?
Chris Josephes
2007-10-19 11:07:19
"Well, what if it's the network," Sally asked aloud?


"No, it's not the network."


The voice came from outside the garage. Behind Kip, Sally, and Tom was Douglas, the old network engineer. He came prepared with a stack of MRTG graph printouts, and walked up to the desk.


"You're all working with a full duplex gigabit backplane, and none of you are throttling your individual interfaces. And the error counters aren't incrementing either, so I doubt that there's a physical wiring problem."


"I know Douglas," I said reassuringly. "The problem has nothing to do with network latency, although latency should be a key word here."

Pukku
2007-10-19 11:49:25
How long does the perl script wait between calls to the database? Even if it only waits one second between calls, 60000 records will need something on the order of (if I've done the math correctly) 16 hours to iterate through...
Chris Josephes
2007-10-19 12:45:06
60000 records will need something on the order of....


You're on the right track. Stay Tuned.

Jim
2007-10-19 18:54:55
Chris,


I don't know the answer, but I would read a whole book of these. Very entertaining and I can't wait to hear the answer.


I think you are onto something.


Keep it up.

Aristotle Pagaltzis
2007-10-19 22:09:09
Are the files and database on the same disk? What is the hit rate on the filesystem cache?
Chris Josephes
2007-10-20 14:09:28
"It would most likely have to be a disk issue," Douglas suggested. "Disk is always going to be your primary point of contention, because it depends on the physical mechanics involved in reading a spinning platter."


"Douglas, I never once criticized your network," came another voice from outside. "So don't you dare complain about my disk configurations."


We all turned around to see Velma, the system architect. She held a soccer ball, suggesting that she had just finished playing a game. "I couldn't help but overhear your problem, so I figured I'd see what's going on."


Velma had originally designed the system everyone was currently concerned about. Once it was built, she gave Tom the duty of daily administration.


"The application is incredibly slow," I said.


"It can't be the disk," Velma said. "The database server and the file server are on different hosts. She took a piece of chalk and drew two servers on the cement floor. "The application runs here, and the database server is here."


Everyone looked at the chalk diagram, hoping to gain some additional insight.


"So, it's safe to say," I said, "That there are no shared resources between the database, and the file server."


"Correct," Velma said. "But it sounds like you don't even suspect the disk performance to be a problem."


"No, " I said. "I didn't suspect disk IO in this particular case. It was something Kip said that made me suspicious."

hiren
2007-10-22 01:27:13
there is a perl module out there for benchmarking different sections of code right? the offending pieces can be found using something of that sort, and based on what the offending piece is doing, the problem can be made much more guessable no?
never
2007-10-22 03:42:29
is the first name server listed in /etc/resolv.conf down/not responding/slow? For each iteration of html records the resolver will need to lookup the A record for the database server. gethostbyname() doesn't do any caching AFAIK. Solutions: remove the faulty name server, put ip addr in /etc/hosts for the db server, put a static IP override in tnsnames.ora (oracle), make the db connection persistent in the script, change the loop so that all updates are saved in a hash/array (depending on memory) and do one transaction at the end.

2007-10-22 05:47:57
and rests???
chrisp
2007-10-22 08:40:43
The script is single threaded so it waits for the database, the disks, the network etc to give it the next bit of data to work on. Its not a single problem with one of them, everthings working fine but the laws of physics say each stage will take a non-zero amount of time to work, and this adds up to more then the 0.36 seconds per record limit. Using multiple threads so the script can get on with the next record while it waits for data is the solution.


DNS admins know the way to deal with a problem is to parallelise it so each worker (dns server or thread) deals with its own lttle bit of the problem regardless of latency in any single task.

Jose Tamayo
2007-10-22 08:49:43
Maybe the file server, which connects to the database server, isn't correctly mapped on DNS. That is, the file server has an A record on the domain.com realm, but his IP address doesn't have a PTR record that points to fileserver.domain.com. This sometimes caused delays (waiting for timeouts) when an IP address could not be resolved to a host.
ikegami
2007-10-22 11:21:32
60_000 rows, so 60_000/(6*60*60)=2.778 seconds allocated per file, yet Tim said the script rests between records. How long does it rest? One can only requests sleeps multiple of one second (using the core function sleep), and the program isn't even guaranteed to be awoken after after exactly one second.
ikegami
2007-10-22 11:42:54
doh! The solution was already posted!
Chris Josephes
2007-10-22 12:15:27
Ikegami,


Yep, solution was posted about an hour ago.


http://www.oreillynet.com/sysadmin/blog/2007/10/the_solution_to_the_case_of_th.html


You and Pukku were on the right track.

Andy Lester
2007-10-24 06:39:11
Don't you have an empty gas can in your office on which the quarter gets slapped down?
Chris Josephes
2007-10-25 05:46:50
Don't you have an empty gas can in your office on which the quarter gets slapped down?


Actually, it's an old mason jar.