# After -Ofun: Thoughts on Code Optimization, Part 2

by Geoff Broadwell

**Related link:** http://www.oreillynet.com/pub/wlg/8097

Last week I talked

about how to determine *how* an application performs in different

scenarios, and generally *where* the code has bottlenecks. It's

time to talk about *why* the code might be slow.

There are a great many particular reasons why code could be performing

slowly, so I'll start off by painting some broad strokes and go for more

detail later on. Let's break these issues into some big categories:

- The application is attempting to solve a more difficult problem

than the one the user actually has. - The code does more work than is necessary to solve the problem

it was designed for. - The code is great in theory, but runs slowly on real hardware.

That first one may seem silly, but I'd warrant it's one of the biggest

causes of performance complaints. The mistake could be as simple as

running a complex statistical analysis program on a huge dataset, when

the user only needs to know how many data points lie within a certain

range. Or perhaps the application computes all possible answers to a

problem, when knowing just one answer is enough.

Or it could be more subtle issue, like requiring an exact answer when an

approximation will easily do. There are many problem spaces for which

the only known ways to calculate an exact answer are vastly slower than

making a very good guess. Modern computer graphics would not even exist

if not for the fact that, most of the time, gross approximations are

just fine. There are even problems for which it is infeasible or

impossible to calculate an exact answer at all; the only decision to

make is how loose an approximation you can can accept.

In other words, before looking at the code at all, think about what you

are really trying to achieve. Does the code try to compute an exact answer

to a difficult problem when all you really have to determine is "Yes,

no, or maybe?", or "More than a few?", or a similar fuzzy result?

Does the application computing many different things, when only a few

are actually necessary?

For example, let's say you need to determine which hours of the week

your web server is most heavily loaded, and how much difference there is

between the heaviest and lightest loads. You could analyze all of the

log files, counting hits per second, bytes sent per connection, and

so on. Unfortunately, for a heavily loaded site this would require a

lot of work, perhaps taking an hour to churn through all the log data.

This makes a casual query turn into a huge production, and leads to

silly behavior like running a whole slew of similar reports every night,

just in case anyone wants to know what happened the previous day without

waiting an hour or two to get an answer for a simple question.

Alternately, if your web server rotates logs every hour, you could

*just compare the log file sizes.* A small script could even

chart a rough answer before the user could take his finger off the

enter key. Sure, the answers this method gives will only be an

approximation. But if you really only need to know whether your

traffic is fairly smooth or spikes up a lot on Monday mornings,

that's good enough. Better yet, what was once a production is now

something that anyone could find out on a whim. Want to know whether

that big breaking news story brought in heavier traffic today? You

can have the answer *now*, rather than tomorrow morning when

the standard reports are delivered.

If you can make this kind of qualitative change in your solution,

do so. You'll get more of a performance boost (and happier users) this

way than through anything else I'll cover. Go ahead, spend some time

on this. I'll wait.

At this point I'll assume you've either determined that your solution

really is a good match to the problem, or have already gone back and

simplified your solution, but still need to do better. Next up is to

check whether the code does more work than needed for the chosen solution.

The big culprits here are poor choice of algorithm and poor use of decent

algorithms.

Let's get the "poor use of decent algorithms" case out of the way first.

Even good algorithms can perform poorly if misused, especially in

combination. Chances are pretty decent for example that grepping a

few items out of a big list and then sorting the results will be faster

than sorting the big list first and then grepping. It may not always

be obvious that this is happening; sometimes the various steps are in

widely separated code written by different teams, or even in different

applications. This is particularly a problem when modules do a mountain

of work implicitly, either to "save the user the trouble", or because

the module requires specific properties of its inputs (such as being

sorted), and would rather enforce those properties than merely detect

whether they have been violated.

A system administrator friend likes to tell the tale of a particular

script that was the subject of many performance complaints. After a

bit of research, he determined that the performance of the

script was completely dominated by one command pipe. It turned out

that the commands in that pipe were implicitly sorting a large data

set *twice*, then explicitly sorting it *again*, and

then finally grepping the result for just four or five desired items.

Even the best algorithms can't prevent this from being slow. Happily,

simply reordering the operations and turning off the implicit sorts

turned a horrendously slow script in a decently fast one.

Once this kind of misbehavior has been rooted out, it's time to look

at the algorithms used. Most problems can be solved a great many different

ways, each with its own performance characteristics. For any given

problem (sorting, say), each available algorithm has certain strengths

and weaknesses. For example, Quicksort does not need much extra memory

to operate, and usually performs quite well on random inputs.

Unfortunately, it's an unstable sort, can degenerate badly in particular

cases, and isn't well suited to sorting data in any form other than an

in-memory random-access array. Merge sort, on the other hand, usually

requires more memory and is a bit slower than Quicksort on random inputs,

but it's a stable sort, won't degenerate, and works well with

sequential-access data.

Unfortunately, there are also algorithms in common use that are weak

from just about every angle, except that they've been around a long

time or are exceptionally easy to understand. In either case, these

poor algorithms still get used because they were the first thing that

came to the implementor's mind.

The next step, then, is to determine whether the algorithm the program

uses is the best one for the situation. For this there is no substitute

for knowledge. The more algorithms you are aware of, the better the

chances that you will be able to determine a more suitable choice,

especially if you know the general performance features of each approach.

I'm not of course suggesting that you memorize your copy of

Donald E.Knuth's

*The
Art of Computer Programming*. The detail you'll

find in those volumes could easily make you an expert in the analysis

of algorithms, but that's probably another case of trying to solve a

much larger problem than the one you actually face. For most purposes,

a book such as Robert

Sedgewick's

*Algorithms in C*is probably

a better choice, assuming it hasn't changed its style too much since

my dog-eared first edition came out. That book quickly became one of

my favorites because for every algorithm it laid out in plain wording

several key things: what its strengths and weaknesses were, how it

actually worked, and the standard optimizations to its basic design.

In some cases, an online reference such as

Wikipedia can give you an idea of

your options, and in the

best cases

will even give you a performance overview for various algorithms; more

specialized sites such as the

NIST Dictionary of Algorithms and

Data Structures can point you to various implementations.

To understand these resources, you'll need to understand big-O

(asymptotic performance) notation. Put simply, big-O notation tells you

broadly how an algorithm performs on large input sizes, generally

with a simple expression on one or more variables that define a

characteristic size for a problem. Typical examples are `O(n³)`,

`O(n log n)`, and `O(m + n)`. By choosing suitable values for

the variables, you can compare the performance of several algorithms

in various situations.

There are two big gotchas, however. First, big-O notation ignores

constants; `O(n)` might seem to be always better than

`O(n²)`, but if the actual run times are `n` seconds

and `n²` *micro*seconds, then the better choice depends

on how big `n` is. This is particularly an issue for algorithms

that have the same big-O performance; in that case, constants make

all the difference no matter how big the input. Second, big-O notation

only gives information about asymptotic (large input) performance. For

small inputs, you can pretty much ignore big-O comparison, as startup

overhead and other details will often completely dominate runtime

performance.

Even with a good understanding of the use and limitations of the notation,

you'll find that most of the time algorithms are discussed in a fairly

general sense, and often without reference to special situations that

may make one algorithm particularly appropriate. For example, efficient

sorting on a specialized, highly parallel GPU requires a different

algorithm than efficient sorting on a less parallel but more general

CPU. The unfortunate truth is that choosing a good algorithm requires

both general knowledge of available algorithms, and a good understanding

of how these are applied to your problem space, whether it be data

analysis, computer graphics, physical simulation, or what have you.

At this point, I could just suggest going off to read yet more books

and articles. You'd certainly learn a lot, and may even be able to

synthesize a few patterns out of the morass of data. Before you do

that, there's that last big group of performance problems: things that

make code that's great in theory perform horribly on real hardware.

Understanding those issues will make a world of difference in how you

approach all that reading, so they're my subject for next week.

*What's the worst misuse of otherwise good code that you've seen?*