After -Ofun: Thoughts on Code Optimization, Part 2

by Geoff Broadwell

Related link: http://www.oreillynet.com/pub/wlg/8097




Last week I talked
about how to determine how an application performs in different
scenarios, and generally where the code has bottlenecks. It's
time to talk about why the code might be slow.




There are a great many particular reasons why code could be performing
slowly, so I'll start off by painting some broad strokes and go for more
detail later on. Let's break these issues into some big categories:




  1. The application is attempting to solve a more difficult problem
    than the one the user actually has.

  2. The code does more work than is necessary to solve the problem
    it was designed for.

  3. The code is great in theory, but runs slowly on real hardware.




That first one may seem silly, but I'd warrant it's one of the biggest
causes of performance complaints. The mistake could be as simple as
running a complex statistical analysis program on a huge dataset, when
the user only needs to know how many data points lie within a certain
range. Or perhaps the application computes all possible answers to a
problem, when knowing just one answer is enough.




Or it could be more subtle issue, like requiring an exact answer when an
approximation will easily do. There are many problem spaces for which
the only known ways to calculate an exact answer are vastly slower than
making a very good guess. Modern computer graphics would not even exist
if not for the fact that, most of the time, gross approximations are
just fine. There are even problems for which it is infeasible or
impossible to calculate an exact answer at all; the only decision to
make is how loose an approximation you can can accept.




In other words, before looking at the code at all, think about what you
are really trying to achieve. Does the code try to compute an exact answer
to a difficult problem when all you really have to determine is "Yes,
no, or maybe?", or "More than a few?", or a similar fuzzy result?
Does the application computing many different things, when only a few
are actually necessary?




For example, let's say you need to determine which hours of the week
your web server is most heavily loaded, and how much difference there is
between the heaviest and lightest loads. You could analyze all of the
log files, counting hits per second, bytes sent per connection, and
so on. Unfortunately, for a heavily loaded site this would require a
lot of work, perhaps taking an hour to churn through all the log data.
This makes a casual query turn into a huge production, and leads to
silly behavior like running a whole slew of similar reports every night,
just in case anyone wants to know what happened the previous day without
waiting an hour or two to get an answer for a simple question.



Alternately, if your web server rotates logs every hour, you could
just compare the log file sizes. A small script could even
chart a rough answer before the user could take his finger off the
enter key. Sure, the answers this method gives will only be an
approximation. But if you really only need to know whether your
traffic is fairly smooth or spikes up a lot on Monday mornings,
that's good enough. Better yet, what was once a production is now
something that anyone could find out on a whim. Want to know whether
that big breaking news story brought in heavier traffic today? You
can have the answer now, rather than tomorrow morning when
the standard reports are delivered.




If you can make this kind of qualitative change in your solution,
do so. You'll get more of a performance boost (and happier users) this
way than through anything else I'll cover. Go ahead, spend some time
on this. I'll wait.




At this point I'll assume you've either determined that your solution
really is a good match to the problem, or have already gone back and
simplified your solution, but still need to do better. Next up is to
check whether the code does more work than needed for the chosen solution.
The big culprits here are poor choice of algorithm and poor use of decent
algorithms.




Let's get the "poor use of decent algorithms" case out of the way first.
Even good algorithms can perform poorly if misused, especially in
combination. Chances are pretty decent for example that grepping a
few items out of a big list and then sorting the results will be faster
than sorting the big list first and then grepping. It may not always
be obvious that this is happening; sometimes the various steps are in
widely separated code written by different teams, or even in different
applications. This is particularly a problem when modules do a mountain
of work implicitly, either to "save the user the trouble", or because
the module requires specific properties of its inputs (such as being
sorted), and would rather enforce those properties than merely detect
whether they have been violated.




A system administrator friend likes to tell the tale of a particular
script that was the subject of many performance complaints. After a
bit of research, he determined that the performance of the
script was completely dominated by one command pipe. It turned out
that the commands in that pipe were implicitly sorting a large data
set twice, then explicitly sorting it again, and
then finally grepping the result for just four or five desired items.
Even the best algorithms can't prevent this from being slow. Happily,
simply reordering the operations and turning off the implicit sorts
turned a horrendously slow script in a decently fast one.




Once this kind of misbehavior has been rooted out, it's time to look
at the algorithms used. Most problems can be solved a great many different
ways, each with its own performance characteristics. For any given
problem (sorting, say), each available algorithm has certain strengths
and weaknesses. For example, Quicksort does not need much extra memory
to operate, and usually performs quite well on random inputs.
Unfortunately, it's an unstable sort, can degenerate badly in particular
cases, and isn't well suited to sorting data in any form other than an
in-memory random-access array. Merge sort, on the other hand, usually
requires more memory and is a bit slower than Quicksort on random inputs,
but it's a stable sort, won't degenerate, and works well with
sequential-access data.




Unfortunately, there are also algorithms in common use that are weak
from just about every angle, except that they've been around a long
time or are exceptionally easy to understand. In either case, these
poor algorithms still get used because they were the first thing that
came to the implementor's mind.




The next step, then, is to determine whether the algorithm the program
uses is the best one for the situation. For this there is no substitute
for knowledge. The more algorithms you are aware of, the better the
chances that you will be able to determine a more suitable choice,
especially if you know the general performance features of each approach.




I'm not of course suggesting that you memorize your copy of
Donald E.Knuth's
The
Art of Computer Programming
. The detail you'll
find in those volumes could easily make you an expert in the analysis
of algorithms, but that's probably another case of trying to solve a
much larger problem than the one you actually face. For most purposes,
a book such as Robert
Sedgewick's
Algorithms in C is probably
a better choice, assuming it hasn't changed its style too much since
my dog-eared first edition came out. That book quickly became one of
my favorites because for every algorithm it laid out in plain wording
several key things: what its strengths and weaknesses were, how it
actually worked, and the standard optimizations to its basic design.




In some cases, an online reference such as
Wikipedia can give you an idea of
your options, and in the
best cases
will even give you a performance overview for various algorithms; more
specialized sites such as the
NIST Dictionary of Algorithms and
Data Structures
can point you to various implementations.




To understand these resources, you'll need to understand big-O
(asymptotic performance) notation. Put simply, big-O notation tells you
broadly how an algorithm performs on large input sizes, generally
with a simple expression on one or more variables that define a
characteristic size for a problem. Typical examples are O(n³),
O(n log n), and O(m + n). By choosing suitable values for
the variables, you can compare the performance of several algorithms
in various situations.




There are two big gotchas, however. First, big-O notation ignores
constants; O(n) might seem to be always better than
O(n²), but if the actual run times are n seconds
and microseconds, then the better choice depends
on how big n is. This is particularly an issue for algorithms
that have the same big-O performance; in that case, constants make
all the difference no matter how big the input. Second, big-O notation
only gives information about asymptotic (large input) performance. For
small inputs, you can pretty much ignore big-O comparison, as startup
overhead and other details will often completely dominate runtime
performance.




Even with a good understanding of the use and limitations of the notation,
you'll find that most of the time algorithms are discussed in a fairly
general sense, and often without reference to special situations that
may make one algorithm particularly appropriate. For example, efficient
sorting on a specialized, highly parallel GPU requires a different
algorithm than efficient sorting on a less parallel but more general
CPU. The unfortunate truth is that choosing a good algorithm requires
both general knowledge of available algorithms, and a good understanding
of how these are applied to your problem space, whether it be data
analysis, computer graphics, physical simulation, or what have you.




At this point, I could just suggest going off to read yet more books
and articles. You'd certainly learn a lot, and may even be able to
synthesize a few patterns out of the morass of data. Before you do
that, there's that last big group of performance problems: things that
make code that's great in theory perform horribly on real hardware.
Understanding those issues will make a world of difference in how you
approach all that reading, so they're my subject for next week.




What's the worst misuse of otherwise good code that you've seen?