I've worked with a lot of server load balancing (SLB) systems, and in doing so I've been asked to diagnose and resolve a variety of issues. I've done this for dozens of clients, ranging from small shops to high-profile Web sites. And let me tell you, if there is one common theme when working in these situations it's that "It's always the load balancer." From complete site outages to packet loss--even male-pattern baldness--load balancers get blamed for almost everything. Anyone who has ever administered a load balancer will probably back me up on this point.

Load balancers are an integral part of today's Web infrastructure. They're also complex and underdocumented pieces of hardware. In this article I will explain the reasons why load balancers get the blame and what you can do about it.

The Beast

Today's Web sites are complex beasts. Every component must work together to create a site that is greater than the sum of its parts. Figure 1 represents a fairly typical Web site installation and in it you can see how complex the average Web site has become.

Figure 1
Figure 1: Traffic flow for a load balancer

The Internet is connected to the routers, which pass traffic through a firewall to the load balancers, which distribute the traffic to the Web servers, which pass information to the application server bone, and the application server bone is connected to the database server bone. You get the picture. If one component or piece of the process fails, it can take down the entire site.

So, if the load balancer is only a small part of a bigger whole, why does it get blamed so disproportionately? What is it about load balancers that attracts critics, finger pointers, and naysayers alike? Let's take a look at some of the reasons.

The Blame Game

One reason why people blame server load balancers is that they often do experience more problems with them than with other network devices. There are several reasons for this, but the primary one is that of all the network devices a site might employ, load balancers are typically the newest on the scene. Moreover, manufacturers are competing with each other at a feverish pace. They quickly release new feature-rich versions that aren't thoroughly tested.

Another reason for the blame is that load balancers are not very well understood. Documentation quality varies greatly from vendor to vendor, and there are few third-party resources (O'Reilly and I hope to change that). A load balancer may not be malfunctioning, but if the people configuring the unit don't understand all of the features or troubleshooting techniques, they may unduly lay blame on the load balancer. The old maxim that people fear (and blame) what they don't understand certainly applies here.

Load balancers are also in the direct path of all traffic to a particular Web site. By looking at Figure 2 below, you can see that if the load balancer stops working, the entire site stops working. This critical position in the infrastructure can make it appear as though the load balancer is the problem, even in cases where it is not (such as a firewall issue, a back-end database problem, someone tripping over a cable, etc.). Unlike a broken or malfunctioning Web server, a misconfigured or malfunctioning load balancer will result in a dead-to-the-world site. This is why a firewall is often a suspect, too, but to a lesser degree since it is generally a simpler device than load balancers.

Figure 2
Figure 2: Load Balancer implementation

Considering these points, it's easy to understand why load balancers too often take the rap for a Web site's misfortunes. The problem, however, is that blame can lead to misdiagnosis of the real culprit and delay a remedy. So, what's to be done? The rest of this article focuses on three simple and effective steps you can take to identify the real culprit, and, as we say in the trade, CYA (cover your ass).

MRTG

I've written articles and have spoken at great length extolling the virtues of MRTG (Multi Router Traffic Router), a freeware software package that allows you to graph bandwidth utilization as well as several other metrics. MRTG is invaluable because it provides historical and graphical trending for network devices, whether they be a server, a router, or any other device with an Ethernet interface. You can take a look at what all the network devices you're responsible for were doing during a problem. Figure 3 shows an example of an MRTG graph that covers 36 hours worth of traffic in 5-minute intervals.

Figure 3
Figure 3: MRTG example graph

MRTG not only records bandwidth in and out of an interface on a networked device, it can also graph other SNMP-based (Simple Network Management Protocol) metrics on a load balancer. For instance, one of the metrics you can measure with MRTG is connections per second, which produces a graph such as the one shown in Figure 4. The graph shows the number of connections per second over a 36-hour period, peaking at around 5,200 connections per second at a little after 2:00 P.M.

Figure 4
Figure 4: MRTG example for connections per second

It all depends on the load balancer, but you can also graph various functions, such as connections per second, bandwidth in and out of a VIP (virtual IP, also known as virtual server) or real server, connections per second per port, total number of active TCP (Transmission Control Protocol) sessions, and dozens of others. For a closer look at load balancers and MRTG, check out the MRTG site I maintain.

Syslog

Virtually every load-balancing product has some way to write to a syslog server, or, in some cases, to store syslog locally. Logs are an invaluable tool in helping to diagnose problems when they arise. You can use them to show who and when someone made changes to configuration, any fail-overs to redundant units, DOS (denial of service) attack warnings, and other operation issues that might point to a problem (or lack thereof). Check the syslog documentation for your load balancer to learn more.

Sniffing

It's critical that you have a way to quickly and easily analyze traffic as it passes through the network for troubleshooting purposes. Common problems that arise include NAT (Network Address Translation) problems, packet filtering, routing problems, DOS attacks (and their sources), and much more.

To capture traffic just about any Unix machine will do. Different Unix implementations have various sniffing programs available, such snoop for Solaris (comes installed with Solaris) or tcpdump, which is an open source traffic analyzer released under the BSD license that runs on most Unix flavors. There are also several commercial sniffing programs available for Windows 2000/NT, as well as several black box packet analyzers.

To monitor traffic on the sniffing machine, most switch vendors offer the ability to do something called port mirroring. Port mirroring is when one port duplicates the traffic of another port for the explicit purpose of monitoring traffic. Figure 5 shows a typical implementation of traffic sniffing.

Figure 5
Figure 5: Traffic sniffing scenario

Any packet that goes out of a selected port is sent to the mirrored port as well. This allows for nonintrusive and nondisruptive traffic monitoring. Cisco calls this the "span" port, while most other vendors simply call it the mirrored port.

Given that load balancing is a relative newcomer to the site infrastructure scene, and because it is generally misunderstood, it's easy to see why the deck is stacked against SLBs. But laying blame prematurely on a load balancer will not only cause headaches to the vendor and to the people responsible for its function, but it can also divert attention from the real problem. I've seen several cases where mob mentality called for the head of the vendor of the load balancer only to have the same problem appear with an entirely new vendor. Assumptions and premature diagnosis can only lead to further problems, frustrated customers, and upset vendors. So make sure your assumptions are based on hard evidence, and not just the fear of the unknown and misunderstood.