OSCON Day 0: Scalable Internet Architectures

   Print.Print
Email.Email weblog link
Blog this.Blog this

Robert Kaye
Aug. 02, 2005 01:40 PM
Permalink

Atom feed for this author. RSS 1.0 feed for this author. RSS 2.0 feed for this author.

One of my favorite presentations from last year was Theo Schlossnagel's presentation on Whack-a-mole, so when I saw him giving a full tutorial on scalability this year, I had to go and check it out. And this year I wasn't disappointed either -- Theo presented a solid tutorial that exuded his practical experience in this field. Of course its impossible to summarize four hours of a tutorial in a blog entry, so I'll try to summarize Theo's three simple rules that he applied repeatedly in his presentation:

  1. Know the system you're trying to scale. It is important to carefully study the system you're trying to scale. Any time that a (sub)system is glossed over and not carefully examined for potential pitfalls, some hidden gotcha that was not considered early on in the planning phase, will come to to bite you later. The sooner you understand all of the details of a system that needs to be scaled, the easier and cheaper it will be to address the problem. Failing to spot a potential problem early on drastically limits the options and increases the cost to fix the problem. For example, a problem that is identified and solved in the design process will be cheap to address as part of the whole process. Fixing the same problem much later, say in production, can be catastrophic. If the entire system is down and business has stopped, the cost to the company astronomical both in dollars and in customer perception. To avoid this understand all your problems fully and never ignore a potential problem.
  2. Complexity has costs. You can only have a certain amount of complexity in a system before the system becomes impossible to maintain or scale up. You should carefully examine new technologies before you decide to embrace them in your project. If you create a complex system comprised of too many disparate technologies, you limit your overall scalability. For instance, scaling a system that requires dozens of specific versions of software may end up having conflicting requirements. If package A requires package B version 1.2.3 and package C requires package B version 2.1.1, then you need to resort to fancy footwork to make the system work. And the more fancy footwork is required, the more complex and error prone your system becomes. I really have to agree with Theo here -- I think that complexity should be treated like a scarce resource -- in scalability and many other aspects of software engineering.
  3. Use the right tool for the job.You should never blindly throw pieces of technology at your problem. Traditional computer science teaches us to reuse existing pieces of technology, rather than constantly inventing new pieces of software. However, throwing a familiar off the shelf tool at the problem may give a poor result, whereas some elbow grease and a custom piece of code may solve the problem much more efficiently. If knowing the system you're attempting to scale is important, so is knowing the technology (and its limitations) that you're going to throw at your problems.

Theo kept reaching back to these points throughout his talk -- other points he stressed repeatedly were the use of a clearly established release procedure. Unless the team that rolls out new software onto production servers has documented procedures to follow, mistakes will be made. And as systems grow in size, the likelihood of fatal errors increases dramatically. To spare yourself from this fate, document your release procedure and use a version control system to keep track of everything you do. Again, seems like common sense, but often it is not.

Aside from general rules, Theo covered a number of open source solutions that can eliminate the need for expensive dedicated hardware boxes like fail-over switches and load balancers. My favorite example was the use of the Whack-a-mole toolkit for when one machine fails. The toolkit allows an architecture to determine when a server fails and automatically reshuffle the work that the dead server covered. Using whack-a-mole allows people to save money by not buying expensive redundant/fail-over systems and only use commodity hardware.

Another great tool that Theo covered is the spread toolkit that allows multiple machines to easily communicate in a coherent manner. Spread allows machines to create a communication channel that is shared and sequenced between all the computers that have joined that channel. Each listener in the channel receives all of the messages posted to the channel in the same order as everyone else -- this is an important feature that allows this toolkit to be used in mission critical high availability setups. My favorite application of this toolkit is to create a multi-server logging facility, where multiple machines write their log files to a spread channel and one machine writes a correct interleaved log file for all the machines.

Theo's tutorial set the stage for people who are facing scalability issues -- he presented a lot of thoughts and hard earned experiences from his extensive past. Scaling issues are generally very dependent on the system, and having a general set of rules to consider has given me a framework in which to consider scalabilities in my own projects.

Robert Kaye is the Mayhem & Chaos Coordinator and creator of MusicBrainz, the music metadata commons.