oreilly.comSafari Books Online.Conferences.


Ad-Hoc Clustering

by Raimo Koski

Clustering sounds like rocket science, but it can also be very simple and useful for quite common tasks. Creating a MP3 collection from music CDs is a good example of how to use multiple PCs to do the job faster.

Creating an MP3 from music CDs has two major parts. First, you must rip WAV files from the CDs, and then convert them to MP3 files. Both are time consuming, but if you have more than one PC, you can speed up the process.

Planning the Directory Structure

Because the plan is to use multiple PCs and the files should end up in one location, that directory must be available to all the participating PCs. NFS is the obvious choice for Linux. In my examples I used /data/mp3 as the root of the MP3 file tree. If you would like to burn your collection to CD-R or DVD+/-R disks, create subdirectories such as vol1 and vol2 for the entire collection. Keep all of the clustering scripts in the root mp3 directory.

-- all
|   |
|   --- album1
|   |
|   --- album2
|   |
|   ...
-- vol1
|   |
|   --- album1
|   |
|   --- album2
-- vol2

/data/mp3 is the MP3 collection root, all has all the album subdirectories, album1 and album2 contain the contents of one music CD each, vol1 contains the first MP3 collection to burn to CD/DVD, vol2 the second, and so on.

Ripping WAV Files from the Music CDs

Apart from the basic utilities, install freedbtool package to get track names from It saves a lot of typing by fetching the album names from the freecddb database. Get the tar ball from Freshmeat's Freedbtool page, extract it, change to the extracted directory, run make discid, and copy discid and to an appropriate bin/ directory on every PC you plan to use for ripping.

Another tool you might have to install is cdparanoia, but it is part of most major Linux distributions. cdparanoia reads the WAV files from a music CD. As its name suggests, it is very thorough. If the disk has scratches, cdparanoia might take hours to process one disk. Good disks go much faster, in a couple of minutes for a fast drive. Note that the quality and speed of drives varies very much. CD or DVD recorders are usually better than read-only devices.

The ripping script is fairly simple shell script. It takes one parameter, the name of the album, creates a new directory and a lock file, calls cdparanoia to do the ripping, calls freedbtool to get the table of contents or toc file, generates a renaming script from the toc file, runs the renaming script, removes the lock file, and ejects the CD.

mkdir $1
cd $1
touch lock
cdparanoia -B

# Get only the first version of toc get -n1
dos2unix toc

# Generate the renaming script
awk 'BEGIN{FS="="
           print > "" }

# add leading zero if value less than 10
     if ($1 < 10) $1="0"$1

# replace blanks with underscore
     gsub(" ","_",$2)

# escape some special characters
     print "mv track"$1".cdda.wav " $1"_"$2 ".wav"}' toc  | tr  \' _ >>

# Run the renaming script
rm -f lock
eject /dev/cdrom

If you have more than one CD/DVD drive in any of your PCs, make a copy of the script and change three lines in the copy:

cdparanoia -d /dev/cdrom1 -B get -n1 -d "discid /dev/cdrom1"
eject /dev/cdrom1

If your CD/DVD drive has a device file with a different name, make the appropriate changes. The filenames should be the same in every PC (the naming conventions vary between distributions and versions).

Once you have the ripping script in your MP3 collection root, ssh to the first ripping machine, mount the collection root, cd to it, cd to vol1/, insert the first music CD, and start ripping:

sh ../ Artist_Name\:Album_name

You don't have to use underscores instead of blanks, but it tends to make life easier. Repeat for each PC. When the disk trays open, replace the disks and start the script with a new album names.

I use KDE Konsole as my X Window terminal. Each window can have multiple tabs representing multiple terminal sessions. I change sessions with Shift-left/right arrow keys. This way, I keep all of my ripping sessions in a single window.

With my fastest CD drive, it took 221 seconds to rip a good quality, 70-minute music CD. The resulting WAV files took up 738 MB total space, so the network bandwidth requirement was 3.3 MB/s, which would saturate a Gigabit network with about 25 concurrent ripping sessions. However, the human factor is often the bottleneck in this case. If you are able to change disks and write new album names in 20 seconds, you scale up to 11 concurrent ripping sessions. Hard disk writing speed is also a very likely bottleneck.

Converting WAV Files to MP3s

While you are ripping, you can already start to convert WAV files to MP3s. The script I use skips any subdirectory with a lock file in it, so the first script must have processed at least one disk before you start. You might need to install Lame; the one I use came from Dag's RPM repository. It should be available from one of your favorite repositories, so use apt-get, yum, or whatever is your favorite advanced package manager to resolve its dependencies and install it to all PCs you intend to use for MP3 conversion.

for i in *
do if [ -d $i -a ! -f $i/lock ]
     cd $i
     for j in *.wav
       if [ ! -f $j.reserved -a -f $j ]
             touch $j.reserved
             echo At `date` $HOSTNAME starts to convert $j
             lame $j `basename $j .wav`.mp3 >/dev/null 2>&1
             rm -f $j
             rm -f $j.reserved
     cd ..

The script processes every subdirectory without a lock file, changes into them, processes every WAV file, checks if they are reserved, if not, creates a lock file and runs lame, and then removes the lock file. You can run this script on any number of PCs because of the use of lock files.

To generate Ogg Vorbis files, use oggenc instead of lame. lame uses a bitrate of 128 kbps by default. Add -b bitrate to change that.

Once you have the script in your MP3 collection root, ssh to the first encoding machine, mount the directory, cd to it, cd to vol1/, and start the encoding script. Alternately, you can write another script that does the same on every MP3 encoding cluster member.

CLUSTERHOSTS="rk2 rk4 rk23"

  ssh $i "mkdir -p /$COLLECTIONHOST/$DATADIR > /dev/null ; \
          cd  /$COLLECTIONHOST/$DATADIR/mp3/$1 ; \
          sh  ../ ; \
          cd ; \
          umount /$COLLECTIONHOST/$DATADIR" &

Note that the first and only parameter to is the subdirectory (vol1/, vol2/, etc.) to process. Change the variables at the beginning to suitable values and fix the paths as well.

Note that I have generated keys with ssh-keygen and added them to all cluster members' $HOME/.ssh/authorized_keys files to run ssh without password query.

This all combines to produce a simple clustering application done by simple shell scripts. That's not rocket science!

Because each cluster node reads a WAV file and writes the resulting MP3 file over the network, network speed is often the biggest scalability limiting factor. Both files are big relative to the processing time, so network latency is not an issue. Disk reading and writing speeds might be another limiting factor. Processing 738 MB of WAV files took 388 seconds on an AMD Athlon64 3000+ CPU, so the bandwidth requirement was 1.9 MB/s. Gigabit Ethernet should scale up to about 40 similar CPUs.

Pages: 1, 2

Next Pagearrow

Sponsored by: