BOSC Day 2: BioPerl and XML Interoperability

by Lorrie LeJeune


Developing Bioinformatics Computer Skills

Developing Bioinformatics Computer Skills

By Cynthia Gibas & Per Jambeck

April 2001

1-56592-664-1, Order Number: 6641

446 pages, $34.95

Since Perl is such a great language for manipulating biological data, it stands to reason that the bioinformaterati have banded together and created their own set of Perl modules for biological applications. Collectively these modules are known as BioPerl. Volunteers from both and the biological community maintain the existing code base, and contribute to it as need arises. As Hilmar Lapp, one of the core BioPerl developer/coordinators said at the meeting, "BioPerl is not a religion. The modules exist because they solve problems." Given the rate that BioPerl has grown (The core development team expects to release version 1.0 later this year), it's clear that the modules have made bioinformatics research much easier.

One of the most popular BioPerl modules allows you to do automated parsing and conversion of the major biological database formats such as GenBank, EMBL, and Fasta. As is often the case with resources that have developed over time, biological data formats are not consistent or interchangeable. For example, you must convert sequence data from GenBank format to Fasta in order to use BLAST (a sequence alignment tool) to compare your data with GenBank's. The conversion itself is easy, if you're only comparing one sequence to another. It becomes incredibly tedious and time consuming when you're comparing hundreds or thousands of sequences, or entire genomes, which is how most biologists approach it. Thankfully, Perl is well suited to this sort of data manipulation. Once you've converted your data and completed your BLAST analysis, you can use another BioPerl module to parse the multi-megabyte file full of results. And this is just the tip of the iceberg. BioPerl also includes modules for sequence translation, batch retrieval of records from a public database, and much more.

BioPerl is one of the oldest of the bio projects (its siblings are Biopython,
and the most recent, BioRuby) and it has made great progress since it started as a loose collection of scripts. Right now BioPerl focuses on sequence handling, annotation, and analysis. In the future the core team hopes to add modules on visualization and GUIs, and establish a script repository. As Hilmar said in his closing statements, "There's still lots more to do, so remember: if you need it, code it."

This is open source at its best.


Lunch is the perfect time to make announcements. Everyone is happy, well-fed, and hopefully, open to new ideas. The BOSC organizers have figured this out. Between the sessions on BioPerl, they snuck in Eric Neumann from Beyond Genomics and Maciek Sasinowski from Incogen to tell us about the Interoperable Informatics Infrastructure Consortium or I3C, an organization dedicated to exploring interoperability in life sciences using XML technologies. The I3C will "serve as the international organization for global coordination and cooperation for the convergence of Information Technology in Life Science Research. It will promote and maintaining a broad spectrum of activities focused on the development and availability of standards, solutions as well as associated technologies." A number of key organizations are involved including Blackstone Technology Group, Incogen, Oracle, Sun Microsystems, TimeLogic, LabBook, IBM, BIO, and many others.

The I3C is taking an open source approach to development, and plans to become a global repository for information and education. They feel that XML is a key component of the current "best approach" to informatics, and they and their members are developing protocols in bioinformatics using ebXML (electronic business). Eventually they plan to expand into other areas of informatics such as metabolomics, pharmacogenomics, proteomics, cheminformatics, and others.

I think that the I3C is an organization that defintitely bears watching.