Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples

The Worldwide Lexicon Reloaded

by Brian McConnell

This month, my colleagues and I relaunched an experiment in inter-language communication. The Worldwide Lexicon is a simple solution to a tough problem--how to make websites accessible in many languages.

We first worked on this problem in 2001 and 2002 when we developed a system modeled after SETI@Home, except instead of tapping idle computers to do calculations, it would ask idle instant messaging users to translate short blocks of text. We got pretty far along with building the system, but were blocked by what appeared to be a difficult problem--matching volunteer translators with content that they'd be willing to work on. It seemed simple enough, but after looking into it more, we realized we couldn't find a good way to solve this problem.

For example, one English/French speaking user might be a rugby fan, whereas another might prefer to write about fashion. There wasn't a simple way to automate the process of assigning jobs to translators. The task of matching translators with content they'd be willing to translate was far more complex than building a web text editor. We knew that if we got it wrong, people would drop off the system, and it would die before it ever had an opportunity to grow.

This new version of the system is much simpler, and it's based on a key insight. Any website with an audience of more than a few dozen people probably has bilingual readers. The bigger the audience, the more languages its readers will speak. Moreover, these readers are presumably interested in the content, are more knowledgeable about it (they understand its context), and are more willing to help others read it.

The new system works like this. It monitors participating websites' RSS feeds for new articles. When it sees a new item, it fetches machine translations to several languages. It then creates a wiki page for each target language. The website promotes the service to its readers and encourages them to edit and refine these translations, or to start translations to any number of other languages. The translations are output as RSS and static HTML, so the publisher can loop some or all of the translations back into his or her site. It's a simple technique, but an effective solution to a difficult problem.

Websites will get fast (but inaccurate) machine translations, which will improve as readers edit them. Plus, there is a built-in mechanism for prioritizing work. Readers will spend their time on the most relevant posts first. They'll skip material that's off topic or not really in need of translation. No need to build a complicated system for matching supply and demand; volunteers should do a good job of figuring this out themselves.

We have developed an implementation of this service in Ruby on Rails, and will be hosting a prototype version at www.worldwidelexicon.org. You can join our email list for updates about the project. The private beta test is just getting under way. If you'd like us to translate your site, we'd love to hear from you.

Open Specifications and Open Source

We are also publishing the specifications used to build the service along with the project source so that developers can freely copy this service and embed it into a wide variety of publishing, search, and content-management systems.

Over the holidays, I seriously considered building a business around this service. My goal with this project is to make this ubiquitous, and to make cross-language communication routine.

I realized that this project is an example of where open source is a superior model for software development. Some projects work well as commercial services--for example, when a company develops a product that can be used or distributed as a single package. Human language, however, is tricky. A program or user interface that works well for English users might not work well for Thai users. I decided it would be better to publish this as a starting point or reference design, and instead of owning the system, teach other developers how to implement it in a wide range of systems.

I also view this as a worthy cause, and something that should be shared freely. The language barrier is really the last frontier left in communication. Although someday artificial intelligence research may produce a "universal translator," I remain skeptical. Human language demands that people comprehend and understand its nuances. This is an activist project, because it is more about organizing people to help each other communicate than about technology.

The technology employed in this project is pretty basic, but if millions of people participate in this, information and culture that is hidden will suddenly become visible. Many problems we face in the world are at least indirectly caused by cultural barriers. Projects like this, while not a cure-all, will enable people to communicate, discover, and read things that they'd otherwise never see. Overall, I think this will be a good thing.

In the end, I decided that rather than trying to build a business out of this, I would rather build a reference design and show other people how to copy it and embed it into many services. My goal is ambitious. I'd like to see something like this as a checkbox option on most web-publishing and content-management systems within a year or two. If I tried to own this project, there's no way I'd be able to reach my goal, so I decided to give my work away. Besides, I make a decent living, more or less, designing telephone systems.

About the System and How You Can Help

The prototype service, which should work with any site that has an RSS feed, was built by Elevated Rails, a Chicago-based Ruby on Rails consultancy. They also did much of the web and database work for Radio Handi, a global communication service for groups (and my day job). They've done great work on a range of projects I've worked on. They'll be happy to share the source, and they're available for hire if you're interested in embedding this in your publishing or content-management system. We're looking for clients who'd like to fund ongoing development of the system via custom development and system integration projects.

We've published a detailed description of how the prototype works (along with issues to be aware of). We're also publishing the source code from our Ruby on Rails implementation of the spec. With this information, developers can quickly embed this in their publishing systems. The technique is useful for a wide range of applications from translating RSS feeds to localizing software (we came up with a clever hack to do this).

The only centralized aspect of the Worldwide Lexicon system is a simple statistics interface we've built so that we can maintain a global view of how many documents have been processed, how many translations have been created, the overall number of users, etc. Whenever a participating system processes a text, it'll ping a simple HTTP interface to log the event. This information will be used to make translations easy to find, and also to drive whatever third-party visualization or dictionary tools developers can conjure up. We're also hosting a simple dictionary service that captures micro-translations for words and phrases--something else that's easy to host and doesn't require a lot of infrastructure.

If you're an amateur or professional publisher, we're looking for beta testers. Our prototype system works with any system that has an RSS feed. We're looking for websites that publish content that is not region-specific and that have 100 or more readers. If you host publishing or content-management systems, you should consider implementing this on your system or integrating with ours. It's pretty easy to do; write or call if you'd like advice or tips.

If you're bilingual, we're looking for volunteers to test the system and give the developer community feedback about how to improve it. Our prototype is pretty simple, but it's a good starting point, and it's something that people who are experienced in building publishing systems can improve upon, both for consumer and business applications.

I'll conclude with some advice for other developers and entrepreneurs. Sometimes it takes a while to arrive at a good solution to a problem. I've been grinding away at this for many years now. The first designs for WWL were interesting but proved unwieldy. They were also several years too early. When you work in technology, it's easy to overestimate how quickly something will be adopted. RSS is a good example of this. It is common now, but it's taken a while to catch on. Don't be discouraged if it takes you longer than you'd like to solve a problem or for a product to become popular. Not everything is meant to be an overnight success, and often you'll learn a lot by taking time to let an idea mature through thought, experimentation, and trial and error.

Brian McConnell is an inventor, author, and serial telecom entrepreneur. He has founded three telecom startups since moving to California. The most recent, Open Communication Systems, designs cutting-edge telecom applications based on open standards telephony technology.

Return to ETel.

Copyright © 2009 O'Reilly Media, Inc.