ediplex - Generic Text Processor

by Jeremy Jones

I've been carrying around an interest in text processing for several years now which began with my work with EDI. Even though I don't work with EDI and my job doesn't revolve primarily around text processing, I still maintain an interest in text processing in general and processing EDI specifically. I created the project ediplex using Novell forge probably two years ago, around the time I wrote this article for DevX. ediplex back then was specifically an EDI processing engine with hopes of converting EDI to other formats pretty easily.

Over time, ediplex has evolved. A goal that I had for ediplex even from the beginning was the ability to easily define new EDI file formats. In its inception, it only supported X12, which is primarily a North American standard. But I had hopes for supporting EDIFACT and TRADACOM, which are more in use in Europe.

Which leads me to today. The latest incarnation of ediplex doesn't support EDI. Not yet, anyway. What it does is allows users to create custom document definitions which describe what a document's header and footer should look like. It also allows users to create custom handlers to allow the engine to feed them with data for a specific document type. The latest rendition is in early alpha, but it looks like a document is being passed all the way from its input to its handler. If you're interested you can `bzr branch http://bzr.ediplex.org/trunk/` and start poking around. (This requires the Bazaar version control client.)

The architecture for ediplex is layered, but pretty simple. The first layer is the input layer. This layer gets input from somewhere (file, socket, whatever) and passes data to the scanner, which is next. The input layer was designed to allow users to create their own custom types of input receivers as they see fit. The next layer is the scanner. While this layer can certainly be replaced and customized, that shouldn't be necessary. The scanner receives data from the input receiver and determines which document type the text should be passed off to and passes it off. The next two layers are the document definition and the data handler. I combine there here, because they are combined in the ediplex code. The document definition doesn't do much except for describe a new document type and tell the scanner if a certain string of text matches its definition. The handler is intended to be extremely customized. When it receives data, it gets to do whatever with it that its little heart (and its coding master) desires.

So, if you're in the market for a text processing engine, check out ediplex. I don't have a license statement in the source tree, but will soon. I'm strongly leaning toward the MIT license, but am also considering GPLv2. Questions, comments, flames welcome.