Audio Linkblogging
Pages: 1, 2

The module expects a whole MP3 file, but I'm only giving it a short (6K) extract from the middle of the file. So I substitute the full length of the original file, as reported by the remote web server, for the length of the clip reported by the file system.

The fudge factor is just a dreadful hack that makes the reported duration right for the files I've tried so far. There's undoubtedly a more clueful way to do this, and I hope someone will show it to me.

These changes weren't strictly necessary. My audio linkblog could have used the original uglier URL syntax as well. But the simpler, the better.

The first incarnation of the audio linkblog was straightforward. I simply transformed the RSS feed for my sound-bite tag into an RSS 2.0 feed whose enclosures were the referenced clipping URLs. In order to supply the length attribute required by the enclosure tag, I made an HTTP HEAD request to the origin server where the MP3 file was hosted.

Unfortunately, neither iPodder nor the then newly released iTunes 4.9 was willing to work with the clipping URLs contained in those enclosure tags. iPodder would download the resources but couldn't figure out what to name them; iTunes wouldn't download them at all.

So I resorted to another dreadful hack, and appended a bogus &ext=.mp3 parameter to the URLs. That worked, at least for iTunes. After converting the clips in my sound-bite feed into a playlist, I was able to listen to them in iTunes and on my iPod.

It was a weird experience, though. First Peter Yared would speak for a couple of minutes, then Doug Engelbart, then Kim Polese, and if they hadn't been my clips I wouldn't have known who was speaking or in what context. In theory I could record audio introductions to each clip. But that wouldn't be a sustainable procedure even for me, never mind a less motivated person. Forming the clip URL and bookmarking it to is already asking more effort than most people will be willing to give.

Then I realized that I already had contextual metadata, in the form of the title, extended description, and tags. Why not convert that metadata to audio and use it to introduce the clips?

I tried two different text-to-speech (TTS) solutions. First, on Win32, I used pyTTS, which wraps the Microsoft speech API for Python. The prerequisites--Microsoft's SAPI 5.1 redistributable kit, and Mark Hammond's win32all extensions--were a breeze to install, and the TTS module couldn't be easier to use. Here's all you need, for example, to convert some metadata into a WAV file:

text = '%s, %s, Tags, %s' % ( title, descr, tagset )
tts.SpeakToWave( 'tmp.wav', text )

The ever-popular lame can then convert the WAV file to MP3.

I wasn't wild about the TTS results, though, even after trying the extra voices available for the engine. Then I remembered AT&T's online TTS demo, so I tried that too. It's not packaged as a web service, but it was straightforward to issue a request, receive a reference to a WAV file, and then retrieve that file. And AT&T's Audrey sounded a bit better than Microsoft's MSMary!

If this turns into more than an experiment, I'll look into licensing TTS software. Chris Brooks, who runs, a website that converts blogs into podcasts, uses NeoSpeech and recommends it highly.

With these ingredients, I was able to create a sound-bite podcast. So far, I've only gotten it to work in iTunes, and even there only imperfectly. Because the introductions and clips are separate enclosures, they'll play out of order if shuffling is turned on in the player. Clearly, they should be combined, though I'm not yet sure what the right way to do that will be. There's also an iTunes issue. iPodder automatically creates playlists from podcasts, but iTunes doesn't, which creates an extra (and to me) annoying step in the process.

Warts notwithstanding, this exercise has given me a sense of the possibilities that will open up once we can link to discrete segments of audio, subscribe to them, and recombine them. In the realm of text these capabilities have radically improved our ability to assimilate and share information. In the audio realm the need is even greater, because it takes so much longer to listen than to read. In tandem with the social bookmarking and tagging services, the blogosphere stands ready to process all the new audio content that's coming online. Everything we need for efficient classification and recommendation is in place, with the exception of things we take for granted in the textual realm: selection, quotation, and linking. All these things are doable, and I hope someday to take them for granted in the audio and video domains too.

Jon Udell is an author, information architect, software developer, and new media innovator.

Return to the O'Reilly Network