The u8u16 library: 4 to 25 times speed increases for transcoding using SIMD

by Rick Jelliffe


Rob Cameron, who is a professor at Simon Fraser University, has released u8u16 in open source beta, a really exciting library which implements an "iconv" like transcoder (i.e. it converts data from one character set and encoding to another), and which uses the SIMD instructions that modern CPUs have.

I think I was the first person to write something on this technique, certainly on the Internet, in my blog item Using C++ Intrinsic Functions for Pipelined Text Processing a couple of years ago, but only because the idea was too obvious to people involved with DSP to write about, I gather: of course you can use instrinsic functions for text processing! My code just used C++ intrinsics as an optimization on top of C++ code. But Cameron takes it to another level: his code abstracts out the features of the most common SIMD devices so that his algorithms can be arranged to work on this abstraction and compile to a wide range of targets processors, and he can dispense with the code. He reports 4 to 25 times speed increases, depending on the data; which is very promising.

I would love to see an XML parser that combines Cameron' SIMD work with the optimizations from IBM's XML Screamer, which seem to increase the speed of Java processing by two or three fold. Cameron's work is important because it gives a working abstraction that can inform decision-making on buiding SIMD-using capabilities into Java's text processing.


Michael Day
2007-03-08 02:58:32
That's a clever technique! One question, though. Why would you want to convert text in UTF-8 to UTF-16 at all? I would have thought that when parsing an XML document in UTF-8, it would be best to parse it as-is rather than transcoding to UTF-16 first. That sounds more efficient than converting it to UTF-16 and doubling (on average) the memory requirements.

Oh right, Java, in which strings are all sorta kinda UTF-16. Given that C libraries for XML parsing like libxml2 and expat use UTF-8 as their default internal encoding the idea of converting UTF-8 to UTF-16 had me baffled for a second there. No further questions :)

Rick Jelliffe
2007-03-08 15:50:16
There will be some ratio of ASCII to non-ASCII text at which it becomes more efficient to convert to parse as UTF-16 rather than UTF-8. This would primarily effect CJK (China/Japan/Korea) characters that take three bytes in UTF-8 but only two in UTF-16, where there is a ratio above which it is more efficient to transmit as UTF-16 as well. The "doubling of memory requirements" may be true in the Americas, Australia and Pacific, Indonesia, Malaysia, Sub-saharan Africa and parts of West Europe though.

But apart from that, there is always the need to transcode data in and out from different formats.

But the promise is not just for transcoding: I am very interested in seeing how the SIMD (e.g. C++ Intrinsic Functions) can be used for actual parsing.

Having interleaved parsing with transcoding has been a long established technique in SGML. Indeed, IIRC correctly OmniMark (a better, earlier, streaming version of XSLT meets Perl still available) used two or three different parsers, optimised for different character or delimniter handling situations.