The u8u16 library: 4 to 25 times speed increases for transcoding using SIMD
by Rick Jelliffe
Rob Cameron, who is a professor at Simon Fraser University, has released u8u16 in open source beta, a really exciting library which implements an "iconv" like transcoder (i.e. it converts data from one character set and encoding to another), and which uses the SIMD instructions that modern CPUs have.
I think I was the first person to write something on this technique, certainly on the Internet, in my blog item Using C++ Intrinsic Functions for Pipelined Text Processing a couple of years ago, but only because the idea was too obvious to people involved with DSP to write about, I gather: of course you can use instrinsic functions for text processing! My code just used C++ intrinsics as an optimization on top of C++ code. But Cameron takes it to another level: his code abstracts out the features of the most common SIMD devices so that his algorithms can be arranged to work on this abstraction and compile to a wide range of targets processors, and he can dispense with the code. He reports 4 to 25 times speed increases, depending on the data; which is very promising.
I would love to see an XML parser that combines Cameron' SIMD work with the optimizations from IBM's XML Screamer, which seem to increase the speed of Java processing by two or three fold. Cameron's work is important because it gives a working abstraction that can inform decision-making on buiding SIMD-using capabilities into Java's text processing.
That's a clever technique! One question, though. Why would you want to convert text in UTF-8 to UTF-16 at all? I would have thought that when parsing an XML document in UTF-8, it would be best to parse it as-is rather than transcoding to UTF-16 first. That sounds more efficient than converting it to UTF-16 and doubling (on average) the memory requirements.
There will be some ratio of ASCII to non-ASCII text at which it becomes more efficient to convert to parse as UTF-16 rather than UTF-8. This would primarily effect CJK (China/Japan/Korea) characters that take three bytes in UTF-8 but only two in UTF-16, where there is a ratio above which it is more efficient to transmit as UTF-16 as well. The "doubling of memory requirements" may be true in the Americas, Australia and Pacific, Indonesia, Malaysia, Sub-saharan Africa and parts of West Europe though.