Turning MySQL data in latin1 to utf8 utf-8

by Derek Sivers

Related link: http://mysql.he.net/doc/refman/5.0/en/charset.html



I've just finished one of the most difficult and tedious problems I've ever solved, so I have to share the solution here in a little tutorial of how I fixed this, even though I'm sure there are better ways, this is what worked for me.

THE PROBLEM - PART 1:
My old CD Baby MySQL database from 1998 was filled with foreign characters and was in MySQL's default (latin1) encoding.
For years, customers and clients had been using our web interface to give us their names, addresses, song titles, bio, and many things in all kinds of alphabets.
I wanted everything to be in UTF-8. (The database, the website, the MySQL client, everything.)

QUICK DEFINITION : "FOREIGN CHARACTERS"
When I say "foreign characters" I mean not just Greek, Icelandic, Japanese, Chinese, Korean, and others shown at Omniglot, but also the curly-quotes, ellipsis, em-dash, and things described at alistapart.

START OF THE SOLUTION (THE EASY PART):
* - Found a few hours of downtime at 2am on a Sunday night.
* - Shut down the website.
* - Did a raw data dump (mysqldump) of the data to a regular text "dump.sql" file. (85 tables, millions of rows, an 8 gig dump)
* - Completely removed MySQL 3.2 from the system
* - Installed MySQL 5.0 (FreeBSD ports), making sure to use --with-charset=utf8 while compiling (see http://dev.mysql.com/doc/refman/5.0/en/charset-server.html)
* - Did a sed replace on the dump.sql file, changing all table types to utf8.
* - (Also changed from MyISAM to InnoDB but that's a different story, and had no problem.)
* - Changed my HTML header Content-Type to charset=utf-8 everywhere
* - Changed /etc/my.cnf to default charset utf8
* - Loaded the dump.sql file, and turned the website back on.
* - Made sure it mostly worked, and went to sleep

THE PROBLEM - PART 2:
Some foreign characters were perfect. Others were a jumble : what should have been one quotation-mark turned into a series of THREE jumbly characters. Weird. Had to be fixed. No idea where to start.

FIGURING OUT WHAT'S WRONG (THE HARD PART):
* - Unless you want to do *everything* in a web browser, you need to get a terminal that does Unicode and can display foreign characters. I used uxterm. See http://czyborra.com/unicode/terminals.html
* - I learned about using the SET NAMES utf8 query, but when I did that almost everything turned into a jumble.
* - I could send the database a set names utf8 command, and SOME would work. Or I could do set names latin1, and the rest would work. I was stumped.
* - It took about 10 hours of frowning and furiously typing, but I found out that
--- #1 : The MySQL server was using UTF8 encoding.
--- #2 : The MySQL client was using latin1 encoding.
--- #3 : Even if I got the command-line MySQL client to use utf8, the PHP client was still using latin1 encoding.
--- #4 : Most of my data must have been put into the MySQL server with latin1 encoding, which is why it worked with latin1 encoding on the client when getting it out.

Seems I had some characters in latin1, some characters in UTF-8, some in the database as HTML equivalents (分) and some characters that were just a total mystery.

A TOOLBOX FOR SLEUTHING CHARACTER ENCODING PROBLEMS:

#1 - USE MySQL CHAR_LENGTH TO FIND ROWS WITH MULTI-BYTE CHARACTERS:
SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);

#2 - USE MySQL HEX and PHP bin2hex
SELECT name, HEX(name) FROM clients;
Get the result back into PHP, and run a bin2hex on the string, compare it to MySQL's hex of that same string

#3 - SEE IT IN BOTH ENCODINGS
$db->query("SET NAMES latin1");
$db->query("SELECT name, HEX(name) FROM clients");
(compare the string and its hex result from MySQL with the bin2hex from PHP)
$db->query("SET NAMES utf8");
$db->query("SELECT name, HEX(name) FROM clients");
(compare the string and its hex result from MySQL with the bin2hex from PHP)

For all those strings that looked perfect in LATIN1 encoding, here's how I would fix them in the database:
$db->query("SET NAMES latin1");
$db->query("SELECT id, name FROM clients");
$hex = bin2hex($x['name']);
$db->query("SET NAMES utf8");
$db->query("UPDATE clients SET name=UNHEX($hex) WHERE id=$id")

That seemed to work, for most things.
Problem is, only SOME of the database was in latin1 encoding, so I had to use a few quirky ways, but mostly my own eyes, to fix only these things, and not accidently re-encode something that was perfect.

#4 - USE A HEX/UNHEX REPLACE FOR THE UNFIXABLE CHARACTERS
Imagine, after all that fixing, you found strings like this:

Let~!@s say ^|%What a nice house you~!@ve got here, don~!@t you think?^!%.

Who knows when or how this happened, but obviously ~!@ is meant to be an apostrophe, ^|% an open-quote, and ^!% a closing-quote.

I'd use MySQL SUBSTRING to find the 3 characters that needed replacing:
SELECT SUBSTRING(quote, 353, 3) FROM table WHERE id=1;

Once narrowing it down to the exact string, add a HEX() around it:
SELECT HEX(SUBSTRING(quote, 353, 3)) FROM table WHERE id=1;
... which would give you a result like C8035EF6BB92BF2

Then use that with MySQL REGEXP to find and replace all occurences in your database!
UPDATE table SET field = REPLACE(field, UNHEX('C8035EF6BB92BF2'), "'") WHERE field REGEXP UNHEX('C8035EF6BB92BF2');

I set up some PHP arrays of all my tables, and all their text fields, to run this same query on everything in my database.
Then do it again for curly-quotes and other weirnesses.

A few times, I had no idea what a character was supposed to be (like the Icelandic and Gaelic ones) - so I had to go visit the artist's website, and find their song titles or bio information spelled correctly there.


#5 - VALIDATE UTF8
Got the is_utf8 function from PHP docs to validate all the values in the database.
Doing this found a bunch of invisible problems, which only through hours of MySQL SUBSTRING and HEX revealed that there were invisible characters with HEX values of 00-19 scattered around my text fields.
I used the same solution as above to replace them:
UPDATE table SET field = REPLACE(field, UNHEX('05'), '') WHERE field REGEXP UNHEX('05');
I looped this inside an array of all hex values under 20.


#6 - CONVERTING HTML ENTITIES
Find HTML entities hidden in the database:
SELECT field FROM table WHERE field REGEXP '&#[0-9]*;'
Use the utf8_chr function from the comments of the PHP html_entity_decode page.
Use PHP preg_match_all to find the entities inside the string, and replace them:

function myreplace($string) {
preg_match_all('/&#(\d*)/', $string, $matches);
foreach($matches[1] as $num) {
$string = str_replace("&#$num;", utf8_chr($num), $string);
}
return $string;
}

Update the database with the returned result.



After all this (about 60 hours work over the last 5 days) I think it's all done.

Phew.

LESSON LEARNED: KEEP EVERYTHING IN UTF-8, ABSOLUTELY EVERYWHERE, FROM DAY ONE. You'll be glad you did some day.

Go ahead. Show how smart you are. Show how my 60 hours could have been done in 5 minutes with the language of your choice instead of my mess, above.


8 Comments

dereksivers
2006-01-16 13:11:58
mb_send_mail
Uh-oh. There's more. Now I have to make sure all email is sent correctly, too:


http://us2.php.net/manual/en/function.mb-send-mail.php

JamesRyan
2006-01-17 12:39:40
Wow, I just finished doing the same thing
The week before you posted this article, I went through exactly the same process. However, for me it was not nearly as painful as it seems to have been for you.


I actually left the databases in place when I upgraded, but wrote a PHP script to change all of my varchar and char fields to varbinary and binary before the upgrade. After the upgrade (making sure to use --with-charset=utf8), I ran another PHP script that I wrote to convert all the tables and character fields back to varchar and char using utf8. I simultaneously performed the conversion to InnoDB. Note if you are already using InnoDB and any of your char or varchar fields are part of a foreign key constraint, you will need to remove the constraint before changing the field definition and put it back afterwards. My PHP script did this for me automatically.


When I was finished, everything worked fine. I had no garbled characters. This is possibly because even before the conversion I was running PHP using utf8 as the default charset, and running Apache using utf8 as the default charset. This probably prevented corruption of characters.


Just for everyone's information, if you want to run purely utf8, don't forget the following:


httpd.conf:
AddCharset UTF-8 .utf8
AddDefaultCharset UTF-8


php.ini
default_charset = "utf-8"


my.cnf
character-set-server=utf8
default-collation=utf8_unicode_ci


Then, if you're writing PHP scripts, soon after opening your connection to mysql, issue one of the following:
SET NAMES 'utf8';
OR, if you are running the mysqli extension:
mysqli_set_charset('utf8');


Finally, in your HTML, don't forget the following in your head section:



If you miss any of these pieces, you'll likely see garbled characters!!


Good Luck!


James Ryan
jamesr@totalinfosecurity.com


dereksivers
2006-01-18 10:39:47
PATTERN : HEX(C383C2__) TURNS INTO (C3__)
I FOUND A PATTERN to this:


Whenever I find a 2-character string with HEX(C383C2__) (where the last 2 hex-numbers are $$ anything)


IT NEEDS TO BE TURNED INTO (C3__) (those same 2 hex-numbers)


C383C2A5 needs to be converted to C3A5
C383C285 needs to be converted to C385
C383C2B0 needs to be converted to C3B0


etc

jasonhawkins
2006-01-21 10:00:13
utf8 email
The list of clients that actually support utf8 email is... well, not exhaustive.


Pretty much everything worth using on *ix does (i.e. Mutt on a unicode-compatible term, KMail, Thunderbird), but some of the most canonically popular mail clients, including Pegasus (what all the Mac-using professors at my college used) and Eudora (what we still use on winboxes here at CD Baby), do not support UTF-8 encoding (or unicode of any kind, btw). Not to mention the plethora of end-users using browser-based clients through Internet Explorer.


Given that, the problem so far has been: trying to figure out a way to get UTF-8 quashed into something all mail clients can deal with.


PHP's utf8_decode anglicizes most Romance languages quite well, simply removing accents and umlauts, etc. where appropriate. It barfs on other alphabets, though... if anyone knows of a way to do this without using php's strtr() and a huge freakin' mapping array... I'm all ears.


here's some good reading, especially for C programmers new to problems of unicode. The author, Marcus Kuhn, also has a great links section / ml guide:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

RichardLynch
2006-01-27 14:00:02
utf8_decode
If you ever find yourself with crazy characters while using PHP's utf8_decode, then just don't use it...


utf8_decode down-grades to Latin1 (basic English, pretty much) -- If the characters you want aren't available IN Latin1, you're screwed.


You're better off keeping the data in UTF-8 all the way to the browser, and using:

?>



.
.
.


Things you should know:
1. Mozilla and standards-compliant browsers will honor the header(). Internet Explorer will not.
2. Internet Explorer needs the goofy META tag.
3. You should put the META tag first because it forces the browser to start all over with rendering, so anything before that gets done twice.


Modern browsers can handle UTF-8 just fine, mostly.


Some older browsers might be slightly better off with the results of utf8_decode...


But it's probably not worth detecting the browser and doing utf8_decode() just for that.


Visually-oriented people, or audio-geeks may want to think of the utf8_decode() function as:
down-sample to English characters with a pretty awful lossy compression algorithm where only the simplest non-US characters survive the conversion


It will save a few bytes, but unless you're sending out "War and Peace" it's not going to make that much difference for most websites.


I learned all this the hard way just last week.

jasonhawkins
2006-01-27 15:07:08
update: re utf8 email
Seems the inelegant-but-permissible solution to sending utf8 email into the aleatory realm of MUAs:
find out whether the email contains characters not within ASCII.
if so, send as utf8 email, assuming the person on the other side, since for the most part they're the ones who provided the variable data in the mail body, will be able to read it.
(i.e. use PHP's mb_send_mail())
if not, just send as a normal latin1 email.
(i.e. use trusty old mail())
The is_utf8 function Derek posted about up above is (luckily) well-commented. A simple removal of the first line of the regex and a little tidying was all it took!
Now if I can only get it to parse the message body a little faster... :)
kmh496
2006-02-08 23:59:06
about messing up chars like the apostrophe and quote mark
i found out that in php i needed this expression. I can't understand why either. i would submit into a utf-8 forum from a windows machine in korea, supposedly everything over here uses euc-kr, so it would appear strange in the utf-8. i found the following was necessary :


$contents = preg_replace("/'/", "\'", $contents) ;
$contents = preg_replace("/`/", "\'", $contents) ;
$contents=preg_replace("@\-@","-",$contents);
$contents=preg_replace("@\xe2\x80\x99@","\'",$contents); // this comes from the windows machine at the office
$contents=preg_replace("@\xe2\x80\x9c@","\"",$contents); // this comes from the windows machine at the office
$contents=preg_replace("@\xe2\x80\x9d@","\"",$contents); // this comes from the windows machine at the office
$contents=preg_replace("@\xe2\x80\xa6@"," ",$contents); // this comes from the windows machine at the office
$contents=preg_replace("@\xe2\x80\x94@","-",$contents); // this comes from the windows machine at the office
$contents=preg_replace("@\xe2\x80\x9[0-9a-fA-F]@","*",$contents); // this comes from the windows machine at the office
// which uses hangul encoding by default, some funky windows thing which comes as 3 bytes but is supposed
// to just be a single apostrophe
// use www.sirfsup.com/code/perl/bin_perl/dump.perl
// slashes



goodbye from south korea.

kmh496
2006-02-09 00:02:46
my conversion script for converting files.
also i used the following script, from teh sed/awk book, adapted to call 'iconv' for the data transformation. I would like to see the sed script you used. i wish you had included it in your posting. do you believe that was the primary culprit?


#!/bin/sh
this_directory=`pwd`
for x
do
echo -n "converting $x: "
if test "$x" = runiconv.sh; then
echo "not editing script itself!"
elif [ -d $x ]; then
(cp runiconv.sh $x; cd $x; sh runiconv.sh *; rm -f runiconv.sh cd .. )
elif test -s $x; then
iconv --from-code=euc-kr --to-code=UTF-8 < $x > $this_directory/$x$$ ;
if [ $? == 0 ]
then
cp $this_directory/$x$$ $x
rm -f $this_directory/$x$$
else
echo -n "ICONVE ERROr "
rm -f $this_directory/$x$$
fi
echo "done";
else
echo "original file is empty"
fi
done
echo "all done"