Corrupted UTF-8 characters with PHP and MySQL

by PapaScott on 05 May 2007

Someone recently asked me about corrupted characters in a PHP-MySQL project. They had recent stable versions of everything, MySQL 5.0 and PHP 5.1, using UTF-8 for everything, and they were getting corrupted accented characters saved in the database, junk with ‘Ã’s was coming out. They were asking me if there were perhaps some secret PHP setting they were missing to make things right. Not that I knew of.

I would have thought that in this day and age using UTF-8 with PHP and MySQL would just work. To try this out for myself, I downloaded some simple tutorial code, set up databases and HTML with UTF-8, and started creating records with names like Scött Hänsön and Mäcky Möüße. And sure enough, I got junk.

PHP and MySQL don’t just work with UTF-8. In particular, the client connection from PHP to MySQL defaults to latin1. There don’t seem to be any PHP settings to change this. On the MySQL side there are ‘default-character-set’ settings you can put in various sections of the my.cnf, but for me they didn’t seem to help.

I found two ways to solve the problem. The first is to issue a mysql_query("SET NAMES 'utf8') immediately after every mysql_connect to manually switch the connection to utf8. That’s a huge pain if you have a lot of code (even in my mini 5-file test-case I forgot one instance at first).

The other is to add a init-connect='SET NAMES utf8' setting to the [mysqld] section of my.cnf. init-connect automatically executes SQL commands at the beginning of every client connection. The catch is the commands won’t be executed for a user with SUPER privileges, so if you’re lazy and using ‘root’ to connect to your database, it won’t work.

By using various combinations of <meta charset="blah"/> in the HTML and SET NAMES for the database connection, you can create interesting errors. The ‘Ã’ errors seem to occur when correctly sent utf-8 is incorrectly saved over a latin1 connection, then read back out as utf-8. My friend said he was using SET NAMES 'utf8' with all his connects, I’m guessing he forgot it for INSERTs and UPDATEs. That’s just a hunch.

I wasn’t able to find a clear explanation of this in Google (the closest was in the comments to mysql_client_encoding() in the PHP documentation), so I thought write this up here. It’s clear to me, at least.

Lars Strojny May 5, 2007 at 13:54
Nils May 19, 2007 at 23:28

You can also set charactersetclient to utf8 in my.cnf.

Zaw Htet Wai May 28, 2007 at 09:03

I was also stucking in this utf-8 problem. The server’s MySQL version is 4.0.25 (which does not support utf-8) and PHP is 4. Although I can save the German chars in DB correctly, when I display in HTML, they appear in incorrect format. I tried with a lot of PHP built-in string functions (substrreplace, strpos, utf8decode, mbconvertencoding(“”, ‘UTF-8′, ‘HTML-ENTITIES’)) and encode, decode methods but couldn’t be helpful. When I chek the View-> Source, all chars are correct in Notepad, but not in HTML. So far the best solution for me is to replace the German chars with related HTML numerical code (eg. ä => ä) before you insert the data to DB. After that when u retrieve the data and display them to browser, no need to do anything. Everything will be in correct format.

Mark July 16, 2007 at 11:47

Thanks for mentioning this following super important comment regarding init-connect=’SET NAMES utf8′:
“The catch is the commands won’t be executed for a user with SUPER privileges, so if you’re lazy and using ‘root’ to connect to your database, it won’t work.”

Jimmie February 7, 2010 at 09:58

This is an issue that is occurring all over the net. I read stories/articles all the time and it is very difficult to read when you’re seeing ’ everywhere – for example “What’s” instead of “What’s” A classic example of this can be found on twitter.com Go to http://www.twitter.com and note where it says “What’s happening?” Now click in the edit box to change your status. As soon as you click, you will now notice the above text reads “What’s happening?”
It’s not just the apostrophe either, as you can see in this line copied from another site.
“Hey Jess, you wanna do something? I’m really bored.”
“Hey Jess, you wanna do something? I’m really bored.”
I use Opera myself, but even when I viewed the same page in IE it still came up the same. It’s not something I recall being a problem in the past. So why now? And what can someone from my end do about it? From what I can gather, you all are talking about rectifying the issue during the construction of the site. Is there something I can do as a viewer to counteract it? Is there a line of code I can add to my browser config or an addon I can download or such that will convert ’ back to apostrophes? Because, to be quite honest, it’s giving me the SH$@S!!

Comments on this entry are closed.

{ 1 trackback }

Previous post:

Next post: