1 / 3
Caption Text
2 / 3
Caption Two
3 / 3
Caption Three margin testing

Wednesday, October 21, 2009

MySQL and UTF-8

MySQL and UTF-8 [Web Application Component Toolkit]

Good support from 4.1

utf-8 is utf8 in MySQL.

A collation defines the sort order for the data, it may be case sensitive or not

To find out your current setup:

SHOW VARIABLES LIKE 'character_set_database';  SHOW VARIABLES LIKE 'character_set_client'; 

To see available character sets and collations on your database:

SHOW CHARACTER SET;  SHOW COLLATION LIKE 'utf8%'; 

Character set and collation can be set per server, database, table, connection;

Server (/etc/my.cnf):

[mysqld]  ...
default-character-set=utf8
default-collation=utf8_general_ci

Database:

(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8 

Table:

(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8 

Connection:

SET NAMES 'utf8'; 

A PHP mysql connection (not totally confirmed, but see tests below) defaults to a latin1 connection, so, your first query after connection should be:

mysql_query("SET NAMES 'utf8'"); 

In php versions 5.2 and later, use

mysql_set_charset('utf8',$conn);  

The CONVERT() function can convert between charsets, eg:

INSERT INTO utf8table (utf8column)  SELECT CONVERT(latin1field USING utf8) FROM latin1table;  

As mentioned in charsets, field widths may need to be increased to deal with multi-byte characters.

Code to generate a mass change of collations:

<?php   // this script will output the queries need to change all fields/tables to a different collation // it is HIGHLY suggested you take a MySQL dump prior to running any of the generated // this code is provided as is and without any warranty   die("Make a backup of your MySQL database then remove this line");   set_time_limit(0);   // collation you want to change: $convert_from = 'latin1_swedish_ci';   // collation you want to change it to: $convert_to   = 'utf8_general_ci';   // character set of new collation: $character_set= 'utf8';   $show_alter_table = true; $show_alter_field = true;   // DB login information $username = 'user'; $password = 'pass'; $database = 'table'; $host     = 'localhost';   mysql_connect($host, $username, $password); mysql_select_db($database);   $rs_tables = mysql_query(" SHOW TABLES ") or die(mysql_error());   print '<pre>'; while ($row_tables = mysql_fetch_row($rs_tables)) {     $table = mysql_real_escape_string($row_tables[0]);          // Alter table collation     // ALTER TABLE `account` DEFAULT CHARACTER SET utf8     if ($show_alter_table) {         echo("ALTER TABLE `$table` DEFAULT CHARACTER SET $character_set;\r\n");     }       $rs = mysql_query(" SHOW FULL FIELDS FROM `$table` ") or die(mysql_error());     while ($row=mysql_fetch_assoc($rs)) {                  if ($row['Collation']!=$convert_from)             continue;           // Is the field allowed to be null?         if ($row['Null']=='YES') {             $nullable = ' NULL ';         } else {             $nullable = ' NOT NULL';         }           // Does the field default to null, a string, or nothing?         if ($row['Default']==NULL) {             $default = " DEFAULT NULL";         } else if ($row['Default']!='') {             $default = " DEFAULT '".mysql_real_escape_string($row['Default'])."'";         } else {             $default = '';         }           // Alter field collation:         // ALTER TABLE `account` CHANGE `email` `email` VARCHAR( 50 ) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL         if ($show_alter_field) {             $field = mysql_real_escape_string($row['Field']);             echo "ALTER TABLE `$table` CHANGE `$field` `$field` $row[Type] CHARACTER SET $character_set COLLATE $convert_to $nullable $default; \r\n";         }     } }   ?>

MySQL tables, columns and connections with UTF-8

What happens when you INSERT a UTF-8 string into a MySQL database using PHP's MySQL extension? Well that depends on what string you use. If you use something like "Iñtërnâtiônàlizætiøn" then doing just about anything in MySQL will be data-safe (but, of course, the collation will be incorrect). If you have characters that don't encode the same in UTF-8 and latin1 (e.g. text in Chinese, Russian, ...) then the behaviour depends on both the table definition and the encoding of the connection to the database.


Table or column charset

latin1 (MySQL default) utf8
default connection mysql_pconnect() data is binary-safe but not encoded properly* data is binary-safe but not encoded properly*
using SET NAMES 'utf8'; data destroyed; string is converted to "????" works fine!

* It seems that MySQL will look at the data as being a series of bytes all within the latin1 codepage. If you use the same code to read and write the data then it will round-trip fine (but MySQL's collation will obviously be wrong). If you use tools with a knowledge of the data types used by MySQL, like phpMyAdmin or even mysql_dump, then the wrong encoding is obvious.

(this test was performed with PHP versions 4.4.2 and 5.1.2 and MySQL 5.0.20; PHP source for test)

Further Reading

No comments:

Post a Comment

Featured Post

Windows和Ubuntu双系统完全独立的安装方法

http://www.ubuntuhome.com/windows-and-ubuntu-install.html  | Ubuntu Home Posted by Snow on 2012/06/25 安装Windows和Ubuntu双系统时,很多人喜欢先安装windows,然...