Good support from 4.1
utf-8 is utf8 in MySQL.
A collation defines the sort order for the data, it may be case sensitive or not
To find out your current setup:
SHOW VARIABLES LIKE 'character_set_database'; SHOW VARIABLES LIKE 'character_set_client';
To see available character sets and collations on your database:
SHOW CHARACTER SET; SHOW COLLATION LIKE 'utf8%';
Character set and collation can be set per server, database, table, connection;
Server (/etc/my.cnf):
[mysqld] ...
default-character-set=utf8
default-collation=utf8_general_ci
Database:
(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8
Table:
(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8
Connection:
SET NAMES 'utf8';
A PHP mysql connection (not totally confirmed, but see tests below) defaults to a latin1 connection, so, your first query after connection should be:
mysql_query("SET NAMES 'utf8'");
In php versions 5.2 and later, use
mysql_set_charset('utf8',$conn);
The CONVERT() function can convert between charsets, eg:
INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table;
As mentioned in charsets, field widths may need to be increased to deal with multi-byte characters.
Code to generate a mass change of collations:
<?php // this script will output the queries need to change all fields/tables to a different collation // it is HIGHLY suggested you take a MySQL dump prior to running any of the generated // this code is provided as is and without any warranty die("Make a backup of your MySQL database then remove this line"); set_time_limit(0); // collation you want to change: $convert_from = 'latin1_swedish_ci'; // collation you want to change it to: $convert_to = 'utf8_general_ci'; // character set of new collation: $character_set= 'utf8'; $show_alter_table = true; $show_alter_field = true; // DB login information $username = 'user'; $password = 'pass'; $database = 'table'; $host = 'localhost'; mysql_connect($host, $username, $password); mysql_select_db($database); $rs_tables = mysql_query(" SHOW TABLES ") or die(mysql_error()); print '<pre>'; while ($row_tables = mysql_fetch_row($rs_tables)) { $table = mysql_real_escape_string($row_tables[0]); // Alter table collation // ALTER TABLE `account` DEFAULT CHARACTER SET utf8 if ($show_alter_table) { echo("ALTER TABLE `$table` DEFAULT CHARACTER SET $character_set;\r\n"); } $rs = mysql_query(" SHOW FULL FIELDS FROM `$table` ") or die(mysql_error()); while ($row=mysql_fetch_assoc($rs)) { if ($row['Collation']!=$convert_from) continue; // Is the field allowed to be null? if ($row['Null']=='YES') { $nullable = ' NULL '; } else { $nullable = ' NOT NULL'; } // Does the field default to null, a string, or nothing? if ($row['Default']==NULL) { $default = " DEFAULT NULL"; } else if ($row['Default']!='') { $default = " DEFAULT '".mysql_real_escape_string($row['Default'])."'"; } else { $default = ''; } // Alter field collation: // ALTER TABLE `account` CHANGE `email` `email` VARCHAR( 50 ) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL if ($show_alter_field) { $field = mysql_real_escape_string($row['Field']); echo "ALTER TABLE `$table` CHANGE `$field` `$field` $row[Type] CHARACTER SET $character_set COLLATE $convert_to $nullable $default; \r\n"; } } } ?>
MySQL tables, columns and connections with UTF-8
What happens when you INSERT
a UTF-8 string into a MySQL database using PHP's MySQL extension? Well that depends on what string you use. If you use something like "Iñtërnâtiônàlizætiøn" then doing just about anything in MySQL will be data-safe (but, of course, the collation will be incorrect). If you have characters that don't encode the same in UTF-8 and latin1 (e.g. text in Chinese, Russian, ...) then the behaviour depends on both the table definition and the encoding of the connection to the database.
| Table or column charset | |
---|---|---|
| latin1 (MySQL default) | utf8 |
default connection mysql_pconnect() | data is binary-safe but not encoded properly* | data is binary-safe but not encoded properly* |
using SET NAMES 'utf8'; | data destroyed; string is converted to "????" | works fine! |
* It seems that MySQL will look at the data as being a series of bytes all within the latin1 codepage. If you use the same code to read and write the data then it will round-trip fine (but MySQL's collation will obviously be wrong). If you use tools with a knowledge of the data types used by MySQL, like phpMyAdmin or even mysql_dump, then the wrong encoding is obvious.
(this test was performed with PHP versions 4.4.2 and 5.1.2 and MySQL 5.0.20; PHP source for test)
No comments:
Post a Comment