MySQL Character Set and Collation

MySQL documentation states the following with respect to the "character set" and "collation" of the data storage in the database.

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set. Suppose that we have an alphabet with four letters: 'A', 'B', 'a', and 'b'. We give each letter a number: 'A' = 0, 'B' = 1, 'a' = 2, 'b' = 3. The letter 'A' is a symbol, the number 0 is the encoding for 'A', and the combination of all four letters and their encodings is a character set. Suppose that we want to compare two string values, 'A' and 'B'. The simplest way to do this is to look at the encodings: 0 for 'A' and 1 for 'B'. Because 0 is less than 1, we say 'A' is less than 'B'. What we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): compare the encodings. We call this simplest of all possible collations a binary collation. But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters 'a' and 'b' as equivalent to 'A' and 'B'; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation.

-- End of MySQL documentation citation

Starting with MySQL version 4.1, there are five levels to which character sets and collations can be applied - server, database, table, column, and string constant. The rules for applying character set and collation occur according to precedence. The order of precedence is string constant, column, table, database, and server. The character set and collation specific to a particular string constant and column apply first. If those don't exist, the table default is used, and if that doesn't exist the database default, and finally the server default. By default, MySQL uses the latin1 character set and the latin1_swedish_ci collation. You can see which defaults apply to your installation by running the following statements:

mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
7 rows in set (0.00 sec) 

The server defaults are set up in the configuration file (/etc/my.cnf). For example:

[mysqld]
default-character-set=euckr
default-collation=euckr_korean_ci

To create a table or a database that makes use of a different character set and collation, simply use the CHARACTER SET (CHARSET is a synonym) and COLLATION clauses.

mysql> CREATE TABLE tablename(column definitions, ...) CHARACTER SET euckr 
COLLATE euckr_korean_ci;
Query OK, 0 rows affected (0.12 sec)

mysql> CREATE DATABASE dbname CHARACTER SET euckr COLLATE euckr_korean_ci;
Query OK, 1 row affected (0.05 sec)

-- If you already have a database that you need to change the character set, 
-- use alter database command.
mysql> ALTER DATABASE dbname CHARACTER SET euckr COLLATE euckr_korean_ci;

Client, Results and Connection CHARACTER SET and COLLATION

Setting the correct character set on the server is only a half story. In order to display characters correctly on the client (be it a GUI or Web), it is important how the client gets the data from the server. To establish a client connection to the server with a desired character set, we need to define client connection settings with variables.

set character_set_client = euckr;
set character_set_results = euckr;
set character_set_connection = euckr;

Since those three variables are likely changed at the same time, MySQL offers a command that will allow you to change all three with one command.

set names = euckr;
or
set names euckr collate euckr_korean_ci;

If you wish to set up a server-level character set, you may also execute the mysqld with a command-line option. For example, in the /etc/rc.d/init.d/mysqld (for redhat):

/usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file="$mypidfile" --character-set-server=euckr --collation-server=euckr_korean_ci >/dev/null 2>&1 &

To see the available character set, use the following MySQL command.

mysql> show character set

-- To change the database to UTF8, use the following:

mysql> ALTER DATABASE dbname CHARACTER SET utf8 
COLLATE utf8_general_ci; 

If you move your application from one server to another and the character set does not appear correctly, chances are that you have not set the client-side character set correctly. Open a library file where a database connection is made, and add the following statement just after selecting a database. This is an example using PHP programming language.

$db = mysql_connect("$host", "$user", "$pwd");
mysql_select_db("$database");
mysql_query("set names euckr collate euckr_korean_ci");

Replace the Korean character set with whichever character set that you would like to use.

Share this post

Comments (0)

    No comment

Leave a comment

All comments are moderated. Spammy and bot submitted comments are deleted. Please submit the comments that are helpful to others, and we'll approve your comments. A comment that includes outbound link will only be approved if the content is relevant to the topic, and has some value to our readers.


Login To Post Comment