(mysql.info) charset-unicode-sets
Info Catalog
(mysql.info) charset-charsets
(mysql.info) charset-charsets
(mysql.info) charset-we-sets
10.9.1 Unicode Character Sets
-----------------------------
MySQL has two Unicode character sets. You can store text in about 650
languages using these character sets.
* `ucs2' (UCS-2 Unicode) collations:
* `ucs2_bin'
* `ucs2_czech_ci'
* `ucs2_danish_ci'
* `ucs2_esperanto_ci'
* `ucs2_estonian_ci'
* `ucs2_general_ci' (default)
* `ucs2_hungarian_ci'
* `ucs2_icelandic_ci'
* `ucs2_latvian_ci'
* `ucs2_lithuanian_ci'
* `ucs2_persian_ci'
* `ucs2_polish_ci'
* `ucs2_roman_ci'
* `ucs2_romanian_ci'
* `ucs2_slovak_ci'
* `ucs2_slovenian_ci'
* `ucs2_spanish2_ci'
* `ucs2_spanish_ci'
* `ucs2_swedish_ci'
* `ucs2_turkish_ci'
* `ucs2_unicode_ci'
* `utf8' (UTF-8 Unicode) collations:
* `utf8_bin'
* `utf8_czech_ci'
* `utf8_danish_ci'
* `utf8_esperanto_ci'
* `utf8_estonian_ci'
* `utf8_general_ci' (default)
* `utf8_hungarian_ci'
* `utf8_icelandic_ci'
* `utf8_latvian_ci'
* `utf8_lithuanian_ci'
* `utf8_persian_ci'
* `utf8_polish_ci'
* `utf8_roman_ci'
* `utf8_romanian_ci'
* `utf8_slovak_ci'
* `utf8_slovenian_ci'
* `utf8_spanish2_ci'
* `utf8_spanish_ci'
* `utf8_swedish_ci'
* `utf8_turkish_ci'
* `utf8_unicode_ci'
The `ucs2_esperanto_ci' and `utf8_esperanto_ci' collations were added in
MySQL 5.0.13. The `ucs2_hungarian_ci' and `utf8_hungarian_ci'
collations were added in MySQL 5.0.19.
MySQL implements the `utf8_unicode_ci' collation according to the
Unicode Collation Algorithm (UCA) described at
`http://www.unicode.org/reports/tr10/'. The collation uses the
version-4.0.0 UCA weight keys:
`http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt'. The
following discussion uses `utf8_unicode_ci', but it is also true for
`ucs2_unicode_ci'.
Currently, the `utf8_unicode_ci' collation has only partial support for
the Unicode Collation Algorithm. Some characters are not supported yet.
Also, combining marks are not fully supported. This affects primarily
Vietnamese and some minority languages in Russia such as Udmurt, Tatar,
Bashkir, and Mari.
The most significant feature in `utf8_unicode_ci' is that it supports
expansions; that is, when one character compares as equal to
combinations of other characters. For example, in German and some other
languages ‘`ss'’ is equal to ‘`ss'’.
`utf8_general_ci' is a legacy collation that does not support
expansions. It can make only one-to-one comparisons between characters.
This means that comparisons for the `utf8_general_ci' collation are
faster, but slightly less correct, than comparisons for
`utf8_unicode_ci'.
For example, the following equalities hold in both `utf8_general_ci' and
`utf8_unicode_ci':
A" = A
O" = O
U" = U
A difference between the collations is that this is true for
`utf8_general_ci':
ss = s
Whereas this is true for `utf8_unicode_ci':
ss = ss
MySQL implements language-specific collations for the `utf8' character
set only if the ordering with `utf8_unicode_ci' does not work well for a
language. For example, `utf8_unicode_ci' works fine for German and
French, so there is no need to create special `utf8' collations for
these two languages.
`utf8_general_ci' also is satisfactory for both German and French,
except that ‘`ss'’ is equal to ‘`s'’, and not to ‘`ss'’. If
this is acceptable for your application, then you should use
`utf8_general_ci' because it is faster. Otherwise, use
`utf8_unicode_ci' because it is more accurate.
`utf8_swedish_ci', like other `utf8' language-specific collations, is
derived from `utf8_unicode_ci' with additional language rules. For
example, in Swedish, the following relationship holds, which is not
something expected by a German or French speaker:
U" = Y < O"
The `utf8_spanish_ci' and `utf8_spanish2_ci' collations correspond to
modern Spanish and traditional Spanish, respectively. In both
collations, ‘`ñ'’ (n-tilde) is a separate letter between ‘`n'’
and ‘`o'’. In addition, for traditional Spanish, ‘`ch'’ is a
separate letter between ‘`c'’ and `d', and ‘`ll'’ is a separate
letter between ‘`l'’ and ‘`m'’
Info Catalog
(mysql.info) charset-charsets
(mysql.info) charset-charsets
(mysql.info) charset-we-sets
automatically generated byinfo2html