only for RuBoard - do not distribute or recompile

14.2 Alternative Character Sets

A character set defines the basic alphabet supported by a database. An encoding then maps the character set to the 1s and 0s understood by a computer. MySQL comes with support for a variety of character sets and encodings. Unfortunately, this support is very peculiar. MySQL ties the concepts of character set, encoding, and sorting rules together so that these three normally distinct concepts cannot be separated. While this scheme works well if you use one of the supported character sets with the built-in sorting rules, you may need to go to some extra lengths if the character sets that come with MySQL do not meet your needs.

First, you need to determine whether the character set needs special string collating routines for sorting and whether it needs multi-byte character support. If it needs either of these, it will need to be compiled in. Otherwise, support for the character set can be added by simply modifying the configuration.

14.2.1 Simple Character Sets

Additional character sets may be configured into MySQL if they don't require multi-byte character support or string collating routines. Adding a character set through configuration requires the following steps:

Add the new character set to the file sql/share/charsets/Index^[1]

^[1] The charsets directory may have different locations depending on your installation. This file might also be share/mysql/charsets, for example..
Create the configuration file for the new character set in sql/share/charsets.
Edit your configure.in file to include the character set in the next compile.
Recompile MySQL.

In this example, we will add a special character set called elvish. We first need to add it to the character set index file. The file looks like this:

$ cat sql/share/charsets/Index
# sql/share/charsets/Index
#
# This file lists all of the available character sets. Please keep this
# file sorted by character set number.


big5               1
czech              2
dec8               3
.
.
.
latin5            30
latin1_de         31

To add a new character set, simply add the character set to the end of the file with a unique index:

latin5            30
latin1_de         31
elvish            32

The next step is to create a configuration file in sql/share/charsets for your character set. You can base it on sql/share/charsets/latin1.conf.

$ cd sql/share/charsets
$ cp latin1.conf elvish.conf
$ vi elvish.conf

There are four array definitions in the configuration file. You need to edit each of these arrays to configure your character set. A # in the configuration file indicates a comment.

ctype

The ctype array^[2] defines the features of each character in the character set. It consists of 257 hexadecimal words. Each word corresponds to a character in the character set, plus an additional character for EOF (for legacy reasons), and is a bitmask that defines the features of its corresponding characters. Table 14-2 shows the possible features. These are also defined in include/m_ctype.h. The ctype value for each character is the union of all the features that describe it. For example, "A" is an uppercase character (0001) and a hexadecimal digit (0200), so its ctype is 0001 + 0200 = 0201 octal. In hexadecimal, this is 81. So ctype['A' + 1] should contain 0x81.

^[2] Note that the ctype array contains 257 words while the to_lower, to_upper, and sort_order arrays all contain 256 words. The ctype array is indexed by character value +1, while the others are indexed by character value.

Table 14-2. Character feature bitmask values

Feature

Bitmask (in octal)

Uppercase

0001

Lowercase

0002

Numeral (digit)

0004

Whitespace

0010

Punctuation

0020

Control character

0040

Blank

0100

Hexadecimal digit

0200

to_lower and to_upper

Table 14-2. Character feature bitmask values
Feature	Bitmask (in octal)
Uppercase	0001
Lowercase	0002
Numeral (digit)	0004
Whitespace	0010
Punctuation	0020
Control character	0040
Blank	0100
Hexadecimal digit	0200

The to_lower and to_upper arrays contain, for each character, the corresponding upper- and lowercase character. So, for example to_lower['A'] should contain 'a', and to_upper['a'] should contain 'A'.

sort_order

MySQL uses the sort_order array to determine the sort order of characters in your character set. For character sets in which you want the sorting to be case insensitive, this will be the same as the to_upper array. If the sorting rules for your character set are too complicated to be handled by a simple table, you will need to compile in support for your character set.

Once you have configured your character set, you are ready to compile MySQL to include it. Before recompiling MySQL, you need to edit configure.in and add your new character set to CHARSETS_AVAILABLE:

CHARSETS_AVAILABLE="big5 cp1251 cp1257
                    croat czech danish dec8 dos estonia euc_kr gb2312 gbk
                    german1 greek hebrew hp8 hungarian koi8_ru koi8_ukr
                    latin1 latin1_de latin2 latin5 sjis swe7 tis620 ujis
                    usa7 win1250 win1251ukr elvish"

The last step is to compile MySQL:

$ make
$ make install

14.2.2 Complex Character Sets

Character sets requiring special string collating routines for sorting or multi-byte character support will need custom work a configuration file simply cannot handle. To accomplish this task, you will need to be familiar with C programming, as you will need to create several C source files. If you plan to attempt this, it is highly recommended that you study the existing ctype-*.c files and base yours on these.

As with simple character sets, you need to edit the sql/share/charsets/Index and configure.in files. Instead of a configuration file, however, you need to create a C source file called ctype-charset.c. In our case, this file will be ctype-elvish.c. You can get a head start with your character set file by copying the source file for a similar character set.

The first task in this source file is to define ctype, to_lower, to_upper, and sort_order. These are the same arrays you configured in the previous section, except they are defined as C arrays. You'll need to create:

ctype_charset
to_upper_charset
to_lower_charset
sort_order_charset

For our example, we need to create ctype_elvish, to_upper_elvish, to_lower_elvish, and sort_order_elvish.

If you need string collating functions, you must write your own custom string collating functions. These functions should be named:

my_strnncoll_charset
my_strnxfrm_charset
my_strcoll_charset
my_strxfrm_charset
my_like_range_charset

We therefore need to create my_strnncoll_elvish, my_strnxfrm_elvish, my_strcoll_elvish, my_strxfrm_elvish, and my_like_range_elvish.

Multi-byte support requires the coding of special multi-byte functions:

ismbchar_charset
ismbhead_charset
mbcharlen_charset

We need to create ismbchar_elvish, ismbhead_elvish, and mbcharlen_elvish.

With your coding complete, you need to create a comment header for your ctype file. This header should look like this:

/*
 * This comment is parsed by configure to create ctype.c,
 * so don't change it unless you know what you are doing.
 *
 * .configure. number_charset=[character set index number from Index file]
 * .configure. strxfrm_multiply_charset=[max ratio that strings may grow
 *                                       during my_strxfrm_charset - a
 *                                       positive integer]
 * .configure. mbmaxlen_charset=[size in bytes of largest char in set]
 */

For our example, this might look like:

 * .configure. number_elvish=32
 * .configure. strxfrm_multiply_elvish=1
 * .configure. mbmaxlen_elvish=2

Finally, compile MySQL and install your new binaries supporting the custom character set.

only for RuBoard - do not distribute or recompile