Character set overview
Character set is a set of symbols and encoding rules. Whether it is in the Oracle database or the MySQL database, there is a character set selection problem, and if the character set is not selected correctly during the database creation phase, it may happen later. The character set needs to be replaced, and character set replacement is a relatively expensive operation and involves certain risks. Therefore, we recommend that you correctly select the appropriate character set according to your needs at the beginning of the application to avoid unnecessary adjustments later.
4.2 Introduction to character sets supported by Mysql
Mysql server can support multiple character sets (you can use the show character set command to view all character sets supported by mysql). Different fields in the same server, the same database, and even the same table can Different character sets can be specified. Compared with other database management systems such as Oracle, which can only use the same character set in the same database, MySQL obviously has greater flexibility.
The character set of mysql includes two concepts: character set (CHARACTER) and proofreading rules (COLLATION). The character set is used to define the way MySQL stores strings, and the collation rules define the way to compare strings. There is a one-to-many relationship between character sets and collation rules. MySQL supports more than 70 collation rules in more than 30 character sets.
Each character set corresponds to at least one proofreading rule. You can use the SHOW COLLATION LIKE 'utf8%'; command to view the collation rules of the relevant character set.
4.3 Brief description of Unicode
Unicode is an encoding specification. Here we briefly describe the history of Unicode encoding.
Let’s start with the ASCII code. ASCII code is also a coding standard, but the ASCII code can only represent a maximum of 256 characters. It is generated for English. For complex texts such as Chinese and Arabic, 256 characters It's obviously not enough. Therefore, various countries or organizations have successively formulated standards that conform to their own language and characters, such as gb2312, big5, etc. However, this approach of setting up their own standards obviously has many disadvantages, so the Unicode encoding specification came into being.
Unicode is also a character encoding method, but it is designed by an international organization and can accommodate encoding schemes for all languages around the world. The scientific name of Unicode is "Universal Multiple-Octet Coded Character Set", or UCS for short. UCS can be seen as the abbreviation of "Unicode Character Set".
Unicode has two sets of standards, UCS-2 and UCS-4. The former uses 2 bytes to represent a character, and the latter uses 4 words to represent a character. Taking the currently commonly used UCS-2 as an example, the number of characters it can represent is 2^16=65535, which can basically accommodate all European and American characters and most Asian characters.
4.4 How to choose a suitable character set
We recommend that you try to use a small character set as long as it can fully satisfy the application. Because a smaller character set means that it can save space and reduce the number of network transmission bytes. At the same time, the smaller storage space indirectly improves the performance of the system.
There are many character sets that can save Chinese characters, such as utf8, gb2312, gbk, latin1, etc., but the commonly used ones are gb2312 and gbk. Because the gb2312 font library is smaller than the gbk font library, some rare characters (for example: 洺) cannot be saved. Therefore, when choosing a character set, you must weigh the probability and impact of these remote characters in the application. If you cannot give an affirmative answer, it is best. Choose gbk.
4.5 Mysql character set settings
Mysql’s character set and collation rules have 4 levels of default settings: server level, database level, table level and field level. They are set up in different places and have different functions.
The server character set and collation are determined when the mysql service is started.
Can be set in my.cnf:
[mysqld]
default-character-set=utf8
or specified in the startup options:
mysqld --default-character-set=utf8
or specified during compilation:
. /configure --with-charset=utf8
If the server character set is not specifically specified, latin1 is used as the server character set by default. The above three settings only specify the character set and do not specify the collation rules. In this way, the default collation rules of the character set are used. If you want to use the non-default collation rules of the character set, you need to specify the character set at the same time. Proofreading rules.
You can use the show variables like 'character_set_server'; command to query the character set and collation rules of the current server.