After synchronizing the Hive table to MySQL today, one of the columns is the only column, but when querying in MySQL The values queried by count
and distinct count
are different. From this point of view, there is duplicate data (which should not be the case, because in Hive, these two values are the same ), then I checked out the duplicate data and found that it was a case problem. Then I checked and found that by default in the MySQL database, all related operations on string fields are "case insensitive".
This is different from other popular databases.
MySQL allows you to specify case sensitivity when querying. You need to use the keyword BINARY
, the query is as follows:
SELECT * FROM student WHERE BINARY name = 'ZhangSan'; --或者 SELECT * FROM student WHERE name = BINARY 'ZhangSan';
Many times when the above problems are found in the MySQL database, the system has been running for a period of time, and the cost of using method two or three may be very high.
The biggest advantage of using this method is that it can quickly implement functions.
But this method also has great limitations: it may cause query performance to decrease because the index cannot be used.
The reason is easy to understand, because the index for the query field is also established in a case-insensitive manner.
Unless the amount of data is not large, or you don’t care about this performance loss in your application, you can only choose method two or three.
Specify specific field case sensitivity when creating the table, the example is as follows:
CREATE TABLE student ( ... name VARCHAR(64) BINARY NOT NULL, ... )
Keywords BINARY
specifies that the name field is case-sensitive.
In this way, even if the BINARY
keyword is not used in the query, the query statement is case-sensitive.
The name-related index created on this basis is also case-sensitive, so the index can be used to improve performance.
MySQL allows the use of the BINARY
keyword on most string types to indicate that all operations on this field are case-sensitive. For more information, see the MySQL official documentation.
This approach allows designers to precisely control whether each field is case-sensitive. In the design of many systems, it is often expected that all fields, or even most fields, are case-sensitive. MySQL also provides a solution, which requires using method three.
Execute the show create table <tablename>
command in MySQL, and you can see the table creation statement of a table. The example is as follows:
CREATE TABLE `table1` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `field1` text COLLATE utf8_general_ci NOT NULL COMMENT '字段1', `field2` varchar(128) COLLATE utf8_general_ci NOT NULL DEFAULT '' COMMENT '字段2', PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8_unicode_ci;
We can understand most fields, but what we want to look at today is the COLLATE keyword. What does the utf8_general_ci
corresponding to this value mean? here we come to find out.
Those developed using Navicat may look familiar, because the answers have been given in the options:
The so-called utf8_general_ci
, In fact, it is a rule used for sorting. For those character type columns in MySQL, such as VARCHAR, CHAR, and TEXT type columns, a COLLATE type is required to tell MySQL how to sort and compare the column. In short, COLLATE will affect the order of the ORDER BY statement, will affect the results filtered out by the greater than and less sign in the WHERE condition, and will affect DISTINCT , GROUP BY, HAVING query results. In addition, when MySQL builds an index, if the index column is of character type, it will also affect index creation, but we cannot perceive this impact. In short, Anywhere that involves character type comparison or sorting will be related to COLLATE.
The core of various operations involving strings must involve the character sorting rule (COLLATE, also translated as "checking"). Whether MySQL's string operations are case-sensitive essentially depends on the COLLATE collation it uses.
utf8_general_ci
is a specific COLLATE value. Each specific COLLATE corresponds to a unique character set. It can be seen that the character set corresponding to this COLLATE is utf8
. Related to the issue of case sensitivity is its suffix _ci
, which the official MySQL documentation explains is the abbreviation of Case Ignore
, which means it is not case sensitive. Since MySQL specifies utf8_general_ci
as the default COLLATE of the character set utf8, this also leads to the phenomenon mentioned at the beginning of the article. At the same time, MySQL also provides other COLLATE value options, utf8_bin
is case-sensitive. In fact, all case-sensitive COLLATEs are suffixed with _bin
or _cs
, the former is the abbreviation of Binary
, and the latter is Case Sensitive
abbreviation of.
COLLATE is usually related to data encoding (CHARSET). Generally speaking, each CHARSET has multiple COLLATEs it supports, and each CHARSET specifies one. COLLATE is the default value. For example, the default COLLATE for Latin1 encoding is latin1_swedish_ci
, the default COLLATE for GBK encoding is gbk_chinese_ci
, and the default value for utf8mb4 encoding is utf8mb4_general_ci
.
By the way, here is a digression. There are two encodings in MySQL: utf8 and utf8mb4. In MySQL, please forget utf8 and always use utf8mb4. This is a legacy issue of MySQL. The utf8 in MySQL can only support character encodings up to 3bytes in length. For some text that needs to occupy 4bytes, MySQL's utf8 does not support it. You must use utf8mb4.
Many COLLATEs have the words _ci
, which is the abbreviation of Case Insensitive
, that is, case-insensitive, that is, "A"
and "a"
are treated equally when sorting and comparing. selection * from table1 where field1="a"
You can also select the value of field1 as "A"
. At the same time, for those COLLATEs with the _cs
suffix, it is Case Sensitive
, that is, case sensitive.
Use the show collation
command in MySQL to view all COLLATEs supported by MySQL. Taking utf8mb4 as an example, all COLLATEs supported by this encoding are as shown in the figure below.
In the picture we can see the sorting rules of many countries’ languages. The three commonly used ones in China are utf8mb4_general_ci
(default), utf8mb4_unicode_ci
, and utf8mb4_bin
. Let’s explore the differences between these three:
UTF8mb4_bin’s comparison method is to treat all characters as binary strings and then compare them from the highest bit to the lowest bit. So obviously it's case sensitive.
There is actually no difference between utf8mb4_unicode_ci and utf8mb4_general_ci for Chinese and English. For the system we developed for domestic use, you can choose any one. It's just that for the letters in some Western countries, utf8mb4_unicode_ci is more in line with their language habits than utf8mb4_general_ci. General is an older standard of MySQL. For example, the German letter "ß"
is equivalent to the two letters "ss"
in utf8mb4_unicode_ci (this is in line with German habits), while in utf8mb4_general_ci , but it is equivalent to the letters "s"
. However, the subtle differences between the two encodings are difficult to perceive for normal development. We rarely use text fields to sort directly. To take a step back, even if one or two letters are misaligned, can it really bring catastrophic consequences to the system? Judging from various posts and discussions found on the Internet, more people recommend using utf8mb4_unicode_ci, but they are not very resistant to systems that use the default value, and do not think there is any big problem. Conclusion: It is recommended to use utf8mb4_unicode_ci. For systems that already use utf8mb4_general_ci, there is no need to spend time modifying it.
Another thing to note is that starting from MySQL 8.0, MySQL's default CHARSET is no longer Latin1, but has been changed to utf8mb4 (reference link), and the default COLLATE has also been changed to utf8mb4_0900_ai_ci. utf8mb4_0900_ai_ci is generally a further subdivision of unicode. 0900 refers to the number of the unicode comparison algorithm (Unicode Collation Algorithm version), ai means accent insensitive (pronunciation is irrelevant), such as e, è, é, ê and ë are treated equally of. Related reference link 1, related reference link 2
MySQL database allows the library, table and Column Specify Collation at three levels. When specified together, the precedence relationship is: column > table > library.
COLLATE can be set at the instance level, library level, table level, column level, and SQL specifies . When specified simultaneously, the precedence relationship is: SQL specification > column > table > library > instance level.
The instance-level COLLATE setting is the collation_connection system variable in the MySQL configuration file or startup directive.
The library level setting COLLATE statement is as follows:
CREATE DATABASE <db_name> DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
如果库级别没有设置 CHARSET 和 COLLATE,则库级别默认的 CHARSET 和 COLLATE 使用实例级别的设置。在 MySQL 8.0 以下版本中,你如果什么都不修改,默认的 CHARSET 是 Latin1,默认的 COLLATE 是 latin1_swedish_ci。从 MySQL 8.0 开始,默认的 CHARSET 已经改为了 utf8mb4,默认的 COLLATE 改为了 utf8mb4_0900_ai_ci。
表级别的 COLLATE 设置,则是在 CREATE TABLE 的时候加上相关设置语句,例如:
CREATE TABLE table_name ( …… ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT = '表注释';
如果表级别没有设置 CHARSET 和 COLLATE,则表级别会继承库级别的 CHARSET 与 COLLATE。
列级别的设置,则在 CREATE TABLE 中声明列的时候指定,例如
CREATE TABLE ( `field1` VARCHAR(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL DEFAULT '' COMMENT '字段1', …… ) ……
如果列级别没有设置 CHARSET 和 COLATE,则列级别会继承表级别的 CHARSET 与 COLLATE。
最后,你也可以在写 SQL 查询的时候显示声明 COLLATE 来覆盖任何库表列的 COLLATE 设置,不太常用,了解即可:
SELECT DISTINCT field1 COLLATE utf8mb4_general_ci FROM table1; SELECT field1, field2 FROM table1 ORDER BY field1 COLLATE utf8mb4_unicode_ci;
如果全都显示设置了,那么优先级顺序是 SQL 语句 > 列级别设置 > 表级别设置 > 库级别设置 > 实例级别设置。
也就是说列上所指定的 COLLATE可以覆盖表上指定的 COLLATE,表上指定的 COLLATE 可以覆盖库级别的 COLLATE。如果没有指定,则继承下一级的设置。
即列上面没有指定 COLLATE,则该列的 COLLATE 和表上设置的一样。
The above is the detailed content of How to distinguish between upper and lower case when querying data stored in MySQL. For more information, please follow other related articles on the PHP Chinese website!