Data consistency verification and repair procedures in MySQL non-master-slave environment-Mysql Tutorial-php.cn

1. Introduction

Project address: https://github.com/seanlook/p...

The pt-table-checksum tool is often used for data consistency verification in a master-slave environment. Its principle and implementation process I wrote an article before: Use pt-table-checksum to check MySQL data consistency in the production environment. However, in the DBA work, there will be some checks for consistency between the two tables, and there is no master-slave relationship between the two tables. The pt tool replays the checking actions performed in the master database based on binlog in the slave database. is no longer applicable.

There will always be such special needs, such as migrating from Alibaba Cloud RDS instances to self-built mysql instances. Its data transmission service implementation method is table-based batch data extraction, plus binlog subscription, but forcing row mode will cause pt -table-checksum does not have permission to temporarily change the session to a statement. Another requirement is to convert the character set of the entire library: the library table definitions are all utf8, but the application connection uses the default latin1. To unify the connection character set and the table character set, the data can only be exported with latin1, and then UTF8 import, in this case data consistency check, not to mention that the binlog parsing program does not support statement (such as canal), the contents of the old and new libraries are different, and the check value calculated by pt-table-checksum will be different and invalid.

That’s why I came up with the idea of referring to pt-table-checksum and wrote one myself: px-table-checksum.

2. Implementation method

The overall idea is to learn from pt-table-checksum, take out a piece of data such as 1000 rows from the source library in batches (i.e. chunk), calculate the CRC32 value, run the same statement in the target library, and store the results Another library finally checks whether the chunk crc values of the corresponding numbers are consistent. Knowing the inconsistencies is not enough. You need to be able to fix the differences quickly and conveniently. So, continue to go to the target library and the source library to find the inconsistent rows based on those inconsistent chunks. Are they missing, redundant, or modified? Then generate the repair sql. According to Indicates whether to automatically repair.

Then the question is:

How to determine the batch, that is, how to get the next chunk? I haven't wanted to do what pt-table-checksum does, which can dynamically adjust the chunk size according to the load, or even suspend the check when the number of active threads exceeds the threshold. The workload would be too much. Currently, the number of chunk rows calculated each time is fixed, and can be configured to 1000 or 2000, etc.

So we need to use paging query. According to the (auto-increment or union) primary key and unique index, after each limit of 1000, the last one is taken in ascending order as the start of the next batch. Therefore, it is necessary to analyze the key situation on the table and combine the query conditions. Currently, only tables with primary keys or unique keys can be checked.

How to ensure that the source library and the target library run the same SQL? In the previous version, the target library and the source library used multiple threads to calculate chunks and store them in the library. Later, I realized a serious bug: for example, if 1000 rows were also fetched, if the target library had less data, the next chunk would start differently. , the comparison results are simply a mess.

So it is necessary to ensure that the chunks with the same number and the starting point must be the same, so I thought of using a queue to store all the verification sql that has been run in the source library, and simulate the pt tool to replay it in the target library. Considering that multiple threads need to compare multiple tables at the same time, the queue may consume too much memory, so a redis queue is used.

Calculate crc32 directly in the database, or take out the data and calculate it in memory? I checked the source code of pt-table-checksum and found that it is calculated in the database. But as mentioned in the first section, if the target library and the source library need to use different character sets to read the correct data, they can only be queried and then compared. So px-table-checksum supports both, and only needs to specify one configuration item.

Checking multiple tables at the same time, the source database SQL is crowded in the queue, and 1 second has passed when the target database is taken out for execution. At this time, the data in the source database has been modified again and synchronized to the target database, which will lead to inconsistent calculation results. In fact, they are consistent. , how to handle it cannot be handled, which is the biggest flaw of px-table-checksum compared to pt-table-checksum.

But in order to reduce such problems as much as possible (for example, master-slave delay may also occur), multiple redis queues are specially designed, and the target library has multiple checking threads. For example, if 8 tables are specified to be checked at the same time, there will be 8 source library checks. Threads correspond, but according to the writing situation of the table, 4 redis queues (currently randomly added) and 10 target library checking threads can be configured to reduce inaccuracy factors. But from my point of view, inconsistent data will be recorded. If there are not many, they will be checked manually. If there are many, they will be checked again. If the same data is inconsistent twice, then there is something wrong. .

3. Limitations

If the source table data changes frequently during the check, the check results may be inaccurate, which is the problem in point 4 above. Obviously, each transaction checked by this program is separate, unlike the pt tool, which can strictly guarantee the transaction order of each check SQL. But if there are inconsistent data, check again and it will be ok. In fact, during my online use, it was 99.9% accurate.

There must be a primary key or unique index on the table. The program will check and exit if not.

Varbinay, blob and other binary fields do not support repair
In fact, it is not not supported at all, it depends on how to use it. If during development, characters are first converted into bytes and then stored in mysql, this does not support repair. There is a way to deal with it, that is to use the hex() function when checking from the source library, repair the unhex() in the sql and write it back.

4. Instructions for use

This python program is developed based on 2.7 and has not been tested on 2.6 and 3.x. You need to install MySQLdb and hotqueue before use:

$ sudo pip install MySQL-python hotqueue

The tables and options to be compared should be fully configured, that is, not specified through the command line (forgive the additional usage of command line parameters) increase the amount of code).

4.1 px-table-checksum.py

Main program, run python px-table-checksum.py to perform consistency check, but be sure to understand the following configuration file options.

4.2 settings_checksum.py

Configuration options

CHUNK_SIZE: The number of chunk rows extracted each time

REDIS_INFO: Specify the redis queue address to use

REDIS_QUEUE_CNT: The number of redis queues, the consumer (target library) has a one-to-one corresponding thread Guarding the queue

REDIS_POOL_CNT: Producer (source library) redis client connection pool. This design is to alleviate the problems caused by GIL and separate the enqueuing end from the dequeuing end, because if there are many tables, a large amount of SQL may be queued in a short time to avoid hotqueue contention. CALC_CRC32_DB: True means calculating the checksum value in the db , False means taking out the chunk data and calculating it in python. The default value given is based on the connection character set.

DO_COMPARE: Operation mode

0: Only extract data for calculation, not compare for consistency. You can later only compare in mode 2

1: Calculate and compare. Commonly used, the last result of the table to be checked is deleted before each calculation. The comparison result only tells which chunk numbers are inconsistent.

2: No calculation, only comparison from t_checkum results. Commonly used, calculation consumes database resources. You can only compare the inconsistencies in the existing checksum calculation results. Similar to the --replicate-check-only option of the pt tool.

GEN_DATAFIX:

Used in combination with DO_COMPARE, if True, it means to find specific inconsistent rows for inconsistent chunks and generate repair sql; if it is False, nothing will be done.

RUN_DATAFIX:

Used in combination with GEN_DATAFIX, if True, it means that the generated repair sql will be executed in the target library. You need to be careful. If you set up a repair at any time, remember to change it back to False after completion, otherwise there will be an accident next time you check another table, so I specially added a confirmation prompt for this option.

DB_CHECKSUM: A dictionary that specifies where the checksum results are stored.

The configuration file has examples, db_name must be specified, and the table will be automatically created.

4.3 settings_cs_tables.py

The above configuration file can be considered to be used to control the program. This configuration file specifies the source and target library information to be verified, and which tables to verify.

TABLES_CHECK: Dictionary, specifying which tables to check for consistency. The db name is key, and the list of multiple table names is value. Checking the entire db is not supported at the moment, and the number of tables compared at the same time is not recommended to exceed 8

DB_SOURCE: Dictionary, specifying the connection information of the source library

DB_SOURCE: Dictionary, specifying the connection information of the target library