1. Introduction to kingbus
1.1 What is kingbus?
Kingbus is a distributed MySQL binlog storage system based on the raft strong consistency protocol. It can act as a MySQL Slave to synchronize binlog from the real Master and store it in a distributed cluster. At the same time, it also acts as a MySQL Master to synchronize the binlog in the cluster to other slaves. kingbus has the following characteristics:
Compatible with the MySQL replication protocol, synchronizes the binlog on the Master through Gtid, and supports the slave to pull the binlog from kingbus through Gtid.
Cross-regional data replication, Kingbus supports cross-regional data replication through the raft protocol. The binlog data written to the cluster is guaranteed to be strongly consistent across multiple nodes, and the binlog order is completely consistent with that on the master.
High availability, because Kingbus is built on the Raft strong consensus protocol, it can achieve high availability of the entire binlog pull and push service when more than half of the nodes in the cluster survive.
1.2 What problems can kingbus solve?
Kingbus can reduce Master’s network transmission traffic. In a replication topology with one master and multiple slaves, the master needs to send binlog to each slave. If there are too many slaves, the network traffic is likely to reach the upper limit of the master's network card. For example, when the master performs operations such as deleting a large table or online DDL, a large number of binlog events may be generated instantly. If 10 slaves are connected to the master, the network card traffic on the master will be amplified 10 times. If the master uses a Gigabit network card, the network card may be full if more than 10MB/S of traffic is generated. By connecting to the master via kingbus, slaves can be distributed to multiple machines to balance the transmission traffic.
To simplify the Master Failover process, you only need to promote a slave connected to the kingbus to master, and redirect the kingbus to the new master. Other slaves are still connected to the kingbus, and the replication topology remains unchanged.
Save the space used by the Master to store binlog files. Generally, MySQL uses more expensive SSDs. If the binlog file takes up a lot of space, the data stored in MySQL will have to be reduced. You can reduce the number of binlog files stored on the Master by storing all binlogs in kingbus
Support heterogeneous replication. Connect to kingbus through Alibaba's open source canal. kingbus continuously pushes binlog to canal. canal receives the binlog and then pushes it to the kafka message queue, and finally stores it in HBase. The business department writes SQL directly through Hive to achieve real-time analysis of the business. .
2. Kingbus overall architecture
The overall structure of kingbus is shown in the figure below:
Storage is responsible for storing raft log entry and Metadata. In Kingbus, raft log and mysql binlog are integrated. They are distinguished by different header information. The data part of raft log is binlog event, so there is no need to store the two types of logs separately. , save storage space. Because kingbus needs to store some meta-information, such as raft node voting information and the specific content of some special binlog events (FORMAT_DESCRIPTION_EVENT).
Raft replicates the lead election, log replication and other functions of the kingbus cluster, using the etcd raft library.
Binlog syncer only runs on the Lead node of the Raft cluster. There is only one syncer in the entire cluster. The syncer pretends to be a slave and establishes a master-slave replication connection to the Master. The Master will filter the binlog events that the syncer has accepted based on the executed_gtid_set sent by the syncer, and only send binlog events that the syncer has not received. This replication protocol is fully compatible with the MySQL master-slave replication mechanism. . After syncer receives the binlog event, it will do some processing according to the binlog event type, and then encapsulate the binlog event into a message and submit it to the raft cluster. Through the raft algorithm, this binlog event can be stored on multiple nodes and achieve strong consistency.
Binlog server is a Master that implements the replication protocol. The real slave can connect to the port monitored by the binlog server. The binlog server will send binlog events to the slave. The entire process of sending binlog events is implemented with reference to the MySQL replication protocol. When no binlog event is sent to the slave, the binlog server will periodically send heartbeat events to the slave to keep the replication connection alive.
API server is responsible for the management of the entire kingbus cluster, including the following:
Raft cluster membership operation, view cluster status, add a node, remove a node, update node information, etc.
Binlog syncer related operations: start a binlog syncer, stop the binlog syncer, and check the binlog syncer status.
Binlog server related operations: start a binlog server, stop the binlog server, and check the binlog server status. Various abnormalities in the server layer will not affect the raft layer. The server can be understood as a plug-in, which can be started and stopped on demand. When extending KingBus in the future, you only need to implement the server with relevant logic. For example, if you implement a kafka protocol server, you can consume the messages in kingbus through kafka client.
3.kingbus core implementation
3.1 Core implementation of storage
There are two log forms in storage, one is the raft log (hereinafter referred to as the raft log), which is generated and used by the raft algorithm, and the other is the user-form Log (that is, mysql binlog event). In the design of Storage, two Log forms are combined into one Log Entry. It is only distinguished by different header information. Storage consists of data files and index files, as shown in the following figure:
The segment has a fixed size (1GB) and can only be written additionally. The name is first_raft_index-last_raft_index, which indicates the raft index range of the segment.
Only the last segment can be written, and its file name is first_raft_index-inprogress. Other segments are read-only.
Read-only segments and corresponding index files are written and read through mmap.
The index content of the last segment is stored in both disk and memory. Reading the index only requires reading from memory.
3.2 Use of etcd raft library
The Etcd raft library is single-threaded when processing applied logs, committed entries, etc. Please refer to the link for specific functions. The processing time of this function must be as short as possible. If the processing time exceeds the raft election time, it will cause the cluster to be re-elected. This point requires special attention.
3.3 The core implementation of binlog syncer
The main job of binlog syncer is:
Pull binlog event
Parse and process binlog event
Submit binlog events to the raft cluster. Obviously, the pipeline mechanism can be used to improve the processing speed of the entire process. Kingbus uses a separate goroutine to process each stage, and connects different stages through pipelines. Since binlog syncer receives binlog events one by one, syncer cannot guarantee transaction integrity. It is possible that after syncer hangs, the master needs to be reconnected. At this time, the last transaction may be incomplete. Binlog syncer needs to find that the transaction is complete. With unique capabilities, kingbus implements the function of transaction integrity analysis, which is fully implemented with reference to the MySQL source code.
3.4 Core implementation of binlog server
The binlog server implements the function of a master. When the slave establishes a replication connection with the binlog server, the slave will send relevant commands, and the binlog server needs to respond to these commands. Finally send binlog event to slave. For each slave, the binlog server will start a goroutine to continuously read the raft log, remove the relevant header information, and turn it into a binlog event, and then send it to the slave.
The above is the detailed content of How to design the architecture of MySQL Binlog storage system. For more information, please follow other related articles on the PHP Chinese website!