With the rapid development of business and the increasing complexity of business, almost every company’s system will From monolithic to distributed, especially to microservice architecture. Then you will inevitably encounter the problem of distributed transactions.
This article first introduces the relevant basic theories, then summarizes the most classic transaction solutions, and finally gives solutions to the out-of-order execution of sub-transactions (idempotence, null compensation, and hanging problems). Share for everyone.
Basic Theory
Before explaining the specific solution, let us first understand the basic theoretical knowledge involved in distributed transactions.
Let’s take transfer as an example. A needs to transfer 100 yuan to B. Then the balance to A is -100 yuan, and the balance to B is 100 yuan. The entire transfer must ensure that A-100 and B 100 are successful at the same time. , or fail both at the same time. Let’s see how this problem is solved in various scenarios.
Transaction
The function of operating multiple statements as a whole is called a database transaction. Database transactions ensure that all operations within the scope of the transaction can either succeed or fail.
Transactions have 4 attributes: atomicity, consistency, isolation, and durability. These four properties are often called ACID properties.
- Atomicity (atomicity): All operations in a transaction are either completely completed or not completed, and will not end in any intermediate link. If an error occurs during the execution of the transaction, it will be restored to the state before the transaction started, as if the transaction had never been executed.
- Consistency (consistency): The integrity of the database is not destroyed before the transaction starts and after the transaction ends. Integrity including foreign key constraints, application-defined constraints, etc. will not be destroyed.
- Isolation: The database allows multiple concurrent transactions to read, write and modify its data at the same time. Isolation can prevent data inconsistency due to cross execution when multiple transactions are executed concurrently.
- Durability: After the transaction is completed, the modification to the data is permanent and will not be lost even if the system fails.
If our business system is not complicated and can modify data and complete transfers in one database and one service, then we can use database transactions to ensure the correct completion of the transfer business.
Distributed Transaction
Inter-bank transfer business is a typical distributed transaction scenario. Suppose A needs to transfer money to B across banks, then it involves the data of two banks and cannot be passed through the local database of one database. The ACID of transaction guarantee transfer can only be solved through distributed transactions.
Distributed transaction means that the initiator of the transaction, the resource and resource manager and the transaction coordinator are located on different nodes of the distributed system. In the above transfer business, the operation of user A-100 and the operation of user B 100 are not located on the same node. In essence, distributed transactions are to ensure the correct execution of data operations in distributed scenarios.
Distributed transactions in a distributed environment, in order to meet the needs of availability, performance and degraded services, and reduce the requirements for consistency and isolation, on the one hand follow the BASE theory (BASE related theory, involving a lot of content, Interested students can refer to BASE theory):
- Basic Business Availability (Basic Availability)
- Soft state (Soft state)
- Eventual Consistency consistency)
Similarly, distributed transactions also partially follow the ACID specification:
- Atomicity: strictly follow
- Consistency: after the transaction is completed Consistency is strictly followed; consistency in transactions can be relaxed appropriately
- Isolation: parallel transactions cannot be affected; visibility of intermediate transaction results allows security relaxation
- Persistence: strictly followed
Distributed transaction solution
Due to the distributed transaction solution, complete ACID guarantee cannot be achieved. There is no perfect solution that can solve all business problems. Therefore, in actual applications, the most suitable distributed transaction solution will be selected based on the different characteristics of the business.
Two-phase commit/XA
XA is a distributed transaction specification proposed by the X/Open organization. The XA specification mainly defines the (global) transaction manager (TM) and (local) Interface between resource managers (RM). Local databases such as mysql play the role of RM in XA
XA is divided into two phases:
The first phase (prepare): that is, all participants RM prepare to execute transactions and lock resources needed to live. When the participant is ready, it reports to TM that it is ready.
The second phase (commit/rollback): When the transaction manager (TM) confirms that all participants (RM) are ready, it sends the commit command to all participants.
Currently mainstream databases basically support XA transactions, including mysql, oracle, sqlserver, postgre
XA transactions consist of one or more resource managers (RM), a transaction manager (TM) and a Application program (ApplicationProgram).
The three roles of RM, TM, and AP here are classic role divisions, which will run through subsequent transaction modes such as Saga and Tcc.
Taking the above transfer as an example, a successfully completed XA transaction sequence diagram is as follows:
If any participant fails to prepare, then TM will notify all completion Prepare's participants perform rollback.
The characteristics of XA transactions are:
- Simple and easy to understand, easier to develop
- Long-term locking of resources, low concurrency
If readers want to further study XA, go language, PHP, Python, Java, C#, Node, etc. can refer to DTM
SAGA
Saga is this database paper A solution mentioned by sagas. The core idea is to split a long transaction into multiple local short transactions, coordinated by the Saga transaction coordinator. If it ends normally, it will be completed normally. If a step fails, the compensation operation will be called once in the reverse order.
Taking the above transfer as an example, a successfully completed SAGA transaction sequence diagram is as follows:
Once Saga reaches the Cancel stage, then Cancel in the business logic Failure is not allowed. If success is not returned due to network or other temporary failures, TM will continue to retry until Cancel returns success.
Characteristics of Saga transactions:
- High concurrency, no need to lock resources for a long time like XA transactions
- Need to define normal operations and compensation operations, and the development volume is larger than XA Big
- The consistency is weak. For transfers, it may happen that user A has deducted the money, and the final transfer fails again
There is a lot of SAGA content in the paper, including two kinds of recovery Strategies include concurrent execution of branch transactions. Our discussion here only includes the simplest SAGA
SAGA is applicable to many scenarios, long transactions, and business scenarios that are not sensitive to intermediate results
If readers want to further study SAGA, they can refer to the DTM, which includes examples of SAGA success and failure rollback, as well as the handling of various network exceptions.
TCC
The concept of TCC (Try-Confirm-Cancel) was first published by Pat Helland in 2007 in an article titled "Life beyond Distributed Transactions: an Apostate's Opinion" Paper presented.
TCC is divided into 3 phases
- Try phase: try to execute, complete all business checks (consistency), reserve necessary business resources (quasi-isolation)
- Confirm phase: Confirm that the execution actually executes the business, without any business checks, and only uses the business resources reserved in the Try phase. The Confirm operation requires an idempotent design, and a retry is required after Confirm fails.
- Cancel phase: Cancel execution and release the business resources reserved in the Try phase. The exception handling scheme in the Cancel phase is basically the same as that in the Confirm phase, and requires an idempotent design.
Taking the above transfer as an example, the amount is usually frozen in Try but not deducted, the amount is deducted in Confirm, and the amount is unfrozen in Cancel. A successfully completed TCC transaction sequence diagram is as follows:
TCC’s Confirm/Cancel phase is not allowed to return failure in terms of business logic. If success cannot be returned due to network or other temporary failures, TM will continue to try again. , until Confirm/Cancel returns successfully.
TCC features are as follows:
- High concurrency and no long-term resource locking.
- The development volume is large and the Try/Confirm/Cancel interface needs to be provided.
- The consistency is good, and there will be no situation where SAGA has deducted the money and then failed to transfer it.
- TCC is suitable for order-type business and business with constraints on the intermediate status
If readers want to further study TCC, they can refer to DTM
Local message table
This solution was originally an article published by eBay architect Dan Pritchett to ACM in 2008. The core of the design is to asynchronously ensure execution of tasks that require distributed processing through messages.
The general process is as follows:
Writing local messages and business operations are placed in one transaction, ensuring the atomicity of business and message sending, or they are all Succeed or fail all together.
Fault tolerance mechanism:
- When the balance deduction transaction fails, the transaction will be rolled back directly without subsequent steps
- If the sequence production message fails, the increase balance transaction will fail. Retry
Features of the local message table:
- Long transactions only need to be split into multiple tasks, simple to use
- Producers require additional Creating a message table
- Each local message table needs to be polled
- If the consumer logic cannot succeed through retry, then more mechanisms are needed to roll back the operation
Applicable to businesses that can be executed asynchronously and subsequent operations do not require rollback
Transaction Message
In the above local message table solution, the producer needs to create an additional message table and poll the local message table, which imposes a heavy business burden. Alibaba's open source RocketMQ 4.3 and later officially supports transaction messages. This transaction message essentially puts the local message table on RocketMQ to solve the atomicity problem of message sending and local transaction execution on the production side.
Transaction message sending and submission:
- Send message (half message)
- The server stores the message and responds to the writing result of the message
- Execute local transactions based on the sending result (if the write fails, the half message will not be visible to the business at this time, and the local logic will not be executed)
- Execute Commit or Rollback based on the local transaction status (Commit operation publishes the message, and the message is consumed) Visible)
The normal sending flow chart is as follows:
Compensation process:
For transactions without Commit/Rollback Message (message in pending status), initiate a "review" from the server
Producer receives the review message and returns the status of the local transaction corresponding to the message, which is Commit or Rollback
Transaction message scheme and local message table The mechanism is very similar. The main difference is that the original related local table operations are replaced by a reverse query interface.
The characteristics of transaction messages are as follows:
- Long transactions only need to be split into multiple tasks. , and provide a reverse check interface, using simple
- If the consumer's logic fails to pass the retry, then more mechanisms are needed to roll back the operation
Applicable to applications that can Businesses that are executed asynchronously and subsequent operations do not require rollback
If readers want to further study transaction messages, they can refer to the DTM or Rocketmq
Best effort notification
The initiating notification party tries its best to notify the recipient of the business processing results through a certain mechanism. Specifically include:
There is a certain message duplication notification mechanism. Because the party receiving the notification may not receive the notification, a certain mechanism must be in place to repeatedly notify the message.
Message proofreading mechanism. If the receiver is not notified despite its best efforts, or if the receiver wants to consume the message again after consuming it, the receiver can actively query the notifier for message information to meet the demand.
The local message table and transaction messages introduced earlier are both reliable messages. How are they different from the best effort notification introduced here?
Reliable message consistency. The initiating notification party needs to ensure that the message is sent out and sent to the receiving notification party. The key to the reliability of the message is guaranteed by the initiating notification party.
Best effort notification, the party initiating the notification tries its best to notify the receiving party of the business processing results, but the message may not be received. At this time, the receiving party needs to actively call the interface of the initiating party to query the business processing. As a result, the key to the reliability of notifications lies with the party receiving the notification.
In terms of solution, best effort notification needs:
- Provide an interface so that the recipient of the notification can query the business processing results through the interface
- Message queue ACK mechanism, message The queue gradually increases the notification interval at intervals of 1min, 5min, 10min, 30min, 1h, 2h, 5h, and 10h until the upper limit of the time window required for notification is reached. No more notifications later
Best effort notifications are suitable for business notification types. For example, the results of WeChat transactions are notified to each merchant through best effort notifications. There are both callback notifications and transaction query interfaces
AT transaction mode
This is a transaction mode in the Alibaba open source project seata, also known as FMT in Ant Financial. The advantage is that the transaction mode is used in a way similar to the XA mode. The business does not need to write various compensation operations, and the rollback is automatically completed by the framework. The disadvantage is also similar to XA. There are long-term locks, which does not meet high concurrency scenarios. From a performance perspective, AT mode is higher than XA, but it also brings new problems such as dirty rollback. Interested students can refer to seata-AT
Exception handling
Problems such as network and business failures may occur in all aspects of distributed transactions. These problems require business methods for distributed transactions. It achieves the three characteristics of anti-air rollback, idempotence, and anti-hanging.
Exceptions
The following uses TCC transactions to illustrate these exceptions:
Empty rollback:
When no TCC resources are called In the case of the Try method, the two-stage Cancel method is called. The Cancel method needs to recognize that this is an empty rollback and then directly return success.
The reason is that when a branch transaction has a service downtime or network abnormality, the branch transaction call is recorded as a failure. At this time, the Try phase is not actually executed. When the fault is restored, the distributed transaction is rolled back. The two-stage Cancel method will be called, resulting in an empty rollback.
Impotence:
Since any request may experience network anomalies and repeated requests, all distributed transaction branches need to ensure idempotence
Suspension:
Suspension means that for a distributed transaction, the second-stage Cancel interface is executed before the Try interface.
The reason is that when RPC calls branch transaction try, the branch transaction is registered first, and then the RPC call is executed. If the network called by RPC is congested at this time, after RPC times out, TM will notify RM to roll back the distributed transaction. Maybe after the rollback is completed, the RPC request of Try reaches the participant and is actually executed.
Let’s look at a timing diagram of network anomalies to better understand the above problems
- When business processes request 4, Cancel is executed before Try , need to handle empty rollback
- When business processing request 6, Cancel is executed repeatedly, need to be idempotent
- When business processing request 8, Try is executed after Cancel, and suspension needs to be processed
Faced with the above complex network anomalies, the solutions currently proposed by various companies are that the business party uses a unique key to query whether the associated operation has been completed, and if it has been completed, it will directly return success. The relevant judgment logic is complex, error-prone, and has a heavy business burden.
Sub-transaction barrier
In the project https://github.com/yedf/dtm, a sub-transaction barrier technology appeared. Using this technology, this effect can be achieved. See the schematic diagram:
After all these requests reach the sub-transaction barrier: abnormal requests will be filtered; normal requests will pass the barrier. After developers use sub-transaction barriers, all the above-mentioned exceptions are properly handled, and business developers only need to focus on the actual business logic, and the burden is greatly reduced.
The sub-transaction barrier provides the method ThroughBarrierCall. The prototype of the method is:
func ThroughBarrierCall(db *sql.DB, transInfo *TransInfo, busiCall BusiFunc)
Business developers can write their own relevant logic in busiCall and call this function. ThroughBarrierCall ensures that busiCall will not be called in scenarios such as empty rollback and suspension; when the business is called repeatedly, there is idempotent control to ensure that it is only submitted once.
Sub-transaction barrier will manage TCC, SAGA, transaction messages, etc., and can also be extended to other areas
Principle of sub-transaction barrier
The principle of sub-transaction barrier technology is that in In the local database, establish the branch transaction status table sub_trans_barrier. The unique key is global transaction id-sub-transaction id-sub-transaction branch name (try|confirm|cancel)
- Open transaction
- If If it is a Try branch, then insert ignore is inserted into gid-branchid-try. If it is successfully inserted, the logic within the barrier is called.
- If it is a Confirm branch, then insert ignore is inserted into gid-branchid-confirm. If it is successfully inserted, then Call the logic within the barrier
- If it is the Cancel branch, insert ignore into gid-branchid-try, and then insert gid-branchid-cancel. If try is not inserted and cancel is inserted successfully, call the logic within the barrier
- The logic within the barrier returns success, commits the transaction, and returns success
- The logic within the barrier returns an error, rolls back the transaction, and returns an error
Under this mechanism, network exceptions are solved Related issues
- Empty compensation control--if Try is not executed and Cancel is executed directly, then Cancel inserted into gid-branchid-try will succeed, without using the logic within the barrier, ensuring empty compensation control
- Impotent control--No unique key can be inserted repeatedly in any branch, ensuring that it will not be executed repeatedly.
- Anti-hanging control--Try is executed after Cancel, then the inserted gid-branchid If -try is unsuccessful, it will not be executed, ensuring anti-hanging control
It is a similar mechanism for SAGA, transaction messages, etc.
Summary of sub-transaction barrier
Sub-transaction barrier technology was pioneered by https://github.com/yedf/dtm. Its significance lies in designing a simple and easy-to-implement algorithm, and provides a simple and easy-to-implement algorithm. The significance of the interface used in Capital is to design simple and easy-to-implement algorithms and provide a simple and easy-to-use interface. With the help of these two items, developers are completely liberated from the processing of network exceptions.
This technology currently needs to be paired with the yedf/dtm transaction manager. Currently, the SDK has been provided to Go and Python language developers. SDKs for other languages are under planning. For other distributed transaction frameworks, as long as appropriate distributed transaction information is provided, the technology can be quickly implemented according to the above principles.
Summary
This article introduces some basic theories of distributed transactions and explains commonly used distributed transaction solutions. In the second half of the article, the reasons for transaction exceptions are also given. , classification, and elegant solutions.
yedf/dtm supports TCC, XA, SAGA, transaction messages, best-effort notifications (implemented using transaction messages), and provides HTTP and gRPC protocol support, which is very easy to access.
yedf/dtm has supported clients in Python, Java, PHP, C#, Node and other languages. See: SDK for each language.
Welcome everyone to visit the https://github.com/yedf/dtm project and give a star to support it!