Today’s businesses, especially those that prioritize digital transformation, are in dire need of real-time data. Traditional weekly and monthly batch processing can no longer meet demand. However, it is not easy to obtain real-time data from multiple sources and use it to automate processes and dynamically optimize decisions.
Recently, we encountered a challenge when re-architecting a customer's legacy system and splitting the monolithic architecture into microservices. We started making changes to the database and modernizing the system by module. At this stage, we need to ensure that both databases remain in sync, as different modules may require the same data - in other words, the old system requires data generated by the new system in the new database, and vice versa.
We researched Change Data Capture (CDC) technology to determine if it fit our needs. The article details the definition of CDC, the tools we tested, how they work and their advantages. At the same time, we shared some cases and suggestions to help other technicians choose the appropriate CDC tool in specific situations.
Data capture refers to the process of detecting and capturing changes in the source system and then delivering these changes to the target system in near real-time. These changes may include insert, delete, update operations, and DDL changes to the database structure.
CDC tools implement their functions by monitoring data changes in the source system. Once a change is discovered, the CDC tool captures and records it in a designated location, such as a database or log file. The processed and transformed data is then loaded into a target system, such as a data warehouse or analytics platform.
There are many ways to capture database changes. Let’s take a look at some of them:
In this method, we will maintain some audit columns similar to CREATED_AT, LAST_UPDATED or DATE_MODIFIED in the source and detect changes in these columns by querying the data in the source to capture any data changes . It should be noted that this method does not record deletion operations.
A trigger is a function in the database that performs operations based on specific events. Although useful for capturing any change, including delete operations, it reduces database performance because each event requires multiple writes.
The database contains a transaction log for recovery in the event of a crash, storing all events. With log-based CDC, new database transactions are read directly from the native log, which allows changes to be captured without scanning the source table and is therefore more efficient.
This approach is similar to event sourcing in event-driven architecture. Whenever the system state changes, we record it as an event. The recorded events can be replayed in the same order to reconstruct the system state at any time.
CDC is critical in many scenarios depending on the situation, application, architecture and business needs. Here are some of the ways the CDC helps with the engineering process:
There are several CDC tools on the market, such as Oracle Golden Gate, Debezium, IBM Infosphere, Striim, StreamSets and Qlik Replicate. These tools can be open source or paid. They typically support on-premises and cloud environments and can handle a variety of data sources. When choosing, consider the following:
As businesses become technology-driven, historical and current data will become a critical differentiator. Achieving accurate, timely, efficient and cost-effective change data capture will be an important part of any technology transformation initiative. When you face this situation, I hope this article can help you.
The above is the detailed content of Change Data Capture: Overview, Why, and Best Practices. For more information, please follow other related articles on the PHP Chinese website!