In the process of distributed large-batch data collection, the management of information sources is particularly important. In order to ensure that the same task can only be processed by one collector at the same time, the uniqueness of task scheduling must be ensured. Usually when we carry out distributed data collection, there will usually be a scheduling module, whose main responsibility is to distribute the collection tasks and ensure the uniqueness of the tasks.
Because it is distributed, it involves multiple servers (multiple machines), each server involves multiple collectors (multiple processes), and each collector may involve multiple threads. , Therefore, the lock mechanism in the task scheduling module is particularly important. Depending on the implementation architecture of the application, lock implementation methods can usually be divided into the following types
If the handler is single-process and multi-threaded, under python, you can Use the Lock object of the threading module to restrict synchronous access to shared variables to achieve thread safety.
In the case of single machine and multiple processes, under python, you can use the Lock object of multiprocessing to handle it.
In the case of multi-machine and multi-process deployment, you have to rely on a third-party component (storage lock object) to implement a distributed synchronization lock.
Since the scheduling module is a multi-machine, multi-process, and multi-thread processing mechanism, it is consistent with the third method.
Distributed lock implementation methods
The current mainstream distributed lock implementation methods are as follows:
Based on database, such as mysql
Based on cache, such as redis
Based on zookeeper
Each implementation method has its own merits. After comprehensive consideration, Redis is the most suitable choice. The main reason is:
redis operates based on memory, and the access speed is faster than the database. Under high concurrency, the performance after locking will not drop too much
redis can set the survival time (TTL) of key values
redis is simple to use and has low overall implementation overhead
However, the distributed lock implemented using redis also needs to meet the following conditions:
Only one thread can occupy the lock at the same time. Other threads must wait until the lock is released
The lock operation must satisfy atomicity
No deadlock will occur, such as when the lock has been acquired The thread suddenly exits abnormally before releasing the lock, causing other threads to wait in a loop for the lock to be released
The addition and release of the lock must be set by the same thread
We use redis to implement a distributed synchronization lock to ensure data consistency, which needs to meet the following characteristics:
Satisfy mutual exclusivity, only one thread can acquire the lock at the same time
Use the ttl of redis to ensure that no deadlock will occur, but it will also cause problems due to lock expiration The problem of multiple threads occupying locks at the same time requires us to set the expiration time of the lock reasonably to avoid
Use the uniqueness of the lock to ensure that the lock will not be accidentally deleted
In the actual operation process, I separated the scheduling module from the entire collection system, based on the Java client Jredis (JRedis is a high-end A high-performance Java client used to connect to the Redis distributed hash key-value database. An independent service that uses Spring Boot to implement synchronous and asynchronous functions. It allows other collectors to request the collection tasks to be processed through HTTP. .The processing process is roughly as follows:
The collector sends a task request to the dispatching center through HTTP;
The dispatching center determines whether the lock exists , if it exists, the empty set will be returned directly;
If the lock does not exist, the request will be locked, and then the corresponding collection task will be obtained according to the source rules;
Return the acquired task (if there is no pending task, return empty), and then delete the lock.
The code implementation of the scheduling module is roughly as follows:
to the lock. Otherwise, if some unknown exception occurs, the lock may not be released and the collector will never be able to obtain the collection task.public static List
HashServiceInterface hif, ZSetServiceInterface zScoreSet, String dicName) {
List
try {
String dicNameLock = "Dispatcher_Task_Lock";// Task scheduling lock;
if (! redisHashUtils.keyIsExit(dicNameLock, lockKeyValue)) {//Determine whether the lock exists
//Add a lock (write the task uniqueness identifier into the record);
redisHashUtils.addOneData(dicNameLock, lockKeyValue) ,
DateUtil.getYMDHMS());
// Processing task logic
.......
’'’’'’’’’’’’’’’’’’s’ one’s ’’’’’’ out’s out out out out out out out out out out out out out outs’s of's
Sorry, you did not provide the original words that need to be rewritten, and rewriting cannot be performed else { //The lock already exists System.out.println("Processing task, Temporarily return the empty collection....");Sorry, you did not provide the original words that need to be rewritten, so rewriting cannot be done } catch ( Exception e) {e.printStackTrace(); }return result;}During the actual operation, When adding a lock, you must add an
expiration time
The above is the detailed content of How to implement task scheduling based on Redis distributed lock. For more information, please follow other related articles on the PHP Chinese website!