This article brings you relevant knowledge about high-availability sentinels in Redis, including issues related to function architecture, deployment and configuration. I hope it will be helpful to everyone.
Recommended learning: Redis video tutorial
Before introducing Sentinel, let’s first review the technologies related to Redis’ high availability from a macro perspective. They include: persistence, replication, sentinel and cluster. Their main functions and problems solved are:
Let’s talk about the sentry.
Redis Sentinel, Redis Sentinel, was introduced in Redis version 2.8. The core function of Sentinel is automatic failover of the master node. The following is the description of the Sentinel function in the Redis official document:
Among them, the monitoring and automatic failover functions allow Sentinel to detect master node failures in time and complete the transfer; while the configuration provider and notification functions need to be reflected in the interaction with the client.
Here is an explanation of the usage of the word "client" in the article: In the previous article, as long as the redis server is accessed through the API, it will be called the client, including redis-cli and Java client. Jedis, etc.; in order to facilitate the distinction and explanation, the client in this article does not include redis-cli, but is more complex than redis-cli: redis-cli uses the underlying interface provided by redis, while the client uses these interfaces and functions Encapsulated to take full advantage of Sentinel's configuration providers and notification capabilities.
The typical sentinel architecture diagram is as follows:
It consists of two parts, the sentinel node and the data node :
This part will deploy a simple sentinel system, including 1 master node, 2 slave nodes and 3 sentinel nodes. For convenience: all these nodes are deployed on one machine (LAN IP: 192.168.92.128), distinguished by port numbers; the configuration of the nodes is simplified as much as possible.
The master-slave nodes in the Sentinel system are configured the same as ordinary master-slave nodes and do not require any additional configuration. The following are the configuration files of the master node (port=6379) and the two slave nodes (port=6380/6381). The configurations are relatively simple and will not be described in detail.
#redis-6379.conf port 6379 daemonize yes logfile "6379.log" dbfilename "dump-6379.rdb" #redis-6380.conf port 6380 daemonize yes logfile "6380.log" dbfilename "dump-6380.rdb" slaveof 192.168.92.128 6379 #redis-6381.conf port 6381 daemonize yes logfile "6381.log" dbfilename "dump-6381.rdb" slaveof 192.168.92.128 6379
redis-server redis-6379.conf redis-server redis-6380.conf redis-server redis-6381.conf
After the node starts, connect to the master node to check whether the master-slave status is normal. After the configuration is completed, start the master node and slave node in sequence:
The sentinel node is essentially a special Redis node.
The configurations of the three sentinel nodes are almost exactly the same. The main difference is the port number (26379/26380/26381). The following uses the 26379 node as an example to introduce the configuration and startup method of the node; the configuration part is as far as possible Simplified, more configuration will be introduced later.
#sentinel-26379.conf port 26379 daemonize yes logfile "26379.log" sentinel monitor mymaster 192.168.92.128 6379 2
哨兵节点的启动有两种方式,二者作用是完全相同的:其中,sentinel monitor mymaster 192.168.92.128 6379 2 配置的含义是:该哨兵节点监控192.168.92.128:6379这个主节点,该主节点的名称是mymaster,最后的2的含义与主节点的故障判定有关:至少需要2个哨兵节点同意,才能判定主节点故障并进行故障转移。
redis-sentinel sentinel-26379.conf redis-server sentinel-26379.conf --sentinel
3. 总结
按照上述方式配置和启动之后,整个哨兵系统就启动完毕了。可以通过redis-cli连接哨兵节点进行验证
哨兵系统的搭建过程,有几点需要注意:
(1)哨兵系统中的主从节点,与普通的主从节点并没有什么区别,故障发现和转移是由哨兵来控制和完成的。
(2)哨兵节点本质上是redis节点。
(3)每个哨兵节点,只需要配置监控主节点,便可以自动发现其他的哨兵节点和从节点。
(4)在哨兵节点启动和故障转移阶段,各个节点的配置文件会被重写(config rewrite)。
上一小节演示了哨兵的两大作用:监控和自动故障转移,本小节则结合客户端演示哨兵的另外两个作用:配置提供者和通知。
在介绍客户端的原理之前,先以Java客户端Jedis为例,演示一下使用方法:下面代码可以连接我们刚刚搭建的哨兵系统,并进行各种读写操作(代码中只演示如何连接哨兵,异常处理、资源关闭等未考虑)。
public static void testSentinel() throws Exception { String masterName = "mymaster"; Set<String> sentinels = new HashSet<>(); sentinels.add("192.168.92.128:26379"); sentinels.add("192.168.92.128:26380"); sentinels.add("192.168.92.128:26381"); JedisSentinelPool pool = new JedisSentinelPool(masterName, sentinels); //初始化过程做了很多工作 Jedis jedis = pool.getResource(); jedis.set("key1", "value1"); pool.close(); }
在整个过程中,我们的代码不需要显式的指定主节点的地址,就可以连接到主节点;代码中对故障转移没有任何体现,就可以在哨兵完成故障转移后自动的切换主节点。之所以可以做到这一点,是因为在JedisSentinelPool的构造器中,进行了相关的工作;主要包括以下两点:
(1)遍历哨兵节点,获取主节点信息:遍历哨兵节点,通过其中一个哨兵节点+masterName获得主节点的信息;该功能是通过调用哨兵节点的sentinel get-master-addr-by-name命令实现,该命令示例如下:
一旦获得主节点信息,停止遍历(因此一般来说遍历到第一个哨兵节点,循环就停止了)。
(2)增加对哨兵的监听:这样当发生故障转移时,客户端便可以收到哨兵的通知,从而完成主节点的切换。具体做法是:利用redis提供的发布订阅功能,为每一个哨兵节点开启一个单独的线程,订阅哨兵节点的+switch-master频道,当收到消息时,重新初始化连接池。
通过客户端原理的介绍,可以加深对哨兵功能的理解:
(1)配置提供者:客户端可以通过哨兵节点+masterName获取主节点信息,在这里哨兵起到的作用就是配置提供者。
需要注意的是,哨兵只是配置提供者,而不是代理。二者的区别在于:如果是配置提供者,客户端在通过哨兵获得主节点信息后,会直接建立到主节点的连接,后续的请求(如set/get)会直接发向主节点;如果是代理,客户端的每一次请求都会发向哨兵,哨兵再通过主节点处理请求。
举一个例子可以很好的理解哨兵的作用是配置提供者,而不是代理。在前面部署的哨兵系统中,将哨兵节点的配置文件进行如下修改:
sentinel monitor mymaster 192.168.92.128 6379 2 改为 sentinel monitor mymaster 127.0.0.1 6379 2
(2)通知:哨兵节点在故障转移完成后,会将新的主节点信息发送给客户端,以便客户端及时切换主节点。然后,将前述客户端代码在局域网的另外一台机器上运行,会发现客户端无法连接主节点;这是因为哨兵作为配置提供者,客户端通过它查询到主节点的地址为127.0.0.1:6379,客户端会向127.0.0.1:6379建立redis连接,自然无法连接。如果哨兵是代理,这个问题就不会出现了。
前面介绍了哨兵部署、使用的基本方法,本部分介绍哨兵实现的基本原理。
As a redis node running in a special mode, the sentinel node supports commands that are different from ordinary redis nodes. In operation and maintenance, we can query or modify the Sentinel system through these commands; but more importantly, in order for the Sentinel system to implement various functions such as fault discovery and failover, it is inseparable from the communication between the Sentinel nodes, and the communication is very Most of this is achieved through commands supported by sentinel nodes. The following introduces the main commands supported by the sentinel node.
(1) Basic query: Through these commands, you can query the topology, node information, configuration information, etc. of the Sentinel system.
(2) Add/remove monitoring of the master node
sentinel monitor mymaster2 192.168.92.128 16379 2: The function is exactly the same as the sentinel monitor function in the configuration file when deploying the sentinel node, and will not be described in detail
sentinel remove mymaster2: Cancel the monitoring of the master node mymaster2 by the current sentinel node
(3) Forced failover
sentinel failover mymaster: This command can force a failover on mymaster, Even if the current master node is running well; for example, if the machine where the current master node is located is about to be scrapped, you can use the failover command to perform failover in advance.
Regarding the principles of sentry, the key is to understand the following concepts.
(1) Scheduled tasks: Each sentinel node maintains 3 scheduled tasks. The functions of the scheduled tasks are as follows: obtain the latest master-slave structure by sending the info command to the master-slave node; obtain the information of other sentinel nodes through the publish and subscribe function; perform heartbeat detection by sending the ping command to other nodes to determine whether they are offline.
(2) Subjective offline: In the scheduled task of heartbeat detection, if other nodes do not reply for a certain period of time, the sentinel node will subjectively offline them. As the name suggests, subjective offline means that a sentinel node "subjectively" judges offline; the counterpart to subjective offline is objective offline.
(3) Objective offline: After the sentinel node subjectively logs off the master node, it will ask other sentinel nodes about the status of the master node through the sentinel is-master-down-by-addr command; if it is judged When the number of sentinels that go offline on a master node reaches a certain value, the master node will be objectively taken offline.
It should be noted that objective offline is a concept that only exists for the master node; if the slave node and sentinel node fail, there will be no subsequent objective offline after being subjectively offline by the sentinel. line and failover operations.
(4) Elect the leader sentinel node: When the master node is judged to be objectively offline, each sentinel node will negotiate to elect a leader sentinel node, and the leader node will Perform failover operations.
All sentinels monitoring the master node may be elected as the leader. The algorithm used in the election is the Raft algorithm; the basic idea of the Raft algorithm is first come, first served: that is, in one round of election, Sentinel A Send an application to become the leader to B. If B has not agreed to other sentinels, it will agree to A becoming the leader. The specific process of the election will not be described in detail here. Generally speaking, the sentinel selection process is very fast. Whoever completes the objective offline first will generally become the leader.
(5) Failover: The elected leader sentinel starts the failover operation, which can be roughly divided into 3 steps:
The following introduces several configurations related to Sentinel.
(1) sentinel monitor {masterName} {masterIp} {masterPort} {quorum}
sentinel monitor is the core configuration of the sentinel. It has been explained in the previous article when deploying the sentinel node. Among them: masterName specifies the master node name, masterIp and masterPort specify the master node address, and quorum is the sentinel that determines the objective offline of the master node. Quantity threshold: When the number of sentinels that determine that the master node is offline reaches the quorum, the master node will be objectively offline. The recommended value is half the number of sentinels plus 1.
(2) sentinel down-after-milliseconds {masterName} {time}
sentinel down-after-milliseconds is related to the judgment of subjective offline: the sentinel uses the ping command to perform heartbeats on other nodes Detection, if other nodes do not reply after the time configured by down-after-milliseconds, Sentinel will subjectively take them offline. This configuration is valid for the subjective offline determination of master nodes, slave nodes, and sentinel nodes.
The default value of down-after-milliseconds is 30000, which is 30s; it can be adjusted according to different network environments and application requirements: the larger the value, the looser the judgment of subjective offline, the advantage is misjudgment The possibility is small, but the disadvantage is that the time for fault discovery and failover will become longer, and the waiting time of the client will also become longer. For example, if the application has high availability requirements, the value can be appropriately reduced to complete the transfer as soon as possible when a failure occurs; if the network environment is relatively poor, the threshold can be appropriately increased to avoid frequent misjudgments.
(3) sentinel parallel-syncs {masterName} {number}
sentinel parallel-syncs is related to the replication of the slave node after failover: it specifies that each time it is initiated to the new master node The number of slave nodes for replication operations. For example, assume that after the master node switch is completed, 3 slave nodes want to initiate replication to the new master node; if parallel-syncs=1, the slave nodes will start replicating one by one; if parallel-syncs=3, then 3 slave nodes The nodes will start replicating together.
The larger the value of parallel-syncs, the faster it takes for the slave node to complete replication, but the greater the pressure on the network load and hard disk load of the master node; it should be set according to the actual situation. For example, if the load on the master node is low and the slave node has high service availability requirements, you can increase the parallel-syncs value appropriately. The default value for parallel-syncs is 1.
(4) sentinel failover-timeout {masterName} {time}
sentinel failover-timeout is related to the judgment of failover timeout, but this parameter is not used to judge the timeout of the entire failover phase. , but the timeout of several of its sub-stages. For example, if the time for the master node to promote the slave node exceeds timeout, or the time for the slave node to initiate a replication operation to the new master node (excluding the time to copy data) exceeds timeout, it will cause a failover timeout. fail.
The default value of failover-timeout is 180000, which is 180s; if it times out, the value will become twice the original value next time.
(5) In addition to the above parameters, there are some other parameters, such as parameters related to security verification, which will not be introduced here.
(1) The number of sentinel nodes should be more than one. On the one hand, it increases the redundancy of sentinel nodes to avoid the sentinel itself becoming a high-availability bottleneck; on the other hand, it reduces the number of sentinel nodes. Offline misjudgment. Furthermore, these different sentinel nodes should be deployed on different physical machines.
(2) The number of sentinel nodes should be an odd number to facilitate the sentinels to make "decisions" through voting: decisions on leader election, decisions on objective offline, etc.
(3) The configuration of each sentinel node should be consistent, including hardware, parameters, etc.; in addition, all nodes should use ntp or similar services to ensure accurate and consistent time.
(4) Sentinel's configuration provider and notification client functions require client support to be implemented, such as Jedis mentioned above; if the library used by the developer does not provide corresponding support, the developer may need Implement it yourself.
(5) When the nodes in the Sentinel system are deployed in docker (or other software that may perform port mapping), special attention should be paid to the fact that port mapping may cause the Sentinel system to fail to work properly, because the work of Sentinel is based on Communication with other nodes, and docker's port mapping may cause Sentinel to be unable to connect to other nodes. For example, the discovery of each other by sentinels depends on the IP and port they declare to the outside world. If a sentinel A is deployed in a docker with port mapping, other sentinels cannot connect to A using the port declared by A.
This article first introduces the role of Sentinel: monitoring, failover, configuration provider and notification; then it describes the deployment method of Sentinel system and accessing Sentinel system through client method; then briefly explains the basic principles of sentinel implementation; and finally gives some suggestions on sentinel practice.
Based on master-slave replication, Sentinel introduces automatic failover of the master node, further improving the high availability of Redis; however, the defect of Sentinel is also obvious: Sentinel cannot automatically failover the slave node. In the read-write separation scenario, failure of the slave node will cause the read service to be unavailable, requiring us to perform additional monitoring and switching operations on the slave node.
In addition, Sentinel still has not solved the problem that write operations cannot be load balanced and storage capacity is limited by a single machine; the solution to these problems requires the use of clusters, which I will introduce in a later article. Welcome to pay attention.
Recommended learning: "Redis Video Tutorial", "2022 Latest Redis Interview Questions and Answers"
The above is the detailed content of Redis Advanced Learning High Availability Sentinel (Summary Sharing). For more information, please follow other related articles on the PHP Chinese website!