New Redis Cluster meta-data handling
This blog post describes the new algorithm used in Redis Cluster in order to propagate and update metadata, that is hopefully significantly safer than the previous algorithm used. The Redis Cluster specification was not yet updated, as I'm
This blog post describes the new algorithm used in Redis Cluster in order to propagate and update metadata, that is hopefully significantly safer than the previous algorithm used. The Redis Cluster specification was not yet updated, as I'm rewriting it from scratch, so this blog post serves as a first way to share the algorithm with the community.Let's start with the problem to solve. Redis Cluster uses a master - slave design in order to recover from nodes failures. The key space is partitioned across the different masters in the cluster, using a concept that we call "hash slots". Basically every key is hashed into a number between 0 and 16383. If a given key hashes to 15, it means it is in the hash slot number 15. These 16k hash slots are split among the different masters.
At every single time only one master should serve a given hash slot. Slaves just replicate the master dataset so that it is possible to fail over a master and put the cluster again into an usable state where all the hash slots are served by one node.
Redis Cluster is client assisted and nodes are not capable to forward queries to other nodes. However nodes are able to redirect a client to the right node every time a client tries to access a key that is served by a different node. This means that every node in the cluster should know the map between the hash slots and the nodes serving them.
The problem I was trying to solve is, how to take this map in sync between nodes in a safe way? A safe way means that even in the event of net splits, eventually all the nodes will agree about the hash slots configuration.
Another problem to solve was the slave promotion. A master can have multiple slaves, how to detect, and how to act, when a master is failing and a slave should be promoted to replace it?
Metadata is not data
====================
In the case of Redis Cluster handling of metadata is significantly different than the way the user data itself is handled. The focus of Redis Cluster is:
1) Speed.
2) No need for merge operations, so that it is semantically simple to handle the very large values typical of Redis.
3) The ability to retain most writes originating from clients connected to the majority of masters.
Given the priorities, Redis Cluster, like the vanilla single node version of Redis, uses asynchronous replication where changes to the data set are streamed to slave nodes with an asynchronous acknowledgement from slaves. In other words when a node receives a write, the client most of the times directly talk with the node in charge for the key hash slot, and the node has no other chatting to do with other nodes.
However this means that Redis Cluster is not a true CP system: there is a window where writes can be lost. The trivial case to lose a write is to write to a master that stops working after accepting our write and replying to the client, but before being able to propagate the write to its slaves.
This window is very small, in the sub-millisecond range. However when a client is partitioned away from the majority of the master nodes there is a bigger window to lose data, as a partitioned master will continue to accept writes for some time, while on the majority side the same master may be failed over by a slave. So when the partition will be fixed, the master will be reconfigured as a slave and writes will be lost.
Apart from the replicas, a key is stored into a single master node, so there is no need to agree or merge its value. Given the design, there is no need to use an agreement protocol in order to write or read data to/from the cluster. However metadata is a different story, we want that every node has a coherent vision of the cluster setup, and that the right configuration is eventually propagated to all the nodes even in case of failures and network partitions.
Using an agreement algorithm
============================
The simplest way to solve such a problem is to use a consensus algorithm such as Paxos or Raft, and this was the direction I was going to take. However implementing consensus algorithms is hard. Actually it is so hard that often years are needed for implementations to stabilize.
At some point I noticed something that makes the work of Redis Cluster nodes simpler, that is, information about hash slots is always idempotent. If hash slot 5 is served by A, and later because the configuration changes hash slot 5 is served by B, nodes don't need to know what happened, everything they need is that the configuration for an hash slot was updated.
This changes everything basically: agreement protocols are able to take a state machine synchronized by running the same sequence of operations in every node. If the state machine is deterministic, then the internal state will be the same in all the nodes eventually. However all the state Redis Cluster needs, for a given slot, can be embedded into a single packet.
Because of this we don't need a log of operations stored on disk, nor a way to make sure to fetch all the operations still not fetched, or to figure out what should be applied and what not, all the state can be copied in a single message. In short Redis Cluster does not require a full agreement protocol so I stolen just what I needed from Raft, and tried to figure out the rest.
Failure detection
=================
In order to see if a node has issues, Redis Cluster still uses the old algorithm that is based on gossip. Nodes inform other nodes about the state of a few random nodes using ping / pong packets. These ping / pong packets are in turn used to check if a node is reachable from the point of view of the sender of the ping. If the (informal) majority of masters agree that a given node is not reachable, then the node is flagged as failing. I said "informal" as there is no true agreement here, but simply:
1) Every node flags other nodes are failing if the majority of master nodes signaled the node as down in a given time range.
2) Every node removes the flag if the node is back reachable and is a salve, or a master that after some time is still serving slots from our point of view (was not failed over).
The failure detection is completely informal and has the only property that eventually all the nodes will agree on the failure: either the majority of nodes will mark it as failing resulting into a chain effect that will force all the other nodes to mark the node as failing, OR there is no majority and if the node is reachable again everybody will clear the flag.
The point here is that the failure detection does not require any safety, it is only useful in order to trigger the safe part of the algorithm, that is, replacing the master with a slave and update the cluster configuration.
Slave promotion
===============
Promoting a slave must be a safe operation, and one that should ensure that the configuration change is propagated across the cluster as soon as possible.
Slave promotion is up to slaves and uses a mechanism very similar to the Raft algorithm leader election. This is what happens:
1) A slave detects its master is failing.
2) The slave will try to win the election in order to promote itself to master.
3) If it is successful, it will change its state and will advertise the new configuration.
4) If it is unsuccessful it will try again to win the election after some time.
Every slave will try to start the election at a slightly different time in order to avoid a split brain condition that will require a new election. Redis Cluster uses a random delay that is driven by the number of seconds a slave was disconnected from the master, in order to favor slaves that were able to talk with the master more recently (slaves with too old data don't try at all to get elected).
Every cluster node has the concept of currentTerm as in Raft, that is called currentEpoch in Redis Cluster. Every node tries to have a currentEpoch that is the highest found among other nodes, so this information is always added in ping /pong packets headers. Every time a node sees a currentEpoch of another node that is greater than its epoch, it updates its currentEpoch.
The election is like Raft: a slave that tries to get elected increments its currentEpoch and sends a failover-auth-request to every master hoping to get its vote. Masters refuse to vote if the master instance of the slave is not failing from the point of view of a given master, or if the currentTerm advertised by the slave is smaller than the currentTerm of the receiving master.
Also masters vote a single time for every epoch: this ensures that for every epoch we can have just a single winner, this is central both in the Raft algorithm and in the Redis Cluster slave election.
Basically, if a slave wins the election, it uses the epoch at which the election was won as the version of its configuration, and newer configurations always win over older configurations.
The configEpoch
===============
In order to make more clear how it works, we need to add some more information. Basically every ping / pong packet does not just publish the currentEpoch, but also the configEpoch, that is, the epoch at which the master started to serve its set of hash slots. Initially when a cluster is created every master uses a configEpoch of zero. As failover events happen, there will be nodes with greater configEpoch values.
As in the old algorithm, the ping and pong packets also carry a bitmap with the hash slots served by a given node. Since every node knows the last observed configEpoch of every other node, it can detect configuration changes to incorporate.
For instance if node B claims to serve hash slot 5 that was previously served by node A, but the configEpoch of node B is greater than the configEpoch we have for A, then we accept the new configuration.
The same mechanism is also used in order to reconfigure a reappearing master as a slave, or to reconfigure all the other slaves after a failover. The old master's served slots count will drop to zero, and the nodes will switch as replicas of the node that is serving the slots now.
The real algorithm used has more details that don't change the semantics, but make everything more fluid in common cases. For example after a slave wins an election it broadcasts a PONG to every node in order to make the configuration change faster, and to prevent other slaves from initiating a new useless election.
Similarly a master that was partitioned out from the majority for enough time (the same time needed to flag it as failing) stop accepting writes, and will not accept writes for a few more seconds even after the majority of masters is reachable again, in order to give some time to the other nodes to inform it of configuration changes. This makes less likely that a client with an old routing table will try and succeed to write to the returning master that is now failed over.
From the point of view of the code, the implementation is requiring a minor amount of code, as everything was already implemented in the old algorithm even if in a broken way, it was unsafe but the basic abstractions and message formats were ok.
All in all I'm failing in love with distributed programming and I hope to learn more in the next weeks... Comments

ホットAIツール

Undresser.AI Undress
リアルなヌード写真を作成する AI 搭載アプリ

AI Clothes Remover
写真から衣服を削除するオンライン AI ツール。

Undress AI Tool
脱衣画像を無料で

Clothoff.io
AI衣類リムーバー

AI Hentai Generator
AIヘンタイを無料で生成します。

人気の記事

ホットツール

メモ帳++7.3.1
使いやすく無料のコードエディター

SublimeText3 中国語版
中国語版、とても使いやすい

ゼンドスタジオ 13.0.1
強力な PHP 統合開発環境

ドリームウィーバー CS6
ビジュアル Web 開発ツール

SublimeText3 Mac版
神レベルのコード編集ソフト(SublimeText3)

ホットトピック









Redisクラスターモードは、シャードを介してRedisインスタンスを複数のサーバーに展開し、スケーラビリティと可用性を向上させます。構造の手順は次のとおりです。異なるポートで奇妙なRedisインスタンスを作成します。 3つのセンチネルインスタンスを作成し、Redisインスタンスを監視し、フェールオーバーを監視します。 Sentinel構成ファイルを構成し、Redisインスタンス情報とフェールオーバー設定の監視を追加します。 Redisインスタンス構成ファイルを構成し、クラスターモードを有効にし、クラスター情報ファイルパスを指定します。各Redisインスタンスの情報を含むnodes.confファイルを作成します。クラスターを起動し、CREATEコマンドを実行してクラスターを作成し、レプリカの数を指定します。クラスターにログインしてクラスター情報コマンドを実行して、クラスターステータスを確認します。作る

Redisはハッシュテーブルを使用してデータを保存し、文字列、リスト、ハッシュテーブル、コレクション、注文コレクションなどのデータ構造をサポートします。 Redisは、スナップショット(RDB)を介してデータを維持し、書き込み専用(AOF)メカニズムを追加します。 Redisは、マスタースレーブレプリケーションを使用して、データの可用性を向上させます。 Redisは、シングルスレッドイベントループを使用して接続とコマンドを処理して、データの原子性と一貫性を確保します。 Redisは、キーの有効期限を設定し、怠zyな削除メカニズムを使用して有効期限キーを削除します。

Redis-Serverが見つからない問題を解決するための手順:インストールを確認して、Redisが正しくインストールされていることを確認します。環境変数Redis_hostとredis_portを設定します。 Redis Server Redis-Serverを起動します。サーバーがRedis-Cli pingを実行しているかどうかを確認します。

Redisのすべてのキーを表示するには、3つの方法があります。キーコマンドを使用して、指定されたパターンに一致するすべてのキーを返します。スキャンコマンドを使用してキーを繰り返し、キーのセットを返します。情報コマンドを使用して、キーの総数を取得します。

Redisバージョン番号を表示するには、次の3つの方法を使用できます。(1)情報コマンドを入力し、(2) - versionオプションでサーバーを起動し、(3)構成ファイルを表示します。

Redisソースコードを理解する最良の方法は、段階的に進むことです。Redisの基本に精通してください。開始点として特定のモジュールまたは機能を選択します。モジュールまたは機能のエントリポイントから始めて、行ごとにコードを表示します。関数コールチェーンを介してコードを表示します。 Redisが使用する基礎となるデータ構造に精通してください。 Redisが使用するアルゴリズムを特定します。

Redis指令を使用するには、次の手順が必要です。Redisクライアントを開きます。コマンド(動詞キー値)を入力します。必要なパラメーターを提供します(指示ごとに異なります)。 Enterを押してコマンドを実行します。 Redisは、操作の結果を示す応答を返します(通常はOKまたは-ERR)。

Redisパスワードを設定するには、構成ファイルのexaclePassを必要なパスワードに変更し、サービスを再起動します。パスワードで保護されたインスタンスに接続するときは、Redis-CLIコマンドを使用して、ホスト名/IP、ポート、およびパスワードを提供します。パスワードのセキュリティに注意を払い、セキュリティを改善するために定期的に変更してください。
