如何使用 etcd Raft 库构建自己的分布式 KV 存储系统-Golang-PHP中文网

How to Build Your Own Distributed KV Storage System Using the etcd Raft Library

介绍

raftexample是etcd提供的示例，演示了etcd raft共识算法库的使用。 raftexample 最终实现了一个提供 REST API 的分布式键值存储服务。

本文将对raftexample的代码进行阅读和分析，希望能够帮助读者更好地理解如何使用etcd raft库以及raft库的实现逻辑。

建筑学

raftexample的架构非常简单，主要文件如下：

main.go: 负责组织 raft 模块、httpapi 模块、kvstore 模块之间的交互；
raft.go： 负责与raft库交互，包括提交提案、接收需要发送的RPC消息、进行网络传输等；
httpapi.go： 负责提供REST API，作为用户请求的入口；
kvstore.go: 负责持久化存储提交的日志条目，相当于raft协议中的状态机。

写请求的处理流程

写入请求通过 HTTP PUT 请求到达 httpapi 模块的 ServeHTTP 方法。

curl -L http://127.0.0.1:12380/key -XPUT -d value

登录后复制

通过switch匹配到HTTP请求方法后，进入PUT方法处理流程：

从HTTP请求体中读取内容（即值）；
通过kvstore模块的Propose方法构建提案（添加以key为key、value为value的键值对）；
由于没有数据可返回，所以响应客户端204 StatusNoContent；

通过 raft 算法库提供的 Propose 方法将提案提交到 raft 算法库。

提案的内容可以是添加新的键值对、更新现有的键值对等

// httpapi.go
v, err := io.ReadAll(r.Body)
if err != nil {
    log.Printf("Failed to read on PUT (%v)\n", err)
    http.Error(w, "Failed on PUT", http.StatusBadRequest)
    return
}
h.store.Propose(key, string(v))
w.WriteHeader(http.StatusNoContent)

登录后复制

接下来，让我们看看kvstore模块的Propose方法，看看提案是如何构造和处理的。

在Propose方法中，我们首先使用gob对要写入的键值对进行编码，然后将编码内容传递给proposeC，proposeC是负责将kvstore模块构建的proposal传输到raft模块的通道。

// kvstore.go
func (s *kvstore) Propose(k string, v string) {
    var buf strings.Builder
    if err := gob.NewEncoder(&buf).Encode(kv{k, v}); err != nil {
        log.Fatal(err)
    }
    s.proposeC <- buf.String()
}

登录后复制

kvstore构造并传递给proposeC的proposal由raft模块中的serveChannels方法接收和处理。

在确认proposeC尚未关闭后，raft模块使用raft算法库提供的Propose方法将proposal提交给raft算法库进行处理。

// raft.go
select {
    case prop, ok := <-rc.proposeC:
    if !ok {
        rc.proposeC = nil
    } else {
        rc.node.Propose(context.TODO(), []byte(prop))
    }

登录后复制

提案提交后，遵循raft算法流程。提案最终将转发到领导节点（如果当前节点不是领导节点，并且您允许追随者转发提案，由 DisableProposalForwarding 配置控制）。 Leader 会将提案作为日志条目添加到其 raft 日志中，并与其他 follower 节点同步。被视为已提交后，将应用到状态机并将结果返回给用户。

但是，由于etcd raft库本身不处理节点之间的通信、追加到raft日志、应用到状态机等，所以raft库只准备这些操作所需的数据。实际操作必须由我们来执行。

因此，我们需要从 raft 库接收这些数据，并根据其类型进行相应的处理。 Ready方法返回一个只读通道，通过该通道我们可以接收需要处理的数据。

需要注意的是，接收到的数据包含多个字段，例如要应用的快照、要附加到 raft 日志的日志条目、要通过网络传输的消息等。

继续我们的写请求示例（Leader 节点），收到相应数据后，我们需要持久保存快照、HardState 和 Entries，以处理服务器崩溃引起的问题（例如，一个 follower 为多个候选人投票）。 HardState 和 Entries 共同构成了本文中提到的所有服务器上的持久状态。持久保存它们后，我们可以应用快照并追加到 raft 日志中。

Since we are currently the leader node, the raft library will return MsgApp type messages to us (corresponding to AppendEntries RPC in the paper). We need to send these messages to the follower nodes. Here, we use the rafthttp provided by etcd for node communication and send the messages to follower nodes using the Send method.

// raft.go
case rd := <-rc.node.Ready():
    if !raft.IsEmptySnap(rd.Snapshot) {
        rc.saveSnap(rd.Snapshot)
    }
    rc.wal.Save(rd.HardState, rd.Entries)
    if !raft.IsEmptySnap(rd.Snapshot) {
        rc.raftStorage.ApplySnapshot(rd.Snapshot)
        rc.publishSnapshot(rd.Snapshot)
    }
    rc.raftStorage.Append(rd.Entries)
    rc.transport.Send(rc.processMessages(rd.Messages))
    applyDoneC, ok := rc.publishEntries(rc.entriesToApply(rd.CommittedEntries))
    if !ok {
        rc.stop()
        return
    }
    rc.maybeTriggerSnapshot(applyDoneC)
    rc.node.Advance()

登录后复制

Next, we use the publishEntries method to apply the committed raft log entries to the state machine. As mentioned earlier, in raftexample, the kvstore module acts as the state machine. In the publishEntries method, we pass the log entries that need to be applied to the state machine to commitC. Similar to the earlier proposeC, commitC is responsible for transmitting the log entries that the raft module has deemed committed to the kvstore module for application to the state machine.

// raft.go
rc.commitC <- &commit{data, applyDoneC}

登录后复制

In the readCommits method of the kvstore module, messages read from commitC are gob-decoded to retrieve the original key-value pairs, which are then stored in a map structure within the kvstore module.

// kvstore.go
for commit := range commitC {
    ...
    for _, data := range commit.data {
        var dataKv kv
        dec := gob.NewDecoder(bytes.NewBufferString(data))
        if err := dec.Decode(&dataKv); err != nil {
            log.Fatalf("raftexample: could not decode message (%v)", err)
        }
        s.mu.Lock()
        s.kvStore[dataKv.Key] = dataKv.Val
        s.mu.Unlock()
    }
    close(commit.applyDoneC)
}

登录后复制

Returning to the raft module, we use the Advance method to notify the raft library that we have finished processing the data read from the Ready channel and are ready to process the next batch of data.

Earlier, on the leader node, we sent MsgApp type messages to the follower nodes using the Send method. The follower node's rafthttp listens on the corresponding port to receive requests and return responses. Whether it's a request received by a follower node or a response received by a leader node, it will be submitted to the raft library for processing through the Step method.

raftNode implements the Raft interface in rafthttp, and the Process method of the Raft interface is called to handle the received request content (such as MsgApp messages).

// raft.go
func (rc *raftNode) Process(ctx context.Context, m raftpb.Message) error {
    return rc.node.Step(ctx, m)
}

登录后复制

The above describes the complete processing flow of a write request in raftexample.

Summary

This concludes the content of this article. By outlining the structure of raftexample and detailing the processing flow of a write request, I hope to help you better understand how to use the etcd raft library to build your own distributed KV storage service.

If there are any mistakes or issues, please feel free to comment or message me directly. Thank you.