etcd raft如何實現Linearizable Read

時間 2020-07-08

標籤 etcd raft 如何實現 linearizable read 简体版

原文原文鏈接

Linearizable Read通俗來說，就是讀請求須要讀到最新的已經commit的數據，不會讀到老數據。node

對於使用raft協議來保證多副本強一致的系統中，讀寫請求均可以經過走一次raft協議來知足。而後，現實系統中，讀請求一般會佔很大比重，若是每次讀請求都要走一次raft落盤，性能可想而知。因此優化讀性能相當重要。數組

從raft協議可知，leader擁有最新的狀態，若是讀請求都走leader，那麼leader能夠直接返回結果給客戶端。然而，在出現網絡分區和時鐘快慢相差比較大的狀況下，這有可能會返回老的數據，即stale read，這違反了Linearizable Read。例如，leader和其餘followers之間出現網絡分區，其餘followers已經選出了新的leader，而且新的leader已經commit了一堆數據，然而因爲不一樣機器的時鐘走的快慢不一，原來的leader可能並無發覺本身的lease過時，仍然認爲本身仍是合法的leader直接給客戶端返回結果，從而致使了stale read。網絡

Raft做者提出了一種叫作ReadIndex的方案：app

當leader接收到讀請求時，將當前commit index記錄下來，記做read index，在返回結果給客戶端以前，leader須要先肯定本身到底仍是不是真的leader，肯定的方法就是給其餘全部peers發送一次心跳，若是收到了多數派的響應，說明至少這個讀請求到達這個節點時，這個節點仍然是leader，這時只須要等到commit index被apply到狀態機後，便可返回結果。ide

func (n *node) ReadIndex(ctx context.Context, rctx []byte) error {    return n.step(ctx, pb.Message{Type: pb.MsgReadIndex, Entries: []pb.Entry{{Data: rctx}}})
}

處理讀請求時，應用的goroutine會調用這個函數，其中rctx參數至關於讀請求id，全局保證惟一。step會往recvc中塞進一個MsgReadIndex消息，而運行node入口函數函數

func (n *node) run(r *raft)

的goroutine會從recvc中拿出這個message，並進行處理：oop

case m := <-n.recvc:
            // filter out response message from unknown From.
            if _, ok := r.prs[m.From]; ok || !IsResponseMsg(m.Type) {
                r.Step(m) // raft never returns an error
            }

Step(m)最終會調用到raft結構體的step(m)，step是個函數指針，根據node的角色，運行stepLeader()/stepFollower()/stepCandidate()。性能

若是node是leader，stepLeader()主要代碼片斷:優化

    case pb.MsgReadIndex:
        if r.raftLog.zeroTermOnErrCompacted(r.raftLog.term(r.raftLog.committed)) != r.Term {                // Reject read only request when this leader has not committed any log entry at its term.
                return
        }        
        if r.quorum() > 1 {            switch r.readOnly.option {            case ReadOnlySafe:
                r.readOnly.addRequest(r.raftLog.committed, m)
                r.bcastHeartbeatWithCtx(m.Entries[0].Data)            case ReadOnlyLeaseBased:
                var ri uint64                if r.checkQuorum {
                    ri = r.raftLog.committed
                }                if m.From == None || m.From == r.id { // from local member
                    r.readStates = append(r.readStates, ReadState{Index: r.raftLog.committed, RequestCtx: m.Entries[0].Data})
                } else {
                    r.send(pb.Message{To: m.From, Type: pb.MsgReadIndexResp, Index: ri, Entries: m.Entries})
                }
            }
        }

首先，r.raftLog.zeroTermOnErrCompacted須要檢查leader是否在當前term有過commit entry，小論文5.4節關於Safety中給出瞭解釋，以及不這麼作會有什麼問題，而且給出了反例。ui

其次，本文討論的ReadIndex方案對應的是ReadOnlySafe這個option分支，其中addRequest(...)會把這個讀請求到達時的commit index保存起來，而且維護一些狀態信息，而bcastHeartbeatWithCtx(...)準備好須要發送給peers的心跳消息MsgHeartbeat。當node收到心跳響應消息MsgHeartbeatResp時處理以下:

只保留邏輯相關代碼：

case pb.MsgHeartbeatResp:

        if r.readOnly.option != ReadOnlySafe || len(m.Context) == 0 {            return
        }        ackCount := r.readOnly.recvAck(m)        if ackCount < r.quorum() {            return
        }        rss := r.readOnly.advance(m)        for _, rs := range rss {            req := rs.req            if req.From == None || req.From == r.id { // from local member
                r.readStates = append(r.readStates, ReadState{Index: rs.index, RequestCtx: req.Entries[0].Data})
            } else {
                r.send(pb.Message{To: req.From, Type: pb.MsgReadIndexResp, Index: rs.index, Entries: req.Entries})
            }
        }

首先只有ReadOnlySafe這個方案時，纔會繼續往下走。若是接收到了多數派的心跳響應，則會從剛纔保存的信息中將對應讀請求當時的commit index和請求id拿出來，填充到ReadState中，ReadState結構以下:

type ReadState struct {    Index      uint64
    RequestCtx []byte}

能夠看出ReadState實際上包含了一個讀請求到達node時，當前raft的狀態commit index和請求id。

而後將ReadState append到raft結構體中的readStates數組中，readStates數組會被包含在Ready結構體中從readyc中pop出來供應用使用。

看看etcdserver是怎麼使用的:

首先，在消費Ready的goroutine中：

if len(rd.ReadStates) != 0 {                    select {                    case r.readStateC <- rd.ReadStates[len(rd.ReadStates)-1]:                    case <-time.After(internalTimeout):
                        plog.Warningf("timed out sending read state")                    case <-r.stopped:                        return
                    }
                }

這裏重點是把Ready中的ReadState放入readStateC中,readStateC是一個buffer大小爲1的channel

而後，在etcdserver跑linearizableReadLoop()的另一個goroutine中:

// 執行ReadIndex，ctx是request idif err := s.r.ReadIndex(cctx, ctx); err != nil {
            cancel()            if err == raft.ErrStopped {                return
            }
            plog.Errorf("failed to get read index from raft: %v", err)
            nr.notify(err)            continue}//等待request id對應的ReadState從readStateC中pop出來for !timeout && !done {
            select {            case rs = <-s.r.readStateC:
                done = bytes.Equal(rs.RequestCtx, ctx)                if !done {                    // a previous request might time out. now we should ignore the response of it and
                    // continue waiting for the response of the current requests.
                    plog.Warningf("ignored out-of-date read index response (want %v, got %v)", rs.RequestCtx, ctx)
                }            case <-time.After(s.Cfg.ReqTimeout()):
                plog.Warningf("timed out waiting for read index response")
                nr.notify(ErrTimeout)
                timeout = true
            case <-s.stopping:
                return
            }
}if !done {            continue
        }        // 等待當前apply index大於等於commit index
        if ai := s.getAppliedIndex(); ai < rs.Index {
            select {            case <-s.applyWait.Wait(rs.Index):            case <-s.stopping:
                return
            }
}

至此，ReadIndex流程結束，總結一下，就四步:

leader check本身是否在當前term commit過entry
leader記錄下當前commit index，而後leader給全部peers發心跳廣播
收到多數派響應表明讀請求到達時仍是leader，而後等待apply index大於等於commit index
返回結果

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。