Linearizable Read通俗來說,就是讀請求須要讀到最新的已經commit的數據,不會讀到老數據。node
從raft協議可知,leader擁有最新的狀態,若是讀請求都走leader,那麼leader能夠直接返回結果給客戶端。然而,在出現網絡分區和時鐘快慢相差比較大的狀況下,這有可能會返回老的數據,即stale read,這違反了Linearizable Read。例如,leader和其餘followers之間出現網絡分區,其餘followers已經選出了新的leader,而且新的leader已經commit了一堆數據,然而因爲不一樣機器的時鐘走的快慢不一,原來的leader可能並無發覺本身的lease過時,仍然認爲本身仍是合法的leader直接給客戶端返回結果,從而致使了stale read。網絡
當leader接收到讀請求時,將當前commit index記錄下來,記做read index,在返回結果給客戶端以前,leader須要先肯定本身到底仍是不是真的leader,肯定的方法就是給其餘全部peers發送一次心跳,若是收到了多數派的響應,說明至少這個讀請求到達這個節點時,這個節點仍然是leader,這時只須要等到commit index被apply到狀態機後,便可返回結果。函數
func (n *node) ReadIndex(ctx context.Context, rctx []byte) error { return n.step(ctx, pb.Message{Type: pb.MsgReadIndex, Entries: []pb.Entry{{Data: rctx}}}) }
func (n *node) run(r *raft)
case m := <-n.recvc: // filter out response message from unknown From. if _, ok := r.prs[m.From]; ok || !IsResponseMsg(m.Type) { r.Step(m) // raft never returns an error }
case pb.MsgReadIndex: if r.raftLog.zeroTermOnErrCompacted(r.raftLog.term(r.raftLog.committed)) != r.Term { // Reject read only request when this leader has not committed any log entry at its term. return } if r.quorum() > 1 { switch r.readOnly.option { case ReadOnlySafe: r.readOnly.addRequest(r.raftLog.committed, m) r.bcastHeartbeatWithCtx(m.Entries[0].Data) case ReadOnlyLeaseBased: var ri uint64 if r.checkQuorum { ri = r.raftLog.committed } if m.From == None || m.From == { // from local member r.readStates = append(r.readStates, ReadState{Index: r.raftLog.committed, RequestCtx: m.Entries[0].Data}) } else { r.send(pb.Message{To: m.From, Type: pb.MsgReadIndexResp, Index: ri, Entries: m.Entries}) } } }
首先,r.raftLog.zeroTermOnErrCompacted須要檢查leader是否在當前term有過commit entry,小論文5.4節關於Safety中給出瞭解釋,以及不這麼作會有什麼問題,而且給出了反例。ui
其次,本文討論的ReadIndex方案對應的是ReadOnlySafe這個option分支,其中addRequest(...)會把這個讀請求到達時的commit index保存起來,而且維護一些狀態信息,而bcastHeartbeatWithCtx(...)準備好須要發送給peers的心跳消息MsgHeartbeat。當node收到心跳響應消息MsgHeartbeatResp時處理以下:this
case pb.MsgHeartbeatResp: if r.readOnly.option != ReadOnlySafe || len(m.Context) == 0 { return } ackCount := r.readOnly.recvAck(m) if ackCount < r.quorum() { return } rss := r.readOnly.advance(m) for _, rs := range rss { req := rs.req if req.From == None || req.From == { // from local member r.readStates = append(r.readStates, ReadState{Index: rs.index, RequestCtx: req.Entries[0].Data}) } else { r.send(pb.Message{To: req.From, Type: pb.MsgReadIndexResp, Index: rs.index, Entries: req.Entries}) } }
首先只有ReadOnlySafe這個方案時,纔會繼續往下走。若是接收到了多數派的心跳響應,則會從剛纔保存的信息中將對應讀請求當時的commit index和請求id拿出來,填充到ReadState中,ReadState結構以下:
type ReadState struct { Index uint64 RequestCtx []byte }
能夠看出ReadState實際上包含了一個讀請求到達node時,當前raft的狀態commit index和請求id。
而後將ReadState append到raft結構體中的readStates數組中,readStates數組會被包含在Ready結構體中從readyc中pop出來供應用使用。
if len(rd.ReadStates) != 0 { select { case r.readStateC <- rd.ReadStates[len(rd.ReadStates)-1]: case <-time.After(internalTimeout): plog.Warningf("timed out sending read state") case <-r.stopped: return } }
// 執行ReadIndex,ctx是request id if err := s.r.ReadIndex(cctx, ctx); err != nil { cancel() if err == raft.ErrStopped { return } plog.Errorf("failed to get read index from raft: %v", err) nr.notify(err) continue } //等待request id對應的ReadState從readStateC中pop出來 for !timeout && !done { select { case rs = <-s.r.readStateC: done = bytes.Equal(rs.RequestCtx, ctx) if !done { // a previous request might time out. now we should ignore the response of it and // continue waiting for the response of the current requests. plog.Warningf("ignored out-of-date read index response (want %v, got %v)", rs.RequestCtx, ctx) } case <-time.After(s.Cfg.ReqTimeout()): plog.Warningf("timed out waiting for read index response") nr.notify(ErrTimeout) timeout = true case <-s.stopping: return } } if !done { continue } // 等待當前apply index大於等於commit index if ai := s.getAppliedIndex(); ai < rs.Index { select { case <-s.applyWait.Wait(rs.Index): case <-s.stopping: return } }
etcd不只實現了leader上的read only query,同時也實現了follower上的read only query,原理是同樣的,只不過讀請求到達follower時,commit index是須要向leader去要的,leader返回commit index給follower以前,一樣,須要走上面的ReadIndex流程,由於leader一樣須要check本身到底仍是不是leader,代碼不贅述。