【一塊兒學源碼-微服務】Nexflix Eureka 源碼十：服務下線及實例摘除，一個client下線到底多久纔會被其餘實例感知？

時間 2020-04-22

標籤一塊兒學源碼-微服務 nexflix eureka 源碼服務下線實例摘除一個 client 到底多久會被其餘感知简体版

原文原文鏈接

前言

前情回顧

上一講咱們講了 client端向server端發送心跳檢查，也是默認每30鍾發送一次，server端接收後會更新註冊表的一個時間戳屬性，而後一次心跳(續約)也就完成了。html

本講目錄

這一篇有兩個知識點及一個疑問，這個疑問是在工做中真真實實遇到過的。java

例如我有服務A、服務B，A、B都註冊在同一個註冊中心，當B下線後，A多久能感知到B已經下線了呢？node

不知道你們有沒有這個困惑，這篇文章最後會對此問題答疑，若是可以看到文章的結尾，或許你就知道答案了，固然答案也會在結尾揭曉。spring

目錄以下：緩存

Client端服務實例下線通知Server端
Server端定時任務服務摘除

技術亮點：定時任務錯誤觸發時間補償機制restful

在Server端定時任務進行服務故障自動感知摘除的時候有一個設計很巧妙的點，時間補償機制。app

咱們知道，在作定時任務的時候，基於某個固定點觸發的操做均可能因爲一些其餘緣由致使固定的點沒有執行對應的操做，這時再次執行定時操做後，計算的每次任務相隔時間就會出現問題。而Eureka 這裏採用了一種補償機制，再計算時間差值的時候完美解決此問題。dom

說明

原創不易，如若轉載請標明來源：一枝花算不算浪漫ide

源碼分析

Client端服務實例下線通知Server端

Client下線咱們仍是依照以前的原則，從DiscoveryClient 看起，能夠看到有一個shutdown() 方法，而後接着跟一下這個方法：微服務

@PUT
public Response renewLease(
        @HeaderParam(PeerEurekaNode.HEADER_REPLICATION) String isReplication,
        @QueryParam("overriddenstatus") String overriddenStatus,
        @QueryParam("status") String status,
        @QueryParam("lastDirtyTimestamp") String lastDirtyTimestamp) {
    boolean isFromReplicaNode = "true".equals(isReplication);
    boolean isSuccess = registry.renew(app.getName(), id, isFromReplicaNode);

    // 省略部分代碼

    logger.debug("Found (Renew): {} - {}; reply status={}" + app.getName(), id, response.getStatus());
    return response;
}


public boolean renew(String appName, String id, boolean isReplication) {
    RENEW.increment(isReplication);
    Map<String, Lease<InstanceInfo>> gMap = registry.get(appName);
    Lease<InstanceInfo> leaseToRenew = null;
    if (gMap != null) {
        leaseToRenew = gMap.get(id);
    }
    if (leaseToRenew == null) {
        RENEW_NOT_FOUND.increment(isReplication);
        logger.warn("DS: Registry: lease doesn't exist, registering resource: {} - {}", appName, id);
        return false;
    } else {
        InstanceInfo instanceInfo = leaseToRenew.getHolder();
        if (instanceInfo != null) {
            // touchASGCache(instanceInfo.getASGName());
            InstanceStatus overriddenInstanceStatus = this.getOverriddenInstanceStatus(
                    instanceInfo, leaseToRenew, isReplication);
            if (overriddenInstanceStatus == InstanceStatus.UNKNOWN) {
                logger.info("Instance status UNKNOWN possibly due to deleted override for instance {}"
                        + "; re-register required", instanceInfo.getId());
                RENEW_NOT_FOUND.increment(isReplication);
                return false;
            }
            if (!instanceInfo.getStatus().equals(overriddenInstanceStatus)) {
                Object[] args = {
                        instanceInfo.getStatus().name(),
                        instanceInfo.getOverriddenStatus().name(),
                        instanceInfo.getId()
                };
                logger.info(
                        "The instance status {} is different from overridden instance status {} for instance {}. "
                                + "Hence setting the status to overridden status", args);
                instanceInfo.setStatusWithoutDirty(overriddenInstanceStatus);
            }
        }
        renewsLastMin.increment();
        leaseToRenew.renew();
        return true;
    }
}

代碼也很簡單，作一些資源釋放，取消調度任等操做，這裏主要仍是關注的是通知Server端的邏輯，及Server端是如何作實例下線的。這裏請求Server端請求主要看下unregister方法，這裏是調用jersey中的cancel 方法，調用Server端ApplicationsResource中的@DELETE 請求。（看到這裏，前面看到各類client端調用server端，都是經過請求方式來作restful風格調用的，這裏不只要感嘆妙啊）

咱們到Server端看下接收請求的入口代碼：

InstanceResource.cancelLease() ：

@DELETE
public Response cancelLease(
        @HeaderParam(PeerEurekaNode.HEADER_REPLICATION) String isReplication) {
    boolean isSuccess = registry.cancel(app.getName(), id,
            "true".equals(isReplication));

    if (isSuccess) {
        logger.debug("Found (Cancel): " + app.getName() + " - " + id);
        return Response.ok().build();
    } else {
        logger.info("Not Found (Cancel): " + app.getName() + " - " + id);
        return Response.status(Status.NOT_FOUND).build();
    }
}

而後接着往下跟，AbstractInstanceRegistry.internalCancel 方法：

protected boolean internalCancel(String appName, String id, boolean isReplication) {
    try {
        read.lock();
        CANCEL.increment(isReplication);
        // 經過appName獲取註冊表信息
        Map<String, Lease<InstanceInfo>> gMap = registry.get(appName);
        Lease<InstanceInfo> leaseToCancel = null;
        if (gMap != null) {
            // 經過實例id將註冊信息從註冊表中移除
            leaseToCancel = gMap.remove(id);
        }
        
        // 最近取消的註冊表信息隊列添加該註冊表信息
        synchronized (recentCanceledQueue) {
            recentCanceledQueue.add(new Pair<Long, String>(System.currentTimeMillis(), appName + "(" + id + ")"));
        }
        InstanceStatus instanceStatus = overriddenInstanceStatusMap.remove(id);
        if (instanceStatus != null) {
            logger.debug("Removed instance id {} from the overridden map which has value {}", id, instanceStatus.name());
        }
        if (leaseToCancel == null) {
            CANCEL_NOT_FOUND.increment(isReplication);
            logger.warn("DS: Registry: cancel failed because Lease is not registered for: {}/{}", appName, id);
            return false;
        } else {
            // 執行下線操做的cancel方法
            leaseToCancel.cancel();
            InstanceInfo instanceInfo = leaseToCancel.getHolder();
            String vip = null;
            String svip = null;
            if (instanceInfo != null) {
                instanceInfo.setActionType(ActionType.DELETED);
                // 最近更新的隊列中加入此服務實例信息
                recentlyChangedQueue.add(new RecentlyChangedItem(leaseToCancel));
                instanceInfo.setLastUpdatedTimestamp();
                vip = instanceInfo.getVIPAddress();
                svip = instanceInfo.getSecureVipAddress();
            }
            // 使註冊表的讀寫緩存失效
            invalidateCache(appName, vip, svip);
            logger.info("Cancelled instance {}/{} (replication={})", appName, id, isReplication);
            return true;
        }
    } finally {
        read.unlock();
    }
}

接着看 Lease.cancel :

public void cancel() {
    // 這裏只是更新服務實例中下線的時間戳
    if (evictionTimestamp <= 0) {
        evictionTimestamp = System.currentTimeMillis();
    }
}

這裏已經加了註釋，再總結下：

一、加上讀鎖，支持多服務實例下線二、經過appName獲取註冊表信息map 三、經過appId移除對應註冊表信息四、recentCanceledQueue添加該服務實例五、更新Lease中的服務實例下線時間六、recentlyChangedQueue添加該服務實例七、invalidateCache() 使註冊表的讀寫緩存失效

這裏針對於六、7再解釋一下，咱們在第八講：【一塊兒學源碼-微服務】Nexflix Eureka 源碼八：EurekaClient服務發現之註冊表抓取精妙設計分析！中講過，當client端第一次進行增量註冊表抓取的時候，是會從recentlyChangedQueue中獲取數據的，而後放入到讀寫緩存，而後再同步到只讀緩存，下次再獲取的時候直接從只讀緩存獲取便可。

這裏會存在一個問題，若是一個服務下線了，讀寫緩存更新了，可是隻讀緩存並未更新，30s後由定時任務刷新讀寫緩存的數據到了只讀緩存，這時其餘客戶端纔會感知到該下線的服務實例。

配合文字說明這裏加一個EurekaClient下線流程圖，紅色線是下線邏輯，黑色線是抓取註冊表感知服務下線邏輯：

記住一點，這裏是正常的服務下線，走shutdown邏輯，若是一個服務忽然本身宕機了，那麼註冊中心怎麼去自動感知這個服務下線呢？緊接着往下看吧。

Server端定時任務服務摘除

舉例一個場景，上面也說過，一個Client服務端本身掛掉了，並無正常的去執行shutdown方法，那麼註冊中心該如何感知這個服務實例下線了並從註冊表摘除這個實例呢？

咱們知道，eureka靠心跳機制來感知服務實例是否還存活着，若是某個服務掛掉了是不會再發送心跳過來了，若是在一段時間內沒有接收到某個服務的心跳，那麼就將這個服務實例給摘除掉，認爲這個服務實例以及宕機了。

這裏自動檢測服務實例是否宕機的入口在:EurekaBootStrap，eureka server在啓動初始化的時候，有個方法registry.openForTraffic(applicationInfoManager, registryCount) 裏面會有一個服務實例檢測的調度任務（這個入口真的很隱蔽，網上查了別人的分析才找到），接着直接看代碼吧。

EurekaBootStrap.initEurekaServerContext() ：

protected void initEurekaServerContext() throws Exception {
    // 省略部分代碼...
    
    int registryCount = registry.syncUp();
    registry.openForTraffic(applicationInfoManager, registryCount);
}

這裏的代碼前面看過不少次，syncUp是獲取其餘EurekaServer中註冊表數據，而後拿到註冊表中服務實例registryCount，而後和本身本地註冊表服務實例數量進行對比等等。

接着是openForTraffic方法，這裏會計算預期的1分鐘全部服務實例心跳次數expectedNumberOfRenewsPerMin （插個眼，後面eureka server自我保護機制會用到這個屬性）後面會詳細講解，並且這裏設置仍是有bug的。

在方法的最後會有一個：super.postInit(); 到了這裏纔是真正的服務實例自動感知的調度任務邏輯。兜兜轉轉在這個不起眼的地方隱藏了這麼重要的邏輯。

PeerAwareInstanceRegistryImpl.java ：

public int syncUp() {
    // Copy entire entry from neighboring DS node
    int count = 0;

    for (int i = 0; ((i < serverConfig.getRegistrySyncRetries()) && (count == 0)); i++) {
        if (i > 0) {
            try {
                Thread.sleep(serverConfig.getRegistrySyncRetryWaitMs());
            } catch (InterruptedException e) {
                logger.warn("Interrupted during registry transfer..");
                break;
            }
        }
        Applications apps = eurekaClient.getApplications();
        for (Application app : apps.getRegisteredApplications()) {
            for (InstanceInfo instance : app.getInstances()) {
                try {
					// isRegisterable：是否能夠在當前服務實例所在的註冊中心註冊。這個方法必定返回true，那麼count就是相鄰註冊中心全部服務實例數量
                    if (isRegisterable(instance)) {
                        register(instance, instance.getLeaseInfo().getDurationInSecs(), true);
                        count++;
                    }
                } catch (Throwable t) {
                    logger.error("During DS init copy", t);
                }
            }
        }
    }
    return count;
}

@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
    // Renewals happen every 30 seconds and for a minute it should be a factor of 2.
	// 若是有20個服務實例，乘以2 表明須要40次心跳
	// 這裏有bug，count * 2 是硬編碼，做者是否是按照心跳時間30秒計算的？因此計算一分鐘得心跳就是 * 2，可是心跳時間是能夠本身配置修改的
	// 看了master源碼，這一塊已經改成：
	/**
	 * this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews + 1;
	 * updateRenewsPerMinThreshold();
     *
	 * 主要是看 updateRenewsPerMinThreshold 方法：
	 * this.numberOfRenewsPerMinThreshold = (int) (this.expectedNumberOfClientsSendingRenews * (60.0 / serverConfig.getExpectedClientRenewalIntervalSeconds() * serverConfig.getRenewalPercentThreshold());
	 * 這裏徹底是讀取用戶本身配置的心跳檢查時間，而後用60s / 配置時間
	 */
    this.expectedNumberOfRenewsPerMin = count * 2;
    // numberOfRenewsPerMinThreshold = count * 2 * 0.85 = 34 指望一分鐘 20個服務實例，得有34個心跳
    this.numberOfRenewsPerMinThreshold =
            (int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
    logger.info("Got " + count + " instances from neighboring DS node");
    logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);
    this.startupTime = System.currentTimeMillis();
    if (count > 0) {
        this.peerInstancesTransferEmptyOnStartup = false;
    }
    DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
    boolean isAws = Name.Amazon == selfName;
    if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
        logger.info("Priming AWS connections for all replicas..");
        primeAwsReplicas(applicationInfoManager);
    }
    logger.info("Changing status to UP");
    applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
    // 此方法會作服務實例的自動摘除任務
    super.postInit();
}

關於syncUp 方法，這裏知道它是獲取其餘服務註冊表信息，而後獲取註冊實例數量就好了，後面還會有更詳細的講解。
接着openForTraffic 方法，第一行代碼：this.expectedNumberOfRenewsPerMin = count * 2; 這個count是相鄰註冊表中全部服務實例數量，至於乘以2 是什麼意思呢？首先是這個字段的含義是：期待的一分鐘全部服務實例心跳次數，由於服務續約renew 默認是30s執行一次，因此這裏就想固然一分鐘就乘以2了。
你們看出來了吧？這是個很明顯的bug。由於續約時間是可配置的，若是手動配置成10s，那麼這裏乘以6纔對。看了下公司代碼 spring-cloud版本是Finchley.RELEASE，其中以來的netflix eureka 是1.9.2 仍然存在這個問題。
我也翻看了master分支的代碼，此bug已經修復了，修改以下：

其實這一塊還有不少bug，包括服務註冊、下線用的都是+2 -2操做，後面一篇文章會有更多講解。

繼續看服務實例自動感知的調度任務：

AbstractInstanceRegistry.java :

protected void postInit() {
    renewsLastMin.start();
    if (evictionTaskRef.get() != null) {
        evictionTaskRef.get().cancel();
    }
    evictionTaskRef.set(new EvictionTask());
    evictionTimer.schedule(evictionTaskRef.get(),
            serverConfig.getEvictionIntervalTimerInMs(),
            serverConfig.getEvictionIntervalTimerInMs());
}

class EvictionTask extends TimerTask {
    private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);

    @Override
    public void run() {
        try {
            // 獲取補償時間 可能大於0
            long compensationTimeMs = getCompensationTimeMs();
            logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
            evict(compensationTimeMs);
        } catch (Throwable e) {
            logger.error("Could not run the evict task", e);
        }
    }

    /**
     * compute a compensation time defined as the actual time this task was executed since the prev iteration,
     * vs the configured amount of time for execution. This is useful for cases where changes in time (due to
     * clock skew or gc for example) causes the actual eviction task to execute later than the desired time
     * according to the configured cycle.
     */
    long getCompensationTimeMs() {
        // 第一次進來先獲取當前時間 currNanos=20:00:00
        // 第二次過來，此時currNanos=20:01:00
        // 第三次過來，currNanos=20:03:00纔過來，本該60s調度一次的，因爲fullGC或者其餘緣由，到了這個時間點沒執行
        long currNanos = getCurrentTimeNano();

        // 獲取上一次這個EvictionTask執行的時間 getAndSet ：以原子方式設置爲給定值，並返回之前的值
        // 第一次 將20:00:00 設置到lastNanos，而後return 0
        // 第二次過來後，拿到的lastNanos爲20:00:00
        // 第三次過來，拿到的lastNanos爲20:01:00
        long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
        if (lastNanos == 0l) {
            return 0l;
        }

        // 第二次進來，計算elapsedMs = 60s
        // 第三次進來，計算elapsedMs = 120s
        long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
        // 第二次進來，配置的服務驅逐間隔默認時間爲60s，計算的補償時間compensationTime=0
        // 第三次進來，配置的服務驅逐間隔默認時間爲60s，計算的補償時間compensationTime=60s
        long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
        return compensationTime <= 0l ? 0l : compensationTime;
    }

    long getCurrentTimeNano() {  // for testing
        return System.nanoTime();
    }

}

這裏執行postInit 方法，而後執行EvictionTask 任務，執行時間是serverConfig.getEvictionIntervalTimerInMs() 默認是60s執行一次。
接着調用EvictionTask ，這裏也加了一些註釋，咱們再來分析一下。 2.1 首先是獲取補償時間，compenstationTimeMs，這個時間很關鍵 2.2 調用evict 方法，摘除過時沒有發送心跳的實例

查看getCompensationTimeMs 方法，這裏我添加了很詳細的註釋，這個方法主要是爲了防止定時任務觸發點，服務由於某些緣由沒有執行該調度任務，此時elapsedMs 會超過60s的，最後返回的compensationTime 就是實際延誤且須要補償的時間。

接着再看下evict 邏輯：

public void evict(long additionalLeaseMs) {

    // 是否容許主動刪除宕機節點數據，這裏判斷是否進入自我保護機制，若是是自我保護了則不容許摘除服務
    if (!isLeaseExpirationEnabled()) {
        logger.debug("DS: lease expiration is currently disabled.");
        return;
    }

    List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
    for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
        Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
        if (leaseMap != null) {
            for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
                Lease<InstanceInfo> lease = leaseEntry.getValue();
                if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
                    expiredLeases.add(lease);
                }
            }
        }
    }

    int registrySize = (int) getLocalRegistrySize();
    int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
    int evictionLimit = registrySize - registrySizeThreshold;

    int toEvict = Math.min(expiredLeases.size(), evictionLimit);
    if (toEvict > 0) {
        logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);

        Random random = new Random(System.currentTimeMillis());
        for (int i = 0; i < toEvict; i++) {
            // Pick a random item (Knuth shuffle algorithm)
            int next = i + random.nextInt(expiredLeases.size() - i);
            Collections.swap(expiredLeases, i, next);
            Lease<InstanceInfo> lease = expiredLeases.get(i);

            String appName = lease.getHolder().getAppName();
            String id = lease.getHolder().getId();
            EXPIRED.increment();
            logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
            internalCancel(appName, id, false);
        }
    }
}

public boolean isLeaseExpirationEnabled() {
    if (!isSelfPreservationModeEnabled()) {
        // The self preservation mode is disabled, hence allowing the instances to expire.
        return true;
    }

    // 這行代碼觸發自我保護機制，指望的一分鐘要有多少次心跳發送過來，全部服務實例一分鐘得發送多少次心跳
    // getNumOfRenewsInLastMin 上一分鐘全部服務實例一共發送過來多少心跳，10次
    // 若是上一分鐘 的心跳次數太少了（20次）< 我指望的100次，此時會返回false
    return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
}

首先看isLeaseExpirationEnabled 方法，這個方法是判斷是否須要自我保護的，裏面邏輯其實也很簡單，獲取山一分鐘全部實例心跳的次數和numberOfRenewsPerMinThreshold (指望的每分鐘全部實例心跳次數x85%) 進行對比，若是大於numberOfRenewsPerMinThreshold 才容許摘除實例，不然進入自我保護模式。下一節會詳細講解這個方法。
若是服務實例能夠被移除，接着往下看，這裏是遍歷全部的服務註冊信息，而後一個個遍歷服務實例心跳時間是否超過了對應的時間，主要看 lease.isExpired(additionalLeaseMs) 方法：

Lease.isExpired() ：

/**
 * Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
 *
 * Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
 * what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
 * instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
 * not be fixed.
 *
 * @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
 */
public boolean isExpired(long additionalLeaseMs) {
    // lastUpdateTimestamp renew成功後就會刷新這個時間，能夠理解爲最近一次活躍時間
    // 查看 Lease.renew方法：lastUpdateTimestamp = System.currentTimeMillis() + duration;
    // duration能夠查看爲：LeaseInfo中的DEFAULT_LEASE_RENEWAL_INTERVAL=90s 默認爲90s
    // 這段邏輯爲 當前時間 > 上一次心跳時間 + 90s + 補償時間
    /**
     * 這裏先不看補償時間，假設補償時間爲0，這段的含義是 若是當前時間大於上次續約的時間+90s，那麼就認爲該實例過時了
     * 由於lastUpdateTimestamp=System.currentTimeMillis()+duration，因此這裏能夠理解爲 超過180是尚未續約，那麼就認爲該服務實例過時了
     *
     * additionalLeaseMs 時間是一個容錯的機制，也是服務保持最終一致性的一種手段，針對於定時任務 由於一些不可控緣由在某些時間點沒有定時執行，那麼這個就是很好的容錯機制
     * 這段代碼 意思如今理解爲：服務若是宕機了，那麼最少180s 纔會被註冊中心摘除掉
     */
    return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
}

這裏註釋已經寫得很清楚了，System.currentTimeMillis() > lastUpdateTimestamp + duration + additionalLeaseMs 若是將補償時間記爲0，那麼這段代碼的含義是 若是服務若是宕機了，那麼最少180s 纔會被註冊中心摘除掉

上面這段代碼翻譯完了，接着看一個彩蛋看這段代碼註釋，我先谷歌翻譯給你們看下：

翻譯的不是很好，我再來講下，這裏說的是在renew() 方法中，咱們寫了一個bug，那裏不該該多加一個duration(默認90s)時間的，加上了會致使這裏duration * 2了，因此也就是至少180s纔會被摘除。可是又因爲修改會產生其餘的問題，因此咱們不予修改。

順便看下renew() 作了什麼錯事：

這裏確實多給加了一個duration，哈哈經過這個註釋能夠感覺到做者就像一個嬌羞的小媳婦同樣，我作錯了事我就不改哼！~

言歸正傳，這裏接着看evict()後面的操做：