最好的重試是指數後退和抖動

時間 2019-11-07

標籤最好重試指數後退抖動简体版

原文原文鏈接

1. 概述

在本教程中，咱們將探討如何使用兩種不一樣的策略改進客戶端重試：指數後退和抖動。git

2. 重試

在分佈式系統中，多個組件之間的網絡通訊隨時可能發生故障。github

客戶端應用程序經過實現重試來處理這些失敗。算法

設想咱們有一個調用遠程服務的客戶端應用程序—— PingPongService 。網絡

interface PingPongService {
    String call(String ping) throws PingPongServiceException;
}複製代碼

若是 PingPongService 返回一個 PingPongServiceException ，則客戶端應用程序必須重試。在如下選項當中，咱們將考慮實現客戶端重試的方法。app

3. Resilience4j 重試

在咱們的例子中，咱們將使用 Resilience4j 庫，特別是它的 retry 模塊。咱們須要將添加 resilience4j-retry 模塊到 pom.xml ：dom

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-retry</artifactId>
</dependency>複製代碼

關於重試的複習，不要忘記查看咱們的 Resilience4j 指南。分佈式

4. 指數後退

客戶端應用程序必須負責地實現重試。當客戶在沒有等待的狀況下重試失敗的調用時，他們可能會使系統不堪重負，並致使已經處於困境的服務進一步降級。函數

指數回退是處理失敗網絡調用重試的經常使用策略。簡單地說，客戶端在連續重試之間等待的時間間隔愈來愈長：ui

wait_interval = base * multiplier^n複製代碼

其中，spa

base 是初始間隔，即等待第一次重試
n 是已經發生的故障數量
multiplier 是一個任意的乘法器，能夠用任何合適的值替換

經過這種方法，咱們爲系統提供了喘息的空間，以便從間歇性故障或更嚴重的問題中恢復過來。

咱們能夠在 Resilience4j 重試中使用指數回退算法，方法是配置它的 IntervalFunction ，該函數接受 initialInterval 和 multiplier。

重試機制使用 IntervalFunction 做爲睡眠函數：

IntervalFunction intervalFn =
  IntervalFunction.ofExponentialBackoff(INITIAL_INTERVAL, MULTIPLIER);

RetryConfig retryConfig = RetryConfig.custom()
  .maxAttempts(MAX_RETRIES)
  .intervalFunction(intervalFn)
  .build();
Retry retry = Retry.of("pingpong", retryConfig);

Function<String, String> pingPongFn = Retry
    .decorateFunction(retry, ping -> service.call(ping));
pingPongFn.apply("Hello");複製代碼

讓咱們模擬一個真實的場景，假設咱們有幾個客戶端同時調用 PingPongService ：

ExecutorService executors = newFixedThreadPool(NUM_CONCURRENT_CLIENTS);
List<Callable> tasks = nCopies(NUM_CONCURRENT_CLIENTS, () -> pingPongFn.apply("Hello"));
executors.invokeAll(tasks);複製代碼

讓咱們看看 NUMCONCURRENTCLIENTS = 4 的遠程調用日誌：

[thread-1] At 00:37:42.756
[thread-2] At 00:37:42.756
[thread-3] At 00:37:42.756
[thread-4] At 00:37:42.756

[thread-2] At 00:37:43.802
[thread-4] At 00:37:43.802
[thread-1] At 00:37:43.802
[thread-3] At 00:37:43.802

[thread-2] At 00:37:45.803
[thread-1] At 00:37:45.803
[thread-4] At 00:37:45.803
[thread-3] At 00:37:45.803

[thread-2] At 00:37:49.808
[thread-3] At 00:37:49.808
[thread-4] At 00:37:49.808
[thread-1] At 00:37:49.808複製代碼

咱們能夠在這裏看到一個清晰的模式——客戶機等待指數級增加的間隔，可是在每次重試（衝突）時，它們都在同一時間調用遠程服務。

![img](https://user-gold-cdn.xitu.io/2019/9/22/16d593f9934dce43?w=600&h=371&f=png&s=8719)

咱們只解決了問題的一部分 - 咱們再也不從新啓動遠程服務，可是，取而代之的是隨着時間的推移分散工做量，咱們在工做時間間隔更多，空閒時間更長。此行爲相似於驚羣問題。

5. 介紹抖動

在咱們前面的方法中，客戶機等待時間逐漸變長，但仍然是同步的。添加抖動提供了一種方法來中斷跨客戶機的同步，從而避免衝突。在這種方法中，咱們給等待間隔增長了隨機性。

wait_interval = (base * 2^n) +/- (random_interval)複製代碼

其中，random_interval 被添加（或減去）以打破客戶端之間的同步。

咱們不會深刻研究隨機區間的計算機制，可是隨機化必須將峯值空間分離到更平滑的客戶端調用分佈。

咱們能夠經過配置一個指數隨機回退 IntervalFunction，它也接受一個 randomizationFactor，從而在 Resilience4j 重試中使用帶有抖動的指數回退：

IntervalFunction intervalFn = 
  IntervalFunction.ofExponentialRandomBackoff(INITIAL_INTERVAL, MULTIPLIER, RANDOMIZATION_FACTOR);複製代碼

讓咱們回到咱們的真實場景，並查看帶抖動的遠程調用日誌：

[thread-2] At 39:21.297
[thread-4] At 39:21.297
[thread-3] At 39:21.297
[thread-1] At 39:21.297

[thread-2] At 39:21.918
[thread-3] At 39:21.868
[thread-4] At 39:22.011
[thread-1] At 39:22.184

[thread-1] At 39:23.086
[thread-5] At 39:23.939
[thread-3] At 39:24.152
[thread-4] At 39:24.977

[thread-3] At 39:26.861
[thread-1] At 39:28.617
[thread-4] At 39:28.942
[thread-2] At 39:31.039複製代碼

如今咱們有了更好的傳播。咱們已經消除了衝突和空閒時間，並以幾乎恆定的客戶端調用率結束，除非出現最初的激增。

![img](https://user-gold-cdn.xitu.io/2019/9/22/16d593f9c517b558?w=600&h=371&f=png&s=9197)

注意：咱們誇大了插圖的間隔時間，在實際狀況中，咱們會有較小的差距。