來源:https://www.iteblog.com/archives/1907.htmlphp
在使用 Apache Spark 的時候,做業會以分佈式的方式在不一樣的節點上運行;特別是當集羣的規模很大時,集羣的節點出現各類問題是很常見的,好比某個磁盤出現問題等。咱們都知道 Apache Spark 是一個高性能、容錯的分佈式計算框架,一旦它知道某個計算所在的機器出現問題(好比磁盤故障),它會依據以前生成的 lineage 從新調度這個 Task。html
咱們如今來考慮下下面的場景:node
上面提到的場景其實對咱們人來講能夠經過某些措施來避免。可是對於 Apache Spark 2.2.0 版本以前是沒法避免的,不太高興的是,來自 Cloudera 的工程師解決了這個問題:引入了黑名單機制 Blacklist(詳情能夠參見SPARK-8425,具體的設計文檔參見Design Doc for Blacklist Mechanism),而且隨着 Apache Spark 2.2.0 版本發佈,不過目前還處於實驗性階段。算法
黑名單機制實際上是經過維護以前出現問題的執行器(Executors)和節點(Hosts)的記錄。當某個任務(Task)出現失敗,那麼黑名單機制將會追蹤這個任務關聯的執行器以及主機,並記下這些信息;當在這個節點調度任務出現失敗的次數超過必定的數目(默認爲2),那麼調度器將不會再將任務分發到那臺節點。調度器甚至能夠殺死那臺機器對應的執行器,這些均可以經過相應的配置實現。app
咱們能夠經過 Apache Spark WEB UI 界面看到執行器的狀態(Status):若是執行器處於黑名單狀態,你能夠在頁面上看到其狀態爲 Blacklisted ,不然爲 Active。以下圖所示:框架
目前黑名單機制能夠經過一系列的參數來控制,主要以下:分佈式
參數 | 默認值 | 含義 |
---|---|---|
spark.blacklist.enabled | false | 若是這個參數這爲 true,那麼 Spark 將再也不會往黑名單裏面的執行器調度任務。黑名單算法能夠由其餘「spark.blacklist」配置選項進一步控制,詳情參見下面的介紹。 |
spark.blacklist.timeout | 1h | (實驗性) How long a node or executor is blacklisted for the entire application, before it is unconditionally removed from the blacklist to attempt running new tasks. |
spark.blacklist.task.maxTaskAttemptsPerExecutor | 1 | (實驗性) For a given task, how many times it can be retried on one executor before the executor is blacklisted for that task. |
spark.blacklist.task.maxTaskAttemptsPerNode | 2 | (實驗性) For a given task, how many times it can be retried on one node, before the entire node is blacklisted for that task. |
spark.blacklist.stage.maxFailedTasksPerExecutor | 2 | (實驗性) How many different tasks must fail on one executor, within one stage, before the executor is blacklisted for that stage. |
spark.blacklist.stage.maxFailedExecutorsPerNode | 2 | (實驗性) How many different executors are marked as blacklisted for a given stae, before the entire node is marked as failed for the stage. |
spark.blacklist.application.maxFailedTasksPerExecutor | 2 | (實驗性) How many different tasks must fail on one executor, in successful task sets, before the executor is blacklisted for the entire application. Blacklisted executors will be automatically added back to the pool of available resources after the timeout specified by spark.blacklist.timeout . Note that with dynamic allocation, though, the executors may get marked as idle and be reclaimed by the cluster manager. |
spark.blacklist.application.maxFailedExecutorsPerNode | 2 | (實驗性) How many different executors must be blacklisted for the entire application, before the node is blacklisted for the entire application. Blacklisted nodes will be automatically added back to the pool of available resources after the timeout specified by spark.blacklist.timeout . Note that with dynamic allocation, though, the executors on the node may get marked as idle and be reclaimed by the cluster manager. |
spark.blacklist.killBlacklistedExecutors | false | (實驗性) If set to "true", allow Spark to automatically kill, and attempt to re-create, executors when they are blacklisted. Note that, when an entire node is added to the blacklist, all of the executors on that node will be killed. |
由於黑名單機制目前還處於實驗性狀態,因此上面的一些參數可能會在後面的 Spark 中有所修改。性能