spark executor 被yarn殺掉的問題

spark的任務,在運行期間executor老是掛掉。剛開始以爲是數據量太大executor內存不夠。可是估算了數據量,以爲不該該出現內存不夠。因而,首先嚐試經過jvisualvm觀察executor的內存分佈:
image.pngjava

老年代還沒填滿,進程就會出現掛掉的狀況,因此並非jvm級別的OOM。
仔細檢查對應NodeManager的日誌,發現以下日誌:node

WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=1151,containerID=container_1578970174552_5615_01_
000003] is running beyond physical memory limits. Current usage: 4.3 GB of 4 GB physical memory used; 7.8 GB of 8.4 GB virtual memory used. Killing container.
Dump of the process-tree for container_1578970174552_5615_01_000003 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 1585 1175 1585 1151 (python) 50 67 567230464 8448 python -m pyspark.daemon
        |- 1596 1585 1585 1151 (python) 1006 81 1920327680 303705 python -m pyspark.daemon
        |- 1175 1151 1151 1151 (java) ...
        |- 1151 1146 1151 1151 (bash) ...
INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 1152 for container-id container_1578
970174552_5615_01_000004: 4.3 GB of 4 GB physical memory used; 7.8 GB of 8.4 GB virtual memory used

日誌說,某個container的進程佔用物理內存超過的閾值,yarn將其kill掉了。而且這個內存的統計是基於Process Tree的,咱們的spark任務會啓動python進程,並將數據經過pyspark傳輸給python進程,換句話說數據即存在jvm,也存在python進程,若是按照進程樹統計,意味着會重複至少兩倍。很容易超過「閾值」。python

在yarn中,NodeManager會監控container的資源佔用,爲container設置物理內存和虛擬內存的上限,當超過之後,會kill container。apache

虛擬內存上限 = 物理內存上限 x yarn.nodemanager.vmem-pmem-ratio(默認是2.1)

經過設置兩個開關能夠關閉檢查,但注意要設置到NodeManager上:bash

# 物理內存檢查
<property>
  <name>yarn.nodemanager.pmem-check-enabled </name>
  <value>false</value>
  <description>Whether physical memory limits will be enforced for containers.</description>
</property>
# 虛擬內存檢查
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
  <description>Whether virtual memory limits will be enforced for containers.</description>
</property>

還要一種狀況跟,spark自身內存設置有關係,參考:jvm

Spark on Yarn 爲何出現內存超界container被killoop

其餘參考:spa

【Hadoop】運行MR任務,出現Container is running beyond physical memory limits錯誤
How to set yarn.nodemanager.pmem-check-enabled?.net

相關文章
相關標籤/搜索