一、事故背景java
上週三凌晨,我負責的某個模塊在多臺機器上連續發生coredump,幸虧發生在業務低峯期,並且該模塊提供的功能也不是核心流程功能,因此對線上業務影響比較小。發生coredump後,運維收到報警後立馬拉起了服務,服務宕機時間爲3分鐘左右。linux
二、事故分析程序員
次日當即組織了事故分析小組,對事故發生緣由進行了排查,coredump的時候JVM保存了coredump文件,運維幫忙轉換成了問題分析結果文件,以下框架
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 12288 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (os_linux.cpp:2756), pid=43471, tid=140153494918912
#
# JRE version: Java(TM) SE Runtime Environment (7.0_75-b13) (build 1.7.0_75-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.75-b04 mixed mode linux-amd64 compressed oops)
# Core dump written. Default location: /home/xiaoju/marketing-mall/core or core.43471 (max size 100000000 kB). To ensure a full core dump, try "ulimit -c unlimited" before s
tarting Java again
#運維
--------------- T H R E A D ---------------異步
Current thread (0x00007f8080013800): JavaThread "ThreadPool_2" [_thread_new, id=31832, stack(0x00007f7807375000,0x00007f7807476000)]jvm
Stack: [0x00007f7807375000,0x00007f7807476000], sp=0x00007f7807474980, free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x9a18da] VMError::report_and_die()+0x2ea
V [libjvm.so+0x4975ab] report_vm_out_of_memory(char const*, int, unsigned long, char const*)+0x9b
V [libjvm.so+0x81ebae] os::Linux::commit_memory_impl(char*, unsigned long, bool)+0xfe
V [libjvm.so+0x81ec6c] os::pd_commit_memory(char*, unsigned long, bool)+0xc
V [libjvm.so+0x8169da] os::commit_memory(char*, unsigned long, bool)+0x2a
V [libjvm.so+0x81d13d] os::pd_create_stack_guard_pages(char*, unsigned long)+0x6d
V [libjvm.so+0x953e1e] JavaThread::create_stack_guard_pages()+0x5e
V [libjvm.so+0x95a764] JavaThread::run()+0x34
V [libjvm.so+0x820b88] java_start(Thread*)+0x108分佈式
--------------- P R O C E S S ---------------函數
Java Threads: ( => current thread )
=>0x00007f8080013800 JavaThread "ThreadPool_2" [_thread_new, id=31832, stack(0x00007f7807375000,0x00007f7807476000)]
0x00007f8080013000 JavaThread "ThreadPool_1" [_thread_blocked, id=31828, stack(0x00007f780eff1000,0x00007f780f0f2000)]
0x00007f808000a000 JavaThread "ThreadPool_0" [_thread_blocked, id=31827, stack(0x00007f7807577000,0x00007f7807678000)]
0x00007f8080009800 JavaThread "ThreadPool_0" [_thread_blocked, id=31824, stack(0x00007f7807779000,0x00007f780787a000)]
0x00007f8424261000 JavaThread "dubboConsumer-100.69.174.35:7101-thread-619" daemon [_thread_blocked, id=31816, stack(0x00007f7807476000,0x00007f7807577000)]
0x00007f80c0012800 JavaThread "ThreadPool_0" [_thread_blocked, id=31803, stack(0x00007f780797b000,0x00007f7807a7c000)]
0x00007f83fc09d000 JavaThread "dubboConsumer-100.69.112.55:7101-thread-621" daemon [_thread_blocked, id=31800, stack(0x00007f7807678000,0x00007f7807779000)]oop
基本的意思是說內存空間不足致使malloc系統調用失敗,同時咱們還發現打印出來的coredump時的java線程數以ThreadPool(原有名稱替換爲ThreadPool)開頭的線程數有30000多個,而後定位到「ThreadPool」的線程池的使用位置,發現近期的改動是將ThreadPoolExecutor該成了二方庫的ThreadPoolExecutor。當時改動的緣由是使用二方庫線程池會將flag傳遞到線程內,方便排查問題。在使用方式上沒有改動,每一次調用建立一個線程池,異步處理多個任務,同步等待拿處處理任務結果,調用shutdown方法。
咱們發現了發生coredump的基本緣由是發生了內存泄露,聲明的線程沒有及時回收致使線程佔用空間過大,查看java啓動參數尚未配置-Xss(thread stack size),因此對於linux 64位系統,jvm默認設置線程棧大小位1M,因此就ThreadPool開頭的線程就已經佔用了30多G的空間,線上的物理內存爲128G,通常佔用90到100G,因此ThreadPool開頭的線程直接致使內存耗盡,進程coredump。
在明確了coredump發生的緣由後,還要確認爲何建立了那麼多以ThreadPool開頭的線程,在查看二方庫的ThreadPoolExecutor的構造函數後一切真相大白了。二方庫的ThreadPoolExecutor繼承了jdk的ThreadPoolExecutor,二方庫的ThreadPoolExecutor的構造函數初始化了父類線程池,同時也建立了一個定時調度線程池,用於統計該線程池內部使用線程數,空閒線程數,任務處理數等一些基本日誌信息,而shutdown沒有從新父類的shutdown方法,致使線程池釋放的時候只釋放了父類的,沒有釋放掉統計線程池,因此每一次接口調用都會增長一個線程,而這個接口調用量不大,線程數就會慢慢堆積知道內存耗盡。分佈式調用框架使用了robin的方式也就是輪訓,因此在單位時間內每一臺機器上的接口調用次數差很少,建立的統計線程池也差很少,所以這四臺機器也就在相近的時間前後dump,自此,本次coredump事故發生的來龍去脈都浮出了水面。
三、反思
本次事故是由多個因素致使的:
總之,程序員在平常工做開發過程當中,仍是要敬畏每一行代碼,認真仔細,養成良好的編碼習慣。