可能同窗常常會遇到生產環境下的某臺跑Java的服務器,在剛發佈時的時候一切都很正常,在運行一段時間後就出現CPU佔用很高或負載飆高等現象,好一點的負載或CPU一天比一天高,差的狀況,就是隨機進行抖動,後又恢復正常,給運維及開發同窗帶來了很多困擾。固然,出現此問題時,後續要如何改進,諸如:代碼上線前要進行review、相關強弱依賴服務隔離/降級等、單元測試、迴歸測試、SQL上線審覈、基礎及業務監控、相關流程制度等。java
若CPU使用率或負載飆高,且持續時間較長,網上也有大量的排查步驟python
方法一bash
1.使用top定位佔用CPU高的進程PID服務器
top運維
2.獲取線程信息ide
ps -mp PID -o THREAD,tid,time | sort -rn 單元測試
3.將須要的線程ID轉換爲16進制格式測試
printf "%x\n" tidui
4.打印線程的堆棧信息this
jstack pid |grep tid #這裏的tid就是步驟3生成的 十六進制格式的tid
方法二(推薦)
可快速定位thread及thread的cpu使用率
#!/bin/bash # @Function # Find out the most cpu consumed threads of java,and print the stack trace of these threads. # # @Usage # $./javacpu -h # PROG=`basename $0` usage(){ cat <<EOF Usage: ${PROG} [OPTION] ... Find out the highest cpu consumed threads of java,and print the stack of these threads. Example: ${PROG} -c 10 Options: -p,--pid find out highest cpu consumed threads from the specifed java process, default from all java process. -c,--count set the thread count to show,default is 5 -h,--help display this help and exit EOF exit $1 } ARGS=`getopt -n "$PROG" -a -o c:p:h -l count:,pid:,help -- "$@" ` [ $? -ne 0 ] && usage 1 eval set -- "${ARGS}" while true;do case "$1" in -c|--count) count="$2" shift 2 ;; -p|--pid) pid="$2" shift 2 ;; -h|--help) usage ;; --) shift break ;; esac done count=${count:-10} redEcho(){ [ -c /dev/stdout ] &&{ # if stdout is console,turn on color output. echo -ne "\033[1;31m" echo -n "$@" echo -e "\033[0m" } || echo "$@" } ## check jstack cmd if ! which jstack &> /dev/null; then [ -n "$JAVA_HOME" ] && [ -f "$JAVA_HOME/bin/jstack" ] && [ -x "$JAVA_HOME/bin/jstack" ] &&{ export PATH="$JAVA_HOME/bin:$PATH" } || { redEcho "Error:jstack nof found on PATH and JAVA_HOME!" exit 1 } fi uuid=`date +%s`_${RANDOM}_$$ cleanupWhenExit(){ rm /tmp/${uuid}_* &> /dev/null } trap "cleanupWhenExit" EXIT printStackOfThread(){ while read threadLine ; do pid=`echo ${threadLine} | awk '{print $1}'` threadId=`echo ${threadLine} | awk '{print $2}'` threadId0x=`printf %x ${threadId}` user=`echo ${threadLine} | awk '{print $3}'` pcpu=`echo ${threadLine} | awk '{print $5}'` jstackFile=/tmp/${uuid}_${pid} [ ! -f "${jstackFile}" ] && { jstack ${pid} > ${jstackFile} ||{ redEcho "Fail to jstack java process ${pid}!" rm ${jstackFile} continue } } redEcho "The stack of busy(${pcpu}%) thread(${threadId}/0x${htreadId0x}) of java process(${pid}) of user(${user}):" sed "/nid=0x${threadId0x}/,/^$/p" -n ${jstackFile} done } [ -z "${pid}" ] && { ps -Leo pid,lwp,user,comm,pcpu --no-headers|awk '$4=="java"{print $0}' |sort -k5 -r -n |head --lines "${count}" | printStackOfThread } || { ps -Leo pid,lwp,user,comm,pcpu --no-headers |awk -v "pid=${pid}" '$1==pid,$4=="java"{print $0}' | sort -k5 -r -n |head --lines "${count}" | printStackOfThread }
方法三(針對Java服務器的load負載隨機抖動狀況)
#!/usr/bin/env python import os import time, datetime import threading # desc: when system loadavg 1 min load lt 10,then dump java jstack def load_stat(): loadavg = {} f = open("/proc/loadavg") info = f.read().split() f.close() loadavg['lavg_1'] = info[0] loadavg['lavg_5']= info[1] loadavg['lavg_15']= info[2] start_time = datetime.datetime.strptime(str(datetime.datetime.now().date()) + '00:00', '%Y-%m-%d%H:%M') curr_time = datetime.datetime.now() end_time = datetime.datetime.strptime(str(datetime.datetime.now().date() + datetime.timedelta(days=2)) + '23:59', '%Y-%m-%d%H:%M') if (start_time <= curr_time <= end_time ) : if float(loadavg['lavg_1']) >= 11: pid = os.popen("jps |grep -v Jps|awk '{print $1}'").read() cmd = "jstack" + " " + pid stack = os.popen(cmd).read() tm = time.strftime("%Y-%m-%d_%H-%M-%S", time.localtime()) timeslog = 'java_stack_' + tm + r'.txt' log_f = open(timeslog, 'w') log_f.write(stack) log_f.close() cmd_2="ps -mp " + pid.strip('\n') + " -o THREAD,tid,time | sort -rn" top_tid_info=os.popen(cmd_2).read() cpu_tid_logs='tid_cpu_' + tm + r'.txt' log_f2 = open(cpu_tid_logs,'w') log_f2.write(top_tid_info) log_f2.close() threading.Timer(5, load_stat).start() else: threading.Timer(5, load_stat).start() else: exit #return loadavg load_stat()