用top命令,發現java進程(pid:8424)CPU資源消耗達到192%,以下圖: java
肯定java資源消耗過多。web
jstack 8424 > /tmp/jstack.txt
堆棧信息見附件1oop
命令以下:.net
top -H -p8424
結果以下:線程
8320 nwd 20 0 6275m 4.8g 7044 R 44.6 62.7 3:57.00 java 8365 nwd 20 0 6275m 4.8g 7044 R 32.4 62.7 4:06.09 java 19016 nwd 20 0 6275m 4.8g 7044 R 22.5 62.7 0:24.47 java 8289 nwd 20 0 6275m 4.8g 7044 S 14.1 62.7 16:27.13 java 8356 nwd 20 0 6275m 4.8g 7044 S 13.5 62.7 6:40.18 java 8333 nwd 20 0 6275m 4.8g 7044 R 10.3 62.7 5:21.39 java
從上面查詢出來的結果能夠找到cpu消耗靠前的線程的pid,下一步根據pid從堆棧裏面查找相應的線程。注意:這裏pid是十進制的,線程nid是十六進制,要作進制轉換。進制轉換網址:http://tool.oschina.net/hexconvert/, 或者參考以下命令:netty
[ylp@ylp-web-01 ~]$ printf "%x\n" 14065 36f1
下面分析cpu資源消耗靠前的三個線程。code
"epollEventLoopGroup-3-5" prio=10 tid=0x00007fed00272000 nid=0x2080 runnable [0x00007fece29e6000] java.lang.Thread.State: RUNNABLE at sun.misc.Unsafe.unpark(Native Method) at java.util.concurrent.locks.LockSupport.unpark(LockSupport.java:152) at java.util.concurrent.locks.AbstractQueuedSynchronizer.unparkSuccessor(AbstractQueuedSynchronizer.java:662) at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1263) at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:460) at java.util.concurrent.LinkedBlockingQueue.signalNotEmpty(LinkedBlockingQueue.java:174) at java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:429) at com.dianping.cat.message.io.DefaultMessageQueue.offer(DefaultMessageQueue.java:22) at com.dianping.cat.analysis.PeriodTask.enqueue(PeriodTask.java:44) at com.dianping.cat.analysis.Period.distribute(Period.java:85) at com.dianping.cat.analysis.RealtimeConsumer.consume(RealtimeConsumer.java:53) at com.dianping.cat.analysis.DefaultMessageHandler.handle(DefaultMessageHandler.java:28) at com.dianping.cat.analysis.TcpSocketReceiver$MessageDecoder.decode(TcpSocketReceiver.java:147) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:247) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:147) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:722) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - None
當前線程是消息接收器,從客戶端傳過來消息,而後交給consumer消費。進程
"cat-Message-Gzip-19" daemon prio=10 tid=0x00007fed1c203800 nid=0x20ad runnable [0x00007fece0eb9000] java.lang.Thread.State: RUNNABLE at sun.misc.Unsafe.unpark(Native Method) at java.util.concurrent.locks.LockSupport.unpark(LockSupport.java:152) at java.util.concurrent.locks.AbstractQueuedSynchronizer.unparkSuccessor(AbstractQueuedSynchronizer.java:662) at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1263) at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:460) at java.util.concurrent.LinkedBlockingQueue.signalNotEmpty(LinkedBlockingQueue.java:174) at java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:429) at com.dianping.cat.consumer.dump.LocalMessageBucketManager$MessageGzip.gzipMessage(LocalMessageBucketManager.java:303) at com.dianping.cat.consumer.dump.LocalMessageBucketManager$MessageGzip.run(LocalMessageBucketManager.java:332) at java.lang.Thread.run(Thread.java:745) at org.unidal.helper.Threads$RunnableThread.run(Threads.java:294) Locked ownable synchronizers: - None
當前線程是處理消息分析結果壓縮和解壓的。圖片
"Cat-RealtimeConsumer-ProblemAnalyzer-10-1" daemon prio=10 tid=0x00007fed181ed800 nid=0x4a48 runnable [0x00007fecd9c1c000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000711fd73b8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at com.dianping.cat.message.io.DefaultMessageQueue.poll(DefaultMessageQueue.java:49) at com.dianping.cat.analysis.AbstractMessageAnalyzer.analyze(AbstractMessageAnalyzer.java:40) at com.dianping.cat.analysis.PeriodTask.run(PeriodTask.java:80) at java.lang.Thread.run(Thread.java:745) at org.unidal.helper.Threads$RunnableThread.run(Threads.java:294) Locked ownable synchronizers: - None
該線程是分析當前消息是不是異常消息,若是是異常消息,要統計異常數及異常堆棧信息。ip
分析發現,資源消耗靠前的線程都是在正常處理業務,懷疑是否是消息負載太大致使CPU效果太高。想起來Cat自身有個state報表,查看一下當前消息處理的量。
處理這種問題,首先在合適的時機打印出堆棧信息,找到資源消耗靠前的線程重點分析。
附件: 附件1
#!/bin/sh cd `dirname $0` APP_HOME=`pwd` typeset top=${1:-10} #typeset pid=${2:-$(pgrep -u $USER java)} typeset pid=`ps -ef | grep java | grep "${APP_HOME}" |awk '{print $2}'` typeset tmp_file=./java${pid}.trace top -H -p $pid -b -d 1 -n 1 | head -20 > $tmp_file $JAVA_HOME/bin/jstack -l $pid >> $tmp_file ps H -eo user,pid,ppid,tid,time,%cpu --sort=%cpu --no-headers\ | tail -$top\ | awk -v "pid=$pid" '$2==pid{print $4"\t"$6}'\ | while read line; do typeset nid=$(echo "$line"|awk '{printf("0x%x",$1)}') typeset cpu=$(echo "$line"|awk '{print $2}') awk -v "cpu=$cpu" '/nid='"$nid"'/,/^$/{print $0"\t"(isF++?"":"cpu="cpu"%");}' $tmp_file done