考慮到前期開發啓動建設衆籌平臺過程當中的一些弊端和架構經驗,本次架構引用了dubbo作soa服務的治理,web容器nginx+tomcat,後端語言採用java,框架選擇spring+mybaits,前端模板引擎使用的是btl,app採用原生+h5的模式。這個架構能夠參考我以前寫的文章從零到百億互聯網金融架構發展史中的第三代系統架構,以前的文章主要介紹了架構的變遷,本篇文章主要介紹在第三代平臺中遇到的問題以及解決方法。html
首先介紹一下衆籌系統的部署架構(以下圖),網站和app請求都是首先到最前端的nginx,若是隻是靜態內容的訪問nginx直接處理後返回;動態請求分別轉發到後端的tomcat前端服務層,前端服務層只關注頁面端業務邏輯不涉及數據庫的操做,若是隻是頁面框架渲染以及不涉及數據庫的請求,在前端服務層直接處理返回;若是涉及到數據庫操做或者核心業務邏輯,前端服務層經過dubbo調用後端的接入層服務或者核心層服務。前端
{:.center}
一個腳本引起的血案java
上線在生產測試期間,發現tomcat過一段時間就會莫名奇妙的down掉,特別是後端的tomcat down掉的頻率比較高。後端的tomcat down掉以後對前端的頁面展現沒有影響,會影響後端的交易。nginx
jvm參數配置
查看tomcat業務日誌,報錯以下:web
2016-04-14 12:01:55,025 - org.jboss.netty.channel.DefaultChannelPipeline -59679839 [New I/O worker #29] WARN null - [DUBBO] An exception was thrown by a user handler while handling an exception event ([id: 0x5f980c11, /xxx:55386 => /xxx:6666] EXCEPTION: com.alibaba.dubbo.remoting.ExecutionException: class com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler error when process received event .), dubbo version: 2.8.4, current host: xxx
com.alibaba.dubbo.remoting.ExecutionException: class com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler error when process caught event .
at com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler.caught(AllChannelHandler.java:67)
at com.alibaba.dubbo.remoting.transport.AbstractChannelHandlerDelegate.caught(AbstractChannelHandlerDelegate.java:44)
at com.alibaba.dubbo.remoting.transport.AbstractChannelHandlerDelegate.caught(AbstractChannelHandlerDelegate.java:44)
at com.alibaba.dubbo.remoting.transport.AbstractPeer.caught(AbstractPeer.java:127)
at com.alibaba.dubbo.remoting.transport.netty.NettyHandler.exceptionCaught(NettyHandler.java:112)
at com.alibaba.dubbo.remoting.transport.netty.NettyCodecAdapter$InternalDecoder.exceptionCaught(NettyCodecAdapter.java:165)
at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:48)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at com.alibaba.dubbo.remoting.transport.netty.NettyCodecAdapter$InternalDecoder.messageReceived(NettyCodecAdapter.java:148)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1360)
at com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler.caught(AllChannelHandler.java:65)
... 19 more
查看output日誌,發現其中有這麼一句。spring
SEVERE: The web application [/xxx] appears to have started a thread named [DubboResponseTimeoutScanTimer] but has failed to stop it. This is very likely to create a memory leak.
根據日誌提示貌似有內存泄露,之前確實尚未碰到過這個錯誤,一片迷茫。從新啓動後,先用命令jstat -gc xxx 1000 30查看java 進程的gc狀況,發如今30秒的世界內minor gc了n次,隨懷疑年輕代內存配置少了,查看個區域內存的配置參數以下:sql
-Xms10g -Xmx10g -XX:PermSize=1g -XX:MaxPermSize=2g -Xshare:off -Xmn1024m
按照年輕代爲堆內存爲百分之三的原則修改成-Xmn4g,從新啓動觀察以後mimor gc的頻率確實有所降低,測試大約過了3小時候以後又反饋tomcat down掉了,繼續分析啓動參數配置的時候發現了這麼一句-XX:-+DisableExplicitGC,顯示的禁止了System.gc(),可是使用了java.nio的大量框架中使用System.gc()來執行gc期經過full gc來強迫已經無用的DirectByteBuffer對象釋放掉它們關聯的native memory,若是禁用會致使OOM,隨即懷疑是不是這個參數引起的問題,在啓動參數中去掉它。shell
爲了防止再次出現異常的時候能更加詳細的分析堆內存的使用狀況,在啓動參數中添加了-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/logs/java/,當tomcat down的時候讓輸出堆內存文件,一邊也啓動jvisualvm工具來實時的監控內存各個線程的使用狀況。數據庫
數據庫鏈接池
繼續使用壓測工具來壓測,在壓測的過程當中發現名爲com.mchange.v2.resourcepool.ssync.ThreadPoolAsynchronousRunner$PoolThred-#xxx的線程不斷的增加,而且後臺tomcat報錯以下:後端
2016-04-13 16:55:15,175 - com.alibaba.dubbo.common.threadpool.support.AbortPolicyWithReport -83649035 [New I/O worker #27] WARN - [DUBBO] Thread pool is EXHAUSTED! Thread Name: DubboServerHandler-xxx:6666, Pool Size: 200 (active: 200, core: 200, max: 200, largest: 200), Task: 692 (completed: 492), Executor status:(isShutdown:false, isTerminated:false, isTerminating:false), in dubbo://xxx:6666!, dubbo version: 2.8.4, current host: xxx
2016-04-13 16:55:15,176 - com.alibaba.dubbo.common.threadpool.support.AbortPolicyWithReport -83649036 [New I/O worker #27] WARN - [DUBBO] Thread pool is EXHAUSTED! Thread Name: DubboServerHandler-xxx:6666, Pool Size: 200 (active: 200, core: 200, max: 200, largest: 200), Task: 692 (completed: 492), Executor status:(isShutdown:false, isTerminated:false, isTerminating:false), in dubbo://xxx:6666!, dubbo version: 2.8.4, current host: xxx
2016-04-13 16:55:15,177 - org.jboss.netty.channel.DefaultChannelPipeline -83649037 [New I/O worker #27] WARN - [DUBBO] An exception was thrown by a user handler while handling an exception event ([id: 0x2f345d45, /192.168.57.20:36475 => /xxx:6666] EXCEPTION: com.alibaba.dubbo.remoting.ExecutionException: class com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler error when process received event .), dubbo version: 2.8.4, current host: xxx
com.alibaba.dubbo.remoting.ExecutionException: class com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler error when process caught event .
at com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler.caught(AllChannelHandler.java:67)
at com.alibaba.dubbo.remoting.transport.AbstractChannelHandlerDelegate.caught(AbstractChannelHandlerDelegate.java:44)
at com.alibaba.dubbo.remoting.transport.AbstractChannelHandlerDelegate.caught(AbstractChannelHandlerDelegate.java:44)
at com.alibaba.dubbo.remoting.transport.AbstractPeer.caught(AbstractPeer.java:127)
at com.alibaba.dubbo.remoting.transport.netty.NettyHandler.exceptionCaught(NettyHandler.java:112)
at com.alibaba.dubbo.remoting.transport.netty.NettyCodecAdapter$InternalDecoder.exceptionCaught(NettyCodecAdapter.java:165)
at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:48)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at com.alibaba.dubbo.remoting.transport.netty.NettyCodecAdapter$InternalDecoder.messageReceived(NettyCodecAdapter.java:148)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.RejectedExecutionException: Thread pool is EXHAUSTED! Thread Name: DubboServerHandler-xxx:6666, Pool Size: 200 (active: 200, core: 200, max: 200, largest: 200), Task: 692 (completed: 492), Executor status:(isShutdown:false, isTerminated:false, isTerminating:false), in dubbo://xxx:6666!
at com.alibaba.dubbo.common.threadpool.support.AbortPolicyWithReport.rejectedExecution(AbortPolicyWithReport.java:53)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at com.alibaba.dubbo.remoting.transport.dispatcher.all.AllChannelHandler.caught(AllChannelHandler.java:65)
... 19 more
根據這些信息隨懷疑數據庫鏈接池有問題,爲了更好的監控鏈接池的使用,所以前期使用c3p0也會出現的一些問題,因此咱們決定將數據庫鏈接池替換成druid,已經在別的項目中使用測試過,所以很是快速的更換投產。投產後繼續用壓測工具來測試,根據druid的後臺監控頁面發現(項目地址/druid/index.html),每次前端掉用一次數據庫鏈接就加一次,執行完成以後數據庫鏈接並無釋放。以下圖紅色區域,咱們將數據庫鏈接池調整成1000,不一會就佔滿了。
{:.center}
一個腳本引起的血案
根據這些信息判斷出,數據庫執行sql後確定沒有釋放數據庫鏈接,致使數據庫鏈接池用滿後,後續的線程沒法執行,檢查代碼以後發現果真有問題,請看下方代碼,咱們最早使用的是SqlSessionFactory,若是使用SqlSessionFactory,在執行完sql後必需要執行session.close()來關閉鏈接,纔會被鏈接池從新回收。
public class SessionFactory {br/>@Resource
private SqlSessionFactory coreSqlSessionFactory;
protected SqlSession getSession() {
return coreSqlSessionFactory.openSession();
}
}
public class BaseDao extends SessionFactory{
public void add(Entity entity) {
this.getSession().update(entity.getClass().getSimpleName()+"."+Thread.currentThread().getStackTrace()[2].getMethodName(), entity);
}
}
可是使用SqlSessionTemplate卻不用手動執行代碼來關閉session,所以咱們把上面SessionFactory類中的代碼改爲SqlSessionTemplate(以下),此問題便解決了。
public class SessionFactory {
@Resource
public SqlSessionTemplate coreSqlSession;
protected SqlSessionTemplate getSession() {
return coreSqlSession;
}
}
詭異的腳本
作完上面的優化以後,咱們感受問題應該解決了,但過了一段時間後tomcat又詭異的掛了,繼續分析gc狀況,分階段使用jmap -dump:live,format=b,file=dump.hprof xxx命令生成堆轉儲快照來對比堆內存使用狀況,監控線程使用狀況,均發現沒有問題。這個問題困擾了咱們好幾天,天天都監控這端口,一但發現tomcat down以後立刻通知運營人員重啓。一方面咱們也查閱了各類資料,到網上查找各類tomcat自動down的緣由,一一在咱們服務器進行了測試、修復均不起做用。
終於在google各類tomcat down緣由的時候發現了這麼一篇文章Tomcat進程意外退出的問題分析,馬上想起了咱們最近使用的一個腳原本,由於咱們的tomcat禁止了經過bat文件來關閉,所以爲了啓動方便咱們寫了一個腳本文件,方便經過腳原本啓動、中止、重啓tomcat文件,這是這個腳本致使tomcat down的緣由,不不,不叫緣由叫元兇!腳本內容以下:
#!/bin/sh
#
proc_dir="/usr/local/xxx/tomcat-zc-web/bin"
proc_name=$2
if [ x$proc_name != x ]
then
proc_dir=${proc_dir//xxx/$proc_name}
fi
#echo $proc_dir
function stop () {
kill -9 ps -ef |grep $proc_dir |grep -v grep|awk '{print $2}'
}
function start () {
cd $proc_dir
./startup.sh
tail -300f /usr/local/logs/tomcat-business/$proc_name.log
}
case $1 in
start)
start;;
stop)
stop;;
restart)
stop
start;;
esac
就是由於tail -300f /usr/local/logs/tomcat-business/$proc_name.log這一句致使的問題,在別的項目使用的時候實際上是沒有這一句的,通常在使用的步驟是:
1 執行tomcat.sh start xxx啓動tomcat,
2 執行tail -300f /usr/local/logs/tomcat-business/xxx.log 查看啓動日誌是否成功。
在此次投產的時候爲了省一步操做,就將執行查看日誌的命令,直接加在了啓動命令的後面,當執行tomcat.sh start xxx這個命令的時候,即啓動的tomcat,也自動會打印出tomcat的日誌,那時候的想法很是好。
緣由是,使用腳本命令啓動後由於使用了tail -300f xxx 命令,tomcat的進程會成爲shell腳本的子進程,這樣的話,若是shell腳本中止的話,系統會自動殺掉tomcat進程致使tomcat down掉,在咱們的腳本中去掉這條命令tomcat就正常了,更深層次的緣由參考Tomcat進程意外退出的問題分析這篇文章,文章的內容仍是分析的比較透徹,最後感受阿里的技術真的很牛X,這篇文章也是出自於阿里的員工。