順風車運營研發團隊 黃桃php
問題現象
某機器這段時間出現cpu-idle掉地的報警 如圖:php-fpm
緣由分析優化
查看當時的監控(php-fpm-idle、cpu-idle,io-wait、io-write 等)
(1)php-fpm-idle今天出現兩次突降,一次是12點左右,一次是16:30左右,查看整週也常常出現突降,如圖spa
(2)io-wait在11:58分時 忽然升高 3d
(3)io-write也在11:58時出現大量寫: 日誌
(4)cpu-idle當時短暫出現下降,以後出現徒增,可是結合整週的曲線來看,一直維持在70-80之間,徒增緣由待分析:code
(5)緣由推測:由於當時出現了大批量的寫日誌,致使io-wait上升,php-fpm進程由於寫文件出現延時,形成總體響應過慢,從而致使了fpm掉地; 對同一個文件進行 write時,大批量並行會出現等待,阻塞blog
驗證推測
(1)查看當時的php-fpm的慢日誌,看當時阻塞的地方,基本是在調用fwrite阻塞進程
(2)查看當時的程序日誌trace.log的大小,日誌文件越大的時間段,正好是fpm-idle下滑嚴重的階段: 圖片
(3)在經過 sar命令,驗證下當時的寫入磁盤狀況,在出現掉地的時間段確實出現極大的寫入,wr_sec/s 從每秒幾百低峯的幾百增加到十幾萬:
問題緣由及優化建議
寫入日誌大可能兩方面緣由:
(1)當時請求暴增
(2)請求未暴增,可是某些請求觸發了某些不合理的打日誌
驗證緣由1,出現問題時間段每秒的流量
32 [02/Jul/2018:12:01:12 18 [02/Jul/2018:12:01:13 18 [02/Jul/2018:12:01:14 42 [02/Jul/2018:12:01:15 30 [02/Jul/2018:12:01:16 35 [02/Jul/2018:12:01:17 26 [02/Jul/2018:12:01:18 30 [02/Jul/2018:12:01:19 1 [02/Jul/2018:12:01:22 108 [02/Jul/2018:12:01:24 17 [02/Jul/2018:12:01:25 1 [02/Jul/2018:12:01:27 1 [02/Jul/2018:12:01:29 1 [02/Jul/2018:12:01:30 9 [02/Jul/2018:12:01:33 1 [02/Jul/2018:12:01:31 1 [02/Jul/2018:12:01:32 146 [02/Jul/2018:12:01:33 62 [02/Jul/2018:12:01:34 44 [02/Jul/2018:12:01:35 1 [02/Jul/2018:12:01:37 1 [02/Jul/2018:12:01:38 1 [02/Jul/2018:12:01:41 2 [02/Jul/2018:12:01:44 2 [02/Jul/2018:12:01:50 1 [02/Jul/2018:12:01:45 12 [02/Jul/2018:12:01:50 2 [02/Jul/2018:12:01:45 7 [02/Jul/2018:12:01:50 1 [02/Jul/2018:12:01:46 1 [02/Jul/2018:12:01:50 1 [02/Jul/2018:12:01:46 15 [02/Jul/2018:12:01:50 7 [02/Jul/2018:12:01:48 2 [02/Jul/2018:12:01:50 1 [02/Jul/2018:12:01:48 342 [02/Jul/2018:12:01:50 65 [02/Jul/2018:12:01:51 46 [02/Jul/2018:12:01:52 54 [02/Jul/2018:12:01:53 1 [02/Jul/2018:12:01:55 1 [02/Jul/2018:12:01:56 1 [02/Jul/2018:12:01:57 1 [02/Jul/2018:12:01:59 16 [02/Jul/2018:12:02:03 1 [02/Jul/2018:12:02:01 1 [02/Jul/2018:12:02:02 42 [02/Jul/2018:12:02:03 1 [02/Jul/2018:12:02:02 187 [02/Jul/2018:12:02:03 39 [02/Jul/2018:12:02:04 40 [02/Jul/2018:12:02:05 25 [02/Jul/2018:12:02:06 44 [02/Jul/2018:12:02:07 29 [02/Jul/2018:12:02:08
正常狀況是QPS是30左右,出問題時間段則很是不穩定,時高時低,相差很是大,例如在 12:01:50 時 342 qps,可是前十幾秒則基本都是個位數;緣由?前幾十秒都阻塞了,響應不了,積累到了12:01:50才響應;總體流量並未出現暴增;
驗證緣由2,查看當時traceId日誌,查看寫進去的內容:
發現這段寫入很是巨大,一行就119kb ,總共寫了 33555 行,總大小佔:33555 * 119KB = 3993045KB =3899M 基本能夠判定爲這行出的問題了
優化建議:在底層打日誌的類中對字符串長度作限制,避免這種大批量的寫入;