【PHP問題定位】2018-07-02 fpm掉底分析

順風車運營研發團隊 黃桃php

問題現象
某機器這段時間出現cpu-idle掉地的報警 如圖:php-fpm

clipboard.png

緣由分析優化

查看當時的監控(php-fpm-idle、cpu-idle,io-wait、io-write 等)
(1)php-fpm-idle今天出現兩次突降,一次是12點左右,一次是16:30左右,查看整週也常常出現突降,如圖spa

clipboard.png

(2)io-wait在11:58分時 忽然升高 3d

clipboard.png

(3)io-write也在11:58時出現大量寫: 日誌

clipboard.png

(4)cpu-idle當時短暫出現下降,以後出現徒增,可是結合整週的曲線來看,一直維持在70-80之間,徒增緣由待分析:code

clipboard.png

(5)緣由推測:由於當時出現了大批量的寫日誌,致使io-wait上升,php-fpm進程由於寫文件出現延時,形成總體響應過慢,從而致使了fpm掉地; 對同一個文件進行 write時,大批量並行會出現等待,阻塞blog

驗證推測
(1)查看當時的php-fpm的慢日誌,看當時阻塞的地方,基本是在調用fwrite阻塞進程

clipboard.png

(2)查看當時的程序日誌trace.log的大小,日誌文件越大的時間段,正好是fpm-idle下滑嚴重的階段: 圖片

clipboard.png

(3)在經過 sar命令,驗證下當時的寫入磁盤狀況,在出現掉地的時間段確實出現極大的寫入,wr_sec/s 從每秒幾百低峯的幾百增加到十幾萬:

clipboard.png

問題緣由及優化建議
寫入日誌大可能兩方面緣由:

(1)當時請求暴增
(2)請求未暴增,可是某些請求觸發了某些不合理的打日誌
驗證緣由1,出現問題時間段每秒的流量

32 [02/Jul/2018:12:01:12
18 [02/Jul/2018:12:01:13
18 [02/Jul/2018:12:01:14
42 [02/Jul/2018:12:01:15
30 [02/Jul/2018:12:01:16
35 [02/Jul/2018:12:01:17
26 [02/Jul/2018:12:01:18
30 [02/Jul/2018:12:01:19
1 [02/Jul/2018:12:01:22
108 [02/Jul/2018:12:01:24
17 [02/Jul/2018:12:01:25
1 [02/Jul/2018:12:01:27
1 [02/Jul/2018:12:01:29
1 [02/Jul/2018:12:01:30
9 [02/Jul/2018:12:01:33
1 [02/Jul/2018:12:01:31
1 [02/Jul/2018:12:01:32
146 [02/Jul/2018:12:01:33
62 [02/Jul/2018:12:01:34
44 [02/Jul/2018:12:01:35
1 [02/Jul/2018:12:01:37
1 [02/Jul/2018:12:01:38
1 [02/Jul/2018:12:01:41
2 [02/Jul/2018:12:01:44
2 [02/Jul/2018:12:01:50
1 [02/Jul/2018:12:01:45
12 [02/Jul/2018:12:01:50
2 [02/Jul/2018:12:01:45
7 [02/Jul/2018:12:01:50
1 [02/Jul/2018:12:01:46
1 [02/Jul/2018:12:01:50
1 [02/Jul/2018:12:01:46
15 [02/Jul/2018:12:01:50
7 [02/Jul/2018:12:01:48
2 [02/Jul/2018:12:01:50
1 [02/Jul/2018:12:01:48
342 [02/Jul/2018:12:01:50
65 [02/Jul/2018:12:01:51
46 [02/Jul/2018:12:01:52
54 [02/Jul/2018:12:01:53
1 [02/Jul/2018:12:01:55
1 [02/Jul/2018:12:01:56
1 [02/Jul/2018:12:01:57
1 [02/Jul/2018:12:01:59
16 [02/Jul/2018:12:02:03
1 [02/Jul/2018:12:02:01
1 [02/Jul/2018:12:02:02
42 [02/Jul/2018:12:02:03
1 [02/Jul/2018:12:02:02
187 [02/Jul/2018:12:02:03
39 [02/Jul/2018:12:02:04
40 [02/Jul/2018:12:02:05
25 [02/Jul/2018:12:02:06
44 [02/Jul/2018:12:02:07
29 [02/Jul/2018:12:02:08

正常狀況是QPS是30左右,出問題時間段則很是不穩定,時高時低,相差很是大,例如在 12:01:50 時 342 qps,可是前十幾秒則基本都是個位數;緣由?前幾十秒都阻塞了,響應不了,積累到了12:01:50才響應;總體流量並未出現暴增;

驗證緣由2,查看當時traceId日誌,查看寫進去的內容:

發現這段寫入很是巨大,一行就119kb ,總共寫了 33555 行,總大小佔:33555 * 119KB = 3993045KB =3899M 基本能夠判定爲這行出的問題了
圖片描述

優化建議:在底層打日誌的類中對字符串長度作限制,避免這種大批量的寫入;

相關文章
相關標籤/搜索