Linux IO 監控與深刻分析

時間 2019-11-11

標籤 linux 監控深刻分析欄目 Linux 简体版

原文原文鏈接

https://jaminzhang.github.io/os/Linux-IO-Monitoring-and-Deep-Analysis/html

Linux IO 監控與深刻分析node

引言

接昨天電話面試，面試官問了系統 IO 怎麼分析，當時第一反應是使用 iotop 看系統上各進程的 IO 讀寫速度，而後使用 iostat 看 CPU 的 %iowait 時間佔比，（%iowait：CPU等待輸入輸出完成時間的百分比，%iowait的值太高，表示硬盤存在I/O瓶頸）
但回答並是不很全面，確實，比較久以前寫過一篇 Linux iostat 使用，好久沒有在系統上分析 IO 狀態了，因此有好幾個分析工具和參數忘記了（說明要熟悉一個知識和技能是須要不斷應用和重複學習，熟能生巧頗有道理，扯遠了，接着說 IO 監控與分析），而後面試官提示還要看 %util 參數（表示磁盤的繁忙程度），他一說，我確實了也記起來了。這個也是經常使用要看的參數。
下面我從新查找相關資料並再次學習一下吧，仍是要常常在實際工做中多應用才能熟練。mysql

1 系統級 IO 監控

iostat

 
  [root@xxxx_wan360_game ~]# iostat -xdm 1 Linux 2.6.32-358.el6.x86_64 (xxxx_wan360_game) 12/06/2016 _x86_64_ (8 CPU) Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util xvdep1 0.00 0.00 0.01 1.31 0.00 0.02 31.35 0.00 1.63 0.06 0.01 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util xvdep1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util xvdep1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util xvdep1 0.00 0.00 0.00 3.00 0.00 0.01 8.00 0.00 0.00 0.00 0.00 # iostat 選項 -x Display extended statistics. This option works with post 2.5 kernels since it needs /proc/diskstats file or a mounted sysfs to get the statistics. This option may also work with older kernels (e.g. 2.4) only if extended statistics are available in /proc/partitions (the kernel needs to be patched for that). -d Display the device utilization report. -m Display statistics in megabytes per second instead of blocks or kilobytes per second. Data displayed are valid only with kernels 2.4 and later.  
 

rrqm/s
       The number of read requests merged per second that were queued to the device.
	 排隊到設備時，每秒合併的讀請求數量

wrqm/s
       The number of write requests merged per second that were queued to the device.
         排隊到設備時，每秒合併的寫請求數量

r/s
       The number of read requests that were issued to the device per second.
	 每秒發送給設備的讀請求數量

w/s
       The number of write requests that were issued to the device per second.
	 每秒發送給設備的的寫請求數量

rMB/s
       The number of megabytes read from the device per second.
	 每秒從設備中讀取多少 MBs 

wMB/s
       The number of megabytes written to the device per second.
	 每秒往設備中寫入多少 MBs


avgrq-sz
	The average size (in sectors) of the requests that were issued to the device.
	分發給設備的請求的平均大小（以扇區爲單位）
	磁盤扇區是磁盤的物理屬性，它是磁盤設備尋址的最小單元，磁盤扇區大小能夠用 fdisk -l 命令查看
	另外，常說的「塊」（Block）是文件系統的抽象，不是磁盤自己的屬性。
	另一種說明：
	提交給驅動層的 IO 請求大小，通常不小於 4K，不大於 max(readahead_kb, max_sectors_kb)
	可用於判斷當前的 IO 模式，通常狀況下，尤爲是磁盤繁忙時，越大表明順序，越小表明隨機
	 
avgqu-sz
	The average queue length of the requests that were issued to the device.
	分發給設備的請求的平均隊列長度

await
	The average time (in milliseconds) for I/O requests issued to the device to be served. 
	This includes the time spent by the requests in queue and the time spent servicing them.
	分發給設備的 I/O 請求的平均響應時間（單位是毫秒）
	這個時間包含了花在請求在隊列中的時間和服務於請求的時間
	另一種說明：
	每個 I/O 請求的處理的平均時間（單位是毫秒）。這裏能夠理解爲 I/O 的響應時間。
	通常地，系統 I/O 響應時間應該低於 5ms，若是大於 10ms 就比較大了。

svctm
	The average service time (in milliseconds) for I/O requests that were issued to the device. 
	Warning! Do not trust this field any more. This field will be removed in a future sysstat version.
	分發給設備的 I/O 請求的平均服務時間。（單位是毫秒）
	警告！不要再相信這列值了。這一列將會在一個未來的 sysstat 版本中移除。
	另一種說明：
	一次 IO 請求的服務時間，對於單塊盤，徹底隨機讀時，基本在 7ms 左右，即尋道 + 旋轉延遲時間
	
%util
	Percentage of elapsed time during which I/O requests were issued to the device 
	(bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.
	 分發給設備的 I/O 請求的運行時間所佔的百分比。（設備的帶寬利用率）
	 設備飽和會發生在這個值接近 100%。
	 另一種說明：
	 表明磁盤繁忙程度。100% 表示磁盤繁忙，0% 表示磁盤空閒。可是注意，磁盤繁忙不表明磁盤(帶寬)利用率高。
	 在統計時間內全部處理 I/O 時間，除以總共統計時間。
	 例如，若是統計間隔 1 秒，該設備有 0.8 秒在處理 I/O，而 0.2 秒閒置，那麼該設備的 %util = 0.8/1 = 80%，
	 因此該參數暗示了設備的繁忙程度。通常地，若是該參數是 100% 表示設備已經接近滿負荷運行了
	 （固然若是是多磁盤，即便 %util 是 100%，由於磁盤的併發能力，因此磁盤使用未必就到了瓶頸）。

%iowait
	Show the percentage of time that the CPU or CPUs were idle during 
	which the system had an outstanding disk I/O request.
	 顯示當系統有一個顯著的磁盤 I/O 請求期間，CPU 空閒時間的百分比。

總結：
iostat 統計的是通用塊層通過合併(rrqm/s, wrqm/s)後，直接向設備提交的 IO 數據，能夠反映系統總體的 IO 情況，
可是有如下 2 個缺點:

1. 距離業務層比較遙遠，跟代碼中的 write，read 不對應(因爲系統預讀 + PageCache + IO 調度算法等因素，也很難對應)
2. 是系統級，沒辦法精確到進程，好比只能告訴你如今磁盤很忙，可是沒辦法告訴你是誰在忙，在忙什麼

另外一資料的總結：linux

若是 %iowait 的值太高，表示磁盤存在 I/O 瓶頸。
若是 %util 接近 100%，說明產生的 I/O 請求太多，I/O 系統已經滿負荷，該磁盤可能存在瓶頸。
若是 svctm 比較接近 await，說明 I/O 幾乎沒有等待時間；
若是 await 遠大於 svctm，說明 I/O 隊列太長，I/O 響應太慢，則須要進行必要優化。
若是 avgqu-sz 比較大，也表示有大量 IO 在等待。

2 進程級 IO 監控

iotop 和 pidstat

iotop 顧名思義, IO 版的 top
pidstat 顧名思義, 統計進程(pid)的 stat，進程的 stat 天然包括進程的 IO 情況

這兩個命令，均可以按進程統計 IO 情況，所以能夠回答你如下二個問題：ios

當前系統哪些進程在佔用 IO，百分比是多少?
佔用 IO 的進程是在讀？仍是在寫？讀寫量是多少？

pidstat 參數不少，根據須要使用git

 
  [root@xxxx_wan360_game ~]# pidstat -d 1 # 只顯示 IO Linux 2.6.32-358.el6.x86_64 (xxxx_wan360_game) 12/06/2016 _x86_64_ (8 CPU) 05:28:57 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command 05:28:58 PM 50 0.00 4.00 0.00 sync_supers 05:28:58 PM 627 0.00 8.00 0.00 flush-202:65 05:28:58 PM 3852 0.00 8.00 0.00 cente_s0001 05:28:58 PM 3860 0.00 4.00 0.00 game_s0001 05:28:58 PM 3864 0.00 4.00 0.00 game_s0001 05:28:58 PM 3868 0.00 4.00 0.00 game_s0001 05:28:58 PM 3876 0.00 4.00 0.00 gate_s0001 05:28:58 PM 3880 0.00 4.00 0.00 gate_s0001 05:28:58 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command 05:28:59 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command 05:29:00 PM 23922 0.00 20.00 0.00 filebeat # pidstat -u -r -d -t 1 # -u CPU 使用率 # -r 缺頁及內存信息 # -d IO 信息 # -t 以線程爲統計單位 # 1 1 秒統計一次 [root@xxxx_wan360_game ~]# pidstat -u -r -d -t 1 Linux 2.6.32-358.el6.x86_64 (xxxx_wan360_game) 12/06/2016 _x86_64_ (8 CPU) 05:32:11 PM TGID TID %usr %system %guest %CPU CPU Command 05:32:12 PM 3856 - 3.74 0.93 0.00 4.67 5 game_s0001 05:32:12 PM - 3856 4.67 0.93 0.00 5.61 5 |__game_s0001 05:32:12 PM - 3922 0.93 0.00 0.00 0.93 2 |__game_s0001 05:32:12 PM 3880 - 0.00 0.93 0.00 0.93 3 gate_s0001 05:32:12 PM - 3908 0.00 0.93 0.00 0.93 3 |__gate_s0001 05:32:12 PM 6832 - 1.87 4.67 0.00 6.54 4 pidstat 05:32:12 PM - 6832 1.87 4.67 0.00 6.54 4 |__pidstat 05:32:11 PM TGID TID minflt/s majflt/s VSZ RSS %MEM Command 05:32:12 PM 6803 - 5.61 0.00 4124 796 0.00 iostat 05:32:12 PM - 6803 5.61 0.00 4124 796 0.00 |__iostat 05:32:12 PM 6832 - 1321.50 0.00 101432 1280 0.01 pidstat 05:32:12 PM - 6832 1321.50 0.00 101432 1280 0.01 |__pidstat 05:32:12 PM 8391 - 0.93 0.00 17992 1176 0.01 zabbix_agentd 05:32:12 PM - 8391 0.93 0.00 17992 1176 0.01 |__zabbix_agentd 05:32:12 PM 8392 - 2.80 0.00 20064 1320 0.01 zabbix_agentd 05:32:12 PM - 8392 2.80 0.00 20064 1320 0.01 |__zabbix_agentd 05:32:11 PM TGID TID kB_rd/s kB_wr/s kB_ccwr/s Command 05:32:12 PM 1894 - 0.00 3.74 0.00 mysqld 05:32:12 PM - 1923 0.00 3.74 0.00 |__mysqld  
 

總結:github

進程級 IO 監控：面試

能夠回答系統級 IO 監控不能回答的 2 個問題
距離業務層相對較近(例如，能夠統計進程的讀寫量)

可是也沒有辦法跟業務層的 read, write 聯繫在一塊兒，同時顆粒度較粗，沒有辦法告訴你，當前進程讀寫了哪些文件？耗時？大小？算法

3. 業務級 IO 監控

ioprofile

ioprofile 命令本質上是 lsof + strace ioprofile 能夠回答你如下三個問題:sql

當前進程某時間內,在業務層面讀寫了哪些文件(read, write)？
讀寫次數是多少？(read, write 的調用次數)
讀寫數據量多少？(read, write 的 byte 數)

注: ioprofile 僅支持多線程程序,對單線程程序不支持. 對於單線程程序的 IO 業務級分析，strace 足以。

總結： ioprofile 本質上是 strace，所以能夠看到 read，write 的調用軌跡，能夠作業務層的 IO 分析(mmap 方式無能爲力)

4. 文件級 IO 監控

文件級 IO 監控能夠配合/補充」業務級和進程級」 IO 分析
文件級 IO 分析，主要針對單個文件，回答當前哪些進程正在對某個文件進行讀寫操做

lsof 或者 ls /proc/pid/fd
inodewatch.stp

lsof 告訴你當前文件由哪些進程打開

 
  [root@xxxx_wan360_game ~]# lsof ./ # 當前目錄當前由 bash 和 lsof 進程打開 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME lsof 8932 root cwd DIR 202,65 4096 33652737 . lsof 8938 root cwd DIR 202,65 4096 33652737 . bash 16678 root cwd DIR 202,65 4096 33652737 .  
 

lsof 命令只能回答靜態的信息，而且「打開」並不必定「讀取」，
對於 cat，echo 這樣的命令，打開和讀取都是瞬間的，lsof 很難捕捉能夠用 inodewatch.stp 來彌補

Ref

Linux 下的 IO 監控與分析
 使用 iostat 分析 IO 性能
 性能優化-分析 IO 瓶頸

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。