Linux運維基礎採集項

時間 2019-11-16

標籤 linux 基礎採集欄目 Linux 简体版

原文原文鏈接

1. Linux運維基礎採集項

作運維，不怕出問題，怕的是出了問題，抓不到現場，兩眼摸黑。因此，依靠強大的監控系統，收集儘量多的指標，意義重大。但哪些指標纔是有意義的呢，本着從實踐中來的思想，各位工程師在長期摸爬滾打中總結出來的經驗最有價值。java

在各位運維工程師長期的工做實踐中，咱們總結了在系統運維過程當中，常常會參考的一些指標，主要包括如下幾個類別：node

CPU
Load
內存
磁盤
IO
網絡相關
內核參數
ss 統計輸出
端口採集
核心服務的進程存活信息採集
關鍵業務進程資源消耗
NTP offset採集
DNS解析採集

每一個類別，具體的詳細指標以下，這些指標，都是open-falcon的agent組件直接支持的。falcon-agent每隔必定時間間隔（目前是60秒）會採集一次相關的指標，並彙報給server端。linux

2. CPU相關採集項

計算方法：經過採集/proc/stat來獲得，你們能夠參考sar命令的統計輸出來理解。ios

cpu.idle：Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
cpu.busy：與cpu.idle相對，他的值等於100減去cpu.idle。
cpu.guest：Percentage of time spent by the CPU or CPUs to run a virtual processor.
cpu.iowait：Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
cpu.irq：Percentage of time spent by the CPU or CPUs to service hardware interrupts.
cpu.softirq：Percentage of time spent by the CPU or CPUs to service software interrupts.
cpu.nice：Percentage of CPU utilization that occurred while executing at the user level with nice priority.
cpu.steal：Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
cpu.system：Percentage of CPU utilization that occurred while executing at the system level (kernel).
cpu.user：Percentage of CPU utilization that occurred while executing at the user level (application).
cpu.cnt：cpu核數。
cpu.switches：cpu上下文切換次數，計數器類型。

3. 磁盤相關採集項

計算方法：先讀取/proc/mounts拿到全部掛載點，而後經過syscall.Statfs_t拿到blocks和inode的使用狀況。每一個metric都會附加一組tag描述，相似mount=$mount,fstype=$fstype，其中$mount是掛載點，好比/home，$fstype是文件系統，好比ext4。緩存

df.bytes.free：磁盤可用量，int64
df.bytes.free.percent：磁盤可用量佔總量的百分比，float64，好比32.1
df.bytes.total：磁盤總大小，int64
df.bytes.used：磁盤已用大小，int64
df.bytes.used.percent：磁盤已用大小佔總量的百分比，float64
df.inodes.total：inode總數，int64
df.inodes.free：可用inode數目，int64
df.inodes.free.percent：可用inode佔比，float64
df.inodes.used：已用的inode數據，int64
df.inodes.used.percent：已用inode佔比，float64

4. megacli工具輸出

使用 megacli 工具讀取 RAID 相關信息，每一個metric都會附件一組tag描述，用來標明所屬PD或者 VD，PD格式爲PD=Enclosure_ID:SLOT_ID，好比PD=32:0代表第一塊磁盤，VD=0 代表第一個邏輯磁盤。服務器

sys.disk.lsiraid.pd.Media_Error_Count：這個及如下三個指標目前僅做爲數據收集，不必定意味磁盤損壞（只是表示損壞機率變大）
sys.disk.lsiraid.pd.Other_Error_Count
sys.disk.lsiraid.pd.Predictive_Failure_Count
sys.disk.lsiraid.pd.Drive_Temperature
sys.disk.lsiraid.pd.Firmware_state：若是值不爲0，則此物理磁盤出現問題
sys.disk.lsiraid.vd.cache_policy：若是值不爲0，表示此邏輯磁盤緩存策略和設置不符
sys.disk.lsiraid.vd.state：若是值不爲0，表示此邏輯磁盤出現問題

5. SMART工具輸出

使用 smartctl 工具讀取磁盤 SMART 信息，目前全部指標僅做爲數據收集，不必定意味磁盤損壞（只是表示機率變大），每一個metric都會有一組tag描述，代表盤符，例如device=/dev/sda。網絡

sys.disk.smart.Reallocated_Sector_Ct
sys.disk.smart.Spin_Retry_Count
sys.disk.smart.Reallocated_Event_Count
sys.disk.smart.Current_Pending_Sector
sys.disk.smart.Offline_Uncorrectable
sys.disk.smart.Temperature_Celsius

6. 分區讀寫監控

測試全部已掛載分區是否可讀寫，每一個metric都會有一組tag描述，表示掛載點，好比mount=/homeapp

sys.disk.rw：若是值不爲0，代表此分區讀寫出現問題

7. IO相關採集項

計算方法：每秒採集一次/proc/diskstats，計算差值，都是計數器類型的。每一個metric都會有一組tag描述，形如device=$device，用來表示具體的設備，好比sda一、sdb。用戶能夠參考iostat的幫助文檔來理解具體的metric含義。運維

disk.io.ios_in_progress：Number of actual I/O requests currently in flight.
disk.io.msec_read：Total number of ms spent by all reads.
disk.io.msec_total：Amount of time during which ios_in_progress >= 1.
disk.io.msec_weighted_total：Measure of recent I/O completion time and backlog.
disk.io.msec_write：Total number of ms spent by all writes.
disk.io.read_merged：Adjacent read requests merged in a single req.
disk.io.read_requests：Total number of reads completed successfully.
disk.io.read_sectors：Total number of sectors read successfully.
disk.io.write_merged：Adjacent write requests merged in a single req.
disk.io.write_requests：total number of writes completed successfully.
disk.io.write_sectors：total number of sectors written successfully.
disk.io.read_bytes：單位是byte的數字
disk.io.write_bytes：單位是byte的數字
disk.io.avgrq_sz：下面幾個值就是iostat -x 1看到的值
disk.io.avgqu-sz
disk.io.await
disk.io.svctm
disk.io.util：是個百分數，好比56.43，表示56.43%

8. 機器負載相關採集項

計算方法：讀取/proc/loadavg，都是原始值類型的：ssh

load.1min
load.5min
load.15min

9. 內存相關採集項

計算方法：讀取/proc/meminfo 中的內容，其中的mem.memfree是free+buffers+cached，mem.memused=mem.memtotal-mem.memfree。用戶具體能夠參考free命令的輸出和幫助文檔來理解每一個metric的含義。

mem.memtotal：內存總大小
mem.memused：使用了多少內存
mem.memused.percent：使用的內存佔比
mem.memfree
mem.memfree.percent
mem.swaptotal：swap總大小
mem.swapused：使用了多少swap
mem.swapused.percent：使用的swap的佔比
mem.swapfree
mem.swapfree.percent

10. 網絡相關採集項

計算方法：讀取/proc/net/dev的內容，每一個metric都附加有一組tag，形如iface=$iface，標明具體那個interface，好比eth0。metric中帶有in的表示流入狀況，out表示流出狀況，total是總量in+out，支持的metric以下：

net.if.in.bytes
net.if.in.compressed
net.if.in.dropped
net.if.in.errors
net.if.in.fifo.errs
net.if.in.frame.errs
net.if.in.multicast
net.if.in.packets
net.if.out.bytes
net.if.out.carrier.errs
net.if.out.collisions
net.if.out.compressed
net.if.out.dropped
net.if.out.errors
net.if.out.fifo.errs
net.if.out.packets
net.if.total.bytes
net.if.total.dropped
net.if.total.errors
net.if.total.packets

11. 端口採集項

計算方法，經過ss -ln，來判斷指定的端口是否處於listen狀態。原始值類型，值要麼是1：表明在監聽，要麼是0，表明沒有在監聽。每一個metric都附件一組tag，形如port=$port，$port就是具體的端口。

net.port.listen

12. 機器內核配置

kernel.maxfiles：讀取的/proc/sys/fs/file-max
kernel.files.allocated：讀取的/proc/sys/fs/file-nr第一個Field
kernel.files.left：值=kernel.maxfiles-kernel.files.allocated
kernel.maxproc：讀取的/proc/sys/kernel/pid_max

13. ntp採集項

使用 ntpq -pn 獲取本機時間相對於 ntp 服務器的 offset。

sys.ntp.offset：本機偏移時間，單位爲ms，值過大或者爲0則代表有異常，須要報警

14. 進程監控

proc.num：判斷某個進程的數目，這裏須要分兩個場景，一種是根據進程的名字來斷定，好比name=sshd；另一種是根據cmdline來斷定，好比Java的應用進程名可能都是java，根據第一種狀況無法作區分，此時能夠配置cmdline，如cmdline=./falcon_agent-c./cfg.ini