本文的地址爲:http://tiewei.github.io/devops/howto-use-cgroup/html
介紹docker的的過程當中,提到lxc利用cgroup來提供資源的限額和控制,本文主要介紹cgroup的用法和操做命令,主要內容來自node
[1]https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.htmlgit
[2]https://www.kernel.org/doc/Documentation/cgroups/cgroups.txtgithub
cgroup
cgroup的功能在於將一臺計算機上的資源(CPU,memory, network)進行分片,來防止進程間不利的資源搶佔。sql
Terminologydocker
- cgroup - 關聯一組task和一組subsystem的配置參數。一個task對應一個進程, cgroup是資源分片的最小單位。
- subsystem - 資源管理器,一個subsystem對應一項資源的管理,如 cpu, cpuset, memory等
- hierarchy - 關聯一個到多個
subsystem
和一組樹形結構的cgroup
. 和cgroup
不一樣,hierarchy
包含的是可管理的subsystem
而非具體參數
因而可知,cgroup對資源的管理是一個樹形結構,相似進程。centos
相同點 - 分層結構,子進程/cgroup繼承父進程/cgroupapi
不一樣點 - 進程是一個單根樹狀結構(pid=0爲根),而cgroup總體來看是一個多樹的森林結構(hierarchy爲根)。網絡
一個典型的hierarchy
掛載目錄以下app
/cgroup/
├── blkio <--------------- hierarchy/root cgroup
│ ├── blkio.io_merged <--------------- subsystem parameter
... ...
│ ├── blkio.weight
│ ├── blkio.weight_device
│ ├── cgroup.event_control
│ ├── cgroup.procs
│ ├── lxc <--------------- cgroup
│ │ ├── blkio.io_merged <--------------- subsystem parameter
│ │ ├── blkio.io_queued
... ... ...
│ │ └── tasks <--------------- task list
│ ├── notify_on_release
│ ├── release_agent
│ └── tasks
...
subsystem列表
RHEL/centos支持的subsystem以下
- blkio — 塊存儲配額 >> this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, USB, etc.).
- cpu — CPU時間分配限制 >> this subsystem uses the scheduler to provide cgroup tasks access to the CPU.
- cpuacct — CPU資源報告 >> this subsystem generates automatic reports on CPU resources used by tasks in a cgroup.
- cpuset — CPU綁定限制 >> this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
- devices — 設備權限限制 >> this subsystem allows or denies access to devices by tasks in a cgroup.
- freezer — cgroup中止/恢復 >> this subsystem suspends or resumes tasks in a cgroup.
- memory — 內存限制 >> this subsystem sets limits on memory use by tasks in a cgroup, and generates automatic reports on memory resources used by those tasks.
- net_cls — 配合tc進行網絡限制 >> this subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc) to identify packets originating from a particular cgroup task.
- net_prio — 網絡設備優先級 >> this subsystem provides a way to dynamically set the priority of network traffic per network interface.
- ns — 資源命名空間限制 >> the namespace subsystem.
cgroup操做準則與方法
操做準則
1.一個hierarchy能夠有多個 subsystem (mount 的時候hierarchy能夠attach多個subsystem)
A single hierarchy can have one or more subsystems attached to it.
eg.
mount -t cgroup -o cpu,cpuset,memory cpu_and_mem /cgroup/cpu_and_mem
2.一個已經被掛載的 subsystem 只能被再次掛載在一個空的 hierarchy 上 (已經mount一個subsystem的hierarchy不能掛載一個已經被其它hierarchy掛載的subsystem)
Any single subsystem (such as cpu) cannot be attached to more than one hierarchy if one of those hierarchies has a different subsystem attached to it already.
3.每一個task只能在一同個hierarchy的惟一一個cgroup裏(不能在同一個hierarchy下有超過一個cgroup的tasks裏同時有這個進程的pid)
Each time a new hierarchy is created on the systems, all tasks on the system are initially members of the default cgroup of that hierarchy, which is known as the root cgroup. For any single hierarchy you create, each task on the system can be a member of exactly onecgroup in that hierarchy. A single task may be in multiple cgroups, as long as each of those cgroups is in a different hierarchy. As soon as a task becomes a member of a second cgroup in the same hierarchy, it is removed from the first cgroup in that hierarchy. At no time is a task ever in two different cgroups in the same hierarchy.
4.子進程在被fork出時自動繼承父進程所在cgroup,可是fork以後就能夠按需調整到其餘cgroup
Any process (task) on the system which forks itself creates a child task. A child task automatically inherits the cgroup membership of its parent but can be moved to different cgroups as needed. Once forked, the parent and child processes are completely independent.
5.其它
- 限制一個task的惟一方法就是將其加入到一個cgroup的task裏
- 多個subsystem能夠掛載到一個hierarchy裏, 而後經過不一樣的cgroup中的subsystem參數來對不一樣的task進行限額
- 若是一個hierarchy有太多subsystem,能夠考慮重構 - 將subsystem掛到獨立的hierarchy; 相應的, 能夠將多個hierarchy合併成一個hierarchy
- 由於能夠只掛載少許subsystem, 能夠實現只對task單個方面的限額; 同時一個task能夠被加到多個hierarchy中,從而實現對多個資源的控制
操做方法
1.掛載subsystem
-
利用cgconfig服務及其配置文件 /etc/cgconfig.conf
- 服務啓動時自動掛載
subsystem = /cgroup/hierarchy;
-
命令行操做
mount -t cgroup -o subsystems name /cgroup/name
取消掛載
umount /cgroup/name
eg. 掛載 cpuset, cpu, cpuacct, memory 4個subsystem到/cgroup/cpu_and_mem
目錄(hierarchy)
mount {
cpuset = /cgroup/cpu_and_mem;
cpu = /cgroup/cpu_and_mem;
cpuacct = /cgroup/cpu_and_mem;
memory = /cgroup/cpu_and_mem;
}
or
mount -t cgroup -o remount,cpu,cpuset,memory cpu_and_mem /cgroup/cpu_and_mem
2. 新建/刪除 cgroup
3. 權限管理
eg.
group daemons {
cpuset {
cpuset.mems = 0;
cpuset.cpus = 0;
}
}
group daemons/sql {
perm {
task {
uid = root;
gid = sqladmin;
} admin {
uid = root;
gid = root;
}
}
cpuset {
cpuset.mems = 0;
cpuset.cpus = 0;
}
}
or
~]$ mkdir -p /cgroup/red/daemons/sql
~]$ chown root:root /cgroup/red/daemons/sql/*
~]$ chown root:sqladmin /cgroup/red/daemons/sql/tasks
~]$ echo 0 > /cgroup/red/daemons/cpuset.mems
~]$ echo 0 > /cgroup/red/daemons/cpuset.cpus
~]$ echo 0 > /cgroup/red/daemons/sql/cpuset.mems
~]$ echo 0 > /cgroup/red/daemons/sql/cpuset.cpus
4. cgroup參數設定
- 命令行1
cgset -r parameter=value path_to_cgroup
- 命令行2
cgset --copy-from path_to_source_cgroup path_to_target_cgroup
- 文件
echo value > path_to_cgroup/parameter
eg.
cgset -r cpuset.cpus=0-1 group1
cgset --copy-from group1/ group2/
echo 0-1 > /cgroup/cpuset/group1/cpuset.cpus
5. 添加task
- 命令行添加進程
cgclassify -g subsystems:path_to_cgroup pidlist
- 文件添加進程
echo pid > path_to_cgroup/tasks
- 在cgroup中啓動進程
cgexec -g subsystems:path_to_cgroup command arguments
- 在cgroup中啓動服務
echo 'CGROUP_DAEMON="subsystem:control_group"' >> /etc/sysconfig/<service>
-
利用cgrulesengd服務初始化,在配置文件/etc/cgrules.conf
中
user<:command> subsystems control_group
其中:
+用戶user的全部進程的subsystems限制的group爲control_group
+<:command>是可選項,表示對特定命令實行限制
+user能夠用@group表示對特定的 usergroup 而非user
+能夠用*表示所有
+%表示和前一行的該項相同
eg.
cgclassify -g cpu,memory:group1 1701 1138
echo -e "1701\n1138" |tee -a /cgroup/cpu/group1/tasks /cgroup/memory/group1/tasks
cgexec -g cpu:group1 lynx http://www.redhat.com
sh -c "echo \$$ > /cgroup/lab1/group1/tasks && lynx http://www.redhat.com"
經過/etc/cgrules.conf 對特定服務限制
maria devices /usergroup/staff
maria:ftp devices /usergroup/staff/ftp
@student cpu,memory /usergroup/student/
% memory /test2/
6. 其餘
-
cgsnapshot會根據當前cgroup狀況生成/etc/cgconfig.conf文件內容
gsnapshot [-s] [-b FILE] [-w FILE] [-f FILE] [controller]
-b, --blacklist=FILE Set the blacklist configuration file (default /etc/cgsnapshot_blacklist.conf)
-f, --file=FILE Redirect the output to output_file
-s, --silent Ignore all warnings
-t, --strict Don't show the variables which are not on the whitelist
-w, --whitelist=FILE Set the whitelist configuration file (don't used by default)
-
查看進程在哪一個cgroup
ps -O cgroup
或
cat /proc/<PID>/cgroup
-
查看subsystem mount狀況
cat /proc/cgroups
lssubsys -m <subsystems>
-
查看cgroup lscgroup
-
查看cgroup參數值
cgget -r parameter list_of_cgroups
cgget -g <controllers>:<path>
-
cgclear刪除hierarchy極其全部cgroup
- 事件通知API - 目前只支持memory.oom_control
- 更多
- man 1 cgclassify — the cgclassify command is used to move running tasks to one or more cgroups.
- man 1 cgclear — the cgclear command is used to delete all cgroups in a hierarchy.
- man 5 cgconfig.conf — cgroups are defined in the cgconfig.conf file.
- man 8 cgconfigparser — the cgconfigparser command parses the cgconfig.conf file and mounts hierarchies.
- man 1 cgcreate — the cgcreate command creates new cgroups in hierarchies.
- man 1 cgdelete — the cgdelete command removes specified cgroups.
- man 1 cgexec — the cgexec command runs tasks in specified cgroups.
- man 1 cgget — the cgget command displays cgroup parameters.
- man 1 cgsnapshot — the cgsnapshot command generates a configuration file from existing subsystems.
- man 5 cgred.conf — cgred.conf is the configuration file for the cgred service.
- man 5 cgrules.conf — cgrules.conf contains the rules used for determining when tasks belong to certain cgroups.
- man 8 cgrulesengd — the cgrulesengd service distributes tasks to cgroups.
- man 1 cgset — the cgset command sets parameters for a cgroup.
- man 1 lscgroup — the lscgroup command lists the cgroups in a hierarchy.
- man 1 lssubsys — the lssubsys command lists the hierarchies containing the specified subsystems.
subsystem配置
1. blkio - BLOCK IO限額
- common
- blkio.reset_stats - 重置統計信息,寫int到此文件
- blkio.time - 統計cgroup對設備的訪問時間 -
device_types:node_numbers milliseconds
- blkio.sectors - 統計cgroup對設備扇區訪問數量 -
device_types:node_numbers sector_count
- blkio.avg_queue_size - 統計平均IO隊列大小(須要
CONFIG_DEBUG_BLK_CGROUP=y
)
- blkio.group_wait_time - 統計cgroup等待總時間(須要
CONFIG_DEBUG_BLK_CGROUP=y
, 單位ns)
- blkio.empty_time - 統計cgroup無等待io總時間(須要
CONFIG_DEBUG_BLK_CGROUP=y
, 單位ns)
- blkio.idle_time - reports the total time (in nanoseconds — ns) the scheduler spent idling for a cgroup in anticipation of a better request than those requests already in other queues or from other groups.
- blkio.dequeue - 此cgroup IO操做被設備dequeue次數(須要
CONFIG_DEBUG_BLK_CGROUP=y
) - device_types:node_numbers number
- blkio.io_serviced - 報告CFQ scheduler統計的此cgroup對特定設備的IO操做(read, write, sync, or async)次數 -
device_types:node_numbers operation number
- blkio.io_service_bytes - 報告CFQ scheduler統計的此cgroup對特定設備的IO操做(read, write, sync, or async)數據量 -
device_types:node_numbers operation bytes
- blkio.io_service_time - 報告CFQ scheduler統計的此cgroup對特定設備的IO操做(read, write, sync, or async)時間(單位ns) -
device_types:node_numbers operation time
- blkio.io_wait_time - 此cgroup對特定設備的特定操做(read, write, sync, or async)的等待時間(單位ns) -
device_types:node_numbers operation time
- blkio.io_merged - 此cgroup的BIOS requests merged into IO請求的操做(read, write, sync, or async)的次數 -
number operation
- blkio.io_queued - 此cgroup的queued IO 操做(read, write, sync, or async)的請求次數 -
number operation
- Proportional weight division 策略 - 按比例分配block io資源
- blkio.weight - 100-1000的相對權重,會被blkio.weight_device的特定設備權重覆蓋
- blkio.weight_device - 特定設備的權重 - device_types:node_numbers weight
- I/O throttling (Upper limit) 策略 - 設定IO操做上限
- 每秒讀/寫數據上限
blkio.throttle.read_bps_device - device_types:node_numbers bytes_per_second
blkio.throttle.write_bps_device - device_types:node_numbers bytes_per_second
- 每秒讀/寫操做次數上限
blkio.throttle.read_iops_device - device_types:node_numbers operations_per_second
blkio.throttle.write_iops_device - device_types:node_numbers operations_per_second
- 每秒具體操做(read, write, sync, or async)的控制 blkio.throttle.io_serviced -
device_types:node_numbers operation operations_per_second
blkio.throttle.io_service_bytes -device_types:node_numbers operation bytes_per_second
2. cpu - CPU使用時間限額
- CFS(Completely Fair Scheduler)策略 - CPU最大資源限制
- cpu.cfs_period_us, cpu.cfs_quota_us - 必選 - 兩者配合,前者規定時間週期(微秒)後者規定cgroup最多可以使用時間(微秒),實現task對單個cpu的使用上限(cfs_quota_us是cfs_period_us的兩倍便可限定在雙核上徹底使用)。
- cpu.stat - 記錄cpu統計信息,包含 nr_periods(經歷了幾個cfs_period_us), nr_throttled (cgroup裏的task被限制了幾回), throttled_time (cgroup裏的task被限制了多少納秒)
- cpu.shares - 可選 - cpu輪轉權重的相對值
- RT(Real-Time scheduler)策略 - CPU最小資源限制
-
cpu.rt_period_us, cpu.rt_runtime_us
兩者配合使用規定cgroup裏的task每cpu.rt_period_us(微秒)必然會執行cpu.rt_runtime_us(微秒)
3. cpuacct - CPU資源報告
- cpuacct.usage - cgroup中全部task的cpu使用時長(納秒)
- cpuacct.stat - cgroup中全部task的用戶態和內核態分別使用cpu的時長
- cpuacct.usage_percpu - cgroup中全部task使用每一個cpu的時長
4. cpuset - CPU綁定
- cpuset.cpus - 必選 - cgroup可以使用的cpu,如0-2,16表明 0,1,2,16這4個cpu
- cpuset.mems - 必選 - cgroup可以使用的memory node
- cpuset.memory_migrate - 可選 - 當cpuset.mems變化時page上的數據是否遷移, default 0
- cpuset.cpu_exclusive - 可選 - 是否獨佔cpu, default 0
- cpuset.mem_exclusive - 可選 - 是否獨佔memory,default 0
- cpuset.mem_hardwall - 可選 - cgroup中task的內存是否隔離, default 0
- cpuset.memory_pressure - 可選 - a read-only file that contains a running average of the memory pressure created by the processes in this cpuset
- cpuset.memory_pressure_enabled - 可選 - cpuset.memory_pressure開關,default 0
- cpuset.memory_spread_page - 可選 - contains a flag (0 or 1) that specifies whether file system buffers should be spread evenly across the memory nodes allocated to this cpuset, default 0
- cpuset.memory_spread_slab - 可選 - contains a flag (0 or 1) that specifies whether kernel slab caches for file input/output operations should be spread evenly across the cpuset, default 0
- cpuset.sched_load_balance - 可選 - cgroup的cpu壓力是否會被平均到cpu set中的多個cpu, default 1
- cpuset.sched_relax_domain_level - 可選 - cpuset.sched_load_balance的策略
- -1 = Use the system default value for load balancing
- 0 = Do not perform immediate load balancing; balance loads only periodically
- 1 = Immediately balance loads across threads on the same core
- 2 = Immediately balance loads across cores in the same package
- 3 = Immediately balance loads across CPUs on the same node or blade
- 4 = Immediately balance loads across several CPUs on architectures with non-uniform memory access (NUMA)
- 5 = Immediately balance loads across all CPUs on architectures with NUMA
5. device - cgoup的device限制
- 設備黑/白名單
- devices.allow - 容許名單
- devices.deny - 禁止名單
- 語法 - type device_types:node_numbers access type - b (塊設備) c (字符設備) a (所有設備) access - r 讀 w 寫 m 建立
- devices.list - 報告
6. freezer - 暫停/恢復 cgroup的限制
- 不能出如今root目錄下
- freezer.state - FROZEN 中止 FREEZING 正在中止 THAWED 恢復
7. memory - 內存限制
- memory.usage_in_bytes - 報告內存限制byte
- memory.memsw.usage_in_bytes - 報告cgroup中進程當前所用內存+swap空間
- memory.max_usage_in_bytes - 報告cgoup中的最大內存使用
- memory.memsw.max_usage_in_bytes - 報告最大使用到的內存+swap
- memory.limit_in_bytes - cgroup - 最大內存限制,單位k,m,g. -1表明取消限制
- memory.memsw.limit_in_bytes - 最大內存+swap限制,單位k,m,g. -1表明取消限制
- memory.failcnt - 報告達到最大容許內存的次數
- memory.memsw.failcnt - 報告達到最大容許內存+swap的次數
- memory.force_empty - 設爲0且無task時,清除cgroup的內存頁
- memory.swappiness - 換頁策略,60基準,小於60下降換出機率,大於60增長換出機率
- memory.use_hierarchy - 是否影響子group
- memory.oom_control - 0 enabled,當oom發生時kill掉進程
- memory.stat - 報告cgroup限制狀態
- cache - page cache, including tmpfs (shmem), in bytes
- rss - anonymous and swap cache, not including tmpfs (shmem), in bytes
- mapped_file - size of memory-mapped mapped files, including tmpfs (shmem), in bytes
- pgpgin - number of pages paged into memory
- pgpgout - number of pages paged out of memory
- swap - swap usage, in bytes
- active_anon - anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs (shmem), in bytes
- inactive_anon - anonymous and swap cache on inactive LRU list, including tmpfs (shmem), in bytes
- active_file - file-backed memory on active LRU list, in bytes
- inactive_file - file-backed memory on inactive LRU list, in bytes
- unevictable - memory that cannot be reclaimed, in bytes
- hierarchical_memory_limit - memory limit for the hierarchy that contains the memory cgroup, in bytes
- hierarchical_memsw_limit - memory plus swap limit for the hierarchy that contains the memory cgroup, in bytes
8. net_cls
- net_cls.classid - 指定tc的handle,經過tc實現網絡控制
9.net_prio 指定task網絡設備優先級
- net_prio.prioidx - a read-only file which contains a unique integer value that the kernel uses as an internal representation of this cgroup.
- net_prio.ifpriomap - 網絡設備使用優先級 -
<network_interface> <priority>
10.其餘
- tasks - 該cgroup的全部進程pid
- cgroup.event_control - event api
- cgroup.procs - thread group id
- release_agent(present in the root cgroup only) - 根據notify_on_release是否在task爲空時執行的腳本
- notify_on_release - 當cgroup中沒有task時是否執行release_agent
總結
- 本文總結了cgroup的操做方法和詳細的可配置項,爲對更好的控制系統中的資源分配打下基礎
- 對於限制資源分配的兩個場景,在針對特殊APP的場景中可進行很是細緻的調優,而在通用的資源隔離的角度上看,可能更關注的是CPU和內存相關的主要屬性