Postgresql/Greenplum內核參數配置手冊

時間 2019-11-11

標籤 postgresql greenplum 內核參數配置手冊欄目 Postgre SQL 简体版

原文原文鏈接

memory overcommit

vm.overcommit_memory = 2
vm.overcommit_ratio = 95 # **See [Note](https://gpdb.docs.pivotal.io/6-0/install_guide/prep_os.html#topic4__sysctl_conf) 2**

GP相關說明html

When vm.overcommit_memory is 2, you specify a value for vm.overcommit_ratio. For information about calculating the value for vm.overcommit_ratio when using resource queue-based resource management, see the Greenplum Database server configuration parameter gp_vmem_protect_limit in the Greenplum Database Reference Guide. If you are using resource group-based resource management, tune the operating system vm.overcommit_ratio as necessary. If your memory utilization is too low, increase the vm.overcommit_ratio value; if your memory or swap usage is too high, decrease the value.

linux內核解釋
http://linuxperf.com/?p=102
Memory Overcommit的意思是操做系統承諾給進程的內存大小超過了實際可用的內存。一個保守的操做系統不會容許memory overcommit，有多少就分配多少，再申請就沒有了，這其實有些浪費內存，由於進程實際使用到的內存每每比申請的內存要少，好比某個進程malloc()了200MB內存，但實際上只用到了100MB，按照UNIX/Linux的算法，物理內存頁的分配發生在使用的瞬間，而不是在申請的瞬間，也就是說未用到的100MB內存根本就沒有分配，這100MB內存就閒置了。下面這個概念很重要，是理解memory overcommit的關鍵：commit(或overcommit)針對的是內存申請，內存申請不等於內存分配，內存只在實際用到的時候才分配。linux

Linux是容許memory overcommit的，只要你來申請內存我就給你，寄但願於進程實際上用不到那麼多內存，但萬一用到那麼多了呢？那就會發生相似「銀行擠兌」的危機，現金(內存)不足了。Linux設計了一個OOM killer機制(OOM = out-of-memory)來處理這種危機：挑選一個進程出來殺死，以騰出部份內存，若是還不夠就繼續殺…也可經過設置內核參數 vm.panic_on_oom 使得發生OOM時自動重啓系統。這都是有風險的機制，重啓有可能形成業務中斷，殺死進程也有可能致使業務中斷，我本身的這個小網站就碰到過這種問題，參見前文。因此Linux 2.6以後容許經過內核參數 vm.overcommit_memory 禁止memory overcommit。算法

內核參數 vm.overcommit_memory 接受三種取值：shell

0 – Heuristic overcommit handling. 這是缺省值，它容許overcommit，但過於明目張膽的overcommit會被拒絕，好比malloc一次性申請的內存大小就超過了系統總內存。Heuristic的意思是「試探式的」，內核利用某種算法（對該算法的詳細解釋請看文末）猜想你的內存申請是否合理，它認爲不合理就會拒絕overcommit。
1 – Always overcommit. 容許overcommit，對內存申請來者不拒。
2 – Don’t overcommit. 禁止overcommit。

關於禁止overcommit (vm.overcommit_memory=2) ，須要知道的是，怎樣纔算是overcommit呢？kernel設有一個閾值，申請的內存總數超過這個閾值就算overcommit，在/proc/meminfo中能夠看到這個閾值的大小：緩存

# grep -i commit /proc/meminfo
CommitLimit:     5967744 kB
Committed_AS:    5363236 kB

CommitLimit 就是overcommit的閾值，申請的內存總數超過CommitLimit的話就算是overcommit。
這個閾值是如何計算出來的呢？它既不是物理內存的大小，也不是free memory的大小，它是經過內核參數vm.overcommit_ratio或vm.overcommit_kbytes間接設置的，公式以下：
【CommitLimit = (Physical RAM * vm.overcommit_ratio / 100) + Swap】服務器

注：
vm.overcommit_ratio 是內核參數，缺省值是50，表示物理內存的50%。若是你不想使用比率，也能夠直接指定內存的字節數大小，經過另外一個內核參數 vm.overcommit_kbytes 便可；
若是使用了huge pages，那麼須要從物理內存中減去，公式變成：
CommitLimit = ([total RAM] – [total huge TLB RAM]) * vm.overcommit_ratio / 100 + swap
參見https://access.redhat.com/solutions/665023cookie

/proc/meminfo中的 Committed_AS 表示全部進程已經申請的內存總大小，（注意是已經申請的，不是已經分配的），若是 Committed_AS 超過 CommitLimit 就表示發生了 overcommit，超出越多表示 overcommit 越嚴重。Committed_AS 的含義換一種說法就是，若是要絕對保證不發生OOM (out of memory) 須要多少物理內存。網絡

ip port

net.ipv4.ip_local_port_range = 10000 65535

GP相關說明oracle

To avoid port conflicts between Greenplum Database and other applications when initializing Greenplum Database, do not specify Greenplum Database ports in the range specified by the operating system parameter net.ipv4.ip_local_port_range. For example, if net.ipv4.ip_local_port_range = 10000 65535, you could set the Greenplum Database base port numbers to these values.
PORT_BASE = 6000
MIRROR_PORT_BASE = 7000
For information about the port ranges that are used by Greenplum Database, see gpinitsystem.

linux內核解釋
On Linux, there is a sysctl parameter calledip_local_port_rangethat defines the minimum and maximum port a networking connection can use as its source (local) port. This applies to both TCP and UDP connections.app

cat /proc/sys/net/ipv4/ip_local_port_range

shared memory

# kernel.shmall = _PHYS_PAGES / 2 # See Note 1
kernel.shmall = 4000000000
# kernel.shmmax = kernel.shmall * PAGE_SIZE # See Note 1
kernel.shmmax = 500000000
kernel.shmmni = 4096

查看限制、查看使用
ipcs -lm、ipcs -u

shmall: This parameter sets the total amount of shared memory pages that can be used system wide. Hence, SHMALL should always be at least ceil(shmmax/PAGE_SIZE).
共享內存能使用的總頁數
echo $(expr $(getconf _PHYS_PAGES) / 2)
shmmax: This parameter defines the maximum size in bytes of a single shared memory segment that a Linux process can allocate in its virtual address space.
共享內存的總大小
echo $(expr $(getconf _PHYS_PAGES) / 2 \* $(getconf PAGE_SIZE))
shmmin: This parameter sets the system wide maximum number of shared memory segments.

semaphores

cat /proc/sys/kernel/sem
500 2048000 200 40960

SEMMSL, SEMMNS, SEMOPM, SEMMNI

kernel.sem = 500 2048000 200 40960

SEMMSL
含義：每一個信號量set中信號量最大個數設置：最小250；對於processes參數設置較大的系統建議設置爲processes+10

SEMMNS
含義：linux系統中信號量最大個數設置：至少32000；SEMMSL * SEMMNI

SEMOPM
含義：semop系統調用容許的信號量最大個數設置：至少100；或者等於SEMMSL

SEMMNI
含義：linux系統信號量set最大個數設置：最少128

link

message queue

kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048

消息隊列提供了一個從一個進程向另一個進程發送一塊數據的方法,消息隊列具備內核持續性；
每一個數據塊都被認爲是有一個類型，接收者進程接收的數據塊能夠有不一樣的類型值;
消息隊列也有管道同樣的不足，就是每一個消息的最大長度是有上限的（MSGMAX），每一個消息隊列的總的字節數是有上限的（MSGMNB），系統上消息隊列的總數也有一個上限（MSGMNI）

cat /proc/sys/kernel/msgmax 最大消息長度限制，8192=8K
cat /proc/sys/kernel/msgmnb 消息隊列總的字節數，16384 = 16K
cat /proc/sys/kernel/msgmni 消息條目數,169

file cache

文件緩存是提高性能的重要手段。毋庸置疑，讀緩存（Read caching）在絕大多數狀況下是有益無害的（程序能夠直接從RAM中讀取數據），而寫緩存(Write caching)則相對複雜。Linux內核將寫磁盤的操做分解成了，先寫緩存，每隔一段時間再異步地將緩存寫入磁盤。這提高了IO讀寫的速度，但存在必定風險。數據沒有及時寫入磁盤，因此存在數據丟失的風險。

一樣，也存在cache被寫爆的狀況。還可能出現一次性往磁盤寫入過多數據，以至使系統卡頓。之因此卡頓，是由於系統認爲，緩存太大用異步的方式來不及把它們都寫進磁盤，因而切換到同步的方式寫入。（異步，即寫入的同時進程能正常運行；同步，即寫完以前其餘進程不能工做）。

# 這個時候，後臺進行在髒數據達到10%時就開始異步清理，但在20%以前系統不會強制同步寫磁盤。刷髒進程3秒起來一次，髒數據存活超過10秒就會開始刷。
vm.dirty_expire_centisecs = 10
vm.dirty_writeback_centisecs = 3
vm.dirty_background_ratio: 10
vm.dirty_ratio: 20
vm.dirty_background_bytes: 0
vm.dirty_bytes: 0

vm.dirty_background_ratio 是內存能夠填充「髒數據」的百分比。這些「髒數據」在稍後是會寫入磁盤的，pdflush/flush/kdmflush這些後臺進程會稍後清理髒數據。舉一個例子，我有32G內存，那麼有3.2G的內存能夠待着內存裏，超過3.2G的話就會有後來進程來清理它。

vm.dirty_ratio 是絕對的髒數據限制，內存裏的髒數據百分比不能超過這個值。若是髒數據超過這個數量，新的IO請求將會被阻擋，直到髒數據被寫進磁盤。這是形成IO卡頓的重要緣由，但這也是保證內存中不會存在過量髒數據的保護機制。

vm.dirty_expire_centisecs 指定髒數據能存活的時間。在這裏它的值是30秒。當 pdflush/flush/kdmflush 進行起來時，它會檢查是否有數據超過這個時限，若是有則會把它異步地寫到磁盤中。畢竟數據在內存裏待過久也會有丟失風險。

vm.dirty_writeback_centisecs 指定多長時間 pdflush/flush/kdmflush 這些進程會起來一次。

# 有7頁髒數據須要刷到盤裏
# cat /proc/vmstat | egrep "dirty|writeback"
nr_dirty 7
nr_writeback 0
nr_writeback_temp 0

link

swap

DB服務器出於性能考慮不適合使用swap，因此不必配置swap空間

vm.swappiness = 0

若是內存夠大，應當告訴 linux 沒必要太多的使用 SWAP 分區，能夠經過修改 swappiness 的數值。swappiness=0的時候表示最大限度使用物理內存，而後纔是 swap空間，swappiness＝100的時候表示積極的使用swap分區，而且把內存上的數據及時的搬運到swap空間裏面。

net

net.ipv4.tcp_syncookies = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152

min free memory

爲網絡和文件系統保留內存3%的內存應急，不要超過5¥

awk 'BEGIN {OFMT = "%.0f";} /MemTotal/ {print $2 * .03;}' /proc/meminfo

ipc resource management

分別查詢IPC資源:

$ipcs -m 查看系統使用的IPC共享內存資源
$ipcs -q 查看系統使用的IPC隊列資源
$ipcs -s 查看系統使用的IPC信號量資源

查看IPC資源被誰佔用

示例：有個IPCKEY(51036)，須要查詢其是否被佔用；

首先經過計算器將其轉爲十六進制:

51036 -> c75c
若是知道是被共享內存佔用:

$ipcs -m | grep c75c
0x0000c75c 40403197   tdea3    666        536870912  2
若是不肯定，則直接查找:

$ipcs | grep c75c
0x0000c75c 40403197   tdea3    666        536870912  2
0x0000c75c 5079070    tdea3    666        4

系統IPC參數查詢

ipcs -l

清除IPC資源

ipcrm -M shmkey  移除用shmkey建立的共享內存段
ipcrm -m shmid    移除用shmid標識的共享內存段
ipcrm -Q msgkey  移除用msqkey建立的消息隊列
ipcrm -q msqid  移除用msqid標識的消息隊列
ipcrm -S semkey  移除用semkey建立的信號
ipcrm -s semid  移除用semid標識的信號

清除當前用戶建立的全部的IPC資源:

ipcs -q | awk '{ print "ipcrm -q "$2}' | sh > /dev/null 2>&1;
ipcs -m | awk '{ print "ipcrm -m "$2}' | sh > /dev/null 2>&1;
ipcs -s | awk '{ print "ipcrm -s "$2}' | sh > /dev/null 2>&1;

link

https://gpdb.docs.pivotal.io/6-0/install_guide/prep_os.html#topic_sqj_lt1_nfb

ansible

---
- hosts: gp
  vars:
    version: "6.0.0"
    admin_user: "gp12345678"
    admin_password: "333"
    port_pre: "3001"           
  remote_user: root
  tasks:
  ##
  #! auth
  ##
  - name: add ssh authorized keys for root
    authorized_key:
      user: root
      state: present
      key: "{{ lookup('file', lookup('env','HOME') + '/.ssh/id_rsa.pub') }}"
  ##
  #! user
  ##
  - name: create admin user
    user:
      name: "{{ admin_user }}"
      password: "{{ admin_password | password_hash('sha512', 'iamsalt') }}"

  - name: add ssh authorized keys for admin
    authorized_key:
      user: "{{ admin_user }}"
      state: present
      key: "{{ lookup('file', lookup('env','HOME') + '/.ssh/id_rsa.pub') }}"
  ##    
  #! sysctl
  ##
  - name: backing up sysctl
    copy:
      src: /etc/sysctl.conf
      remote_src: yes
      dest: /tmp/sysctl.conf.bak
      backup: yes
  - name: get shmall 
    shell: echo $(expr $(getconf _PHYS_PAGES) / 2) 
    register: shmall
  - name: get shmmax
    shell: echo $(expr $(getconf _PHYS_PAGES) / 2 \* $(getconf PAGE_SIZE))
    register: shmmax
  - name: get min_free_kbytes
    shell: awk 'BEGIN {OFMT = "%.0f";} /MemTotal/ {print $2 * .03;}' /proc/meminfo
    register: min_free_kbytes
  - name: set shmall
    sysctl:
      name: kernel.shmall
      value: "{{ shmall.stdout }}"
      reload: yes
  - name: set shmmax
    sysctl:
      name: kernel.shmmax
      value: "{{ shmmax.stdout }}"
      reload: yes
  - name: set min_free_kbytes
    sysctl:
      name: vm.min_free_kbytes
      value: "{{ min_free_kbytes.stdout }}"
      reload: yes
  - name: set other sysctl
    sysctl:
       name: "{{ item.key }}"
       value: "{{ item.value }}"
       sysctl_set: yes
       state: present
       reload: yes
       ignoreerrors: yes
    with_dict:
      kernel.shmmni: 4096
      vm.overcommit_memory: 2
      vm.overcommit_ratio: 95
      net.ipv4.ip_local_port_range: 10000 65535
      kernel.sem: 500 2048000 200 40960
      kernel.sysrq: 1
      kernel.core_uses_pid: 1
      kernel.msgmnb: 65536
      kernel.msgmax: 65536
      kernel.msgmni: 2048
      net.ipv4.tcp_syncookies: 1
      net.ipv4.conf.default.accept_source_route: 0
      net.ipv4.tcp_max_syn_backlog: 4096
      net.ipv4.conf.all.arp_filter: 1
      net.core.netdev_max_backlog: 10000
      net.core.rmem_max: 2097152
      net.core.wmem_max: 2097152
      vm.swappiness: 0
      vm.zone_reclaim_mode: 0
      vm.dirty_expire_centisecs: 10
      vm.dirty_writeback_centisecs: 3
      vm.dirty_background_ratio: 10
      vm.dirty_ratio: 20
      vm.dirty_background_bytes: 0
      vm.dirty_bytes: 0 
  ##    
  #! pam limit
  ##
  - name: state PAM limits
    pam_limits:
      domain: '*'
      limit_type: '-'
      limit_item: "{{ item.key }}"
      value: "{{ item.value }}"
    with_dict:
      nofile: 655360
      nproc: 655360
      memlock: unlimited
      core: unlimited

  ##    
  #! install src gp
  ##
  - name: copy package to host
    copy:
      src: "{{ package_path }}"
      dest: /tmp