內核內存碎片管理

時間 2019-12-02

標籤內核內存碎片管理简体版

原文原文鏈接

學習

大頁面和透明大頁面

內存是由塊管理，即衆所周知的頁面。一個頁面有 4096 字節。1MB 內存等於 256 個頁面。1GB 內存等於 256000 個頁面等等。CPU 有內嵌的內存管理單元，這些單元中包含這些頁面列表，每一個頁面都使用頁表條目參考。html

讓系統管理大量內存有兩種方法：linux

增長硬件內存管理單元中頁表數
增大頁面大小

第一個方法很昂貴，由於現代處理器中的硬件內存管理單元只支持數百或者書籤頁表條目。另外適用於管理數千頁面（MB 內存）硬件和內存管理算法可能沒法很好管理數百萬（甚至數十億）頁面。這會形成性能問題：但程序須要使用比內存管理單元支持的更多的頁面，該系統會退回到緩慢的基於軟件的內存管理，從而形成整個系統運行緩慢。算法

紅帽企業版 Linux 6 採用第二種方法，即便用超大頁面。
簡單說，超大頁面是 2MB 和 1GB 大小的內存塊。2MB 使用的頁表可管理多 GB 內存，而 1GB 頁是 TB 內存的最佳選擇。
差大頁面必須在引導時分配。它們也很難手動管理，且常常須要更改代碼以即可以有效使用。所以紅帽企業版 Linux 也部署了透明超大頁面 (THP)。THP 是一個提取層，可自動建立、管理和使用超大頁面的大多數方面。
THP 系統管理員和開發者減小了不少使用超大頁面的複雜性。由於 THP 的目的是改進性能，因此其開發者（社區和紅帽開發者）已在各類系統、配置、程序和負載中測試並優化了 THP。這樣可以讓 THP 的默認設置改進大多數系統配置性能。
注：THP 目前只能映射異步內存區域，好比堆和棧空間。數據庫

Transparent Huge Pages的一些官方介紹資料：

Transparent Huge Pages (THP) are enabled by default in RHEL 6 for all applications. The kernel attempts to allocate hugepages whenever possible and any Linux process will receive 2MB pages if the mmap region is 2MB naturally aligned. The main kernel address space itself is mapped with hugepages, reducing TLB pressure from kernel code. For general information on Hugepages, see: What are Huge Pages and what are the advantages of using them?
The kernel will always attempt to satisfy a memory allocation using hugepages. If no hugepages are available (due to non availability of physically continuous memory for example) the kernel will fall back to the regular 4KB pages. THP are also swappable (unlike hugetlbfs). This is achieved by breaking the huge page to smaller 4KB pages, which are then swapped out normally.
But to use hugepages effectively, the kernel must find physically continuous areas of memory big enough to satisfy the request, and also properly aligned. For this, a khugepaged kernel thread has been added. This thread will occasionally attempt to substitute smaller pages being used currently with a hugepage allocation, thus maximizing THP usage.
In userland, no modifications to the applications are necessary (hence transparent). But there are ways to optimize its use. For applications that want to use hugepages, use of posix_memalign() can also help ensure that large allocations are aligned to huge page (2MB) boundaries.
Also, THP is only enabled for anonymous memory regions. There are plans to add support for tmpfs and page cache. THP tunables are found in the /sys tree under /sys/kernel/mm/redhat_transparent_hugepage.服務器

2：命令多線程

cat /sys/kernel/mm/transparent_hugepage/enabled 該命令適用於其它Linux系統
[root@getlnx06 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
  
always madvise [never]

使用命令查看時，若是輸出結果爲[always]表示透明大頁啓用了。[never]表示透明大頁禁用、[madvise]表示app

3：如何HugePages_Total返回0，也意味着透明大頁禁用了less

[root@getlnx06 ~]# grep -i HugePages_Total /proc/meminfo 
HugePages_Total: 0

4：運維

cat /proc/sys/vm/nr_hugepages返回0也意味着透明大頁禁用了,這個表示有多少個大頁？？？
[root@jiangyi01.sqa.zmf /home/ahao.mah]
# cat /proc/sys/vm/nr_hugepages
0

設置2000個大頁：異步

[root@jiangyi01.sqa.zmf /home/ahao.mah]
#echo 2000 >  /proc/sys/vm/nr_hugepages

發現默認，2000個透明大頁是沒有被用到

[root@jiangyi01.sqa.zmf /home/ahao.mah]
#grep -i HugePages_ /proc/meminfo
HugePages_Total:    2000
HugePages_Free:     2000
HugePages_Rsvd:        0
HugePages_Surp:        0

如何使用透明大頁？

經過配置透明大頁，提升kvm的性能

關於Linux的透明大頁詳細介紹

/proc/buddyinfo的理解

/proc/buddyinfo是linuxbuddy系統管理物理內存的debug信息。
在Linux中使用buddy算法解決物理內存的外碎片問題，其把全部空閒的內存，以2的冪次方的形式，分紅11個塊鏈表，分別對應爲一、二、四、八、1六、3二、6四、12八、25六、5十二、1024個頁塊。
而Linux支持NUMA技術，對於NUMA設備，NUMA系統的結點一般是由一組CPU和本地內存組成，每個節點都有相應的本地內存，所以buddyinfo 中的Node0表示節點ID；而每個節點下的內存設備，又能夠劃分爲多個內存區域（zone），所以下面的顯示中，對於Node0的內存，又劃分類DMA、Normal、HighMem區域。然後面則是表示空閒的區域。
此處以Normal區域進行分析，第二列值爲100，表示當前系統中normal區域，可用的連續兩頁的內存大小爲100*2*PAGE_SIZE；第三列值爲52，表示當前系統中normal區域，可用的連續四頁的內存大小爲52*2^2*PAGE_SIZE

cat /proc/buddyinfo 
Node 0, zone      DMA     23     15      4      5      2      3      3      2      3      1      0 
Node 0, zone   Normal    149    100     52     33     23      5     32      8     12      2     59 
Node 0, zone  HighMem     11     21     23     49     29     15      8     16     12      2    142

PAGE_SIZE：在32位機中通常爲4096。
PAGE_SIZE：在64位機中通常爲

[root@jiangyi01.sqa.zmf /home/ahao.mah]
#cat /proc/buddyinfo
Node 0, zone      DMA      1      1      1      0      2      1      1      0      1      1      3
Node 0, zone    DMA32    263   3113   3988   2062   1922   1161    769    639    582      0      0
Node 0, zone   Normal   4033   5345  34166  41732  38549  26725  15639   9526   6372   1410  15443

原文解釋

/proc/buddyinfo

This file is used primarily for diagnosing memory fragmentation issues. Using the buddy algorithm, each column represents the number of pages of a certain order (a certain size) that are available at any given time. For example, for zone DMA (direct memory access), there are 90 of 2^(0PAGE_SIZE) chunks of memory. Similarly, there are 6 of 2^(1PAGE_SIZE) chunks, and 2 of 2^(2*PAGE_SIZE) chunks of memory available.

The DMA row references the first 16 MB on a system, the HighMem row references all memory greater than 4 GB on a system, and the Normal row references all memory in between.

The following is an example of the output typical of /proc/buddyinfo:

[root@jiangyi01.sqa.zmf /home/ahao.mah]
#cat /proc/buddyinfo
Node 0, zone      DMA      1      1      1      0      2      1      1      0      1      1      3
Node 0, zone    DMA32    197   2692   4007   2065   1922   1161    769    639    582      0      0
Node 0, zone   Normal   3450    214  31052  41870  38553  26732  15642   9527   6373   1406  15443

Each column of numbers represents the number of pages of that order which are available. In the example below, there are 7 chunks of 2 ^ 0 * PAGE_SIZE available in ZONE_DMA, and 12 chunks of 2 ^ 3 * PAGE_SIZE available in ZONE_NORMAL, etc...

This information can give you a good idea about how fragmented memory is and give you a clue as to how big of an area you can safely allocate.

下面這段話很重要

When a Linux system has been running for a while memory fragmentation can increase which depends heavily on the nature of the applications that are running on it. The more processes allocate and free memory, the quicker memory becomes fragmented. And the kernel may not always be able to defragment enough memory for a requested size on time. If that happens, applications may not be able to allocate larger contiguous chunks of memory even though there is enough free memory available. Starting with the 2.6 kernel, i.e. RHEL4 and SLES9, memory management has improved tremendously and memory fragmentation has become less of an issue.

想看memory fragmentation 能夠執行： echo m > /proc/sysrq-trigger

To see memory fragmentation you can use the magic SysRq key. Simply execute the following command:

# echo m > /proc/sysrq-trigger

This command will dump current memory information to /var/log/messages.

若是執行：echo m > /proc/sysrq-trigger 沒有輸出，那是由於你的sysrq沒有enable，須要執行以下：

# echo 1 > /proc/sys/kernel/sysrq

Starting with the 2.6 kernel, i.e. RHEL4 and SLES9, you don’t need SysRq to dump memory information. You can simply check /proc/buddyinfo for memory fragmentation.

其實，從kernel2.6 開始，不須要執行：echo m > /proc/sysrq-trigger 才能dump memory information，也能夠直接看/proc/buddyinfo文件

[root@jiangyi01.sqa.zmf /home/ahao.mah]
#echo 1 > /proc/sys/kernel/sysrq

[root@jiangyi01.sqa.zmf /home/ahao.mah]
#echo m > /proc/sysrq-trigger

[root@jiangyi01.sqa.zmf /home/ahao.mah]
#grep Normal /var/log/messages | tail -1
Mar  3 17:19:27 jiangyi01.sqa.zmf kernel: Node 0 Normal: 3745*4kB (UEM) 278*8kB (UEM) 31097*16kB (UEM) 41853*32kB (UEM) 38552*64kB (UEM) 26731*128kB (UEM) 15641*256kB (UEM) 9528*512kB (UEM) 6373*1024kB (UEM) 1406*2048kB (UEM) 15443*4096kB (UEMR) = 89285348kB

案例

背景：

公司某個大型業務系統反饋最近數據庫服務器老是宕機（此處描述不許確，後面解釋），最後，客戶、運維人員都以爲實在是忍無可忍了，項目經理打電話找到我問是否能幫忙診斷一下，恰好次日要去現場溝通另一個系統的測試需求，因而答應次日順便看一下。

排查解決過程：

次日來到現場，正在溝通需求的時候，運維人員忽然說，操做又開始卡了，
因而連上服務器，先用top大概看了一下資源的使用狀況，此時CPU已經基本上滿載了，並且能夠發現用戶態的CPU佔比並不高，大部分時間居然都是內核態的CPU佔用，

當時我開始懷疑多是數據庫服務對底層的某個調用出了問題，有死循環？
因而馬上用perf top大概看了一下，

發現比重較大的是自旋鎖還有一個compaction_alloc，內存碎片整理？
從該信息判斷，多是內存的什麼操做致使了不少線程在臨界區各類等待。
爲了進一步弄明白具體是什麼操做致使，因而對內核參數的調用棧進行取樣

perf record -a -g -F 1000 sleep 60

「-g'的意思是按照調用關係存儲數據；「-F 1000 sleep 60」表示按照每秒取1000個樣本的頻率取一分鐘。
取完樣後，使用perf report -g打開取樣的數據，能夠看到以下的調用棧：

很明顯這個自旋鎖是由內存頁的碎片整理致使，而進行碎片整理是由hugepage致使的，
看到這裏的時候，我忽然想起來linux的一個THP特性，貌似是kelnel 2.6.38版本後開始加進來的，
這個特性實際上就是會把這種巨頁的使用對用戶透明，用戶不須要再進行巨頁的配置，
內存會自動將連續的512個普通頁做爲一個巨頁處理，
正如咱們在前面的調用棧看到的，這種特性就須要對內存碎片進行整理，
因此咱們看到的現象是內存碎片頁移動致使的自旋鎖，而根本緣由是THP特性所致使的。
知道了問題緣由，解決也就容易了，只要把THP關閉就能夠了。

關閉的方法以下：

vi /etc/rc.local
在文件末尾添加以下指令：

if test -f /sys/kernel/mm/redhat_transparent_hugepage/enabled; then
   echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
fi
if test -f /sys/kernel/mm/redhat_transparent_hugepage/defrag; then
   echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
fi

保存後，重啓便可。
PS：此處不一樣版本的linux路徑會有些區別，本身看好了

vi /sys/kernel/mm/redhat_transparent_hugepage/enabled

若是顯示以下：

即爲關閉THP生效。

其實這樣作完還不算徹底解決問題，就如咱們前面說的，
THP的引入是爲了減小維護人員配置巨頁的工做，咱們把THP特性關掉了，
最好的實踐是咱們應該再根據咱們數據庫服務須要的共享內存大小進行hugepage的配置。
畢竟在如今動輒幾十G，甚至上百G的內存，若是在按照4K普通頁大小去維護TLB，也是一個很大的開銷。
這裏hugepage的配置，由於數據庫不一樣，甚至數據庫版本不一樣，配置過程也不大相同，最重要的一點，我發現這篇日誌寫的有點太長了。
所以，這裏就不展開贅述了，有時間能夠開帖講一講。

解決效果：

在進行如上兩步處理後，連續觀察了幾天，果真再沒有所謂的「宕機」事件了。
這裏「宕機」用了引號，對應最前面反饋問題時項目經理所說的服務器宕機描述，其實這個描述自己就是錯誤的，明天我準備再針對這個詳細解釋一下：如何正確的提問。

案例2: 瞬間內存中的cache大量釋放，致使，IO壓力很大

這個問題多是由於，程序，在請求連續的大的內存，雖然內存仍是有的，可是，連續的大的內存片沒有了，可能會致使，觸發transparent_hugepage ，在el7上transparent_hugepage 默認又是開啓的，因此，能夠嘗試改成never觀察

#cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

sudo sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'

REF

http://stackoverflow.com/questions/4863707/how-to-see-linux-view-of-the-ram-in-order-to-determinate-the-fragmentation

http://www.cnblogs.com/itfriend/archive/2011/12/14/2287160.html

http://1152313.blog.51cto.com/1142313/1767927

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。