Linux系統出現hung_task_timeout_secs和blocked for more than 120 seconds的解決方法

Linux系統出現系統沒有響應。在/var/log/message日誌中出現大量的 「echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.」 和「blocked for more than 120 seconds」錯誤。php

問題緣由：

默認狀況下， Linux會最多使用40%的可用內存做爲文件系統緩存。當超過這個閾值後，文件系統會把將緩存中的內存所有寫入磁盤，致使後續的IO請求都是同步的。將緩存寫入磁盤時，有一個默認120秒的超時時間。出現上面的問題的緣由是IO子系統的處理速度不夠快，不能在120秒將緩存中的數據所有寫入磁盤。IO系統響應緩慢，致使愈來愈多的請求堆積，最終系統內存所有被佔用，致使系統失去響應。linux

解決方法：

根據應用程序狀況，對vm.dirty_ratio，vm.dirty_background_ratio兩個參數進行調優設置。例如，推薦以下設置：
# sysctl -w vm.dirty_ratio=10
# sysctl -w vm.dirty_background_ratio=5
# sysctl -pios

若是系統永久生效，修改/etc/sysctl.conf文件。加入以下兩行：
#vi /etc/sysctl.conf數據庫

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

重啓系統生效。緩存

有關Cache

文件緩存是提高性能的重要手段。毋庸置疑，讀緩存（Read caching）在絕大多數狀況下是有益無害的（程序能夠直接從RAM中讀取數據），而寫緩存(Write caching)則相對複雜。Linux內核將寫磁盤的操做分解成了，先寫緩存，每隔一段時間再異步地將緩存寫入磁盤。這提高了IO讀寫的速度，但存在必定風險。數據沒有及時寫入磁盤，因此存在數據丟失的風險。app

一樣，也存在cache被寫爆的狀況。還可能出現一次性往磁盤寫入過多數據，以至使系統卡頓。之因此卡頓，是由於系統認爲，緩存太大用異步的方式來不及把它們都寫進磁盤，因而切換到同步的方式寫入。（異步，即寫入的同時進程能正常運行；同步，即寫完以前其餘進程不能工做）。異步

好消息是，你能夠根據實際狀況，對寫緩存進行配置。
能夠看一下這些參數：async

[root@host ~]# sysctl -a | grep dirty
vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000

vm.dirty_background_ratio 是內存能夠填充「髒數據」的百分比。這些「髒數據」在稍後是會寫入磁盤的，pdflush/flush/kdmflush這些後臺進程會稍後清理髒數據。舉一個例子，我有32G內存，那麼有3.2G的內存能夠待着內存裏，超過3.2G的話就會有後來進程來清理它。ide

vm.dirty_ratio 是絕對的髒數據限制，內存裏的髒數據百分比不能超過這個值。若是髒數據超過這個數量，新的IO請求將會被阻擋，直到髒數據被寫進磁盤。這是形成IO卡頓的重要緣由，但這也是保證內存中不會存在過量髒數據的保護機制。post

vm.dirty_expire_centisecs 指定髒數據能存活的時間。在這裏它的值是30秒。當 pdflush/flush/kdmflush 進行起來時，它會檢查是否有數據超過這個時限，若是有則會把它異步地寫到磁盤中。畢竟數據在內存裏待過久也會有丟失風險。

vm.dirty_writeback_centisecs 指定多長時間 pdflush/flush/kdmflush 這些進程會起來一次。

能夠經過下面方式看內存中有多少髒數據：

[root@host ~]# cat /proc/vmstat | egrep "dirty|writeback"
nr_dirty 69
nr_writeback 0
nr_writeback_temp 0

這說明了，我有69頁的髒數據要寫到磁盤裏。

情景1：減小Cache

你能夠針對要作的事情，來制定一個合適的值。
在一些狀況下，咱們有快速的磁盤子系統，它們有自帶的帶備用電池的NVRAM caches，這時候把數據放在操做系統層面就顯得相對高風險了。因此咱們但願系統更及時地往磁盤寫數據。
能夠在/etc/sysctl.conf中加入下面兩行，並執行"sysctl -p"

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

這是虛擬機的典型應用。不建議將它設置成0，畢竟有點後臺IO能夠提高一些程序的性能。

情景2：增長Cache

在一些場景中增長Cache是有好處的。例如，數據不重要丟了也不要緊，並且有程序重複地讀寫一個文件。容許更多的cache，你能夠更多地在內存上進行讀寫，提升速度。

vm.dirty_background_ratio = 50
vm.dirty_ratio = 80

有時候還會提升vm.dirty_expire_centisecs 這個參數的值，來容許髒數據更長時間地停留。

情景3：增減兼有

有時候系統須要應對突如其來的高峯數據，它可能會拖慢磁盤。（好比說，每一個小時開始時進行的批量操做等）
這個時候須要允許更多的髒數據存到內存，讓後臺進程慢慢地經過異步方式將數據寫到磁盤當中。

vm.dirty_background_ratio = 5
vm.dirty_ratio = 80

這個時候，後臺進行在髒數據達到5%時就開始異步清理，但在80%以前系統不會強制同步寫磁盤。這樣可使IO變得更加平滑。

從/proc/vmstat, /proc/meminfo, /proc/sys/vm中能夠得到更多資訊來做出調整。

這兩天在調優數據庫性能的過程當中須要下降操做系統文件Cache對數據庫性能的影響，故調研了一些下降文件系統緩存大小的方法，其中一種是經過修改/proc/sys/vm/dirty_background_ration以及/proc/sys/vm/dirty_ratio兩個參數的大小來實現。看了很多相關博文的介紹，不過一直弄不清楚這兩個參數的區別在哪裏，後來看了下面的一篇英文博客才大體瞭解了它們的不一樣。

vm.dirty_background_ratio:這個參數指定了當文件系統緩存髒頁數量達到系統內存百分之多少時（如5%）就會觸發pdflush/flush/kdmflush等後臺回寫進程運行，將必定緩存的髒頁異步地刷入外存；

vm.dirty_ratio:而這個參數則指定了當文件系統緩存髒頁數量達到系統內存百分之多少時（如10%），系統不得不開始處理緩存髒頁（由於此時髒頁數量已經比較多，爲了不數據丟失須要將必定髒頁刷入外存）；在此過程當中不少應用進程可能會由於系統轉而處理文件IO而阻塞。

以前一直錯誤的一位dirty_ratio的觸發條件不可能達到，由於每次確定會先達到vm.dirty_background_ratio的條件，後來才知道本身理解錯了。確實是先達到vm.dirty_background_ratio的條件而後觸發flush進程進行異步的回寫操做，可是這一過程當中應用進程仍然能夠進行寫操做，若是多個應用進程寫入的量大於flush進程刷出的量那天然會達到vm.dirty_ratio這個參數所設定的坎，此時操做系統會轉入同步地處理髒頁的過程，阻塞應用進程。

附上原文：

Better Linux Disk Caching & Performance with vm.dirty_ratio & vm.dirty_background_ratio

by BOB PLANKERS on DECEMBER 22, 2013

in BEST PRACTICES,CLOUD,SYSTEM ADMINISTRATION,VIRTUALIZATION

This is post #16 in my December 2013 series about Linux Virtual Machine Performance Tuning. For more, please see the tag 「Linux VM Performance Tuning.」

In previous posts on vm.swappiness and using RAM disks we talked about how the memory on a Linux guest is used for the OS itself (the kernel, buffers, etc.), applications, and also for file cache. File caching is an important performance improvement, and read caching is a clear win in most cases, balanced against applications using the RAM directly. Write caching is trickier. The Linux kernel stages disk writes into cache, and over time asynchronously flushes them to disk. This has a nice effect of speeding disk I/O but it is risky. When data isn’t written to disk there is an increased chance of losing it.

There is also the chance that a lot of I/O will overwhelm the cache, too. Ever written a lot of data to disk all at once, and seen large pauses on the system while it tries to deal with all that data? Those pauses are a result of the cache deciding that there’s too much data to be written asynchronously (as a non-blocking background operation, letting the application process continue), and switches to writing synchronously (blocking and making the process wait until the I/O is committed to disk). Of course, a filesystem also has to preserve write order, so when it starts writing synchronously it first has to destage the cache. Hence the long pause.

The nice thing is that these are controllable options, and based on your workloads & data you can decide how you want to set them up. Let’s take a look:

$ sysctl -a | grep dirty vm.dirty_background_ratio = 10 vm.dirty_background_bytes = 0 vm.dirty_ratio = 20 vm.dirty_bytes = 0 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 3000

vm.dirty_background_ratio is the percentage of system memory that can be filled with 「dirty」 pages — memory pages that still need to be written to disk — before the pdflush/flush/kdmflush background processes kick in to write it to disk. My example is 10%, so if my virtual server has 32 GB of memory that’s 3.2 GB of data that can be sitting in RAM before something is done.

vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.

vm.dirty_background_bytes and vm.dirty_bytes are another way to specify these parameters. If you set the _bytes version the _ratio version will become 0, and vice-versa.

vm.dirty_expire_centisecs is how long something can be in cache before it needs to be written. In this case it’s 30 seconds. When the pdflush/flush/kdmflush processes kick in they will check to see how old a dirty page is, and if it’s older than this value it’ll be written asynchronously to disk. Since holding a dirty page in memory is unsafe this is also a safeguard against data loss.

vm.dirty_writeback_centisecs is how often the pdflush/flush/kdmflush processes wake up and check to see if work needs to be done.

You can also see statistics on the page cache in /proc/vmstat:

$ cat /proc/vmstat | egrep "dirty|writeback" nr_dirty 878 nr_writeback 0 nr_writeback_temp 0

In my case I have 878 dirty pages waiting to be written to disk.

Approach 1: Decreasing the Cache

As with most things in the computer world, how you adjust these depends on what you’re trying to do. In many cases we have fast disk subsystems with their own big, battery-backed NVRAM caches, so keeping things in the OS page cache is risky. Let’s try to send I/O to the array in a more timely fashion and reduce the chance our local OS will, to borrow a phrase from the service industry, be 「in the weeds.」 To do this we lower vm.dirty_background_ratio and vm.dirty_ratio by adding new numbers to /etc/sysctl.conf and reloading with 「sysctl –p」:

vm.dirty_background_ratio = 5 vm.dirty_ratio = 10

This is a typical approach on virtual machines, as well as Linux-based hypervisors. I wouldn’t suggest setting these parameters to zero, as some background I/O is nice to decouple application performance from short periods of higher latency on your disk array & SAN (「spikes」).

Approach 2: Increasing the Cache

There are scenarios where raising the cache dramatically has positive effects on performance. These situations are where the data contained on a Linux guest isn’t critical and can be lost, and usually where an application is writing to the same files repeatedly or in repeatable bursts. In theory, by allowing more dirty pages to exist in memory you’ll rewrite the same blocks over and over in cache, and just need to do one write every so often to the actual disk. To do this we raise the parameters:

vm.dirty_background_ratio = 50 vm.dirty_ratio = 80

Sometimes folks also increase the vm.dirty_expire_centisecs parameter to allow more time in cache. Beyond the increased risk of data loss, you also run the risk of long I/O pauses if that cache gets full and needs to destage, because on large VMs there will be a lot of data in cache.

Approach 3: Both Ways

There are also scenarios where a system has to deal with infrequent, bursty traffic to slow disk (batch jobs at the top of the hour, midnight, writing to an SD card on a Raspberry Pi, etc.). In that case an approach might be to allow all that write I/O to be deposited in the cache so that the background flush operations can deal with it asynchronously over time:

vm.dirty_background_ratio = 5 vm.dirty_ratio = 80

Here the background processes will start writing right away when it hits that 5% ceiling but the system won’t force synchronous I/O until it gets to 80% full. From there you just size your system RAM and vm.dirty_ratio to be able to consume all the written data. Again, there are tradeoffs with data consistency on disk, which translates into risk to data. Buy a UPS and make sure you can destage cache before the UPS runs out of power. :)

No matter the route you choose you should always be gathering hard data to support your changes and help you determine if you are improving things or making them worse. In this case you can get data from many different places, including the application itself, /proc/vmstat, /proc/meminfo, iostat, vmstat, and many of the things in /proc/sys/vm. Good luck!