理解LGWR,Log File Sync Waits以及Commit的性能問題[轉]

時間 2019-11-20

標籤理解 lgwr log file sync waits 以及 commit 性能問題欄目系統性能简体版

原文原文鏈接

理解LGWR,Log File Sync Waits以及Commit的性能問題

一．概要：

1.  Commit和log filesync的工做機制

2.  爲何log file wait過久

3.   如何去度量問題出在那裏呢？

二．log file sync等待的緣由

1.  默認狀況下咱們commit一個事務是要等待logfile sync,這其中包括：

(1)User  commit(用戶提交的統計信息能夠經過v$sesstat來查看)

(2)DDL-這一部分主要是因爲遞歸的事務提交所產生

(3)遞歸的數據字典DML操做

2. Rollbacks致使log file sync等待

   (1)Userrollbacks-用戶或者由應用發出的rollback操做所致

   (2)Transactionrollbacks:1,因爲一些失敗的操做致使oracle內部的rollback 2.空間分配,或者ASSM相關的問題,以及用戶取消的長查詢,被kill掉的session等等。

下圖爲Commit和log file sync相關的流程圖：



Log file sync performance>disk IO speed

****大多數log file sync的等待時間其實都是花費在logfile parallel write,相似與DBWR會等待db file parallel write

****其它的log file sync等待花費在調度延遲，IPC通訊延遲等等


1.  前臺進程對LGWR發出調用,而後到sleep狀態下面看看Log file sync等待的整個流程：

此時log file sync等待開始記數

次調用在Unix平臺是經過信號量來實現

2.  LGWR被喚醒,獲得CPU時間片來工做

LGWR發出IO請求

LGWR轉去sleep,而且等待log file parallel write

3.  當在存儲級別完成IO調用後OS喚醒LGWR進程

LGWR繼續去得到CPU時間片

此時標記log file parallel write等待完成,Post相關信息給前臺進程

4.  前臺進程被LGWR喚醒,前臺進程獲得CPU時間片而且標記log file sync等待完成

經過snapper腳原本度量LGWR的速度：

[html] view plain copy

<span style="font-family:'Comic Sans MS';">---------------------------------------------------------------------------------
SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH
---------------------------------------------------------------------------------
1096, (LGWR) , STAT, messages sent , 12 , 12,
1096, (LGWR) , STAT, messages received , 10 , 10,
1096, (LGWR) , STAT, background timeouts , 1 , 1,
1096, (LGWR) , STAT, physical write total IO requests , 40, 40,
1096, (LGWR) , STAT, physical write total multi block request, 38, 38,
1096, (LGWR) , STAT, physical write total bytes, 2884608 , 2.88M,
1096, (LGWR) , STAT, calls to kcmgcs , 20 , 20,
1096, (LGWR) , STAT, redo wastage , 4548 , 4.55k,
1096, (LGWR) , STAT, redo writes , 10 , 10,
1096, (LGWR) , STAT, redo blocks written , 2817 , 2.82k,
1096, (LGWR) , STAT, redo write time , 25 , 25,
1096, (LGWR) , WAIT, LGWR wait on LNS , 1040575 , 1.04s, 104.1%, |@@@@@@@@@@|
1096, (LGWR) , WAIT, log file parallel write , 273837 , 273.84ms, 27.4%,|@@@ |
1096, (LGWR) , WAIT, events in waitclass Other , 1035172 , 1.04s , 103.5%,|@@@@@@@@@@|</span>


 LGWR和Asynch IO

[html] view plain copy

oracle@linux01:~$ strace -cp `pgrep -f lgwr`
Process 12457 attached - interrupt to quit
^CProcess 12457 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- --------------
100.00 0.010000 263 38 3 semtimedop
0.00 0.000000 0 213 times
0.00 0.000000 0 8 getrusage
0.00 0.000000 0 701 gettimeofday
0.00 0.000000 0 41 io_getevents
0.00 0.000000 0 41 io_submit
0.00 0.000000 0 2 semop
0.00 0.000000 0 37 semctl
------ ----------- ----------- --------- --------- --------------
100.00 0.010000 1081 3 total


***io_getevents是在AIO階段log file parallel write等待事件度量
Redo,commit相關的latch tuning
1.redo allocation latches-故名思議,在私有現成寫redo到log buffer時保護分配空間的latch
2.redo copy latches-當從私有內存區域copy redo到log buffer時須要的latch直到相關redo流被copy到log buffer，,那麼LGWR進程
  直到已經copy完成能夠寫buffers到磁盤,此時LGWR將等待LGWR wait for redo copy事件,相關的能夠被調整的參數:_log_simultaneous_copies
等待事件：
1.log file sync
2.log file parallel write
3.log file single write
能夠獲取相關的統計信息(v$sesstat,v$sysstat)
(1.redo size 2.redo writing time 3.user commits 4 user rollbacks 5.transaction rollbacks)
下面看一個非commit問題的等待事件：log buffer space-此事件主要是因爲redo log buffer過小,LGWR刷出redo致使爭用,或者因爲IO子系統太慢.根據不少人的經驗,相對log buffer設置大一點可以緩解log file sync,這只是相對而言.若是你滴業務類型,每次commit都是比較大的寫入,並且系統的整個IO已經達到存儲子系統的瓶頸,那麼增大log buffer將是無濟於事的。根據MOS的不少文檔參考,在10g中仍是不建議設置次參數。
log file single write：
單塊寫redo IO大多數狀況下僅僅用於logfile header block的讀和寫,其中log switch是主要的狀況,當歸檔發生時須要update log header,因此多是LGWR和ARCH等待此事件。
以下是log switch發生時的trace文件：

[html] view plain copy

WAIT #0: nam='log file sequential read' ela= 12607 log#=0
block#=1
WAIT #0: nam='log file sequential read' ela= 21225 log#=1
block#=1
WAIT #0: nam='control file sequential read' ela= 358 file#=0
WAIT #0: nam='log file single write' ela= 470 log#=0 block#=1
WAIT #0: nam='log file single write' ela= 227 log#=1 block#=1


從10.2.0.3+開始若是log write等待過久,oracle將dump相關的信息：

[html] view plain copy

LGWR trace file:
*** 2012-06-10 11:36:06.759
Warning: log write time 690ms, size 19KB
*** 2012-06-10 11:37:23.778
Warning: log write time 52710ms, size 0KB
*** 2012-06-10 11:37:27.302
Warning: log write time 3520ms, size 144KB

看下面咱們某庫的AWR信息：

[html] view plain copy

Load Profile Per Second Per Transaction Per Exec Per Call
~~~~~~~~~~~~ --------------- --------------- ---------- ----------
Redo size: 15,875,849.0 121,482.8
Logical reads: 42,403.5 324.5
Block changes: 34,759.1 266.0
Physical reads: 46.0 0.4
Physical writes: 3,417.9 26.2
User calls: 569.6 4.4
Parses: 292.3 2.2
Hard parses: 0.1 0.0
W/A MB processed: 0.5 0.0
Logons: 10.7 0.1
Executes: 552.8 4.2
Rollbacks: 42.9 0.3
Transactions: 130.7

[html] view plain copy

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Avg
wait % DB
Event Waits Time(s) (ms) time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU 37,301 76.5
log file sync 1,665,900 7,732 5 15.9 Commit
db file sequential read 711,221 6,614 9 13.6 User I/O
buffer busy waits 366,589 440 1 .9 Concurrenc
gc current multi block request 192,791 230 1 .5 Cluster

這是某庫的2號節點,還不算太忙,只是因爲業務類型每次寫入的量都很大,log file sync等待佔用DB time的16%

看看後臺進程等待：

[html] view plain copy

<span style="font-family:Comic Sans MS;"> Avg
%Time Total Wait wait Waits % bg
Event Waits -outs Time (s) (ms) /txn time
-------------------------- ------------ ----- ---------- ------- -------- ------
db file parallel write 11,968,325 0 24,481 2 5.7 71.2
log file parallel write 1,503,192 0 3,863 3 0.7 11.2</span>

***如上信息log file sync佔用了DB time的16%,avg wait(5),那麼LGWR等待佔用的比例爲多少呢？佔用了平均每次相應的40%.那麼這主要是由於業務緣由,已經達到咱們存儲的IO瓶頸.

***此庫平均每s的吞吐量在200M左右

下面看看我使用腳本lfsdiag.sql收集的部分信息：

[html] view plain copy

INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
1 wait for scn ack 1 4243024
1 wait for scn ack 2 728196
1 wait for scn ack 4 1133400
1 wait for scn ack 8 1157120
1 wait for scn ack 16 88333
1 wait for scn ack 32 3883
1 wait for scn ack 64 429
1 wait for scn ack 128 80
1 wait for scn ack 256 34
1 wait for scn ack 512 48
2 wait for scn ack 1 55024800
2 wait for scn ack 2 6658764
2 wait for scn ack 4 6802492
2 wait for scn ack 8 4400949
2 wait for scn ack 16 564950
2 wait for scn ack 32 21712
2 wait for scn ack 64 3190
2 wait for scn ack 128 912
2 wait for scn ack 256 390
2 wait for scn ack 512 508
1 log file sync 1 49708644
1 log file sync 2 4285471
1 log file sync 4 3929029
1 log file sync 8 2273533
1 log file sync 16 709349
1 log file sync 32 257827
1 log file sync 64 10464
1 log file sync 128 2371
1 log file sync 256 1582
1 log file sync 512 1979
1 log file sync 1024 1200
2 log file sync 1 647580137
2 log file sync 2 56421028
2 log file sync 4 42559988
2 log file sync 8 26002340
2 log file sync 16 12821558
2 log file sync 32 4429073
2 log file sync 64 229397
2 log file sync 128 42685
2 log file sync 256 22693
2 log file sync 512 23922
2 log file sync 1024 215090
1 log file switch completion 1 141
1 log file switch completion 2 27
1 log file switch completion 4 35
1 log file switch completion 8 72
1 log file switch completion 16 237
1 log file switch completion 32 453
1 log file switch completion 64 387
1 log file switch completion 128 31
2 log file switch completion 1 956
2 log file switch completion 2 508
2 log file switch completion 4 1005
2 log file switch completion 8 1858
2 log file switch completion 16 4506
2 log file switch completion 32 5569
2 log file switch completion 64 6957
2 log file switch completion 128 319
2 log file switch completion 256 24
2 log file switch completion 512 108
2 log file switch completion 1024 1
1 log file parallel write 1 56713575
1 log file parallel write 2 2952904
1 log file parallel write 4 1832942
1 log file parallel write 8 785097
1 log file parallel write 16 386755
1 log file parallel write 32 229099
1 log file parallel write 64 8552
1 log file parallel write 128 1461
1 log file parallel write 256 914
1 log file parallel write 512 231
1 log file parallel write 1024 21
1 log file parallel write 2048 3
2 log file parallel write 1 708078642
2 log file parallel write 2 31616460
2 log file parallel write 4 16087368
2 log file parallel write 8 5656461
2 log file parallel write 16 3121042
2 log file parallel write 32 1995505
2 log file parallel write 64 44298
2 log file parallel write 128 7506
2 log file parallel write 256 2582
2 log file parallel write 512 536
2 log file parallel write 1024 464
2 log file parallel write 2048 26
2 log file parallel write 4096 0
2 log file parallel write 8192 0
2 log file parallel write 16384 0
2 log file parallel write 32768 0
2 log file parallel write 65536 0
2 log file parallel write 131072 0
2 log file parallel write 262144 0
2 log file parallel write 524288 1
1 gcs log flush sync 1 4366103
1 gcs log flush sync 2 72108
1 gcs log flush sync 4 52374
1 gcs log flush sync 8 23374
INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
1 gcs log flush sync 16 13518
1 gcs log flush sync 32 12450
1 gcs log flush sync 64 11307
1 gcs log flush sync 128 4
2 gcs log flush sync 1 9495464
2 gcs log flush sync 2 263718
2 gcs log flush sync 4 222876
2 gcs log flush sync 8 148562
2 gcs log flush sync 16 68586
2 gcs log flush sync 32 33704
2 gcs log flush sync 64 5231
2 gcs log flush sync 128 1
1 gc current block 2-way 1 30064919
1 gc current block 2-way 2 353563
1 gc current block 2-way 4 239425
1 gc current block 2-way 8 29994
1 gc current block 2-way 16 3203
1 gc current block 2-way 32 1661
1 gc current block 2-way 64 1501
1 gc current block 2-way 128 273
1 gc current block 2-way 256 153
1 gc current block 2-way 512 22
1 gc current block 2-way 1024 119
2 gc current block 2-way 1 36168617
2 gc current block 2-way 2 303236
2 gc current block 2-way 4 148934
2 gc current block 2-way 8 13304
2 gc current block 2-way 16 2140
2 gc current block 2-way 32 1635
2 gc current block 2-way 64 1114
2 gc current block 2-way 128 210
2 gc current block 2-way 256 28
2 gc current block 2-way 512 12
2 gc current block 2-way 1024 12
2 gc current block 2-way 2048 3
2 gc current block 2-way 4096 2
1 gc cr grant 2-way 1 76502000
1 gc cr grant 2-way 2 476023
1 gc cr grant 2-way 4 564802
1 gc cr grant 2-way 8 61560
1 gc cr grant 2-way 16 5657
1 gc cr grant 2-way 32 3011
1 gc cr grant 2-way 64 440
1 gc cr grant 2-way 128 217
1 gc cr grant 2-way 256 6
2 gc cr grant 2-way 1 155966394
2 gc cr grant 2-way 2 740788
2 gc cr grant 2-way 4 748834
2 gc cr grant 2-way 8 59464
2 gc cr grant 2-way 16 9889
2 gc cr grant 2-way 32 7236
2 gc cr grant 2-way 64 937
2 gc cr grant 2-way 128 463
2 gc cr grant 2-way 256 14
2 gc cr grant 2-way 512 5
2 gc cr grant 2-way 1024 10
2 gc cr grant 2-way 2048 2
2 gc cr grant 2-way 4096 4
2 gc cr grant 2-way 8192 1
1 gc buffer busy 1 34252868
1 gc buffer busy 2 18723990
1 gc buffer busy 4 9528539
1 gc buffer busy 8 4351426
1 gc buffer busy 16 3691918
1 gc buffer busy 32 755331
1 gc buffer busy 64 68712
1 gc buffer busy 128 10869
1 gc buffer busy 256 2553
1 gc buffer busy 512 337
1 gc buffer busy 1024 2933
2 gc buffer busy 1 7881434
2 gc buffer busy 2 2083189
2 gc buffer busy 4 1372486
2 gc buffer busy 8 1957290
2 gc buffer busy 16 1417604
2 gc buffer busy 32 448992
2 gc buffer busy 64 544446
2 gc buffer busy 128 202888
2 gc buffer busy 256 58584
2 gc buffer busy 512 16470
2 gc buffer busy 1024 91266
2 gc buffer busy 2048 14
1 LGWR wait for redo copy 1 278115
1 LGWR wait for redo copy 2 3698
1 LGWR wait for redo copy 4 8498
1 LGWR wait for redo copy 8 220
1 LGWR wait for redo copy 16 6
1 LGWR wait for redo copy 32 1
2 LGWR wait for redo copy 1 7935371
2 LGWR wait for redo copy 2 29915
2 LGWR wait for redo copy 4 58179
2 LGWR wait for redo copy 8 2472
2 LGWR wait for redo copy 16 204
2 LGWR wait for redo copy 32 47

此信息主要來自於V$EVENT_HISTOGRAM,對於判斷究竟是什麼引發的問題,有助於參考。

***Oracle有個bug：log file sync等待1s,cursor:pin s wait on x也有個bug,莫名等待10ms.

如何tuning sql log file：

對於log buffer size的大小,前面已經提到過,默認的就已經足夠大了,可是從經驗來看(主要是前輩們,我也嘗試過,效果不太明顯),因此在allocate相關latch等待,比較多的時候能夠考慮增大log buffer size.9.2之後oracle使用多個log buffer,也在很大程度上緩解了相關latch等待(每一個latch保護本身的buffers).10g出現新功能：private redo strand,每個allocate latch保護本身的redo strand,也出現了IMU的概念,因此log buffer相關latch的爭用已經不多出現.

下面看看相關的參數調整：

10g R1：commit_logging

10g R2：commit_write

Option	Effect
Wait(default)	Ensures that the commit returns only after the corresponding redo information is persistent in the online redo log. When the client receives a successful return from this COMMIT statement, the transaction has been committed to durable media. A failure that occurs after a successful write to the log might prevent the success message from returning to the client, in which case the client cannot tell whether or not the transaction committed.
Nowait	The commit returns to the client whether or not the write to the redo log has completed. This behavior can increase transaction throughput.
Batch	The redo information is buffered to the redo log, along with other concurrently executing transactions. When sufficient redo information is collected, a disk write to the redo log is initiated. This behavior is called group commit, as redo information for multiple transactions is written to the log in a single I/O operation.
Immediate(default)	LGWR writes the transaction's redo information to the log. Because this operation option forces a disk I/O, it can reduce transaction throughput.

以上來自於Online Document,若是你不關心ACID的D(持久性),也就是不關心instance crash後丟失數據的風險,徹底能夠採用nowait,但我目前沒有見過系統使用該參數,都爲默認值.

可能在不少狀況下,咱們從單純的DB的層面去tuning,並不能獲得很好的效果,從應用層面可以獲得很好的效果,可是推進應用修改代碼又是一件難而又難的事情。

1.PLSQL,這包括Procedure致使的log file sync,看下面例子：

[html] view plain copy

SW_SID STATE SW_EVENT SEQ# SEC_IN_WAIT SQL_ID SQL_TEXT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2962 WAITING log file sync 60440 0 773b3nqmcvwf4 BEGIN P_MS_UPDATE_SENDTASK (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14,:15,:16,:17); END;
2962 WAITING log file sync 60440 0 773b3nqmcvwf4 BEGIN P_MS_UPDATE_SENDTASK(:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14,:15,:16,:17); END;

對於這種類型的存過,裏面有各類update,insert,delete,每次的處理量比較大,因此咱們只能去修改,分散相應的業務邏輯.是每次提交儘量以合理的批量來作

CPU方面：

也有種多是在CPU的配置上來優化,LGWR消耗大量的CPU,作法是若是LGWR等待的延遲至關嚴重,那麼能夠把LGWR調整到高優先級

IO方面：

若是你的存儲IO存在瓶頸,那麼log file parallel write事件會比較明顯,因此這個調整仍是從存儲級別,好比採用raw device,ASM,更加快速的存儲設備等等

下面是如上Begin....End執行的系統的log file sync曲線,當此過程大量執行的時候,平均等待的時間有所增長,以下圖：

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。