【故障處理】隊列等待之enq IV - contention案例

【故障處理】隊列等待之enq IV -  contention案例html

1.1  BLOG文檔結構圖

wpsAAE0.tmp 

1.2  前言部分

1.2.1  導讀和注意事項

各位技術愛好者,看完本文後,你能夠掌握以下的技能,也能夠學到一些其它你所不知道的知識,~O(∩_∩)O~node

隊列等待之enq IV -  contention案例(重點)web

Tips數據庫

本文在itpubhttp://blog.itpub.net/26736162)、博客園(http://www.cnblogs.com/lhrbest)和微信公衆號(xiaomaimiaolhr有同步更新微信

文章中用到的全部代碼相關軟件相關資料及本文的pdf版本都請前往小麥苗的雲盤下載小麥苗的雲盤地址見:http://blog.itpub.net/26736162/viewspace-1624453/網絡

若網頁文章代碼格式有錯亂,下載pdf格式的文檔來閱讀oracle

本篇BLOG,代碼輸出部分通常放在一行一列的表格中。app

本文若有錯誤或不完善的地方請你們多多指正,ITPUB留言或QQ皆可,您的批評指正是我寫做的最大動力。ide

 

 

1.3  故障分析及解決過程

 

1.3.1  數據庫環境介紹

 

項目oop

source db

db 類型

RAC

db version

12.1.0.2.0

db 存儲

ASM

OS版本及kernel版本

SuSE Linux Enterprise Server(SLES 1164

 

1.3.2  AWR分析

wpsAAF1.tmp 

這裏簡單分析一下Up Time(hrs),其它幾個指標都很熟悉了。表中的「Up Time(hrs)」表明自數據庫實例啓動到本次快照結束這段時間的小時數。例如,該AWR中數據庫實例1的啓動時間爲「2016-08-11 20:51」,快照結束時間爲「2016-12-14 21:00」,故「Up Time(hrs)」約爲125.006天,轉換爲小時約爲3000.14小時,以下所示:

SYS@lhrdb> SELECT trunc(UP_TIME_D,3),  trunc(trunc(UP_TIME_D,3)*24,2) UP_TIME_HRS FROM (SELECT (TO_DATE('2016-12-14 21:00', 'YYYY-MM-DD HH24:MI') - TO_DATE('2016-08-11 20:51', 'YYYY-MM-DD HH24:MI'))  UP_TIME_D FROM DUAL);

 

TRUNC(UP_TIME_D,3) UP_TIME_HRS

------------------ -----------

           125.006     3000.14

 

能夠看到節點1的負載較大,而ADDM中,特殊類的等待事件較多。接下來查看等待事件的部分:

wpsAAF2.tmp 

能夠看到enq: IV - contention和DFS lock handle等待較爲嚴重。這裏須要說一下enq: IV - contention這個名稱。在AWR中,IV-的先後都是1個空格,而在數據庫中記錄的是-以後有2個空格,以下:

wpsAAF3.tmp 

因此,採用搜索的時候須要注意。

wpsAAF4.tmp 

根據ASH中的p1參數的值得到:

SYS@lhrdb> SELECT CHR(BITAND(1230372867, -16777216) / 16777215) ||

  2         CHR(BITAND(1230372867, 16711680) / 65535) "LOCK",

  3         BITAND(1230372867, 65535) "MODE"

  4    FROM DUAL;

 

LO       MODE

-- ----------

IV          3

 

 

1.3.3  enq: IV -  contention解決

    SELECT *

      FROM V$EVENT_NAME A

     WHERE A.NAME LIKE '%enq: IV -  contention%';

wpsAAF5.tmp 

該等待事件爲12c特有,12c相比11g多了大約500多個等待事件。該問題參考MOS12c RAC DDL Performance Issue: High "enq: IV - contention" etc if CPU Count is Different (文檔 ID 2028503.1)

wpsAB05.tmp

The fix will be included in future PSU, patch exists for certain platform/version.

The workaround is to set the following parameter to the highest value in the cluster and restart:

_ges_server_processes

To find out the highest value, run the following grep on each node:

ps -ef| grep lmd

 

該等待事件主要是因爲12c RAC的2個節點上的cpu_count這個變量不一致致使的。

AWR中能夠看出節點1CPU48,而節點2CPU96

wpsAB06.tmp 

dba_hist_parameter中能夠看到CPU_COUNT這個參數的變化歷史:

 

SQL> SHOW PARAMETER CPU  

 

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

cpu_count                            integer     96

parallel_threads_per_cpu             integer     2

resource_manager_cpu_allocation      integer     96

 

 

SQL>  select snap_id, INSTANCE_NUMBER,PARAMETER_NAME,VALUE from dba_hist_parameter where PARAMETER_NAME='cpu_count' order by snap_id;

 

   SNAP_ID INSTANCE_NUMBER PARAMETER_NAME                                                   VALUE

---------- --------------- ---------------------------------------------------------------- ------

。。。。。。。。。。。。。。。。。。。。。。。。。。。

      3368               1 cpu_count                                                        48

      3369               1 cpu_count                                                        48

      3369               2 cpu_count                                                        48

      3370               1 cpu_count                                                        48

      3371               1 cpu_count                                                        48

      3372               1 cpu_count                                                        48

      3373               1 cpu_count                                                        48

      3374               1 cpu_count                                                        48

      3375               2 cpu_count                                                        96

      3375               1 cpu_count                                                        48

      3376               1 cpu_count                                                        48

      3376               2 cpu_count                                                        96

      3377               1 cpu_count                                                        48

      3377               2 cpu_count                                                        96

      3378               2 cpu_count                                                        96

      3378               1 cpu_count                                                        48

      3379               1 cpu_count                                                        48

      3379               2 cpu_count                                                        96

。。。。。。。。。。。。。。。。。。。。

 

 

 

查詢告警日誌:more alert*|grep -i Cpu  也能夠獲取CPU的變化。

詢問客戶可知,是他們調整過系統的CPU資源,因此致使節點2上的CPU_COUNT自動變化,引發了enq: IV -  contention等待。

若主機的CPU個數變化,那麼當主機重啓後數據庫的cpu_count參數的值會隨之變化,該參數屬於操做系統依賴參數。

調整主機的CPU個數以後,該等待事件消失。

 

1.4  MOS

1.4.1  12c RAC DDL Performance Issue: High "enq: IV - contention" etc if CPU Count is Different (文檔 ID 2028503.1)

wpsAB07.tmp

 

In this Document

  Symptoms
  Cause
  Solution
  References

 

APPLIES TO:

Oracle Database - Enterprise Edition - Version 12.1.0.1 to 12.1.0.2 [Release 12.1]
Information in this document applies to any platform.

SYMPTOMS

12c RAC database seeing high "enq: IV - contention":

From awr report:

Top 10 Foreground Events by Total Wait Time =================================== Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class enq: IV - contention  52,914 1688.4 31.91 42.8 Other row cache lock 44,865 896.8 19.99 22.7 Concurrency

 

tkprof of 10046 trace of SQL statement shows the same event in the top:

Event waited on Times Max. Wait Total Waited   ---------------------------------------- Waited ---------- ------------  enq: IV - contention 6017 0.32 287.68  row cache lock 957 0.20 7.48  library cache lock 631 0.13 15.10  library cache pin 616 0.11 14.54

 

 

CAUSE

Cluster nodes have different CPU count resulting in different number of LMD processes, on one node it has two while on the other it has three.

The issue is due to the following bug:

BUG 21293056 - PERFORMANCE DEGRADATION OF GRANT STATEMENT AFTER 12C UPGRADE

Which is closed as duplicate of:

BUG 21395269 - ASYMMETRICAL LMDS CONFIGURATION IN A CLUSTER LEADS TO POOR MESSAGE TRANSFER

 

 

SOLUTION

The fix will be included in future PSU, patch exists for certain platform/version.

The workaround is to set the following parameter to the highest value in the cluster and restart:

_ges_server_processes

To find out the highest value, run the following grep on each node:

ps -ef| grep lmd

 

 

About Me

...............................................................................................................................

本文做者:小麥苗,只專一於數據庫的技術,更注重技術的運用

本文在itpubhttp://blog.itpub.net/26736162)、博客園http://www.cnblogs.com/lhrbest和我的微信公衆號(xiaomaimiaolhr)上有同步更新

本文itpub地址:http://blog.itpub.net/26736162/viewspace-2131320/

本文博客園地址:http://www.cnblogs.com/lhrbest/p/6218042.html

本文pdf小麥苗雲盤地址:http://blog.itpub.net/26736162/viewspace-1624453/

● QQ羣:230161599     微信羣:私聊

聯繫我請加QQ好友(642808185),註明添加原因

2016-09-01 15:00 ~ 2016-10-20 19:00農行完成

文章內容來源於小麥苗的學習筆記,部分整理自網絡,如有侵權或不當之處還請諒解

版權全部,歡迎分享本文,轉載請保留出處

...............................................................................................................................

手機長按下圖識別二維碼或微信客戶端掃描下邊的二維碼來關注小麥苗的微信公衆號:xiaomaimiaolhr,免費學習最實用的數據庫技術。

wpsF8C8.tmp

 

相關文章
相關標籤/搜索