緩解 SQL Server has encountered 727 occurrence(s) of I/O requests taking longer than 15 seconds

sql server 會記錄IO等待時間超過15 seconds的請求,這時application會有 time out 現象,dba須要判斷是workload,concurrecy 所致仍是sql server配置沒有最優致使?兩者之間的比例各佔多少?sql

1:判斷sql server的配置是否最優化,相對容易,能夠有如下項目檢查:app

  a: 經過datafile/logfile放置在不一樣的drive上,能夠分離random io 和 sequence io 操做,能夠極大的緩解讀寫IO,對於寫IO,只需當即作logfile 的 sequence io便可,至於什麼時候把data page 真正寫入到datafile,dom

    sql server本身把握。根據我本身的觀察,分離datafile/logfile,一個月有個2到3次的 IO stall, 若是都在同一個磁盤,有個8到9次 IO stall 也很常見。ide

  b: 儘可能合理的設置[max server memory (MB)], 通常爲總內存的7/8便可。優化

  c: tempdb最好在單獨的磁盤上,若是不行的話,也要作datafile/logfile分離。spa

 

2:若是出現了問題,那如何判斷是workload致使的呢?第一要作好IO activity的統計工做,第二要通知開發,是不是業務的變化致使IO增多,困難是這個IO stall 到底涉及到哪些表?如何獲得?code

  a: 經過如下的sp, 利用sys.dm_io_virtual_file_stats DMV 記錄每一分鐘sql server IO讀寫操做server

    

USE [master]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create proc [dbo].[usp_monitor_virtual_file_latency]
as
begin
 if OBJECT_ID('master.dbo.tbl_virtual_file_latency','U') is null
 begin
  create table master.dbo.tbl_virtual_file_latency
  (
   id int identity(1,1) primary key,
   collect_dt datetime not null,
   [database_name] [nvarchar](128) NULL,
   [file_id] [smallint] NOT NULL,
   [type_desc] [nvarchar](60) NULL,
   [sample_ms] [int] NOT NULL,
   [num_of_reads] [bigint] NOT NULL,
   [num_of_bytes_read] [bigint] NOT NULL,
   [io_stall_read_ms] [bigint] NOT NULL,
   [num_of_writes] [bigint] NOT NULL,
   [num_of_bytes_written] [bigint] NOT NULL,
   [io_stall_write_ms] [bigint] NOT NULL,
   [io_stall] [bigint] NOT NULL,
   [size_on_disk_bytes] [bigint] NOT NULL,
   [file_handle] [varbinary](8) NOT NULL,
   [physical_name] [nvarchar](260) NOT NULL,
   [state_desc] [nvarchar](60) NULL
  )
  create index idx_collect_dt on tbl_virtual_file_latency(collect_dt) with(online=on)
 end
 
 insert into master.dbo.tbl_virtual_file_latency
 SELECT 
    --virtual file latency
  GETDATE(),
  db_name(vfs.[database_id]) as database_name
      ,vfs.[file_id]
      ,mf.[type_desc]
      ,vfs.[sample_ms]
      ,vfs.[num_of_reads]
      ,vfs.[num_of_bytes_read]
      ,vfs.[io_stall_read_ms]
      ,vfs.[num_of_writes]
      ,vfs.[num_of_bytes_written]
      ,vfs.[io_stall_write_ms]
      ,vfs.[io_stall]
      ,vfs.[size_on_disk_bytes]
      ,vfs.[file_handle]
      ,mf.[physical_name]
      ,mf.[state_desc]

FROM sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs]
JOIN sys.master_files(nolock) AS [mf]
    ON [vfs].[database_id] = [mf].[database_id]
    AND [vfs].[file_id] = [mf].[file_id]
 WHERE DB_NAME(vfs.database_id) not in('master','model','msdb')
 
 
 --delete old data
 
 delete from master.dbo.tbl_virtual_file_latency where collect_dt <DATEADD(MONTH,-2,getdate())
end

 

 

 

  

  b: 經過如下的sp,求每分鐘內每一個數據文件的IO的size和讀寫狀況blog

   

USE [master]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create proc [dbo].[usp_calculate_virtual_file_stats_PerMinutes]
@physical_name nvarchar(1000),
@collect_starttime datetime,
@collect_endtime datetime
as
with snap as
 (
 select
  row_number() over (order by id asc) as row_num,
  id
      ,[collect_dt]
      ,[database_name]
      ,[file_id]
      ,[type_desc]
      ,[sample_ms]
      ,[num_of_reads]
      ,[num_of_bytes_read]
      ,[io_stall_read_ms]
      ,[num_of_writes]
      ,[num_of_bytes_written]
      ,[io_stall_write_ms]
      ,[io_stall]
from master.dbo.tbl_virtual_file_latency(nolock) AS [vfs1]
WHERE vfs1.physical_name=@physical_name
and vfs1.collect_dt>=@collect_starttime
and vfs1.collect_dt<=@collect_endtime

)
select cur.collect_dt,
 read_io_num= case when (cur.num_of_reads-pre.num_of_reads)=0 then 0 else cur.num_of_reads-pre.num_of_reads end ,
 write_io_num= case when (cur.num_of_writes-pre.num_of_writes)=0 then 0 else cur.num_of_writes-pre.num_of_writes end ,
 read_latency_MS =case when (cur.num_of_reads-pre.num_of_reads)=0 then 0 else cast((cur.io_stall_read_ms-pre.io_stall_read_ms)*1./(cur.num_of_reads-pre.num_of_reads) as decimal(10,1)) end ,
 write_latency_MS= case when (cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.io_stall_write_ms-pre.io_stall_write_ms)*1./(cur.num_of_writes-pre.num_of_writes) as decimal(10,1)) end ,
 average_latency_MS=case when (cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.io_stall-pre.io_stall)*1./(cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes) as decimal(10,1)) end,
 averageKB_read=case when (cur.num_of_reads-pre.num_of_reads)=0 then 0 else cast((cur.num_of_bytes_read-pre.num_of_bytes_read)*1./(cur.num_of_reads-pre.num_of_reads)/1024 as decimal(10,1))end ,
 averageKB_write=case when (cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.num_of_bytes_written-pre.num_of_bytes_written)*1./(cur.num_of_writes-pre.num_of_writes)/1024 as decimal(10,1)) end,
 averageKB_Transfer=case when (cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.num_of_bytes_read-pre.num_of_bytes_read+cur.num_of_bytes_written-pre.num_of_bytes_written)*1./(cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes)/1024 as decimal(10,1)) end
from snap as pre join snap as cur
on pre.row_num=cur.row_num-1

 

 

  

  c: 最後根據歷史數據來看IO讀寫的趨勢ip

    

exec usp_calculate_virtual_file_stats_PerMinutes 
'E:\TestDB\test.mdf',
'2014-12-14 13:49:01.000',
'2014-12-14 15:49:01.000'

 

    

 

3:根據IO stall的信息找出sql server 讀取哪一個表致使了IO Stall.

    sql server log text: 

    SQL Server has encountered 727 occurrence(s) of I/O requests taking longer than 15 seconds to complete  on file [F:\MSSQL\Test_1.ndf:MSSQL_DBCC7] in database [Test] (7). The OS file handle is             0x0000000000001A40. The offset of the latest long I/O is: 0x000023101f0000

 

    

     a: 把偏移量0x000023101f0000轉化爲十進制爲:150594322432

    b: 把該值除以8192即爲page number: 150594322432/8192=18383096

    c:在經過page number求的table name, 再與開發討論IO的問題.

    

select object_name(object_id) as tblname,* 
from sys.partitions (nolock)as sp
where sp.partition_id in (
select au.container_id from 
sys.dm_os_buffer_descriptors(nolock) as obd
join sys.allocation_units(nolock) as au
on obd.allocation_unit_id=au.allocation_unit_id
where obd.page_id=18383096)

 

     

  d: 若是該page number再也不內存中的化,該SQL 是沒有結果的。可使用dbcc page(dbid,fileid,pageid,3)來獲得該表的object_id.

 

結論:IO stall 只可能緩解,是無法避免的,業務量致使的IO永遠在某一點上是超過存儲的容量的,dba能經過各類事前過後方法到找到root cause, 據此提出解決的方法。但願以上的方法能對你們有所啓發。

相關文章
相關標籤/搜索