OpenStack中的Multipath faulty device的成因及解決(part 1)

時間 2019-12-14

標籤 openstack multipath faulty device 成因解決简体版

原文原文鏈接

| 版權：本文版權歸做者和博客園共有，歡迎轉載，但未經做者贊成必須保留此段聲明，且在文章頁面明顯位置給出原文鏈接。若有問題，能夠郵件：wangxu198709@gmail.comhtml

簡介：

Multipath：這個多路徑軟件在Linux平臺普遍使用，它的功能就是能夠把一個快設備對應的多條路徑聚合成一個單一的multipath device。主要目的有以下兩點：node

多路徑冗餘（redundancy）：當配置在Active/Passive模式下，只有一半的路徑會用來作IO，若是IO路徑上有任何失敗（包括，交換機故障，線路故障，後端存儲故障等），能夠自動切換的備用路線上，對上層應用作到基本無感知。linux

提升性能（Performance）：當配置在Active/Active模式下，因此路徑均可以用來跑IO（如以round-robin模式），能夠提升IO速率或者延時。git

multipath不是本文的重點，若有須要，請移步：https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/dm_multipath/setup_overviewgithub

安裝及使用：

Multipath：這個多路徑軟件在Linux平臺普遍使用，在Debian/Ubuntu平臺能夠經過 sudo apt-get install multipath-tools 安裝, RedHat/CentOS 平臺能夠經過 sudo yum install device-mapper-multipath 安裝。後端

multipath.conf: multipath對於主流的存儲陣列都有默認的配置，能夠支持存儲陣列的不少自帶特性，如ALUA。固然用戶能夠在安裝好後，手動建立/etc/multipath.confsession

如下是VNX/Unity的參考配置（vnx cinder driver）：併發

blacklist {
    # Skip the files under /dev that are definitely not FC/iSCSI devices
    # Different system may need different customization
    devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
    devnode "^hd[a-z][0-9]*"
    devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"

    # Skip LUNZ device from VNX
    device {
        vendor "DGC"
        product "LUNZ"
        }
}

defaults {
    user_friendly_names no
    flush_on_last_del yes
}

devices {
    # Device attributed for EMC CLARiiON and VNX series ALUA
    device {
        vendor "DGC"
        product ".*"
        product_blacklist "LUNZ"
        path_grouping_policy group_by_prio
        path_selector "round-robin 0"
        path_checker emc_clariion
        features "1 queue_if_no_path"
        hardware_handler "1 alua"
        prio alua
        failback immediate
    }
}

Multipath在OpenStack中的應用及faulty device的產生：

OpenStack中，multipath可使用在Nova和Cinder的節點上，提供對後端存儲的高可用訪問。在很早以前，這部分代碼是分別在Nova和Cinder項目裏面的，漸漸的爲了維護方便，就單獨擰出來一個項目：os-brickapp

os-brick裏面很重要的兩個interface是：connect_volume-負責連接一個存儲上的LUN或者disk，disconnect_volume-輔助斷開與存儲上一個LUN的連接。ide

什麼是faulty device

當host上multipath軟件發現對應的host path不可訪問時，就會顯示爲faulty狀態。

關於全部狀態的描述，能夠參考：https://en.wikipedia.org/wiki/Linux_DM_Multipath

os-brick的代碼我選擇的是比較早期容易產生faulty device的版本：https://github.com/openstack/os-brick/blob/liberty-eol/os_brick/initiator/connector.py

1. connect_volume的主要邏輯以下：

  1     @synchronized('connect_volume')
  2     def connect_volume(self, connection_properties):
  3         """Attach the volume to instance_name.
  4         connection_properties for iSCSI must include:
  5         target_portal(s) - ip and optional port
  6         target_iqn(s) - iSCSI Qualified Name
  7         target_lun(s) - LUN id of the volume
  8         Note that plural keys may be used when use_multipath=True
  9         """
 10 
 11         device_info = {'type': 'block'}
 12 
 13         if self.use_multipath:
 14             # Multipath installed, discovering other targets if available
 15             try:
 16                 ips_iqns = self._discover_iscsi_portals(connection_properties)
 17             except Exception:
 18                 raise exception.TargetPortalNotFound(
 19                     target_portal=connection_properties['target_portal'])
 20 
 21             if not connection_properties.get('target_iqns'):
 22                 # There are two types of iSCSI multipath devices. One which
 23                 # shares the same iqn between multiple portals, and the other
 24                 # which use different iqns on different portals.
 25                 # Try to identify the type by checking the iscsiadm output
 26                 # if the iqn is used by multiple portals. If it is, it's
 27                 # the former, so use the supplied iqn. Otherwise, it's the
 28                 # latter, so try the ip,iqn combinations to find the targets
 29                 # which constitutes the multipath device.
 30                 main_iqn = connection_properties['target_iqn']
 31                 all_portals = set([ip for ip, iqn in ips_iqns])
 32                 match_portals = set([ip for ip, iqn in ips_iqns
 33                                      if iqn == main_iqn])
 34                 if len(all_portals) == len(match_portals):
 35                     ips_iqns = zip(all_portals, [main_iqn] * len(all_portals))
 36 
 37             for ip, iqn in ips_iqns:
 38                 props = copy.deepcopy(connection_properties)
 39                 props['target_portal'] = ip
 40                 props['target_iqn'] = iqn
 41                 self._connect_to_iscsi_portal(props)
 42 
 43             self._rescan_iscsi()
 44             host_devices = self._get_device_path(connection_properties)
 45         else:
 46             target_props = connection_properties
 47             for props in self._iterate_all_targets(connection_properties):
 48                 if self._connect_to_iscsi_portal(props):
 49                     target_props = props
 50                     break
 51                 else:
 52                     LOG.warning(_LW(
 53                         'Failed to connect to iSCSI portal %(portal)s.'),
 54                         {'portal': props['target_portal']})
 55 
 56             host_devices = self._get_device_path(target_props)
 57 
 58         # The /dev/disk/by-path/... node is not always present immediately
 59         # TODO(justinsb): This retry-with-delay is a pattern, move to utils?
 60         tries = 0
 61         # Loop until at least 1 path becomes available
 62         while all(map(lambda x: not os.path.exists(x), host_devices)):
 63             if tries >= self.device_scan_attempts:
 64                 raise exception.VolumeDeviceNotFound(device=host_devices)
 65 
 66             LOG.warning(_LW("ISCSI volume not yet found at: %(host_devices)s. "
 67                             "Will rescan & retry.  Try number: %(tries)s."),
 68                         {'host_devices': host_devices,
 69                          'tries': tries})
 70 
 71             # The rescan isn't documented as being necessary(?), but it helps
 72             if self.use_multipath:
 73                 self._rescan_iscsi()
 74             else:
 75                 if (tries):
 76                     host_devices = self._get_device_path(target_props)
 77                 self._run_iscsiadm(target_props, ("--rescan",))
 78 
 79             tries = tries + 1
 80             if all(map(lambda x: not os.path.exists(x), host_devices)):
 81                 time.sleep(tries ** 2)
 82             else:
 83                 break
 84 
 85         if tries != 0:
 86             LOG.debug("Found iSCSI node %(host_devices)s "
 87                       "(after %(tries)s rescans)",
 88                       {'host_devices': host_devices, 'tries': tries})
 89 
 90         # Choose an accessible host device
 91         host_device = next(dev for dev in host_devices if os.path.exists(dev))
 92 
 93         if self.use_multipath:
 94             # We use the multipath device instead of the single path device
 95             self._rescan_multipath()
 96             multipath_device = self._get_multipath_device_name(host_device)
 97             if multipath_device is not None:
 98                 host_device = multipath_device
 99                 LOG.debug("Unable to find multipath device name for "
100                           "volume. Only using path %(device)s "
101                           "for volume.", {'device': host_device})
102 
103         device_info['path'] = host_device
104         return device_info

其中重要的邏輯我都用紅色標記了，用來發現host上的塊設備device

2. disconnect_volume邏輯以下：

 1     @synchronized('connect_volume')
 2     def disconnect_volume(self, connection_properties, device_info):
 3         """Detach the volume from instance_name.
 4         connection_properties for iSCSI must include:
 5         target_portal(s) - IP and optional port
 6         target_iqn(s) - iSCSI Qualified Name
 7         target_lun(s) - LUN id of the volume
 8         """
 9         if self.use_multipath:
10             self._rescan_multipath()
11             host_device = multipath_device = None
12             host_devices = self._get_device_path(connection_properties)
13             # Choose an accessible host device
14             for dev in host_devices:
15                 if os.path.exists(dev):
16                     host_device = dev
17                     multipath_device = self._get_multipath_device_name(dev)
18                     if multipath_device:
19                         break
20             if not host_device:
21                 LOG.error(_LE("No accessible volume device: %(host_devices)s"),
22                           {'host_devices': host_devices})
23                 raise exception.VolumeDeviceNotFound(device=host_devices)
24 
25             if multipath_device:
26                 device_realpath = os.path.realpath(host_device)
27                 self._linuxscsi.remove_multipath_device(device_realpath)
28                 return self._disconnect_volume_multipath_iscsi(
29                     connection_properties, multipath_device)
30 
31         # When multiple portals/iqns/luns are specified, we need to remove
32         # unused devices created by logging into other LUNs' session.
33         for props in self._iterate_all_targets(connection_properties):
34             self._disconnect_volume_iscsi(props)

上面的紅色代碼塊，會把LUN對應的host path從kernel中，和multipath mapper中刪除。

3. 競態Race Condition分析

注意到，以上兩個接口都是用的同一個叫（connect_volume)的鎖（其實就是用flock實現的Linux上的文件鎖）

爲了方便描述faulty device的產生，我畫了以下的圖，來表示兩個接口的關係

如上的流程在非併發的狀況下是表現正常的，host上的device均可以正常鏈接和清理。

可是，以上邏輯有個實現上的問題，當高併發狀況下，會產生faulty device，考慮一下執行順序：

右邊的disconnect_volume執行完畢，存儲上LUN對應的device path(在/dev/disk/by-path下能夠看到）和multipath descriptor（multipath -l能夠看到）。
這個時候，connect_volume鎖被釋放，左邊的connect_volume開始執行，而右邊的terminate_connection尚未執行，也就是說，存儲上尚未移除host訪問LUN的權限，任何host上的scsi rescan仍是會發現這個LUN的device。
接着，connect_volume按正常執行，iscsi rescan 和multipath rescan都相繼執行，形成在步驟 1）中已經刪除的device有從新被scan出來。
而後，右邊的terminate_connection在存儲上執行完成，移除了host對LUN的訪問，最終就造成的所謂的faulty device，看到的multipath 輸出以下(兩個multipath descriptor都是faulty的）：

$ sudo multipath -ll

3600601601290380036a00936cf13e711 dm-30 DGC,VRAID
size=2.0G features='1 retain_attached_hw_handler' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 11:0:0:151 sdef 128:112 failed faulty running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 12:0:0:151 sdeg 128:128 failed faulty running

3600601601bd032007c097518e96ae411 dm-2 DGC,VRAID
size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
  `- #:#:#:# -   #:#   active faulty running