【DPDK】【Multiprocess】一個dpdk多進程場景的坑

時間 2020-01-24

標籤 DPDK Multiprocess 一個 dpdk 進程場景简体版

原文原文鏈接

【前言】html

　　這是一個隱藏了近3年的問題，理論上只要用到DPDK multiprocess場景的都會遇到這個問題，具體出不出問題只能說是看運氣，即便不出問題也仍然是一個風險。node

　　patch地址：https://patches.dpdk.org/patch/64819/linux

　　討論的patch地址：https://patches.dpdk.org/patch/64526/編程

【場景】數組

　　我先描述一下這個問題我是怎麼撞到的吧。安全

　　我司不一樣的產品線都不一樣程度的使用了DPDK做爲網絡IO加速的手段，我相信這也是全部使用DPDK人的初衷，而且我司不一樣的產品線在設計上有使用DPDK multiprocess場景實現業務邏輯。網絡

　　在我這邊的狀況是這樣，用過DPDK的話都知道，DPDK會利用本身的igb_uio/vfio驅動來接管傳統內核驅動，這樣每每會致使一些問題，就是咱們一些傳統的類unix工具，諸如ifconfig、ip、ethtool等工具沒法再查看被DPDK驅動接管的網卡狀態。數據結構

　　舉個例子：app

　　在傳統linux場景下，我向看一下網卡丟包緣由、網卡寄存器狀態、網卡的feature，經過一個ethtool就能夠搞定，可是到了DPDK這裏就行不通了，由於上述傳統工具實際上都是去內核拿數據，ethtool底層就是用ioctl去讀的內核數據，可是如今網卡驅動已經被DPDK驅動接管了，用ethtool再也拿不到信息了。socket

　　所以我從新寫了一個ethtool-dpdk，用來專門解決在dpdk場景下的查看網卡驅動狀態。這個工具是以secondary進程實現的，每次運行，都會attach到primary進程中，去獲取primary進程和secondary進程之間的share memory。其中就包括struct rte_eth_dev_data這個處在share memory的數據結構，經過獲取這個結構中的pci bar，我就能夠經過「基地址 + 寄存器偏移量」的手段去拿到寄存器狀態。

/**
 * @internal
 * The data part, with no function pointers, associated with each ethernet device.
 *
 * This structure is safe to place in shared memory to be common among different （這個結構處於共享內存中）
 * processes in a multi-process configuration.
 */
struct rte_eth_dev_data {
    char name[RTE_ETH_NAME_MAX_LEN]; /**< Unique identifier name */

    void **rx_queues; /**< Array of pointers to RX queues. */
    void **tx_queues; /**< Array of pointers to TX queues. */ uint16_t nb_rx_queues; /**< Number of RX queues. */ uint16_t nb_tx_queues; /**< Number of TX queues. */ struct rte_eth_dev_sriov sriov; /**< SRIOV data */ void *dev_private; /**< PMD-specific private data */ //這個裏面存着pci bar struct rte_eth_link dev_link; /**< Link-level information & status */ struct rte_eth_conf dev_conf; /**< Configuration applied to device. */ uint16_t mtu; /**< Maximum Transmission Unit. */ uint32_t min_rx_buf_size; /**< Common rx buffer size handled by all queues */ uint64_t rx_mbuf_alloc_failed; /**< RX ring mbuf allocation failures. */ struct ether_addr* mac_addrs;/**< Device Ethernet Link address. */ uint64_t mac_pool_sel[ETH_NUM_RECEIVE_MAC_ADDR]; /** bitmap array of associating Ethernet MAC addresses to pools */ struct ether_addr* hash_mac_addrs; /** Device Ethernet MAC addresses of hash filtering. */ uint8_t port_id; /**< Device [external] port identifier. */ __extension__ uint8_t promiscuous : 1, /**< RX promiscuous mode ON(1) / OFF(0). */ scattered_rx : 1, /**< RX of scattered packets is ON(1) / OFF(0) */ all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */ dev_started : 1, /**< Device state: STARTED(1) / STOPPED(0). */ lro : 1; /**< RX LRO is ON(1) / OFF(0) */ uint8_t rx_queue_state[RTE_MAX_QUEUES_PER_PORT]; /** Queues state: STARTED(1) / STOPPED(0) */ uint8_t tx_queue_state[RTE_MAX_QUEUES_PER_PORT]; /** Queues state: STARTED(1) / STOPPED(0) */ uint32_t dev_flags; /**< Capabilities */ //請注意這個標記 enum rte_kernel_driver kdrv; /**< Kernel driver passthrough */ int numa_node; /**< NUMA node connection */ struct rte_vlan_filter_conf vlan_filter_conf; /**< VLAN filter configuration. */ };

代碼版本：

　　代碼來源於DPDK 17.08版本，可是此問題不侷限於17.08版本，一直到19.11版本都存在，只是我在這個版本的dpdk代碼踩到了這個坑，或者換句話說，這版本比較容易踩到這個坑。下列介紹凡是不特別說起，都爲dpdk-17.08版本。

代碼位置：

　　DPDK 根目錄/lib/librte_ether/rte_ethdev.h

【問題描述】

　　可是偶然一次測試發現了問題。咱們的設備本來是支持網卡熱插拔的，可是在啓動這個ethtool-dpdk工具後發現網卡的熱插拔居然失效了，primary去檢查網卡熱插拔的標記時，發現標記「消失了」」

　　標記所在的代碼位置：

　　DPDK 根目錄/lib/librte_ether/rte_ethdev.h

/** Device supports hotplug detach */
#define RTE_ETH_DEV_DETACHABLE   0x0001 //網卡熱插拔標記
/** Device supports link state interrupt */
#define RTE_ETH_DEV_INTR_LSC     0x0002 //網卡LSC中斷標記
/** Device is a bonded slave */
#define RTE_ETH_DEV_BONDED_SLAVE 0x0004
/** Device supports device removal interrupt */
#define RTE_ETH_DEV_INTR_RMV     0x0008

　　這些標記時用於給struct rte_eth_dev_data->dev_flags準備的，剛纔咱們說過，rte_eth_dev_data這個數據結構處於共享內存中，由primary進程掌控。

　　本來struct rte_eth_dev_data->dev_flags的值應該是 RTE_ETH_DEV_DETACHABLE | RTE_ETH_DEV_INTR_LSC，也就是0x0001 | 0x0002 = 0x0003。

　　可是在使用ethtool-dpdk工具後，這個值變爲了0x0002，也就是說，網卡熱插拔標記RTE_ETH_DEV_DETACHABLE消失了...根據我剛纔所說rte_eth_dev_data處於共享內存中，所以必定是secondary進程，也就是ethtool-dpdk工具改變了共享內存中的內容致使的。

　　注意：若是已經知曉struct rte_eth_dev_data數據處於共享內存中，如下的分析應該掃一眼就知道是怎麼回事了

【分析】

　　在primary/secondary進程初始化過程當中，也就是調用rte_eal_init()函數進行初始化的過程當中，會去掃描pci設備，獲取pci設備的狀態信息。這裏不瞭解的話，能夠見我另一篇文章《DPDK初始化之PCI》，而且實際上不耽誤瞭解此篇文章中的內容。

　　在初始化的過程當中，primary進程和secondary進程都會進入rte_eth_dev_pci_allocate函數去獲取struct rte_eth_dev結構。

　　先介紹下struct rte_eth_dev結構：

struct rte_eth_dev {
    eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function. */ eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function. */ eth_tx_prep_t tx_pkt_prepare; /**< Pointer to PMD transmit prepare function. */ struct rte_eth_dev_data *data; /**< Pointer to device data */ //注意這個指針 const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */ struct rte_device *device; /**< Backing device */ struct rte_intr_handle *intr_handle; /**< Device interrupt handle */ /** User application callbacks for NIC interrupts */ struct rte_eth_dev_cb_list link_intr_cbs; /** * User-supplied functions called from rx_burst to post-process * received packets before passing them to the user */ struct rte_eth_rxtx_callback *post_rx_burst_cbs[RTE_MAX_QUEUES_PER_PORT]; /** * User-supplied functions called from tx_burst to pre-process * received packets before passing them to the driver for transmission. */ struct rte_eth_rxtx_callback *pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT]; enum rte_eth_dev_state state; /**< Flag indicating the port state */ } __rte_cache_aligned;

　　不瞭解這裏的實現的話，我在這裏就直接告訴你們，這個struct rte_eth_dev數據結構描述的是「設備」，在咱們的場景下能夠理解爲描述某一個網卡設備，說白了就是一個管理性質的數據結構，網卡設備的抽象。

　　先上數據結構：

　　（這張圖若是看一眼就知道什麼意思的基本接下來的分析大概看一看就能明白究竟是什麼問題）

　　rte_eth_dev_pci_allocate函數的做用實際上就是去得到這個struct rte_eth_dev數據，這裏爲何視角從關鍵的rte_eth_dev_data結構轉到rte_eth_dev_pci_allocate函數，我先按下不表，跟着思路走便可，由於這裏我更傾向於還原整個問題現場與順序，若是直接從問題出現的上下文出發，反而很差分析。

static inline struct rte_eth_dev *
rte_eth_dev_pci_allocate(struct rte_pci_device *dev, size_t private_data_size) { struct rte_eth_dev *eth_dev; const char *name; if (!dev) return NULL; //step 1.先獲取設備名 name = dev->device.name; //step 2.若是是primary進程就去調用rte_eth_dev_allocate函數去「申請」rte_eth_dev結構 //反之若是是secondary進程，就去調用rte_eth_dev_attach_secondary函數去「獲取」rte_eth_dev結構 if (rte_eal_process_type() == RTE_PROC_PRIMARY) { eth_dev = rte_eth_dev_allocate(name); if (!eth_dev) return NULL; if (private_data_size) { eth_dev->data->dev_private = rte_zmalloc_socket(name, private_data_size, RTE_CACHE_LINE_SIZE, dev->device.numa_node); if (!eth_dev->data->dev_private) { rte_eth_dev_release_port(eth_dev); return NULL; } } } else { eth_dev = rte_eth_dev_attach_secondary(name); if (!eth_dev) return NULL; } eth_dev->device = &dev->device; //step 3.調用rte_eth_copy_pci_info去根據pci設備數據結構拷貝pci信息  rte_eth_copy_pci_info(eth_dev, dev); return eth_dev; }

　　對應的流程圖爲：

　　根據rte_eth_dev_pci_allocate函數的邏輯咱們能夠看到有兩處關鍵的地方，即：

要獲取rte_eth_dev數據結構，只不過primary和secondary獲取的方式不一樣。
調用rte_eth_copy_pci_info函數，去從描述pci設備的數據結構中拷貝信息至rte_eth_dev這個描述設備的結構。

　　先將視角聚焦在第一處關鍵位置，即獲取rte_eth_dev數據結構，咱們這裏的場景是secondary進程，所以primary進程執行的代碼就不作分析，有興趣的能夠本身瞭解。

　　接下來以secondary進程的視角進入rte_eth_dev_attach_secondary函數，觀察secondary是怎麼獲取的struct rte_eth_dev結構，隨之作了什麼。

struct rte_eth_dev *
rte_eth_dev_attach_secondary(const char *name) { uint8_t i; struct rte_eth_dev *eth_dev; //step 1.判斷全局數據指針rte_eth_dev_data是否爲NULL，若是爲NULL，則申請。 if (rte_eth_dev_data == NULL) rte_eth_dev_data_alloc(); //step 2.找到與設備名字對應的rte_eth_dev_data結構所在的下標id for (i = 0; i < RTE_MAX_ETHPORTS; i++) { if (strcmp(rte_eth_dev_data[i].name, name) == 0) break; } if (i == RTE_MAX_ETHPORTS) { RTE_PMD_DEBUG_TRACE( "device %s is not driven by the primary process\n", name); return NULL; } //step 3.根據上一步獲取的下標id來調用eth_dev_get函數來獲取struct rte_eth_dev數據結構 eth_dev = eth_dev_get(i); RTE_ASSERT(eth_dev->data->port_id == i); return eth_dev; }

　　對應的流程圖爲：

　　這個函數中一樣有兩個重要的點，即：

要調用rte_eth_dev_data_alloc()函數去「得到」rte_eth_dev_data這個數據結構
調用eth_dev_get函數拿到對應的struct rte_eth_dev結構

　　咱們暫且跳過rte_eth_dev_data_alloc()函數，回頭再來看，先看rte_dev_get函數是怎麼拿到的這個struct rte_eth_dev結構。

static struct rte_eth_dev *
eth_dev_get(uint8_t port_id)
{
    struct rte_eth_dev *eth_dev = &rte_eth_devices[port_id]; eth_dev->data = &rte_eth_dev_data[port_id]; //rte_eth_dev中的data指針來自於rte_eth_dev_data結構 eth_dev->state = RTE_ETH_DEV_ATTACHED; TAILQ_INIT(&(eth_dev->link_intr_cbs)); eth_dev_last_created_port = port_id; return eth_dev; }

　　對應的流程圖爲：

　　rte_eth_dev結構來自於全局數組rte_eth_devices，說明rte_eth_dev數據爲local數據，並非shared memory中的數據，可是。關鍵在於上述代碼註釋的那一行，rte_eth_dev中的data指針指向了rte_eth_dev_data數據。而rte_eth_dev_data咱們剛纔也說過是在rte_eth_dev_attach_secondary中調用rte_eth_dev_data_alloc函數「得到的」，怎麼得到的呢，讓咱們接下來回過頭來看rte_eth_dev_data_alloc函數是怎麼得到的rte_eth_dev_data數據。

static void
rte_eth_dev_data_alloc(void) { const unsigned flags = 0; const struct rte_memzone *mz; //step 1.若是是primary進程則向memzone中申請一塊空間做爲rte_eth_dev_data數據所在 //若是是secondary進程，則直接lookup收共享內存中的rte_eth_dev_data數據 if (rte_eal_process_type() == RTE_PROC_PRIMARY) { mz = rte_memzone_reserve(MZ_RTE_ETH_DEV_DATA, RTE_MAX_ETHPORTS * sizeof(*rte_eth_dev_data), rte_socket_id(), flags); } else mz = rte_memzone_lookup(MZ_RTE_ETH_DEV_DATA); if (mz == NULL) rte_panic("Cannot allocate memzone for ethernet port data\n"); rte_eth_dev_data = mz->addr; if (rte_eal_process_type() == RTE_PROC_PRIMARY) memset(rte_eth_dev_data, 0, RTE_MAX_ETHPORTS * sizeof(*rte_eth_dev_data)); }

　　對應是流程圖：

　　經過這段代碼，咱們能夠了解到一個信息，struct rte_eth_dev_data數據是處於共享內存中的，實際secondary進程去讀網卡寄存器就是經過這個數據結構索引拿到pci bar，在根據基地址 + 寄存器偏移，拿到的具體的某一個寄存器地址，因此secondary進程才能夠去讀網卡的寄存器信息。

　　通過上述的分析咱們起碼知道如下幾個線索，能夠梳理一下：

dpdk的primary/secondary進程初始化過程當中都會調用rte_eth_dev_pci_allocate函數去拿到struct rte_eth_dev結構。
在初始化過程當中，struct rte_eth_dev結構來自於全局數組struct rte_eth_devices，也就意味着rte_eth_dev結構爲進程的local變量。
在初始化過程當中，還會給struct rte_eth_dev結構中的data指針初始化指向struct rte_eth_dev_data結構。
struct rte_eth_dev_data結構在secondary進程中，其初始化時經過獲取共享內存中的地址獲得的，覺得這struct rte_eth_dev_data結構在共享內存中。

　　其實上述4點梳理的線索只是爲了讓你們明白：secondary中握着和primary進程的共享內存結構，這個結果是struct rte_eth_dev_data結構，既然握着共享內存，就容易犯錯。

　　而犯錯的代碼就位於rte_eth_dev_pci_allocate函數中第二處關鍵的位置，即rte_eth_dev_pci_copy_info函數中。

static inline void
rte_eth_copy_pci_info(struct rte_eth_dev *eth_dev, struct rte_pci_device *pci_dev) { if ((eth_dev == NULL) || (pci_dev == NULL)) { RTE_PMD_DEBUG_TRACE("NULL pointer eth_dev=%p pci_dev=%p\n", eth_dev, pci_dev); return; } eth_dev->intr_handle = &pci_dev->intr_handle; //問題代碼：將data指針的dev_flags進行reset操做 eth_dev->data->dev_flags = 0; if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC) eth_dev->data->dev_flags |= RTE_ETH_DEV_INTR_LSC; if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_RMV) eth_dev->data->dev_flags |= RTE_ETH_DEV_INTR_RMV; eth_dev->data->kdrv = pci_dev->kdrv; eth_dev->data->numa_node = pci_dev->device.numa_node; }

對應的流程圖爲：

　　能夠看到，在rte_eth_dev_pci_copy_info函數中，對struct rte_eth_dev中的data指針中的數據進行了寫操做，而這個數據正式來自於shared memory中的struct rte_eth_dev_data結構，而且通過前面對rte_eth_dev_pci_allocate函數的分析咱們知道不管是secondary進程仍是primary進程，都會進入rte_eth_dev_pci_copy_info函數中，那麼就會出現這種狀況：

　　secondary進程在得到struct rte_eth_dev結構後大搖大擺的進入rte_eth_dev_pci_copy_info中去拷貝pci信息，而後順手就將struct rte_eth_dev中的data指針中的數據重置了，這個數據就是rte_eth_dev_data.dev_flags，而重置時的條件判斷卻不充分，致使重置後的dev_flags只有兩種可能，要麼爲0x0000，就是什麼都沒有，要麼爲0x0002，RTE_ETH_DEV_INTR_LS，或者是RTE_ETH_DEV_INTR_RMV，要麼就是RTE_ETH_DEV_INTR_LSC | RTE_ETH_DEV_INTR_RMV，可是除了二者之外的其餘值永遠回不來了....

　　那麼回到咱們的場景，在dpdk 17.08版本，struct rte_eth_dev_data.dev_flags本來爲RTE_ETH_DEV_DETACHABLE | RTE_ETH_DEV_INTR_LSC，值爲0x0003，通過rte_eth_dev_pci_copy_info函數中的邏輯重置後，就只剩下RTE_ETH_DEV_INTR_LSC了，就是由好好的0x0003變爲了0x0002，從而致使primary中的網卡熱插拔特性被莫名其妙的重置掉了。

【dpdk 19.11版本】

　　在dpdk 19.11版本，此問題仍然存在，函數名都沒有變，只不過就是函數所在的文件位置發生了變化，dev_flags的值發生了變化，RTE_ETH_DEV_DETACHABLE已經被廢棄，可是問題真的只是RTE_ETH_DEV_DETACHABLE標誌消失致使網卡熱插拔出問題麼？相信通過上述的分析你們內心天然有答案。

【結論】

　　這個問題的本質其實是在secondary函數初始化時進入rte_eth_pci_copy_info函數私自改變了共享內存struct rte_eth_dev_data中的值。關於這個問題我我的有兩種角度來看：

從語義編程的角度講，secondary進程的確須要進入rte_eth_pci_copy_info函數去重置（若是真的是這樣的話，我我的表示不理解），只不過在重置時沒有考慮全全部的狀況，致使重置後的狀態和充值前的狀態出現了差別。
從我我的的想法來看，我我的堅持secondary進程不須要進入rte_eth_dev_pci_copy_info函數去從新設置rte_eth_dev_data->dev_flags，我以爲這個問題本質上與標誌位無關，與是否網卡支持熱插拔無關，與標誌位的值，與標誌位是否被廢棄也無關，由於我我的認爲dpdk的secondary進程就不該該有權利去觸碰共享內存的數據，只能讀不能寫，更況且是與驅動相關的struct rte_eth_dev_data中的數據。secondary進程就不該該就如rte_eth_dev_pci_copy_info函數。而且爲何須要從新設置rte_eth_dev_data->dev_flags呢？數據來自於共享內存，primary已經把值設置完成了，舉個例子就是primary進程已經把菜作好了端在面前了，secondary進程爲何還要講菜倒掉從新作一份呢？而且從最終的結果來看，secondary進程初始化後的驅動狀態和primary進程是統一的，既然但願統一，對secondary進程來說，不設置struct rte_eth_dev_data中的數據豈不是最安全的。

【後續】

　　關於這個問題有兩種改動方法：

方法一：最方便改法

static inline void
rte_eth_copy_pci_info(struct rte_eth_dev *eth_dev, struct rte_pci_device *pci_dev) { if ((eth_dev == NULL) || (pci_dev == NULL)) { RTE_PMD_DEBUG_TRACE("NULL pointer eth_dev=%p pci_dev=%p\n", eth_dev, pci_dev); return; } eth_dev->intr_handle = &pci_dev->intr_handle; //加一層if判斷，只有primary進程有權利對struct rte_eth_dev_data中的數據進行寫操做 if (rte_eal_process_type() == RTE_PROC_PRIMARY) { eth_dev->data->dev_flags = 0; if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_LSC) eth_dev->data->dev_flags |= RTE_ETH_DEV_INTR_LSC; if (pci_dev->driver->drv_flags & RTE_PCI_DRV_INTR_RMV) eth_dev->data->dev_flags |= RTE_ETH_DEV_INTR_RMV; eth_dev->data->kdrv = pci_dev->kdrv; eth_dev->data->numa_node = pci_dev->device.numa_node; } }

方法二：

　　堅持我我的的想法，不該該讓secondary進程進入rte_eth_dev_pci_copy_info函數，可是這種改法改動巨大，風險也大，由於在dpdk的邏輯中不僅有在初始化時會調用rte_eth_dev_pci_copy_info函數，有興趣的能夠自行研究，這裏很少贅述。

　　最後，這個問題已經提交了patch到dpdk社區，目前已經被採納。

　　P.S.給dpdk提patch還挺費勁的....比在我公司內部提一個patch麻煩的多...dpdk有一個專門提patch的引導，https://doc.dpdk.org/guides/contributing/patches.html，第一次看的時候腦殼都有點大...

這個框我也不知道是啥東西,寫博客的時候冒出來的...