nvidia-nccl 學習筆記

NCCL 資料

NCCL1 vs NCCL2

  • nccl1:
    nccl1支持單機多卡通訊,不支持多機通訊。
    開源地址:https://github.com/NVIDIA/nccl-tests
  • nccl2:
    nccl2支持多機通訊,在nccl1的基礎上增長了多機通訊策略。多機通訊可進行通訊協議的選擇,支持經過IB、TCP等協議實現多機間數據通訊。

NCCL2接口

初始化操做

  • Id 建立

建立統一的Id,一個通訊組中只初始化一次。建立的Id被分發給通訊組中的全部應用。html

  1. ncclResult_t ncclGetUniqueId(ncclUniqueId* uniqueId)
    建立一個被初始化函數(ncclCommInitRank)使用的Id。該函數只能被調用一次(在整個分佈式計算中只能被一個地方調用),調用後產生的Id須要分發給分佈式任務中其餘全部的任務,而後在進行ncclCommInitRank初始化操做(該初始化操做須要使用全局統一Id)。
    > Generates an Id to be used in ncclCommInitRank. ncclGetUniqueId should be called once and the Id should be distributed to all ranks in the communicator before calling ncclCommInitRank.
  • communicator 初始化

建立通訊組中每一個應用的communicator。每一個應用在通訊過程當中須要綁定本身的communicator。git

  1. ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nranks, ncclUniqueId commId, int rank)
    多進程/多線程中建立一個新的communicator。參數重的rank必須是0到nranks-1之間,而且是惟一的。每一個rank應該對應一個已經設置的device。該函數會對每一個rank作隱式同步。該函數必須被不一樣的進程、線程調用;或者在同一個線程中使用ncclGroupStart/ncclGroupEnd進行限制。

Creates a new communicator (multi thread/process version). rank must be between 0 and nranks-1 and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before calling ncclCommInitRank. ncclCommInitRank implicitly syncronizes with other ranks, so it must be called by different threads/processes or use ncclGroupStart/ncclGroupEnd.github

  1. ncclResult_t ncclCommInitAll(ncclComm_t* comm, int ndev, const int* devlist)
    但進程中統一建立communicators,須要預先分配comm地址,而且傳入device個數和device列表(該函數在單機通訊中使用較方便,多機通訊中不使用該函數)。
    > Creates a clique of communicators (single process version). This is a convenience function to create a single-process communicator clique. Returns an array of ndev newly initialized communicators in comm. comm should be pre-allocated with size at least ndev*sizeof(ncclComm_t). If devlist is NULL, the first ndev CUDA devices are used. Order of devlist defines user-order of processors within the communicator.

合併操做

對通訊組中的每一個communicator,須要分別調用collective操做。當操做進入到cuda stream中排隊時函數就返回。collective須要每一個進程/線程進行獨立操做;或者在單線程中使用組語句(ncclGroupStart/ncclGroupEnd)。支持in-place模式(sendbuf=recvbuf)多線程

Collective communication operations must be called separately for each ommunicator in a communicator clique. They return when operations have been enqueued on the CUDA stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics.app

  • reduce分佈式

    1. ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream)
      數據合併操做,將數據合併到root節點(root節點是rank的root,不是device的root)

    Reduces data arrays of length count in sendbuff into recvbuff using op operation. recvbuff may be NULL on all calls except for root device. root is the rank (not the CUDA device) where data will reside after the operation is complete. In-place operation will happen if sendbuff == recvbuff.ide

  • Broadcast函數

    1. ncclResult_t ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)
      廣播root節點的數據到所有節點。

    Copies count values from root to all other devices. root is the rank (not the CUDA device) where data resides before the operation is started. This operation is implicitely in place.測試

  • All-Reduceui

    1. ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)
      在每一個節點上對所有數據作reduce操做

    Reduces data arrays of length count in sendbuff using op operation, and leaves identical copies of result on each recvbuff. In-place operation will happen if sendbuff == recvbuff.

  • Reduce-Scatter

  • All-Gather

    1. ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)
      從其餘節點接受數據並存儲到本地recvbuf中。接收到的數據存儲的偏移位置爲i*sendcount(i爲rank序列)

    Each device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount.

組操做

當在單線程中操做多個GPU時,須要使用組操做進行不一樣ranks/devices間通訊的約束(保證在cpu同步時不衝突)。經過使用ncclGroupStart 和 ncclGroupEnd,保證組內相同操做的進行。ncclGroupStart將全部的操做放入隊列,ncclGroupEnd等待隊列中全部操做的完成(在collective操做中ncclGroupEnd只保證把全部的操做都放到cuda stream中,不等待操做完成)。 組操做能夠在collective操做和ncclCommInitRank中被使用。

When managing multiple GPUs from a single thread, and since NCCL collective calls may perform inter-CPU synchronization, we need to "group" calls for different ranks/devices into a single call. Grouping NCCL calls as being part of the same collective operation is done using ncclGroupStart and ncclGroupEnd. ncclGroupStart will enqueue all collective calls until the ncclGroupEnd call, which will wait for all calls to be complete. Note that for collective communication, ncclGroupEnd only guarantees that the operations are enqueued on the streams, not that the operation is effectively done. Both collective communication and ncclCommInitRank can be used in conjunction of ncclGroupStart/ncclGroupEnd.

  • ncclResult_t ncclGroupStart()
    組開始操做,其後的操做不使用cpu同步.

    start a group call. All subsequent calls to NCCL may not block due to inter-CPU synchronization.

  • ncclResult_t ncclGroupEnd()
    組結束操做,阻塞到全部從ncclGroupStart開始的操做完成在返回.

    End a group call. Wait for all calls since ncclGroupStart to complete before returning.

其餘操做

  • ncclResult_t ncclCommDestroy(ncclComm_t comm)
    釋放comm資源.

    Frees resources associated with communicator object

  • const char* ncclGetErrorString(ncclResult_t result)
    得到錯誤信息結果

    Returns a human-readable error message.

  • ncclResult_t ncclCommCount(const ncclComm_t comm, int* count)
    得到通訊組中所有的rank數

    Gets the number of ranks in the communicator clique.

  • ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int* device)
    得到當前通訊communicator對應的device

    Returns the cuda device number associated with the communicator.

  • ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank) 得到當前通訊communicator對應的rank值

    Returns the user-ordered "rank" associated with the communicator.

NCCL 動態擴展

單機多卡多線程動態擴展

設計思路:
採用在線程內各自初始化本身communicator的方法進行初始化(在主線程中建立ncclid,該ncclid對全局線程可見)。當某一個線程調用merge操做失敗時,查看是否由於某個線程退出引發的。
若是由於某個線程退出引發merge失敗,這時每一個線程從新初始化本身的communicator,並進行上一步的merge操做(該次初始化時device已經減小,至關於從新建立communicator)

  • 測試結論:
    1. 每一個線程初始化本身OK
    2. merge操做過程當中若是出現某個線程退出,其餘線程會處於block狀態(不返回)
  • 結論
    單機多卡(多線程)動態擴展沒法支持。

單機/多機多卡多進程動態擴展

設計思路:
採用在進程內各自初始化本身communicator的方法進行初始化(初始化時,0號進程使用tpc協議廣播ncclid到所有進程)。當某一個進程調用merge操做失敗時,查看是不是由於有進程退出引發的。 若是由於某個進程退出引發merge失敗,這時每一個進程從新初始化本身的communicator,並進行上一步的merge操做(該次初始化時device已經減小,至關於從新建立communicator)

  • 測試結論: 1. server進程(TCP server端)建立ncclId,而且將該進程bcast到全部work進程(TCP client端),而後進行通訊是能夠的(server進程能夠不參與通訊)
    2. merge操做過程當中若是出現某個進程退出,其餘進程所有處於block狀態(不返回),且這時候其餘進程的GPU使用率是100%,cpu使用100%。
  • 結論 單機/多機多卡多進程動態擴展沒法支持。
相關文章
相關標籤/搜索