建立統一的Id,一個通訊組中只初始化一次。建立的Id被分發給通訊組中的全部應用。html
建立通訊組中每一個應用的communicator。每一個應用在通訊過程當中須要綁定本身的communicator。git
Creates a new communicator (multi thread/process version). rank must be between 0 and nranks-1 and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before calling ncclCommInitRank. ncclCommInitRank implicitly syncronizes with other ranks, so it must be called by different threads/processes or use ncclGroupStart/ncclGroupEnd.github
對通訊組中的每一個communicator,須要分別調用collective操做。當操做進入到cuda stream中排隊時函數就返回。collective須要每一個進程/線程進行獨立操做;或者在單線程中使用組語句(ncclGroupStart/ncclGroupEnd)。支持in-place模式(sendbuf=recvbuf)多線程
Collective communication operations must be called separately for each ommunicator in a communicator clique. They return when operations have been enqueued on the CUDA stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics.app
reduce分佈式
Reduces data arrays of length count in sendbuff into recvbuff using op operation. recvbuff may be NULL on all calls except for root device. root is the rank (not the CUDA device) where data will reside after the operation is complete. In-place operation will happen if sendbuff == recvbuff.ide
Broadcast函數
Copies count values from root to all other devices. root is the rank (not the CUDA device) where data resides before the operation is started. This operation is implicitely in place.測試
All-Reduceui
Reduces data arrays of length count in sendbuff using op operation, and leaves identical copies of result on each recvbuff. In-place operation will happen if sendbuff == recvbuff.
Reduce-Scatter
All-Gather
Each device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount.
當在單線程中操做多個GPU時,須要使用組操做進行不一樣ranks/devices間通訊的約束(保證在cpu同步時不衝突)。經過使用ncclGroupStart 和 ncclGroupEnd,保證組內相同操做的進行。ncclGroupStart將全部的操做放入隊列,ncclGroupEnd等待隊列中全部操做的完成(在collective操做中ncclGroupEnd只保證把全部的操做都放到cuda stream中,不等待操做完成)。 組操做能夠在collective操做和ncclCommInitRank中被使用。
When managing multiple GPUs from a single thread, and since NCCL collective calls may perform inter-CPU synchronization, we need to "group" calls for different ranks/devices into a single call. Grouping NCCL calls as being part of the same collective operation is done using ncclGroupStart and ncclGroupEnd. ncclGroupStart will enqueue all collective calls until the ncclGroupEnd call, which will wait for all calls to be complete. Note that for collective communication, ncclGroupEnd only guarantees that the operations are enqueued on the streams, not that the operation is effectively done. Both collective communication and ncclCommInitRank can be used in conjunction of ncclGroupStart/ncclGroupEnd.
ncclResult_t ncclGroupStart()
組開始操做,其後的操做不使用cpu同步.
start a group call. All subsequent calls to NCCL may not block due to inter-CPU synchronization.
ncclResult_t ncclGroupEnd()
組結束操做,阻塞到全部從ncclGroupStart開始的操做完成在返回.
End a group call. Wait for all calls since ncclGroupStart to complete before returning.
ncclResult_t ncclCommDestroy(ncclComm_t comm)
釋放comm資源.
Frees resources associated with communicator object
const char* ncclGetErrorString(ncclResult_t result)
得到錯誤信息結果
Returns a human-readable error message.
ncclResult_t ncclCommCount(const ncclComm_t comm, int* count)
得到通訊組中所有的rank數
Gets the number of ranks in the communicator clique.
ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int* device)
得到當前通訊communicator對應的device
Returns the cuda device number associated with the communicator.
ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank) 得到當前通訊communicator對應的rank值
Returns the user-ordered "rank" associated with the communicator.
設計思路:
採用在線程內各自初始化本身communicator的方法進行初始化(在主線程中建立ncclid,該ncclid對全局線程可見)。當某一個線程調用merge操做失敗時,查看是否由於某個線程退出引發的。
若是由於某個線程退出引發merge失敗,這時每一個線程從新初始化本身的communicator,並進行上一步的merge操做(該次初始化時device已經減小,至關於從新建立communicator)
設計思路:
採用在進程內各自初始化本身communicator的方法進行初始化(初始化時,0號進程使用tpc協議廣播ncclid到所有進程)。當某一個進程調用merge操做失敗時,查看是不是由於有進程退出引發的。 若是由於某個進程退出引發merge失敗,這時每一個進程從新初始化本身的communicator,並進行上一步的merge操做(該次初始化時device已經減小,至關於從新建立communicator)