MPI入門

時間 2020-12-13

標籤 node c++ 分佈式 ide 學習指針 rest code orm 欄目 C&C++ 简体版

原文原文鏈接

MPI入門

分佈式系統中常常用到MPI，這裏簡單地學習一下基礎用法，並作個筆記。node

教程c++

通信器（communicator）。通信器定義了一組可以互相發消息的進程。在這組進程中，每一個進程會被分配一個序號，稱做(rank).分佈式

點對點通訊

本身先把本身想要發送的數據寫在一個buffer裏，該buffer能夠是MPI_Datatype類型的指針所指向的一片內存區域，調用Send的時候就將該類型指針轉爲void *.ide

MPI_Send(
    void* data,
    int count,
    MPI_Datatype datatype,
    int destination,
    int tag,
    MPI_Comm communicator)

MPI_Recv(
    void* data,
    int count,
    MPI_Datatype datatype,
    int source,
    int tag,
    MPI_Comm communicator,
    MPI_Status* status)

注意學習

MPI_Recv的source能夠是任意tag，表示接收來自任何source的msg，可是MPI_Send的destination應該不能夠用來發送消息給任何des吧
目前MPI_Recv的status尚未用到，想忽略這個就直接置爲MPI_STATUS_IGNORE。

若是調用MPI_Recv時提供了MPI_Status參數，假設是一個名爲stat的MPI_Status，就會往裏面填入一些信息，主要是如下3個：ui

The rank of the sender：經過stat.MPI_SOURCE去access;
tag of the message: stat.MPI_TAG;
length of the message: 不能直接經過訪問stat的某個元素去access，須要調用下面的方法去access:

MPI_Get_count(
    MPI_Status* status,
    MPI_Datatype datatype,
    int* count)

The count variable is the total number of datatype elements that were received.
到這裏就有一個疑問？爲何須要這3個信息？指針

MPI_Recv裏不是已經有一個參數是count了嗎？
事實上，MPI_Recv裏的count是最多接收多少個datatype類型的元素，可是MPI_Status裏的count是實際接收到了多少個元素。
MPI_Recv裏不是已經有一個參數tag了嗎？
普通的使用場景下，tag是個固定的值，可是也能夠傳MPI_ANY_TAG來表示接收任意tag的message，這時如何去分辨收到的message是屬於何種tag就只能依靠Status裏的信息
相似地，MPI_Recv裏也能夠指定MPI_ANY_SOURCE來表示接收來自任何sender的message，這時如何分辨收到的message來自哪一個sender也只能靠status的信息。

由於在調用MPI_Recv的時候須要提供一個buffer去存收到的消息嘛，而每每真正收到消息以前咱們並不知道消息有多大，因此先調用Probe去探測一下。而後再調用MPI_Recv去真正接收message。rest

MPI_Probe(
    int source,
    int tag,
    MPI_Comm comm,
    MPI_Status* status)

collective通訊

MPI_Barrier(MPI_Comm communicator)：就是BSP裏的barrier啦。code

關於同步最後一個要注意的地方是：始終記得每個你調用的集體通訊方法都是同步的。也就是說，若是你無法讓全部進程都完成 MPI_Barrier，那麼你也無法完成任何集體調用。若是你在沒有確保全部進程都調用 MPI_Barrier 的狀況下調用了它，那麼程序會空閒下來。這對初學者來講會很迷惑，因此當心這類問題。orm

MPI_Bcast

MPI_Bcast(
    void* data,
    int count,
    MPI_Datatype datatype,
    int root,
    MPI_Comm communicator)

不管是發送方仍是接收方，都是調用一樣的MPI_Bcast。這與點對點通訊不同。

問題：compare_bcast.c中的第二個MPI_Barrier有什麼做用？

MPI Scatter, Gather, and Allgather

MPI_Bcast與Scatter的區別：

Bcast將一樣的數據發送給別的進程，而Scatter將數據的不一樣分片發給不一樣的進程，即每一個進程只能得到一部分數據；
Bcast是發送給其餘(即除了本身以外的)進程，而Scatter是發送給communicator中包括本身的全部進程。(這句話的前半句不太肯定)。我的分析：Bcast中應該是發給除本身以外的其餘進程，不然的話，它只提供了一個data指針，指向一個buffer，那麼若是root也發給本身，那一方面是不必，另外一方面是data必然同時做爲send_buffer和recv_buffer，這不可能。而在Scatter中，就提供了2個buffer。由於root就須要同時發送和接收。

MPI_Scatter(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)

MPI Reduce and Allreduce

MPI_Reduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    int root,
    MPI_Comm communicator)

MPI_Allreduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    MPI_Comm communicator)

MPI_Allreduce與MPI_Allgather相似，就是普通的MPI_gather是將結果放到一個進程裏，可是MPI_Allgather是將結果返回到全部進程，能被全部進程access到。MPI_Allreduce也同樣，將reduce的結果能夠被全部的進程access到。

Groups and Communicators

前面的應用，要麼是talk to one process或者talk to all the processes, 只是用了默認的一個communicator。隨着程序規模的增大，可能須要只與部分processes通訊，因此引入了group，每一個group分別對應一個communicator。如何建立多個communicator呢？

MPI_Comm_split(
    MPI_Comm comm,
    int color,
    int key,
    MPI_Comm* newcomm)

MPI_Comm_split creates new communicators by 「splitting」 a communicator into a group of sub-communicators based on the input values color and key.

The first argument, comm, is the communicator that will be used as the basis for the new communicators. This could be MPI_COMM_WORLD, but it could be any other communicator as well.

The second argument, color, determines to which new communicator each processes will belong. All processes which pass in the same value for color are assigned to the same communicator. If the color is MPI_UNDEFINED, that process won’t be included in any of the new communicators.

The third argument, key, determines the ordering (rank) within each new communicator. The process which passes in the smallest value for key will be rank 0, the next smallest will be rank 1, and so on. If there is a tie, the process that had the lower rank in the original communicator will be first.

When you print things out in an MPI program, each process has to send its output back to the place where you launched your MPI job before it can be printed to the screen. This tends to mean that the ordering gets jumbled so you can’t ever assume that just because you print things in a specific rank order, that the output will actually end up in the same order you expect. The output was just rearranged here to look nice.

MPI has a limited number of objects that it can create at a time and not freeing your objects could result in a runtime error if MPI runs out of allocatable objects.

額外的問題

初始化MPI的時候，MPI Init or MPI_Init_thread？

int MPI_Init_thread(int *argc, char *((*argv)[]), int required, int *provided)

int MPI::Init_thread(int& argc, char**& argv, int required) 
int MPI::Init_thread(int required)

argc和argv是optional，在C語言裏，經過傳NULL實現，C++裏就是重載了兩個額外的MPI_Init_thread。
required: the desired level of thread support, 可能的取值：
- MPI_THREAD_SINGLE： Only one thread will execute.
- MPI_THREAD_FUNNELED： The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are ``funneled'' to the main thread).
- MPI_THREAD_SERIALIZED： The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are ``serialized'').
- MPI_THREAD_MULTIPLE： Multiple threads may call MPI, with no restrictions.
The call returns in provided information about the actual level of thread support that will be provided by MPI. It can be one of the four values listed above.

Vendors may provide (implementation dependent) means to specify the level(s) of thread support available when the MPI program is started, e.g., with arguments to mpiexec. This will affect the outcome of calls to MPI_INIT and MPI_INIT_THREAD.

Suppose, for example, that an MPI program has been started so that only MPI_THREAD_MULTIPLE is available. Then MPI_INIT_THREAD will return provided = MPI_THREAD_MULTIPLE, irrespective of the value of required; a call to MPI_INIT will also initialize the MPI thread support level to MPI_THREAD_MULTIPLE.

Suppose, on the other hand, that an MPI program has been started so that all four levels of thread support are available. Then, a call to MPI_INIT_THREAD will return provided = required; on the other hand, a call to MPI_INIT will initialize the MPI thread support level to MPI_THREAD_SINGLE. 當提供的參數required爲MPI_THREAD_SINGLE時，與MPI_Init效果同樣

https://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node165.htm