【並行計算】用MPI進行分佈式內存編程（二）

時間 2019-12-12

原文原文鏈接

經過上一篇中，知道了基本的MPI編寫並行程序，最後的例子中，讓使用0號進程作全局的求和的全部工做，而其餘的進程卻都不工做，這種方式也許是某種特定狀況下的方案，但明顯不是最好的方案。舉個例子，若是咱們讓偶數號的進程負責收集求和的工做，狀況會怎麼樣？以下圖：程序員

對比以前的圖發現，總的工做量與以前的同樣，可是發現新方案中0號進程只作了3次接收和3次加法（以前的7次接收和7次加法），若是進程都是同時啓動的，那麼全局求和時間將是0號進程的接收時間和求和時間，即須要的總時間比原來方案的總時間減小了50%多。若是是進程數=1024的話，則原方案須要0號進程執行1023次接收和求和，而新方案只要0號進程10次接收和求和操做。這樣的話就能將原方案的性能提升100倍！！既然改變進程之間的接收和發送方式能提升性能，這就涉及進程集合之間的集合通訊了，而這些進程集合之間的通訊，MPI都已經苦逼的程序員都封裝好了，使得程序員能擺脫有無之境的程序優化，而將精力集中解決程序業務上面。首先仍是將以前的求積分函數的例子改造一下：算法

int main(int argc, char* argv[])
{
    int my_rank = 0, comm_sz = 0, n = 1024, local_n = 0;
    double a = 0.0, b = 3.0, h = 0, local_a = 0, local_b = 0;
    double local_double = 0, total_int = 0;
    int source;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);

    h = (b - a) / n;       /*  h is the same for all processes  */
    local_n = n / comm_sz; /*  So is the number of trapezoids */

    local_a = a + my_rank*local_n*h;
    local_b = local_a + local_n*h;
    local_double = Trap(local_a, local_b, local_n, h);

    MPI_Reduce(&local_double, &total_int, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

    if (my_rank == 0)
    {
        printf("With n = %d trapezoids, our estimate\n", n);
        printf("of the integral from %f to %f = %.15e\n", a, b, total_int);
        
    }
    MPI_Finalize();
    return 0;
}

注意在這段代碼中，咱們再也不使用MPI_Send和MPI_Recv這樣的通訊函數，而是使用了一個MPI_Reduce函數，經過編譯執行數組

一樣能獲得結果。各位看官不只要問，代碼中的MPI_Reduce函數是個什麼東西呢？如何使用？要回答這些問題，就須要繼續往下深刻的學習集合通訊的概念。函數

1.集合通信

在MPI中，涉及全部的進程的通訊函數咱們稱之爲集合通訊（collective communication）。而單個進程對單個進程的通訊，相似於MPI_Send和MPI_Recv這樣的通訊函數，咱們稱之爲點對點通訊（point-to-point communication）。進程間的通訊關係能夠用以下圖的關係來表示：性能

（1）1對1；學習

（2）1對部分優化

（3）1對所有spa

（4）部分對13d

（5）部分對部分code

（6）部分對所有

（7）所有對1

（8）所有對部分

（9)所有對所有

那既然區分了集合通訊與點對點通訊，它們之間的各自有什麼不一樣呢？集合通訊具備如下特色：

（1）、在通訊子中的全部進程都必須調用相同的集合通訊函數。

（2）、每一個進程傳遞給MPI集合通訊函數的參數必須是「相容的」。

（3）、點對點通訊函數是經過標籤和通訊子來匹配的。而通訊函數不實用標籤，只是經過通訊子和調用的順序來進行匹配。

下表彙總了MPI中的集合通訊函數：

1.1 歸約

數據歸約的基本功能是從每一個進程收集數據，把這些數據歸約成單個值，把歸約成的值存儲到根進程中。具體例子相似於單科老師（數學老師）收試卷，每一個學生都把考試完的數學試卷交給老師，由老師來進行操做（求最大值、求總和等）。如圖所示：

MPI_Reduce函數:

int MPI_Reduce (void *sendbuf, void *recvbuf, int count,MPI_Datatype datatype, MPI_Op op, int root,MPI_Comm comm)

在這個函數中，最關鍵的參數是第5個參數MPI_Op op，它表示MPI歸於中的操做符，咱們上面的例子就是用的求累加和的歸約操做符。具體的歸約操做符以下表：

運算操做符	描述	運算操做符	描述
MPI_MAX	最大值	MPI_LOR	邏輯或
MPI_MIN	最小值	MPI_BOR	位或
MPI_SUM	求和	MPI_LXOR	邏輯異或
MPI_PROD	求積	MPI_BXOR	位異或
MPI_LAND	邏輯與	MPI_MINLOC	計算一個全局最小值和附到這個最小值上的索引－－能夠用來決定包含最小值的進程的秩
MPI_BAND	位與	MPI_MAXLOC	計算一個全局最大值和附到這個最大值上的索引－－能夠用來決定包含最小值的進程的秩

除MPI_Reduce函數以外，數據歸約還有以下一些變種函數：

MPI_Allreduce函數

int MPI_Allreduce (void *sendbuf, void *recvbuf, int count,MPI_Datatype datatype, MPI_Op op,MPI_Comm comm)

此函數在獲得歸約結果值以後，將結果值分發給每個進程，這樣的話，並行中的全部進程值都能知道結果值了。相似的求和計算結果的發佈圖以下：

MPI_Reduce_scatter函數

int MPI_Reduce_scatter (void *sendbuf, void *recvbuf,int *recvcnts,MPI_Datatype datatype, MPI_Op op,MPI_Comm comm)

歸約散發。該函數的做用至關於首先進行一次歸約操做，而後再對歸約結果進行散發操做。

MPI_Scan函數

int MPI_Scan (void *sendbuf, void *recvbuf, int count,MPI_Datatype datatype, MPI_Op op,MPI_Comm comm)

前綴歸約(或掃描歸約)。與普通全歸約MPI_Allreduce相似，但各進程依次獲得部分歸約的結果。

1.2 數據移動-廣播

在一個集合通訊中，若是屬於一個進程的數據被髮送到通訊子中的全部進程，這樣的集合通訊就叫作廣播。如圖所示：

MPI_Bcast函數：

int MPI_Bcast (void *buffer, int count,MPI_Datatype datatype, int root,MPI_Comm comm)

通訊器comm中進程號爲root的進程(稱爲根進程) 將本身buffer中的內容發送給通訊器中全部其餘進程。參數buffer、count和datatype的含義與點對點通訊函數(如MPI_Send和MPI_Recv)相同。

下面咱們編寫一個具體的例子：

void blog3::TestForMPI_Bcast(int argc, char* argv[])
{
    int rankID, totalNumTasks;

    MPI_Init(&argc, &argv);
    MPI_Barrier(MPI_COMM_WORLD);
    double elapsed_time = -MPI_Wtime();

    MPI_Comm_rank(MPI_COMM_WORLD, &rankID);
    MPI_Comm_size(MPI_COMM_WORLD, &totalNumTasks);

    int sendRecvBuf[3] = { 0, 0, 0 };

    if (!rankID) {
        sendRecvBuf[0] = 3;
        sendRecvBuf[1] = 6;
        sendRecvBuf[2] = 9;
    }

    int count = 3;
    int root = 0;
    MPI_Bcast(sendRecvBuf, count, MPI_INT, root, MPI_COMM_WORLD); //MPI_Bcast can be seen from all processes  

    printf("my rankID = %d, sendRecvBuf = {%d, %d, %d}\n", rankID, sendRecvBuf[0], sendRecvBuf[1], sendRecvBuf[2]);

    elapsed_time += MPI_Wtime();
    if (!rankID) {
        printf("total elapsed time = %10.6f\n", elapsed_time);
    }

    MPI_Finalize();
}

int main(int argc, char* argv[])
{
    blog3 test;
    test.TestForMPI_Bcast(argc, argv);
}

結果爲：

1.3 數據移動-散射

在進行數值計算軟件開發的過程當中，常常碰到兩個向量的加法運算，例如每一個向量有1萬個份量，若是有10個進程，那麼就能夠簡單的將local_n個向量份量所構成的塊分配到每一個進程中去，至於怎麼分塊，這裏有一些方法（塊劃分法、循環劃分法、塊-循環劃分法），這種將數據分塊發送給各個進程進行並行計算的方法稱之爲散射。

MPI_Scatter函數：

int MPI_Scatter (void *sendbuf, int sendcnt,MPI_Datatype sendtype, void *recvbuf,int recvcnt, MPI_Datatype recvtype,int root, MPI_Comm comm)

散發相同長度數據塊。根進程root將本身的sendbuf中的np個連續存放的數據塊按進程號的順序依次分發到comm的各個進程(包括根進程本身) 的recvbuf中，這裏np表明comm中的進程數。sendcnt和sendtype 給出sendbuf中每一個數據塊的大小和類型，recvcnt和recvtype給出recvbuf的大小和類型，其中參數sendbuf、sendcnt 和sendtype僅對根進程有意義。須要特別注意的是，在根進程中，參數sendcnt指分別發送給每一個進程的數據長度，而不是發送給全部進程的數據長度之和。所以，當recvtype等於sendtype時，recvcnt應該等於sendcnt。

MPI_Scatterv函數：

int MPI_Scatterv (void *sendbuf, int *sendcnts,int *displs, MPI_Datatype sendtype,void *recvbuf, int recvcnt,MPI_Datatype recvtype, int root,MPI_Comm comm)

散發不一樣長度的數據塊。與MPI_Scatter相似，但容許sendbuf中每一個數據塊的長度不一樣而且能夠按任意的順序排放。sendbuf、sendtype、sendcnts和displs僅對根進程有意義。數組sendcnts和displs的元素個數等於comm中的進程數，它們分別給出發送給每一個進程的數據長度和位移，均以sendtype爲單位。

下面咱們來看一個例子：

void blog3::TestForMPI_Scatter(int argc, char* argv[])
{
    int totalNumTasks, rankID;

    float sendBuf[SIZE][SIZE] = {
        { 1.0,   2.0,    3.0,    4.0 },
        { 5.0,   6.0,    7.0,    8.0 },
        { 9.0,   10.0,   11.0,   12.0 },
        { 13.0,  14.0,   15.0,   16.0 }
    };

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rankID);
    MPI_Comm_size(MPI_COMM_WORLD, &totalNumTasks);

    if (totalNumTasks == SIZE) {
        int source = 0;
        int sendCount = SIZE;
        int recvCount = SIZE;
        float recvBuf[SIZE];
        //scatter data from source process to all processes in MPI_COMM_WORLD  
        MPI_Scatter(sendBuf, sendCount, MPI_FLOAT,
            recvBuf, recvCount, MPI_FLOAT, source, MPI_COMM_WORLD);

        printf("my rankID = %d, receive Results: %f %f %f %f, total = %f\n",
            rankID, recvBuf[0], recvBuf[1], recvBuf[2], recvBuf[3],
            recvBuf[0] + recvBuf[1] + recvBuf[2] + recvBuf[3]);
    }
    else if (totalNumTasks == 8) {
        int source = 0;
        int sendCount = 2;
        int recvCount = 2;
        float recvBuf[2];

        MPI_Scatter(sendBuf, sendCount, MPI_FLOAT,
            recvBuf, recvCount, MPI_FLOAT, source, MPI_COMM_WORLD);

        printf("my rankID = %d, receive result: %f %f, total = %f\n",
            rankID, recvBuf[0], recvBuf[1], recvBuf[0] + recvBuf[1]);
    }
    else {
        printf("Please specify -n %d or -n %d\n", SIZE, 2 * SIZE);
    }

    MPI_Finalize();
}

int main(int argc, char* argv[])
{
    blog3 test;

    test.TestForMPI_Scatter(argc, argv);

    return 0;
}

其結果爲：

1.4 數據移動-彙集

MPI_Gather函數：

int MPI_Gather (void *sendbuf, int sendcnt,MPI_Datatype sendtype, void *recvbuf,int recvcnt, MPI_Datatype recvtype,int root, MPI_Comm comm)

收集相同長度的數據塊。以root爲根進程，全部進程(包括根進程本身) 將sendbuf中的數據塊發送給根進程，根進程將這些數據塊按進程號的順序依次放到recvbuf中。發送和接收的數據類型與長度必須相配，即發送和接收使用的數據類型必須具備相同的類型序列。參數recvbuf，recvcnt 和recvtype僅對根進程有意義。須要特別注意的是，在根進程中，參數recvcnt指分別從每一個進程接收的數據長度，而不是從全部進程接收的數據長度之和。所以，當sendtype等於recvtype時，sendcnt應該等於recvcnt。

MPI_Allgather函數：

int MPI_Allgather (void *sendbuf, int sendcnt,MPI_Datatype sendtype, void *recvbuf,int recvcnt, MPI_Datatype recvtype,MPI_Comm comm)

MPI_Allgather與MPI_Gather相似，區別是全部進程同時將數據收集到recvbuf中，所以稱爲數據全收集。MPI_Allgather至關於依次以comm中的每一個進程爲根進程調用普通數據收集函數MPI_Gather，或者以任一進程爲根進程調用一次普通收集，緊接着再對收集到的數據進行一次廣播。

MPI_Gatherv函數：

int MPI_Gatherv (void *sendbuf, int sendcnt,MPI_Datatype sendtype, void *recvbuf,int *recvcnts, int *displs,MPI_Datatype recvtype, int root,MPI_Comm comm)

收集不一樣長度的數據塊。與MPI_Gather相似，但容許每一個進程發送的數據塊長度不一樣，而且根進程能夠任意排放數據塊在recvbuf中的位置。recvbuf，recvtype，recvcnts和displs僅對根進程有意義。數組recvcnts和displs的元素個數等於進程數，用於指定從每一個進程接收的數據塊長度和它們在recvbuf中的位移，均以recvtype爲單位。

MPI_Allgatherv函數：

int MPI_Allgatherv (void *sendbuf, int sendcnt,MPI_Datatype sendtype, void *recvbuf,int *recvcnts, int *displs,MPI_Datatype recvtype, MPI_Comm comm)

不一樣長度數據塊的全收集。參數與MPI_Gatherv相似。它等價於依次以comm中的每一個進程爲根進程調用MPI_Gatherv，或是以任一進程爲根進程調用一次普通收集，緊接着再對收集到的數據進行一次廣播。

例子：

void blog3::TestForMPI_Gather(int argc, char* argv[])
{
    int rankID, totalNumTasks;

    MPI_Init(&argc, &argv);
    MPI_Barrier(MPI_COMM_WORLD);
    double elapsed_time = -MPI_Wtime();

    MPI_Comm_rank(MPI_COMM_WORLD, &rankID);
    MPI_Comm_size(MPI_COMM_WORLD, &totalNumTasks);

    int* gatherBuf = (int *)malloc(sizeof(int) * totalNumTasks);
    if (gatherBuf == NULL) {
        printf("malloc error!");
        exit(-1);
        MPI_Finalize();
    }

    int sendBuf = rankID; //for each process, its rankID will be sent out  

    int sendCount = 1;
    int recvCount = 1;
    int root = 0;
    MPI_Gather(&sendBuf, sendCount, MPI_INT, gatherBuf, recvCount, MPI_INT, root, MPI_COMM_WORLD);

    elapsed_time += MPI_Wtime();
    if (!rankID) {
        int i;
        for (i = 0; i < totalNumTasks; i++) {
            printf("gatherBuf[%d] = %d, ", i, gatherBuf[i]);
        }
        putchar('\n');
        printf("total elapsed time = %10.6f\n", elapsed_time);
    }

    MPI_Finalize();
}

int main(int argc, char* argv[])
{
    blog3 test;

    test.TestForMPI_Gather(argc, argv);

    return 0;
}

結果爲：

1.5 數據移動-其它

MPI_Alltoall函數：

int MPI_Alltoall (void *sendbuf, int sendcnt,MPI_Datatype sendtype, void *recvbuf,int recvcnt, MPI_Datatype recvtype,MPI_Comm comm)

相同長度數據塊的全收集散發：進程i將sendbuf中的第j塊數據發送到進程j的recvbuf中的第i個位置，i, j =0, . . . , np－1 (np表明comm 中的進程數)。sendbuf 和recvbuf 均由np個連續的數據塊構成，每一個數據塊的長度/類型分別爲sendcnt/sendtype和recvcnt/recvtype。該操做至關於將數據在進程間進行一次轉置。例如，假設一個二維數組按行分塊存儲在各進程中，則調用該函數可很容易地將它變成按列分塊存儲在各進程中。

MPI_Alltoallv函數：

int MPI_Alltoallv (void *sendbuf, int *sendcnts,int *sdispls, MPI_Datatype sendtype,void *recvbuf, int *recvcnts,int *rdispls, MPI_Datatype recvtype,MPI_Comm comm)

不一樣長度數據塊的全收集散發。與MPI_Alltoall相似，但每一個數據塊的長度能夠不等，而且不要求連續存放。各個參數的含義可參考函數MPI_Alltoall，MPI_Scatterv和MPI_Gatherv。