這門課做爲 ECE 中少有的跟計算機科學相關的課,天然是必上不可。不過不管是 OpenMP 仍是 CUDA,對於平時極少接觸並行編程的我來講,都是十分吃力的,第一次做業的 OpenMP 編程已經讓意識到了箇中的差異,固然,在單個核心的計算速度基本達到極致的如今,掌握並行編程能夠算是程序員的基本素養,而 OpenMP 實際上是一個很是好的開始,簡單,易懂,見效飛快。因此咱們的旅程,就從這裏開始吧。程序員
Hello OpenMP
1 #include <omp.h> 2 #include <iostream> 3 using namespace std; 4 5 int main(){ 6 #pragma omp parallel for 7 for (int i = 0; i < 10; ++i) 8 { 9 cout << i; 10 } 11 cout << endl; 12 return 0; 13 }
經過#pragma omp預處理指示符指定要採用OpenMP編程
經過#pragma omp parallel for來指定下方的for循環採用多線程執行,此時編譯器會根據CPU的個數來建立線程數,對於雙核系統,編譯器會默認建立兩個線程執行並行區域的代碼。多線程
函數原型 / 功能編程語言
int omp_get_num_procs(void) 分佈式
int omp_get_num_threads(void)
int omp_get_thread_num(void)
int omp_set_num_threads(void)
1 #include <iostream> 2 #include <omp.h> 3 using namespace std; 4 5 int main(){ 6 cout << "CPU number: " << omp_get_num_procs() << endl; 7 8 cout << "Parallel area 1: " << endl; 9 10 #pragma omp parallel //下面大括號內部爲並行區域 11 { 12 cout << "Num of threads is: " << omp_get_num_threads(); 13 cout << "; This thread ID is " << omp_get_thread_num() << endl; 14 } 15 16 cout << "Parallel area 2:" << endl; 17 omp_set_num_threads(4); // 設置爲並行區域建立4個線程 18 #pragma omp parallel //下面大括號內部爲並行區域 19 { 20 cout << "Num of threads is: " << omp_get_num_threads(); 21 cout << "; This thread ID is " << omp_get_thread_num() << endl; 22 } 23 24 return 0; 25 }
在循環並行化時,因爲多個線程同時執行循環,迭代的順序是不肯定的。若是是數據不相關的,則能夠採用基本的#pragma omp parallel for預處理器指示符。
1. 語句S1在一次迭代中訪問存儲單元L,而S2在隨後的一次迭代中訪問同一存儲單元,稱之爲循環迭代相關(Loop-Carried Dependence);
2. S1和S2在同一循環迭代中訪問統一存儲單元L,但S1的執行在S2以前,稱之爲非循環迭代相關(Loop-Independent Dependence)。
for 循環並行化的聲明形式
1 #include <iostream> 2 #include <omp.h> 3 using namespace std; 4 5 int main(){ 6 // for 循環並行化聲明形式1 7 #pragma omp parallel 8 { 9 #pragma omp for 10 for (int i = 0; i < 10; ++i){ 11 cout << i << endl; 12 } 13 } 14 15 // for 循環並行化聲明形式2 16 #pragma omp parallel for 17 for (int j = 0; j < 10; ++j){ 18 cout << j << endl; 19 } 20 return 0; 21 }
for 循環並行化的約束條件
1. for循環中的循環變量必須是有符號整形。例如,for (unsigned int i = 0; i < 10; ++i){}會編譯不經過;
2. for循環中比較操做符必須是<, <=, >, >=。例如for (int i = 0; i != 10; ++i){}會編譯不經過;
3. for循環中的第三個表達式,必須是整數的加減,而且加減的值必須是一個循環不變量。例如for (int i = 0; i != 10; i = i + 1){}會編譯不經過;感受只能++i; i++; –i; 或i–;
4. 若是for循環中的比較操做爲<或<=,那麼循環變量只能增長;反之亦然。例如for (int i = 0; i != 10; –i)會編譯不經過;
5. 循環必須是單入口、單出口,也就是說循環內部不容許可以達到循環之外的跳轉語句,exit除外。異常的處理也必須在循環體內處理。例如:若循環體內的break或goto會跳轉到循環體外,那麼會編譯不經過。
基本 for 循環並行化舉例
1 #include <iostream> 2 #include <omp.h> 3 4 int main(){ 5 int a[10] = {1}; 6 int b[10] = {2}; 7 int c[10] = {0}; 8 9 #pragma omp parallel 10 { 11 #pragma omp for 12 for (int i = 0; i < 10; ++i){ 13 // c[i] 只跟 a[i] 和 b[i] 有關 14 c[i] = a[i] + b[i]; 15 } 16 } 17 18 return 0; 19 }
嵌套 for 循環並行化舉例
1 #include <omp.h> 2 3 int main(){ 4 int a[10][5] = {1}; 5 int b[10][5] = {2}; 6 int c[10][5] = {3}; 7 8 #pragma omp parallel 9 { 10 #pragma omp for 11 for (int i = 0; i < 10; ++i){ 12 for (int j = 0; j < 5; ++j){ 13 // c[i][j] 只跟 a[i][j] 和 b[i][j] 有關 14 c[i][j] = a[i][j] + b[i][j]; 15 } 16 } 17 } 18 19 return 0; 20 } 21 22 ------------------------------------------------------- 23 24 對於雙核 CPU 來講,編譯器會讓第一個cpu完成: 25 for (int i = 0; i < 5; ++i){ 26 for (int j = 0; j < 5; ++j){ 27 // c[i][j] 只跟 a[i][j] 和 b[i][j] 有關 28 c[i][j] = a[i][j] + b[i][j]; 29 } 30 } 31 32 會讓第二個 cpu 完成: 33 for (int i = 5; i < 10; ++i){ 34 for (int j = 0; j < 5; ++j){ 35 // c[i][j] 只跟 a[i][j] 和 b[i][j] 有關 36 c[i][j] = a[i][j] + b[i][j]; 37 } 38 }
1. 並行區域中定義的變量
2. 多個線程用來完成循環的循環變量
3. private、firstprivate、lastprivate或reduction字句修飾的變量
1 #include <iostream> 2 #include <omp.h> 3 using namespace std; 4 5 int main(){ 6 int share_a = 0; // 共享變量 7 int share_to_private_b = 1; // 經過 private 子句修飾該變量以後在並行區域內變爲私有變量 8 9 #pragma omp parallel 10 { 11 int private_c = 2; 12 13 #pragma omp for private(share_to_private_b) 14 for (int i = 0; i < 10; ++i) //該循環變量是私有的,若爲兩個線程,則一個線程執行0~4,另外一個執行5~9 15 cout << i << endl; 16 17 } 18 19 return 0; 20 }
聲明方法 / 功能
1 並行區域中變量val是私有的,即每一個線程擁有該變量的一個拷貝 2 private(val1, val2, ...) 3 4 與private不一樣的是,每一個線程在開始的時候都會對該變量進行一次初始化。 5 first_private(val1, val2, ...) 6 7 與private不一樣的是,併發執行的最後一次循環的私有變量將會拷貝到val 8 last_private(val1, val2, ...) 9 10 聲明val是共享的 11 shared(val1, val2, ...)
Reduction 的用法
1 #include <iostream> 2 #include <stdio.h> 3 #include <omp.h> 4 using namespace std; 5 6 int main(){ 7 int sum = 0; 8 cout << "Before: " << sum << endl; 9 10 #pragma omp parallel for reduction(+:sum) 11 for (int i = 0; i < 10; ++i){ 12 sum = sum + i; 13 printf("%d\n", sum); 14 } 15 16 cout << "After: " << sum << endl; 17 18 return 0; 19 }
其中sum是共享的,採用reduction以後,每一個線程根據reduction(+: sum)的聲明算出本身的sum,而後再將每一個線程的sum加起來。
1. 保證了對sum的原則操做
2. 多個線程的執行結果經過reduction中聲明的操做符進行計算,以加法操做符爲例:
假設sum的初始值爲10,reduction(+: sum)聲明的並行區域中每一個線程的sum初始值爲0(規定),並行處理結束以後,會將sum的初始化值10以及每一個線程所計算的sum值相加。
reduction (operator: var1, val2, …)
1 運算符 數據類型 默認初始值 2 + 整數、浮點 0 3 - 整數、浮點 0 4 * 整數、浮點 1 5 & 整數 全部位均爲1 6 | 整數 0 7 ^ 整數 0 8 && 整數 1 9 || 整數 0
線程同步之 atomic
1 #pragma omp atomic 2 x< + or * or - or * or / or & or | or << or >> >=expr 3 (例如x <<= 1; or x *=2;) 4 5 或 6 7 #pragma omp atomic 8 x++ //or x--, --x, ++x
1. 自加減操做
2. x<上述列出的操做符>=expr
1 #include <iostream> 2 #include <omp.h> 3 using namespace std; 4 5 int main(){ 6 int sum = 0; 7 cout << "Before: " << sum << endl; 8 9 #pragma omp parallel for 10 for (int i = 0; i < 20000; ++i){ 11 #pragma omp atomic 12 sum++; 13 } 14 cout << "Atomic-After: " << sum << endl; 15 16 sum = 0; 17 #pragma omp parallel for 18 for (int i = 0; i < 20000; ++i){ 19 sum++; 20 } 21 cout << "None-atomic-After: " << sum << endl; 22 return 0; 23 }
輸出20000。若是將#pragma omp atomic聲明去掉,則輸出值不肯定。
線程同步之 critical
1 #pragma omp critical [(name)] //[]表示名字可選 2 { 3 //並行程序塊,同時只能有一個線程能訪問該並行程序塊 4 }
1 #include <iostream> 2 #include <omp.h> 3 using namespace std; 4 5 int main(){ 6 int sum = 0; 7 cout << "Before: " << sum << endl; 8 9 #pragma omp parallel for 10 for (int i = 0; i < 100; ++i){ 11 #pragma omp critical(a) 12 { 13 sum = sum + i; 14 sum = sum + i * 2; 15 } 16 } 17 18 cout << "After: " << sum << endl; 19 20 return 0; 21 }
critical 與 atomic 的區別在於,atomic 僅適用於上一節規定的兩種類型操做,並且 atomic 所防禦的僅爲一句代碼。critical 能夠對某個並行程序塊進行防禦。
For a simple increment to a shared variable, atomic and critical are semantically equivalent, but atomic allows the compiler more opportunities for optimisation (using hardware instructions, for example).
In other cases, there are differences. If incrementing array elements (e.g. a[i]++ ), atomic allows different threads to update different elements of the array concurrently whereas critical does not. If there is a more complicated expression on the RHS (e.g. a+=foo() ) then the evaluation of foo() is protected from concurrent execution with critical but not with atomic.
Using a critical section is a legitimate way of implementing atomics inside the compiler/runtime, but most current OpenMP compilers do a better job than this.
1. 隱式柵障
2. nowait 用來取消柵障
1 #pragma omp for nowait //不能用#pragma omp parallel for nowait 2 3 或 4 5 #pragma omp single nowait
1 #include <stdio.h> 2 #include <omp.h> 3 4 int main(){ 5 #pragma omp parallel 6 { 7 #pragma omp for nowait 8 for (int i = 0; i < 20; ++i){ 9 printf("%d+\n", i); 10 } 11 12 #pragma omp for 13 for (int j = 0; j < 10; ++j){ 14 printf("%d-\n", j); 15 } 16 17 for (int j = 0; j < 10; ++j){ 18 printf("%dx\n", j); 19 } 20 } 21 return 0; 22 }
第一個 for 循環的兩個線程中的一個執行完以後,繼續往下執行,所以同時打印出了第一個循環的 + 和第一個循環的 - 。
能夠看到,第二個 for 循環的兩個線程都執行完以後,纔開始同時執行第三個 for 循環,並無交叉。也就是說,經過 #pragma omp for 聲明的 for 循環結束時有一個默認的柵障。
3. 顯式同步柵障 #pragma omp barrier
1 #include <stdio.h> 2 #include <omp.h> 3 4 int main(){ 5 #pragma omp parallel 6 { 7 for (int i = 0; i < 100; ++i){ 8 printf("%d+\n", i); 9 } 10 #pragma omp barrier 11 for (int j = 0; j < 10; ++j){ 12 printf("%d-\n", j); 13 } 14 } 15 }
兩個線程(具體數目不一樣 CPU 不一樣)執行了第一個for循環,當兩個線程同時執行完第一個for循環以後,在barrier處進行了同步,而後執行後邊的for循環。
4. master 經過#pragma omp mater來聲明對應的並行程序塊只由主線程完成
1 #include <stdio.h> 2 #include <omp.h> 3 4 int main(){ 5 #pragma omp parallel 6 { 7 #pragma omp master 8 { 9 for (int j = 0; j < 10; ++j){ 10 printf("%d-\n", j); 11 } 12 } 13 14 printf("This will be shown two or more times\n"); 15 } 16 return 0; 17 }
進入 parallel 聲明的並行區域以後,建立了兩個(或更多)線程,主線程執行了 for 循環,而另外一個線程沒有執行 for 循環,而直接進入了 for 循環以後的打印語句,而後執行 for 循環的線程隨後還會再執行一次後邊的打印語句。
5. section 用來指定不一樣的線程執行不一樣的部分
1 #include <stdio.h> 2 #include <omp.h> 3 4 int main(){ 5 #pragma omp parallel sections // 聲明該區域分爲若干個 section, section 之間的運行順序爲並行的關係 6 { 7 #pragma omp section // 第一個 section, 由某個線程單獨完成 8 for (int i = 0; i < 5; ++i){ 9 printf("%d+\n", i); 10 } 11 12 #pragma omp section // 另外一個 section, 由某個線程單獨完成 13 for (int j = 0; j < 5; ++j){ 14 printf("%d-\n", j); 15 } 16 } 17 return 0; 18 }
1 determines which iterations are executed by each thread 2 3 STATIC 4 The iteration space is broken in chunks of approximately size N/(num of threads). Then these chunks are assigned to the threads in a Round-Robin fashion. 5 STATIC, CHUNK 6 The iteration space is broken in chunks of size N. Then these chunks are assigned to the threads in a Round-Robin fashion. 7 Characteristics of static schedules 8 Low overhead 9 Good locality (usually) 10 Can have load imbalance problems 11 DYNAMIC[,chunk] 12 Threads dynamically grab chunks of N iterations until all iterations have been executed. If no chunk is specified, N = 1 13 GUIDED[,chunk] 14 Variant of dynamic. The size of the chunks deceases as the threads grab iterations, but it is at least of size N. If no chunk is specified, N = 1. 15 Characteristics of static schedules 16 Higher overhead 17 Not very good locality (usually) 18 Can solve imbalance problems 19 AUTO 20 The implementation is allowed to do whatever it wishes. (Do not expect much of it as of now) 21 RUNTIME 22 The decision is delayed until the program is run through the sched-nvar ICV. It can be set with: 23 The OMP_SCHEDULE environment variable 24 The omp_set_schedule() API call