腫瘤大數據挖掘中常常須要處理上百億行的文本文件,這些文件每每高達數百GB,假如文件結構簡單統一,那麼用sed和awk 處理是很是方便和快速的。但有時候會遇到邏輯較爲複雜的處理流程,這樣我通常會用JAVA來處理。但因爲JAVA是單線程的,所以對於實驗室多核服務器來講,能充分有效的利用起每一個核會方便很多,那麼這個時候就推薦用多線程來併發(並行)處理任務,從而達到運算速度倍速的提高。安全
這裏舉一個並行計算的例子。例子比較簡單,主要是對三個數進行累加,最後輸出結果。咱們分別用單線程和多線程來執行,其中單線程是順序執行而多線程則同時啓動三個線程來並行(服務器CPU數大於三,因此這裏是並行而不是併發)執行。服務器
首先是單線程的運行結果:多線程
public class Nothreading { public static void main(String[] args) { long startTime = System.currentTimeMillis(); int sum_i = 0; int sum_j = 0; int sum_k = 0; for(int i = 0; i < 10000; i++) { sum_i += 1;
/* 增長程序運行時間, 後面同理 */ for(int a = 0 ; a < 100000 ; a ++) { String s = "To cost some time"; String[] ss = s.split(" "); } } for(int j = 0; j < 10000; j++) { sum_j += 2; for(int a = 0 ; a < 100000 ; a ++) { String s = "To cost some time"; String[] ss = s.split(" "); } } for(int k = 0; k < 10000; k++) { sum_k += 3; for(int a = 0 ; a < 100000 ; a ++) { String s = "To cost some time"; String[] ss = s.split(" "); } } long endTime = System.currentTimeMillis(); System.out.println(sum_i + "\t" + sum_j + "\t" + sum_k); System.out.println("run time:"+(endTime-startTime)+"ms"); } }
運行結果:併發
10000 20000 30000 run time:663587ms
圖片是該程序運行時的CPU資源利用狀態: 能夠看到僅有一個CPU的利用率達到100%.大數據
下面是多線程:this
class Count_i { public int sum_i = 0; public synchronized void count() { for(int i = 0 ; i < 10000; i++) { sum_i += 1; /* 增長運行時間 後面同理*/ for(int a = 0 ; a < 100000; a ++) { String s = "To cost some time"; String[] ss = s.split(" "); } } } } class Count_j { public int sum_j = 0; public synchronized void count() { for(int j = 0 ; j < 10000; j++) { sum_j += 2; for(int a = 0 ; a < 100000; a ++) { String s = "To cost some time"; String[] ss = s.split(" "); } } } } class Count_k { public int sum_k = 0; public synchronized void count() { for(int k = 0 ; k < 10000; k++) { sum_k += 3; for(int a = 0 ; a < 100000; a ++) { String s = "To cost some time"; String[] ss = s.split(" "); } } } } class Mul_thread_i extends Thread { public Count_i c_i; public Mul_thread_i(Count_i acc) { this.c_i = acc; } public void run() { c_i.count(); } } class Mul_thread_j extends Thread { public Count_j c_j; public Mul_thread_j(Count_j acc) { this.c_j = acc; } public void run() { c_j.count(); } } class Mul_thread_k extends Thread { public Count_k c_k; public Mul_thread_k(Count_k acc) { this.c_k = acc; } public void run() { c_k.count(); } } public class Threethreading_save { public static void main(String[] args) throws InterruptedException { long startTime = System.currentTimeMillis(); Count_i ci = new Count_i(); Count_j cj = new Count_j(); Count_k ck = new Count_k(); Mul_thread_i aa = new Mul_thread_i(ci); Mul_thread_j bb = new Mul_thread_j(cj); Mul_thread_k cc = new Mul_thread_k(ck); aa.start(); bb.start(); cc.start(); aa.join(); bb.join(); cc.join(); System.out.println(ci.sum_i); System.out.println(cj.sum_j); System.out.println(ck.sum_k); long endTime = System.currentTimeMillis(); System.out.println("run time:"+(endTime-startTime)+"ms"); } }
下面是運行結果:spa
10000
20000
30000
run time:221227ms
CPU狀態:能夠看到有三個CPU的利用率達到100%.線程
空閒時的狀態:code
總結一些,當咱們處理的任務量很大的時候,若是計算機有多個CPU,能夠將待處理的任務合理的分爲幾個部分,而後開幾個線程同時進行運算,等這些子任務都完成之後再交給主線程後續的處理,能夠看到效率成倍的提高。固然線程安全是一個須要注意的問題,因爲時間關係後面將詳細介紹。blog