shell腳本處理大數據系列之(一)方法小結

時間 2019-11-17

標籤 shell 腳本處理數據系列方法小結欄目 Unix 简体版

原文原文鏈接

方法1：shell

單進程處理大規模的文件速度如（上million量級）比較慢，能夠採用awk取模的方法，將文件分而治之，這樣能夠利用充分的利用多核CPU的優點bash

 
for((i=0;i<5;i++));do           
    cat query_ctx.20k | awk 'NR%5=='$i'' |\
     wc -l  1> output_$i 2>err_$i &
 done

方法2：進程

另外也能夠使用split的方法，或者hashkey 的辦法把大文件分而治之,
該辦法的缺陷是須要對大文件預處理，這個劃分大文件的過程是單進程，也比較的耗時資源

infile=$1
opdir=querys
opfile=res
s=`date "+%s"`
while read line
do
    imei=`./awk_c "$line"`
    no=`./tools/default $imei 1000`
    echo $line >> $opdir/$opfile-$no
done<$infile

方法3：get

該方法是方法2的延伸，在預處理以後，能夠使用shell腳本起多個進程來並行執行，固然爲了防止進程之間由於並行形成的混亂輸出，能夠使用鎖的辦法，也能夠經過劃分命名的辦法。下面的例子比較巧妙使用mv 操做。這一同步操做起到互斥鎖的做用，使得增長進程更加靈活，只要機器資源夠用，隨時增長進程，都不會形成輸出上的錯誤。input

output=hier_res
input=dbscan_res
prefix1=tmp-
prefix2=res-
for file in `ls  $input/res*`
do
    tmp=`echo ${file#*-}`
    ofile1=${prefix1}${tmp}
    ofile2=${prefix2}${tmp}
    if [ ! -f $output/$ofile1 -a ! -f $output/$ofile2 ];then
        touch $output/aaa_$tmp
        mv $output/aaa_$tmp $output/$ofile1
        if [ $? -eq 0 ] 
        then    
            echo "dealing "$file
            cat $file | python hcluster.py 1> $output/$ofile1 2> hier.err
            mv $output/$ofile1 $output/$ofile2
        fi      
    fi
done