數據下載工做筆記三：腳本

時間 2019-11-19

標籤數據下載筆記腳本简体版

原文原文鏈接

一共寫了三個腳本，第一次寫shell腳本，很蹩腳。寫完回頭看一看，質量確實不好勁。java

腳本一：addLinkFiles.shweb

在當前目錄下有一個xmlfile.xml文件，該文件須要手動編輯，文件中用標籤表示出來兩級目錄，例如：shell

<class>Life-sciences</class>bash

<dataset>uniprot</dataset>網絡

<location>http://www.example.com/example.nt.gz</location>ssh

沒有使用xml樹形結構，由於shell中解析實在是太麻煩了。用相對位置表示上下級關係。因此這並非一個嚴格的xml文件，四不像。上面三句話表示class中有dataset，dataset中有一個下載連接，對應的目錄結構爲"Life-sciences/uniprot/link"，link中保存連接。addLinkFiles.sh會在啓動時檢查一下該文件，並在./download/data/中檢查linkpath = ${class}/${dataset}/link文件是否存在，若不存在，新建目錄和文件，而後添加下載連接到link文件中，並將${linkpath}添加到${modifiedLinkFile}中，每隔${interval}秒，腳本會檢查一次${xmlfile}的修改時間，若是修改時間改變了，說明有新的location添加了，檢查目錄和xml文件內容的對應關係，給xml中新添加的內容創建相應的目錄和文件。oop

#!/bin/bash
#*********************************************************
#addLinkFiles.sh
#Keep checking the $xmlfile.
#The $xmlfile shoul have only 3 tags: class, name, location.
#
#last edited 2013.09.03 by Lyuxd.
#
#*********************************************************


#******************
#----init----------
#******************
interval=10
rootDir=${PWD}
dataDir=$rootDir"/data"
logDir=$rootDir"/log"
link="link"
log="add.log"
modifiedLinkFile="modifiedlinkfile"
xmlfile="xmlfile.xml"
level1="class"
level2="name"
level3="location"
currentClass=$rootDir
currentDataSet=$rootDir
xmlLastMT=0
cd $rootDir


#****************************************
#------Create Data, Log Directories------
#****************************************
if [ ! -d "$dataDir" ];then
    mkdir "$dataDir"
fi

if [ ! -d "$logDir" ];then
    mkdir "$logDir"
fi


#****************************************
#------Parsing the xmlfile------
#****************************************
if [ ! -f "$xmlfile" ];then
    echo "`date "+%Y.%m.%d-%H:%M:%S--ERROR: "`No xmlfile found. exit." >> "$logDir/$log"
    exit 1
fi


#check the modified-time of xmlfile every $interval sec. If modified-time changed parse xmlfile.
while true
do
xmlMT=$(stat -c %Y $xmlfile|awk '{print $0}')
if [ "$xmlLastMT" -lt "$xmlMT" ];then
xmlLastMT=$xmlMT
echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`parsing $xmlfile..." >> "$logDir/$log"


while read line
do

#bu wei kong.
if [ "$line"x != x ];then
tmp=$(echo $line | awk -F "<|>| " '{print $2}')
tag=$(echo $tmp)

    #check if "class" is existing. If not, create it.
   if [ "$tag"x = "$level1"x ]; then
    currentClass=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}')
    currentClass=$(echo $currentClass)
    currentDataSet=$rootDir
    if [ ! -z "$currentClass" ] && [ ! -d "$dataDir/$currentClass" ]; then
        mkdir "$dataDir/$currentClass"
        echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`mkdir $dataDir/$currentClass" >> "$logDir/$log"
    fi


    #check if "name" is existing. If not, create it.
   elif [ "$tag"x = "$level2"x ] && [ "$currentClass" != "$rootDir" ]; then
    currentDataSet=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}')
    currentDataSet=$(echo $currentDataSet)
    if [ ! -z "$currentClass" ] &&[ ! -z "$currentDataSet" ] && [ ! -d "$dataDir/$currentClass/$currentDataSet" ]; then 
        mkdir "$dataDir/$currentClass/$currentDataSet"
        echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`mkdir $dataDir/$currentDataSet" >> "$logDir/$log"
    fi


    #check if "link" is existing. If not, create it.
   elif [ "$tag"x = "$level3"x ] && [ ! -z "$currentClass" ] && [ ! -z "$currentDataSet" ] && [ "$currentDataSet" != "$rootDir" ] && [ -d "$dataDir/$currentClass/$currentDataSet" ]; then
    if  [ ! -f "$dataDir/$currentClass/$currentDataSet/$link" ]; then
        touch "$dataDir/$currentClass/$currentDataSet/$link"        
        echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Create link file : $dataDir/$currentClass/$currentDataSet/$link" >> "$logDir/$log"
    fi
    newRecord=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}')
    ifexit=$(grep "$newRecord" "$dataDir/$currentClass/$currentDataSet/$link")    
    if [ ! -z "$newRecord" ] && [ -z $ifexit ]; then #不存在相同的記錄
        echo "$newRecord" >> "$dataDir/$currentClass/$currentDataSet/$link"
        echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Add new link $newRecord to $datatDir/$currentClass/$currentDataSet/$link" >> "$logDir/$log"
        echo "$dataDir/$currentClass/$currentDataSet/$link" >> "$logDir/modifiedLinkFile.tmp"
    fi
   else
    echo "`date "+%Y.%m.%d-%H:%M:%S--ERROR: "`Failed to process $line" >> "$logDir/$log"
   fi
fi
done <$xmlfile


#******************************
#modifiedLinkFile.tmp contains the paths who were modified in last loop.
#Deduplicate modifiedLinkFile.tmp --> modifiedLinkFile
#******************************
if [ -f "$logDir/modifiedLinkFile.tmp" ]; then
    cat "$logDir/modifiedLinkFile.tmp"| awk '!a[$0]++{"date \"+%Y%m%d%H%M%S\""|getline time; print time,$0}' >> "$logDir/$modifiedLinkFile"
    rm "$logDir/modifiedLinkFile.tmp"
else
    touch "$logDir/$modifiedLinkFile"
fi
fi
#echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`parsing end." >> "$logDir/$log"
sleep $interval
done

腳本二：checkmodifiedLinkFiles.shurl

modifiedlinkfile文件是由上面的腳本一來添加內容的，腳本二會以interval爲間隔檢查modifiedlinkfile文件的修改時間，若是修改時間發生改變，說明該文件被腳本一修改過了，也就是說，xmlfile中添加了新的下載連接，而且創建了對應的目錄。此時，腳本二就會將modifiedlinkfile中的記錄取出來（記錄是新建的link文件的絕對路徑），調用腳本三monitot.sh執行下載任務。spa

#!/bin/bash
#*************************************************
#This script reads in modifiedLinkFile, 
#for every record calling monitor.sh.
#monitor.sh /home/class/name "wget -c -i link -b"
#
#last edited 2013.09.10 by lyuxd.
#
#*************************************************



interval=10
rootDir=${PWD}
dataDir=$rootDir"/data"
logDir=$rootDir"/log"
failedqueue="$logDir/failedQueue"
runningTask="$logDir/runningTask"
modifiedLinkFile="$logDir/modifiedLinkFile"
modifiedLinkFileMT="$logDir/modifiedLinkFile.MT"
log=$logDir"/check.log"
maxWgetProcess=5
echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`check is running...">>$log
#*****************************************
#-----------restart interrupted tasks-----
#*****************************************
if [ -f "$runningTask" ]; then
   while read line
   do
    counterWgetProcess=$(ps -A|grep -c "monitor.sh")
    while [ $counterWgetProcess -ge $maxWgetProcess ]
    do
        sleep 20
        counterWgetProcess=$(ps -A|grep -c "monitor.sh")
    done
    echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Call ./monitor for $downloadDir." >> $log 
    nohup "./monitor.sh" "$line" "wget -nd -c -i link -b" >> /dev/null &
    sleep 1
   done <$runningTask 
fi 


#*********************************
#------------failedQueue-----
#*********************************
#if [ -f "$failedqueue" ] && [ `ls -l "$failedqueue"|awk '{print $5}'` -gt "0" ];then
#    line=($(awk '{print $0}' $failedqueue))
#    echo ${line[1]}
#    :>"$failedqueue"
#    for ((i=0;i<${#line[@]};i++))
#    do
#    counterWgetProcess=$(ps -A|grep -c "monitor.sh")
#        while [ $counterWgetProcess -ge $maxWgetProcess ]
#        do
#            sleep 20
#            counterWgetProcess=$(ps -A|grep -c "monitor.sh")
#        done
#        echo "./monitor.sh" "${line[i]}" "wget -nd -c -i link -b"
#        "./monitor.sh" "${line[i]}" "wget -nd -c -i link -b" >> /dev/null &
#ex "$failedqueue" <<EOF
#1d
#wq
#EOF
#    done
#fi
#***************************************************
#------------check new task in modifiedLinkFile-----
#***************************************************
if [ ! -f "$modifiedLinkFile" ];then
    echo "`date "+%Y.%m.%d-%H:%M:%S--"`No modifiedLinkFile found. checkmodifiedLinkFiles.sh exit 1." >> $log
    exit 1
fi
if [ ! -f "$modifiedLinkFileMT" ];then
    echo "0" > "$modifiedLinkFileMT"
fi
while true
do

newMT=$(stat -c %Y $modifiedLinkFile|awk '{print $0}')
oldMT=$(awk '{print $0}' "$modifiedLinkFileMT")

if [ "$newMT" != "$oldMT" ]; then
while read line
do    
    if [ ! -z "$line" ] && [ "$line" != "" ]; then
        counterWgetProcess=$(ps -A|grep -c "monitor.sh")
        while [ $counterWgetProcess -ge $maxWgetProcess ]
        do
            #echo "waiting 20sec"
            sleep 20
            counterWgetProcess=$(ps -A|grep -c "monitor.sh")
        done
            newLink=$(echo $line |awk '{print $2}')
            
            linkfileName=$(echo $newLink |awk -F "/" '{print $NF}')
            downloadDir=$(echo $newLink|awk -F "$linkfileName" '{print $1}')
            echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Call ./monitor for $downloadDir." >> $log
            "./monitor.sh" "$downloadDir" "wget -nd -c -i $linkfileName -b" >> /dev/null &
            sleep 1
    fi
done <$modifiedLinkFile
: > $modifiedLinkFile
echo $(stat -c %Y $modifiedLinkFile|awk '{print $0}') > "$modifiedLinkFileMT"
#else
    #echo "nothing to do"
fi
sleep $interval
done

腳本三：monitor.sh 這個腳本主要就是被腳本二調用，執行具體的下載任務了。下載前會在Life-sciences/uniprot下新建一個wgetlog目錄，目錄中存放下載日誌wget-log。下載過程當中，monitor.sh會以10S爲時間間隔不斷檢查日誌文件的大小，一旦文件大小在連續兩次檢查中沒有發生改變，則去查看日誌的最後三行，發現FINISH或者failed等關鍵字時，就中止下載，而且經過郵件通知。若是發現日誌後三行沒有找到關鍵字，則認爲是網絡速度有問題，致使下載速度爲0，因此日誌沒有增加，在interval時間後從新檢查日誌大小，重複此過程共maxchecktimes次，若是仍是沒有增加，則將該錯誤經過郵件通知。rest

#!/bin/bash
#*********************************************************
#monitor download directory. 
#One moniter.sh process is started for one download task.
#IF some url in $downloadDir/link can't be reached, monitor
#will log "WARNING". If load failed, log "ERROR". If 
#finished, log "FINISH".
#mail to $mailAddress.
#
#Last edited 2013.09.04 by Lyuxd. 
#
#*********************************************************


#every $interval sec check the size of wgetlog.
interval=30

#if size of wgetlog stay the same, try $maxtrytimes to check
maxtrytimes=5


downloadDir=$1
command=$2

rootDir=${PWD}
dataDir=$rootDir"/data"
logDir=$rootDir"/log"
log=$logDir"/monitor.log"
wgetlogDir="$downloadDir/wgetlog"
wgetlogname="`date +%Y%m%d%H%M%S`-wgetlog"
wgetlog="$wgetlogDir/$wgetlogname"
failedqueue="$logDir/failedQueue"
runningTask="$logDir/runningTask"
mailAddress="15822834587@139.com"
lastERROR="e"
addtoBoolean=0


cd $downloadDir
sleep 1
counterMail=0


echo "`date "+%Y.%m.%d-%H:%M:%S--"`Monitor for directory: ${PWD}.">> $log
whereAmI=$(echo ${PWD} | awk -F "/" '{print $NF}')
if [ ! -d $wgetlogDir ]; then
mkdir $wgetlogDir
fi
# Put current task into runningTask is case of power off. When checkmodifiedLinkFile.sh up, runningTask will be checked if some task  interrupted. And interrupted task will be started again by checkmodifiedLinkFile.sh .  
isexit=$(grep $downloadDir $runningTask)
if [ -z "$isexit" ];then
echo $downloadDir >> $runningTask 
fi

#Begainning downloading.
`$command -b -o "$wgetlog" &`


#Check the size of logfile every $interval times.
#Continue cheching Until size is same with it in
#last check, then wait a $interval long period time,
#try again, try again...(try $maxtrytimes totally)
#read in wgetlog to find if there is
#something not right.
#Mail to $mailAddress.
trytimesRemain=$maxtrytimes
logoldsize=0
sleep 10
lognewsize=$(echo $(ls -l $wgetlog | awk '{print $5}'))
while [ ! -z "$lognewsize" ] && [ "$trytimesRemain" -gt 0 ]
do


# If log's size stays unchanging in $interval*$maxtrytime
# find "FINISH" from log. 
# 
    if [ "$lognewsize" -eq "$logoldsize" ];then
        message=$(tail -n3 "$wgetlog")
        level=$(echo $message|grep "FINISH")
        if [ -z "$level" ];then
            trytimesRemain=`expr $trytimesRemain - 1`
            echo "`date "+%Y.%m.%d-%H:%M:%S--"`WARNNING: $downloadDir Download speed 0.0 KB/s. MaxTryTimes=$maxtrytimes. Try(`expr $maxtrytimes - $trytimesRemain`). ">> $log
        else
            break
        fi
    else
        trytimesRemain=$maxtrytimes
    fi


    ERROR=$(tail -n250 "$wgetlog" | grep "ERROR\|failed")
    if [ ! -z "$ERROR" ] && [ "$ERROR" != "$lastERROR" ] && [ "$counterMail" -lt 5 ]
        then
        echo "`date "+%Y.%m.%d-%H:%M:%S--"`WARNNING: $downloadDir $ERROR. mail to $mailAddress.">> $log
        echo -e "${PWD}\n$ERROR\n"|mutt -s "Wget Running State : WARNNING in $whereAmI" $mailAddress
        counterMail=$counterMail+1
        lastERROR=$ERROR
        addtoBoolean=1
    fi
    logoldsize=$lognewsize
    sleep $interval
    lognewsize=$(echo $(ls -l $wgetlog | awk '{print $5}'))
done

if [ ! -z "$level" ]
    then
    echo "`date "+%Y.%m.%d-%H:%M:%S--"`FINISHI: $message. mail to $mailAddress.">> $log
    echo -e "`date '+%Y-%m-%d +%H:%M:%S'`\n${PWD}\n$message\n"|mutt -s "Wget Report : FINISH $whereAmI--RUNNING $(ps -A|grep -c wget)" $mailAddress
    counterMail=$counterMail+1
else
    echo "`date "+%Y.%m.%d-%H:%M:%S--"`ERROR: $message. mail to $mailAddress.">> $log
    echo -e "`date '+%Y-%m-%d +%H:%M:%S'`\n${PWD}\n$message\n"|mutt -s "Wget Report : ERROR in $whereAmI" $mailAddress
    addtoBoolean=1
    counterMail=$counterMail+1
fi

if [ "$addtoBoolean" -eq "1" ];then
echo "$downloadDir" >> "$failedqueue"
fi


#Remove the interrupted task from runningTask.
sed -i "/$whereAmI/d" "$runningTask"
echo "`date "+%Y.%m.%d-%H:%M:%S--"`$downloadDir Monitor ending.">> $log

總結：第一次寫shell腳本，中間基本上每修改一次都會產生不少錯誤。腳本的質量也不好，好在三個腳本的耦合度不算過高，分工還算明確，這也帶來了很多方便。因爲平時工做電腦是教育網，而下數據用的是聯通的PPPoE撥號，因此ssh訪問速度也比較慢，雖然全部工做都簡化爲了維護一個xml文件（好吧，嚴格說，它根本不是xml文件，只是一個帶標籤的文本而已），可是ssh上敲一個字符須要等待三四秒鐘的龜速仍是沒法忍受的，因此下一步想將第一個腳本的工做用java重寫一下，在web上管理xml文件。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。