數據下載工做筆記三:腳本

一共寫了三個腳本,第一次寫shell腳本,很蹩腳。寫完回頭看一看,質量確實不好勁。java

腳本一:addLinkFiles.shweb

在當前目錄下有一個xmlfile.xml文件,該文件須要手動編輯,文件中用標籤表示出來兩級目錄,例如:shell

<class>Life-sciences</class>bash

<dataset>uniprot</dataset>網絡

<location>http://www.example.com/example.nt.gz</location>ssh

沒有使用xml樹形結構,由於shell中解析實在是太麻煩了。用相對位置表示上下級關係。因此這並非一個嚴格的xml文件,四不像。上面三句話表示class中有dataset,dataset中有一個下載連接,對應的目錄結構爲"Life-sciences/uniprot/link",link中保存連接。addLinkFiles.sh會在啓動時檢查一下該文件,並在./download/data/中檢查linkpath = ${class}/${dataset}/link文件是否存在,若不存在,新建目錄和文件,而後添加下載連接到link文件中,並將${linkpath}添加到${modifiedLinkFile}中,每隔${interval}秒,腳本會檢查一次${xmlfile}的修改時間,若是修改時間改變了,說明有新的location添加了,檢查目錄和xml文件內容的對應關係,給xml中新添加的內容創建相應的目錄和文件。oop

#!/bin/bash
#*********************************************************
#addLinkFiles.sh
#Keep checking the $xmlfile. #The $xmlfile shoul have only
3 tags: class, name, location. # #last edited 2013.09.03 by Lyuxd. # #********************************************************* #****************** #----init---------- #****************** interval=10 rootDir=${PWD} dataDir=$rootDir"/data" logDir=$rootDir"/log" link="link" log="add.log" modifiedLinkFile="modifiedlinkfile" xmlfile="xmlfile.xml" level1="class" level2="name" level3="location" currentClass=$rootDir currentDataSet=$rootDir xmlLastMT=0 cd $rootDir #**************************************** #------Create Data, Log Directories------ #**************************************** if [ ! -d "$dataDir" ];then mkdir "$dataDir" fi if [ ! -d "$logDir" ];then mkdir "$logDir" fi #**************************************** #------Parsing the xmlfile------ #**************************************** if [ ! -f "$xmlfile" ];then echo "`date "+%Y.%m.%d-%H:%M:%S--ERROR: "`No xmlfile found. exit." >> "$logDir/$log" exit 1 fi #check the modified-time of xmlfile every $interval sec. If modified-time changed parse xmlfile. while true do xmlMT=$(stat -c %Y $xmlfile|awk '{print $0}') if [ "$xmlLastMT" -lt "$xmlMT" ];then xmlLastMT=$xmlMT echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`parsing $xmlfile..." >> "$logDir/$log" while read line do #bu wei kong. if [ "$line"x != x ];then tmp=$(echo $line | awk -F "<|>| " '{print $2}') tag=$(echo $tmp) #check if "class" is existing. If not, create it. if [ "$tag"x = "$level1"x ]; then currentClass=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}') currentClass=$(echo $currentClass) currentDataSet=$rootDir if [ ! -z "$currentClass" ] && [ ! -d "$dataDir/$currentClass" ]; then mkdir "$dataDir/$currentClass" echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`mkdir $dataDir/$currentClass" >> "$logDir/$log" fi #check if "name" is existing. If not, create it. elif [ "$tag"x = "$level2"x ] && [ "$currentClass" != "$rootDir" ]; then currentDataSet=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}') currentDataSet=$(echo $currentDataSet) if [ ! -z "$currentClass" ] &&[ ! -z "$currentDataSet" ] && [ ! -d "$dataDir/$currentClass/$currentDataSet" ]; then mkdir "$dataDir/$currentClass/$currentDataSet" echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`mkdir $dataDir/$currentDataSet" >> "$logDir/$log" fi #check if "link" is existing. If not, create it. elif [ "$tag"x = "$level3"x ] && [ ! -z "$currentClass" ] && [ ! -z "$currentDataSet" ] && [ "$currentDataSet" != "$rootDir" ] && [ -d "$dataDir/$currentClass/$currentDataSet" ]; then if [ ! -f "$dataDir/$currentClass/$currentDataSet/$link" ]; then touch "$dataDir/$currentClass/$currentDataSet/$link" echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Create link file : $dataDir/$currentClass/$currentDataSet/$link" >> "$logDir/$log" fi newRecord=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}') ifexit=$(grep "$newRecord" "$dataDir/$currentClass/$currentDataSet/$link") if [ ! -z "$newRecord" ] && [ -z $ifexit ]; then #不存在相同的記錄 echo "$newRecord" >> "$dataDir/$currentClass/$currentDataSet/$link" echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Add new link $newRecord to $datatDir/$currentClass/$currentDataSet/$link" >> "$logDir/$log" echo "$dataDir/$currentClass/$currentDataSet/$link" >> "$logDir/modifiedLinkFile.tmp" fi else echo "`date "+%Y.%m.%d-%H:%M:%S--ERROR: "`Failed to process $line" >> "$logDir/$log" fi fi done <$xmlfile #****************************** #modifiedLinkFile.tmp contains the paths who were modified in last loop. #Deduplicate modifiedLinkFile.tmp --> modifiedLinkFile #****************************** if [ -f "$logDir/modifiedLinkFile.tmp" ]; then cat "$logDir/modifiedLinkFile.tmp"| awk '!a[$0]++{"date \"+%Y%m%d%H%M%S\""|getline time; print time,$0}' >> "$logDir/$modifiedLinkFile" rm "$logDir/modifiedLinkFile.tmp" else touch "$logDir/$modifiedLinkFile" fi fi #echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`parsing end." >> "$logDir/$log" sleep $interval done

腳本二:checkmodifiedLinkFiles.shurl

modifiedlinkfile文件是由上面的腳本一來添加內容的,腳本二會以interval爲間隔檢查modifiedlinkfile文件的修改時間,若是修改時間發生改變,說明該文件被腳本一修改過了,也就是說,xmlfile中添加了新的下載連接,而且創建了對應的目錄。此時,腳本二就會將modifiedlinkfile中的記錄取出來(記錄是新建的link文件的絕對路徑),調用腳本三monitot.sh執行下載任務。spa

#!/bin/bash
#*************************************************
#This script reads in modifiedLinkFile, 
#for every record calling monitor.sh.
#monitor.sh /home/class/name "wget -c -i link -b"
#
#last edited 2013.09.10 by lyuxd.
#
#*************************************************



interval=10
rootDir=${PWD}
dataDir=$rootDir"/data"
logDir=$rootDir"/log"
failedqueue="$logDir/failedQueue"
runningTask="$logDir/runningTask"
modifiedLinkFile="$logDir/modifiedLinkFile"
modifiedLinkFileMT="$logDir/modifiedLinkFile.MT"
log=$logDir"/check.log"
maxWgetProcess=5
echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`check is running...">>$log
#*****************************************
#-----------restart interrupted tasks-----
#*****************************************
if [ -f "$runningTask" ]; then
   while read line
   do
    counterWgetProcess=$(ps -A|grep -c "monitor.sh")
    while [ $counterWgetProcess -ge $maxWgetProcess ]
    do
        sleep 20
        counterWgetProcess=$(ps -A|grep -c "monitor.sh")
    done
    echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Call ./monitor for $downloadDir." >> $log 
    nohup "./monitor.sh" "$line" "wget -nd -c -i link -b" >> /dev/null &
    sleep 1
   done <$runningTask 
fi 


#*********************************
#------------failedQueue-----
#*********************************
#if [ -f "$failedqueue" ] && [ `ls -l "$failedqueue"|awk '{print $5}'` -gt "0" ];then
#    line=($(awk '{print $0}' $failedqueue))
#    echo ${line[1]}
#    :>"$failedqueue"
#    for ((i=0;i<${#line[@]};i++))
#    do
#    counterWgetProcess=$(ps -A|grep -c "monitor.sh")
#        while [ $counterWgetProcess -ge $maxWgetProcess ]
#        do
#            sleep 20
#            counterWgetProcess=$(ps -A|grep -c "monitor.sh")
#        done
#        echo "./monitor.sh" "${line[i]}" "wget -nd -c -i link -b"
#        "./monitor.sh" "${line[i]}" "wget -nd -c -i link -b" >> /dev/null &
#ex "$failedqueue" <<EOF
#1d
#wq
#EOF
#    done
#fi
#***************************************************
#------------check new task in modifiedLinkFile-----
#***************************************************
if [ ! -f "$modifiedLinkFile" ];then
    echo "`date "+%Y.%m.%d-%H:%M:%S--"`No modifiedLinkFile found. checkmodifiedLinkFiles.sh exit 1." >> $log
    exit 1
fi
if [ ! -f "$modifiedLinkFileMT" ];then
    echo "0" > "$modifiedLinkFileMT"
fi
while true
do

newMT=$(stat -c %Y $modifiedLinkFile|awk '{print $0}')
oldMT=$(awk '{print $0}' "$modifiedLinkFileMT")

if [ "$newMT" != "$oldMT" ]; then
while read line
do    
    if [ ! -z "$line" ] && [ "$line" != "" ]; then
        counterWgetProcess=$(ps -A|grep -c "monitor.sh")
        while [ $counterWgetProcess -ge $maxWgetProcess ]
        do
            #echo "waiting 20sec"
            sleep 20
            counterWgetProcess=$(ps -A|grep -c "monitor.sh")
        done
            newLink=$(echo $line |awk '{print $2}')
            
            linkfileName=$(echo $newLink |awk -F "/" '{print $NF}')
            downloadDir=$(echo $newLink|awk -F "$linkfileName" '{print $1}')
            echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Call ./monitor for $downloadDir." >> $log
            "./monitor.sh" "$downloadDir" "wget -nd -c -i $linkfileName -b" >> /dev/null &
            sleep 1
    fi
done <$modifiedLinkFile
: > $modifiedLinkFile
echo $(stat -c %Y $modifiedLinkFile|awk '{print $0}') > "$modifiedLinkFileMT"
#else
    #echo "nothing to do"
fi
sleep $interval
done

腳本三:monitor.sh 這個腳本主要就是被腳本二調用,執行具體的下載任務了。下載前會在Life-sciences/uniprot下新建一個wgetlog目錄,目錄中存放下載日誌wget-log。下載過程當中,monitor.sh會以10S爲時間間隔不斷檢查日誌文件的大小,一旦文件大小在連續兩次檢查中沒有發生改變,則去查看日誌的最後三行,發現FINISH或者failed等關鍵字時,就中止下載,而且經過郵件通知。若是發現日誌後三行沒有找到關鍵字,則認爲是網絡速度有問題,致使下載速度爲0,因此日誌沒有增加,在interval時間後從新檢查日誌大小,重複此過程共maxchecktimes次,若是仍是沒有增加,則將該錯誤經過郵件通知。rest

#!/bin/bash
#*********************************************************
#monitor download directory. 
#One moniter.sh process is started for one download task.
#IF some url in $downloadDir/link can't be reached, monitor
#will log "WARNING". If load failed, log "ERROR". If 
#finished, log "FINISH".
#mail to $mailAddress.
#
#Last edited 2013.09.04 by Lyuxd. 
#
#*********************************************************


#every $interval sec check the size of wgetlog.
interval=30

#if size of wgetlog stay the same, try $maxtrytimes to check
maxtrytimes=5


downloadDir=$1
command=$2

rootDir=${PWD}
dataDir=$rootDir"/data"
logDir=$rootDir"/log"
log=$logDir"/monitor.log"
wgetlogDir="$downloadDir/wgetlog"
wgetlogname="`date +%Y%m%d%H%M%S`-wgetlog"
wgetlog="$wgetlogDir/$wgetlogname"
failedqueue="$logDir/failedQueue"
runningTask="$logDir/runningTask"
mailAddress="15822834587@139.com"
lastERROR="e"
addtoBoolean=0


cd $downloadDir
sleep 1
counterMail=0


echo "`date "+%Y.%m.%d-%H:%M:%S--"`Monitor for directory: ${PWD}.">> $log
whereAmI=$(echo ${PWD} | awk -F "/" '{print $NF}')
if [ ! -d $wgetlogDir ]; then
mkdir $wgetlogDir
fi
# Put current task into runningTask is case of power off. When checkmodifiedLinkFile.sh up, runningTask will be checked if some task  interrupted. And interrupted task will be started again by checkmodifiedLinkFile.sh .  
isexit=$(grep $downloadDir $runningTask)
if [ -z "$isexit" ];then
echo $downloadDir >> $runningTask 
fi

#Begainning downloading.
`$command -b -o "$wgetlog" &`


#Check the size of logfile every $interval times.
#Continue cheching Until size is same with it in
#last check, then wait a $interval long period time,
#try again, try again...(try $maxtrytimes totally)
#read in wgetlog to find if there is
#something not right.
#Mail to $mailAddress.
trytimesRemain=$maxtrytimes
logoldsize=0
sleep 10
lognewsize=$(echo $(ls -l $wgetlog | awk '{print $5}'))
while [ ! -z "$lognewsize" ] && [ "$trytimesRemain" -gt 0 ]
do


# If log's size stays unchanging in $interval*$maxtrytime
# find "FINISH" from log. 
# 
    if [ "$lognewsize" -eq "$logoldsize" ];then
        message=$(tail -n3 "$wgetlog")
        level=$(echo $message|grep "FINISH")
        if [ -z "$level" ];then
            trytimesRemain=`expr $trytimesRemain - 1`
            echo "`date "+%Y.%m.%d-%H:%M:%S--"`WARNNING: $downloadDir Download speed 0.0 KB/s. MaxTryTimes=$maxtrytimes. Try(`expr $maxtrytimes - $trytimesRemain`). ">> $log
        else
            break
        fi
    else
        trytimesRemain=$maxtrytimes
    fi


    ERROR=$(tail -n250 "$wgetlog" | grep "ERROR\|failed")
    if [ ! -z "$ERROR" ] && [ "$ERROR" != "$lastERROR" ] && [ "$counterMail" -lt 5 ]
        then
        echo "`date "+%Y.%m.%d-%H:%M:%S--"`WARNNING: $downloadDir $ERROR. mail to $mailAddress.">> $log
        echo -e "${PWD}\n$ERROR\n"|mutt -s "Wget Running State : WARNNING in $whereAmI" $mailAddress
        counterMail=$counterMail+1
        lastERROR=$ERROR
        addtoBoolean=1
    fi
    logoldsize=$lognewsize
    sleep $interval
    lognewsize=$(echo $(ls -l $wgetlog | awk '{print $5}'))
done

if [ ! -z "$level" ]
    then
    echo "`date "+%Y.%m.%d-%H:%M:%S--"`FINISHI: $message. mail to $mailAddress.">> $log
    echo -e "`date '+%Y-%m-%d +%H:%M:%S'`\n${PWD}\n$message\n"|mutt -s "Wget Report : FINISH $whereAmI--RUNNING $(ps -A|grep -c wget)" $mailAddress
    counterMail=$counterMail+1
else
    echo "`date "+%Y.%m.%d-%H:%M:%S--"`ERROR: $message. mail to $mailAddress.">> $log
    echo -e "`date '+%Y-%m-%d +%H:%M:%S'`\n${PWD}\n$message\n"|mutt -s "Wget Report : ERROR in $whereAmI" $mailAddress
    addtoBoolean=1
    counterMail=$counterMail+1
fi

if [ "$addtoBoolean" -eq "1" ];then
echo "$downloadDir" >> "$failedqueue"
fi


#Remove the interrupted task from runningTask.
sed -i "/$whereAmI/d" "$runningTask"
echo "`date "+%Y.%m.%d-%H:%M:%S--"`$downloadDir Monitor ending.">> $log

總結:第一次寫shell腳本,中間基本上每修改一次都會產生不少錯誤。腳本的質量也不好,好在三個腳本的耦合度不算過高,分工還算明確,這也帶來了很多方便。因爲平時工做電腦是教育網,而下數據用的是聯通的PPPoE撥號,因此ssh訪問速度也比較慢,雖然全部工做都簡化爲了維護一個xml文件(好吧,嚴格說,它根本不是xml文件,只是一個帶標籤的文本而已),可是ssh上敲一個字符須要等待三四秒鐘的龜速仍是沒法忍受的,因此下一步想將第一個腳本的工做用java重寫一下,在web上管理xml文件。

相關文章
相關標籤/搜索