wget 沒法限制下載文件的大小,若是你的URL列表中有一個很大的文件,勢必致使下載過程延長,故用curl得到文件的header,解析出其中的content-length,來得到將要下載的文件長度。若是超過預先設置的threshold,則不予下載。
固然目前的bash shell版本對不存在content-length的URL,不作特殊處理,因此仍沒法避免大文件的下載。具體擴展思路:
在下載過程當中,不斷query文件的大小,當超過必定閾值,kill掉下載進程。
因此當前版本還有待改進:
if [ $# -eq 4 ] then echo "start downloading..." urllist=$1 limitsize=$2 outfolder=$3 logfolder=$4 echo "url list file:$urllist" echo "limited file size:$limitsize bytes" echo "output folder:$outfolder" echo "log folder:$logfolder" else echo "usage: ./download.sh <url list> <limited file size> <output folder> <log folder>..." exit fi if [ -d "$outfolder" ] then echo "$outfolder exists..." else echo "make $outfolder..." mkdir $outfolder fi if [ -d "$logfolder" ] then echo "$logfolder exists..." else echo "make $logfolder..." mkdir $logfolder fi cat $urllist|while read url;do echo "downloading:$url" len=$(curl -I -s "$url"|grep Content-Length|cut -d' ' -f2|tr -d '\15') if [ ! -z $len ] then echo "length:$len bytes" if [ $len -gt $limitsize ] then echo "$url is greater than $limitsize bytes, can't be downloaded." else echo "$url is smaller than $limitsize bytes, can be downloaded." filename=$(echo $url|tr -d ':/?\|*<>') wget -P $outfolder -x -t 3 --save-headers --connect-timeout=10 --read-timeout=10 --level=1 $url -o $logfolder/$filename.txt fi else echo "$url file size is unknown." filename=$(echo $url|tr -d ':/?\|*<>') wget -P $outfolder -x -t 3 --save-headers --connect-timeout=10 --read-timeout=10 --level=1 $url -o $logfolder/$filename.txt fi done