公司須要將服務器的網頁緩存到路由器,用戶在訪問該網頁時就直接取路由器上的緩存便可。雖然我不知道這個需求有什麼意義,但仍是盡力去實現吧。 css
wget是unix和類unix下的一個網頁抓取工具,待我熟悉它後,發現它的功能遠不止這些。可是這篇博文只說怎麼抓取一個指定URL以及它下面的相關內容(包括html,js,css,圖片)並將內容裏的絕對路徑換成相對路徑。網上搜到一堆有關wget的文章,關於它怎麼抓取網頁和相關的圖片資源,反正我是沒有找到一篇實用的,都以失敗了結。 html
這是wget -h > ./help_wget.txt 後的文件內容 node
GNU Wget 1.16, a non-interactive network retriever. Usage: wget [OPTION]... [URL]... Mandatory arguments to long options are mandatory for short options too. Startup: -V, --version display the version of Wget and exit. -h, --help print this help. -b, --background go to background after startup. -e, --execute=COMMAND execute a `.wgetrc'-style command. Logging and input file: -o, --output-file=FILE log messages to FILE. -a, --append-output=FILE append messages to FILE. -q, --quiet quiet (no output). -v, --verbose be verbose (this is the default). -nv, --no-verbose turn off verboseness, without being quiet. --report-speed=TYPE Output bandwidth as TYPE. TYPE can be bits. -i, --input-file=FILE download URLs found in local or external FILE. -F, --force-html treat input file as HTML. -B, --base=URL resolves HTML input-file links (-i -F) relative to URL. --config=FILE Specify config file to use. --no-config Do not read any config file. Download: -t, --tries=NUMBER set number of retries to NUMBER (0 unlimits). --retry-connrefused retry even if connection is refused. -O, --output-document=FILE write documents to FILE. -nc, --no-clobber skip downloads that would download to existing files (overwriting them). -c, --continue resume getting a partially-downloaded file. --start-pos=OFFSET start downloading from zero-based position OFFSET. --progress=TYPE select progress gauge type. --show-progress display the progress bar in any verbosity mode. -N, --timestamping don't re-retrieve files unless newer than local. --no-use-server-timestamps don't set the local file's timestamp by the one on the server. -S, --server-response print server response. --spider don't download anything. -T, --timeout=SECONDS set all timeout values to SECONDS. --dns-timeout=SECS set the DNS lookup timeout to SECS. --connect-timeout=SECS set the connect timeout to SECS. --read-timeout=SECS set the read timeout to SECS. -w, --wait=SECONDS wait SECONDS between retrievals. --waitretry=SECONDS wait 1..SECONDS between retries of a retrieval. --random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals. --no-proxy explicitly turn off proxy. -Q, --quota=NUMBER set retrieval quota to NUMBER. --bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host. --limit-rate=RATE limit download rate to RATE. --no-dns-cache disable caching DNS lookups. --restrict-file-names=OS restrict chars in file names to ones OS allows. --ignore-case ignore case when matching files/directories. -4, --inet4-only connect only to IPv4 addresses. -6, --inet6-only connect only to IPv6 addresses. --prefer-family=FAMILY connect first to addresses of specified family, one of IPv6, IPv4, or none. --user=USER set both ftp and http user to USER. --password=PASS set both ftp and http password to PASS. --ask-password prompt for passwords. --no-iri turn off IRI support. --local-encoding=ENC use ENC as the local encoding for IRIs. --remote-encoding=ENC use ENC as the default remote encoding. --unlink remove file before clobber. Directories: -nd, --no-directories don't create directories. -x, --force-directories force creation of directories. -nH, --no-host-directories don't create host directories. --protocol-directories use protocol name in directories. -P, --directory-prefix=PREFIX save files to PREFIX/... --cut-dirs=NUMBER ignore NUMBER remote directory components. HTTP options: --http-user=USER set http user to USER. --http-password=PASS set http password to PASS. --no-cache disallow server-cached data. --default-page=NAME Change the default page name (normally this is `index.html'.). -E, --adjust-extension save HTML/CSS documents with proper extensions. --ignore-length ignore `Content-Length' header field. --header=STRING insert STRING among the headers. --max-redirect maximum redirections allowed per page. --proxy-user=USER set USER as proxy username. --proxy-password=PASS set PASS as proxy password. --referer=URL include `Referer: URL' header in HTTP request. --save-headers save the HTTP headers to file. -U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION. --no-http-keep-alive disable HTTP keep-alive (persistent connections). --no-cookies don't use cookies. --load-cookies=FILE load cookies from FILE before session. --save-cookies=FILE save cookies to FILE after session. --keep-session-cookies load and save session (non-permanent) cookies. --post-data=STRING use the POST method; send STRING as the data. --post-file=FILE use the POST method; send contents of FILE. --method=HTTPMethod use method "HTTPMethod" in the request. --body-data=STRING Send STRING as data. --method MUST be set. --body-file=FILE Send contents of FILE. --method MUST be set. --content-disposition honor the Content-Disposition header when choosing local file names (EXPERIMENTAL). --content-on-error output the received content on server errors. --auth-no-challenge send Basic HTTP authentication information without first waiting for the server's challenge. HTTPS (SSL/TLS) options: --secure-protocol=PR choose secure protocol, one of auto, SSLv2, SSLv3, TLSv1 and PFS. --https-only only follow secure HTTPS links --no-check-certificate don't validate the server's certificate. --certificate=FILE client certificate file. --certificate-type=TYPE client certificate type, PEM or DER. --private-key=FILE private key file. --private-key-type=TYPE private key type, PEM or DER. --ca-certificate=FILE file with the bundle of CA's. --ca-directory=DIR directory where hash list of CA's is stored. --random-file=FILE file with random data for seeding the SSL PRNG. --egd-file=FILE file naming the EGD socket with random data. FTP options: --ftp-user=USER set ftp user to USER. --ftp-password=PASS set ftp password to PASS. --no-remove-listing don't remove `.listing' files. --no-glob turn off FTP file name globbing. --no-passive-ftp disable the "passive" transfer mode. --preserve-permissions preserve remote file permissions. --retr-symlinks when recursing, get linked-to files (not dir). WARC options: --warc-file=FILENAME save request/response data to a .warc.gz file. --warc-header=STRING insert STRING into the warcinfo record. --warc-max-size=NUMBER set maximum size of WARC files to NUMBER. --warc-cdx write CDX index files. --warc-dedup=FILENAME do not store records listed in this CDX file. --no-warc-compression do not compress WARC files with GZIP. --no-warc-digests do not calculate SHA1 digests. --no-warc-keep-log do not store the log file in a WARC record. --warc-tempdir=DIRECTORY location for temporary files created by the WARC writer. Recursive download: -r, --recursive specify recursive download. -l, --level=NUMBER maximum recursion depth (inf or 0 for infinite). --delete-after delete files locally after downloading them. -k, --convert-links make links in downloaded HTML or CSS point to local files. --backups=N before writing file X, rotate up to N backup files. -K, --backup-converted before converting file X, back up as X.orig. -m, --mirror shortcut for -N -r -l inf --no-remove-listing. -p, --page-requisites get all images, etc. needed to display HTML page. --strict-comments turn on strict (SGML) handling of HTML comments. Recursive accept/reject: -A, --accept=LIST comma-separated list of accepted extensions. -R, --reject=LIST comma-separated list of rejected extensions. --accept-regex=REGEX regex matching accepted URLs. --reject-regex=REGEX regex matching rejected URLs. --regex-type=TYPE regex type (posix). -D, --domains=LIST comma-separated list of accepted domains. --exclude-domains=LIST comma-separated list of rejected domains. --follow-ftp follow FTP links from HTML documents. --follow-tags=LIST comma-separated list of followed HTML tags. --ignore-tags=LIST comma-separated list of ignored HTML tags. -H, --span-hosts go to foreign hosts when recursive. -L, --relative follow relative links only. -I, --include-directories=LIST list of allowed directories. --trust-server-names use the name specified by the redirection url last component. -X, --exclude-directories=LIST list of excluded directories. -np, --no-parent don't ascend to the parent directory. Mail bug reports and suggestions to <bug-wget@gnu.org>.
根據wget的幫助文檔,我嘗試了下面這條命令 shell
wget -r -np -pk -nH -P ./download http://www.baidu.com
-r 遞歸下載全部內容 vim
-np 只下載給定URL下的內容,不下載它的上級內容 緩存
-p 下載有關頁面須要用到的全部資源,包括圖片和css樣式 服務器
-k 將絕對路徑轉換爲相對路徑(這個很重要,爲了在用戶打開網頁的時候,加載的相關資源都在本地尋找) cookie
-nH 禁止wget以接收的URL爲名稱建立文件夾(若是沒有這個,這條命令會將下載的內容存在./download/www.baidu.com/下) session
-P 下載到哪一個路徑,這裏是當前文件夾下的download文件夾下,沒有的話,wget會幫你自動建立 app
這些選項都符合目前的這個需求,單結果很意外,並非咱們想象的那麼簡單,wget並無給咱們想要的東西
你若是執行了這條命令,會發如今當前的download文件夾中只是下載了一個index.html和一個robots.txt,而index.html文件所須要的圖片也並無被下載
<img>標籤中的路徑也沒有被替換成相對路徑,可能只是去掉了"http:"這個字符串而已。
至於爲何會這樣,請繼續往下看。
由於上面的命令行不通,因此,腦洞全開。來吧,讓咱們寫一個shell腳本,名稱爲wget_cc內容以下
#!/bin/sh URL="$2" PATH="$1" echo "download url: $URL" echo "download dir: $PATH" /usr/bin/wget -e robots=off -w 1 -xq -np -nH -pk -m -t 1 -P "$PATH" "$URL" echo "success to download"
這裏多加了幾個參數,解釋一下:
-e 用法是‘-e command’
用來執行額外的.wgetrc命令。就像vim的配置存在.vimrc文件中同樣,wget也用.wgetrc文件來存放它的配置。也就是說在wget執行以前,會先執行.wgetrc文件中的配置命令。一個典型的.wgetrc文件能夠參考:
http://www.gnu.org/software/wget/manual/html_node/Sample-Wgetrc.html
http://www.gnu.org/software/wget/manual/html_node/Wgetrc-Commands.html
用戶能夠在不改寫.wgetrc文件的狀況下,用-e選項指定額外的配置命令。若是想要制定多個配置命令,-e command1 -e command2 ... -e commandN便可。這些制定的配置命令,會在.wgetrc中全部命令以後執行,所以會覆蓋.wgetrc中相同的配置項。
這裏robots=off是由於wget默認會根據網站的robots.txt進行操做,若是robots.txt裏是User-agent: * Disallow: /的話,wget是作不了鏡像或者下載目錄的。
這就是前面爲何下載不了圖片和其餘資源的緣由所在了,由於你要爬的HOST禁止蜘蛛去爬它,而wget使用 -e robots=off 這個選項能夠經過這個命令繞過這個限制。
-x 建立鏡像網站對應的目錄結構
-q 靜默下載,即不顯示下載信息,你若是想知道wget當前在下載什麼資源的話,能夠去掉這個選項
-m 它會打開鏡像相關的選項,好比無限深度的子目錄遞歸下載。
-t times 某個資源下載失敗後的重試下載次數
-w seconds 資源請求下載之間的等待時間(減輕服務器的壓力)
剩下有不懂的你就去挖文檔吧。
寫好後保存退出,執行:
chmod 744 wget_cc
下面就讓腳本執行起來吧!
./wget_cc ./download http://www.baidu.com
OK,而後再查看<img>標籤中的src屬性,
src="img/bd_logo1.png"
果真換成了相對路徑啊,大功告成,以爲對您有幫助的請點個贊吧!
這裏是Freestyletime@foxmail.com,歡迎交流。
本人原創做品,轉載請標明出處。