網上的爬蟲不能用,仍是先表達謝意,不過我比較懶不喜歡重複寫別人寫的教程,只貼出修改,怎麼用本身看教程吧。html
我本身改了一版能夠正常爬:python
#!/usr/bin/env python #coding=utf-8 # # Openwrt Package Grabber # # Copyright (C) 2016 sohobloo.me # import urllib2 import re import os import time # the url of package list page, end with "/" baseurl = 'https://downloads.openwrt.org/snapshots/trunk/ramips/mt7620/packages/' # which directory to save all the packages, end with "/" time = time.strftime("%Y%m%d%H%M%S", time.localtime()) savedir = './' + time + '/' pattern = r'<a href="([^\?].*?)">'
cnt = 0
def fetch(url, path = ''): if not os.path.exists(savedir + path): os.makedirs(savedir + path) print 'fetching package list from ' + url content = urllib2.urlopen(url + path, timeout=15).read() items = re.findall(pattern, content)for item in items: if item == '../': continue elif item.endswith('/'): fetch(url, path + item) else: cnt += 1 print 'downloading item %d: '%(cnt) + path + item if os.path.isfile(savedir + path + item): print 'file exists, ignored.' else: rfile = urllib2.urlopen(baseurl + path + item) with open(savedir + path + item, "wb") as code: code.write(rfile.read()) fetch(baseurl) print 'done!'
修改內容:fetch
1. 增長了一級當前時間格式的根目錄url
2. 修改正則,過濾無效的地址(問號開頭)spa
3. 改成遞歸爬目錄結構code
另外很高興Python知識終於能夠用了,撒花。htm
想更新截圖失敗,博客園看上去是要死了。blog