def downloadXml(isExists,filedir,filename): if not isExists: os.mkdir(filedir) local = os.path.join(filedir,filename) urllib2.urlopen(url,local)
報錯:segmentfault
Traceback (most recent call last):
File "C:\Users\william\Desktop\nova xml\New folder\download_xml.py", line 95, in <module>
downloadXml(isExists,filedir,filename)
File "C:\Users\william\Desktop\nova xml\New folder\download_xml.py", line 80, in downloadXml
urllib.urlretrieve(url,local)
File "E:\Python27\lib\urllib.py", line 98, in urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "E:\Python27\lib\urllib.py", line 245, in retrieve
fp = self.open(url, data)
File "E:\Python27\lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "E:\Python27\lib\urllib.py", line 350, in open_http
h.endheaders(data)
File "E:\Python27\lib\httplib.py", line 1053, in endheaders
self._send_output(message_body)
File "E:\Python27\lib\httplib.py", line 897, in _send_output
self.send(msg)
File "E:\Python27\lib\httplib.py", line 859, in send
self.connect()
File "E:\Python27\lib\httplib.py", line 836, in connect
self.timeout, self.source_address)
File "E:\Python27\lib\socket.py", line 575, in create_connection
raise err
IOError: [Errno socket error] [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
>>> socket
google查找答案,搜索:urlretrieve Errno 10060網站
在 https://segmentfault.com/q/1010000004386726中提到是:頻繁的訪問某個網站會被認爲是DOS攻擊,一般作了Rate-limit的網站都會中止響應一段時間,你能夠Catch這個Exception,sleep一段時間而後重試,也能夠根據重試的次數作exponential backup off。google
想了一個簡單的辦法,就是每次下載之間加個延時,將代碼修改以下:url
def downloadXml(isExists,filedir,filename): if not isExists: os.mkdir(filedir) local = os.path.join(filedir,filename) time.sleep(1) urllib.urlretrieve(url,local)
執行。 原本是在第80條左右的數據就開始time out,但如今一直執行到2300多條數據。惋惜,最後又time out。 spa
這裏,若延長延時,將1s改成5s等,雖然可能不會報錯,但我想,這樣,太費時間了。由於不報錯時,也要延時5s,不如等報錯時再延時重試。code
因而,xml
def downloadXml(isExists,filedir,filename): if not isExists: os.makedirs(filedir) local = os.path.join(filedir,filename) try: urllib.urlretrieve(url,local) except Exception as e: time.sleep(5) urllib.urlretrieve(url,local)
這樣的話,發現會卡在某條數據,不向後執行。因此只好改成在某條數據上,最多重試10次。blog
def downloadXml(flag_exists,file_dir,file_name,xml_url): if not flag_exists: os.makedirs(file_dir) local = os.path.join(file_dir,file_name) try: urllib.urlretrieve(xml_url,local) except Exception as e: print e cur_try = 0 total_try = 10 if cur_try < total_try: cur_try +=1 time.sleep(15) return downloadXml(flag_exists,file_dir,file_name,xml_url) else: raise Exception(e)
這樣執行後,果真再也不報錯,順利執行完了。但一想,有個問題,使用哪一個URL進行下載失敗,沒有記錄下來。因此又添加了將失敗的url寫入本地文本的功能。後面能夠查看,並手動執行。get
def downloadXml(flag_exists,file_dir,file_name,xml_url): if not flag_exists: os.makedirs(file_dir) local = os.path.join(file_dir,file_name) try: urllib.urlretrieve(xml_url,local) except Exception as e: print 'the first error: ',e cur_try = 0 total_try = 10 if cur_try < total_try: cur_try +=1 time.sleep(15) return downloadXml(flag_exists,file_dir,file_name,xml_url) else: print 'the last error: ' with open(test_dir + 'error_url.txt','a') as f: f.write(xml_url) raise Exception(e)
遺憾的是,此次竟再沒有失敗的url了,多是網站這時流量不大。