文的文字及圖片來源於網絡,僅供學習、交流使用,不具備任何商業用途,版權歸原做者全部,若有問題請及時聯繫咱們以做處理。瀏覽器
做者: 肖豪bash
PS:若有須要Python學習資料的小夥伴能夠加點擊下方連接自行獲取網絡
http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef複製代碼
雖然說找到了資源網站能夠下載了,可是每次都要打開瀏覽器,輸入網址,找到該美劇,而後點擊連接才能下載。時間長了就以爲過程好繁瑣,並且有時候網站連接還會打不開,會有點麻煩。學習
寫個爬蟲,抓取該網站上全部美劇連接,並保存在文本文檔中,想要哪部劇就直接打開復制連接到迅雷就能夠下載啦。 如下就是實現代碼。網站
import sys
import threading
import time
reload(sys)
sys.setdefaultencoding('utf-8')
class Archives(object):
def save_links(self,url):
try:
data=requests.get(url,timeout=3)
content=data.text
link_pat='"(ed2k://\|file\|[^"]+?\.(S\d+)(E\d+)[^"]+?1024X\d{3}[^"]+?)"'
name_pat=re.compile(r'<h2 class="entry_title">(.*?)</h2>',re.S)
links = set(re.findall(link_pat,content))
name=re.findall(name_pat,content)
links_dict = {}
count=len(links)
except Exception,e:
pass
for i in links:
links_dict[int(i[1][1:3]) * 100 + int(i[2][1:3])] = i#把劇集按s和e提取編號
try:
with open(name[0].replace('/',' ')+'.txt','w') as f:
print name[0]
for i in sorted(list(links_dict.keys())):#按季數+集數排序順序寫入
f.write(links_dict[i][0] + '\n')
print "Get links ... ", name[0], count
except Exception,e:
pass
def get_urls(self):
try:
for i in range(2015,25000):
base_url='http://cn163.net/archives/'
url=base_url+str(i)+'/'
if requests.get(url).status_code == 404:
continue
else:
self.save_links(url)
except Exception,e:
pass
def main(self):
thread1=threading.Thread(target=self.get_urls())
thread1.start()
thread1.join()
if __name__ == '__main__':
start=time.time()
a=Archives()
a.main()
end=time.time()
print end-start複製代碼