真是太白了,python之路還有很長,今天我從這裏開始,留做本身備忘。2018-04-05html
花了一個下午學習個爬小說的,總的來講是由於本身沒什麼基礎,哪裏不會補哪裏,磕磕絆絆的,總算是能運行,先把代碼放這裏,之後請教高手幫助解決一下。python
# -*- coding: utf-8 -*- # @Time : 2018/4/5 13:46 # @Author : ELEVEN # @File : crawerl--小說網.py # @Software: PyCharm import requests import re import time import os header = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:59.0) Gecko/20100101 Firefox/59.0' } def get_type_list(i): url = 'http://www.quanshuwang.com/list/{}_1.html'.format(i) html = requests.get(url, headers = header) html.encoding = 'gbk' html = html.text # print(html) # lis = re.findall(r'<ul class="seeWell cf">.*?<ul>', html, re.S) # lis = re.findall(r'<li><a target="_blank" href="(.*?)" class="l mr10">', html, re.S) novel_list = re.findall(r'<a target="_blank" title="(.*?)" href="(.*?)" class="clearfix stitle">', html, re.S) return novel_list def get_chapter_list(type_url): html = requests.get(type_url, headers = header) html.encoding = 'gbk' html = html.text novel_chapter_html = re.findall(r'<img src="/kukuku/images/only2.png" class="leftso png_bg"><a href="(.*?)" class="l mr11">', html, re.S)[0] html = requests.get(novel_chapter_html) html.encoding = 'gbk' html = html.text novel_chapter = \ re.findall(r'<li><a href="(http://www.quanshuwang.com/book/.*?)" title=".*?">(.*?)</a></li>', html, re.S) # print(novel_chapter) # exit() return novel_chapter def get_chapter_info(chapter_url): html = requests.get(chapter_url, headers = header) html.encoding = 'gbk' html = html.text # print(html) # exit() chapter_info = re.findall( r'<div class="mainContenr".*?</script>(.*?)<script.*?</script></div>', html, re.S)[0] # print(chapter_info) # exit() return chapter_info if __name__ == '__main__': sort_dict = { 1:'玄幻魔法', 2:'武俠修真', 3:'純愛耽美', 4:'都市言情', 5:'職場校園', 6:'穿越重生', 7:'歷史軍事', 8:'網遊動漫', 9:'恐怖靈異', 10:'科幻小說', 11:'美文名著' } try: if not os.path.exists('全書網'): os.mkdir('全書網') for sort_id, sort_name in sort_dict.items(): if not os.path.exists('%s/%s'%('全書網', sort_name)): os.mkdir('%s/%s'%('全書網', sort_name)) # print('分類名稱:', sort_name) for type_name,type_url in get_type_list(sort_id): # print(type_name, type_url) # if not os.path.exists('%s/%s/%s.txt'%('全書網', sort_name, type_name)): # os.mkdir('%s/%s/%s.txt'%('全書網', sort_name, type_name)) for chapter_url, chapter_name in get_chapter_list(type_url): # [::-1]表明列表反向輸出 # print(chapter_url, chapter_name, chapter_time) # print(get_chapter_info(chapter_url)) with open('%s/%s/%s.txt'%('全書網', sort_name, type_name), 'a') as f: print('正在保存...',chapter_name) f.write('\n' + chapter_name + '\n') f.write(get_chapter_info(chapter_url)) except OSError as reason: print('wrong') print('問題緣由是%s'%str(reason))
沒解決的問題:服務器
一、問題緣由:('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))學習
本身分析:多是由於反覆訪問服務器,服務器認爲我是機器人,被反爬了,文件頭也有換,爬個幾本小說就會出錯。url
解決結果:沒有解決。spa