python爬取網站的小說2

時間 2020-10-23

標籤 html python 正則表達式數據庫 app python爬蟲函數網站編碼 url 欄目 Python 简体版

原文原文鏈接

使用正則表達式

re.compile 函數

compile 函數用於編譯正則表達式，生成一個正則表達式（ Pattern ）對象，供 match() 和 search() 這兩個函數使用。html

語法格式爲：python

 re.compile(pattern[, flags])

參數：正則表達式

pattern : 一個字符串形式的正則表達式數據庫
flags : 可選，表示匹配模式，好比忽略大小寫，多行模式等，具體參數爲：app
1. re.I 忽略大小寫python爬蟲
2. re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依賴於當前環境函數
3. re.M 多行模式網站
4. re.S 即爲 . 而且包括換行符在內的任意字符（. 不包括換行符）編碼
5. re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依賴於 Unicode 字符屬性數據庫url
6. re.X 爲了增長可讀性，忽略空格和 # 後面的註釋

findall

在字符串中找到正則表達式所匹配的全部子串，並返回一個列表，若是沒有找到匹配的，則返回空列表。

注意： match 和 search 是匹配一次 findall 匹配全部。

語法格式爲：

 findall(string[, pos[, endpos]])

參數：

string : 待匹配的字符串。
pos : 可選參數，指定字符串的起始位置，默認爲 0。
endpos : 可選參數，指定字符串的結束位置，默認爲字符串的長度。

python爬蟲之小說網站--下載小說(正則表達式)

思路:

找到要下載的小說首頁,打開網頁源代碼進行分析(例:https://www.kanunu8.com/files/old/2011/2447.html)
分析本身要獲得的內容,首先分析url,發現只有後面的是變化的，先得到小說的沒有相對路徑，而後組合成新的url(每章小說的url)
得到每章小說的內容，進行美化處理

源代碼

import re
 import requests
 
 # 要爬取的網站
 url = 'https://www.kanunu8.com/book4/10509/'
 
 # 先獲取二進制，再進行解碼
 txt = requests.get(url).content.decode('gbk')
 # txt.conten是二進制形式的   ---n<head>\r\n<title>\xd6\xd0\xb9\xfa\xba\xcf\xbb\xef\xc8\xcb
 # print(txt)
 
 m1 = re.compile(r'<td colspan="4" align="center"><strong>(.+)</strong>')
 # print(m1.findall(txt))
 
 m2 = re.compile(r'<td( width="25%")?><a href="(.+\.html)">(.+)</a></td>')
 print(m2.findall(txt))
 
 # 得到小說的目錄以及對應的每一個章節的相對路徑
 raw = m2.findall(txt)
 
 sanguo = []
 for i in raw:
     print([i[2],url+i[1]])
     # ['第五章 成功之母', 'https://www.kanunu8.com/book4/10509/184616.html']
     # 生成每一個章節對應的url
     sanguo.append([i[2],url+i[1]])
 
 print("*"*100)
 print(sanguo)
 # [['第一章 夢的起源', 'https://www.kanunu8.com/book4/10509/184612.html'], ['第二章 偶像兄弟', 'https://www.kanunu8.com/book4/10509/184613.html'], ['第三章 戀愛必修', 'https://www.kanunu8.com/book4/10509/184614.html'], ['第四章 愛的代價', 'https://www.kanunu8.com/book4/10509/184615.html'], ['第五章 成功之母', 'https://www.kanunu8.com/book4/10509/184616.html'], ['第六章 命運轉折', 'https://www.kanunu8.com/book4/10509/184617.html'], ['第七章 被迫下海', 'https://www.kanunu8.com/book4/10509/184618.html'], ['第八章 漸行漸遠', 'https://www.kanunu8.com/book4/10509/184619.html'], ['第九章 三箭合一', 'https://www.kanunu8.com/book4/10509/184620.html'], ['第十章 夢想起航', 'https://www.kanunu8.com/book4/10509/184621.html'], ['第十一章 領航夢想', 'https://www.kanunu8.com/book4/10509/184622.html'], ['第十二章 平地波瀾', 'https://www.kanunu8.com/book4/10509/184623.html'], ['第十三章 新的招牌', 'https://www.kanunu8.com/book4/10509/184624.html'], ['第十四章 神的弱點', 'https://www.kanunu8.com/book4/10509/184625.html'], ['第十五章 裂隙初現', 'https://www.kanunu8.com/book4/10509/184626.html'], ['第十六章 上市之爭', 'https://www.kanunu8.com/book4/10509/184627.html'], ['第十七章 夢想巔峯', 'https://www.kanunu8.com/book4/10509/184628.html'], ['第十八章 乾綱專斷', 'https://www.kanunu8.com/book4/10509/184629.html'], ['第十九章 一劍穿心', 'https://www.kanunu8.com/book4/10509/184630.html'], ['第二十章 渡盡劫波', 'https://www.kanunu8.com/book4/10509/184631.html'], ['尾\u3000聲', 'https://www.kanunu8.com/book4/10509/184632.html']]
 
 
 # 匹配每章節的正文內容
 # 每章小說的正文在<p>標籤中
 m3 = re.compile(r'<p>(.+)</p>',re.S)
 
 # 小說中的<br />要被替換爲空白
 m4 = re.compile(r'<br />')
 
 # &nbsp;也要被替換
 m5 = re.compile(r'&nbsp;&nbsp;&nbsp;&nbsp;')
 
 # 新建一個txt 中國合夥人1.txt
 with open('中國合夥人1.txt','a') as f:
     for i in sanguo:
         # i[1] 是章節的url
         i_url = i[1]
         print("正在下載--->%s" % i[0])
         # 根據每一個章節的url，先獲取正文頁面的二進制，再編碼
         r_nr = requests.get(i_url).content.decode('gbk')
         # 匹配正文  ：帶有<p>的
         n_nr = m3.findall(r_nr)
         print(n_nr)
         # 把<br/>替換爲空  sub()和replace()區別：sub()能夠用正則
         n = m4.sub('',n_nr[0])
         # 把&nbsp;也替換爲空
         n2 = m5.sub('',n)
         n2 = n2.replace('\n','')
         # 寫入txt
         # i[0]是章節名字
         f.write('\n'+i[0]+'\n')
         f.write(n2)