看完python這段爬蟲代碼,java流淚了c#沉默了

哈哈,其實很簡單,寥寥幾行代碼網頁爬一部小說,不賣關子,馬上開始。html

首先安裝所需的包,requests,BeautifulSoup4chrome

控制檯執行網絡

pip install requests工具

pip install BeautifulSoup4spa

若是不能正確安裝,請檢查你的環境變量,至於環境變量配置,在這裏再也不贅述,相關文章有不少。3d

兩個包的安裝命令都結束後,輸入pip list調試

能夠看到,兩個包都成功安裝了。code

好的,咱們馬上開始編寫代碼。htm

咱們的目標是抓取這個連接下全部小說的章節 https://book.qidian.com/info/1013646681#Catalogblog

咱們訪問頁面,用chrome調試工具查看元素,查看各章節的html屬性。咱們發現全部章節父元素是<ul class="cf">這個元素,章節的連接以及標題,在子<li>下的<a>標籤內。

 

那咱們第一步要作的事,就是要提取全部章節的連接。

'用於進行網絡請求'
import requests


chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog")
print(chapter.text)

 

頁面順利的請求到了,接下來咱們從頁面中抓取相應的元素

'用於進行網絡請求'
import requests
'用於解析html'
from bs4 import BeautifulSoup


chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog")

ul_bs = BeautifulSoup(chapter.text)
'提取class爲cf的ul標籤'
ul = ul_bs.find_all("ul",class_="cf")
print(ul)

 

ul也順利抓取到了,接下來咱們遍歷<ul>下的<a>標籤取得全部章節的章節名與連接

'用於進行網絡請求'
import requests
'用於解析html'
from bs4 import BeautifulSoup


chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog")

ul_bs = BeautifulSoup(chapter.text)
'提取class爲cf的ul標籤'
ul = ul_bs.find_all("ul",class_="cf")
ul_bs = BeautifulSoup(str(ul[0]))
'找到<ul>下的<a>標籤'
a_bs = ul_bs.find_all("a")
'遍歷<a>的href屬性跟text'
for a in a_bs:
    href = a.get("href")
    text = a.get_text()
    print(href)
    print(text)

ok,全部的章節連接搞定,咱們去看想一想章節詳情頁面長什麼樣,而後咱們具體制定詳情頁面的爬取計劃。

打開一個章節,用chrome調試工具審查一下。文章標題保存在<h3 class="j_chapterName">中,正文保存在<div class="read-content j_readContent">中。

咱們須要從這兩個標籤中提取內容。

 

'用於進行網絡請求'
import requests
'用於解析html'
from bs4 import BeautifulSoup


chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog")

ul_bs = BeautifulSoup(chapter.text)
'提取class爲cf的ul標籤'
ul = ul_bs.find_all("ul",class_="cf")
ul_bs = BeautifulSoup(str(ul[0]))
'找到<ul>下的<a>標籤'
a_bs = ul_bs.find_all("a")

detail = requests.get("https:"+a_bs[0].get("href"))
text_bs = BeautifulSoup(detail.text)
text = text_bs.find_all("div",class_ = "read-content j_readContent")
print(text)

正文頁很順利就爬取到了,以上代碼僅是用第一篇文章作示範,經過調試文章已經能夠爬取成功,全部下一步咱們只要把全部連接遍歷逐個提取就行了

'用於進行網絡請求'
import requests
'用於解析html'
from bs4 import BeautifulSoup


chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog")

ul_bs = BeautifulSoup(chapter.text)
'提取class爲cf的ul標籤'
ul = ul_bs.find_all("ul",class_="cf")
ul_bs = BeautifulSoup(str(ul[0]))
'找到<ul>下的<a>標籤'
a_bs = ul_bs.find_all("a")

'遍歷全部<href>進行提取'

for a in a_bs:
    detail = requests.get("https:"+a.get("href"))
    d_bs = BeautifulSoup(detail.text)
    '正文'
    content = d_bs.find_all("div",class_ = "read-content j_readContent")
    '標題'
    name = d_bs.find_all("h3",class_="j_chapterName")[0].get_text()
    

在上圖中咱們看到正文中的每個<p>標籤爲一個段落,提取的文章包含不少<p>標籤這也是咱們不但願的,接下來去除p標籤。

可是去除<p>標籤後文章就沒有段落格式了呀,這樣的閱讀體驗很不爽的,咱們只要在每一個段落的結尾加一個換行符就行了

'用於進行網絡請求'
import requests
'用於解析html'
from bs4 import BeautifulSoup


chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog")

ul_bs = BeautifulSoup(chapter.text)
'提取class爲cf的ul標籤'
ul = ul_bs.find_all("ul",class_="cf")
ul_bs = BeautifulSoup(str(ul[0]))
'找到<ul>下的<a>標籤'
a_bs = ul_bs.find_all("a")

'遍歷全部<href>進行提取'

for a in a_bs:
    detail = requests.get("https:"+a.get("href"))
    d_bs = BeautifulSoup(detail.text)
    '正文'
    content = d_bs.find_all("div",class_ = "read-content j_readContent")
    '標題'
    name = d_bs.find_all("h3",class_="j_chapterName")[0].get_text()
    
    txt = ""
    p_bs = BeautifulSoup(str(content))
    '提取每一個<p>標籤的內容'
    for p in p_bs.find_all("p"):
        txt = txt + p.get_text()+"\r\n"

 

去掉<p>標籤了,全部的工做都作完了,咱們只要把文章保存成一個txt就能夠了,txt的文件名以章節來命名。

'用於進行網絡請求'
import requests
'用於解析html'
from bs4 import BeautifulSoup

def create_txt(path,txt):
    fd = None 
    try:
        fd = open(path,'w+',encoding='utf-8')
        fd.write(txt)
    except:
        print("error")
    finally:
        if (fd !=None):
            fd.close()


chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog")

ul_bs = BeautifulSoup(chapter.text)
'提取class爲cf的ul標籤'
ul = ul_bs.find_all("ul",class_="cf")
ul_bs = BeautifulSoup(str(ul[0]))
'找到<ul>下的<a>標籤'
a_bs = ul_bs.find_all("a")

'遍歷全部<href>進行提取'

for a in a_bs:
    detail = requests.get("https:"+a.get("href"))
    d_bs = BeautifulSoup(detail.text)
    '正文'
    content = d_bs.find_all("div",class_ = "read-content j_readContent")
    '標題'
    name = d_bs.find_all("h3",class_="j_chapterName")[0].get_text()

    path = 'F:\\test\\'
    path = path + name+".txt"
    
    txt = ""
    p_bs = BeautifulSoup(str(content))
    '提取每一個<p>標籤的內容'
    for p in p_bs.find_all("p"):
        txt = txt + p.get_text()+"\r\n"

    create_txt(path,txt)
    print(path+"保存成功")

 

 

文章成功爬取,文件成功保存,搞定。就這麼簡單的幾行代碼搞定。

相關文章
相關標籤/搜索