Python爬蟲層層遞進，從爬取一章小說到爬取全站小說

時間 2019-11-10

標籤 python 爬蟲層層遞進一章欄目 Python 简体版

原文原文鏈接

前言

文的文字及圖片來源於網絡,僅供學習、交流使用,不具備任何商業用途,版權歸原做者全部,若有問題請及時聯繫咱們以做處理。css

PS：若有須要Python學習資料的小夥伴能夠加點擊下方連接自行獲取html

[http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956ce]python

不少好看的小說只能看不能下載，教你怎麼爬取一個網站的全部小說編程

知識點：瀏覽器

requests網絡
xpathapp
全站小說爬取思路scrapy

開發環境：編輯器

版本：anaconda5.2.0（python3.6.5）工具
編輯器：pycharm

第三方庫：

requests
parsel

進行網頁分析

目標站點:

開發者工具的使用
- network
- element

爬取一章小說

requests庫的使用（請求網頁數據）
對請求網頁數據步驟進行封裝
css選擇器的使用（解析網頁數據）
操做文件（數據持久化）

# -*- coding: utf-8 -*-
import requests
import parsel

"""爬取一章小說"""

# 請求網頁數據
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}

response = requests.get('http://www.shuquge.com/txt/8659/2324752.html', headers=headers)
response.encoding = response.apparent_encoding
html = response.text
print(html)


# 從網頁中提取內容
sel = parsel.Selector(html)

title = sel.css('.content h1::text').extract_first()
contents = sel.css('#content::text').extract()
contents2 = []
for content in contents:
    contents2.append(content.strip())

print(contents)
print(contents2)

print("\n".join(contents2))

# 將內容寫入文本
with open(title+'.txt', mode='w', encoding='utf-8') as f:
    f.write("\n".join(contents2))

爬取一本小說

對爬蟲進行重構

須要爬取不少章小說，最笨的方法是直接使用 for 循環。
爬取索引頁

須要爬取全部的章節，只要獲取每一章的網址就好了。

import requests
import parsel

"""獲取網頁源代碼"""

# 模擬瀏覽器發送請求
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}


def download_one_chapter(target_url):
    # 須要請求的網址
    # target_url = 'http://www.shuquge.com/txt/8659/2324753.html'
    # response 服務返回的內容 對象
    # pycharm ctrl+鼠標左鍵
    response = requests.get(target_url, headers=headers)

    # 解碼 萬能解碼
    response.encoding = response.apparent_encoding

    # 文字方法 獲取網頁文字內容
    # print(response.text)
    # 字符串
    html = response.text

    """從網頁源代碼裏面拿到信息"""
    # 使用parsel 把字符串變成對象
    sel = parsel.Selector(html)

    # scrapy
    # extract 提取標籤的內容
    # 僞類選擇器（選擇屬性） css選擇器（選擇標籤）
    # 提取第一個內容
    title = sel.css('.content h1::text').extract_first()
    # 提取全部的內容
    contents = sel.css('#content::text').extract()
    print(title)
    print(contents)

    """ 數據清除 清除空白字符串 """
    # contents1 = []
    # for content in contents:
    #     # 去除兩端空白字符
    #     # 字符串的操做 列表的操做
    #     contents1.append(content.strip())
    #
    # print(contents1)
    # 列表推導式
    contents1 = [content.strip() for content in contents]
    print(contents1)
    # 把列表編程字符串
    text = '\n'.join(contents1)
    print(text)
    """保存小說內容"""
    # open 操做文件（寫入、讀取）
    file = open(title + '.txt', mode='w', encoding='utf-8')

    # 只能寫入字符串
    file.write(title)
    file.write(text)

    # 關閉文件
    file.close()


# 傳入一本小說的目錄
def get_book_links(book_url):
    response = requests.get(book_url)
    response.encoding = response.apparent_encoding
    html = response.text
    sel = parsel.Selector(html)
    links = sel.css('dd a::attr(href)').extract()
    return links


# 下載一本小說
def get_one_book(book_url):
    links = get_book_links(book_url)
    for link in links:
        print('http://www.shuquge.com/txt/8659/' + link)
        download_one_chapter('http://www.shuquge.com/txt/8659/' + link)



if __name__ == '__main__':
    # target_url = 'http://www.shuquge.com/txt/8659/2324754.html'
    # # 關鍵詞參數與位置參數
    # download_one_chapter(target_url=target_url)
    # 下載別的小說 直接換url
    book_url = 'http://www.shuquge.com/txt/8659/index.html'
    get_one_book(book_url)

爬取全站小說

爬取索引頁

須要爬取全部的小說，只要獲取每一本書的索引頁就好了

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。