Python 爬蟲實戰（一）：使用 requests 和 BeautifulSoup

時間 2019-11-17

標籤 python 爬蟲實戰使用 requests beautifulsoup 欄目 Python 简体版

原文原文鏈接

Python 基礎

我以前寫的《Python 3 極簡教程.pdf》，適合有點編程基礎的快速入門，經過該系列文章學習，可以獨立完成接口的編寫，寫寫小東西沒問題。html

requests

requests，Python HTTP 請求庫，至關於 Android 的 Retrofit，它的功能包括 Keep-Alive 和鏈接池、Cookie 持久化、內容自動解壓、HTTP 代理、SSL 認證、鏈接超時、Session 等不少特性，同時兼容 Python2 和 Python3，GitHub：github.com/requests/re… 。html5

安裝

Mac：python

pip3 install requests複製代碼

Windows：mysql

pip install requests複製代碼

發送請求

HTTP 請求方法有 get、post、put、delete。git

import requests

# get 請求
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all')

# post 請求
response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert')

# put 請求
response = requests.put('http://127.0.0.1:1024/developer/api/v1.0/update')

# delete 請求
response = requests.delete('http://127.0.0.1:1024/developer/api/v1.0/delete')複製代碼

請求返回 Response 對象，Response 對象是對 HTTP 協議中服務端返回給瀏覽器的響應數據的封裝，響應的中的主要元素包括：狀態碼、緣由短語、響應首部、響應 URL、響應 encoding、響應體等等。github

# 狀態碼
print(response.status_code)

# 響應 URL
print(response.url)

# 響應短語
print(response.reason)

# 響應內容
print(response.json())複製代碼

定製請求頭

請求添加 HTTP 頭部 Headers，只要傳遞一個 dict 給 headers 關鍵字參數就能夠了。sql

header = {'Application-Id': '19869a66c6',
          'Content-Type': 'application/json'
          }
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all/', headers=header)複製代碼

構建查詢參數

想爲 URL 的查詢字符串(query string)傳遞某種數據，好比：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2 ，Requests 容許你使用 params 關鍵字參數，以一個字符串字典來提供這些參數。數據庫

payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)複製代碼

還能夠將 list 做爲值傳入：編程

payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

# 響應 URL
print(response.url)# 打印：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2&key2=value3複製代碼

post 請求數據

若是服務器要求發送的數據是表單數據，則能夠指定關鍵字參數 data。json

payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", data=payload)複製代碼

若是要求傳遞 json 格式字符串參數，則可使用 json 關鍵字參數，參數的值均可以字典的形式傳過去。

obj = {
    "article_title": "小公務員之死2"
}
# response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert', json=obj)複製代碼

響應內容

Requests 會自動解碼來自服務器的內容。大多數 unicode 字符集都能被無縫地解碼。請求發出後，Requests 會基於 HTTP 頭部對響應的編碼做出有根據的推測。

# 響應內容
# 返回是 是 str 類型內容
# print(response.text())
# 返回是 JSON 響應內容
print(response.json())
# 返回是二進制響應內容
# print(response.content())
# 原始響應內容，初始請求中設置了 stream=True
# response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', stream=True)
# print(response.raw())複製代碼

超時

若是沒有顯式指定了 timeout 值，requests 是不會自動進行超時處理的。若是遇到服務器沒有響應的狀況時，整個應用程序一直處於阻塞狀態而無法處理其餘請求。

response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', timeout=5)  # 單位秒數複製代碼

代理設置

若是頻繁訪問一個網站，很容易被服務器屏蔽掉，requests 完美支持代理。

# 代理
proxies = {
    'http': 'http://127.0.0.1:1024',
    'https': 'http://127.0.0.1:4000',
}
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', proxies=proxies)複製代碼

BeautifulSoup

BeautifulSoup，Python Html 解析庫，至關於 Java 的 jsoup。

安裝

BeautifulSoup 3 目前已經中止開發，直接使用BeautifulSoup 4。

Mac：

pip3 install beautifulsoup4複製代碼

Windows：

pip install beautifulsoup4複製代碼

安裝解析器

我用的是 html5lib，純 Python 實現的。

Mac：

pip3 install html5lib複製代碼

Windows：

pip install html5lib複製代碼

簡單使用

BeautifulSoup 將複雜 HTML 文檔轉換成一個複雜的樹形結構，每一個節點都是 Python 對象。

解析

from bs4 import BeautifulSoup

def get_html_data():
    html_doc = """ <html> <head> <title>WuXiaolong</title> </head> <body> <p>分享 Android 技術，也關注 Python 等熱門技術。</p> <p>寫博客的初衷：總結經驗，記錄本身的成長。</p> <p>你必須足夠的努力，才能看起來絕不費力！專一！精緻！ </p> <p class="Blog"><a href="http://wuxiaolong.me/">WuXiaolong's blog</a></p> <p class="WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong">公衆號：吳小龍同窗</a> </p> <p class="GitHub"><a href="http://example.com/tillie" class="sister" id="link3">GitHub</a></p> </body> </html> """
    soup = BeautifulSoup(html_doc, "html5lib")複製代碼

tag

tag = soup.head
print(tag)  # <head><title>WuXiaolong</title></head>
print(tag.name)  # head
print(tag.title)  # <title>WuXiaolong</title>
print(soup.p)  # <p>分享 Android 技術，也關注 Python 等熱門技術。</p>
print(soup.a['href'])  # 輸出 a 標籤的 href 屬性：http://wuxiaolong.me/複製代碼

注意：tag 若是多個匹配，返回第一個，好比這裏的 p 標籤。

查找

print(soup.find('p'))  # <p>分享 Android 技術，也關注 Python 等熱門技術。</p>複製代碼

find 默認也是返回第一個匹配的標籤，沒找到匹配的節點則返回 None。若是我想指定查找，好比這裏的公衆號，能夠指定標籤的如 class 屬性值：

# 由於 class 是 Python 關鍵字，因此這裏指定爲 class_。
print(soup.find('p', class_="WeChat"))
# <p class="WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong">公衆號</a> </p>複製代碼

查找全部的 P 標籤：

for p in soup.find_all('p'):
    print(p.string) 複製代碼

實戰

前段時間，有用戶反饋，個人我的 APP 掛了，雖然這個 APP 我已經再也不維護，可是我也得起碼保證它能正常運行。大部分人都知道這個 APP 數據是爬來的（詳見：《手把手教你作我的app》），數據爬來的好處之一就是不用本身管數據，弊端是別人網站掛了或網站的 HTML 節點變了，我這邊就解析不到，就沒數據。此次用戶反饋，我在想要不要把他們網站數據直接爬蟲了，正好自學 Python，練練手，嗯說幹就幹，原本是想着先用 Python 爬蟲，MySQL 插入本地數據庫，而後 Flask 本身寫接口，用 Android 的 Retrofit 調，再用 bmob sdk 插入 bmob……哎，費勁，感受行不通，後來我得知 bmob 提供了 RESTful，解決大問題，我能夠直接 Python 爬蟲插入就行了，這裏我演示的是插入本地數據庫，若是用 bmob，是調 bmob 提供的 RESTful 插數據。

網站選定

我選的演示網站：meiriyiwen.com/random ，你們能夠發現，每次請求的文章都不同，正好利用這點，我只要定時去請求，解析本身須要的數據，插入數據庫就 OK 了。

建立數據庫

我直接用 NaviCat Premium 建立的，固然也能夠用命令行。

建立表

建立表 article，用的 pymysql，表須要 id，article_title，article_author，article_content 字段，代碼以下，只須要調一次就行了。

import pymysql


def create_table():
    # 創建鏈接
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn')
    # 建立名爲 article 數據庫語句
    sql = '''create table if not exists article ( id int NOT NULL AUTO_INCREMENT, article_title text, article_author text, article_content text, PRIMARY KEY (`id`) )'''
    # 使用 cursor() 方法建立一個遊標對象 cursor
    cursor = db.cursor()
    try:
        # 執行 sql 語句
        cursor.execute(sql)
        # 提交事務
        db.commit()
        print('create table success')
    except BaseException as e:  # 若是發生錯誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關閉遊標鏈接
        cursor.close()
        # 關閉數據庫鏈接
        db.close()


if __name__ == '__main__':
    create_table()
複製代碼

解析網站

首先須要 requests 請求網站，而後 BeautifulSoup 解析本身須要的節點。

import requests
from bs4 import BeautifulSoup


def get_html_data():
    # get 請求
    response = requests.get('https://meiriyiwen.com/random')

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id='article_show')
    article_title = article.h1.string
    print('article_title=%s' % article_title)
    article_author = article.find('p', class_="article_author").string
    print('article_author=%s' % article.find('p', class_="article_author").string)
    article_contents = article.find('div', class_="article_text").find_all('p')
    article_content = ''
    for content in article_contents:
        article_content = article_content + str(content)
        print('article_content=%s' % article_content)複製代碼

插入數據庫

這裏作了一個篩選，默認這個網站的文章標題是惟一的，插入數據時，若是有了一樣的標題就不插入。

import pymysql


def insert_table(article_title, article_author, article_content):
    # 創建鏈接
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn',
                         charset="utf8")
    # 插入數據
    query_sql = 'select * from article where article_title=%s'
    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
    # 使用 cursor() 方法建立一個遊標對象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執行 sql 語句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務
            db.commit()
            print('--------------《%s》 insert table success-------------' % article_title)
            return True
        else:
            print('--------------《%s》 已經存在-------------' % article_title)
            return False

    except BaseException as e:  # 若是發生錯誤則回滾
        db.rollback()
        print(e)

    finally:  # 關閉遊標鏈接
        cursor.close()
        # 關閉數據庫鏈接
        db.close()複製代碼

定時設置

作了一個定時，過段時間就去爬一次。

import sched
import time


# 初始化 sched 模塊的 scheduler 類
# 第一個參數是一個能夠返回時間戳的函數，第二個參數能夠在定時未到達以前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被週期性調度觸發的函數
def print_time(inc):
    # to do something
    print('to do something')
    schedule.enter(inc, 0, print_time, (inc,))


# 默認參數 60 s
def start(inc=60):
    # enter四個參數分別爲：間隔事件、優先級（用於同時間到達的兩個事件同時執行時定序）、被調用觸發的函數，
    # 給該觸發函數的參數（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == '__main__':
    # 5 s 輸出一次
    start(5)複製代碼

完整代碼

import pymysql
import requests
from bs4 import BeautifulSoup
import sched
import time


def create_table():
    # 創建鏈接
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn')
    # 建立名爲 article 數據庫語句
    sql = '''create table if not exists article ( id int NOT NULL AUTO_INCREMENT, article_title text, article_author text, article_content text, PRIMARY KEY (`id`) )'''
    # 使用 cursor() 方法建立一個遊標對象 cursor
    cursor = db.cursor()
    try:
        # 執行 sql 語句
        cursor.execute(sql)
        # 提交事務
        db.commit()
        print('create table success')
    except BaseException as e:  # 若是發生錯誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關閉遊標鏈接
        cursor.close()
        # 關閉數據庫鏈接
        db.close()


def insert_table(article_title, article_author, article_content):
    # 創建鏈接
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn',
                         charset="utf8")
    # 插入數據
    query_sql = 'select * from article where article_title=%s'
    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
    # 使用 cursor() 方法建立一個遊標對象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執行 sql 語句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務
            db.commit()
            print('--------------《%s》 insert table success-------------' % article_title)
            return True
        else:
            print('--------------《%s》 已經存在-------------' % article_title)
            return False

    except BaseException as e:  # 若是發生錯誤則回滾
        db.rollback()
        print(e)

    finally:  # 關閉遊標鏈接
        cursor.close()
        # 關閉數據庫鏈接
        db.close()


def get_html_data():
    # get 請求
    response = requests.get('https://meiriyiwen.com/random')

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id='article_show')
    article_title = article.h1.string
    print('article_title=%s' % article_title)
    article_author = article.find('p', class_="article_author").string
    print('article_author=%s' % article.find('p', class_="article_author").string)
    article_contents = article.find('div', class_="article_text").find_all('p')
    article_content = ''
    for content in article_contents:
        article_content = article_content + str(content)
        print('article_content=%s' % article_content)

    # 插入數據庫
    insert_table(article_title, article_author, article_content)


# 初始化 sched 模塊的 scheduler 類
# 第一個參數是一個能夠返回時間戳的函數，第二個參數能夠在定時未到達以前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被週期性調度觸發的函數
def print_time(inc):
    get_html_data()
    schedule.enter(inc, 0, print_time, (inc,))


# 默認參數 60 s
def start(inc=60):
    # enter四個參數分別爲：間隔事件、優先級（用於同時間到達的兩個事件同時執行時定序）、被調用觸發的函數，
    # 給該觸發函數的參數（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == '__main__':
    start(60*5)
複製代碼