大數據與雲計算學習:Python網絡數據採集

時間 2020-01-13

原文原文鏈接

本文將介紹網絡數據採集的基本原理：php

如何用Python從網絡服務器請求信息
如何對服務器的響應進行基本處理
如何以自動化手段與網站進行交互
如何建立具備域名切換、信息收集以及信息存儲功能的爬蟲

學習路徑

爬蟲的基本原理

所謂爬蟲就是一個自動化數據採集工具，你只要告訴它要採集哪些數據，丟給它一個 URL，就能自動地抓取數據了。其背後的基本原理就是爬蟲程序向目標服務器發起 HTTP 請求，而後目標服務器返回響應結果，爬蟲客戶端收到響應並從中提取數據，再進行數據清洗、數據存儲工做。css

如下截圖來自　掘金小冊　基於 Python 實現微信公衆號爬蟲對《圖解ＨＴＴＰ》的總結

html

Python的一些基礎爬蟲模塊

urllib

Python 提供了很是多工具去實現 HTTP 請求，但第三方開源庫提供的功能更豐富，你無需從 socket 通訊開始寫，好比使用Pyton內建模塊 urllib 請求一個 URL
這裏咱們先操練起來，寫個測試爬蟲node

from urllib.request import urlopen//查找python的urllib庫的request模塊，導出urlopen函數
html = urlopen("http://jxdxsw.com/")//urlopen用來打開並讀取一個從網絡獲取的遠程對象
print(html.read())

而後，把這段代碼保存爲｀scrapetest.py｀,終端中運行以下命令python

python3 scrapetest.py

這裏會輸出http://jxdxsw/這個網頁首頁的所有ＨＴＭＬ代碼mysql

鯨魚注：　
Python 3.x中urllib分爲子模塊：
　- urllib.requestgit

urllib.parse
urllib.error
urllib.robotparser

　urllib是python的標準庫，它可以：github

從網絡請求數據
處理cookie
改變　請求頭和用戶代理　等元數據的函數

更多查看python官方文檔web

標準示例面試

import ssl

from urllib.request import Request
from urllib.request import urlopen
context = ssl._create_unverified_context()

# HTTP請求
request = Request(url = "http://jxdxsw.com",
                  method="GET",
                  headers= {"Host": "jxdxsw.com"},
                  data=None)
# HTTP響應
response = urlopen(request, context=content)
headers = response.info() #響應頭
content = response.read() #響應體
code = response.getcode() #狀態碼

這裏的請求體 data 爲空，由於你不須要提交數據給服務器，因此你也能夠不指定
urlopen 函數會自動與目標服務器創建鏈接，發送 HTTP 請求，該函數的返回值是一個響應對象 Response
裏面有響應頭信息，響應體，狀態碼之類的屬性。

requests

Python 提供的urllib內建模塊過於低級，須要寫不少代碼，使用簡單爬蟲能夠考慮 Requests

http://python-requests.org

quickstart

安裝 requests

pip3 install requests

GET請求

>>> r = requests.get("https://httpbin.org/ip")
>>> r
<Response [200]> # 響應對象
>>> r.status_code  # 響應狀態碼
200

>>> r.content  # 響應內容
'{\n  "origin": "183.237.232.123"\n}\n'...

POST 請求

>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})

自定義請求頭

服務器反爬蟲機制會判斷客戶端請求頭中的User-Agent是否來源於真實瀏覽器，因此，咱們使用Requests常常會指定UA假裝成瀏覽器發起請求

>>> url = 'https://httpbin.org/headers'
>>> headers = {'user-agent': 'Mozilla/5.0'}
>>> r = requests.get(url, headers=headers)

參數傳遞

不少時候URL後面會有一串很長的參數，爲了提升可讀性，requests 支持將參數抽離出來做爲方法的參數（params）傳遞過去，而無需附在 URL 後面，例如請求 url http://bin.org/get?key=val

>>> url = "http://httpbin.org/get"
>>> r = requests.get(url, params={"key":"val"})
>>> r.url
u'http://httpbin.org/get?key=val'

指定Cookie

Cookie 是web瀏覽器登陸網站的憑證，雖然 Cookie 也是請求頭的一部分，咱們能夠從中剝離出來，使用 Cookie 參數指定

>>> s = requests.get('http://httpbin.org/cookies', cookies={'from-my': 'browser'})
>>> s.text
u'{\n  "cookies": {\n    "from-my": "browser"\n  }\n}\n'

設置超時

當發起一個請求遇到服務器響應很是緩慢而你又不但願等待過久時，能夠指定 timeout 來設置請求超時時間，單位是秒，超過該時間尚未鏈接服務器成功時，請求將強行終止。

r = requests.get('https://google.com', timeout=5)

設置代理

一段時間內發送的請求太多容易被服務器斷定爲爬蟲，因此不少時候咱們使用代理IP來假裝客戶端的真實IP。

import requests

proxies = {
    'http': 'http://127.0.0.1:1080',
    'https': 'http://127.0.0.1:1080',
}

r = requests.get('http://www.kuaidaili.com/free/', proxies=proxies, timeout=2)

Session

若是想和服務器一直保持登陸（會話）狀態，而沒必要每次都指定 cookies，那麼可使用 session，Session 提供的API和 requests 是同樣的。

import requests

s = request.Session()
s.cookies = requests.utils.cookiejar_from_dict({"a":"c"})

r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"a": "c"}}'

實例

使用Requests完成一個爬取知乎專欄用戶關注列表的簡單爬蟲
以一塊兒學習爬蟲這個專欄爲例，打開關注列表關注列表
用 Chrome 找到獲取粉絲列表的請求地址
https://zhuanlan.zhihu.com/ap...

而後咱們用 Requests 模擬瀏覽器發送請求給服務器

import json

import requests


class SimpleCrawler:
    init_url = "https://zhuanlan.zhihu.com/api/columns/pythoneer/followers"
    offset = 0

    def crawl(self, params=None):
        # 必須指定UA，不然知乎服務器會斷定請求不合法
        headers = {
            "Host": "zhuanlan.zhihu.com",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36",
        }
        response = requests.get(self.init_url, headers=headers, params=params)
        print(response.url)
        data = response.json()
        # 7000表示全部關注量
        # 分頁加載更多，遞歸調用
        while self.offset < 7000:
            self.parse(data)
            self.offset += 20
            params = {"limit": 20, "offset": self.offset}
            self.crawl(params)

    def parse(self, data):
    # 以json格式存儲到文件
        with open("followers.json", "a", encoding="utf-8") as f:
            for item in data:
                f.write(json.dumps(item))
                f.write('\n')

if __name__ == '__main__':
    SimpleCrawler().crawl()

這就是一個最簡單的基於 Requests 的單線程知乎專欄粉絲列表的爬蟲，requests 很是靈活，請求頭、請求參數、Cookie 信息均可以直接指定在請求方法中，返回值 response 若是是 json 格式能夠直接調用json()方法返回 python 對象

python-requests 文檔

BeatifulSoup

beatifulsoup非python標準庫須要單獨安裝
安裝使用詳情
鯨魚使用的是ubuntu因此一下幾行命令便可

sudo apt-get install python-bs4
sudo apt-get install python3-pip　//安裝python包管理工具
pip3 install beautifulsoup4

使用BeautifulSoup解析這段代碼,可以獲得一個 BeautifulSoup 的對象,並能按照標準的縮進格式的結構輸出:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://jxdxsw.com/")
bsobj = BeautifulSoup(html.read())
print(bsobj.prettify())

print("-----------------------------我是分割線---------------------------")
print(bsobj.title)

print("-----------------------------我是分割線---------------------------")
print(bsobj.find_all('a'))

異常處理

html = urlopen("http://jxdxsw.com/")

這行代碼主要可能會發生兩種異常：

網頁在服務器上不存在（或者獲取頁面的時候出現錯誤）
服務器不存在

第一種異常會返回HTTP錯誤，如："404 Page Not Found" "500 Internal Server Error",全部相似狀況, urlopen函數都會拋出「HTTPError」異常，遇到這種異常，咱們能夠這樣處理：

try:
   html = urlopen("http://jxdxsw.com/")
except HTTPError as e:
   print(e)
   # 返回空值，中斷程序，或者執行另外一個方案
else:
   # 程序繼續。注意，若是你已經在上面異常捕獲那段代碼裏返回或中斷（break）
　 #那就不須要使用else語句，這段代碼也不會執行

第二種服務器不存在（就是說連接http://jxdxsw.com/打不開，或者ｕｒｌ寫錯）,urlopen 會返回一個None對象，這個對象與其餘編程語言中的null相似

＃ 添加一個判斷語句檢測返回的html是否是None
if html is None:
  print("URL is not found)
else:
   #程序繼續

實例２

咱們建立一個網絡爬蟲來抓取http://www.pythonscraping.com...。
這個網頁中，小說人物對話內容都是紅色,人物名稱都是綠色

實例3

用Requests + Beautifulsoup 爬取 Tripadvisor

服務器與本地的交換機制 --> 爬蟲的基本原理
解析真實網頁的方法、思路

from bs4 import BeautifulSoup
import requests

url = 'https://www.tripadvisor.cn/Attractions-g294220-Activities-Nanjing_Jiangsu.html'
urls = ['https://www.tripadvisor.cn/Attractions-g294220-Activities-oa{}-Nanjing_Jiangsu.html#FILTERED_LIST'.format(str(i)) for i in range(30,800,30)]

def get_attraction(url, data=None):
    wb_data = requests.get(url)
    soup = BeautifulSoup(wb_data.text, 'html.parser')
    # print(soup)
    # 使用BeautifulSoup對html解析時，當使用css選擇器，對於子元素選擇時，要將nth-child改寫爲nth-of-type才行
    #titles = soup.select('#taplc_attraction_coverpage_attraction_0 > div:nth-of-type(1) > div > div > div.shelf_item_container > div:nth-of-type(1) > div.poi > div > div.item.name > a')
    titles = soup.select('a.poiTitle')
    # imgs = soup.select('img.photo_image')
    imgs = soup.select('img[width="200"]')

    # 把信息轉入字典
    for title, img in zip(titles,imgs):
        data = {
            'title': title.get_text(),
            'img': img.get('src'),
        }
        print(data)


for single_url in urls:
    get_attraction(single_url)

{'title': '夫子廟景區', 'img': 'https://cc.ddcdn.com/img2/x.gif'}
{'title': '南京夫子廟休閒街', 'img': 'https://cc.ddcdn.com/img2/x.gif'}
{'title': '南京1912街區', 'img': 'https://cc.ddcdn.com/img2/x.gif'}
{'title': '棲霞寺', 'img': 'https://cc.ddcdn.com/img2/x.gif'}
{'title': '夫子廟大成殿', 'img': 'https://cc.ddcdn.com/img2/x.gif'}
{'title': '南京毗盧寺', 'img': 'https://cc.ddcdn.com/img2/x.gif'}

細心的朋友會發現，這個圖片地址都是一個url，這是由於圖片地址不在頁面的dom結構裏面，都是後來js注入的。這也是一種反爬取的手段，咱們能夠這樣解決：

爬取移動端的（前提是反爬不嚴密）

from bs4 import BeautifulSoup
import requests
headers = {
    'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1'
}
url = 'https://www.tripadvisor.cn/Attractions-g294220-Activities-Nanjing_Jiangsu.html'
mb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(mb_data.text,'html.parser')
imgs = soup.select('div.thumb.thumbLLR.soThumb > img')
for img in imgs:
    print(img.get('src'))

實例４　

抓取異步數據

from bs4 import BeautifulSoup
import requests
import time

url = 'https://knewone.com/discover?page='

def get_page(url, data=None):
    wb_data = requests.get(url)
    soup = BeautifulSoup(wb_data.text, 'html.parser')
    imgs = soup.select('a.cover-inner > img')
    titles = soup.select('section.content > h4 > a')
    links = soup.select('section.content > h4 > a')

    for img, title, link in zip(imgs, titles, links):
        data = {
            'img': img.get('src'),
            'title': title.get('title'),
            'link': link.get('href')
        }
        print(data)

def get_more_pages(start, end):
    for one in range(start, end):
        get_page(url+ str(one))
        time.sleep(2)

get_more_pages(1,10)

scrapy

http://scrapy-chs.readthedocs...

pip install Scrapy
scrapy startproject tutorial

學習實例
https://github.com/scrapy/quo...

正則表達式

import re
line = 'jwxddxsw33'
if line == "jxdxsw33":
    print("yep")
else:
    print("no")

# ^ 限定以什麼開頭
regex_str = "^j.*"
if re.match(regex_str, line):
    print("yes")
#$限定以什麼結尾
regex_str1 = "^j.*3$"
if re.match(regex_str, line):
    print("yes")

regex_str1 = "^j.3$"
if re.match(regex_str, line):
    print("yes")
# 貪婪匹配
regex_str2 = ".*(d.*w).*"
match_obj = re.match(regex_str2, line)
if match_obj:
    print(match_obj.group(1))
# 非貪婪匹配
# ？處表示遇到第一個d 就匹配
regex_str3 = ".*?(d.*w).*"
match_obj = re.match(regex_str3, line)
if match_obj:
    print(match_obj.group(1))
# * 表示>=0次　　＋　表示　>=0次
# ? 表示非貪婪模式
# + 的做用至少>出現一次  因此.+任意字符這個字符至少出現一次
line1 = 'jxxxxxxdxsssssswwwwjjjww123'
regex_str3 = ".*(w.+w).*"
match_obj = re.match(regex_str3, line1)
if match_obj:
    print(match_obj.group(1))
# {2}限定前面的字符出現次數 {2,}2次以上 {2,5}最小兩次最多5次
line2 = 'jxxxxxxdxsssssswwaawwjjjww123'
regex_str3 = ".*(w.{3}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

line2 = 'jxxxxxxdxsssssswwaawwjjjww123'
regex_str3 = ".*(w.{2}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

line2 = 'jxxxxxxdxsssssswbwaawwjjjww123'
regex_str3 = ".*(w.{5,}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

# | 或

line3 = 'jx123'
regex_str4 = "((jx|jxjx)123)"
match_obj = re.match(regex_str4, line3)
if match_obj:
    print(match_obj.group(1))
    print(match_obj.group(2))
# [] 表示中括號內任意一個
line4 = 'ixdxsw123'
regex_str4 = "([hijk]xdxsw123)"
match_obj = re.match(regex_str4, line4)
if match_obj:
    print(match_obj.group(1))
# [0,9]{9} 0到9任意一個 出現9次（9位數）
line5 = '15955224326'
regex_str5 = "(1[234567][0-9]{9})"
match_obj = re.match(regex_str5, line5)
if match_obj:
    print(match_obj.group(1))
# [^1]{9}
line6 = '15955224326'
regex_str6 = "(1[234567][^1]{9})"
match_obj = re.match(regex_str6, line6)
if match_obj:
    print(match_obj.group(1))

# [.*]{9} 中括號中的.和*就表明.*自己
line7 = '1.*59224326'
regex_str7 = "(1[.*][^1]{9})"
match_obj = re.match(regex_str7, line7)
if match_obj:
    print(match_obj.group(1))

#\s 空格
line8 = '你 好'
regex_str8 = "(你\s好)"
match_obj = re.match(regex_str8, line8)
if match_obj:
    print(match_obj.group(1))

# \S 只要不是空格均可以（非空格）
line9 = '你真好'
regex_str9 = "(你\S好)"
match_obj = re.match(regex_str9, line9)
if match_obj:
    print(match_obj.group(1))

# \w  任意字符 和.不一樣的是 它表示[A-Za-z0-9_]
line9 = '你adsfs好'
regex_str9 = "(你\w\w\w\w\w好)"
match_obj = re.match(regex_str9, line9)
if match_obj:
    print(match_obj.group(1))

line10 = '你adsf_好'
regex_str10 = "(你\w\w\w\w\w好)"
match_obj = re.match(regex_str10, line10)
if match_obj:
    print(match_obj.group(1))
#\W大寫的是非[A-Za-z0-9_]
line11 = '你 好'
regex_str11 = "(你\W好)"
match_obj = re.match(regex_str11, line11)
if match_obj:
    print(match_obj.group(1))

# unicode編碼 [\u4E00-\u\9FA5] 表示漢字
line12= "鏡心的小樹屋"
regex_str12= "([\u4E00-\u9FA5]+)"
match_obj = re.match(regex_str12,line12)
if match_obj:
    print(match_obj.group(1))

print("-----貪婪匹配狀況----")
line13 = 'reading in 鏡心的小樹屋'
regex_str13 = ".*([\u4E00-\u9FA5]+樹屋)"
match_obj = re.match(regex_str13, line13)
if match_obj:
    print(match_obj.group(1))

print("----取消貪婪匹配狀況----")
line13 = 'reading in 鏡心的小樹屋'
regex_str13 = ".*?([\u4E00-\u9FA5]+樹屋)"
match_obj = re.match(regex_str13, line13)
if match_obj:
    print(match_obj.group(1))

#\d數字
line14 = 'XXX出生於2011年'
regex_str14 = ".*(\d{4})年"
match_obj = re.match(regex_str14, line14)
if match_obj:
    print(match_obj.group(1))

regex_str15 = ".*?(\d+)年"
match_obj = re.match(regex_str15, line14)
if match_obj:
    print(match_obj.group(1))

示例1

多種出生日期寫法匹配

email 地址匹配

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

###
# 試寫一個驗證Email地址的正則表達式。版本一應該能夠驗證出相似的Email：
#someone@gmail.com
#bill.gates@microsoft.com
###

import re
addr = 'someone@gmail.com'
addr2 = 'bill.gates@microsoft.com'
def is_valid_email(addr):
    if re.match(r'[a-zA-Z_\.]*@[a-aA-Z.]*',addr):
        return True
    else:
        return False

print(is_valid_email(addr))
print(is_valid_email(addr2))

# 版本二能夠提取出帶名字的Email地址：
# <Tom Paris> tom@voyager.org => Tom Paris
# bob@example.com => bob

addr3 = '<Tom Paris> tom@voyager.org'
addr4 = 'bob@example.com'

def name_of_email(addr):
    r=re.compile(r'^(<?)([\w\s]*)(>?)([\w\s]*)@([\w.]*)$')
    if not r.match(addr):
        return None
    else:
        m = r.match(addr)
        return m.group(2)

print(name_of_email(addr3))
print(name_of_email(addr4))

深度優先&廣度優先遍歷

深度優先(遞歸實現)：順着一條路，走到最深處。而後回頭
廣度優先(隊列實現):分層遍歷：遍歷完兒子輩。而後遍歷孫子輩

--> 關於這些基礎算法請戳鯨魚以前的文章
數據結構與算法:二叉樹算法
 數據結構與算法：圖和圖算法(一)

url去重常見策略

實例1

使用scrapy抓取堆糖圖片

scrapy startproject duitang

自動生成一個文件夾

.
├── duitang //該項目的python模塊。以後您將在此加入代碼。
│   ├── __init__.py
│   ├── items.py//項目中的item文件.
│   ├── middlewares.py
│   ├── pipelines.py //項目中的pipelines文件.
│   ├── __pycache__
│   ├── settings.py//項目的設置文件.
│   └── spiders //放置spider代碼的目錄.
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg //項目的配置文件

而後是建立spider，也就是實現具體抓取邏輯的文件，scrapy提供了一個便捷的命令行工具，cd到生成的項目文件夾下執行

實例2

Scrapy爬蟲框架教程（二）-- 爬取豆瓣電影TOP250
豆瓣美女

實例3

利用Scrapy爬取全部知乎用戶詳細信息並存至MongoDB（附視頻和源碼）

實例4

Scrapy分佈式爬蟲打造搜索引擎- (二)伯樂在線爬取全部文章

調試debug

調試(Debugging)Spiders

這裏能夠用　

scrapy shell  url

來調試

extract(): 序列化該節點爲unicode字符串並返回list。

注意這裏的 contains用法

因此spiders下能夠這麼寫

# //ArticleSpider/ArticleSpider/spiders/jobbole.py
# -*- coding: utf-8 -*-
import re
import scrapy



class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/110287/']

    def parse(self, response):
        #提取文章的具體字段((xpath方式實現))
        title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first("")
        create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace(".","").strip()
        praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0]

        fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0]
        match_re = re.match(".*(\d+).*", fav_nums)
        if match_re:
            fav_nums = match_re.group(1)

        comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0]
        match_re = re.match(".*(\d+).*", comment_nums)
        if match_re:
            comment_nums = match_re.group(1)

        content = response.xpath("//div[@class='entry']").extract()[0]

        tag_list= response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
        # 去掉以評論結尾的字段
        tag_list = [element for element in tag_list if not element.strip().endswith("評論")]
        tags = ",".join(tag_list)
        print(tags)#職場,面試
        # print(create_date)
        pass

跑下爬蟲　ｄｅｂｕｇ下

scrapy crawl jobbole

提取下一頁url

# -*- coding: utf-8 -*-
import re
import scrapy
from scrapy.http import Request
from urllib import parse

class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/all-posts/']

    def parse(self, response):
        """
        1. 獲取文章列表頁的具體url，並交給scrapy下載 而後給解析函數進行具體字段的解析
        2. 獲取下一頁的url並交給scarpy進行下載, 下載完成後交給parse函數
        """

        #解析列表頁中的全部url 並交給scrapy下載後進行解析
        post_nodes = response.css("#archive .floated-thumb .post-thumb a")
        for post_node in post_nodes:
            # 獲取封面圖url
            # response.url + post_node
            # image_url = post_node.css("img::attr(src)").extract_first("")
            post_url = post_node.css("::attr(href)").extract_first("")
            url = parse.urljoin(response.url, post_url)
            request = Request(url, callback= self.parse_detail)
            yield request

        #提取下一頁並交給scrapy進行下載
        next_url = response.css(".next.page-numbers::attr(href)").extract_first()
        if next_url:
            yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
    def parse_detail(self, response):
        print("--------")
        #提取文章的具體字段((xpath方式實現))
        title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first("")
        create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace(".","").strip()
        praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0]

        fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0]
        match_re = re.match(".*(\d+).*", fav_nums)
        if match_re:
            fav_nums = match_re.group(1)

        comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0]
        match_re = re.match(".*(\d+).*", comment_nums)
        if match_re:
            comment_nums = match_re.group(1)

        content = response.xpath("//div[@class='entry']").extract()[0]

        tag_list= response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
        # 去掉以評論結尾的字段
        tag_list = [element for element in tag_list if not element.strip().endswith("評論")]
        tags = ",".join(tag_list)
        print(tags)#職場,面試
        # print(create_date)
        pass

在代碼最後打斷點，debug下，咱們發現抓取的值都被提取出來了

配置items.py

items至關於把提取的數據序列化

#//ArticleSpider/ArticleSpider/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class JobBoleArticlespiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    title = scrapy.Field()
    created_date = scrapy.Field()
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    front_image_url = scrapy.Field()
    front_image_path = scrapy.Field()
    praise_nums = scrapy.Field()
    comment_nums = scrapy.Field()
    fav_nums = scrapy.Field()
    tags = scrapy.Field()
    content = scrapy.Field()

實例化item並填充值

# -*- coding: utf-8 -*-
import re
import scrapy
from scrapy.http import Request
from urllib import parse

from ArticleSpider.items import JobBoleArticleItem
# from ArticleSpider.utils.common import get_md5

class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/all-posts/']

    def parse(self, response):
        """
        1. 獲取文章列表頁的具體url，並交給scrapy下載 而後給解析函數進行具體字段的解析
        2. 獲取下一頁的url並交給scarpy進行下載, 下載完成後交給parse函數
        """

        #解析列表頁中的全部url 並交給scrapy下載後進行解析
        post_nodes = response.css("#archive .floated-thumb .post-thumb a")
        for post_node in post_nodes:
            # 獲取封面圖url
            image_url = post_node.css("img::attr(src)").extract_first("")
            post_url = post_node.css("::attr(href)").extract_first("")
            url = parse.urljoin(response.url, post_url)
            # post_url 是咱們每一頁的具體的文章url。
            # 下面這個request是文章詳情頁面. 使用回調函數每下載完一篇就callback進行這一篇的具體解析。
            # 咱們如今獲取到的是完整的地址能夠直接進行調用。若是不是完整地址: 根據response.url + post_url
            # def urljoin(base, url)完成url的拼接
            request = Request(url,meta={"front_image_url": image_url}, callback= self.parse_detail)
            yield request

        #提取下一頁並交給scrapy進行下載
        next_url = response.css(".next.page-numbers::attr(href)").extract_first()
        if next_url:
            yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)

    def parse_detail(self, response):
        # 實例化item
        article_item = JobBoleArticleItem()

        print("經過item loader 加載item")
        # 經過item loader 加載item
        front_image_url = response.meta.get("front_image_url","") #文章封面圖


        #提取文章的具體字段((xpath方式實現))
        title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first("")
        create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace(".","").strip()
        praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0]

        fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0]
        match_re = re.match(".*(\d+).*", fav_nums)
        if match_re:
            fav_nums = int(match_re.group(1))
        else:
            fav_nums = 0
        comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0]
        match_re = re.match(".*(\d+).*", comment_nums)
        if match_re:
            comment_nums = int(match_re.group(1))
        else:
            comment_nums = 0
        content = response.xpath("//div[@class='entry']").extract()[0]

        tag_list= response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
        # 去掉以評論結尾的字段
        tag_list = [element for element in tag_list if not element.strip().endswith("評論")]
        tags = ",".join(tag_list)

        # 爲實例化後的對象填充值
        # article_item["url_object_id"] = get_md5(response.url)
        article_item["title"] = title
        article_item["url"] = response.url
        article_item["create_date"] = create_date
        article_item["front_image_url"] = [front_image_url]
        article_item["praise_nums"] = praise_nums
        article_item["comment_nums"] = comment_nums
        article_item["fav_nums"] = fav_nums
        article_item["tags"] = tags
        article_item["content"] = content
        #print(tags)#職場,面試

        ## 已經填充好了值調用yield傳輸至pipeline
        yield article_item

items.py中至關於對數據序列化，而數據傳遞到pipeline須要在settings.py設置，pipeline中主要作數據存儲的

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
   
}

咱們在pipelines.py文件中打兩個斷點debug下，會發現 item中value值就是咱們以前提取要儲存的

配置pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import json
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exporters import JsonItemExporter

class ArticlespiderPipeline(object):
    def process_item(self, item, spider):
        return item

class JsonWithEncodingPipeline(object):
    #自定義json文件的到出
    def __init__(self):
        self.file = codecs.open('article.json', 'w', encoding="utf-8")
    def process_item(self, item, spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(lines)
        return item
    def spider_closed(self, spider):
        self.file.close()

class JsonExporterPipeline(object):
    #調用scrapy提供的JsonItemExporter  到出json文件
    def __init__(self):
        self.file = open('articleecport.json', 'wb')
        self.exporter = JsonItemExporter(self.file, encoding="utf-8", ensure_ascii=False)
        self.exporter.start_exporting()
    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()
    def process_item(self, item, spider):
        self.exporter.export_item(item=item)
        return item
#圖片處理pipline
class ArticleImagePipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        for ok, value in results:
            image_file_path_ = value["path"]
        item["front_image_path"] = image_file_path_
        return item

儲存數據到mysql

設計數據表

# ubuntu下必須有這條，不然會報下面的錯誤
sudo apt-get install libmysqlclient-dev
# centos 下必須有這條，不然會報下面的錯誤
sudo yum install python-devel mysql-devel
 pip3 install -i https://pypi.douban.com/simple/ mysqlclient

安裝還遇到這種問題：
解決方法：一條命令解決mysql_config not found
pipeline.py

import pymysql
class MysqlPipeline(object):
    def __init__(self):
        # 獲取一個數據庫鏈接，注意若是是UTF-8類型的，須要制定數據庫
        self.conn = pymysql.connect('127.0.0.1', 'root', 'wyc2016','article_spider', charset='utf8',use_unicode=True)
        self.cursor = self.conn.cursor()#獲取一個遊標

    def process_item(self, item, spider):
        insert_sql = """INSERT INTO jobboleArticle(title, url, create_date, fav_nums) VALUES(%s, %s, %s, %s )"""
        try:
            self.cursor.execute(insert_sql, (item["title"], item["url"], item["create_date"], item["fav_nums"]))
            self.conn.commit()
        except Exception as e:
            self.conn.rollback()
        finally:
            self.conn.close()

咱們發現只存入了3條，由於上面的代碼是同步方式，我爬蟲的解析速度是比入數據庫速度快的，這形成了堵塞

咱們用異步寫下：

class MysqlTwistedPipeline(object):
        def __init__(self, dbpool):
            self.dbpool = dbpool

        @classmethod
        def from_settings(cls, settings):
            dbparams =  dict(
                host = settings["MYSQL_HOST"],
                database = settings["MYSQL_DBNAME"],
                user = settings["MYSQL_USER"],
                password = settings["MYSQL_PASSWORD"],
                charset = 'utf8',
                cursorclass = pymysql.cursors.DictCursor,
                use_unicode = True
            )
            dbpool = adbapi.ConnectionPool("pymysql", **dbparams)
            return cls(dbpool)

        def process_item(self, item, spider):
            #使用twisted將mysql插入變成異步執行
            query = self.dbpool.runInteraction(self.do_insert, item)
            query.addErrorback(self.handle_error, item, spider)#   處理異常

        def handle_error(self, failure, item, spider):
            #處理異步插入異常
            print(failure)

        def do_insert(self, cursor,item):
            #執行具體的插入query
            insert_sql = """INSERT INTO jobboleArticle(title, url, create_date, fav_nums) VALUES(%s, %s, %s, %s )"""
            cursor.execute(insert_sql, (item["title"], item["url"], item["create_date"], item["fav_nums"]))