爬蟲-Scrapy(二) 爬取糗百笑話-單頁

1. Scrapy 設置文件修改javascript

配置文件就是項目根路徑下的 settings,py ,改下面的配置信息html

a.遵循人機協議設置成false,不然基本啥也爬不到java

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

b. 設置ua,否則大部分網址是爬不到的web

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'

c. 開下request-headers,這樣看起來更像瀏覽器,這個能夠按本身的瀏覽器改下。json

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'application/json, text/javascript, */*; q=0.01',
   'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
}

d. 設置訪問延遲,這個很重要,過於頻繁的請求不但容易被封,還可能把一些小站搞崩潰,由於Scrapy自己是異步併發的,不設置害人害己瀏覽器

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3

e. 日誌等級,參照我的習慣設置吧,我設置成了WARNING,這樣程序運行後輸出就清淨了,適合咱們學習階段頻繁運行程序併發

LOG_LEVEL= 'WARNING'

 

 

2. 爬取和解析笑話app

先用瀏覽器打開糗百,選擇導航欄裏的段子,由於這些是純文本的,先不弄圖片和視頻。框架

把url貼過來,先測試下,是成功的。dom

再回到瀏覽器中,查看一個笑話的頁面源碼,而後摺疊下,觀察下html的結構,很容易就能看到笑話列表路徑是這裏

 

 OK,寫xpath 獲取下笑話列表一樣的方法,找下做者 和內容的位置,用xpath定位。

import scrapy


class Qiubai1Spider(scrapy.Spider):
    name = 'qiubai1'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        #獲取笑話列表
        joke_list = response.xpath("//div[contains(@class, 'article block')]")
        for joke in joke_list:
            author = joke.xpath("./div/a[2]/h2/text()").extract_first()
            print(author)
            content = joke.xpath(".//div[@class='content']/span/text()").extract()
            print(content)

運行看下效果

 

 已經能夠看到做者和笑話的列表了。


3. item 接收

將爬取結果放到item中,item做爲scrapy 框架中的數據通用容器,既能便於管理和結構化數據,有能方便和pipeline交互進行數據存儲處理。

 咱們直接修改建立項目後自動生成items.py

import scrapy


class Scpy1Item(scrapy.Item):

    author = scrapy.Field()
    content = scrapy.Field()

 

而後修改spdier中的代碼,將數據傳遞個item

import scrapy
import re
from scpy1.items import Scpy1Item

class Qiubai1Spider(scrapy.Spider):
    name = 'qiubai1'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        #獲取笑話列表
        joke_list = response.xpath("//div[contains(@class, 'article block')]")

        for joke in joke_list:
            # 解析做者、內容
            author = joke.xpath("./div/a[2]/h2/text()").extract_first()
            content = joke.xpath(".//div[@class='content']/span/text()").extract()
            # 封裝數據至item
            item = Scpy1Item()
            item['author'] =  re.sub("[\n]", "", author)
            item['content'] = re.sub("[\n]", "", ','.join(content))
            yield item
            # 打印item,測試結果
            print(item)

 

運行看下效果

相關文章
相關標籤/搜索