1. Scrapy 設置文件修改javascript
配置文件就是項目根路徑下的 settings,py ,改下面的配置信息html
a.遵循人機協議設置成false,不然基本啥也爬不到java
# Obey robots.txt rules ROBOTSTXT_OBEY = False
b. 設置ua,否則大部分網址是爬不到的web
# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'
c. 開下request-headers,這樣看起來更像瀏覽器,這個能夠按本身的瀏覽器改下。json
# Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', }
d. 設置訪問延遲,這個很重要,過於頻繁的請求不但容易被封,還可能把一些小站搞崩潰,由於Scrapy自己是異步併發的,不設置害人害己瀏覽器
# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3
e. 日誌等級,參照我的習慣設置吧,我設置成了WARNING,這樣程序運行後輸出就清淨了,適合咱們學習階段頻繁運行程序併發
LOG_LEVEL= 'WARNING'
2. 爬取和解析笑話app
先用瀏覽器打開糗百,選擇導航欄裏的段子,由於這些是純文本的,先不弄圖片和視頻。框架
把url貼過來,先測試下,是成功的。dom
再回到瀏覽器中,查看一個笑話的頁面源碼,而後摺疊下,觀察下html的結構,很容易就能看到笑話列表路徑是這裏
OK,寫xpath 獲取下笑話列表,一樣的方法,找下做者 和內容的位置,用xpath定位。
import scrapy class Qiubai1Spider(scrapy.Spider): name = 'qiubai1' allowed_domains = ['qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): #獲取笑話列表 joke_list = response.xpath("//div[contains(@class, 'article block')]") for joke in joke_list: author = joke.xpath("./div/a[2]/h2/text()").extract_first() print(author) content = joke.xpath(".//div[@class='content']/span/text()").extract() print(content)
運行看下效果
已經能夠看到做者和笑話的列表了。
3. item 接收
將爬取結果放到item中,item做爲scrapy 框架中的數據通用容器,既能便於管理和結構化數據,有能方便和pipeline交互進行數據存儲處理。
咱們直接修改建立項目後自動生成items.py
import scrapy class Scpy1Item(scrapy.Item): author = scrapy.Field() content = scrapy.Field()
而後修改spdier中的代碼,將數據傳遞個item
import scrapy import re from scpy1.items import Scpy1Item class Qiubai1Spider(scrapy.Spider): name = 'qiubai1' allowed_domains = ['qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): #獲取笑話列表 joke_list = response.xpath("//div[contains(@class, 'article block')]") for joke in joke_list: # 解析做者、內容 author = joke.xpath("./div/a[2]/h2/text()").extract_first() content = joke.xpath(".//div[@class='content']/span/text()").extract() # 封裝數據至item item = Scpy1Item() item['author'] = re.sub("[\n]", "", author) item['content'] = re.sub("[\n]", "", ','.join(content)) yield item # 打印item,測試結果 print(item)
運行看下效果