Scrapy是一個流行的網絡爬蟲框架,從如今起將陸續記錄Python3.6下Scrapy整個學習過程,方便後續補充和學習。
本文主要介紹scrapy安裝、項目建立和測試基本命令操做
使用pip命令安裝scrapy,在安裝過程當中可能會由於缺乏依賴庫而報錯,根據報錯提示依次下載須要的依賴庫,下載過程當中注意系統類型和Python版本node
我在安裝過程當中依次安裝的庫有:python
pip install pywin32-223-cp36-cp36m-win32.whlweb
pip install Twisted-17.9.0-cp36-cp36m-win32.whlapi
pip install scrapy網絡
Unofficial Windows Binaries for Python Extension Packages:https://www.lfd.uci.edu/~gohlke/pythonlibs/架構
scrapy安裝成功後打開cmd進入想要存儲scrapy項目的目錄使用startproject命令建立一個新項目:框架
D:\>scrapy startproject scraptest New Scrapy project 'scraptest', using template directory 'c:\\python36-32\\lib\\ site-packages\\scrapy\\templates\\project', created in: D:\scraptest You can start your first spider with: cd scraptest scrapy genspider example example.com
在D:\scraptest\目錄下會生成對應的架構目錄樹dom
scrapytest/ scrapy.cfg scrapytest/ __init__.py items.py #定義抓取域的模型 pipelines.py settings.py #定義一些設置,如用戶代理、爬取延時等 middlewares.py __pycache__/ spiders/ __pycache__/ __init__.py
使用genspider命令,傳入爬蟲模塊名、域名以及可選模塊參數scrapy
D:\scraptest>scrapy genspider country example.webscraping.com Created spider 'country' using template 'basic' in module: scraptest.spiders.country
D:\scraptest\scraptest\spiders目錄下建立country.pyide
# -*- coding: utf-8 -*- import scrapy class CountrySpider(scrapy.Spider): name = 'country' allowed_domains = ['example.webscraping.com'] start_urls = ['http://example.webscraping.com/'] def parse(self, response): pass
1. name做爲爬蟲名,必須指定名稱,根據源碼內容,若值爲空會提示ValueErro
2. start_urls位爬取的網頁
3. parse函數名不能修改,這是源碼中指定的回調函數
# -*- coding: utf-8 -*- import scrapy from lxml import etree class CountrySpider(scrapy.Spider): name = 'country' allowed_domains = ['example.webscraping.com'] start_urls = ['http://example.webscraping.com/places/default/view/Afghanistan-1'] #該函數名不能改變,由於scrapy源碼中默認callback函數的函數名就是parse def parse(self, response): tree = etree.HTML(response.text) for node in (tree.xpath('//tr/td[@class="w2p_fw"]')): print (node.text)
使用crawl命令,能夠根據-s LOG_LEVEL=DEBUG或-s LOG_LEVEL=ERROR來設置日誌信息
D:\scraptest>scrapy crawl country --nolog None 647,500 square kilometres 29,121,286 AF Afghanistan Kabul None .af AFN Afghani 93 None None fa-AF,ps,uz-AF,tk None