Scrapy爬蟲 -- 01

時間 2019-11-13

標籤 scrapy 爬蟲欄目 Python 简体版

原文原文鏈接

Scrapy，Python開發的一個快速,高層次的屏幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結取結構化的數據。python

--from wikiweb

說白了就是基於python的爬蟲框架。mongodb

安裝：shell

ubuntu 14.04ubuntu
python2.7（python3不支持，不是做者懶，是scrapy的框架依賴twisted尚未徹底遷移到python3）框架
pip
python2.7

sudo pip2 install scrapy

注意：雖然pip3也能裝上scrapy，可是缺乏支持庫，沒法使用。。。乖乖python2吧
scrapy

使用：ide

一、新建工程test
url

scrapy startproject tutoria

這樣就會建立這樣一個目錄結構：

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ..

官網的解釋以下：

scrapy.cfg: the project configuration file（項目配置文件）
tutorial/: the project’s python module, you’ll later import your code from here.（項目中的定製部分，我不知道怎麼翻譯好）
tutorial/items.py: the project’s items file.（項目的items文件，其實就是要抓取的數據的結構定義）
tutorial/pipelines.py: the project’s pipelines file.（項目的pipelines文件，在這裏能夠定義將抓取的數據導出方式，pip中有scrapy-mongodb的pipelines，能夠將抓取的數據直接導出到pipeline之中。）
tutorial/settings.py: the project’s settings file.（項目的配置文件）
tutorial/spiders/: a directory where you’ll later put your spiders.（存放爬蟲的目錄，通常用來將網頁爬下來）

待續。。。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。