scrapy使用指南

建立scrapy項目:css

scrapy startproject 項目名html

cd到項目名下json

scrapy genspider 爬蟲名 www.baidu.com(網站網址)dom

 

以後按照提示建立爬蟲文件(官方測試網站爲http://quotes.toscrape.com/)scrapy

 

建立啓動文件ide

from scrapy.cmdline import execute
execute(['scrapy','crawl','quotes'])post

quotes是爬蟲名,該文件建立在scrapy項目根目錄下測試

 

css選擇器:網站

response.css('.text::text').extract()

這裏爲提取全部帶有class=’text’ 這個屬性的元素裏面的text返回的是一個列表url

 

response.css('.text::text').extract_first()

這是取第一條,返回的是str

print(response.css("div span::attr(class)").extract())

這是取元素

 

Xpath選擇器:

url = response.url+response.xpath('/html/body/div/div[2]/div[1]/div[1]/div/a[1]/@href').extract_first()

和原來用法基本同樣,這裏是獲取一個url 而後跟網站的主url拼接了

print(response.xpath("//a[@class='tag']/text()").extract())

取帶有class=’tag’屬性的超連接中間的文本內容

print(response.url)
print(response.status)

打印該請求的url,打印請求的狀態碼

 

保存爲json形式的東西

scrapy crawl quotes -o quotes.json

json lines存儲

scrapy crawl quotes -o quotes.jl

scrapy crawl quotes -o quotes.csv

scrapy crawl quotes -o quotes.xml

 

scrapy crawl quotes -o quotes.pickle

scrapy crawl quotes -o quotes.marshal

scrapy crawl quotes -o ftp://user:pass@ftp.example.com/path/to/quotes.csv

 

piplines.py中的操做

from scrapy.exceptions import DropItem
class HelloPipeline(object):
    def __init__(self):
        self.limit = 50
    def process_item(self,item,spider):
        if item['name']:
            if len(item['name']) > self.limit:
                item['name'] = item['name'][:self.limit].rstrip()+'。。。'
            return item
        else:
            return DropItem

import pymongo
class MongoPipline(object):
    def __init__(self,mongo_url,mongo_db):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db
    @classmethod
    def from_crawler(cls,crawler):
        return cls(mongo_url=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DB'))

    def open_spider(self,spider):
        print(self.mongo_url,self.mongo_db)
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]

    def process_item(self,item,spider):
        self.db['name'].insert(dict(item))
        print(item)
        return item

    def close_spider(self,spider):
        self.client.close()

記得開setting.py:

ITEM_PIPELINES = {
   'hello.pipelines.HelloPipeline': 300,
   'hello.pipelines.MongoPipline': 400,
}
MONGO_URI = '127.0.0.1'
MONGO_DB = 'hello'

 

 

DownloadMiddleware

核心方法:

Process_request(self,request,spider)

Return None:繼續處理這個request,直到返回response,一般用來修改request

Return Response 直接返回該response

Return Request 將返回的request 從新放歸調度隊列,當成一個新的request用

Return IgnoreRequest 拋出異常,process_exception被一次調用,

 

 

Process_response(self,request,response,spider)

Return request將返回的request 從新放歸調度隊列,當成一個新的request用

Return response 繼續處理該response直到結束

Process_exception(request,excetion,spider)

Return IgnoreRequest 拋出異常,process_exception被一次調用,

 

經過重寫中間件給request加useragent,將返回的狀態碼都改爲201

在setting裏:

DOWNLOADER_MIDDLEWARES = {
   'dingdian.middlewares.AgantMiddleware': 543,
}

 

在middleware裏:

import random
class AgantMiddleware(object):
    def __init__(self):
        self.user_agent = ['Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0']
    def process_request(self,request,spider):
        request.headers['User-Agent'] = random.choice(self.user_agent)
        print(request.headers)

    def process_response(self,request,response,spider):
        response.status=201
        return response

 

 

 

scrapy兩種請求方式

一種

import scrapy

yield scrapy.Request(begin_url,self.first)

 

第二種

from scrapy.http import Request

yield Request(url,self.first,meta={'thename':pic_name[0]})

 

 

使用post請求的方法:

from scrapy import FormRequest ##Scrapy中用做登陸使用的一個包

formdata = {
    'username': 'wangshang',
    'password': 'a706486'
}
yield scrapy.FormRequest(
    url='http://172.16.10.119:8080/bwie/login.do',
    formdata=formdata,
    callback=self.after_login,
)

中間鍵添加代理IP以及header頭

 

class UserAgentMiddleware(object):
    def __init__(self):
        self.user_agent = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0']
    def process_request(self,request,spider):         request.meta['proxy'] = 'http://'+'175.42.123.111:33995'

相關文章
相關標籤/搜索