python爬蟲如何POST request payload形式的請求
1. 背景
最近在爬取某個站點時,發如今POST數據時,使用的數據格式是request payload,有別於以前常見的 POST數據格式(Form data)。而使用Form data數據的提交方式時,沒法提交成功。javascript
因而上網查了下兩者的區別:http://xiaobaoqiu.github.io/blog/2014/09/04/form-data-vs-request-payload/,下面作了搬運工(侵權立刪…)
1.1. Http請求中Form Data 和 Request Payload的區別
AJAX Post請求中經常使用的兩種傳參數的形式:form data 和 request payload
1.1.1. Form data
get請求的時候,咱們的參數直接反映在url裏面,形式爲key1=value1&key2=value2形式,好比:
http://news.baidu.com/ns?word=NBA&tn=news&from=news&cl=2&rn=20&ct=1
而若是是post請求,那麼表單參數是在請求體中,也是以key1=value1&key2=value2的形式在請求體中。經過chrome的開發者工具能夠看到,以下:
RequestURL:http://127.0.0.1:8080/test/test.do
Request Method:POST
Status Code:200 OKhtml
Request Headers
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:zh-CN,zh;q=0.8,en;q=0.6
AlexaToolbar-ALX_NS_PH:AlexaToolbar/alxg-3.2
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:25
Content-Type:application/x-www-form-urlencoded
Cookie:JSESSIONID=74AC93F9F572980B6FC10474CD8EDD8D
Host:127.0.0.1:8080
Origin:http://127.0.0.1:8080
Referer:http://127.0.0.1:8080/test/index.jsp
User-Agent:Mozilla/5.0 (Windows NT 6.1)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36前端
Form Data
name:mikan
address:streetjava
Response Headers
Content-Length:2
Date:Sun, 11 May 2014 11:05:33 GMT
Server:Apache-Coyote/1.1
這裏要注意post請求的Content-Type爲application/x-www-form-urlencoded(默認的),參數是在請求體中,即上面請求中的Form Data。python
前端代碼:提交數據nginx
xhr.setRequestHeader("Content-type","application/x-www-form-urlencoded");
xhr.send("name=foo&value=bar");
後端代碼:接收提交的數據。在servlet中,能夠經過request.getParameter(name)的形式來獲取表單參數。
/**
* 獲取httpRequest的參數
*
* @param request
* @param name
* @return
*/
protected String getParameterValue(HttpServletRequest request, String name) {
return StringUtils.trimToEmpty(request.getParameter(name));
}
1.1.2. Request payload
若是使用原生AJAX POST請求的話,那麼請求在chrome的開發者工具的表現以下,主要是參數在
Remote Address:192.168.234.240:80
Request URL:http://tuanbeta3.XXX.com/qimage/upload.htm
Request Method:POST
Status Code:200 OKgit
Request Headers
Accept:application/json, text/javascript, */*; q=0.01
Accept-Encoding:gzip,deflate,sdch
Accept-Language:zh-CN,zh;q=0.8,en;q=0.6
Connection:keep-alive
Content-Length:151
Content-Type:application/json;charset=UTF-8
Cookie:JSESSIONID=E08388788943A651924CA0A10C7ACAD0
Host:tuanbeta3.XXX.com
Origin:http://tuanbeta3.XXX.com
Referer:http://tuanbeta3.XXX.com/qimage/customerlist.htm?menu=19
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36
X-Requested-With:XMLHttpRequestgithub
Request Payload
[{widthEncode:NNNcaXN, heightEncode:NNNN5NN, displayUrl:201409/03/66I5P266rtT86oKq6,…}]web
Response Headers
Connection:keep-alive
Content-Encoding:gzip
Content-Type:application/json;charset=UTF-8
Date:Thu, 04 Sep 2014 06:49:44 GMT
Server:nginx/1.4.7
Transfer-Encoding:chunked
Vary:Accept-Encoding
注意請求的Content-Type是application/json;charset=UTF-8,而請求表單的參數在Request Payload中。
後端代碼:獲取數據(這裏使用org.apache.commons.io.):
/**
* 從 request 獲取 payload 數據
*
* @param request
* @return
* @throws IOException
*/
private String getRequestPayload(HttpServletRequest request) throws IOException {
return IOUtils.toString(request.getReader());
}
1.1.3. 兩者區別
若是一個請求的Content-Type設置爲application/x-www-form-urlencoded,那麼這個Post請求會被認爲是Http Post表單請求,那麼請求主體將以一個標準的鍵值對和&的querystring形式出現。這種方式是HTML表單的默認設置,因此在過去這種方式更加常見。
其餘形式的POST請求,是放到 Request payload 中(如今是爲了方便閱讀,使用了Json這樣的數據格式),請求的Content-Type設置爲application/json;charset=UTF-8或者不指定。
2. 環境
python 3.6.1
系統:win7
IDE:pycharm
requests 2.14.2
scrapy 1.4.0
3. 使用requests模塊post payload請求
import json
import requests
import datetimechrome
postUrl = 'https://sellercentral.amazon.com/fba/profitabilitycalculator/getafnfee?profitcalcToken=en2kXFaY81m513NydhTZ9sdb6hoj3D'
# payloadData數據
payloadData = {
'afnPriceStr': 10,
'currency':'USD',
'productInfoMapping': {
'asin': 'B072JW3Z6L',
'dimensionUnit': 'inches',
}
}
# 請求頭設置
payloadHeader = {
'Host': 'sellercentral.amazon.com',
'Content-Type': 'application/json',
}
# 下載超時
timeOut = 25
# 代理
proxy = "183.12.50.118:8080"
proxies = {
"http": proxy,
"https": proxy,
}
r = requests.post(postUrl, data=json.dumps(payloadData), headers=payloadHeader)
dumpJsonData = json.dumps(payloadData)
print(f"dumpJsonData = {dumpJsonData}")
res = requests.post(postUrl, data=dumpJsonData, headers=payloadHeader, timeout=timeOut, proxies=proxies, allow_redirects=True)
# 下面這種直接填充json參數的方式也OK
# res = requests.post(postUrl, json=payloadData, headers=header)
print(f"responseTime = {datetime.datetime.now()}, statusCode = {res.status_code}, res text = {res.text}")
4. 在scrapy中post payload請求
這兒有個壞消息,那就是scrapy目前還不支持payload這種request請求。並且scrapy對formdata的請求也有很嚴格的要求,具體能夠參考這篇文章:https://blog.csdn.net/zwq912318834/article/details/78292536
4.1. 分析scrapy源碼
參考註解
# 文件:E:\Miniconda\Lib\site-packages\scrapy\http\request\form.py
class FormRequest(Request):
def __init__(self, *args, **kwargs):
formdata = kwargs.pop('formdata', None)
if formdata and kwargs.get('method') is None:
kwargs['method'] = 'POST'
super(FormRequest, self).__init__(*args, **kwargs)
if formdata:
items = formdata.items() if isinstance(formdata, dict) else formdata
querystr = _urlencode(items, self.encoding)
# 這兒寫死了,當提交數據時,設置好Content-Type,也就是form data類型
# 就算改寫這兒,後面也沒有對 json數據解析的處理
if self.method == 'POST':
self.headers.setdefault(b'Content-Type', b'application/x-www-form-urlencoded')
self._set_body(querystr)
else:
self._set_url(self.url + ('&' if '?' in self.url else '?') + querystr)
4.2. 思路:在scrapy中嵌入requests模塊
分析請求
返回的查詢結果
第一步:在爬蟲中構造請求,把全部的參數以及必要信息帶進去。
# 文件 mySpider.py中
payloadData = {}
payloadData['afnPriceStr'] = 0
payloadData['currency'] = asinInfo['currencyCodeHidden']
payloadData['futureFeeDate'] = asinInfo['futureFeeDateHidden']
payloadData['hasFutureFee'] = False
payloadData['hasTaxPage'] = True
payloadData['marketPlaceId'] = asinInfo['marketplaceIdHidden']
payloadData['mfnPriceStr'] = 0
payloadData['mfnShippingPriceStr'] = 0
payloadData['productInfoMapping'] = {}
payloadData['productInfoMapping']['asin'] = dataFieldJson['asin']
payloadData['productInfoMapping']['binding'] = dataFieldJson['binding']
payloadData['productInfoMapping']['dimensionUnit'] = dataFieldJson['dimensionUnit']
payloadData['productInfoMapping']['dimensionUnitString'] = dataFieldJson['dimensionUnitString']
payloadData['productInfoMapping']['encryptedMarketplaceId'] = dataFieldJson['encryptedMarketplaceId']
payloadData['productInfoMapping']['gl'] = dataFieldJson['gl']
payloadData['productInfoMapping']['height'] = dataFieldJson['height']
payloadData['productInfoMapping']['imageUrl'] = dataFieldJson['imageUrl']
payloadData['productInfoMapping']['isAsinLimits'] = dataFieldJson['isAsinLimits']
payloadData['productInfoMapping']['isWhiteGloveRequired'] = dataFieldJson['isWhiteGloveRequired']
payloadData['productInfoMapping']['length'] = dataFieldJson['length']
payloadData['productInfoMapping']['link'] = dataFieldJson['link']
payloadData['productInfoMapping']['originalUrl'] = dataFieldJson['originalUrl']
payloadData['productInfoMapping']['productGroup'] = dataFieldJson['productGroup']
payloadData['productInfoMapping']['subCategory'] = dataFieldJson['subCategory']
payloadData['productInfoMapping']['thumbStringUrl'] = dataFieldJson['thumbStringUrl']
payloadData['productInfoMapping']['title'] = dataFieldJson['title']
payloadData['productInfoMapping']['weight'] = dataFieldJson['weight']
payloadData['productInfoMapping']['weightUnit'] = dataFieldJson['weightUnit']
payloadData['productInfoMapping']['weightUnitString'] = dataFieldJson['weightUnitString']
payloadData['productInfoMapping']['width'] = dataFieldJson['width']
# https://sellercentral.amazon.com/fba/profitabilitycalculator/getafnfee?profitcalcToken=en2kXFaY81m513NydhTZ9sdb6hoj3D
postUrl = f"https://sellercentral.amazon.com/fba/profitabilitycalculator/getafnfee?profitcalcToken={asinInfo['tokenValue']}"
payloadHeader = {
'Host': 'sellercentral.amazon.com',
'Content-Type': 'application/json',
}
# scrapy源碼:self.headers.setdefault(b'Content-Type', b'application/x-www-form-urlencoded')
print(f"payloadData = {payloadData}")
# 這個request並不真正用來調度,去發出請求,由於這種方式構造方式,是沒法提交成功的,會返回404錯誤
# 這樣構造主要是把查詢參數提交出去,在下載中間件部分用request模塊下載,用 「payloadFlag」 標記這種request
yield Request(url = postUrl,
headers = payloadHeader,
meta = {'payloadFlag': True, 'payloadData': payloadData, 'headers': payloadHeader, 'asinInfo': asinInfo},
callback = self.parseAsinSearchFinallyRes,
errback = self.error,
dont_filter = True
)
第二步:在中間件中,用requests模塊處理這個請求
# 文件:middlewares.py
class PayLoadRequestMiddleware:
def process_request(self, request, spider):
# 若是有的請求是帶有payload請求的,在這個裏面處理掉
if request.meta.get('payloadFlag', False):
print(f"PayLoadRequestMiddleware enter")
postUrl = request.url
headers = request.meta.get('headers', {})
payloadData = request.meta.get('payloadData', {})
proxy = request.meta['proxy']
proxies = {
"http": proxy,
"https": proxy,
}
timeOut = request.meta.get('download_timeout', 25)
allow_redirects = request.meta.get('dont_redirect', False)
dumpJsonData = json.dumps(payloadData)
print(f"dumpJsonData = {dumpJsonData}")
# 發現這個竟然是個同步 阻塞的過程,太過影響速度了
res = requests.post(postUrl, data=dumpJsonData, headers=headers, timeout=timeOut, proxies=proxies, allow_redirects=allow_redirects)
# res = requests.post(postUrl, json=payloadData, headers=header)
print(f"responseTime = {datetime.datetime.now()}, res text = {res.text}, statusCode = {res.status_code}")
if res.status_code > 199 and res.status_code < 300:
# 返回Response,就進入callback函數處理,不會再去下載這個請求
return HtmlResponse(url=request.url,
body=res.content,
request=request,
# 最好根據網頁的具體編碼而定
encoding='utf-8',
status=200)
else:
print(f"request mode getting page error, Exception = {e}")
return HtmlResponse(url=request.url, status=500, request=request)
4.3. 遺留下的問題
scrapy之因此強大,就是由於併發度高。你們都知道,因爲Python GIL的緣由,致使python沒法經過多線程來提升性能。可是至少能夠作到下載與解析同步的過程,在下載空檔的時候,進行數據的解析,調度等等,這都歸功於scrapy採用的異步結構。
可是,咱們在中間件中使用requests模塊進行網頁下載,由於這是個同步過程,因此會阻塞在這個地方,拉低了整個爬蟲的效率。
因此,須要根據項目具體的狀況,來決定合適的方案。固然這裏又涉及到一個新的話題,就是scrapy提供的兩種爬取模式:深度優先模式和廣度優先模式。如何儘量最大限度的利用scrapy的併發?在環境不穩定的情形下如何保證儘量穩定的拿到數據?
深度優先模式和廣度優先模式是在settings中設置的。
# 文件: settings.py
# DEPTH_PRIORITY(默認值爲0)設置爲一個正值後,Scrapy的調度器的隊列就會從LIFO變成FIFO,所以抓取規則就由DFO(深度優先)變成了BFO(廣度優先)
DEPTH_PRIORITY = 1, # 廣度優先(肯呢個會累積大量的request,累計佔有大量的內存,最終數據也在最後一批爬取)
深度優先:DEPTH_PRIORITY = 0
廣度優先:DEPTH_PRIORITY = 1