Apache HttpComponents中的cookie匹配策略

Apache HttpComponents中的cookie匹配策略

1 簡介

在clojure中使用clj-http抓取網頁時,須要提交自定義cookie,老是不能成功發送,研究一下HttpClient中Cookie的工做方式。 css

clj-http包裝了HttpClient庫,對於請求頁面時發回的狀態能夠自動處理,可是須要本身往請求中添加cookie時老是失敗,折騰了好久,瞭解了HttpClient處理Cookie的細節,關於HttpClient中HTTP的狀態管理,能夠參考HttpClient的官方指南 html

2 示例

從bing獲取搜索結果,代碼以下: java

deps.edn文件以下: python

{
 :deps {
        org.clojure/clojure {:mvn/version "1.10.0"},
        clj-http {:mvn/version "3.9.1"}, ; http
        reaver {:mvn/version "0.1.2"}  ; jsoup, html parser
        }
}

實際代碼: sql

(require '[clj-http.client :as client])
(require '[clj-http.cookies :as cookies])
(require '[reaver :as html])

(def ua "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20120101 Firefox/33.0")
(def cs (cookies/cookie-store))

(def http-header {:headers {"user-agent" ua
                            "accept-charset" "utf-8"}
                  :proxy-host "localhost" ;本地代理,用於測試
                  :proxy-port 8080
                  :cookie-store cs
                  :insecure? true})

(defn search
  "搜索關鍵字kw,返回[{:url :title :desc} ...]
  kw 爲搜索關鍵字
  opt爲附加選項:
  :option http請求額外參數
  :max-page 查詢的最大頁數,默認爲3"
  [kw & [opt]]
  (let [base-url "https://www.bing.com"
        option (get opt :option {})
        max-page (get opt :max-page 3)]
    (loop [page 1
           url (str base-url "/search?q=" kw) ;這裏沒有用url編碼,僅爲演示用
           result []]
      (if (> page max-page)
        result
        (let [doc (-> (client/get url
                                  (merge http-header option))
                      :body
                      html/parse)
              entrys (html/extract-from doc "li.b_algo"
                                        [:url :title :desc]
                                        "h2 > a" (html/attr "href")
                                        "h2 > a" html/text
                                        "div > p" html/text)
              r (apply conj result entrys)]
          (if-let [next-path (-> (html/select doc "a.sb_pagn" )
                                 (html/attr "href"))]
            (recur (inc page)
                   (str base-url next-path)
                   r)
            r))))))

(def googles (search "google" {:max-page 3}))
(count googles)
;; => 30
(first googles)
;; => {:url "http://www.google.cn/", :title "Google", :desc "2016-12-8 · google.com.hk 請收藏咱們的網址"}

能夠看到一頁返回10個結果,若是要返回50(bing設置裏最大結果數量),則要設置cookie項: "SRCHHPGUSR" 的值爲: "NRSLT=50" shell

下面添加cookie: apache

(cookies/clear-cookies cs) ;清除以前請求產生的cookies

(def usr-cookie (cookies/to-basic-client-cookie
                 ["SRCHHPGUSR" {
                                :discard false
                                :domain ".bing.com",
                                :path "/",
                                :value "NRSLT=50"
                                :expires (java.util.Date. 9000 1 1)
                                }]))
(cookies/add-cookie cs usr-cookie)

(cookies/get-cookies cs)
;; => {"SRCHHPGUSR" {:discard false, :domain ".bing.com", :expires #inst "10900-01-31T16:00:00.000-00:00", :path "/", :secure false, :value "NRSLT=50", :version 0}}

(def googles (search "google" {:max-page 3}))
(count googles)
;; => 30

能夠看到cookie並無生效,在代理中也能夠看到第一次請求時並無帶上添加的cookie。緣由是HttpClient默認的Cookie specifications不會把.bing.com匹配到www.bing.com。 api

下面Api用法能夠參考CookieOrigin API文檔cookie package文檔: sass

(import org.apache.http.cookie.CookieOrigin)
(import (org.apache.http.impl.cookie DefaultCookieSpec
                                     RFC6265LaxSpec
                                     RFC6265StrictSpec
                                     RFC2965Spec
                                     RFC2109Spec
                                     NetscapeDraftSpec
                                     IgnoreSpec
                                     BasicClientCookie2))

(def bing-co (CookieOrigin. "www.bing.com" 80 "/" false))

(def match-spec #(.match %1 usr-cookie bing-co))
(def default-spec (DefaultCookieSpec.))
(match-spec default-spec)
;; => false
(def rfc6265-lax-spec (RFC6265LaxSpec.))
(match-spec rfc6265-lax-spec)
;; => false
(def rfc6265-strict-spec (RFC6265StrictSpec.))
(match-spec rfc6265-strict-spec)
;; => false
(def rfc2965-spec (RFC2965Spec.))
(match-spec rfc2965-spec)
;; => true
(def rfc2109-spec (RFC2109Spec.))
(match-spec rfc2109-spec)
;; => true
(def netscape-spec (NetscapeDraftSpec.))
(match-spec netscape-spec)
;; => true

;; 經過測試的幾個spec,能夠看到,默認只有rfc2*和netscape能夠匹配.bing.com
;; 下面設置cookie的attr
(str usr-cookie)
;; => "[version: 0][name: SRCHHPGUSR][value: NRSLT=50][domain: .bing.com][path: /][expiry: Mon Feb 01 00:00:00 CST 10900]"
(.setAttribute usr-cookie BasicClientCookie2/DOMAIN_ATTR "true")
(str usr-cookie)
;; => "[version: 0][name: SRCHHPGUSR][value: NRSLT=50][domain: .bing.com][path: /][expiry: Mon Feb 01 00:00:00 CST 10900]"
;; 從表面上看不出設置了attr的區別,只有匹配時不一樣:
(.getAttribute usr-cookie "domain") ; DOMAIN_ATTR的值是"domain"
;; => "true"

(def match-spec #(.match %1 usr-cookie bing-co))
(match-spec default-spec)
;; => true
(match-spec rfc6265-lax-spec)
;; => true
(match-spec rfc6265-strict-spec)
;; => true
(match-spec rfc2965-spec)
;; => true
(match-spec rfc2109-spec)
;; => true
(match-spec netscape-spec)
;; => true

;;再看搜索結果
(cookies/clear-cookies cs)
(cookies/add-cookie cs usr-cookie)
(def googles (search "google" {:max-page 3}))
(count googles)
;; => 78
;; 不是每頁都有50條,不太重要的是經過.setAttribute,cookie起做用了

具體緣由是rfc6265的規定,參考overflow的回答 ruby

通過上面的測試,也能夠在clj-http中使用netscape的cookie-policy來達到目的,由於standard屬於rfc6265-lax,默認也不會匹配:

(def usr-cookie (cookies/to-basic-client-cookie
                 ["SRCHHPGUSR" {
                                :discard false
                                :domain ".bing.com",
                                :path "/",
                                :value "NRSLT=50"
                                :expires (java.util.Date. 9000 1 1)
                                }]))

(cookies/clear-cookies cs)
(cookies/add-cookie cs usr-cookie)
(def googles (search "google" {:max-page 3
                               :option {:cookie-policy :netscape}}))
(count googles)
;; => 78

可是這種方法是不推薦的,仍是用.setAttribute比較好。

做者: ntestoc

Created: 2019-01-25 五 20:18

相關文章
相關標籤/搜索