HttpClient獲取Cookie的一次踩坑實錄

本文原地址:http://www.fullstackyang.com/...,轉發請註明本博客地址或segmentfault地址,謝謝!html

在使用HttpClient進行抓取一些網頁的時候,常常會保留從服務器端發回的Cookie信息,以便發起其餘須要這些Cookie的請求。大多數狀況下,咱們使用內置的cookie策略,便可以方便直接地獲取這些cookie。
下面的一小段代碼,就是訪問http://www.baidu.com,並獲取對應的cookie:segmentfault

@Test
public void getCookie(){
    CloseableHttpClient httpClient = HttpClients.createDefault();
    HttpGet get=new HttpGet("http://www.baidu.com");
    HttpClientContext context = HttpClientContext.create();
    try {
        CloseableHttpResponse response = httpClient.execute(get, context);
        try{
            System.out.println(">>>>>>headers:");
            Arrays.stream(response.getAllHeaders()).forEach(System.out::println);
            System.out.println(">>>>>>cookies:");
            context.getCookieStore().getCookies().forEach(System.out::println);
        }
        finally {
            response.close();
        }
    } catch (IOException e) {
        e.printStackTrace();
    }finally {
        try {
            httpClient.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

打印結果服務器

>>>>>>headers:
Server: bfe/1.0.8.18
Date: Tue, 12 Sep 2017 06:19:06 GMT
Content-Type: text/html
Last-Modified: Mon, 23 Jan 2017 13:28:24 GMT
Transfer-Encoding: chunked
Connection: Keep-Alive
Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
Pragma: no-cache
Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
>>>>>>cookies:
[version: 0][name: BDORZ][value: 27315][domain: baidu.com][path: /][expiry: null]

可是也有一些網站返回的cookie並不必定徹底符合規範,例以下面這個例子,從打印出的header中能夠看到,這個cookie中的Expires屬性是時間戳形式,並不符合標準的時間格式,所以,httpclient對於cookie的處理失效,最終沒法獲取到cookie,而且發出了一條警告信息:「Invalid ‘expires’ attribute: 1505204523」cookie

警告: Invalid cookie header: "Set-Cookie: yd_cookie=90236a64-8650-494b332a285dbd886e5981965fc4a93f023d; Expires=1505204523; Path=/; HttpOnly". Invalid 'expires' attribute: 1505204523
>>>>>>headers:
Date: Tue, 12 Sep 2017 06:22:03 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: yd_cookie=90236a64-8650-494b332a285dbd886e5981965fc4a93f023d; Expires=1505204523; Path=/; HttpOnly
Cache-Control: no-cache, no-store
Server: WAF/2.4-12.1
>>>>>>cookies:

雖然咱們能夠利用header的數據,從新構造一個cookie出來,也有不少人確實也是這麼作的,但這種方法不夠優雅,那麼如何解決這個問題?網上相關的資料又不多,因此就只能先從官方文檔入手。在官方文檔3.4小節custom cookie policy中講到容許自定義的cookie策略,自定義的方法是實現CookieSpec接口,並經過CookieSpecProvider來完成在httpclient中的初始化和註冊策略實例的工做。好了,關鍵的線索在於CookieSpec接口,咱們來看一下它的源碼:dom

public interface CookieSpec {
……
    /**
      * Parse the {@code "Set-Cookie"} Header into an array of Cookies.
      *
      * <p>This method will not perform the validation of the resultant
      * {@link Cookie}s</p>
      *
      * @see #validate
      *
      * @param header the {@code Set-Cookie} received from the server
      * @param origin details of the cookie origin
      * @return an array of {@code Cookie}s parsed from the header
      * @throws MalformedCookieException if an exception occurs during parsing
      */
    List<Cookie> parse(Header header, CookieOrigin origin) throws MalformedCookieException;
……
}

在源碼中咱們發現了一個parse方法,看註釋就知道正是這個方法,將Set-Cookie的header信息解析爲Cookie對象,天然地再瞭解一下在httplcient中的默認實現DefaultCookieSpec,限於篇幅,源碼就不貼了。在默認的實現中,DefaultCookieSpec主要的工做是判斷header中Cookie規範的類型,而後再調用具體的某一個實現。像上述這種Cookie,最終是交由NetscapeDraftSpec的實例來作解析,而在NetscapeDraftSpec的源碼中,定義了默認的expires時間格式爲「EEE, dd-MMM-yy HH:mm:ss z」ide

public class NetscapeDraftSpec extends CookieSpecBase {

    protected static final String EXPIRES_PATTERN = "EEE, dd-MMM-yy HH:mm:ss z";

    /** Default constructor */
    public NetscapeDraftSpec(final String[] datepatterns) {
        super(new BasicPathHandler(),
                new NetscapeDomainHandler(),
                new BasicSecureHandler(),
                new BasicCommentHandler(),
                new BasicExpiresHandler(
                        datepatterns != null ? datepatterns.clone() : new String[]{EXPIRES_PATTERN}));
    }

    NetscapeDraftSpec(final CommonCookieAttributeHandler... handlers) {
        super(handlers);
    }

    public NetscapeDraftSpec() {
        this((String[]) null);
    }
……
}

到這裏已經比較清楚了,咱們只須要將Cookie中expires的時間轉換爲正確的格式,而後再送入默認的解析器就能夠了。網站

解決方法:ui

  1. 自定義一個CookieSpec類,繼承DefaultCookieSpec
  2. 重寫parser方法
  3. 將Cookie中的expires轉換爲正確的時間格式
  4. 調用默認的解析方法

實現以下(URL就不公開了,已經隱去)this

public class TestHttpClient {
    
    String url = sth;

    class MyCookieSpec extends DefaultCookieSpec {
        @Override
        public List<Cookie> parse(Header header, CookieOrigin cookieOrigin) throws MalformedCookieException {
            String value = header.getValue();
            String prefix = "Expires=";
            if (value.contains(prefix)) {
                String expires = value.substring(value.indexOf(prefix) + prefix.length());
                expires = expires.substring(0, expires.indexOf(";"));
                String date = DateUtils.formatDate(new Date(Long.parseLong(expires) * 1000L),"EEE, dd-MMM-yy HH:mm:ss z");
                value = value.replaceAll(prefix + "\\d{10};", prefix + date + ";");
            }
            header = new BasicHeader(header.getName(), value);
            return super.parse(header, cookieOrigin);
        }
    }

    @Test
    public void getCookie() {

        CloseableHttpClient httpClient = HttpClients.createDefault();

        Registry<CookieSpecProvider> cookieSpecProviderRegistry = RegistryBuilder.<CookieSpecProvider>create()
                .register("myCookieSpec", context -> new MyCookieSpec()).build();//註冊自定義CookieSpec

        HttpClientContext context = HttpClientContext.create();
        context.setCookieSpecRegistry(cookieSpecProviderRegistry);

        HttpGet get = new HttpGet(url);
        get.setConfig(RequestConfig.custom().setCookieSpec("myCookieSpec").build());

        try {
            CloseableHttpResponse response = httpClient.execute(get, context);
            try{
                System.out.println(">>>>>>headers:");
                Arrays.stream(response.getAllHeaders()).forEach(System.out::println);
                System.out.println(">>>>>>cookies:");
                context.getCookieStore().getCookies().forEach(System.out::println);
            }
            finally {
                response.close();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

再次運行,順利地打印出正確的結果,完美!url

>>>>>>headers:
Date: Tue, 12 Sep 2017 07:24:10 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: yd_cookie=9f521fc5-0248-4ab3ee650ca50b1c7abb1cd2526b830e620f; Expires=1505208250; Path=/; HttpOnly
Cache-Control: no-cache, no-store
Server: WAF/2.4-12.1
>>>>>>cookies:
[version: 0][name: yd_cookie][value: 9f521fc5-0248-4ab3ee650ca50b1c7abb1cd2526b830e620f][domain: www.sth.com][path: /][expiry: Tue Sep 12 17:24:10 CST 2017]
相關文章
相關標籤/搜索