使用HttpClient 4.3.4 自動登陸並抓取中國聯通用戶基本信息和帳單數據,GET/POST/Cookie

一.什麼是HttpClient?

HTTP 協議多是如今 Internet 上使用得最多、最重要的協議了,愈來愈多的 Java 應用程序須要直接經過 HTTP 協議來訪問網絡資源。雖然在 JDK 的 java net包中已經提供了訪問 HTTP 協議的基本功能,可是對於大部分應用程序來講,JDK 庫自己提供的功能還不夠豐富和靈活。HttpClient 是 Apache Jakarta Common 下的子項目,用來提供高效的、最新的、功能豐富的支持 HTTP 協議的客戶端編程工具包,而且它支持 HTTP 協議最新的版本和建議。HttpClient 已經應用在不少的項目中,好比 Apache Jakarta 上很著名的另外兩個開源項目 Cactus 和 HTMLUnit 都使用了 HttpClient。如今HttpClient最新版本爲 HttpClient 4.3.4(2014-06-22).html

-----引自百度百科java

簡單的說,HttpClient就是一個Apache的一個對於Http封裝的一個jar包.git

下面將介紹使用GET/POST請求,登陸中國聯通網站並抓取用戶的基本信息和帳單數據.github

二.新建一個maven項目httpclient

我這裏的環境是jdk1.7+Intelij idea 13.0+ubuntu12.04+maven+HttpClient 4.3.4 .下面首先建一個maven項目:apache

如圖所示,選擇quickstart編程

而後next下去便可.json

建好項目後,以下圖所示:ubuntu

 

雙擊pom.xml文件並添加所須要的jar包:cookie

    <dependency>
          <groupId>org.apache.httpcomponents</groupId>
          <artifactId>httpclient</artifactId>
          <version>4.3.4</version>
      </dependency>

maven會自動將須要的其它jar包下載好,實際上所須要的jar包以下圖所示:網絡

 

三.登陸中國聯通並抓取數據

1.使用Get模擬登陸,抓取每個月帳單數據

中國聯通有兩種登陸方式:

 

 

 

上面兩圖的區別一個是帶驗證碼,一個是不帶驗證碼,下面將先解決不帶驗證碼的登陸.

package com.amos;

import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;

/**
 * @author amosli
 * 登陸並抓取中國聯通數據
 */

public class LoginChinaUnicom {
    /**
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {

        String name = "中國聯通手機號碼";
        String pwd = "手機服務密碼";

        String url = "https://uac.10010.com/portal/Service/MallLogin?callback=jQuery17202691898950318097_1403425938090&redirectURL=http%3A%2F%2Fwww.10010.com&userName=" + name + "&password=" + pwd + "&pwdType=01&productType=01&redirectType=01&rememberMe=1";

        HttpClient httpClient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(url);
        HttpResponse loginResponse = httpClient.execute(httpGet);

        if (loginResponse.getStatusLine().getStatusCode() == 200) {
            for (Header head : loginResponse.getAllHeaders()) {
                System.out.println(head);
            }
            HttpEntity loginEntity = loginResponse.getEntity();
            String loginEntityContent = EntityUtils.toString(loginEntity);
            System.out.println("登陸狀態:" + loginEntityContent);
            //若是登陸成功
            if (loginEntityContent.contains("resultCode:\"0000\"")) {

                //月份
                String months[] = new String[]{"201401", "201402", "201403", "201404", "201405"};

                for (String month : months) { String billurl = "http://iservice.10010.com/ehallService/static/historyBiil/execute/YH102010002/QUERY_YH102010002.processData/QueryYH102010002_Data/" + month + "/undefined"; HttpPost httpPost = new HttpPost(billurl); HttpResponse billresponse = httpClient.execute(httpPost); if (billresponse.getStatusLine().getStatusCode() == 200) { saveToLocal(billresponse.getEntity(), "chinaunicom.bill." + month + ".2.html"); } }
            }
        }

    }

找到要登陸的url以及要傳的參數,這裏手機號碼服務密碼這裏就不提供了.

new一個DefaultHttpClient,而後使用Get方式發出請求,若是登陸成功,其返回代碼是0000.

再用HttpPost方式將返回值寫到本地.

/**
     * 寫文件到本地
     *
     * @param httpEntity
     * @param filename
     */
    public static void saveToLocal(HttpEntity httpEntity, String filename) {

        try {

            File dir = new File("/home/amosli/workspace/chinaunicom/");
            if (!dir.isDirectory()) {
                dir.mkdir();
            }

            File file = new File(dir.getAbsolutePath() + "/" + filename);
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            InputStream inputStream = httpEntity.getContent();

            if (!file.exists()) {
                file.createNewFile();
            }
            byte[] bytes = new byte[1024];
            int length = 0;
            while ((length = inputStream.read(bytes)) > 0) {
                fileOutputStream.write(bytes, 0, length);
            }
            inputStream.close();
            fileOutputStream.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

這裏若是隻是想輸出一下可使用EntityUtils.toString(HttpEntity entity)方法,其源碼以下:

 public static String toString(
            final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException {
        Args.notNull(entity, "Entity");
        final InputStream instream = entity.getContent();
        if (instream == null) {
            return null;
        }
        try {
            Args.check(entity.getContentLength() <= Integer.MAX_VALUE,
                    "HTTP entity too large to be buffered in memory");
            int i = (int)entity.getContentLength();
            if (i < 0) {
                i = 4096;
            }
            Charset charset = null;
            try {
                final ContentType contentType = ContentType.get(entity);
                if (contentType != null) {
                    charset = contentType.getCharset();
                }
            } catch (final UnsupportedCharsetException ex) {
                throw new UnsupportedEncodingException(ex.getMessage());
            }
            if (charset == null) {
                charset = defaultCharset;
            }
            if (charset == null) {
                charset = HTTP.DEF_CONTENT_CHARSET;
            }
            final Reader reader = new InputStreamReader(instream, charset);
            final CharArrayBuffer buffer = new CharArrayBuffer(i);
            final char[] tmp = new char[1024];
            int l;
            while((l = reader.read(tmp)) != -1) {
                buffer.append(tmp, 0, l);
            }
            return buffer.toString();
        } finally {
            instream.close();
        }
    }

這裏能夠發現其實現方式仍是比較容易看懂的,能夠指定編碼,也能夠不指定.

2.帶驗證碼的登陸,抓取基本信息

package com.amos;


import org.apache.http.HttpResponse;
import org.apache.http.client.CookieStore;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.cookie.Cookie;
import org.apache.http.impl.client.*;
import org.apache.http.util.EntityUtils;

import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;

/**
 * Created by amosli on 14-6-22.
 */
public class LoginWithCaptcha {

    public static void main(String args[]) throws Exception {

        //生成驗證碼的連接
        String createCaptchaUrl = "http://uac.10010.com/portal/Service/CreateImage";
        HttpClient httpClient = new DefaultHttpClient();

        String name = "中國聯通手機號碼";
        String pwd = "手機服務密碼";

        //這裏可自定義所須要的cookie
        CookieStore cookieStore = new BasicCookieStore();

        CloseableHttpClient httpclient = HttpClients.custom()
                .setDefaultCookieStore(cookieStore)
                .build();

        //get captcha,獲取驗證碼
        HttpGet captchaHttpGet = new HttpGet(createCaptchaUrl);
        HttpResponse capthcaResponse = httpClient.execute(captchaHttpGet);

        if (capthcaResponse.getStatusLine().getStatusCode() == 200) {
            //將驗證碼寫入本地
            LoginChinaUnicom.saveToLocal(capthcaResponse.getEntity(), "chinaunicom.capthca." + System.currentTimeMillis());
        }


        //手工輸入驗證碼並驗證
        HttpResponse verifyResponse = null;
        String capthca = null;
        String uvc = null;

        do {
            //輸入驗證碼,讀入鍵盤輸入
            //1)
            InputStream inputStream = System.in;
            BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
            System.out.println("請輸入驗證碼:");
            capthca = bufferedReader.readLine();

            //2)
            //Scanner scanner = new Scanner(System.in);
            //capthca = scanner.next();

            String verifyCaptchaUrl = "http://uac.10010.com/portal/Service/CtaIdyChk?verifyCode=" + capthca + "&verifyType=1";
            HttpGet verifyCapthcaGet = new HttpGet(verifyCaptchaUrl);
            verifyResponse = httpClient.execute(verifyCapthcaGet);
            AbstractHttpClient abstractHttpClient = (AbstractHttpClient) httpClient; for (Cookie cookie : abstractHttpClient.getCookieStore().getCookies()) { System.out.println(cookie.getName() + ":" + cookie.getValue()); if (cookie.getName().equals("uacverifykey")) { uvc = cookie.getValue(); } }
        } while (!EntityUtils.toString(verifyResponse.getEntity()).contains("true"));

        //登陸
        String loginurl = "https://uac.10010.com/portal/Service/MallLogin?userName=" + name + "&password=" + pwd + "&pwdType=01&productType=01&verifyCode=" + capthca + "&redirectType=03&uvc=" + uvc;
        HttpGet loginGet = new HttpGet(loginurl);
        CloseableHttpResponse loginResponse = httpclient.execute(loginGet);
        System.out.print("loginResponse:" + EntityUtils.toString(loginResponse.getEntity()));

        //抓取基本信息數據
        HttpPost basicHttpGet = new HttpPost("http://iservice.10010.com/ehallService/static/acctBalance/execute/YH102010005/QUERY_AcctBalance.processData/Result"); LoginChinaUnicom.saveToLocal(httpclient.execute(basicHttpGet).getEntity(), "chinaunicom.basic.html");

    }


}

這裏有兩個難點,一是驗證碼,二uvc碼;

驗證碼,這裏將其寫到本地,而後人工輸入,這個還比較好解決.

uvc碼,很重要,這個是在cookie裏的,httpclient操做cookie的方法網上找了好久都沒有找到,後來看其源碼纔看到.

 

3.效果圖

帳單數據(這裏是json格式的數據,可能不太方便查看):

  

 4.本文源碼

https://github.com/amosli/crawl/tree/httpclient

相關文章
相關標籤/搜索