Web開發 - Selenium自動化&爬蟲

自動化爬取淘寶中的訂單

這是 淘寶會員登陸頁 。由於以前作的爬蟲都是經過框架或從登陸頁取得Cookie,再注入進去實現登錄過程的。但淘寶的反爬機制很難算出Cookie,不少Cookie都是經過JS的計算,因此不得不學習源碼,反到最後看的頭痛。。。html

第一次嘗試

(1)登陸java

經過 Jsoup get登陸頁成功返回Cookie:node

/**
             * 初始化淘寶登陸頁
             */
            Response firstLoginInitResp = Jsoup.connect("https://login.taobao.com/member/login.jhtml?redirectURL=http%3A%2F%2Ftrade.taobao.com%2Ftrade%2Fitemlist%2Flist_export_order.htm")
                    .header("Host", "login.taobao.com")
                    .header("Connection", "keep-alive")
                    .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
                    .header("Upgrade-Insecure-Requests", "1")
                    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36")
                    .header("Accept-Encoding", "gzip, deflate, sdch")
                    .header("Accept-Language", "zh-CN,zh;q=0.8")
                    .execute();
            Map<String, String> firstLoginInitCookies = firstLoginInitResp.cookies();
            System.out.println("code: "+firstLoginInitResp.statusCode()+", msg: "+firstLoginInitResp.statusMessage()+", 第一次登錄淘寶返回的Cookie: "+firstLoginInitCookies.toString());
_tb_token_=e71873665bdae
t=7770a28456dfcad8106b11406e3bc765
cookie2=17c4314a2a5b448f59aa038202b96019
v=0

返回成功後,JS動態添加了倆個Cookie:web

l=
isg=

最後將Cookie從新注入,並傳送消息體到登陸頁(這是爲了js再次動態設置Cookie)瀏覽器

Response secondLoginInitResp = Jsoup.connect("https://login.taobao.com/member/login.jhtml?redirectURL=http%3A%2F%2Ftrade.taobao.com%2Ftrade%2Fitemlist%2Flist_export_order.htm%3Fpage_no%3D1")
                    .header("Host", "login.taobao.com")
                    .header("Connection", "keep-alive")
                    .header("Content-Length", secondLoginInitData.length()+"")
                    .header("Cache-Control", "max-age=0")
                    .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
                    .header("Origin", "https://login.taobao.com")
                    .header("Upgrade-Insecure-Requests", "1")
                    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36")
                    .header("Content-Type", "application/x-www-form-urlencoded")
                    .referrer("https://login.taobao.com/member/login.jhtml?redirectURL=http%3A%2F%2Ftrade.taobao.com%2Ftrade%2Fitemlist%2Flist_export_order.htm%3Fpage_no%3D1")
                    .header("Accept-Encoding", "gzip, deflate")
                    .header("Accept-Language", "zh-CN,zh;q=0.8")
                    .cookies(firstLoginInitCookies)
                    .data(secondLoginInitMap)
                    .execute();
            Map<String, String> secondLoginInitCookies = secondLoginInitResp.cookies();
            System.out.println("code: "+secondLoginInitResp.statusCode()+", msg: "+secondLoginInitResp.statusMessage()+", 第二次登錄淘寶返回的Cookie: "+secondLoginInitCookies.toString());

結果返回的Cookie爲空。此處省略過多廢話。。。只好再採用其餘方式。cookie

第二次嘗試

此次將採用Selenium自動化框架完成自動登陸,再獲取Cookie注入到請求中,最後完成爬取。app

由於須要用瀏覽器來完成自動化登陸,因此應注意Firefox、Chrome、IE與Selenium對應的版本(本人火狐版本24 下載地址、Selenium2.40 下載地址)。框架

import java.util.Map;
import java.util.Set;
import java.util.concurrent.TimeUnit;

import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;

import common.DateUtil;
import common.FileUtil;
import common.Log;

/**
 * 淘寶爬蟲
 * @author Alex
 * @date 2017年3月22日
 */
public class TaobaoCrawler extends Log{
    
    public String login(String username, String password){
        logger.info("Start firefox browser succeed...");
        try {
            WebDriver webDriver = new FirefoxDriver(); //建立火狐驅動(谷歌IE需下載驅動程序並添加瀏覽器插件,還有注意版本對應,比較麻煩,請百度版本對應)
            
            webDriver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS); 
            webDriver.get("https://login.taobao.com/member/login.jhtml?redirectURL=http://trade.taobao.com/trade/itemlist/list_export_order.htm?page_no=1");
            
            WebElement passLoginEle= webDriver.findElement(By.xpath("//a[@class='forget-pwd J_Quick2Static' and @target='_blank' and @href='']")); //密碼登陸
            logger.info("密碼登陸是否顯示可見:"+passLoginEle.isDisplayed());
            passLoginEle.click(); //顯示帳號密碼錶單域(模仿點擊事件,將隱藏視圖變爲可見)
            
            webDriver.findElement(By.id("TPL_username_1")).clear();
            webDriver.findElement(By.id("TPL_username_1")).sendKeys(username); //輸入用戶名
            webDriver.findElement(By.id("TPL_password_1")).clear();
            webDriver.findElement(By.id("TPL_password_1")).sendKeys(password); //輸入密碼
            webDriver.findElement(By.id("J_SubmitStatic")).click(); //點擊登陸按鈕  
            webDriver.switchTo().defaultContent();
            
            try {
                while (true) { //不停的檢測,一旦當前頁面URL不是登陸頁面URL,就說明瀏覽器已經進行了跳轉
                    Thread.sleep(500L);
                    if (!webDriver.getCurrentUrl().startsWith("https://login.taobao.com/member/login.jhtml")) {
                        break;
                    }
                }
            } catch (InterruptedException e) {
                e.printStackTrace();
            }

            //獲取cookie,上面一跳出循環我認爲就登陸成功了,固然上面的判斷不太嚴格,能夠再進行修改
            StringBuffer cookieStr = new StringBuffer();
            Set<Cookie> cookies = webDriver.manage().getCookies();
            for (Cookie cookie : cookies) {
                cookieStr.append(cookie.getName() + "=" + cookie.getValue() + "; ");
            }
            logger.info("帳號 "+username+" ,用戶登陸成功");
            return cookieStr.toString();
        } catch (Exception e) {
            // TODO: handle exception
            e.printStackTrace();
            logger.info("帳號 "+username+" ,用戶登陸失敗,可能被校驗碼攔截");
            logger.error(e.getMessage());
            return null;
        }
    }
    
    public String getOrderUrl(String cookie){
        try {
            Response orderResp = Jsoup.connect("https://trade.taobao.com/trade/itemlist/list_export_order.htm?page_no=1")
                    .header("Host", "login.taobao.com")
                    .header("Connection", "keep-alive")
                    .header("Cache-Control", "max-age=0")
                    .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*;q=0.8")
                    .header("Origin", "https://login.taobao.com")
                    .header("Upgrade-Insecure-Requests", "1")
                    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36")
                    .header("Content-Type", "application/x-www-form-urlencoded")
                    .header("Accept-Encoding", "gzip, deflate")
                    .header("Accept-Language", "zh-CN,zh;q=0.8")
                    .cookie("Cookie", cookie)
                    .execute();
            logger.info("請求訂單頁返回的code: "+orderResp.statusCode()+", msg: "+orderResp.statusMessage());
            
            Document doc = orderResp.parse();
            Element orderEle = doc.getElementsByAttributeValue("title", "下載訂單報表").get(0); //獲取第一個
            
            String orderUrl = orderEle.attr("href");
            logger.info("訂單下載地址:"+orderUrl);
            return orderUrl;
        } catch (Exception e) {
            // TODO: handle exception
            e.printStackTrace();
            logger.error(e.getMessage());
            return null;
        }
    }
    
    public static void main(String[] args){
        TaobaoCrawler crawler = new TaobaoCrawler();
        Map<String, String> map = FileUtil.propToMap();
        String cookie = crawler.login(map.get("username"), map.get("password"));
        String orderUrl = crawler.getOrderUrl(cookie);
    }

}

普通驗證碼是能夠獲取的,可是經過以拖動滑塊來驗證用戶身份,這種狀況就很難解決了。但願你們有空能試下,多提供些寶貴意見。。。學習

先這樣吧,不太會寫文章,但願你們海涵。ui

相關文章
相關標籤/搜索