利用httpclient、htmlunit、selenium 作簡單爬蟲,抓取頁面數據

項目過程當中,總會遇到一些變態的或者特殊的需求,須要咱們去抓取本身的、或者別人的頁面,來獲取咱們想要的數據。javascript

(即簡單的爬蟲)抓取頁面的方法有不少,經常使用的:css

 

1Httpclienthtml

 1 @Test  2     public void crawSignHtmlTest() {  3         CloseableHttpClient httpclient = HttpClients.createDefault();  4         try {  5             //建立httpget
 6             HttpGet httpget = new HttpGet("http://127.0.0.1:8080/index.html?companyName=testCompany");  7 
 8             httpget.setHeader("Accept", "text/html, */*; q=0.01");  9             httpget.setHeader("Accept-Encoding", "gzip, deflate,sdch"); 10             httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.8"); 11             httpget.setHeader("Connection", "keep-alive"); 12             httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36)"); 13 
14             //System.out.println("executing request " + httpget.getURI()); 15             //執行get請求
16             CloseableHttpResponse response = httpclient.execute(httpget); 17             try { 18                 //獲取響應實體
19                 HttpEntity entity = response.getEntity(); 20                 //響應狀態
21  System.out.println(response.getStatusLine()); 22                 if(entity != null) { 23                     //響應內容長度 24                     //System.out.println("response length: " + entity.getContentLength()); 25                     //響應內容
26                     System.out.println("response content: "); 27  System.out.println(EntityUtils.toString(entity)); 28  } 29             } finally { 30  response.close(); 31  } 32         } catch (ClientProtocolException e) { 33  e.printStackTrace(); 34         } catch (ParseException e) { 35  e.printStackTrace(); 36         } catch (IOException e) { 37  e.printStackTrace(); 38         } finally { 39             //關閉連接,釋放資源
40             try { 41  httpclient.close(); 42             } catch(IOException e) { 43  e.printStackTrace(); 44  } 45  } 46     }

 

利用 httpclient 抓取到數據爲該 index.html 靜態頁面的源碼,若是 html 頁面中有 js 須要執行的代碼的,此時抓到的頁面,JS是沒有執行的。java

若是想要抓到JS 渲染以後的 html 源碼,則能夠經過 htmlunit 來抓取。web

 

2,Htmlunitchrome

引入 htmlunit jar,調用,可獲得JS 執行以後的代碼canvas

 1 @Test  2     public void htmlUnitSignTest() throws Exception {  3         WebClient wc = new WebClient(BrowserVersion.CHROME);  4         wc.setJavaScriptTimeout(5000);  5         wc.getOptions().setUseInsecureSSL(true);//接受任何主機鏈接 不管是否有有效證書
 6         wc.getOptions().setJavaScriptEnabled(true);//設置支持javascript腳本
 7         wc.getOptions().setCssEnabled(false);//禁用css支持
 8         wc.getOptions().setThrowExceptionOnScriptError(false);//js運行錯誤時不拋出異常
 9         wc.getOptions().setTimeout(100000);//設置鏈接超時時間
10         wc.getOptions().setDoNotTrackEnabled(false); 11         wc.getOptions().setActiveXNative(true); 12 
13         wc.addRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3"); 14         wc.addRequestHeader("Accept-Encoding", "gzip, deflate, br"); 15         wc.addRequestHeader("Accept-Language", "zh-CN,zh;q=0.9"); 16         wc.addRequestHeader("Connection", "keep-alive"); 17         wc.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36"); 18 
19 
20         //HtmlPage htmlpage = wc.getPage("http://127.0.0.1:8081/demo.html?companyName=testCompany");
21         HtmlPage htmlpage = wc.getPage("http://127.0.0.1:8081/sign.html?companyName=testCompany&p=1"); 22         String res = htmlpage.asXml(); 23         //處理源碼
24  System.out.println(res); 25 
26 // HtmlForm form = htmlpage.getFormByName("f"); 27 // HtmlButton button = form.getButtonByName("btnDomName"); // 獲取提交按鈕 28 // HtmlPage nextPage = button.click(); 29 // System.out.println("等待20秒"); 30 // Thread.sleep(2000); 31 // System.out.println(nextPage.asText());
32  wc.close(); 33     }

 

htmlunit 經過建立 new WebClient()來構建一個瀏覽器模擬器,而後將獲取到的 html 源碼來進行執行 JS 渲染,最後獲得一個 JS 執行後的 html 源碼。瀏覽器

 可是在一些特殊的場景中,如 抓取畫布 canvas 繪製出來的 base64 數據時,發現數據有問題,和瀏覽器上直接執行的結果不一致(巨坑,在這個上浪費了不少時間)。app

 

3,Seleniumless

引入 seleniumjar,另外需下載ChromeDriver.exe,調用也可獲得JS 執行以後的代碼

 

 1 public static void main(String[] args) throws IOException {  2 
 3         System.setProperty("webdriver.chrome.driver", "/srv/chromedriver.exe");// chromedriver服務地址
 4         ChromeOptions options = new ChromeOptions();  5         options.addArguments("--headless");  6         //WebDriver driver = new ChromeDriver(options); // 新建一個WebDriver 的對象,可是new 的是谷歌的驅動
 7 
 8         WebDriver driver = new ChromeDriver();  9         String url = "http://127.0.0.1:8080/index.html?companyName=testCompany"; 10         driver.get(url); // 打開指定的網站 11 
12         //獲取當前瀏覽器的信息
13         System.out.println("Title:" + driver.getTitle()); 14         System.out.println("currentUrl:" + driver.getCurrentUrl()); 15 
16 
17         WebElement imgDom = ((ChromeDriver) driver).findElementById("imgDom"); 18  System.out.println(imgDom.getText()); 19 
20         //String imgBase64 = URLDecoder.decode(imgDom.getText(), "UTF-8"); 21         //imgBase64 = imgBase64.substring(imgBase64.indexOf(",") + 1);
22         byte[] fromBASE64ToByte = Base64Util.getFromBASE64ToByte(imgDom.getText()); 23         FileUtils.writeByteArrayToFile(new File("/srv/charter44.png"),fromBASE64ToByte); 24  driver.close(); 25     }

 

selenium 也是經過 new WebDriver() 來構建一個 瀏覽器模擬器,不只將獲取到的 html 源碼來進行執行 JS 渲染,最後獲得一個 JS 執行後的 html 源碼,連上述 htmlunit 執行中對畫布 canvas 的不友好支持,在這裏也獲得了完美解決。selenium 贊!!!

相關文章
相關標籤/搜索