項目過程當中,總會遇到一些變態的或者特殊的需求,須要咱們去抓取本身的、或者別人的頁面,來獲取咱們想要的數據。javascript
(即簡單的爬蟲)抓取頁面的方法有不少,經常使用的:css
1,Httpclienthtml
1 @Test 2 public void crawSignHtmlTest() { 3 CloseableHttpClient httpclient = HttpClients.createDefault(); 4 try { 5 //建立httpget 6 HttpGet httpget = new HttpGet("http://127.0.0.1:8080/index.html?companyName=testCompany"); 7 8 httpget.setHeader("Accept", "text/html, */*; q=0.01"); 9 httpget.setHeader("Accept-Encoding", "gzip, deflate,sdch"); 10 httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.8"); 11 httpget.setHeader("Connection", "keep-alive"); 12 httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36)"); 13 14 //System.out.println("executing request " + httpget.getURI()); 15 //執行get請求 16 CloseableHttpResponse response = httpclient.execute(httpget); 17 try { 18 //獲取響應實體 19 HttpEntity entity = response.getEntity(); 20 //響應狀態 21 System.out.println(response.getStatusLine()); 22 if(entity != null) { 23 //響應內容長度 24 //System.out.println("response length: " + entity.getContentLength()); 25 //響應內容 26 System.out.println("response content: "); 27 System.out.println(EntityUtils.toString(entity)); 28 } 29 } finally { 30 response.close(); 31 } 32 } catch (ClientProtocolException e) { 33 e.printStackTrace(); 34 } catch (ParseException e) { 35 e.printStackTrace(); 36 } catch (IOException e) { 37 e.printStackTrace(); 38 } finally { 39 //關閉連接,釋放資源 40 try { 41 httpclient.close(); 42 } catch(IOException e) { 43 e.printStackTrace(); 44 } 45 } 46 }
利用 httpclient 抓取到數據爲該 index.html 靜態頁面的源碼,若是 html 頁面中有 js 須要執行的代碼的,此時抓到的頁面,JS是沒有執行的。java
若是想要抓到JS 渲染以後的 html 源碼,則能夠經過 htmlunit 來抓取。web
2,Htmlunitchrome
引入 htmlunit 的jar,調用,可獲得JS 執行以後的代碼canvas
1 @Test 2 public void htmlUnitSignTest() throws Exception { 3 WebClient wc = new WebClient(BrowserVersion.CHROME); 4 wc.setJavaScriptTimeout(5000); 5 wc.getOptions().setUseInsecureSSL(true);//接受任何主機鏈接 不管是否有有效證書 6 wc.getOptions().setJavaScriptEnabled(true);//設置支持javascript腳本 7 wc.getOptions().setCssEnabled(false);//禁用css支持 8 wc.getOptions().setThrowExceptionOnScriptError(false);//js運行錯誤時不拋出異常 9 wc.getOptions().setTimeout(100000);//設置鏈接超時時間 10 wc.getOptions().setDoNotTrackEnabled(false); 11 wc.getOptions().setActiveXNative(true); 12 13 wc.addRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3"); 14 wc.addRequestHeader("Accept-Encoding", "gzip, deflate, br"); 15 wc.addRequestHeader("Accept-Language", "zh-CN,zh;q=0.9"); 16 wc.addRequestHeader("Connection", "keep-alive"); 17 wc.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36"); 18 19 20 //HtmlPage htmlpage = wc.getPage("http://127.0.0.1:8081/demo.html?companyName=testCompany"); 21 HtmlPage htmlpage = wc.getPage("http://127.0.0.1:8081/sign.html?companyName=testCompany&p=1"); 22 String res = htmlpage.asXml(); 23 //處理源碼 24 System.out.println(res); 25 26 // HtmlForm form = htmlpage.getFormByName("f"); 27 // HtmlButton button = form.getButtonByName("btnDomName"); // 獲取提交按鈕 28 // HtmlPage nextPage = button.click(); 29 // System.out.println("等待20秒"); 30 // Thread.sleep(2000); 31 // System.out.println(nextPage.asText()); 32 wc.close(); 33 }
htmlunit 經過建立 new WebClient()來構建一個瀏覽器模擬器,而後將獲取到的 html 源碼來進行執行 JS 渲染,最後獲得一個 JS 執行後的 html 源碼。瀏覽器
可是在一些特殊的場景中,如 抓取畫布 canvas 繪製出來的 base64 數據時,發現數據有問題,和瀏覽器上直接執行的結果不一致(巨坑,在這個上浪費了不少時間)。app
3,Seleniumless
引入 selenium的jar,另外需下載ChromeDriver.exe,調用也可獲得JS 執行以後的代碼
1 public static void main(String[] args) throws IOException { 2 3 System.setProperty("webdriver.chrome.driver", "/srv/chromedriver.exe");// chromedriver服務地址 4 ChromeOptions options = new ChromeOptions(); 5 options.addArguments("--headless"); 6 //WebDriver driver = new ChromeDriver(options); // 新建一個WebDriver 的對象,可是new 的是谷歌的驅動 7 8 WebDriver driver = new ChromeDriver(); 9 String url = "http://127.0.0.1:8080/index.html?companyName=testCompany"; 10 driver.get(url); // 打開指定的網站 11 12 //獲取當前瀏覽器的信息 13 System.out.println("Title:" + driver.getTitle()); 14 System.out.println("currentUrl:" + driver.getCurrentUrl()); 15 16 17 WebElement imgDom = ((ChromeDriver) driver).findElementById("imgDom"); 18 System.out.println(imgDom.getText()); 19 20 //String imgBase64 = URLDecoder.decode(imgDom.getText(), "UTF-8"); 21 //imgBase64 = imgBase64.substring(imgBase64.indexOf(",") + 1); 22 byte[] fromBASE64ToByte = Base64Util.getFromBASE64ToByte(imgDom.getText()); 23 FileUtils.writeByteArrayToFile(new File("/srv/charter44.png"),fromBASE64ToByte); 24 driver.close(); 25 }
selenium 也是經過 new WebDriver() 來構建一個 瀏覽器模擬器,不只將獲取到的 html 源碼來進行執行 JS 渲染,最後獲得一個 JS 執行後的 html 源碼,連上述 htmlunit 執行中對畫布 canvas 的不友好支持,在這裏也獲得了完美解決。selenium 贊!!!