這幾天準備用程序抓下一個網站的數據, 具體哪一個就不說了, 爲了減小人工勞動, 省點力氣。用到的技術 Java, Selenium, chromeDriver, 系統ubuntu16.04html
<% for(var i=0; i < loop_times; i++) { %> <% var items = rider_list.slice(i * num_per_line, (i+1) * num_per_line); %> <tr> <% for (var j=0; j < items.length; j++) { %> <%
這樣直接抓取html是沒法拿到數據的,頁面展現的內容是通過瀏覽器渲染過以後的結果, so。。。, 須要使用瀏覽器把拉下的源碼執行js腳本,前端渲染出頁面, 再使用xpath 解析數據。前端
WebDriver 支持如下的java
chromeDriver 下載地址: https://sites.google.com/a/chromium.org/chromedriver/downloads ,注意版本支持狀況, 我用的是最新的版本2.37linux
Latest Release: ChromeDriver 2.37 Supports Chrome v64-66
$ wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add - $ sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' sudo apt-get update sudo apt-get install google-chrome-stable
root@iZj6c1imv6wpn7tfmm7nusZ:/work/fantasy# ./chromedriver Starting ChromeDriver 2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7) on port 9515 Only local connections are allowed.
這裏有幾個點須要注意程序員
ChromeOptions options = new ChromeOptions(); options.addArguments("--headless"); options.addArguments("--disable-gpu"); options.addArguments("--no-sandbox");
option 須要設置,web
java 抓取分析代碼chrome
private WebDriver webDriver; public XXXSpider() { String driver = System.getProperty("webdriver.chrome.driver"); if (driver == null) { logger.info("沒有設置 driver 變量"); System.getProperties().setProperty("webdriver.chrome.driver", "/Users/chengpanwang/Downloads/chromedriver"); } else { logger.info("driver: {}", driver); } } public BigDecimal pageDetail(String url) { logger.info("詳情頁: {}", url); ........ try { ChromeOptions options = new ChromeOptions(); options.addArguments("--headless"); options.addArguments("--disable-gpu"); options.addArguments("--no-sandbox"); webDriver = new ChromeDriver(options); webDriver.get(url); WebElement webElement = webDriver.findElement(By.xpath("/html")); WebElement roleSkill = webElement.findElement(By.id("role_skill")); logger.info(roleSkill.getText()); logger.info("選中技術標籤"); roleSkill.click(); WebElement skillTb = webElement.findElement(By.className("skillTb")); for (WebElement item : skillTb.findElements(By.tagName("td"))) { String level = item.findElement(By.tagName("p")).getText(); String h5 = item.findElement(By.tagName("h5")).getText(); .... 具體業務代碼 } webDriver.close(); } catch (Exception e) { logger.error("", e); } return price; }