【前言】
最近中美貿易戰很火,試着爬取下,知乎上關於貿易戰的一些評論。
難點:知乎最近的Cookie複雜了不少,因此直接帳號密碼登陸,知乎前端換react技術棧,對頁面對象的選取,帶來挺多困難。
【效果圖】
帳號密碼登陸--模擬鼠標刷新內容--獲取答案元素輸出
css
【代碼】html
public class TradeWar { public static void main(String[] args) throws InterruptedException { System.setProperty("webdriver.gecko.driver", "C:\\code\\selenium\\geckodriver.exe"); WebDriver driver = new FirefoxDriver(); Actions action = new Actions(driver); //進入我的主頁 driver.get("https://www.zhihu.com/#signin"); driverWait(driver, 2000); //輸入帳號密碼 driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[2]/span")).click(); driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[1]/div[2]/div[1]/input")).sendKeys(new String[] { "帳號" }); driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[2]/div/div[1]/input")).sendKeys(new String[] { "密碼" }); driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/button")).click(); driver.get("https://www.zhihu.com/topic/20177825/top-answers"); //下拉刷新足夠的內容,具體能夠設置10000+ for (int i = 0; i < 100; i++) { Thread.sleep(100); action.sendKeys(Keys.ARROW_DOWN).perform(); } //抓取內容並打印 System.out.println("開始打印"); List<WebElement> answers = driver.findElements(By.cssSelector("a[target='_blank']")); for (int i = 0; i < answers.size(); i++) { String answer = answers.get(i).getText(); System.out.println("【答案】"+answer + "\n"); } } //休眠 public static void driverWait(WebDriver driver,long time) { try { synchronized (driver) { System.out.println("begin wait() ThreadName=" + Thread.currentThread().getName()); driver.wait(time); System.out.println(" end wait() ThreadName=" + Thread.currentThread().getName()); } } catch (InterruptedException e) { e.printStackTrace(); } } }
【以前對比】
1.以前獲取的cookie都是不帶時間的,如今變成這樣,cookie登陸不上了,還在修改前端
_zap,469c025b-7e65-4f9f-a00c-75f4cdf7e2ee,.zhihu.com,/,Mon Apr 13 19:28:59 CST 2020
2.以前用下面的classname均可以獲取頁面元素,如今都獲取不到了react
//獲取問題和答案 List<WebElement> questions = driver.findElements(By.className("question_link")); List<WebElement> answers = driver.findElements(By.className("zm-item-rich-text"));