爬取知乎上中美貿易戰的熱點評論

【前言】
最近中美貿易戰很火,試着爬取下,知乎上關於貿易戰的一些評論。
難點:知乎最近的Cookie複雜了不少,因此直接帳號密碼登陸,知乎前端換react技術棧,對頁面對象的選取,帶來挺多困難。
【效果圖】
帳號密碼登陸--模擬鼠標刷新內容--獲取答案元素輸出
css

【代碼】html

public class TradeWar {

    public static void main(String[] args) throws InterruptedException {
        System.setProperty("webdriver.gecko.driver", "C:\\code\\selenium\\geckodriver.exe");
        WebDriver driver = new FirefoxDriver();
        Actions action = new Actions(driver);
        //進入我的主頁
        driver.get("https://www.zhihu.com/#signin");
        driverWait(driver, 2000);           
        
        //輸入帳號密碼   
        driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[2]/span")).click();
        driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[1]/div[2]/div[1]/input")).sendKeys(new String[] { "帳號" });
        driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/div[2]/div/div[1]/input")).sendKeys(new String[] { "密碼" });
        driver.findElement(By.xpath("html/body/div[1]/div/main/div/div/div/div[2]/div[1]/form/button")).click();
            
        driver.get("https://www.zhihu.com/topic/20177825/top-answers");
        
        //下拉刷新足夠的內容,具體能夠設置10000+
        for (int i = 0; i < 100; i++) {
            Thread.sleep(100);
            action.sendKeys(Keys.ARROW_DOWN).perform();
        }
        
        //抓取內容並打印
        System.out.println("開始打印");
        List<WebElement> answers = driver.findElements(By.cssSelector("a[target='_blank']"));
        for (int i = 0; i < answers.size(); i++) {
            String answer = answers.get(i).getText();
            System.out.println("【答案】"+answer + "\n");
        }
    }
    
        //休眠
    public static void driverWait(WebDriver driver,long time) {
        try {
            synchronized (driver) {
                System.out.println("begin wait() ThreadName="
                        + Thread.currentThread().getName());
                driver.wait(time);
                System.out.println("  end wait() ThreadName="
                        + Thread.currentThread().getName());
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

【以前對比】
1.以前獲取的cookie都是不帶時間的,如今變成這樣,cookie登陸不上了,還在修改前端

_zap,469c025b-7e65-4f9f-a00c-75f4cdf7e2ee,.zhihu.com,/,Mon Apr 13 19:28:59 CST 2020

2.以前用下面的classname均可以獲取頁面元素,如今都獲取不到了react

//獲取問題和答案              
List<WebElement> questions = driver.findElements(By.className("question_link"));
List<WebElement> answers = driver.findElements(By.className("zm-item-rich-text"));
相關文章
相關標籤/搜索