【爬蟲】爬取豆瓣圖書TOP250

時間 2019-11-29

原文原文鏈接

經過xpath定位元素python

使用xpath定位元素有好幾種方法web

// 是從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。

#!/user/bin/env python
#coding:utf-8
#先是從selenium導入webdriver，而後用webdriver打開安裝好的谷歌瀏覽器。
from selenium import webdriver
#打開chrom瀏覽器
browser =webdriver.Chrome()
#訪問豆瓣
browser.get('https://book.douban.com/top250?icn=index-book250-all')



#獲得標題
title=browser.find_element_by_xpath("//div[@id='content']//h1").text
#打印標題
print(title)
#得到當前頁面圖書信息的元素對象的列表，總共有25條
book_list=browser.find_elements_by_xpath("//tr[@class='item']")
for ele in book_list:
    print(ele.text+"\n")

由於有不少條信息，因此要注意是find_elements_by_xpath哦~瀏覽器

翻頁函數

定位後頁這個元素spa

使用find_element_by_class_name來定位這一元素線程

#!/user/bin/env python
#coding:utf-8
#先是從selenium導入webdriver，而後用webdriver打開安裝好的谷歌瀏覽器。
from selenium import webdriver
import time
#打開chrom瀏覽器
browser =webdriver.Chrome()
#訪問豆瓣
browser.get('https://book.douban.com/top250?icn=index-book250-all')


for i in range(10):
    #獲得標題
    title=browser.find_element_by_xpath("//div[@id='content']//h1").text
    #打印標題
    print(title)
    #得到當前頁面圖書信息的元素對象的列表，總共有25條
    book_list=browser.find_elements_by_xpath("//tr[@class='item']")
    for ele in book_list:
        print(ele.text+"\n")
    #輸出當前頁數
    print("------------第%s頁------------"%(i+1))

    #下一頁
    next_page=browser.find_element_by_class_name("next").click()
    time.sleep(5)
    print("\n")

time庫則是python的一個標準庫code

time sleep() 函數推遲調用線程的運行，圖中的5表示推遲執行5秒對象

由於頁面的加載須要時間，試想一下，你點擊下一頁之後馬上開始定位元素，而那個時候元素尚未加載完成，那麼程序就容易報錯了。blog

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。