今天本身實戰寫了個爬取京東商品信息,和上一篇的思路同樣,附上連接:https://www.cnblogs.com/cany/p/10897618.htmlhtml
打開 https://www.jd.com/ 首先不須要登錄就可搜索,淘寶不同,因此淘寶我還沒試過。python
開啓F12 定位一下搜索框和搜索按鈕web
input = WAIT.until(EC.presence_of_element_located((By.XPATH,'//*[@id="key"]'))) submit = WAIT.until(EC.element_to_be_clickable((By.XPATH,'//*[@id="search"]/div/div[2]/button'))) input.send_keys(goods) submit.click()
接下來咱們要的是按銷量排名,那就要點擊這個 onclick事件
app
發現使用click()仍是沒法進行點擊,由於這是個js跳轉 因此得用下面代碼測試
submit_js = WAIT.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="J_filter"]/div[1]/div[1]/a[2]'))) browser.execute_script("$(arguments[0]).click()", submit_js)
接下來就仍是檢測是否加載了下面的元素
ui
開始分析各項 怎麼獲取裏面的數據就不說了
lua
這時候可能爬的不徹底,由於京東是動態加載的 須要去模擬一下把頁面拉到底部spa
browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
按照這樣子進行循環遍歷,把每個值添加到goods_data列表裏去,但也保證不了可能會出現找不到對象的屬性,拋出AttributeError異常,這裏已經嘗試過了,因此寫下這個異常處理!excel
而後獲取完一頁就下一頁,而後得寫個代碼來檢查是否跳轉到指定頁面code
WAIT.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#J_bottomPage > span.p-num > a.curr'),str(page_num)))
再獲取每一頁當前頁面源碼進行解析提取內容,保存到 goods_data 列表中,最後寫入xls文件!
Tips:裏面sleep 時間視狀況而定,太快會致使獲取不全,但若是網速快能彌補這一點,目前測試狀況來看是這樣子的問題!
附上代碼:
from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup import xlwt import time goods = input('請輸入你要爬取的商品名稱:') goods_data = [] browser = webdriver.PhantomJS() WAIT = WebDriverWait(browser,10) browser.set_window_size(1000,600) def seach(goods): try: print('開始自動化爬取京東商品信息......') browser.get('https://www.jd.com/') input = WAIT.until(EC.presence_of_element_located((By.XPATH,'//*[@id="key"]'))) submit = WAIT.until(EC.element_to_be_clickable((By.XPATH,'//*[@id="search"]/div/div[2]/button'))) input.send_keys(goods) submit.click() submit_js = WAIT.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="J_filter"]/div[1]/div[1]/a[2]'))) browser.execute_script("$(arguments[0]).click()", submit_js) time.sleep(1) get_source() except TimeoutException: return seach(goods) def get_source(): browser.execute_script("window.scrollTo(0,document.body.scrollHeight)") time.sleep(1) WAIT.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#J_goodsList > ul'))) html = browser.page_source soup = BeautifulSoup(html,'lxml') save_data(soup) def save_data(soup): html = soup.find_all(class_='gl-i-wrap') for item in html: try: goods_name = item.find(class_='p-name').find('em').text goods_link = 'https:' + item.find(class_='p-img').find('a').get('href') goods_evaluate = item.find(class_='p-commit').text goods_store = item.find(class_='curr-shop').text goods_money = item.find(class_='p-price').find('i').text print(('爬取: ' + goods_name)) goods_data.append([goods_name,goods_link,goods_evaluate,goods_store,goods_money]) except AttributeError: pass def next_page(page_num): try: print('獲取下一頁數據') next_btn = WAIT.until(EC.element_to_be_clickable((By.CSS_SELECTOR,'#J_bottomPage > span.p-num > a.pn-next'))) next_btn.click() WAIT.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#J_bottomPage > span.p-num > a.curr'),str(page_num))) get_source() except TimeoutException: browser.refresh() return next_page(page_num) def save_to_excel(): book = xlwt.Workbook(encoding='utf-8', style_compression=0) sheet = book.add_sheet(goods, cell_overwrite_ok=True) sheet.col(0).width = 256 * 80 sheet.col(1).width = 256 * 40 sheet.col(2).width = 256 * 20 sheet.col(3).width = 256 * 25 sheet.col(4).width = 256 * 20 sheet.write(0, 0, '商品名稱') sheet.write(0, 1, '商品連接') sheet.write(0, 2, '評價人數') sheet.write(0, 3, '店名') sheet.write(0, 4, '價格') for item in goods_data: n = goods_data.index(item) + 1 sheet.write(n, 0, item[0]) sheet.write(n, 1, item[1]) sheet.write(n, 2, item[2]) sheet.write(n, 3, item[3]) sheet.write(n, 4, item[4]) book.save(str(goods) + u'.xls') def main(): try: seach(goods) for i in range(2,11): next_page(i) print('-'*50) print('數據爬取完畢,正在寫入xls.....') save_to_excel() print('寫入成功!!!') finally: browser.close() browser.quit() if __name__ == '__main__': main()