爬取https://www.aliexpress.com/wholesale?SearchText=cartoon+case&d=y&origin=n&catId=0&initiative_id=SB_20200523214041這個頁面下的商品詳情,因爲頁面是異步加載的,須要使用Selenium模擬瀏覽器來獲取商品url。但直接使用Selenium定位網頁元素速度又很慢,所以須要結合Re或者BeautifulSoup來提升爬取效率。html
使用Selenium模擬登陸,登陸成功後獲取cookie。python
def login(username, password, driver=None): driver.get('https://login.aliexpress.com/') driver.maximize_window() name = driver.find_element_by_id('fm-login-id') name.send_keys(username) name1 = driver.find_element_by_id('fm-login-password') name1.send_keys(password) submit = driver.find_element_by_class_name('fm-submit') time.sleep(1) submit.click() return driver browser = webdriver.Chrome() browser = login('Wheabion1944@dayrep.com','ab123456',browser) browser.get('https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&SearchText=cartoon+case<ype=wholesale&SortType=default&page=')
這個網站對用戶監管不嚴,使用郵箱註冊都不須要進行驗證,能夠用這個網站獲取假郵箱進行註冊:http://www.fakemailgenerator.com/mysql
其實後續真正運行程序爬的時候並無登陸,爬了十頁也沒碰到反爬。web
這一過程須要解決的問題在於該網頁是ajex異步加載的,網頁不會在打開的同時加載所有數據,在下拉的同時網頁刷新返回新的數據包並渲染,所以經過request沒法一次性讀到網頁的所有源碼。解決思路是經過Selenium來模擬瀏覽器下拉行爲以獲取一頁內所有的數據,而後暫時仍是經過sel去獲取元素。正則表達式
登陸後打開任務須要的頁面會出現廣告彈窗,首先須要關閉廣告彈窗:sql
def close_win(browser): time.sleep(10) try: closewindow = browser.find_element_by_class_name('next-dialog-close') browser.execute_script("arguments[0].click();", closewindow) except Exception as e: print(f"searchKey: there is no suspond Page1. e = {e}")
模擬下拉行爲並獲取一頁中所有商品的url:數據庫
def get_products(browser): wait = WebDriverWait(browser, 1) for i in range(30): browser.execute_script('window.scrollBy(0,230)') time.sleep(1) products = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,"product-info"))) if len(products) >= 60: break else: print(len(products)) continue products = browser.find_elements_by_class_name('product-info') return products
後來經學長指點發現不須要這麼麻煩,搜索頁的商品信息雖然是通過下滑操做纔會經過JS動態渲染,但商品信息其實都是寫在html文檔裏的,能夠經過如下方式獲取:express
url = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&SearchText=cartoon+case<ype=wholesale&SortType=default&page=' driver = webdriver.Chrome() driver.get(url) info = re.findall('window.runParams = (\{.*\})',driver.page_source)[-1] infos = json.loads(info) items = infos['items']
而後就能夠慢慢去匹配。json
這一部分的問題在於須要爬取的網頁不少,繼續使用sel會致使爬蟲速度很慢,另外商品內頁的數據彷佛不是異步返回的。解決方案是先使用sel訪問商品內頁,將整個網頁源碼down下來後用正則表達式去匹配元素:數組
def get_pro_info(product): url = product.find_element_by_class_name('item-title').get_attribute('href') driver = webdriver.Chrome() driver.get(url) page = driver.page_source driver.close() material=re.findall(r'"skuAttr":".*?#(.*?);',page) color=re.findall(r'skuAttr":".*?#.*?#(.*?)"',page) stock=re.findall(r'skuAttr":".*?"availQuantity":(.*?),',page) price=re.findall(r'skuAttr":".*?"actSkuCalPrice":"(.*?)"',page) pics = re.findall(r'<div class="sku-property-image"><img class="" src="(.*?)"', page) titles = re.findall(r'<img class="" src=".*?" title="(.*?)">', page) video = re.findall(r'id="item-video" src="(.*?)"', page) return material, color, stock, price, pics, titles, video
爬取到的數據要求用數據庫儲存,這裏須要接入MySQL,數據庫crawl和表SKU都是提早建好的:
conn = pymysql.connect(host='localhost', user='root', password='ab226690',db='crawl')
mycursor = conn.cursor()
經過循環實現數組數據的寫入,這裏很坑的一點是insert的時候pymysql的格式轉換和python不是徹底同樣,參數用'%s'匹配就能夠,不須要針對數字型字段搞整形或浮點型:
#寫入sku表 sql = "INSERT INTO SKU(skuID,material,color,stock,price, url) VALUES (%s,%s,%s,%s,%s,%s)"#就是這裏,雖然有些變量是數值型,但仍是用%s來對應 for i in range(len(skuID)): if titles: params = (skuID[i], material[i], color[i], stock[i], price[i],url) else: params = (skuID[i], material[i], ' ', stock[i], price[i],url) try: mycursor.execute(sql,params) conn.commit() except IntegrityError: #當出現duplicate primary key時會拋出這個錯誤,這裏這樣寫的本意是碰到重複主鍵就跳過這一條記錄,但實際運行這段代碼的時候仍是會報錯。偷懶的解決辦法是把主鍵取消,但這樣好像不是很合理,往後知道怎麼解決再來更新 conn.rollback() continue
實現寫入操做時碰到的另外一個問題是用re匹配不到元素時返回的是一個空的list,這樣會致使沒法寫入mysql而報錯,所以要判斷待寫入的變量是不是空的list,是的話要賦合適的值:
sql = "INSERT INTO product(url, product_name, rating, reviews, video, shipping) VALUES (%s,%s,%s,%s,%s,%s)" if rating: pass else: rating = '0.0' if review: pass else: review = '0' if video: pass else: video = ' ' if shipping: pass else: shipping = '0.0' params = (url, pro_name, rating,review, video, shipping) mycursor.execute(sql,params)
關閉數據庫鏈接:
conn.commit()
除了前面提到的使用selenium訪問後轉用re匹配外,還發現一個提高爬蟲效率的點:
browser = webdriver.Chrome() browser.get(source_url) browser = close_win(browser)
像這樣重複地實例化和關閉瀏覽器驅動是很耗費時間的,所以要使用盡可能少的瀏覽器窗口來訪問網站。
本任務中是隻實例化了兩個webdriver,一個用來訪問多個商品的展現頁,一個用來訪問商品內頁,具體方法就是實例化後不要這兩個driver,一直用它們來get新的網頁。原來的代碼中是每打開一個網頁都初始化一個新的webdriver去訪問,作出這一修改後代碼運行時間減小了一半。
def scratch_page(source_url): browser = webdriver.Chrome() browser.get(source_url) browser.maximize_window() browser = close_win(browser) pros = get_products(browser) #商品內頁的瀏覽器 browser2 = webdriver.Chrome() error_file = open('ERROR.txt','a+',encoding='utf8') for pro in pros: url, pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping, review = get_pro_info(pro, browser2)#對前面的get_pro_info
作簡單修改 if len(skuID)!=len(color): error_file.write('url:'+url+'\n') continue save_data_to_sql(url,pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping,review) error_file.close() browser.close() browser2.close()
from selenium import webdriver import time from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import re import pymysql from sqlalchemy.exc import IntegrityError#捕獲重複主鍵的異常 def login(username, password, driver=None): driver.get('https://login.aliexpress.com/') driver.maximize_window() name = driver.find_element_by_id('fm-login-id') name.send_keys(username) name1 = driver.find_element_by_id('fm-login-password') name1.send_keys(password) submit = driver.find_element_by_class_name('fm-submit') time.sleep(1) submit.click() return driver def close_win(browser): time.sleep(5) try: closewindow = browser.find_element_by_class_name('next-dialog-close') closewindow.click() except Exception as e: print(f"searchKey: there is no suspond Page1. e = {e}") return browser def get_products(browser): wait = WebDriverWait(browser, 1) for i in range(30): browser.execute_script('window.scrollBy(0,230)') time.sleep(1) products = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,"product-info"))) if len(products) >= 60: break else: continue products = browser.find_elements_by_class_name('product-info') return products def get_pro_info(product, driver): url = product.find_element_by_class_name('item-title').get_attribute('href') driver.get(url) time.sleep(0.5) page = driver.page_source material=re.findall(r'"skuAttr":".*?#(.*?);',page) color=re.findall(r'"skuAttr":".*?#.*?#(.*?)"',page) stock=re.findall(r'"skuAttr":".*?"availQuantity":(.*?),',page) price=re.findall(r'"skuAttr":".*?"skuCalPrice":"(.*?)"',page) pics = re.findall(r'<div class="sku-property-image"><img class="" src="(.*?)"', page) titles = re.findall(r'<img class="" src=".*?" title="(.*?)">', page) video = re.findall(r'id="item-video" src="(.*?)"', page) skuID = re.findall(r'"skuId":(.*?),',page) pro_name = re.findall(r'"product-title-text">(.*?)</h1>', page) rating = re.findall(r'itemprop="ratingValue">(.*?)</span>', page) shipping = re.findall(r'<span class="bold">(.*?) ', page) review = re.findall(r'"reviewCount">(.*?) Reviews</span>', page) #當商品沒有顏色可選時,網頁源碼結構變化,須要從新匹配 if titles: pass else: material = re.findall(r'"skuAttr":".*?#(.*?)"', page) color=[] pics = re.findall(r'"imagePathList":\["(.*?)",', page) return url, pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping, review def save_data_to_sql(url,pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping,review): url = re.findall('/item/(.*?).html',url) # try: conn = pymysql.connect(host='localhost', user='root', password='ab226690',db='crawl') mycursor = conn.cursor() #寫入sku表 sql = "INSERT INTO SKU(skuID,material,color,stock,price, url) VALUES (%s,%s,%s,%s,%s,%s)" for i in range(len(skuID)): if titles: params = (skuID[i], material[i], color[i], stock[i], price[i],url) else: params = (skuID[i], material[i], ' ', stock[i], price[i],url) # mycursor.execute(sql,params) # conn.commit() try: mycursor.execute(sql,params) conn.commit() except IntegrityError: conn.rollback() continue #寫入img表 sql = "INSERT INTO image(url, color, img) VALUES (%s,%s,%s)" i = 0 if titles: for i in range(len(titles)): params = (url, titles[i], pics[i]) # mycursor.execute(sql,params) # conn.commit() try: mycursor.execute(sql,params) conn.commit() except IntegrityError: conn.rollback() continue else: params = (url, ' ', pics) # mycursor.execute(sql,params) # conn.commit() try: mycursor.execute(sql,params) conn.commit() except IntegrityError: conn.rollback() #寫入product表 sql = "INSERT INTO product(url, product_name, rating, reviews, video, shipping) VALUES (%s,%s,%s,%s,%s,%s)" if rating: pass else: rating = '0.0' if review: pass else: review = '0' if video: pass else: video = ' ' if shipping: pass else: shipping = '0.0' params = (url, pro_name, rating,review, video, shipping) mycursor.execute(sql,params) conn.commit() # try: # mycursor.execute(sql,params) # conn.commit() # except Exception: # conn.rollback() conn.close() # except Exception as e: # conn.rollback() # print(e) def scratch_page(source_url): browser = webdriver.Chrome() browser.get(source_url) browser.maximize_window() browser = close_win(browser) pros = get_products(browser) #商品內頁的瀏覽器 browser2 = webdriver.Chrome() error_file = open('ERROR.txt','a+',encoding='utf8') for pro in pros: url, pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping, review = get_pro_info(pro, browser2) if len(skuID)!=len(color): error_file.write('url:'+url+'\n') continue save_data_to_sql(url,pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping,review) error_file.close() browser.close() browser2.close() url = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&SearchText=cartoon+case<ype=wholesale&SortType=default&page=' for p in range(1,11): url_ = url + str(p) start_time = time.time() scratch_page(url_) end_time = time.time() print('成功爬取' + str(p) + '頁') print('第' + str(p) + '頁耗時: '+str(start_time-end_time)+'s')