Selenium Webdriver是經過各類瀏覽器的驅動(web driver)來驅動瀏覽器的,相遇對於使用requests庫直接對網頁進行解析,效率較低,本次使用webdriver庫主要緣由是requests庫沒法解析該網站html
webdriver是經過各瀏覽器的驅動程序 來操做瀏覽器的,因此,要有各瀏覽器的驅動程序,瀏覽器驅動要與本地瀏覽器版本對應,經常使用瀏覽器驅動下載地址以下:mysql
瀏覽器 | 對應驅動下載地址 |
---|---|
chrom(chromedriver.exe) | http://npm.taobao.org/mirrors/chromedriver/ |
firefox(geckodriver.exe) | https://github.com/mozilla/geckodriver/releases |
Edge | https://developer.microsoft.com/en-us/micrsosft-edage/tools/webdriver |
Safari | https://webkit.org/blog/6900/webdriver-support-in-safari-10/ |
本文使用谷歌的chrome瀏覽器,
chrome + webdriver的具體配置和操做說明見 http://www.javashuo.com/article/p-vjsjeeti-nh.htmlgit
進入網頁發現各省份地址相同、各高校地址相同,所以想按規律構造每一個省份和每一個學校的url,並用requests進行解析就沒法實現了。
github
因而想到webdriver,來模擬人工操做,獲取當前頁面,再經過xpath定位到要獲取的數據單元。
先用chrome控制檯獲取目標數據單元的xpath
web
經過手動調整xpath,很容易發現省份xpath的規律爲sql
for province_id in rang(1, 33) province_xpath = '//*[@id="div1"]/div/div[%s]/a' % province_id
再用一樣方法獲取高校的xpat,這裏就不貼截圖了,直接上結果chrome
# sch_id爲每一個省份的高校id # schid_xpath,province_xpath,schcode_xpath,school_xpath,subpage_xpath,schhome_xpath分別對應字段序號、地區、學校代碼、學校名稱、選考科目要求、學校主頁 schid_xpath = '//*[@id="div4"]/table/tbody/tr[%s]/td[1]/a' % school_id province_xpath = '//*[@id="div4"]/table/tbody/tr[%s]/td[2]/a' % school_id schcode_xpath = '//*[@id="div4"]/table/tbody/tr[%s]/td[3]/a' % school_id school_xpath = '//*[@id="div4"]/table/tbody/tr[%s]/td[4]/a' % school_id subpage_xpath = '//*[@id="div4"]/table/tbody/tr[%s]/td[5]/a' % school_id schhome_xpath = '//*[@id="div4"]/table/tbody/tr[%s]/td[6]/a' % school_id
再用一樣方法獲取專業信息的xpat,直接上結果數據庫
# major_id爲每一個高校專業序號,從1到最後一個專業序號 # i從1到4分別對應字段「序號」、「層次」、「專業名稱」、「選考科目要求」 for i in range(1, 5): major_xpath = '//*[@id="ccc"]/div/table/tbody/tr[%s]/td[%s]' % (major_id, i)
def traverse_province(wd, conn): """ 循環進入省份 :return: """ for province_id in range(1, 33): province_xpath = '//*[@id="div1"]/div/div[%s]/a' % province_id wd.find_element_by_xpath(province_xpath).click() # 點擊進入省份 time.sleep(1) traverse_school(wd, conn) # 遍歷省分內的高校 wd.quit() conn.close()
def traverse_school(wd, conn): """ 遍歷高校信息 :return: """ school_id = 1 while True: school_info = [] try: # 獲取高校信息 for i in [1, 2, 3, 4, 6]: school_xpath = '//*[@id="div4"]/table/tbody/tr[%s]/td[%s]' % (school_id, i) text = wd.find_element_by_xpath(school_xpath).text school_info.append(text) # 進入高校子頁 wd.find_element_by_xpath('//*[@id="div4"]/table/tbody/tr[%s]/td[5]/a' % school_id).click() wd.switch_to.window(wd.window_handles[-1]) # 切換到最後一個頁面 traverse_major(school_info, wd, conn) # 遍歷專業 wd.close() # 關閉當前頁 wd.switch_to.window(wd.window_handles[-1]) # 從新定位一次頁面 school_id += 1 except: break conn.commit() # 每一個高校份提交一次
def traverse_major(school_info, wd, conn): """ 遍歷專業信息,最後結合高校信息一併輸出 :param school_info: 上層函數傳遞進來的高校信息 :return: """ major_id = 1 cursor = conn.cursor() while True: major_info = [] try: for i in range(1, 5): major_xpath = '//*[@id="ccc"]/div/table/tbody/tr[%s]/td[%s]' % (major_id, i) text = wd.find_element_by_xpath(major_xpath).text major_info.append(text) print(school_info + major_info) # 寫入mysql insert_sql = ''' insert into sdzk_data (school_id,province,school_code,school_name,school_home,major_id,cc,major_name,subject_ask) values('%s','%s','%s','%s','%s','%s','%s','%s','%s') ''' % (school_info[0], school_info[1], school_info[2], school_info[3], school_info[4], major_info[0], major_info[1], major_info[2], major_info[3]) cursor.execute(insert_sql) major_id += 1 except: break cursor.close() # 每一個高校都從新開啓一次遊標
def connect_mysql(config): """ 鏈接數據庫,並建立表,若是表已存在則先刪除 :param config: mysql數據庫信息 :return: 返回鏈接成功的connect對象 """ create_sql = ''' CREATE table if NOT EXISTS sdzk_data (school_id int(3),province varchar(20), school_code varchar(5), school_name varchar(50), school_home varchar(100), major_id int(3), cc varchar(5), major_name varchar(100), subject_ask varchar(50)) ''' # 判斷表是否存在,存在則刪除,而後建立 conn = pymysql.connect(**config) cursor = conn.cursor() cursor.execute('''show TABLEs like "sdzk_data"''') if cursor.fetchall(): cursor.execute('''drop table sdzk_data''') cursor.execute(create_sql) cursor.close() return conn