最近準備考慮找工做,在招聘網站上面看了一下,感受條目比較多,看得眼花繚亂,因而寫了一個爬蟲,爬取符合條件的崗位的關鍵信息。css
一、基本原理html
在前程無憂裏面輸入搜索條件,我輸入的崗位是大數據開發工程師,地點是武漢,出現了4頁搜索結果:sql
每個大概有50條崗位信息,首頁展現的只有職位名,公司名,工做地點的部分信息,薪資以及發佈日期。對於找工做來講,我但願看到的還有:app
公司具體地址: 若是離家太遠,上下班會比較花時間。post
工做經驗要求:判斷自身經驗是否達到要求大數據
同一個公司職位發佈條數:判斷是否爲虛假招聘,有不少虛假招聘的公司,大量發佈相似招聘信息。網站
最後,我選擇的爬取內容爲爲:崗位名,公司名,經驗要求,公司詳細地址,崗位薪資,招聘詳細信息頁面url。url
本項目中使用了urllib和BeautifulSoup這2個包來處理HTML代碼,處理過程以下:spa
第一步:獲取到崗位詳情頁的url,保存到set中,供下一步遍歷使用。
第二步:遍歷set中的詳情頁url,使用BeautifulSoup的select css選擇器獲取咱們須要的字段,將咱們須要的字段打包成元組保存到list中。excel
第三步:將list裏面的值保存到MySQL中。
第四步:我把數據導出到了excel裏面,進行分析。
二、代碼實現
import urllib.request import MySQLdb from bs4 import BeautifulSoup def get_Url_Set(url): index_page = urllib.request.urlopen(url).read().decode('gbk') soup = BeautifulSoup(index_page, features='html.parser') a_list = soup.select("a[href]") url_set = set() i = 0 for item in a_list: href = item["href"] if "wuhan" in href and "https" in href: #print(href) url_set.add(href) return url_set def get_infomation(url_set): job_list = list() for item_url in url_Set: print(item_url) index_page = urllib.request.urlopen(item_url).read().decode('gbk') soup = BeautifulSoup(index_page, features='html.parser') # 獲取工做名 job_names = soup.select("h1[title]") job_name = job_names[0]["title"] # 獲取工資 moneys = soup.select("div.cn strong") money = moneys[0].string if money is None: money = "未標明工資" # 獲取公司名 company_names = soup.select("a.catn") company_name = company_names[0]["title"] # 獲取工做經驗 jinyans = soup.select("p.msg") list1 = jinyans[0]["title"].split("|") jinyan = list1[1].strip() # 上班地點 address_list = soup.select("p.fp") for item in address_list: if item.span.string == "上班地址:": address = item.contents[2] print(job_name+" "+money+" "+company_name+" "+jinyan+" "+address+" "+item_url) job_list.append((job_name,money,company_name,jinyan,address,item_url)) return job_list def save_jobinfo(job_list): db = MySQLdb.connect("192.168.72.11", "root", "root", "test", charset="utf8") cursor = db.cursor() # 爲何執行 ar_sql = "INSERT INTO `test`.`51job` (`job_name`,`job_money`,`company_name`,`jinyan`,`company_address`,`url`) VALUES(%s,%s,%s,%s,%s,%s)" cursor.executemany(ar_sql, job_list) db.commit() cursor.close() db.close() if __name__ == '__main__': url_1 = "https://search.51job.com/list/180200,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE%25E5%25BC%2580%25E5%258F%2591%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=" url_2 = "https://search.51job.com/list/180200,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE%25E5%25BC%2580%25E5%258F%2591%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588,2,2.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=" url_3 = "https://search.51job.com/list/180200,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE%25E5%25BC%2580%25E5%258F%2591%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588,2,3.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=" url_4 = "https://search.51job.com/list/180200,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE%25E5%25BC%2580%25E5%258F%2591%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588,2,4.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=" url_list=list() url_list.append(url_1) url_list.append(url_2) url_list.append(url_3) url_list.append(url_4) for item in url_list: url_Set = get_Url_Set(item) jobinfo_list = get_infomation(url_Set) save_jobinfo(jobinfo_list)
三、結果展現
導出到excel中,以下:
四、總結
因爲時間比較倉促,代碼還有不少不足,代碼中也有不少地方是寫死的,複用不夠靈活,僅供參考。