聲明:
1)僅做爲我的學習,若有冒犯,告知速刪!
2)不想誤導,若有錯誤,不吝指教!
3)文章配套視頻:http://www.bilibili.com/video/BV1aC4y1a7nR?share_medium=android&share_source=copy_link&bbid=XY1C2901EE0D25CCEC5E23A673F2026B36BEF&ts=1592703866866
目標:
1. 爬取拉鉤網中的關於編程語言的 1)薪資,2)城市範圍,3)工做年限,4)學歷要求;
2 .將四部分保存到mysql
中;
3.對四部分進行數據可視化;
4.最後經過pyecharts+bootstrap
進行網頁美化 .
技能點:
1. python網絡基礎(requests,xpath
語法等);
2. MySQL+ pymysql
的語法基礎;
3. pyecharts
基礎;
4. bootstrap基礎;
項目流程及邏輯:
大方向:先完成爬取一類的信息,進行可視化,走一遍流程很重要,再拓展!
1.進入如下位置:
------->刷新找到請求url
:<--------
------->分析+請求參數:<--------
------->由於url
是post請求,咱們須要提交參數,往下滑:<-------
2.解決反爬機制
1. 上面的操做解決的是------>拉鉤的ajax
請求方式
2. 隱藏在cookies中的時間戳處理:------>session來保持會話-----實時更新cookies
1 #獲取cookies的函數 2 #start_url = "https://www.lagou.com/jobs/list_python?#labelWords=&fromSearch=true&suginput=" 3 def cookieRequest(start_url): 4 r = requests.Session() 5 r.get(url=start_url, headers=headers, timeout=3) 6 return r.cookies
3.構造流程
1.構造主函數:
1 if __name__ == '__main__': 2 #初始url---獲取cookies 3 start_url = "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=" 4 #模擬請求url 5 post_url = "https://www.lagou.com/jobs/positionAjax.json?" 6 #headers 7 headers = { 8 "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36", 9 "accept": "application/json, text/javascript, */*; q=0.01", 10 "accept-encoding": "gzip, deflate, br", 11 "accept-language": "zh-CN,zh;q=0.9,en;q=0.8", 12 "referer": "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=", 13 } 14 # 動態cookies 15 cookies = cookieRequest(start_url) 16 time.sleep(1) 17 #異常處理 18 try: 19 data = { 20 "first": "true", 21 "pn": 1 # 1 22 "kd": "python", 23 } 24 textInformation(post_url, data, cookies) 25 time.sleep(7) 26 print('------------第%s頁爬取成功,正在進行下一頁--------------' % s) 27 except requests.exceptions.ConnectionError: 28 r.status_code = "Connection refused"
2.構造基礎頁函數
1 def textInformation(post_url, data, cookies): 2 response = requests.post(post_url, headers=headers, data=data, cookies=cookies,timeout=3).text 3 div1 = json.loads(response) 4 # 拿到該頁的職位信息 5 position_data = div1["content"]["positionResult"]["result"] 6 n = 1 7 for list in position_data: 8 infor = { 9 "positionName": result["positionName"], 10 11 "companyFullName": result["companyFullName"], 12 "companySize": result["companySize"], 13 "industryField": result["industryField"], 14 "financeStage": result["financeStage"], 15 16 "firstType": result["firstType"], 17 "secondType": result["secondType"], 18 "thirdType": result["thirdType"], 19 20 "positionLables": result["positionLables"], 21 22 "createTime": result["createTime"], 23 24 "city": result["city"], 25 "district": result["district"], 26 "businessZones": result["businessZones"], 27 28 "salary": result["salary"], 29 "workYear": result["workYear"], 30 "jobNature": result["jobNature"], 31 "education": result["education"], 32 33 "positionAdvantage": result["positionAdvantage"] 34 } 35 36 print(infor) 37 time.sleep(5) 38 print('----------寫入%s次-------' %n) 39 n +=1
3.單獨獲取每一個類的show_id(詳情頁使用):
https://www.lagou.com/jobs/4254613.html? show=0977e2e185564709bebd04fe72a34c9f
javascript
1 show_id = [] 2 def getShowId(post_url, headers, cookies): 3 data = { 4 "first": "true", 5 "pn": 1, 6 "kd": "python", 7 } 8 response = requests.post(post_url, headers=headers, data=data, cookies=cookies).text 9 div1 = json.loads(response) 10 # 拿到該頁的職位信息 11 position_data = div1["content"]["positionResult"]["result"] 12 # 詳情頁的show_id 13 position_show_id = div1['content']['showId'] 14 show_id.append(position_show_id) 15 # return position_show_id
4.詳情頁信息
1 def detailinformation(detail_id, show_id): 2 get_url = "https://www.lagou.com/jobs/{}.html?show={}".format(detail_id, show_id) 3 # time.sleep(2) 4 # 詳情頁信息 5 response = requests.get(get_url, headers=headers,timeout=5).text 6 # print(response) 7 html = etree.HTML(response) 8 div1 = html.xpath("//div[@class='job-detail']/p/text()") 9 # 職位詳情/清洗數據 10 position_list = [i.replace(u'\xa0', u'') for i in div1] 11 # print(position_list) 12 return position_list
完整代碼放在GitHub
中:
https://github.com/xbhog/studyProjecthtml
4.暫沒解決/完善的問題
-
詳情頁在
mysql
保存到的時候,有些沒有數據,多是網絡抖動或者請求頻繁java
-
沒有使用多線程python
-
沒有使用
scrapy
框架mysql -
沒有使用類方法android
------>下期內容<---------
數據存儲:----存儲環境ubuntu
git
-
Mysql
存儲github -
csv
存儲ajax
數據存儲連接:http://www.javashuo.com/article/p-orkpwslx-ns.htmlsql