拉鉤項目(一)--項目流程+數據提取

聲明：

　　　1）僅做爲我的學習，若有冒犯，告知速刪！

　　　2）不想誤導，若有錯誤，不吝指教！

　　 3）文章配套視頻：http://www.bilibili.com/video/BV1aC4y1a7nR?share_medium=android&share_source=copy_link&bbid=XY1C2901EE0D25CCEC5E23A673F2026B36BEF&ts=1592703866866

目標：

　　　1. 爬取拉鉤網中的關於編程語言的 1）薪資，2）城市範圍，3）工做年限，4）學歷要求;

　　　2 .將四部分保存到`mysql`中;

　　　3.對四部分進行數據可視化;

　　　4.最後經過`pyecharts+bootstrap`進行網頁美化 .

技能點：

　　 1. python網絡基礎(`requests,xpath`語法等)；

　　　2. `MySQL+ pymysql`的語法基礎；

　　　3. `pyecharts`基礎；

　　　4. bootstrap基礎；

項目流程及邏輯：

　　　大方向：先完成爬取一類的信息，進行可視化，走一遍流程很重要，再拓展！

1.進入如下位置：

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　------->刷新找到請求`url`：<--------

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　------->分析+請求參數：<--------

　　　　　　　　　　　　　　　　　　　　　　　------->由於`url`是post請求，咱們須要提交參數，往下滑：<-------

2.解決反爬機制

1. 上面的操做解決的是------>拉鉤的`ajax`請求方式

2. 隱藏在cookies中的時間戳處理：------>session來保持會話-----實時更新cookies

1 #獲取cookies的函數
2 #start_url = "https://www.lagou.com/jobs/list_python?#labelWords=&fromSearch=true&suginput="
3 def cookieRequest(start_url): 4     r = requests.Session() 5     r.get(url=start_url, headers=headers, timeout=3) 6     return r.cookies

3.構造流程

1.構造主函數：

 1 if __name__ == '__main__':  2     #初始url---獲取cookies
 3     start_url = "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
 4     #模擬請求url
 5     post_url = "https://www.lagou.com/jobs/positionAjax.json?"
 6     #headers
 7     headers = {  8         "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36",  9         "accept": "application/json, text/javascript, */*; q=0.01", 10         "accept-encoding": "gzip, deflate, br", 11         "accept-language": "zh-CN,zh;q=0.9,en;q=0.8", 12         "referer": "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=", 13  } 14     # 動態cookies
15     cookies = cookieRequest(start_url) 16     time.sleep(1) 17     #異常處理
18     try: 19         data = { 20             "first": "true", 21             "pn": 1  # 1
22             "kd": "python", 23  } 24  textInformation(post_url, data, cookies) 25         time.sleep(7) 26         print('------------第%s頁爬取成功，正在進行下一頁--------------' % s) 27     except requests.exceptions.ConnectionError: 28         r.status_code = "Connection refused"

2.構造基礎頁函數

 1 def textInformation(post_url, data, cookies):  2     response = requests.post(post_url, headers=headers, data=data, cookies=cookies,timeout=3).text  3     div1 = json.loads(response)  4     # 拿到該頁的職位信息
 5     position_data = div1["content"]["positionResult"]["result"]  6     n = 1
 7     for list in position_data:  8         infor = {  9                     "positionName": result["positionName"], 10  11                     "companyFullName": result["companyFullName"], 12                     "companySize": result["companySize"], 13                     "industryField": result["industryField"], 14                     "financeStage": result["financeStage"], 15  16                     "firstType": result["firstType"], 17                     "secondType": result["secondType"], 18                     "thirdType": result["thirdType"], 19  20                     "positionLables": result["positionLables"], 21  22                     "createTime": result["createTime"], 23  24                     "city": result["city"], 25                     "district": result["district"], 26                     "businessZones": result["businessZones"], 27  28                     "salary": result["salary"], 29                     "workYear": result["workYear"], 30                     "jobNature": result["jobNature"], 31                     "education": result["education"], 32  33                     "positionAdvantage": result["positionAdvantage"] 34  } 35  36         print(infor) 37         time.sleep(5) 38         print('----------寫入%s次-------' %n) 39         n +=1

3.單獨獲取每一個類的show_id(詳情頁使用):

https://www.lagou.com/jobs/4254613.html? show=0977e2e185564709bebd04fe72a34c9fjavascript

 1 show_id = []  2 def getShowId(post_url, headers, cookies):  3     data = {  4         "first": "true",  5         "pn": 1,  6         "kd": "python",  7  }  8     response = requests.post(post_url, headers=headers, data=data, cookies=cookies).text  9     div1 = json.loads(response) 10     # 拿到該頁的職位信息
11     position_data = div1["content"]["positionResult"]["result"] 12     # 詳情頁的show_id
13     position_show_id = div1['content']['showId'] 14  show_id.append(position_show_id) 15     # return position_show_id

4.詳情頁信息

 1 def detailinformation(detail_id, show_id):  2      get_url = "https://www.lagou.com/jobs/{}.html?show={}".format(detail_id, show_id)  3      # time.sleep(2)
 4      # 詳情頁信息
 5      response = requests.get(get_url, headers=headers,timeout=5).text  6      # print(response)
 7      html = etree.HTML(response)  8      div1 = html.xpath("//div[@class='job-detail']/p/text()")  9      # 職位詳情/清洗數據
10      position_list = [i.replace(u'\xa0', u'') for i in div1] 11      # print(position_list)
12      return position_list

完整代碼放在`GitHub`中：

　　https://github.com/xbhog/studyProjecthtml

4.暫沒解決/完善的問題

詳情頁在mysql保存到的時候，有些沒有數據，多是網絡抖動或者請求頻繁java

沒有使用多線程python
沒有使用scrapy框架mysql
沒有使用類方法android

------>下期內容<---------

數據存儲：----存儲環境ubuntugit

Mysql存儲github
csv存儲ajax

數據存儲連接：http://www.javashuo.com/article/p-orkpwslx-ns.htmlsql

拉鉤項目(一)--項目流程+數據提取

聲明：

1）僅做爲我的學習，若有冒犯，告知速刪！

2）不想誤導，若有錯誤，不吝指教！

3）文章配套視頻：http://www.bilibili.com/video/BV1aC4y1a7nR?share_medium=android&share_source=copy_link&bbid=XY1C2901EE0D25CCEC5E23A673F2026B36BEF&ts=1592703866866

目標：

1. 爬取拉鉤網中的關於編程語言的 1）薪資，2）城市範圍，3）工做年限，4）學歷要求;

2 .將四部分保存到mysql中;

3.對四部分進行數據可視化;

4.最後經過pyecharts+bootstrap進行網頁美化 .

技能點：

1. python網絡基礎(requests,xpath語法等)；

2. MySQL+ pymysql的語法基礎；

3. pyecharts基礎；

4. bootstrap基礎；

項目流程及邏輯：

大方向：先完成爬取一類的信息，進行可視化，走一遍流程很重要，再拓展！

1.進入如下位置：

------->刷新找到請求url：<--------

------->分析+請求參數：<--------

------->由於url是post請求，咱們須要提交參數，往下滑：<-------

2.解決反爬機制

1. 上面的操做解決的是------>拉鉤的ajax請求方式

2. 隱藏在cookies中的時間戳處理：------>session來保持會話-----實時更新cookies

3.構造流程

1.構造主函數：

2.構造基礎頁函數

3.單獨獲取每一個類的show_id(詳情頁使用):

4.詳情頁信息

完整代碼放在GitHub中：

4.暫沒解決/完善的問題

------>下期內容<---------

　　　1）僅做爲我的學習，若有冒犯，告知速刪！

　　　2）不想誤導，若有錯誤，不吝指教！

　　 3）文章配套視頻：http://www.bilibili.com/video/BV1aC4y1a7nR?share_medium=android&share_source=copy_link&bbid=XY1C2901EE0D25CCEC5E23A673F2026B36BEF&ts=1592703866866

　　　1. 爬取拉鉤網中的關於編程語言的 1）薪資，2）城市範圍，3）工做年限，4）學歷要求;

　　　2 .將四部分保存到`mysql`中;

　　　3.對四部分進行數據可視化;

　　　4.最後經過`pyecharts+bootstrap`進行網頁美化 .

　　 1. python網絡基礎(`requests,xpath`語法等)；

　　　2. `MySQL+ pymysql`的語法基礎；

　　　3. `pyecharts`基礎；

　　　4. bootstrap基礎；

　　　大方向：先完成爬取一類的信息，進行可視化，走一遍流程很重要，再拓展！

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　------->刷新找到請求`url`：<--------

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　------->分析+請求參數：<--------

　　　　　　　　　　　　　　　　　　　　　　　------->由於`url`是post請求，咱們須要提交參數，往下滑：<-------

1. 上面的操做解決的是------>拉鉤的`ajax`請求方式

完整代碼放在`GitHub`中：