本文的文字及圖片來源於網絡,僅供學習、交流使用,不具備任何商業用途,若有問題請及時聯繫咱們以做處理。python
如下文章來源於菜J學Python ,做者小小明編程
Python爬蟲、數據分析、網站開發等案例教程視頻免費在線觀看json
https://space.bilibili.com/523606542
今天咱們打算爬取一下字節跳動的招聘信息:api
咱們打開開發者工具並訪問:cookie
https://jobs.bytedance.com/experienced/position?keywords=&category=&location=&project=&type=&job_hot_flag=¤t=1&limit=10
此次訪問監控到的數據不少,其中這個posts接口才有咱們須要的json數據:網絡
觀察響應頭髮現一個重要參數csrf:session
說明字節跳動的網站具有csrf校驗的功能,後文將再介紹如何獲取到這個csrf的token。數據結構
查看請求參數:app
爲了正常爬取時的方便,咱們須要先將上面須要的參數,組織成python可以識別的字典形式。直接複製粘貼有不少須要加雙引號的地方,但咱們能夠編程解決這個問題。函數
首先,定義一個處理函數:
import re def warp_heareder(s): print("{") lines = s.splitlines() for i, line in enumerate(lines): k, v = line.split(": ") if re.search("[a-zA-Z]", k): k = f'"{k}"' if re.search("[a-zA-Z]", v): v = f'"{v}"' print(f" {k}: {v},") print("}")
處理請求頭:
處理post請求數據:
首先,清空cookie:
而後刷新頁面,查看網絡請求的抓包狀況:
找啊找,終於找到了一個set-cookie的響應頭,並且這個設置cookie參數包括了csrf的設置。那麼這個接口咱們就能夠用來做爲獲取csrf校驗值的接口。
使用session保存響應頭設置的cookie:
import requests session = requests.session() headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Origin': 'https://jobs.bytedance.com', 'Referer': f'https://jobs.bytedance.com/experienced/position?keywords=&category=&location=&project=&type=&job_hot_flag=¤t=1&limit=10' } data = { "portal_entrance": 1 } url = "https://jobs.bytedance.com/api/v1/csrf/token" r = session.post(url, headers=headers, data=data) r
結果:
<Response [200]>
查看獲取到的cookie:
cookies = session.cookies.get_dict()
cookies
結果:
{'atsx-csrf-token': 'RDTEznQqdr3O3h9PjRdWjfkSRW79K_G16g85FrXNxm0%3D'}
顯然這個token相對真實須要的存在url編碼,如今對它進行url解碼:
from urllib.parse import unquote unquote(cookies['atsx-csrf-token'])
結果:
'RDTEznQqdr3O3h9PjRdWjfkSRW79K_G16g85FrXNxm0='
有了token咱們就能夠順利的直接訪問接口了:
import requests import json headers = { "Accept": "application/json, text/plain, */*", "Host": "jobs.bytedance.com", "Origin": "https://jobs.bytedance.com", "Referer": "https://jobs.bytedance.com/experienced/position?keywords=&category=&location=&project=&type=&job_hot_flag=¤t=1&limit=10", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36", "x-csrf-token": unquote(cookies['atsx-csrf-token']), } data = { "job_category_id_list": [], "keyword": "", "limit": 10, "location_code_list": [], "offset": 0, "portal_entrance": 1, "portal_type": 2, "recruitment_id_list": [], "subject_id_list": [] } url = "https://jobs.bytedance.com/api/v1/search/job/posts" r = session.post(url, headers=headers, data=json.dumps(data)) r
結果:
<Response [200]>
響應碼是200,說明已經順利經過了校驗,如今查看一下數據結構:
r.json()
結果:
import pandas as pd df = pd.DataFrame(r.json()['data']['job_post_list']) df.head(3)
結果:
而後咱們對各列提取出咱們須要的數據:
df.city_info = df.city_info.str['name'] df.recruit_type = df.recruit_type.str['parent'].str['name'] tmp = [] for x in df.job_category.values: if x['parent']: tmp.append(f"{x['parent']['name']}-{x['name']}") else: tmp.append(x['name']) df.job_category = tmp df.publish_time = df.publish_time.apply(lambda x: pd.Timestamp(x, unit="ms")) df.head(2)
結果:
再刪除一些,明顯沒有任何用的列:
df.drop(columns=['sub_title', 'job_hot_flag', 'job_subject'], inplace=True) df.head()
結果:
有了上面的測試基礎,咱們就能夠組織一下完整的爬取代碼:
import requests from urllib.parse import unquote import pandas as pd import time import os session = requests.session() page = 1500 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Origin': 'https://jobs.bytedance.com', 'Referer': f'https://jobs.bytedance.com/experienced/position?keywords=&category=&location=&project=&type=&job_hot_flag=¤t=1&limit={page}' } data = { "portal_entrance": 1 } url = "https://jobs.bytedance.com/api/v1/csrf/token" r = session.post(url, headers=headers, data=data) cookies = session.cookies.get_dict() url = "https://jobs.bytedance.com/api/v1/search/job/posts" headers["x-csrf-token"] = unquote(cookies["atsx-csrf-token"]) data = { "job_category_id_list": [], "keyword": "", "limit": page, "location_code_list": [], "offset": 0, "portal_entrance": 1, "portal_type": 2, "recruitment_id_list": [], "subject_id_list": [] } for i in range(11): print(f"準備爬取第{i}頁") data["offset"] = i*page r = None whilenot r: try: r = session.post(url, headers=headers, data=json.dumps(data), timeout=3) except Exception as e: print("訪問超時!等待5s", e) time.sleep(5) df = pd.DataFrame(r.json()['data']['job_post_list']) if df.shape[0] == 0: print("爬取完畢!!!") break df.city_info = df.city_info.str['name'] df.recruit_type = df.recruit_type.str['parent'].str['name'] tmp = [] for x in df.job_category.values: if x['parent']: tmp.append(f"{x['parent']['name']}-{x['name']}") else: tmp.append(x['name']) df.job_category = tmp df.publish_time = df.publish_time.apply( lambda x: pd.Timestamp(x, unit="ms")) df.drop(columns=['sub_title', 'job_hot_flag', 'job_subject'], inplace=True) df.to_csv("bytedance_jobs.csv", mode="a", header=not os.path.exists("bytedance_jobs.csv"), index=False) print(",".join(df.title.head(10))) # 對結果去重 df = pd.read_csv("bytedance_jobs.csv") df.drop_duplicates(inplace=True) df.to_csv("bytedance_jobs.csv", index=False) print("共爬取", df.shape[0], "行無重複數據")
結果:
僅7.3秒爬完了字節跳動1W+以上的職位信息。
能夠讀取看看:
import pandas as pd df = pd.read_csv("bytedance_jobs.csv") df
結果:
有1萬個以上的職位信息。