Python爬蟲入門——使用requests爬取python崗位招聘數據

時間 2019-11-09

標籤 python 爬蟲入門使用 requests 崗位招聘數據欄目 Python 简体版

原文原文鏈接

爬蟲目的html

使用requests庫和BeautifulSoup4庫來爬取拉勾網Python相關崗位數據瀏覽器

爬蟲工具
網絡

使用Requests庫發送http請求，而後用BeautifulSoup庫解析HTML文檔對象，並提取職位信息。app

爬取過程工具

1.請求地址spa

https://www.lagou.com/zhaopin/Python/code

2.須要爬取的內容htm

（1）崗位名稱對象

（2）薪資blog

（3）公司所在地

3.查看html

使用FireFox瀏覽器，登錄拉勾網，按F12能夠進入開發者工具頁面：

這時候會看到該頁面的html網頁源碼。

接下來須要尋找崗位信息對應的源碼，好比崗位名稱：

在開發者工具頁面左上角有個箭頭標誌，點擊它，而後再點擊崗位名稱，就能看到對應的源碼。

知道對應的源碼後，還須要知道請求頭：

點擊「網絡」，以後點擊「get」，在最下方User-Agent中的內容就是請求頭

(若是是使用Chrome瀏覽器或者其它瀏覽器方法會有所不一樣)

完成上述操做後就能夠利用BeautifulSoup4提取裏面的文本。

利用requests發出數據請求

import requests
import io
import sys
from bs4 import BeautifulSoup
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',}
r = requests.get('https://www.lagou.com/zhaopin/Python/',headers=headers)　　#設置請求頭
r.encoding=r.apparent_encoding
result=r.text
bs=BeautifulSoup(result,'html.parser')　　#建立一個BeautifulSoup對象

利用BeautifulSoup提取網頁數據

b=[]　　#建立空列表用來存儲爬取的數據
a=[]
d=[]
name = bs.find_all('h3')　　#獲取全部包含'h3'標籤的內容
’
for h3 in name:
    b.append(h3.string)
money = bs.find_all('span',attrs={'class':'money'})
for span in money:
    a.append(span.string)　　#獲取字符串形式的數據
ltd=bs.find_all('em')
for em in ltd:
    d.append(em.string)
i=0
print("職業:","           薪資:","    地點:")
try:
    while True:
        print(b[i],a[i],d[i])
        i+=1
except IndexError:
    print()