最近學習數據分析,所以嘗試一下這兩個網站的職位需求作分析用,在其中遇到了不少坑,記錄一下。html
框架就選用了scrapy,比較簡單,建了兩個文件,分別做用於不一樣的網站。前端
先來看BOSS直聘:python
網上搜了不少BOSS直聘的例子,覺得很容易,只須要模擬一個登錄頭就能夠了……可是進去發現徹底不是那麼一回事。web
按照慣例,首先在items.py中定義須要獲取的數據:api
import scrapy class PositionViewItem(scrapy.Item): # define the fields for your item here like: name :scrapy.Field = scrapy.Field()#名稱 salary :scrapy.Field = scrapy.Field()#薪資 education :scrapy.Field = scrapy.Field()#學歷 experience :scrapy.Field = scrapy.Field()#經驗 jobjd :scrapy.Field = scrapy.Field()#工做ID district :scrapy.Field = scrapy.Field()#地區 category :scrapy.Field = scrapy.Field()#行業分類 scale :scrapy.Field = scrapy.Field()#規模 corporation :scrapy.Field = scrapy.Field()#公司名稱 url :scrapy.Field = scrapy.Field()#職位URL createtime :scrapy.Field = scrapy.Field()#發佈時間 posistiondemand :scrapy.Field = scrapy.Field()#崗位職責 cortype :scrapy.Field = scrapy.Field()#公司性質
上面定義的就是ITEM,構思好須要的數值,目前就簡單的設置爲普通的scrapy.Field() cookie
name :str = 'DA' url :str='https://www.zhipin.com/c100010000/?query=%E6%95%B0%E6%8D%AE&page=10'#起始url設定爲進入BOSS直聘以後的搜索頁,搜索參數爲全國的數據分析 cookies :Dict = { "__zp_stoken__":"bf79ElaZ4z7IK5JruWAX5j256l7CJf3k7Ag2A9mrsSPN%2FnLgjChK0LguCrB%2FtIEFMKdnysNhr4ilqIicjeHkCsCpBQ%3D%3D" }#設置cookies headers :Dict = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0', 'Referer': 'https://www.zhipin.com/web/common/security-check.html?seed=6gkgYHovIokVntQcwXUH9KW3%2FbEZsqfeaoCctIp1rE8%3D&name=f2d51032&ts=1571623520634&callbackUrl=%2Fjob_detail%2F%3Fquery%3D%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590%26city%3D100010000%26industry%3D%26position%3D&srcReferer=https%3A%2F%2Fwww.zhipin.com%2Fjob_detail%2F%3Fquery%3D%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590%26city%3D100010000%26industry%3D%26position%3D' }#設置登陸頭
設置完經常使用的參數以後,嘗試定義start_requests方法做爲爬取的起始url框架
def start_requests(self) -> Request: yield Request(self.url, headers=self.headers, cookies=self.cookies)#返回一個yield,調用默認callback,第一個參數是以前定義的url,第二個是定義的請求頭,第三個是cookies。
scrapy中默認的回調函數爲parse,直接定義一個parse用於獲取response的內容,以後直接用xpath語法進行解析。scrapy
def parse(self, response) -> None: if response.status == 200: PositionInfos :selector.SelectorList = response.selector.xpath(r'//div[@class="job-primary"]') for positioninfo in PositionInfos: pvi = PositionViewItem() pvi['name'] = ''.join(positioninfo.xpath(r'div[@class="info-primary"]/h3[@class="name"]/a/div[@class="job-title"]/text()').extract()) pvi['salary'] = ''.join(positioninfo.xpath(r'div[@class="info-primary"]/h3[@class="name"]/a/span[@class="red"]/text()').extract()) pvi['education'] = ''.join(positioninfo.xpath(r'div[@class="info-primary"]/p/text()').extract()[2]) pvi['experience'] = ''.join(positioninfo.xpath(r'div[@class="info-primary"]/p/text()').extract()[1]) pvi['district'] = ''.join(positioninfo.xpath(r'div[@class="info-primary"]/p/text()').extract()[0]) pvi['corporation'] = ''.join(positioninfo.xpath(r'div[@class="info-company"]/div[@class="company-text"]/h3[@class="name"]/a/text()').extract()) pvi['category'] = ''.join(positioninfo.xpath(r'div[@class="info-company"]/div[@class="company-text"]/p/text()').extract()[0]) try: pvi['scale'] = ''.join(positioninfo.xpath(r'div[@class="info-company"]/div[@class="company-text"]/p/text()').extract()[2]) except IndexError: pvi['scale'] = ''.join(positioninfo.xpath(r'div[@class="info-company"]/div[@class="company-text"]/p/text()').extract()[1]) pvi['url'] = ''.join(positioninfo.xpath(r'div[@class="info-primary"]/h3[@class="name"]/a/@href').extract()) yield pvi nexturl = response.selector.xpath(r'//a[@ka="page-next"]/@href').extract() if nexturl: nexturl = urljoin(self.url, ''.join(nexturl)) print(nexturl) yield Request(nexturl, headers=self.headers, cookies=self.cookies, callback=self.parse)
xpath選擇器後面跟的.extract()會返回一個list,裏面包含的是選擇器選擇出來的全部元素,若是選擇不出來,那麼這個語句會報錯而不是返回空值!函數
yield pvi的做用是把定義好的ITEM傳給pipelines,方便在pipelines中對獲取的數據進行操做。學習
nexturl = response.selector.xpath(r'//a[@ka="page-next"]/@href').extract()獲取到下一頁的連接以後,要用urllib.parse中的urljoin將獲取到的連接和源連接進行合併,由於抓到的連接並非一個完整的url,而是相似於
/c101010100/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&page=2這種格式,須要用urljoin進行合併,合併規則以下:
url='http://ip/ path='api/user/login' urljoin(url,path)拼接後的路徑爲'http//ip/api/user/login'
本覺得這樣就行了,用scrapy crawl + 名字()運行,結果發現請求不到數據,會直接302重定向到一個securitycheck的網頁.
打開fiddler查看請求過程:
能夠看到徹底模擬了整個查詢過程,先直接請求一遍地址,以後重定向到security-check的網頁,以後再切回到返回的頁面,看上去沒有問題,可是仔細查看會發現cookies中的__zp_token__發生了變化:
那麼就很清楚了,應該是在調用security-check以後回寫了一個token,以後根據這個最新的token來判斷請求,看了一下彷佛是經過一個js進行加密回寫的,知乎上有大神寫了解密的辦法,對前端不太懂,放棄了...
轉載連接以下:https://zhuanlan.zhihu.com/p/83235220
這個token只能經過手動刷新的方式獲取,通常能持續個幾回請求就會失效,要從新獲取.不過手動爬也只能爬個10頁左右,後面的不登錄就沒有了,所以也無所謂.
後來嘗試經過selenium模擬的方式進行,也宣告失敗.
總之不是很成功,目前不推薦啦...