學習python網絡爬蟲有一段時間了,正好遇上休假閒來無事,記錄一下爬取的過程。html
1、開發工具
Pycharm 2017
Python 2.7.10
requests
pymongopython
2、爬取目標
一、爬取與python相關的職位信息
二、因爲拉勾網只展現30頁的搜索結果,每頁15條職位信息,所有爬下來,最終將獲取到450條信息
三、將結果存儲在Mongodb中git
3、結果展現github
4、爬取過程
一、瀏覽器中打開拉勾網:https://www.lagou.com,搜索python,同時打開開發者工具進行抓包,找到可以返回數據的Request URL,經過一番查找,發現要找的url是https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0,它返回的是json格式的數據:mongodb
通過以上分析,能夠寫一下初始化的代碼:設置請求的url和請求頭,請求頭在抓包時能夠獲取,經過程序去爬取網絡,咱們要作的一件重要的事情就是模擬瀏覽器,不然遇到具備反爬蟲措施的網站,很難將數據爬下來。固然了,設置正確的請求頭,只是應對反爬蟲的一部分措施。數據庫
1 def __init__(self): 2 self.headers = {} 3 self.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36' 4 self.headers['Host'] = 'www.lagou.com' 5 self.headers['Referer'] = 'https://www.lagou.com/jobs/list_python?px=default&city=%E5%85%A8%E5%9B%BD' 6 self.headers['X-Anit-Forge-Code'] = '0' 7 self.headers['X-Anit-Forge-Token'] = None 8 self.headers['X-Requested-With'] = 'XMLHttpRequest' 9 self.url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false&isSchoolJob=0'
二、返回每一條數據中包含的信息比較,咱們只獲取想要的:薪資、城市、工做年限、職位、職位描述、學歷、公司全稱、發佈時間等。json
1 result = {} 2 result['positionName'] = position['positionName'] 3 result['createTime'] = position['createTime'] 4 result['secondType'] = position['secondType'] 5 result['city'] = position['city'] 6 result['workYear'] = position['workYear'] 7 result['education'] = position['education'] 8 result['companyFullName'] = position['companyFullName'] 9 result['financeStage'] = position['financeStage'] 10 result['jobNature'] = position['jobNature'] 11 result['salary'] = position['salary'] 12 result['jobdescriptions'] = Lagou.get_detail(position['positionId']) 13 result['positionLables'] = position['positionLables'] 14 result['positionId'] = position['positionId']
這裏重點說明一下,職位信息是包含在詳情頁中的,經過觀察發現,詳情頁名稱是經過positionId.html方式命名的,因此要打開詳情頁獲取職位信息就須要獲取positionId,以上代碼中:result['jobdescriptions'] = Lagou.get_detail(position['positionId'])能夠看到。瀏覽器
獲取列表頁和詳情頁的代碼以下:網絡
1 def get_data(self,page): 2 while 1: 3 try: 4 form_data = { 5 'first' : 'true', 6 'pn' : page, 7 'kd' : 'python' 8 } 9 result = requests.post(self.url,data=form_data,headers=self.headers) 10 return result.json()['content']['positionResult']['result'] 11 except: 12 time.sleep(1) 13 14 def get_detail(self,positionId): 15 url = 'https://www.lagou.com/jobs/%s.html' % positionId 16 text = requests.get(url,headers=self.headers).text 17 jobdescriptions = re.findall(r'<h3 class="description">(.*?)</div>',text,re.S) 18 jobdescriptions = [re.sub(r'</?h3>|</?div>|</?p>|</?br>|^\s+|</?li>','',j) for j in jobdescriptions] 19 return jobdescriptions
mongodb的部分單獨寫了一個模塊在主代碼中調用:ide
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 4 import pymongo 5 6 class MongodbOPT: 7 def __init__(self, host, port, db, passwd, collection): 8 self.host = host 9 self.port = str(port) 10 self.db = db 11 self.passwd = passwd 12 self.collection = collection 13 14 def getClient(self): 15 """返回MongoDB客戶端對象""" 16 conn = 'mongodb://' + self.db + ':' + self.passwd + '@' + self.host + ':' + self.port + '/' + self.db 17 return pymongo.MongoClient(conn) 18 19 def getDB(self): 20 """返回MongoDB的一個數據庫""" 21 client = self.getClient() 22 db = client['%s' % (self.db)] 23 return db 24 25 def insertData(self, bsonData): 26 """插入數據(單條或多條)""" 27 if self.db: 28 db = self.getDB() 29 collections = self.collection 30 if isinstance(bsonData, list): 31 result = db.get_collection(collections).insert_many(bsonData) 32 return result.inserted_ids 33 return db.get_collection(collections).insert_one(bsonData).inserted_id 34 else: 35 return None 36 37 def findAll(self, **kwargs): 38 if self.db: 39 collections = self.collection 40 db = self.getDB() 41 def findAllDataQuery(self, dataLimit=0, dataSkip=0, dataQuery=None, dataSortQuery=None,dataProjection=None): 42 return db.get_collection(collections).find(filter=dataQuery, projection=dataProjection, skip=dataSkip, 43 limit=dataLimit, sort=dataSortQuery) 44 return findAllDataQuery(self, **kwargs) 45 46 def updateData(self, oldData=None, **kwargs): 47 if self.db: 48 collections = self.collection 49 db = self.getDB() 50 def updateOne(self, oneOldData=None, oneUpdate=None, oneUpsert=False): # 單個更新 51 result = db.get_collection(collections).update_one(filter=oneOldData, update=oneUpdate, 52 upsert=oneUpsert) 53 return result.matched_count 54 def updateMany(self, manyOldData, manyUpdate=None, manUpsert=False): # 所有更新 55 result = db.get_collection(collections).update_many(filter=manyOldData, update=manyUpdate, 56 upsert=manUpsert) 57 return result.matched_count 58 if oldData: 59 oneup = kwargs.get("oneUpdate", "") 60 manyup = kwargs.get("manyUpdate", "") 61 if oneup: 62 return updateOne(self, oldData, **kwargs) 63 elif manyup: 64 return updateMany(self, oldData, **kwargs)
完整的代碼放在github中了,有須要請移步:https://github.com/Eivll0m/PythonSpider/tree/master/lagou