這個Scrapy框架搞了我很久,功夫不負有心人,差很少懂整個思惟邏輯了,下面是我爬的代碼,不詳細介紹了javascript
要本身找資料慢慢體會,多啃啃就懂的啦。java
這個框架及真的很好用,很快,很全,上次用Request只爬了200多,此次差很少800.很nice哦!!json
其實不用太懂這個原理,懂用這個框架就行了,反正也不是作爬蟲工程師~想懂原理本身去看Scrapy的源代碼app
下面是Spider裏的那個文件:echarts
Setting:框架
其餘不用改dom
ITEMS:scrapy
import scrapy from scrapy.item import Item, Field class Lagou2Item(scrapy.Item): name = Field() location = Field() position = Field() exprience = Field() money = Field()
SPIDER裏的代碼:ide
import scrapy import os import re import codecs import json import sys from scrapy import Spider from scrapy.selector import Selector from lagou2.items import Lagou2Item from scrapy.http import Request from scrapy.http import FormRequest from scrapy.utils.response import open_in_browser class LgSpider(scrapy.Spider): name = 'lg' #allowed_domains = ['www.lagou.com'] start_urls = ['http://www.lagou.com/'] custom_settings = { "DEFAULT_REQUEST_HEADERS": { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Host': 'www.lagou.com', 'Origin': 'https://www.lagou.com', 'Referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city=%E5%85%A8%E5%9B%BD', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', 'X-Anit-Forge-Code': '0', 'X-Anit-Forge-Token': 'None', 'X-Requested-With': 'XMLHttpRequest' }, "ITEM_PIPELINES": { 'lagou2.pipelines.LagouPipeline': 300 } } def start_requests(self): #修改city參數更換城市 url = "https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false&city=全國" requests = [] for i in range(1, 60): #修改kd參數更換關鍵字 formdata = {'first':'false', 'pn':str(i), 'kd':'數據分析'} request = FormRequest(url, callback=self.parse_model, formdata=formdata) requests.append(request) print(request) return requests def parse_model(self, response): print(response.body.decode()) jsonBody = json.loads(response.body.decode()) results = jsonBody['content']['positionResult']['result'] items=[] for result in results: item = Lagou2Item() item['name'] = result['companyFullName'] item['location'] = result['city'] item['position'] = result['positionName'] item['exprience'] = result['workYear'] item['money'] = result['salary'] items.append(item) return items
PIPELINES:函數
from scrapy import signals import json import codecs from openpyxl import Workbook class LagouPipeline(object): def __init__(self): self.workbook = Workbook() self.ws = self.workbook.active self.ws.append(['公司名稱', '工做地點', '職位名稱', '經驗要求', '薪資待遇']) # 設置表頭 #self.file = codecs.open('lagou2.json', 'w', encoding='utf-8') def process_item(self, item, spider): line = [item['name'], item['location'], item['position'], item['exprience'], item['money']] # 把數據中每一項整理出來 self.ws.append(line) self.workbook.save('lagou2.xlsx') # 保存xlsx文件 #line = json.dumps(dict(item), ensure_ascii=False) + "\n" #self.file.write(line) return item def spider_closed(self, spider): self.file.close()
結果:
咱們開始可視化把:在jupyter notebook裏操做
import pandas as pd #數據框操做 import numpy as np import matplotlib.pyplot as plt #繪圖 import matplotlib as mpl #配置字體 from pyecharts import Geo #地理圖 import xlrd import re
mpl.rcParams['font.sans-serif'] = ['SimHei'] #這個是繪圖格式,不寫這個的話橫座標沒法變成咱們要的內容 #配置繪圖風格 plt.rcParams['axes.labelsize'] = 8. plt.rcParams['xtick.labelsize'] = 12. plt.rcParams['ytick.labelsize'] = 12. plt.rcParams['legend.fontsize'] =10. plt.rcParams['figure.figsize'] = [8.,8.]
data = pd.read_excel(r'E:\\scrapyanne\\lagou2\\lagou2\\spiders\\lagou2.xlsx',encoding='gbk') #出現錯誤的話試試utf8,路徑不能出現中文,會出現錯誤
data['經驗要求'].value_counts().plot(kind='barh') #繪製條形圖 plt.show #顯示圖片
data['工做地點'].value_counts().plot(kind='pie',autopct='%1.2f%%',explode=np.linspace(0,1.5,32)) plt.show #顯示圖片
#從lambda一直到*1000,是一個匿名函數,*1000的緣由是這裏顯示的是幾K幾K的,咱們把K切割掉,只要數字,就*1000了 data2 = list(map(lambda x:(data['工做地點'][x],eval(re.split('k|K',data['薪資待遇'][x])[0])*1000),range(len(data))))
#再把data2框架起來 data3 = pd.DataFrame(data2) data3
#轉化成geo所須要的故事,也是用匿名函數,在data3中,按照地區分組,而後根據地區來計算工資的平均值,將其變成序列後再分組 data4 = list(map(lambda x:(data3.groupby(0).mean()[1].index[x],data3.groupby(0).mean()[1].values[x]),range(len(data3.groupby(0)))))
#geo = Geo('主標題','副標題',字體顏色='白色',字體位置='中間',寬度=1200,高度=600,背景顏色=‘#404a59') geo = Geo("全國數據分析工資分佈", "製做:風吹白楊的安妮", title_color="#fff", title_pos="center",width=1200, height=600, background_color='#404a59') #屬性、數值對應的映射關係,attr是屬性,value是該屬性對應的數值,好比說北京對應15000,杭州對應10000 attr, value =geo.cast(data4) #這個是對地圖進行設置,第一個參數設置爲空值,我看別人這麼設置我也這麼設置了,下次查查爲何,第二個參數是屬性,第三個爲對應數值, #第四個參數是可視範圍,把工資區間換算成了0到300. 第五個很容易出錯,我以前安裝完地圖仍是出錯的緣由就是沒加上maptype=''china',必定要加上,第六個圖例類型寫上熱力圖, #第七個參數是地圖文本字體顏色爲白色,第八個是標識大小,第九個是否進行可視化=True. geo.add("", attr, value, visual_range=[0, 300],maptype='china',type='heatmap' ,visual_text_color="#fff", symbol_size=15, is_visualmap=True) geo
大功告成!!!!!!!!
有些特徵沒爬,好比說學歷要求,不爬啦,累死咯!!!
那個漂亮的餅圖我也琢磨出來啦~~!!!