用Python爬取智聯招聘信息作職業規劃

時間 2019-11-09

原文原文鏈接

　　上學期在實驗室發表時寫了一個爬取智聯招牌信息的爬蟲。php

操做流程大體分爲：信息爬取——數據結構化——存入數據庫——所需技能等分詞統計——數據可視化html

　　1.數據爬取mysql

 1 job = "通訊工程師" #以爬取通訊工程師職業爲例
 2 leibie = '1'
 3 url_job = []
 4 
 5 for page in range(99):
 6     x = str(page) #爬取的頁碼
 7     p = str(page+1)
 8     print("正在抓取第一"+p+"頁...\n") #提示
 9     url = "http://sou.zhaopin.com/jobs/searchresult.ashx?in=210500%3B160400%3B160000%3B160500%3B160200%3B300100%3B160100%3B160600&jl=上海%2B杭州%2B北京%2B廣州%2B深圳&kw="+job+"&p="+x+"&isadv=0" #url地址，此處爲示例，可更據實際狀況更改
10     r = requests.post(url) #發送請求
11     data = r.text
12     pattern=re.compile('ssidkey=y&amp;ss=201&amp;ff=03" href="(.*?)" target="_blank"',re.S) #正則匹配出招聘信息的URL地址
13     tmp_job = re.findall(pattern,data)
14     url_job.extend(tmp_job) #加入隊列

上面代碼以上海、杭州、北京、廣州、深圳的「通訊工程師」爲例實現爬取了智聯招聘上每一則招聘信息的URL地址。正則表達式

（示例）在智聯招聘上以下圖所示的招聘地址：sql

　　2.數據結構化數據庫

得到URL以後，就經過URL，發送get請求，爬取每一則招聘的數據，而後使用Xpath或者正則表達式把全部數據結構化，代碼以下：json

 1 for x in url_job:
 2     print(x)
 3     d = requests.post(x) #發送post請求
 4     zhiwei = d.text
 5     selector = etree.HTML(zhiwei) #得到招聘頁面源碼
 6     name = selector.xpath('//div[@class="inner-left fl"]/h1/text()') #匹配到的職業名稱
 7     mone = selector.xpath('//div[@class="terminalpage clearfix"]/div[@class="terminalpage-left"]/ul[@class="terminal-ul clearfix"]/li[1]/strong/text()') #匹配到該職位的月薪
 8     adress = selector.xpath('//div[@class="terminalpage clearfix"]/div[@class="terminalpage-left"]/ul[@class="terminal-ul clearfix"]/li[2]/strong/a/text()') #匹配工做的地址
 9     exp = selector.xpath('//div[@class="terminalpage clearfix"]/div[@class="terminalpage-left"]/ul[@class="terminal-ul clearfix"]/li[5]/strong/text()') #匹配要求的工做經驗
10     education = selector.xpath('//div[@class="terminalpage clearfix"]/div[@class="terminalpage-left"]/ul[@class="terminal-ul clearfix"]/li[6]/strong/text()') #匹配最低學歷
11     zhiweileibie = selector.xpath('//div[@class="terminalpage clearfix"]/div[@class="terminalpage-left"]/ul[@class="terminal-ul clearfix"]/li[8]/strong/a/text()') #匹配職位類別
12 
13     match = re.compile('<!-- SWSStringCutStart -->(.*?)<!-- SWSStringCutEnd -->',re.S)#此處爲匹配對職位的描述，而且對其結構化處理
14     description = re.findall(match,zhiwei)
15     des = description[0]
16     des = filter_tags(des) #filter_tags此函數下面會講到
17     des = des.strip()
18     des = des.replace('&nbsp;','')
19     des = des.rstrip('\n')
20     des = des.strip(' \t\n')
21     try: #嘗試判斷是否爲最後一則
22         name = to_str(name[0])
23         mone = to_str(mone[0])
24         adress = to_str(adress[0])
25         exp = to_str(exp[0])
26         education = to_str(education[0])
27         zhiweileibie = to_str(zhiweileibie[0])
28         des = to_str(des)
29     except Exception as e:
30         continue

上面代碼中使用了filter_tags函數，此函數的目的在於把HTML代碼替換實體，而且去掉各類標籤、註釋和換行空行等，該函數代碼以下：數據結構

 1 def filter_tags(htmlstr):
 2     #先過濾CDATA
 3     re_cdata=re.compile('//<!\[CDATA\[[^>]*//\]\]>',re.I) #匹配CDATA
 4     re_script=re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',re.I)#Script
 5     re_style=re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>',re.I)#style
 6     re_br=re.compile('<br\s*?/?>')#處理換行
 7     re_h=re.compile('</?\w+[^>]*>')#HTML標籤
 8     re_comment=re.compile('<!--[^>]*-->')#HTML註釋
 9     s=re_cdata.sub('',htmlstr)#去掉CDATA
10     s=re_script.sub('',s) #去掉SCRIPT
11     s=re_style.sub('',s)#去掉style
12     #s=re_br.sub('\n',s)#將br轉換爲換行
13     s=re_h.sub('',s) #去掉HTML 標籤
14     s=re_comment.sub('',s)#去掉HTML註釋
15     #去掉多餘的空行
16     blank_line=re.compile('\n+')
17     s=blank_line.sub('\n',s)
18     # s=replaceCharEntity(s)#替換實體
19     return s

　　3.存入數據庫函數

上面的代碼已經幫咱們實現根據數據表中設置的字段清洗好雜亂無章的數據了，以後只要在循環中把結構化的數據存入數據庫便可。post

具體代碼以下：

 1 conn = pymysql.connect(host='127.0.0.1',user='root',passwd='××××××',db='zhiye_data',port=3306,charset='utf8')
 2 cursor=conn.cursor()
 3 
 4 sql='INSERT INTO `main_data_3` (`name`,`mone`,`adress`,`exp`,`education`,`zhiweileibie`,`description`,`leibie`,`company_range`,`company_kind`) VALUES(\''+name+'\',\''+mone+'\',\''+adress+'\',\''+exp+'\',\''+education+'\',\''+zhiweileibie+'\',\''+des+'\',\''+leibie+'\',\'a\',\'b\');'#%(name,mone,adress,exp,education,zhiweileibie,des,leibie)
 5 
 6     #print(sql)
 7     try:
 8         cursor.execute(sql)
 9         conn.commit()
10         print (cursor.rowcount)
11     except Exception as e:
12         print (e)
13 cursor.close()
14 conn.close()

存入數據庫中的具體數據示例以下圖:

　　4.數據統計

首先對職位的描述進行分詞統計，以便分析出該職業所須要的技能。

對職位描述進行分詞我先使用的是SAE的分詞服務，示例代碼（PHP）以下（僅供參考）：

 1     public function get()
 2     {
 3         $h = D('hotword');
 4         $data = $h->get_des();
 5 
 6         foreach ($data as $k => $v) {
 7             $content = POST("http://segment.sae.sina.com.cn/urlclient.php?encoding=UTF-8&word_tag=1","context=".$v['description']);
 8             $text = json_decode($content,true);
 9             if (empty($text[0]['word_tag'])) {
10                 exit;
11             }
12             $sta = $h->hotword_save($text);
13             dump($sta);
14         }
15     }