招聘市場上天然語言處理工程師都得會點啥

時間 2019-11-29

標籤招聘市場天然語言處理工程師简体版

原文原文鏈接

以boss直聘https://www.zhipin.com/上面搜索nlp爲例,咱們抓取數據,探索一下市場上對nlp人才的需求.html

代碼放在https://github.com/sdu2011/nlp.你能夠稍加改造,看看本身所在地區,目標職位都要求一些什麼技能.java

以南京地區的nlp崗位爲例.python

要抓取職位列表.獲取招聘方信息.抓取到職位詳情頁面的url。
要抓取職位詳情,解析詳情,分詞,統計,提取關鍵詞等
可視化. seaborn wordcloud等圖形化展現.

數據抓取與清洗

這部分就很少談了.主要要了解一些爬蟲知識.html頁面的解析庫BeatifulSoup用法.這一步"髒活比較多",主要就是分析各類html的tag的格式,去除空格啦,提取各類tag下信息啦之類的數據清洗工做.git

current_url = "https://www.zhipin.com/job_detail/?query=nlp&scity=101190100&industry=&position="
headers = {
           'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
           ,'Connection': 'keep-alive'
           }
r=requests.get(current_url,headers=headers)

def parse_job_detail(url):
    r=requests.get(url,headers=headers)
    bs = BeautifulSoup(r.text,"html.parser")
    h3=bs.find("h3",text="職位描述")
    #print(h3.find_next_sibling("div"))
    div_tag = h3.find_next_sibling("div")
    #print(div_tag.text)
    requirements = []#任職要求
    responsbility = []#崗位職責
    require_flag = False
    responsbility_flag = False
    for c in div_tag.children:
        #print(c,type(c))
        if type(c) is bs4.element.NavigableString:
            str_no_space = c.string.replace(" ","")
            #print(str_no_space)
            
            if str_no_space.find("任職要求") != -1:   #這邊不能用==判斷  由於前面因爲中文字符的問題 replace替換不掉：後面的blank
                responsbility_flag = False
                require_flag = True
                continue
                
            if str_no_space.find("崗位職責") != -1:   #這邊不能用==判斷  由於前面因爲中文字符的問題 replace替換不掉：後面的blank
                require_flag = False
                responsbility_flag = True           
                continue
            
            if require_flag:
                requirements.append(str_no_space)
            if responsbility_flag:
                responsbility.append(str_no_space)
                        
    #print(requirements)  
    #print(responsbility)
    return (requirements,responsbility)

#parse_job_detail("https://www.zhipin.com/job_detail/84e81e27c933269e1Xxz3dq1E1I~.html?ka=search_list_1")

def  get_jobs_info(url):
    r=requests.get(url,headers=headers)
    bs = BeautifulSoup(r.text,"html.parser")
    jobs = []
    for job in bs.find_all("div",class_="job-primary"):
        #print("**************************************************************")
        one_job = []
        for child in job.descendants:   
            if child.name == 'div'and child['class'] == ['info-primary']:
                jobdetails = parse_job_detail("https://www.zhipin.com/%s" % (child.h3.a.get('href')))
                one_job.append(child.h3.div.text)  #title
                #print(one_job)
                one_job.append(child.h3.span.text) #salary  
                #one_job.append(child.h3.a.get('href')) #link
                one_job.append(jobdetails[0]) #requirements
                one_job.append(jobdetails[1]) #responsbility
                index = 0
                for c in child.p:
                    if index == 0:
                        #print(c)   
                        one_job.append(c)  #地區
                    elif index == 2:
                        #print(c)   
                        one_job.append(c)  #經驗
                    elif index == 4:
                        #print(c)   
                        one_job.append(c)  #學歷
                    index += 1
            elif child.name == 'div' and child['class'] == ['info-company']:
                #print(child.a.string)
                index = 0
                for c in child.p:
                    if index == 0:
                        #print(c)    #行業
                        one_job.append(c)
                    elif index == 2:
                        #print(c)    #公司發展階段 A/B/C/D輪/上市
                        one_job.append(c)
                    elif index == 4:
                        #print(c)    #規模
                        one_job.append(c)

                    index += 1
                    pass
        #print(one_job)  
        jobs.append(one_job)

    return jobs
    
jobs_info=get_jobs_info("https://www.zhipin.com/job_detail/?query=nlp&scity=101190100&industry=&position=")

　數據抓取完成後github

數據探索

這部分,須要對pandas,seaborn有一些瞭解.app

下面咱們就能夠用seaborn對數據作可視化處理了.框架

解決sns顯示中文字體的問題  
from matplotlib.font_manager import FontProperties  
myfont=FontProperties(fname=r'C:\Windows\Fonts\simhei.ttf',size=14)  
sns.set(font=myfont.get_name())

學歷

先來看看學歷的要求.（說到這我就心痛,爲何當初要放棄讀研,真想抽本身兩耳光!!!! 直接致使瞭如今接近一半的職位連門檻都跨不進去)。機器學習

能夠看到NLP工程師對學歷的要求仍是比較高,圖標裏碩士學歷要求基本接近40%.實際上,要接近50%,由於有的崗位在職位搜索頁面標註的是本科便可,可是實際上職位詳情裏又說明了要求碩士.函數

因此,有志於NLP的小夥伴能讀研的仍是讀研吧,有志於NLP的小夥伴能讀研的仍是讀研吧,有志於NLP的小夥伴能讀研的仍是讀研吧,重要的事情說三遍.oop

經驗.

能夠看到3年左右經驗是比較受歡迎的.這也符合常識,首先NLP這幾年是隨着深度學習的發展開始火起來,經驗特別豐富的從業者並很少.並且,不光是NLP，別的崗位也是3/5/8年比較吃香,由於此時你已是這個級別的熟練工種了.

學歷+經驗

依然相似的結論,在各個學歷下,都是3年左右的需求比較多.

規模

首先,千人以上的公司招聘需求相對大,比較好理解. 比較意外的是100-499的中小公司招聘需求相對較多.多是最近幾年隨着深度學習的興起,不少AI相關業務的A輪/B輪的創業公司.

驗證一下咱們上面的猜想,100-499規模中,A/B輪的比較多.

地區

排除沒有標明具體地區的,剩下的雨花臺鐵心橋一帶需求最多,由於那邊是"宇宙的中心",大量的科技公司和碼農彙集到軟件大道一帶. 剩下的江寧區的崗位也相對多.

薪水

大部分集中在15-30k

咱們取月薪的均值再看一眼.這裏咱們添加新的一列"平均月薪".

#處理月薪數據
def f(s):
    #x="17k-18k"
    l = s.replace('k','').split('-')

    tmp=[int(e) for e in l]
    return sum(tmp)/len(tmp)

df["平均月薪"]=df["月薪"].apply(f)

那種4k的基本是實習生.這麼一看平均有22k，很誘人,有沒有. 考慮到用人單位說的15-30k,通常指15k.... 咱們再處理一下數據,繪圖

def min_salary(s):
    #x="17k-18k"
    l = s.replace('k','').split('-')

    return int(l[0])

df["最低月薪"]=df["月薪"].apply(min_salary)

再看看不一樣規模公司中,不一樣學歷與月薪的關係.小公司裏本科生更多.大公司裏碩士生佔比提升.

NLP工程師須要會什麼？

以前的代碼裏,咱們已經抓取到了任職描述和崗位要求,如今咱們使用jieba去作分詞.

注意去掉stopwords
詞典添加自定義詞.好比但願'機器學習'被認爲是一個完整的詞,而不是‘機器’、‘學習’兩個詞.

f = open("./詞表/哈工大停用詞表.txt",encoding='utf-8')
stopword_list = [line.strip() for line in f.readlines()]
self_defined_list = ['1','2','3','4','5','6','以上學歷','關於','\n']
stopword_list.extend(self_defined_list)
print(stopword_list)

def add_self_defined_words():
    jieba.add_word('機器學習')
    jieba.add_word('深度學習')

def get_words(serie):
    clean_contents=[]
    for s in serie:
        s_tmp = ''.join(s)
        #clean_s= re.sub(r'[^\u4e00-\u9fa5]', '', s_tmp)  #https://github.com/fxsjy/jieba/issues/528  這個會去除中文詞以外的詞
        clean_s = s_tmp
        clean_contents.append(clean_s)
      
    add_self_defined_words()
    word_list = [word for word in jieba.cut(''.join(clean_contents),cut_all=False) if word not in stopword_list]
    print(word_list)
    
    tags=jieba.analyse.extract_tags(''.join(clean_contents), topK=20)
               
    print(tags)
    
    return word_list,tags
    
require_word_list,require_tags = get_words(df["任職要求"])
responsibility_word_list,responsibility_tags = get_words(df["崗位職責"])

結巴的topk關鍵詞抽取使用的是tfidf,不是詞頻.僅供參考.這裏咱們其實更關心詞頻.

能夠看到經驗仍是很重要的.不論是相關工做經驗仍是研究經驗.

一樣的看下崗位職責

爲了探索特定詞的詞頻,寫了函數count_specific_word,考慮了類似詞,好比Python和python其實想表達的是一個意思.

通常框架名都爲英文好比tensorflow/hadoop等,寫了函數get_englishword_list去獲取這些英文詞,狀況以下：

能夠發現tensorflow是最常被要求掌握的深度學習框架.

#獲取特定詞的出現次數
def count_specific_word(serie,word_list):
    index = [False] * len(serie)
    for w in word_list:
        tmp_list = list(serie == w)
        #print(list(tmp_list).count(True))
        
        index=list(np.logical_or(tmp_list,index))  #注意一下兩個boolean list相應位置and  or的用法
        #print(index.count(True))

    return list(index).count(True)

print(count_specific_word(require_word_series,["TensorFlow"]))
print(count_specific_word(require_word_series,["經驗"]))
print(count_specific_word(require_word_series,["研究生","碩士"]))
print(count_specific_word(require_word_series,["深度學習","機器學習"]))
print(count_specific_word(require_word_series,["python","Python"]))

#把中文去掉
def get_englishword_list():
    l = []
    for w in require_word_list:
        w=re.sub(r'[\u4e00-\u9fa5\n]', '',w)
        if w == "":
            continue
        else:
            #print(w)
            l.append(w.lower())
    
    print(l)
    return l

enw_serie = pd.Series(get_englishword_list())
enw_serie.value_counts()

爲了得到更直觀的印象,咱們將關鍵詞用詞雲繪製.

from wordcloud import WordCloud, ImageColorGenerator
from scipy.misc import imread
import matplotlib.pyplot as plt
def picture(wordlist):
    font='C:\Windows\Fonts\simhei.ttf'
    wc = WordCloud(background_color="white",font_path=font,max_words=2000)
    wc.generate(" ".join(wordlist))
    
    plt.figure()
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()  
picture(require_word_list),picture(responsibility_word_list)

將兩組詞合起來繪製出的詞雲以下：

總結：當前的NLP工程師招聘,基本上都要求有工做經驗或研究經驗的.（實際上意味着你若是你沒有在學生階段有NLP的經驗的話,那麼這個崗位基本也就與你無緣了...工做後會由於沒有相關經驗很難切入這個領域...不切入這個領域又很難積累經驗...死循環)。須要掌握python/java等,須要瞭解深度學習,最好能掌握諸如tensorflow的框架,具體要掌握的nlp相關技能會涉及數據挖掘、文本分詞分類、實體抽取、知識圖譜構件等.

最後,若是有南京地區的同窗,求內推....