02-NLP-03-LDA主題模型應用

時間 2019-11-13

標籤 nlp lda 主題模型應用简体版

原文原文鏈接

LDA模型應用：一眼看穿希拉里的郵件

咱們拿到希拉里泄露的郵件，跑一把LDA，看看她平時都在聊什麼。html

但願經過這樣一個LDA模型將她全部的郵件進行分類，從而只須要從這些類中取出。python

利用gensim中包含的LDA模型。正則表達式

首先，導入咱們須要的一些庫數組

In [1]：

import numpy as np
import pandas as pd
import re

而後，把希婆的郵件讀取進來。app

這裏咱們用pandas。不熟悉pandas的朋友，能夠用python標準庫csvide

pandas比csv檔次更高。函數

In [2]:

df = pd.read_csv("../input/HillaryEmails.csv")
# 原郵件數據中有不少Nan的值，直接扔了。
df = df[['Id','ExtractedBodyText']].dropna()

文本預處理：

上過我其餘NLP課程的同窗都知道，文本預處理這個東西，對NLP是很重要的。oop

咱們這裏，針對郵件內容，寫一組正則表達式：學習

（不熟悉正則表達式的同窗，直接百度關鍵詞，能夠看到一大張Regex規則表）ui

In [3]:

將全部-，*等特殊的分隔符都換成空格，目的是爲了可以在後續直接所有利用空格劃分開。

也能夠自行添加一個正則表達式去掉全部帶點的。去掉全部沒有意義的標點符號（與虛詞同樣）對語義的影響不大。

 
      def clean_email_text(text):
    text = text.replace('\n'," ") #新行，咱們是不須要的
    text = re.sub(r"-", " ", text) #把 "-" 的兩個單詞，分開。（好比：july-edu ==> july edu）
    text = re.sub(r"\d+/\d+/\d+", "", text) #日期，對主體模型沒什麼意義，直接變空，去掉
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) #時間，沒意義
    text = re.sub(r"[\w]+@[\.\w]+", "", text) #郵件地址，沒意義
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text) 
    #通用的網址，沒意義：https://www.jcnvdk.com
    pure_text = ''
    # 以防還有其餘特殊字符（數字）等等，咱們直接把他們loop一遍，過濾掉。保證文章中該有的都是統一的形式
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter==' ':
            pure_text += letter
    # 再把那些去除特殊字符後落單的單詞，直接排除。
    # 咱們就只剩下有意義的單詞了。
    text = ' '.join(word for word in pure_text.split() if len(word)>1)
    return text 
     

好的，如今咱們新建一個colum，並把咱們的方法跑一遍：

In [4]:

 
      docs = df['ExtractedBodyText']
docs = docs.apply(lambda s: clean_email_text(s))     #調用上面自定義的預處理函數

好，來看看長相：

In [5]:

docs.head(1).values

Out[5]:

array([ 'Thursday March PM Latest How Syria is aiding Qaddafi and more Sid hrc memo syria aiding libya docx hrc memo syria aiding libya docx March For Hillary'], dtype=object)

咱們直接把全部的郵件內容拿出來。

In [6]:

doclist = docs.values

LDA模型構建：注意預處理對數據的處理應該和模型應該一致。

好，咱們用Gensim來作一次模型構建

首先，咱們得把咱們剛剛整出來的一大波文本數據：二維數組，每一個元素都是一條字符串數據

[[一條郵件字符串]，[另外一條郵件字符串], ...]

轉化成Gensim承認的語料庫形式：二維數組，每一個字符串元素須要採用分詞技術（如jieba）將他們拆分紅一個個詞

[[一，條，郵件，在，這裏],[第，二，條，郵件，在，這裏],[今天，天氣，腫麼，樣],...]

引入庫：

In [7]:

#corpora語料模型, models本LDA模型, similarities測量類似度的模型
from gensim import corpora, models, similarities
import gensim

爲了免去講解安裝NLTK（主要處理英文處理）等等的麻煩，我這裏直接手寫一下 中止詞(相似中文中的虛詞：的地得等)列表 ：

這些詞在不一樣語境中指代意義徹底不一樣，可是在不一樣主題中的出現機率是幾乎一致的。因此要去除，不然對模型的準確性有影響

In [8]:

 
      stoplist = ['very', 'ourselves', 'am', 'doesn', 'through', 'me', 'against', 'up', 'just', 'her', 'ours', 
            'couldn', 'because', 'is', 'isn', 'it', 'only', 'in', 'such', 'too', 'mustn', 'under', 'their', 
            'if', 'to', 'my', 'himself', 'after', 'why', 'while', 'can', 'each', 'itself', 'his', 'all', 'once', 
            'herself', 'more', 'our', 'they', 'hasn', 'on', 'ma', 'them', 'its', 'where', 'did', 'll', 'you', 
            'didn', 'nor', 'as', 'now', 'before', 'those', 'yours', 'from', 'who', 'was', 'm', 'been', 'will', 
            'into', 'same', 'how', 'some', 'of', 'out', 'with', 's', 'being', 't', 'mightn', 'she', 'again', 'be', 
            'by', 'shan', 'have', 'yourselves', 'needn', 'and', 'are', 'o', 'these', 'further', 'most', 'yourself', 
            'having', 'aren', 'here', 'he', 'were', 'but', 'this', 'myself', 'own', 'we', 'so', 'i', 'does', 'both', 
            'when', 'between', 'd', 'had', 'the', 'y', 'has', 'down', 'off', 'than', 'haven', 'whom', 'wouldn', 
            'should', 've', 'over', 'themselves', 'few', 'then', 'hadn', 'what', 'until', 'won', 'no', 'about', 
            'any', 'that', 'for', 'shouldn', 'don', 'do', 'there', 'doing', 'an', 'or', 'ain', 'hers', 'wasn', 
            'weren', 'above', 'a', 'at', 'your', 'theirs', 'below', 'other', 'not', 're', 'him', 'during', 'which']
  
     

人工分詞：

這裏，英文的分詞，直接就是對着空白處分割就能夠了。

中文的分詞稍微複雜點兒，具體能夠百度：CoreNLP, HaNLP, 結巴分詞，等等

分詞的意義在於，把咱們的長長的字符串原文本，轉化成有意義的小元素：

In [9]:

 
      texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in doclist] 
     

這時候，咱們的texts就是咱們須要的樣子了：

In [10]: 每一個單詞都是一個元素，一塊兒組成一個列表

texts[0]

Out[10]:

['thursday',
 'march',
 'pm',
 'latest',
 'syria',
 'aiding',
 'qaddafi',
 'sid',
 'hrc',
 'memo',
 'syria',
 'aiding',
 'libya',
 'docx',
 'hrc',
 'memo',
 'syria',
 'aiding',
 'libya',
 'docx',
 'march',
 'hillary']

創建語料庫

用tokenlize（標記化）的方法，把每一個單詞用一個數字index指代，並把咱們的原文本變成一條長長的數組：

經過將全部文本texts塞入字典中，獲得一個對照表dictionary。而後將其轉化爲數字。

In [11]:

 
      dictionary = corpora.Dictionary(texts)   #記得在調用的時候傳給id2word，從而在輸出的時候才知道編號和輸出之間的對應關係。設定好須要分紅多少個topic
corpus = [dictionary.doc2bow(text) for text in texts]  #corpus變成詞袋的表達式形式

給大家看一眼：

In [12]:

corpus[13]

# 小括號中第一項表明單詞標記，第二項表示單詞出現次數

Out[12]:

[(36, 1), (505, 1), (506, 1), (507, 1), (508, 1)]

這個列表告訴咱們，第14（從0開始是第一）個郵件中，一共6個有意義的單詞（通過咱們的文本預處理，並去除了中止詞後）

其中，36號單詞出現1次，505號單詞出現1次，以此類推。。。

接着，咱們終於能夠創建模型了：

In [13]:

 
      lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)  #得出訓練好的模型 
     

咱們能夠看到，第10號分類，其中最常出現的單詞是：

In [14]:

lda.print_topic(10, topn=5)

Out[14]:

'0.007*kurdistan + 0.006*email + 0.006*see + 0.005*us + 0.005*right'

咱們把全部的主題打印出來看看。因爲這是一個非監督的學習，因此實際上是沒法預知最後出來的結果是什麼的。

In [15]:

lda.print_topics(num_topics=20, num_words=5)

Out[15]:

[(0, '0.010*nuclear + 0.006*us + 0.005*american + 0.005*iran + 0.005*also'),
 (1,
  '0.019*labour + 0.016*dialogue + 0.015*doc + 0.015*strategic + 0.014*press'),
 (2, '0.007*mr + 0.007*us + 0.006*would + 0.006*new + 0.004*israel'),
 (3,
  '0.013*israel + 0.011*israeli + 0.011*settlements + 0.007*settlement + 0.006*one'),
 (4, '0.012*us + 0.007*diplomacy + 0.006*state + 0.005*know + 0.005*would'),
 (5, '0.045*call + 0.021*yes + 0.020*thx + 0.010*ops + 0.009*also'),
 (6,
  '0.012*obama + 0.009*percent + 0.008*republican + 0.007*republicans + 0.006*president'),
 (7,
  '0.069*pm + 0.036*office + 0.027*secretarys + 0.021*meeting + 0.020*room'),
 (8, '0.008*would + 0.006*party + 0.006*new + 0.005*said + 0.005*us'),
 (9, '0.007*us + 0.006*would + 0.005*state + 0.005*new + 0.005*netanyahu'),
 (10, '0.007*kurdistan + 0.006*email + 0.006*see + 0.005*us + 0.005*right'),
 (11,
  '0.007*health + 0.007*haitian + 0.006*people + 0.005*would + 0.005*plan'),
 (12, '0.012*see + 0.009*like + 0.009*back + 0.008*im + 0.008*would'),
 (13, '0.009*new + 0.007*fyi + 0.006*draft + 0.006*speech + 0.005*also'),
 (14,
  '0.006*military + 0.006*afghanistan + 0.005*security + 0.005*said + 0.005*government'),
 (15, '0.033*ok + 0.028*pls + 0.023*print + 0.014*call + 0.011*pis'),
 (16,
  '0.015*state + 0.008*sounds + 0.007*us + 0.006*department + 0.005*sorry'),
 (17, '0.053*fyi + 0.012*richards + 0.006*us + 0.005*like + 0.004*defenses'),
 (18, '0.043*pm + 0.021*fw + 0.018*cheryl + 0.015*mills + 0.014*huma'),
 (19, '0.012*clips + 0.007*read + 0.006*tomorrow + 0.006*see + 0.005*send')]

接下來：

經過

lda.get_document_topics(bow)

或者

lda.get_term_topics(word_id)

兩個方法，咱們能夠把新鮮的文本/單詞，分類成20個主題中的一個。

可是注意，咱們這裏的文本和單詞，都必須得通過一樣步驟的文本預處理+詞袋化，也就是說，變成數字表示每一個單詞的形式。

做業：

我這裏有希拉里twitter上的幾條(每一空行是單獨的一條)：

To all the little girls watching...never doubt that you are valuable and powerful & deserving of every chance & opportunity in the world.

I was greeted by this heartwarming display on the corner of my street today. Thank you to all of you who did this. Happy Thanksgiving. -H

Hoping everyone has a safe & Happy Thanksgiving today, & quality time with family & friends. -H

Scripture tells us: Let us not grow weary in doing good, for in due season, we shall reap, if we do not lose heart.

Let us have faith in each other. Let us not grow weary. Let us not lose heart. For there are more seasons to come and...more work to do

We have still have not shattered that highest and hardest glass ceiling. But some day, someone will

To Barack and Michelle Obama, our country owes you an enormous debt of gratitude. We thank you for your graceful, determined leadership

Our constitutional democracy demands our participation, not just every four years, but all the time

You represent the best of America, and being your candidate has been one of the greatest honors of my life

Last night I congratulated Donald Trump and offered to work with him on behalf of our country

Already voted? That's great! Now help Hillary win by signing up to make calls now

It's Election Day! Millions of Americans have cast their votes for Hillary—join them and confirm where you vote

We don’t want to shrink the vision of this country. We want to keep expanding it

We have a chance to elect a 45th president who will build on our progress, who will finish the job

I love our country, and I believe in our people, and I will never, ever quit on you. No matter what

各位同窗請使用訓練好的LDA模型，判斷每句話各自屬於哪一個topic

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。