PMI點互信息計算

時間 2021-01-04

標籤 ide this excel code orm blog token pandas 欄目 Microsoft Office 简体版

原文原文鏈接

點互信息PMI（Pointwise Mutual Information）這個指標來衡量兩個事物之間的相關性（好比兩個詞）。ide

在機率論中，咱們知道，若是x跟y相互獨立，則p(x,y)=p(x)p(y)。this

兩者相關性越大，則p(x,y)就相比於p(x)p(y)越大。用後面的式子可能更好理解，在y出現的狀況下x出現的條件機率p(x|y)除以x自己出現的機率p(x)，天然就表示x跟y的相關程度。excel

例子：
舉個天然語言處理中的例子來講，咱們想衡量like這個詞的極性（正向情感仍是負向情感）。咱們能夠預先挑選一些正向情感的詞，好比good。而後咱們算like跟good的PMI，即： code

其中，orm

在stackoverflow中找到pmi實現的代碼blog

from nltk.collocations import BigramAssocMeasures,BigramCollocationFinder

from nltk.tokenize import word_tokenize

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

words = word_tokenize(text)

bigram_measures = BigramAssocMeasures()

finder = BigramCollocationFinder.from_words(words)

for row in finder.score_ngrams(bigram_measures.pmi):

    print(row)

(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sheep'), 2.5235619560570135)
(('black', 'sentence'), 2.523561956057013)
(('sheep', 'foo'), 2.3536369546147005)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)

好了，下面寫一個完整的代碼token

實現如下功能：pandas

讀取txt、xls、xlsx文件的數據（其中excel形式的數據，其數據是存儲在某一列）it
對文本數據進行分詞、英文小寫化、英文詞幹化、去停用詞io
按照兩元語法模式，計算全部文本兩兩詞語的pmi值
將pmi值保存到csv文件中

完整代碼

import re
import csv
import jieba
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

def chinese(text):
    """
    對中文數據進行處理，並將計算出的pmi保存到"中文pmi計算.csv"
    """
    content = ''.join(re.findall(r'[\u4e00-\u9fa5]+', text))

    words = jieba.cut(content)

    words = [w for w in words if len(w)>1]

    bigram_measures = BigramAssocMeasures()

    finder = BigramCollocationFinder.from_words(words)

    with open('中文pmi計算.csv','a+',encoding='gbk',newline='') as csvf:

        writer = csv.writer(csvf)

        writer.writerow(('word1','word2','pmi_score'))

        for row in finder.score_ngrams(bigram_measures.pmi):

            data = (*row[0],row[1])
            try:
                writer.writerow(data)
            except:
                pass

def english(text):
    """
    對英文數據進行處理，並將計算出的pmi保存到"english_pmi_computer.csv"
    """

    stopwordss = set(stopwords.words('english'))

    stemmer = nltk.stem.snowball.SnowballStemmer('english')

    tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

    words = tokenizer.tokenize(text)

    words = [w for w in words if not w.isnumeric()]

    words = [w.lower() for w in words]

    words = [stemmer.stem(w) for w in words]

    words = [w for w in words if w not in stopwordss]

    bigram_measures = BigramAssocMeasures()

    finder = BigramCollocationFinder.from_words(words)

    with open('english_pmi_computer.csv','a+',encoding='gbk',newline='') as csvf:

        writer = csv.writer(csvf)

        writer.writerow(('word1','word2','pmi_score'))

        for row in finder.score_ngrams(bigram_measures.pmi):

            data = (*row[0],row[1])
            try:
                writer.writerow(data)
            except:
                pass

def pmi_score(file,lang,column='數據列'):
    """
    計算pmi
    :param file: 原始文本數據文件
    :param lang: 數據的語言,參數爲chinese或english
    :param column: 若是文件爲excel形式的文件，column爲excel中的數據列

    """
    #讀取數據
    text = ''
    if 'csv' in file:
        df = pd.read_csv(file)
        rows = df.iterrows()
        for row in rows:
            text += row[1][column]
    elif ('xlsx' in file) or ('xls' in file):
        df = pd.read_excel(file)
        rows = df.iterrows()
        for row in rows:
            text += row[1][column]
    else:
        text = open(file).read()

    #對該語言的文本數據計算pmi
    globals()[lang](text)

#計算pmi
pmi_score(file='test.txt',lang='chinese')

test.txt數據來自4000+場知乎live的簡介，pmi部分計算結果截圖。

pmi計算結果是從大到小輸出的。從中能夠看到，pmi越大，兩個詞語更有感情，更搭。

而當翻看最後面的組合，pmi已經淪爲負值，兩個詞語間關係已經不大了。