點互信息PMI(Pointwise Mutual Information)這個指標來衡量兩個事物之間的相關性(好比兩個詞)。ide
在機率論中,咱們知道,若是x跟y相互獨立,則p(x,y)=p(x)p(y)。this
兩者相關性越大,則p(x,y)就相比於p(x)p(y)越大。用後面的式子可能更好理解,在y出現的狀況下x出現的條件機率p(x|y)除以x自己出現的機率p(x),天然就表示x跟y的相關程度。excel
例子:
舉個天然語言處理中的例子來講,咱們想衡量like這個詞的極性(正向情感仍是負向情感)。咱們能夠預先挑選一些正向情感的詞,好比good。而後咱們算like跟good的PMI,即: code
其中,orm
在stackoverflow中找到pmi實現的代碼blog
from nltk.collocations import BigramAssocMeasures,BigramCollocationFinder from nltk.tokenize import word_tokenize text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence" words = word_tokenize(text) bigram_measures = BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words) for row in finder.score_ngrams(bigram_measures.pmi): print(row)
(('is', 'a'), 4.523561956057013) (('this', 'is'), 4.523561956057013) (('a', 'foo'), 2.938599455335857) (('sheep', 'shep'), 2.938599455335857) (('black', 'sheep'), 2.5235619560570135) (('black', 'sentence'), 2.523561956057013) (('sheep', 'foo'), 2.3536369546147005) (('bar', 'black'), 1.523561956057013) (('foo', 'bar'), 1.523561956057013) (('shep', 'bar'), 1.523561956057013) (('bar', 'bar'), 0.5235619560570131)
好了,下面寫一個完整的代碼token
實現如下功能:pandas
讀取txt、xls、xlsx文件的數據(其中excel形式的數據,其數據是存儲在某一列)it
對文本數據進行分詞、英文小寫化、英文詞幹化、去停用詞io
按照兩元語法模式,計算全部文本兩兩詞語的pmi值
完整代碼
import re import csv import jieba import pandas as pd from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder def chinese(text): """ 對中文數據進行處理,並將計算出的pmi保存到"中文pmi計算.csv" """ content = ''.join(re.findall(r'[\u4e00-\u9fa5]+', text)) words = jieba.cut(content) words = [w for w in words if len(w)>1] bigram_measures = BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words) with open('中文pmi計算.csv','a+',encoding='gbk',newline='') as csvf: writer = csv.writer(csvf) writer.writerow(('word1','word2','pmi_score')) for row in finder.score_ngrams(bigram_measures.pmi): data = (*row[0],row[1]) try: writer.writerow(data) except: pass def english(text): """ 對英文數據進行處理,並將計算出的pmi保存到"english_pmi_computer.csv" """ stopwordss = set(stopwords.words('english')) stemmer = nltk.stem.snowball.SnowballStemmer('english') tokenizer = nltk.tokenize.RegexpTokenizer('\w+') words = tokenizer.tokenize(text) words = [w for w in words if not w.isnumeric()] words = [w.lower() for w in words] words = [stemmer.stem(w) for w in words] words = [w for w in words if w not in stopwordss] bigram_measures = BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words) with open('english_pmi_computer.csv','a+',encoding='gbk',newline='') as csvf: writer = csv.writer(csvf) writer.writerow(('word1','word2','pmi_score')) for row in finder.score_ngrams(bigram_measures.pmi): data = (*row[0],row[1]) try: writer.writerow(data) except: pass def pmi_score(file,lang,column='數據列'): """ 計算pmi :param file: 原始文本數據文件 :param lang: 數據的語言,參數爲chinese或english :param column: 若是文件爲excel形式的文件,column爲excel中的數據列 """ #讀取數據 text = '' if 'csv' in file: df = pd.read_csv(file) rows = df.iterrows() for row in rows: text += row[1][column] elif ('xlsx' in file) or ('xls' in file): df = pd.read_excel(file) rows = df.iterrows() for row in rows: text += row[1][column] else: text = open(file).read() #對該語言的文本數據計算pmi globals()[lang](text) #計算pmi pmi_score(file='test.txt',lang='chinese')
test.txt數據來自4000+場知乎live的簡介,pmi部分計算結果截圖。
pmi計算結果是從大到小輸出的。從中能夠看到,pmi越大,兩個詞語更有感情,更搭。
而當翻看最後面的組合,pmi已經淪爲負值,兩個詞語間關係已經不大了。