維基百科語料中的詞語類似度探索

時間 2019-11-06

標籤維基百科語料詞語類似探索简体版

原文原文鏈接

以前寫過《中英文維基百科語料上的Word2Vec實驗》，近期有很多同窗在這篇文章下留言提問，加上最近一些工做也與Word2Vec相關，因而又作了一些功課，包括從新過了一遍Word2Vec的相關資料，試了一下gensim的相關更新接口，google了一下"wikipedia word2vec" or "維基百科 word2vec" 相關的英中文資料，發現多數仍是走得這篇文章的老路，既經過gensim提供的維基百科預處理腳本"gensim.corpora.WikiCorpus"提取維基語料，每篇文章一行文本存放，而後基於gensim的Word2Vec模塊訓練詞向量模型。這裏再提供另外一個方法來處理維基百科的語料，訓練詞向量模型，計算詞語類似度（Word Similarity)。關於Word2Vec, 若是英文不錯，推薦從這篇文章入手讀相關的資料: Getting started with Word2Vec 。html

此次咱們僅以英文維基百科語料爲例，首先依然是下載維基百科的最新XML打包壓縮數據，在這個英文最新更新的數據列表下：https://dumps.wikimedia.org/enwiki/latest/ ，找到 "enwiki-latest-pages-articles.xml.bz2" 下載，這份英文維基百科全量壓縮數據的打包時間大概是2017年4月4號，大約13G，我經過家裏的電腦wget下載大概花了3個小時，電信100M寬帶，速度還不錯。python

接下來就是處理這份壓縮的XML英文維基百科語料了，此次咱們使用WikiExtractor:git

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.github

WikiExtractor是一個Python 腳本，專門用於提取和清洗Wikipedia的dump數據，支持Python 2.7 或者 Python 3.3+，無額外依賴，安裝和使用都很是方便：bootstrap

安裝：
git clone https://github.com/attardi/wikiextractor.git cd wikiextractor/ sudo python setup.py installapp

使用：
WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2less

......
INFO: 53665431  Pampapaul
INFO: 53665433  Charles Frederick Zimpel
INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

這個過程總計花了2個多小時，提取了大概537萬多篇文章。關於個人機器配置，可參考：《深度學習主機攢機小記》函數

提取後的文件按必定順序切分存儲在多個子目錄下：學習

每一個子目錄下的又存放若干個以wiki_num命名的文件，每一個大小在1M左右，這個大小能夠經過參數 -b 控制:測試

-b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)

咱們看一下wiki_00裏的具體內容：

Anarchism

Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.

Autism

Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three. ... ...
每一個wiki_num文件裏又存放若干個doc，每一個doc都有相關的tag標記，包括id, url, title等，很好區分。

這裏咱們按照Gensim做者提供的word2vec tutorial裏"memory-friendly iterator"方式來處理英文維基百科的數據。代碼以下，也同步放到了github裏：train_word2vec_with_gensim.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Pan Yang (panyangnlp@gmail.com)
# Copyright 2017 @ Yu Zhen
 
import gensim
import logging
import multiprocessing
import os
import re
import sys
 
from pattern.en import tokenize
from time import time
 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)
 
 
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    return cleantext
 
 
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for root, dirs, files in os.walk(self.dirname):
            for filename in files:
                file_path = root + '/' + filename
                for line in open(file_path):
                    sline = line.strip()
                    if sline == "":
                        continue
                    rline = cleanhtml(sline)
                    tokenized_line = ' '.join(tokenize(rline))
                    is_alpha_word_line = [word for word in
                                          tokenized_line.lower().split()
                                          if word.isalpha()]
                    yield is_alpha_word_line
 
 
if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Please use python train_with_gensim.py data_path"
        exit()
    data_path = sys.argv[1]
    begin = time()
 
    sentences = MySentences(data_path)
    model = gensim.models.Word2Vec(sentences,
                                   size=200,
                                   window=10,
                                   min_count=10,
                                   workers=multiprocessing.cpu_count())
    model.save("data/model/word2vec_gensim")
    model.wv.save_word2vec_format("data/model/word2vec_org",
                                  "data/model/vocabulary",
                                  binary=False)
 
    end = time()
    print "Total procesing time: %d seconds" % (end - begin)

注意其中的word tokenize使用了pattern裏的英文tokenize模塊，固然，也可使用nltk裏的word_tokenize模塊，作一點修改便可，不過nltk對於句尾的一些詞的work tokenize處理的不太好。另外咱們設定詞向量維度爲200，窗口長度爲10，最小出現次數爲10，經過 is_alpha() 函數過濾掉標點和非英文詞。如今能夠用這個腳原本訓練英文維基百科的Word2Vec模型了：
python train_word2vec_with_gensim.py enwiki

2017-04-22 14:31:04,703 : INFO : collecting all words and their counts
2017-04-22 14:31:04,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-04-22 14:31:06,442 : INFO : PROGRESS: at sentence #10000, processed 480546 words, keeping 33925 word types
2017-04-22 14:31:08,104 : INFO : PROGRESS: at sentence #20000, processed 983240 words, keeping 51765 word types
2017-04-22 14:31:09,685 : INFO : PROGRESS: at sentence #30000, processed 1455218 words, keeping 64982 word types
2017-04-22 14:31:11,349 : INFO : PROGRESS: at sentence #40000, processed 1957479 words, keeping 76112 word types
......
2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 2 more threads                                                                      2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 1 more threads                                                                      2017-04-23 02:50:59,854 : INFO : worker thread finished; awaiting finish of 0 more threads                                                                      2017-04-23 02:50:59,854 : INFO : training on 8903084745 raw words (6742578791 effective words) took 37805.2s, 178351 effective words/s                          
2017-04-23 02:50:59,855 : INFO : saving Word2Vec object under data/model/word2vec_gensim, separately None                                                       
2017-04-23 02:50:59,855 : INFO : not storing attribute syn0norm                 
2017-04-23 02:50:59,855 : INFO : storing np array 'syn0' to data/model/word2vec_gensim.wv.syn0.npy                                                              
2017-04-23 02:51:00,241 : INFO : storing np array 'syn1neg' to data/model/word2vec_gensim.syn1neg.npy                                                           
2017-04-23 02:51:00,574 : INFO : not storing attribute cum_table                
2017-04-23 02:51:13,886 : INFO : saved data/model/word2vec_gensim               
2017-04-23 02:51:13,886 : INFO : storing vocabulary in data/model/vocabulary    
2017-04-23 02:51:17,480 : INFO : storing 868777x200 projection weights into data/model/word2vec_org                                                             
Total procesing time: 44476 seconds

這個訓練過程當中大概花了12多小時，訓練後的文件存放在data/model下：

咱們來測試一下這個英文維基百科的Word2Vec模型：

textminer@textminer:/opt/wiki/data$ ipython
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
Type "copyright", "credits" or "license" for more information.
 
IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from gensim.models import Word2Vec
 
In [2]: en_wiki_word2vec_model = Word2Vec.load('data/model/word2vec_gensim')

首先來測試幾個單詞的類似單詞(Word Similariy)：

word:

In [3]: en_wiki_word2vec_model.most_similar('word')
Out[3]: 
[('phrase', 0.8129693269729614),
 ('meaning', 0.7311851978302002),
 ('words', 0.7010501623153687),
 ('adjective', 0.6805518865585327),
 ('noun', 0.6461974382400513),
 ('suffix', 0.6440576314926147),
 ('verb', 0.6319557428359985),
 ('loanword', 0.6262609958648682),
 ('proverb', 0.6240501403808594),
 ('pronunciation', 0.6105246543884277)]

similarity:

In [4]: en_wiki_word2vec_model.most_similar('similarity')
Out[4]: 
[('similarities', 0.8517599701881409),
 ('resemblance', 0.786037266254425),
 ('resemblances', 0.7496883869171143),
 ('affinities', 0.6571112275123596),
 ('differences', 0.6465682983398438),
 ('dissimilarities', 0.6212711930274963),
 ('correlation', 0.6071442365646362),
 ('dissimilarity', 0.6062943935394287),
 ('variation', 0.5970577001571655),
 ('difference', 0.5928016901016235)]

nlp:

In [5]: en_wiki_word2vec_model.most_similar('nlp')
Out[5]: 
[('neurolinguistic', 0.6698148250579834),
 ('psycholinguistic', 0.6388964056968689),
 ('connectionism', 0.6027182936668396),
 ('semantics', 0.5866401195526123),
 ('connectionist', 0.5865628719329834),
 ('bandler', 0.5837364196777344),
 ('phonics', 0.5733655691146851),
 ('psycholinguistics', 0.5613113641738892),
 ('bootstrapping', 0.559638261795044),
 ('psychometrics', 0.5555593967437744)]

learn:

In [6]: en_wiki_word2vec_model.most_similar('learn')
Out[6]: 
[('teach', 0.7533557415008545),
 ('understand', 0.71148681640625),
 ('discover', 0.6749690771102905),
 ('learned', 0.6599283218383789),
 ('realize', 0.6390970349311829),
 ('find', 0.6308424472808838),
 ('know', 0.6171890497207642),
 ('tell', 0.6146825551986694),
 ('inform', 0.6008728742599487),
 ('instruct', 0.5998791456222534)]

man:

In [7]: en_wiki_word2vec_model.most_similar('man')
Out[7]: 
[('woman', 0.7243080735206604),
 ('boy', 0.7029494047164917),
 ('girl', 0.6441491842269897),
 ('stranger', 0.63275545835495),
 ('drunkard', 0.6136815547943115),
 ('gentleman', 0.6122575998306274),
 ('lover', 0.6108279228210449),
 ('thief', 0.609005331993103),
 ('beggar', 0.6083744764328003),
 ('person', 0.597919225692749)]

再來看看其餘幾個相關接口：

In [8]: en_wiki_word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
Out[8]: [('queen', 0.7752252817153931)]
 
In [9]: en_wiki_word2vec_model.similarity('woman', 'man')
Out[9]: 0.72430799548282099
 
In [10]: en_wiki_word2vec_model.doesnt_match("breakfast cereal dinner lunch".split())
Out[10]: 'cereal'

我把這篇文章的相關代碼還有另外一篇「中英文維基百科語料上的Word2Vec實驗」的相關代碼整理了一下，在github上創建了一個 Wikipedia_Word2vec 的項目，感興趣的同窗能夠參考。

注：原創文章，轉載請註明出處及保留連接「我愛天然語言處理」：http://www.52nlp.cn

本文連接地址：維基百科語料中的詞語類似度探索 http://www.52nlp.cn/?p=9454