基於 Gensim 的 Word2Vec 實踐

Word2Vec

基於 Gensim 的 Word2Vec 實踐,從屬於筆者的程序猿的數據科學與機器學習實戰手冊,代碼參考gensim.ipynb。推薦前置閱讀Python語法速覽與機器學習開發環境搭建Scikit-Learn 備忘錄html

模型建立

Gensim中 Word2Vec 模型的指望輸入是進過度詞的句子列表,便是某個二維數組。這裏咱們暫時使用 Python 內置的數組,不過其在輸入數據集較大的狀況下會佔用大量的 RAM。Gensim 自己只是要求可以迭代的有序句子列表,所以在工程實踐中咱們可使用自定義的生成器,只在內存中保存單條語句。github

# 引入 word2vec
from gensim.models import word2vec

# 引入日誌配置
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 引入數據集
raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]

# 切分詞彙
sentences= [s.encode('utf-8').split() for s in sentences]

# 構建模型
model = word2vec.Word2Vec(sentences, min_count=1)

# 進行相關性比較
model.similarity('dogs','you')

這裏咱們調用Word2Vec建立模型實際上會對數據執行兩次迭代操做,第一輪操做會統計詞頻來構建內部的詞典數結構,第二輪操做會進行神經網絡訓練,而這兩個步驟是能夠分步進行的,這樣對於某些不可重複的流(譬如 Kafka 等流式數據中)能夠手動控制:數組

model = gensim.models.Word2Vec(iter=1)  # an empty model, no training yet
model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator
model.train(other_sentences)  # can be a non-repeatable, 1-pass generator

Word2Vec 參數

  • min_count網絡

model = Word2Vec(sentences, min_count=10)  # default value is 5

在不一樣大小的語料集中,咱們對於基準詞頻的需求也是不同的。譬如在較大的語料集中,咱們但願忽略那些只出現過一兩次的單詞,這裏咱們就能夠經過設置min_count參數進行控制。通常而言,合理的參數值會設置在0~100之間。併發

  • size機器學習

size參數主要是用來設置神經網絡的層數,Word2Vec 中的默認值是設置爲100層。更大的層次設置意味着更多的輸入數據,不過也能提高總體的準確度,合理的設置範圍爲 10~數百。svn

model = Word2Vec(sentences, size=200)  # default value is 100
  • workers學習

workers參數用於設置併發訓練時候的線程數,不過僅當Cython安裝的狀況下才會起做用:

model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization

外部語料集

在真實的訓練場景中咱們每每會使用較大的語料集進行訓練,譬如這裏以 Word2Vec 官方的text8爲例,只要改變模型中的語料集開源便可:

sentences = word2vec.Text8Corpus('text8')
model = word2vec.Word2Vec(sentences, size=200)

這裏語料集中的語句是通過分詞的,所以能夠直接使用。筆者在第一次使用該類時報錯了,所以把 Gensim 中的源代碼貼一下,也方便之後自定義處理其餘語料集:

class Text8Corpus(object):
    """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip ."""
    def __init__(self, fname, max_sentence_length=MAX_WORDS_IN_BATCH):
        self.fname = fname
        self.max_sentence_length = max_sentence_length

    def __iter__(self):
        # the entire corpus is one gigantic line -- there are no sentence marks at all
        # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokens
        sentence, rest = [], b''
        with utils.smart_open(self.fname) as fin:
            while True:
                text = rest + fin.read(8192)  # avoid loading the entire file (=1 line) into RAM
                if text == rest:  # EOF
                    words = utils.to_unicode(text).split()
                    sentence.extend(words)  # return the last chunk of words, too (may be shorter/longer)
                    if sentence:
                        yield sentence
                    break
                last_token = text.rfind(b' ')  # last token may have been split in two... keep for next iteration
                words, rest = (utils.to_unicode(text[:last_token]).split(),
                               text[last_token:].strip()) if last_token >= 0 else ([], text)
                sentence.extend(words)
                while len(sentence) >= self.max_sentence_length:
                    yield sentence[:self.max_sentence_length]
                    sentence = sentence[self.max_sentence_length:]

咱們在上文中也說起,若是是對於大量的輸入語料集或者須要整合磁盤上多個文件夾下的數據,咱們能夠以迭代器的方式而不是一次性將所有內容讀取到內存中來節省 RAM 空間:

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

sentences = MySentences('/some/directory') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)

模型保存與讀取

model.save('text8.model')
2015-02-24 11:19:26,059 : INFO : saving Word2Vec object under text8.model, separately None
2015-02-24 11:19:26,060 : INFO : not storing attribute syn0norm
2015-02-24 11:19:26,060 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy
2015-02-24 11:19:26,742 : INFO : storing numpy array 'syn1' to text8.model.syn1.npy

model1 = Word2Vec.load('text8.model')
 
model.save_word2vec_format('text.model.bin', binary=True)
2015-02-24 11:19:52,341 : INFO : storing 71290x200 projection weights into text.model.bin
 
model1 = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True)
2015-02-24 11:22:08,185 : INFO : loading projection weights from text.model.bin
2015-02-24 11:22:10,322 : INFO : loaded (71290, 200) matrix from text.model.bin
2015-02-24 11:22:10,322 : INFO : precomputing L2-norms of word weight vectors

模型預測

Word2Vec 最著名的效果便是以語義化的方式推斷出類似詞彙:

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
model.doesnt_match("breakfast cereal dinner lunch";.split())
'cereal'
model.similarity('woman', 'man')
0.73723527
model.most_similar(['man'])
[(u'woman', 0.5686948895454407),
 (u'girl', 0.4957364797592163),
 (u'young', 0.4457539916038513),
 (u'luckiest', 0.4420626759529114),
 (u'serpent', 0.42716869711875916),
 (u'girls', 0.42680859565734863),
 (u'smokes', 0.4265017509460449),
 (u'creature', 0.4227582812309265),
 (u'robot', 0.417464017868042),
 (u'mortal', 0.41728296875953674)]

若是咱們但願直接獲取某個單詞的向量表示,直接如下標方式訪問便可:

model['computer']  # raw NumPy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

模型評估

Word2Vec 的訓練屬於無監督模型,並無太多的相似於監督學習裏面的客觀評判方式,更多的依賴於端應用。Google 以前公開了20000條左右的語法與語義化訓練樣本,每一條遵循A is to B as C is to D這個格式,地址在這裏:

model.accuracy('/tmp/questions-words.txt')
2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342)
2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812)
2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380)
2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332)
2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702)
2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870)
2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482)
2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992)
2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702)
2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)

仍是須要強調下,訓練集上表現的好也不意味着 Word2Vec 在真實應用中就會表現的很好,仍是須要因地制宜。)

模型建立

Gensim中 Word2Vec 模型的指望輸入是進過度詞的句子列表,便是某個二維數組。這裏咱們暫時使用 Python 內置的數組,不過其在輸入數據集較大的狀況下會佔用大量的 RAM。Gensim 自己只是要求可以迭代的有序句子列表,所以在工程實踐中咱們可使用自定義的生成器,只在內存中保存單條語句。

# 引入 word2vec
from gensim.models import word2vec

# 引入日誌配置
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 引入數據集
raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]

# 切分詞彙
sentences= [s.encode('utf-8').split() for s in sentences]

# 構建模型
model = word2vec.Word2Vec(sentences, min_count=1)

# 進行相關性比較
model.similarity('dogs','you')

這裏咱們調用Word2Vec建立模型實際上會對數據執行兩次迭代操做,第一輪操做會統計詞頻來構建內部的詞典數結構,第二輪操做會進行神經網絡訓練,而這兩個步驟是能夠分步進行的,這樣對於某些不可重複的流(譬如 Kafka 等流式數據中)能夠手動控制:

model = gensim.models.Word2Vec(iter=1)  # an empty model, no training yet
model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator
model.train(other_sentences)  # can be a non-repeatable, 1-pass generator

Word2Vec 參數

  • min_count

model = Word2Vec(sentences, min_count=10)  # default value is 5

在不一樣大小的語料集中,咱們對於基準詞頻的需求也是不同的。譬如在較大的語料集中,咱們但願忽略那些只出現過一兩次的單詞,這裏咱們就能夠經過設置min_count參數進行控制。通常而言,合理的參數值會設置在0~100之間。

  • size

size參數主要是用來設置神經網絡的層數,Word2Vec 中的默認值是設置爲100層。更大的層次設置意味着更多的輸入數據,不過也能提高總體的準確度,合理的設置範圍爲 10~數百。

model = Word2Vec(sentences, size=200)  # default value is 100
  • workers

workers參數用於設置併發訓練時候的線程數,不過僅當Cython安裝的狀況下才會起做用:

model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization

外部語料集

在真實的訓練場景中咱們每每會使用較大的語料集進行訓練,譬如這裏以 Word2Vec 官方的text8爲例,只要改變模型中的語料集開源便可:

sentences = word2vec.Text8Corpus('text8')
model = word2vec.Word2Vec(sentences, size=200)

這裏語料集中的語句是通過分詞的,所以能夠直接使用。筆者在第一次使用該類時報錯了,所以把 Gensim 中的源代碼貼一下,也方便之後自定義處理其餘語料集:

class Text8Corpus(object):
    """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip ."""
    def __init__(self, fname, max_sentence_length=MAX_WORDS_IN_BATCH):
        self.fname = fname
        self.max_sentence_length = max_sentence_length

    def __iter__(self):
        # the entire corpus is one gigantic line -- there are no sentence marks at all
        # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokens
        sentence, rest = [], b''
        with utils.smart_open(self.fname) as fin:
            while True:
                text = rest + fin.read(8192)  # avoid loading the entire file (=1 line) into RAM
                if text == rest:  # EOF
                    words = utils.to_unicode(text).split()
                    sentence.extend(words)  # return the last chunk of words, too (may be shorter/longer)
                    if sentence:
                        yield sentence
                    break
                last_token = text.rfind(b' ')  # last token may have been split in two... keep for next iteration
                words, rest = (utils.to_unicode(text[:last_token]).split(),
                               text[last_token:].strip()) if last_token >= 0 else ([], text)
                sentence.extend(words)
                while len(sentence) >= self.max_sentence_length:
                    yield sentence[:self.max_sentence_length]
                    sentence = sentence[self.max_sentence_length:]

咱們在上文中也說起,若是是對於大量的輸入語料集或者須要整合磁盤上多個文件夾下的數據,咱們能夠以迭代器的方式而不是一次性將所有內容讀取到內存中來節省 RAM 空間:

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

sentences = MySentences('/some/directory') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)

模型保存與讀取

model.save('text8.model')
2015-02-24 11:19:26,059 : INFO : saving Word2Vec object under text8.model, separately None
2015-02-24 11:19:26,060 : INFO : not storing attribute syn0norm
2015-02-24 11:19:26,060 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy
2015-02-24 11:19:26,742 : INFO : storing numpy array 'syn1' to text8.model.syn1.npy

model1 = Word2Vec.load('text8.model')
 
model.save_word2vec_format('text.model.bin', binary=True)
2015-02-24 11:19:52,341 : INFO : storing 71290x200 projection weights into text.model.bin
 
model1 = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True)
2015-02-24 11:22:08,185 : INFO : loading projection weights from text.model.bin
2015-02-24 11:22:10,322 : INFO : loaded (71290, 200) matrix from text.model.bin
2015-02-24 11:22:10,322 : INFO : precomputing L2-norms of word weight vectors

模型預測

Word2Vec 最著名的效果便是以語義化的方式推斷出類似詞彙:

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
model.doesnt_match("breakfast cereal dinner lunch";.split())
'cereal'
model.similarity('woman', 'man')
0.73723527
model.most_similar(['man'])
[(u'woman', 0.5686948895454407),
 (u'girl', 0.4957364797592163),
 (u'young', 0.4457539916038513),
 (u'luckiest', 0.4420626759529114),
 (u'serpent', 0.42716869711875916),
 (u'girls', 0.42680859565734863),
 (u'smokes', 0.4265017509460449),
 (u'creature', 0.4227582812309265),
 (u'robot', 0.417464017868042),
 (u'mortal', 0.41728296875953674)]

若是咱們但願直接獲取某個單詞的向量表示,直接如下標方式訪問便可:

model['computer']  # raw NumPy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

模型評估

Word2Vec 的訓練屬於無監督模型,並無太多的相似於監督學習裏面的客觀評判方式,更多的依賴於端應用。Google 以前公開了20000條左右的語法與語義化訓練樣本,每一條遵循A is to B as C is to D這個格式,地址在這裏:

model.accuracy('/tmp/questions-words.txt')
2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342)
2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812)
2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380)
2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332)
2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702)
2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870)
2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482)
2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992)
2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702)
2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)

仍是須要強調下,訓練集上表現的好也不意味着 Word2Vec 在真實應用中就會表現的很好,仍是須要因地制宜。

相關文章
相關標籤/搜索