gensim簡介

gemsim是一個免費python庫，可以從文檔中有效地自動抽取語義主題。gensim中的算法包括：LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), RP (Random Projections), 經過在一個訓練文檔語料庫中，檢查詞彙統計聯合出現模式, 能夠用來發掘文檔語義結構，這些算法屬於非監督學習，能夠處理原始的，非結構化的文本（」plain text」）。
python

Gensim是一個至關專業的主題模型Python工具包。在文本處理中，好比商品評論挖掘，有時須要瞭解每一個評論分別和商品的描述之間的類似度，以此衡量評論的客觀性。評論和商品描述的類似度越高，說明評論的用語比較官方，不帶太多感情色彩，比較注重描述商品的屬性和特性，角度更客觀。git

Gensim：實現語言，Python，實現模型，LDA，Dynamic Topic Model,Dynamic Influence Model,HDP,LSI,Random Projections,深度學習的word2vec,paragraph2vec。github

[官網主頁gensim topic modeling for humans]
算法

[github代碼地址：https://github.com/piskvorky/gensim]windows

gensim 特性

內存獨立- 對於訓練語料來講，不必在任什麼時候間將整個語料都駐留在RAM中
有效實現了許多流行的向量空間算法－包括tf-idf，分佈式LSA, 分佈式LDA 以及 RP；而且很容易添加新算法
對流行的數據格式進行了IO封裝和轉換
在其語義表達中，能夠類似查詢
gensim的建立的目的是，因爲缺少簡單的（java很複雜）實現主題建模的可擴展軟件框架.

gensim 設計原則

簡單的接口，學習曲線低。對於原型實現很方便
根據輸入的語料的size來講，內存各自獨立；基於流的算法操做，一次訪問一個文檔.

gensim 核心概念

gensim的整個package會涉及三個概念：corpus, vector, model.數組

語庫(corpus)
文檔集合，用於自動推出文檔結構，以及它們的主題等，也可稱做訓練語料。

向量(vector)app
在向量空間模型(VSM)中，每一個文檔被表示成一個特徵數組。例如，一個單一特徵能夠被表示成一個問答對(question-answer pair):框架
[1].在文檔中單詞」splonge」出現的次數？ 0個
[2].文檔中包含了多少句子？ 2個
[3].文檔中使用了多少字體? 5種
這裏的問題能夠表示成整型id (好比：1,2,3等), 所以，上面的文檔能夠表示成：(1, 0.0), (2, 2.0), (3, 5.0). 若是咱們事先知道全部的問題，咱們能夠顯式地寫成這樣：(0.0, 2.0, 5.0). 這個answer序列能夠認爲是一個多維矩陣（3維）. 對於實際目的，只有question對應的answer是一個實數.
對於每一個文檔來講，answer是相似的. 於是，對於兩個向量來講（分別表示兩個文檔），咱們但願能夠下相似的結論：「若是兩個向量中的實數是類似的，那麼，原始的文檔也能夠認爲是類似的」。固然，這樣的結論依賴於咱們如何去選取咱們的question。

稀疏矩陣(Sparse vector)
一般，大多數answer的值都是0.0. 爲了節省空間，咱們須要從文檔表示中忽略它們，只須要寫：(2, 2.0), (3, 5.0) 便可(注意：這裏忽略了(1, 0.0)). 因爲全部的問題集事先都知道，那麼在稀疏矩陣的文檔表示中全部缺失的特性能夠認爲都是0.0.
gensim的特別之處在於，它沒有限定任何特定的語料格式；語料能夠是任何格式，當迭代時，經過稀疏矩陣來完成便可。例如，集合 ([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0])) 是一個包含兩個文檔的語料，每一個都有兩個非零的 pair。

模型(model)
對於咱們來講，一個模型就是一個變換(transformation)，將一種文檔表示轉換成另外一種。初始和目標表示都是向量－－它們只在question和answer之間有區別。這個變換能夠經過訓練的語料進行自動學習，無需人工監督，最終的文檔表示將更加緊湊和有用；類似的文檔具備類似的表示。

[Gensim- 用Python作主題模型 ]

gensim的安裝

gensim依賴NumPy和SciPy這兩大Python科學計算工具包，要先安裝。

再安裝gensim: pip install gensim

可能還須要安裝其它的東西：install gensim，sklearn, nltk。

gensim官網教程

[gensim tutorial]

分爲下面幾部分

[Corpora and Vector Spaces]

[Topics and Transformations]

[Similarity Queries]

[Experiments on the English Wikipedia]

[Distributed Computing]

皮皮blog

使用gensim快速實現lda

文檔的向量表示Corpora and Vector Spaces

將用字符串表示的文檔轉換爲用id表示的文檔向量:

documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time",     "The EPS user interface management system",     "System and human system engineering testing of EPS",     "Relation of user perceived response time to error measurement",     "The generation of random binary unordered trees",     "The intersection graph of paths in trees",     "Graph minors IV Widths of trees and well quasi ordering",     "Graph minors A survey"] """ #use StemmedCountVectorizer to get stemmed without stop words corpus Vectorizer = StemmedCountVectorizer # Vectorizer = CountVectorizer vectorizer = Vectorizer(stop_words='english') vectorizer.fit_transform(documents) texts = vectorizer.get_feature_names() # print(texts) """ texts = [doc.lower().split() for doc in documents] # print(texts) dict = corpora.Dictionary(texts) #自建詞典 # print dict, dict.token2id #經過dict將用字符串表示的文檔轉換爲用id表示的文檔向量 corpus = [dict.doc2bow(text) for text in texts] print(corpus)

[如何計算兩個文檔的類似度（二）]

查找doc主題的兩種方式

也就是查詢某個文檔對應的主題及其機率

topics = [lda_model[c] for c in corpus_tfidf] #大量查詢時不推薦，太慢，只適合查詢小的集合

實現實例

使用gensim python拓展包

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
__title__ = 'topic model - build lda - 20news dataset'
__author__ = 'pi'
__mtime__ = '12/26/2014-026'
# code is far away from bugs with the god animal protecting
    I love animals. They taste delicious.
              ┏┓      ┏┓
            ┏┛┻━━━┛┻┓
            ┃      ☃      ┃
            ┃  ┳┛  ┗┳  ┃
            ┃      ┻      ┃
            ┗━┓      ┏━┛
                ┃      ┗━━━┓
                ┃  神獸保佑    ┣┓
                ┃　永無BUG！   ┏┛
                ┗┓┓┏━┳┓┏┛
                  ┃┫┫  ┃┫┫
                  ┗┻┛  ┗┻┛
"""
from Colors import *
from collections import defaultdict
import re
import datetime
from sklearn import datasets
import nltk
from gensim import corpora
from gensim import models
import numpy as np
from scipy import spatial
from CorePyPro.Fun.TimeStump import totalTime


def load_texts(dataset_type='train', groups=None):
    """
    load datasets to bytes list
    :return:train_dataset_bunch.data bytes list
    """
    if groups == 'small':
        groups = ['comp.graphics', 'comp.os.ms-windows.misc']  # 僅用於小數據測試時用, #1368
    elif groups == 'medium':
        groups = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.ma c.hardware',
                  'comp.windows.x', 'sci.space']  # 中量數據時用    #3414
    train_dataset_bunch = datasets.load_mlcomp('20news-18828', dataset_type, mlcomp_root='./datasets',
                                               categories=groups)  # 13180
    return train_dataset_bunch.data


def preprocess_texts(texts, test_doc_id=1):
    """
    texts preprocessing
    :param texts: bytes list
    :return:bytes list
    """
    texts = [t.decode(errors='ignore') for t in texts]  # bytes2str
    # print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n',texts[test_doc_id])
    # split_texts = [t.lower().split() for t in texts]
    # print(REDH, 'split texts[%d]: #%d' % (test_doc_id, len(split_texts)), DEFAULT, '\n',split_texts[test_doc_id])


    # lower str & split str 2 word list with sep=... & delete None
    SEPS = '[\s()-/,:.?!]\s*'
    texts = [re.split(SEPS, t.lower()) for t in texts]
    for t in texts:
        while '' in t:
            t.remove('')
    # print(REDH, 'texts[%d] lower & split(seps= %s) & delete None: #%d' % (test_doc_id, SEPS, len(texts[test_doc_id])), DEFAULT, '\n',texts[test_doc_id])


    # nltk.download()   #then choose the corpus.stopwords
    stopwords = set(nltk.corpus.stopwords.words('english'))  # #127
    stopwords.update(['from', 'subject', 'writes'])  # #129
    word_usage = defaultdict(int)
    for t in texts:
        for w in t:
            word_usage[w] += 1
    COMMON_LINE = len(texts) / 10
    too_common_words = [w for w in t if word_usage[w] > COMMON_LINE]  # set(too_common_words)
    # print('too_common_words: #', len(too_common_words), '\n', too_common_words)   #68
    stopwords.update(too_common_words)
    # print('stopwords: #', len(stopwords), '\n', stopwords)  #   #147

    english_stemmer = nltk.SnowballStemmer('english')
    MIN_WORD_LEN = 3  # 4
    texts = [[english_stemmer.stem(w) for w in t if
              not set(w) & set('@+>0123456789*') and w not in stopwords and len(w) >= MIN_WORD_LEN] for t in
             texts]  # set('+-.?!()>@0123456789*/')
    # print(REDH, 'texts[%d] delete ^alphanum & stopwords & len<%d & stemmed: #' % (test_doc_id, MIN_WORD_LEN),
    # len(texts[test_doc_id]), DEFAULT, '\n', texts[test_doc_id])
    return texts


def build_corpus(texts):
    """
    build corpora
    :param texts: bytes list
    :return: corpus DirectTextCorpus(corpora.TextCorpus)
    """

    class DirectTextCorpus(corpora.TextCorpus): 
        def get_texts(self):
            return self.input

        def __len__(self):
            return len(self.input)

    corpus = DirectTextCorpus(texts)
    return corpus


def build_id2word(corpus):
    """
    from corpus build id2word=dict
    :param corpus:
    :return:dict = corpus.dictionary
    """
    dict = corpus.dictionary  # gensim.corpora.dictionary.Dictionary
    # print(dict.id2token)
    try:
        dict['anything']
    except:
        pass
        # print("dict.id2token is not {} now")
    # print(dict.id2token)
    return dict


def save_corpus_dict(dict, corpus, dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
    dict.save(dictDir)
    print(GREENL, 'dict saved into %s successfully ...' % dictDir, DEFAULT)
    corpora.MmCorpus.serialize(corpusDir, corpus)
    print(GREENL, 'corpus saved into %s successfully ...' % corpusDir, DEFAULT)
    # corpus.save(fname='./LDA/corpus.mm')  # stores only the (tiny) iteration object


def load_ldamodel(modelDir='./lda.pkl'):
    model = models.LdaModel.load(fname=modelDir)
    print(GREENL, 'ldamodel load from %s successfully ...' % modelDir, DEFAULT)
    return model


def load_corpus_dict(dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
    dict = corpora.Dictionary.load(fname=dictDir)
    print(GREENL, 'dict load from %s successfully ...' % dictDir, DEFAULT)
    # dict = corpora.Dictionary.load_from_text('./id_word.txt')
    corpus = corpora.MmCorpus(corpusDir)  # corpora.mmcorpus.MmCorpus
    print(GREENL, 'corpus load from %s successfully ...' % corpusDir, DEFAULT)
    return dict, corpus


def build_doc_word_mat(corpus, model, num_topics):
    """
    build doc_word_mat in topic space
    :param corpus:
    :param model:
    :param num_topics: int
    :return:doc_word_mat np.array (len(topics) * num_topics)
    """
    topics = [model[c] for c in corpus]  # (word_id, weight) list
    doc_word_mat = np.zeros((len(topics), num_topics))
    for doc, topic in enumerate(topics):
        for word_id, weight in topic:
            doc_word_mat[doc, word_id] += weight
    return doc_word_mat


def compute_pairwise_dist(doc_word_mat):
    """
    compute pairwise dist
    :param doc_word_mat: np.array (len(topics) * num_topics)
    :return:pairwise_dist <class 'numpy.ndarray'>
    """
    pairwise_dist = spatial.distance.squareform(spatial.distance.pdist(doc_word_mat))
    max_weight = pairwise_dist.max() + 1
    for i in list(range(len(pairwise_dist))):
        pairwise_dist[i, i] = max_weight
    return pairwise_dist


def closest_texts(corpus, model, num_topics, test_doc_id=1, topn=5):
    """
    find the closest_doc_ids for  doc[test_doc_id]
    :param corpus:
    :param model:
    :param num_topics:
    :param test_doc_id:
    :param topn:
    :return:
    """
    doc_word_mat = build_doc_word_mat(corpus, model, num_topics)
    pairwise_dist = compute_pairwise_dist(doc_word_mat)
    # print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n', original_texts[test_doc_id])
    closest_doc_ids = pairwise_dist[test_doc_id].argsort()
    # return closest_doc_ids[:topn]
    for closest_doc_id in closest_doc_ids[:topn]:
        print(RED, 'closest doc[%d]' % closest_doc_id, DEFAULT, '\n', original_texts[closest_doc_id])


def evaluate_model(model):
    """
    計算模型在test data的Perplexity
    :param model:
    :return:model.log_perplexity float
    """
    test_texts = load_texts(dataset_type='test', groups='small')
    test_texts = preprocess_texts(test_texts)
    test_corpus = build_corpus(test_texts)
    return model.log_perplexity(test_corpus)


def test_num_topics():
    dict, corpus = load_corpus_dict()
    print("#corpus_items:", len(corpus))
    for num_topics in [3, 5, 10, 30, 50, 100, 150, 200, 300]:
        start_time = datetime.datetime.now()
        model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict)
        end_time = datetime.datetime.now()
        print("total running time = ", end_time - start_time)
        print(REDL, 'model.log_perplexity for test_texts with num_topics=%d : ' % num_topics, evaluate_model(model),
              DEFAULT)


def test():
    texts = load_texts(dataset_type='train', groups='small')
    original_texts = texts
    test_doc_id = 1

    # texts = preprocess_texts(texts, test_doc_id=test_doc_id)
    # corpus = build_corpus(texts=texts)  # corpus DirectTextCorpus(corpora.TextCorpus)
    # dict = build_id2word(corpus)
    # save_corpus_dict(dict, corpus)
    dict, corpus = load_corpus_dict()
    # print(len(corpus))

    num_topics = 100
    model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict)  # 每次結果不一樣
    model.show_topic(0)
    # model.save(fname='./lda.pkl')

    # model = load_ldamodel()
    # closest_texts(corpus, model, num_topics, test_doc_id=1, topn=3)

    print(REDL, 'model.log_perplexity for test_texts', evaluate_model(model), DEFAULT)


if __name__ == '__main__':
    test()
    # test_num_topics()