http://blog.csdn.net/pipisorry/article/details/46447561html
使用python gensim輕鬆實現lda模型。java
gensim簡介
gemsim是一個免費python庫,可以從文檔中有效地自動抽取語義主題。gensim中的算法包括:LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), RP (Random Projections), 經過在一個訓練文檔語料庫中,檢查詞彙統計聯合出現模式, 能夠用來發掘文檔語義結構,這些算法屬於非監督學習,能夠處理原始的,非結構化的文本(」plain text」)。
python
Gensim是一個至關專業的主題模型Python工具包。在文本處理中,好比商品評論挖掘,有時須要瞭解每一個評論分別和商品的描述之間的類似度,以此衡量評論的客觀性。評論和商品描述的類似度越高,說明評論的用語比較官方,不帶太多感情色彩,比較注重描述商品的屬性和特性,角度更客觀。git
Gensim:實現語言,Python,實現模型,LDA,Dynamic Topic Model,Dynamic Influence Model,HDP,LSI,Random Projections,深度學習的word2vec,paragraph2vec。github
[官網主頁gensim topic modeling for humans]
算法
[github代碼地址:https://github.com/piskvorky/gensim]windows
gensim 特性
- 內存獨立- 對於訓練語料來講,不必在任什麼時候間將整個語料都駐留在RAM中
- 有效實現了許多流行的向量空間算法-包括tf-idf,分佈式LSA, 分佈式LDA 以及 RP;而且很容易添加新算法
- 對流行的數據格式進行了IO封裝和轉換
- 在其語義表達中,能夠類似查詢
- gensim的建立的目的是,因爲缺少簡單的(java很複雜)實現主題建模的可擴展軟件框架.
gensim 設計原則
- 簡單的接口,學習曲線低。對於原型實現很方便
- 根據輸入的語料的size來講,內存各自獨立;基於流的算法操做,一次訪問一個文檔.
gensim 核心概念
gensim的整個package會涉及三個概念:corpus, vector, model.數組
- 語庫(corpus)
文檔集合,用於自動推出文檔結構,以及它們的主題等,也可稱做訓練語料。
向量(vector)app
在向量空間模型(VSM)中,每一個文檔被表示成一個特徵數組。例如,一個單一特徵能夠被表示成一個問答對(question-answer pair):框架
[1].在文檔中單詞」splonge」出現的次數? 0個
[2].文檔中包含了多少句子? 2個
[3].文檔中使用了多少字體? 5種
這裏的問題能夠表示成整型id (好比:1,2,3等), 所以,上面的文檔能夠表示成:(1, 0.0), (2, 2.0), (3, 5.0). 若是咱們事先知道全部的問題,咱們能夠顯式地寫成這樣:(0.0, 2.0, 5.0). 這個answer序列能夠認爲是一個多維矩陣(3維). 對於實際目的,只有question對應的answer是一個實數.對於每一個文檔來講,answer是相似的. 於是,對於兩個向量來講(分別表示兩個文檔),咱們但願能夠下相似的結論:「若是兩個向量中的實數是類似的,那麼,原始的文檔也能夠認爲是類似的」。固然,這樣的結論依賴於咱們如何去選取咱們的question。
稀疏矩陣(Sparse vector)
一般,大多數answer的值都是0.0. 爲了節省空間,咱們須要從文檔表示中忽略它們,只須要寫:(2, 2.0), (3, 5.0) 便可(注意:這裏忽略了(1, 0.0)). 因爲全部的問題集事先都知道,那麼在稀疏矩陣的文檔表示中全部缺失的特性能夠認爲都是0.0.
gensim的特別之處在於,它沒有限定任何特定的語料格式;語料能夠是任何格式,當迭代時,經過稀疏矩陣來完成便可。例如,集合 ([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0])) 是一個包含兩個文檔的語料,每一個都有兩個非零的 pair。
模型(model)
對於咱們來講,一個模型就是一個變換(transformation),將一種文檔表示轉換成另外一種。初始和目標表示都是向量--它們只在question和answer之間有區別。這個變換能夠經過訓練的語料進行自動學習,無需人工監督,最終的文檔表示將更加緊湊和有用;類似的文檔具備類似的表示。
gensim的安裝
gensim依賴NumPy和SciPy這兩大Python科學計算工具包,要先安裝。
再安裝gensim: pip install gensim
可能還須要安裝其它的東西:install gensim,sklearn, nltk。
gensim官網教程
分爲下面幾部分[Experiments on the English Wikipedia]
使用gensim快速實現lda
文檔的向量表示Corpora and Vector Spaces
將用字符串表示的文檔轉換爲用id表示的文檔向量:
documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] """ #use StemmedCountVectorizer to get stemmed without stop words corpus Vectorizer = StemmedCountVectorizer # Vectorizer = CountVectorizer vectorizer = Vectorizer(stop_words='english') vectorizer.fit_transform(documents) texts = vectorizer.get_feature_names() # print(texts) """ texts = [doc.lower().split() for doc in documents] # print(texts) dict = corpora.Dictionary(texts) #自建詞典 # print dict, dict.token2id #經過dict將用字符串表示的文檔轉換爲用id表示的文檔向量 corpus = [dict.doc2bow(text) for text in texts] print(corpus)
查找doc主題的兩種方式
也就是查詢某個文檔對應的主題及其機率
topics = [lda_model[c] for c in corpus_tfidf] #大量查詢時不推薦,太慢,只適合查詢小的集合
實現實例
使用gensim python拓展包
#!/usr/bin/env python # -*- coding: utf-8 -*- """ __title__ = 'topic model - build lda - 20news dataset' __author__ = 'pi' __mtime__ = '12/26/2014-026' # code is far away from bugs with the god animal protecting I love animals. They taste delicious. ┏┓ ┏┓ ┏┛┻━━━┛┻┓ ┃ ☃ ┃ ┃ ┳┛ ┗┳ ┃ ┃ ┻ ┃ ┗━┓ ┏━┛ ┃ ┗━━━┓ ┃ 神獸保佑 ┣┓ ┃ 永無BUG! ┏┛ ┗┓┓┏━┳┓┏┛ ┃┫┫ ┃┫┫ ┗┻┛ ┗┻┛ """ from Colors import * from collections import defaultdict import re import datetime from sklearn import datasets import nltk from gensim import corpora from gensim import models import numpy as np from scipy import spatial from CorePyPro.Fun.TimeStump import totalTime def load_texts(dataset_type='train', groups=None): """ load datasets to bytes list :return:train_dataset_bunch.data bytes list """ if groups == 'small': groups = ['comp.graphics', 'comp.os.ms-windows.misc'] # 僅用於小數據測試時用, #1368 elif groups == 'medium': groups = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.ma c.hardware', 'comp.windows.x', 'sci.space'] # 中量數據時用 #3414 train_dataset_bunch = datasets.load_mlcomp('20news-18828', dataset_type, mlcomp_root='./datasets', categories=groups) # 13180 return train_dataset_bunch.data def preprocess_texts(texts, test_doc_id=1): """ texts preprocessing :param texts: bytes list :return:bytes list """ texts = [t.decode(errors='ignore') for t in texts] # bytes2str # print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n',texts[test_doc_id]) # split_texts = [t.lower().split() for t in texts] # print(REDH, 'split texts[%d]: #%d' % (test_doc_id, len(split_texts)), DEFAULT, '\n',split_texts[test_doc_id]) # lower str & split str 2 word list with sep=... & delete None SEPS = '[\s()-/,:.?!]\s*' texts = [re.split(SEPS, t.lower()) for t in texts] for t in texts: while '' in t: t.remove('') # print(REDH, 'texts[%d] lower & split(seps= %s) & delete None: #%d' % (test_doc_id, SEPS, len(texts[test_doc_id])), DEFAULT, '\n',texts[test_doc_id]) # nltk.download() #then choose the corpus.stopwords stopwords = set(nltk.corpus.stopwords.words('english')) # #127 stopwords.update(['from', 'subject', 'writes']) # #129 word_usage = defaultdict(int) for t in texts: for w in t: word_usage[w] += 1 COMMON_LINE = len(texts) / 10 too_common_words = [w for w in t if word_usage[w] > COMMON_LINE] # set(too_common_words) # print('too_common_words: #', len(too_common_words), '\n', too_common_words) #68 stopwords.update(too_common_words) # print('stopwords: #', len(stopwords), '\n', stopwords) # #147 english_stemmer = nltk.SnowballStemmer('english') MIN_WORD_LEN = 3 # 4 texts = [[english_stemmer.stem(w) for w in t if not set(w) & set('@+>0123456789*') and w not in stopwords and len(w) >= MIN_WORD_LEN] for t in texts] # set('+-.?!()>@0123456789*/') # print(REDH, 'texts[%d] delete ^alphanum & stopwords & len<%d & stemmed: #' % (test_doc_id, MIN_WORD_LEN), # len(texts[test_doc_id]), DEFAULT, '\n', texts[test_doc_id]) return texts def build_corpus(texts): """ build corpora :param texts: bytes list :return: corpus DirectTextCorpus(corpora.TextCorpus) """ class DirectTextCorpus(corpora.TextCorpus): def get_texts(self): return self.input def __len__(self): return len(self.input) corpus = DirectTextCorpus(texts) return corpus def build_id2word(corpus): """ from corpus build id2word=dict :param corpus: :return:dict = corpus.dictionary """ dict = corpus.dictionary # gensim.corpora.dictionary.Dictionary # print(dict.id2token) try: dict['anything'] except: pass # print("dict.id2token is not {} now") # print(dict.id2token) return dict def save_corpus_dict(dict, corpus, dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'): dict.save(dictDir) print(GREENL, 'dict saved into %s successfully ...' % dictDir, DEFAULT) corpora.MmCorpus.serialize(corpusDir, corpus) print(GREENL, 'corpus saved into %s successfully ...' % corpusDir, DEFAULT) # corpus.save(fname='./LDA/corpus.mm') # stores only the (tiny) iteration object def load_ldamodel(modelDir='./lda.pkl'): model = models.LdaModel.load(fname=modelDir) print(GREENL, 'ldamodel load from %s successfully ...' % modelDir, DEFAULT) return model def load_corpus_dict(dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'): dict = corpora.Dictionary.load(fname=dictDir) print(GREENL, 'dict load from %s successfully ...' % dictDir, DEFAULT) # dict = corpora.Dictionary.load_from_text('./id_word.txt') corpus = corpora.MmCorpus(corpusDir) # corpora.mmcorpus.MmCorpus print(GREENL, 'corpus load from %s successfully ...' % corpusDir, DEFAULT) return dict, corpus def build_doc_word_mat(corpus, model, num_topics): """ build doc_word_mat in topic space :param corpus: :param model: :param num_topics: int :return:doc_word_mat np.array (len(topics) * num_topics) """ topics = [model[c] for c in corpus] # (word_id, weight) list doc_word_mat = np.zeros((len(topics), num_topics)) for doc, topic in enumerate(topics): for word_id, weight in topic: doc_word_mat[doc, word_id] += weight return doc_word_mat def compute_pairwise_dist(doc_word_mat): """ compute pairwise dist :param doc_word_mat: np.array (len(topics) * num_topics) :return:pairwise_dist <class 'numpy.ndarray'> """ pairwise_dist = spatial.distance.squareform(spatial.distance.pdist(doc_word_mat)) max_weight = pairwise_dist.max() + 1 for i in list(range(len(pairwise_dist))): pairwise_dist[i, i] = max_weight return pairwise_dist def closest_texts(corpus, model, num_topics, test_doc_id=1, topn=5): """ find the closest_doc_ids for doc[test_doc_id] :param corpus: :param model: :param num_topics: :param test_doc_id: :param topn: :return: """ doc_word_mat = build_doc_word_mat(corpus, model, num_topics) pairwise_dist = compute_pairwise_dist(doc_word_mat) # print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n', original_texts[test_doc_id]) closest_doc_ids = pairwise_dist[test_doc_id].argsort() # return closest_doc_ids[:topn] for closest_doc_id in closest_doc_ids[:topn]: print(RED, 'closest doc[%d]' % closest_doc_id, DEFAULT, '\n', original_texts[closest_doc_id]) def evaluate_model(model): """ 計算模型在test data的Perplexity :param model: :return:model.log_perplexity float """ test_texts = load_texts(dataset_type='test', groups='small') test_texts = preprocess_texts(test_texts) test_corpus = build_corpus(test_texts) return model.log_perplexity(test_corpus) def test_num_topics(): dict, corpus = load_corpus_dict() print("#corpus_items:", len(corpus)) for num_topics in [3, 5, 10, 30, 50, 100, 150, 200, 300]: start_time = datetime.datetime.now() model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict) end_time = datetime.datetime.now() print("total running time = ", end_time - start_time) print(REDL, 'model.log_perplexity for test_texts with num_topics=%d : ' % num_topics, evaluate_model(model), DEFAULT) def test(): texts = load_texts(dataset_type='train', groups='small') original_texts = texts test_doc_id = 1 # texts = preprocess_texts(texts, test_doc_id=test_doc_id) # corpus = build_corpus(texts=texts) # corpus DirectTextCorpus(corpora.TextCorpus) # dict = build_id2word(corpus) # save_corpus_dict(dict, corpus) dict, corpus = load_corpus_dict() # print(len(corpus)) num_topics = 100 model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict) # 每次結果不一樣 model.show_topic(0) # model.save(fname='./lda.pkl') # model = load_ldamodel() # closest_texts(corpus, model, num_topics, test_doc_id=1, topn=3) print(REDL, 'model.log_perplexity for test_texts', evaluate_model(model), DEFAULT) if __name__ == '__main__': test() # test_num_topics()
from:http://blog.csdn.net/pipisorry/article/details/46447561
ref: [Gensim官方教程翻譯(一)——快速入門 ]