最近在作知識圖譜相關工做,源數據主要來自百度百科,互動百科,中文維基百科等。其中中文維基百科提供數據庫下載,下文主要討論如何處理Wiki數據。html
1. 中文維基數據下載python
下載dump:https://dumps.wikimedia.org/zhwiki/latest/,維基數據主要包含如下幾部分linux
zhwiki-latest-pages-articles.xml.bz2
|
詞條正文 |
zhwiki-latest-redirect.sql | 詞條重定向(同義詞) |
zhwiki-latest-pagelinks.sql | 詞條頁面內容外鏈 |
zhwiki-latest-page.sql | 詞條標題及摘要 |
zhwiki-latest-categorylinks.sql | 詞條開放分類連接 |
本文處理的數據是: zhwiki-latest-pages-articles.xml.bz2git
2. 數據的抽取github
Gensim是一個至關專業的主題模型Python工具包,提供了wiki數據的抽取處理類WikiCorpus,能對下載的數據(*articles.xml.bz2)進行抽取處理,獲得純淨的文本語料。sql
class WikiCorpus(TextCorpus): """ Treat a wikipedia articles dump (\*articles.xml.bz2) as a (read-only) corpus. The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk. >>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h >>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word """
源碼在此,感興趣的能夠詳細品味。下面是處理代碼 process_wiki_1.py,將wiki數據處理獲得文本語料 wiki.zh.txt,860M。數據庫
# -*- coding: utf-8 -*- import logging import sys from gensim.corpora import WikiCorpus logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO) ''' extract data from wiki dumps(*articles.xml.bz2) by gensim. @chenbingjin 2016-05-11 ''' def help(): print "Usage: python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt" if __name__ == '__main__': if len(sys.argv) < 3: help() sys.exit(1) logging.info("running %s" % ' '.join(sys.argv)) inp, outp = sys.argv[1:3] i = 0 output = open(outp, 'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): output.write(" ".join(text) + "\n") i = i + 1 if (i % 10000 == 0): logging.info("Save "+str(i) + " articles") output.close() logging.info("Finished saved "+str(i) + "articles")
3. 數據預處理json
因爲中文維基包含繁體字及不規範字符,須要進行繁體轉簡體,以及字符編碼轉換。同時爲了後續工做,須要對語料進行分詞處理。bash
(1)繁體轉簡體:使用的是開源簡繁轉換工具 OpenCC,安裝說明在此,下面是linux下安裝方式。sudo apt-get install opencc
iconv -c -t UTF-8 < input_file > output_file #iconv -c -t UTF-8 input_file -o output_file
python -m jieba input_file > cut_file
下面是處理代碼 process_wiki_2.shapp
#!/bin/bash # preprocess data # @chenbingjin 2016-05-11 # Traditional Chinese to Simplified Chinese echo "opencc: Traditional Chinese to Simplified Chinese..." #time opencc -i wiki.zh.txt -o wiki.zh.chs.txt -c zht2zhs.ini time opencc -i wiki.zh.txt -o wiki.zh.chs.txt -c t2s.json # Cut words echo "jieba: Cut words..." time python -m jieba -d ' ' wiki.zh.chs.txt > wiki.zh.seg.txt # Change encode echo "iconv: ascii to utf-8..." time iconv -c -t UTF-8 < wiki.zh.seg.txt > wiki.zh.seg.utf.txt
4. 實驗結果
處理器 Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
數據處理過程:主要是分詞耗時48m4s。
opencc: Traditional Chinese to Simplified Chinese... real 0m57.765s user 0m45.494s sys 0m6.910s ----------------------------- jieba: Cut words... Building prefix dict from /usr/local/lib/python2.7/dist-packages/jieba/dict.txt ... Loading model from cache /tmp/jieba.cache Dumping model to file cache /tmp/jieba.cache Loading model cost 2.141 seconds. Prefix dict has been built succesfully. real 48m4.259s user 47m36.987s sys 0m22.746s ----------------------------- iconv: ascii to utf-8... real 0m22.039s user 0m9.304s sys 0m3.464s
數據處理結果:1.1G 已分詞的中文語料
-rw-r--r-- 1 chenbingjin data 860M 7月 2 14:33 wiki.zh.txt -rw-r--r-- 1 chenbingjin data 860M 7月 2 17:46 wiki.zh.chs.txt -rw-r--r-- 1 chenbingjin data 1.1G 7月 2 18:34 wiki.zh.seg.txt -rw-r--r-- 1 chenbingjin data 1.1G 7月 2 18:34 wiki.zh.seg.utf.txt
補充:未分詞的wiki語料,有須要的朋友能夠下載
參考
1. licstar的博客:維基百科簡體中文語料的獲取
2. 52nlp:中英文維基百科語料上的word2vec實驗