一些重要的連接:html
zhwiki-latest-pages-articles.xml.bz2
這個文件選擇了 Gensim 這個主題工具包進行數據預處理。python
python -m gensim.scripts.segment_wiki -f zhwiki-latest-pages-articles.xml.bz2 | gzip > zhwiki-latest.json.gz
而後就轉換成了可被 Python 直接讀取的 json 文檔。github
from smart_open import smart_open import json x = 0 for line in smart_open('zhwiki-latest.json.gz'): article = json.loads(line) print("Article title: %s" % article['title']) for section_title, section_text in zip(article['section_titles'], article['section_texts']): print("Section title: %s" % section_title) print("Section text: %s" % section_text) x += 1 if x == 5: break
運行如上代碼能夠輸出中文維基中的前 5 篇文檔。shell
採用 OpenCC 實現。json
#!/usr/bin/env python3 import opencc def t2s(s): return opencc.convert(s, config='t2s.json') def convert_obj(p): num, line = p[0], p[1] article = json.loads(line) article['title'] = t2s(article['title']) article['section_titles'] = [t2s(t) for t in article['section_titles']] article['section_texts'] = [t2s(t) for t in article['section_texts']] return (num, json.dumps(article)) from smart_open import smart_open import json import sys import multiprocessing from itertools import count p = multiprocessing.Pool() x = 0 for out in p.imap(convert_obj, zip(count(), smart_open('zhwiki-latest.json.gz'))): print(out[1]) if out[0] % 100000 == 0: sys.stderr.write('Processed %d\n' % (out[0]))