天然語言交流系統 phxnet團隊 創新實訓 我的博客 (十四)

 

關於WikiExtractor的學習筆記:python

 

WikiExtractor是一個Python 腳本,專門用於提取和清洗Wikipedia的dump數據,支持Python 2.7 或者 Python 3.3+,無額外依賴,安裝和使用都很是方便:git

安裝:github

git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor/
sudo python setup.py install

使用:app

WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2
......
INFO: 53665431  Pampapaul
INFO: 53665433  Charles Frederick Zimpel
INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

這個過程總計花了2個多小時,提取了大概537萬多篇文章。關於個人機器配置,可參考:《深度學習主機攢機小記less

提取後的文件按必定順序切分存儲在多個子目錄下:post

每一個子目錄下的又存放若干個以wiki_num命名的文件,每一個大小在1M左右,這個大小能夠經過參數 -b 控制:學習

-b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)

咱們看一下wiki_00裏的具體內容:ui

<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism

Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.


</doc>
<doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism">
Autism

Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
...
</doc>
...

每一個wiki_num文件裏又存放若干個doc,每一個doc都有相關的tag標記,包括id, url, title等,很好區分。lua

 

 

 

 

 

 

若是您願意花幾塊錢請我喝杯茶的話,能夠用手機掃描下方的二維碼,經過 支付寶 捐贈。我會努力寫出更好的文章。 
(捐贈不顯示捐贈者的我的信息,如須要,請註明您的聯繫方式) 
Thank you for your kindly donation!!url

 

 

相關文章
相關標籤/搜索