WikiExtractor是一個Python 腳本,專門用於提取和清洗Wikipedia的dump數據,支持Python 2.7 或者 Python 3.3+,無額外依賴,安裝和使用都很是方便:git
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor/
sudo python setup.py install
WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2
-b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)
<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism
Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.
</doc>
<doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism">
Autism
Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
...
</doc>
...