Python 網頁爬蟲 & 文本處理 & 科學計算 & 機器學習 & 數據挖掘兵器譜__轉載

時間 2019-11-12

標籤 python 網頁爬蟲文本處理科學計算機器學習數據挖掘兵器轉載欄目 Python 简体版

原文原文鏈接

曾經由於NLTK的緣故開始學習Python，以後漸漸成爲我工做中的第一輔助腳本語言，雖然開發語言是C/C++，但平時的不少文本數據處理任務都交給了Python。離開騰訊創業後，第一個做品課程圖譜也是選擇了Python系的Flask框架，漸漸的將本身的絕大部分工做交給了Python。這些年來，接觸和使用了不少Python工具包，特別是在文本處理，科學計算，機器學習和數據挖掘領域，有不少不少優秀的Python工具包可供使用，因此做爲Pythoner，也是至關幸福的。其實若是仔細留意微博，你會發現不少這方面的分享，本身也Google了一下，發現也有同窗總結了「Python機器學習庫」，不過總感受缺乏點什麼。最近流行一個詞，全棧工程師（full stack engineer），做爲一個苦逼的創業者，自然的要把本身打形成一個full stack engineer，而這個過程當中，這些Python工具包給本身提供了足夠的火力，因此想起了這個系列。固然，這也僅僅是拋磚引玉，但願你們能提供更多的線索，來彙總整理一套Python網頁爬蟲，文本處理，科學計算，機器學習和數據挖掘的兵器譜。html

1、Python網頁爬蟲工具集python

一個真實的項目，必定是從獲取數據開始的。不管文本處理，機器學習和數據挖掘，都須要數據，除了經過一些渠道購買或者下載的專業數據外，經常須要你們本身動手爬數據，這個時候，爬蟲就顯得格外重要了，幸虧，Python提供了一批很不錯的網頁爬蟲工具框架，既能爬取數據，也能獲取和清洗數據，咱們也就從這裏開始了：git

1. Scrapygithub

Scrapy, a fast high-level screen scraping and web crawling framework for Python.web

鼎鼎大名的Scrapy，相信很多同窗都有耳聞，課程圖譜中的不少課程都是依靠Scrapy抓去的，這方面的介紹文章有不少，推薦大牛pluskid早年的一篇文章：《Scrapy 輕鬆定製網絡爬蟲》，歷久彌新。算法

官方主頁：http://scrapy.org/
Github代碼頁: https://github.com/scrapy/scrapyshell

2. Beautiful Soupexpress

You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.編程

讀書的時候經過《集體智慧編程》這本書知道Beautiful Soup的，後來也偶爾會用用，很是棒的一套工具。客觀的說，Beautifu Soup不徹底是一套爬蟲工具，須要配合urllib使用，而是一套HTML/XML數據分析，清洗和獲取工具。json

官方主頁：http://www.crummy.com/software/BeautifulSoup/

3. Python-Goose

Html Content / Article Extractor, web scrapping lib in Python

Goose最先是用Java寫得，後來用Scala重寫，是一個Scala項目。Python-Goose用Python重寫，依賴了Beautiful Soup。前段時間用過，感受很不錯，給定一個文章的URL, 獲取文章的標題和內容很方便。

Github主頁：https://github.com/grangier/python-goose

2、Python文本處理工具集

從網頁上獲取文本數據以後，依據任務的不一樣，就須要進行基本的文本處理了，譬如對於英文來講，須要基本的tokenize，對於中文，則須要常見的中文分詞，進一步的話，不管英文中文，還能夠詞性標註，句法分析，關鍵詞提取，文本分類，情感分析等等。這個方面，特別是面向英文領域，有不少優秀的工具包，咱們一一道來。

1. NLTK — Natural Language Toolkit

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.

搞天然語言處理的同窗應該沒有人不知道NLTK吧，這裏也就很少說了。不過推薦兩本書籍給剛剛接觸NLTK或者須要詳細瞭解NLTK的同窗: 一個是官方的《Natural Language Processing with Python》，以介紹NLTK裏的功能用法爲主，同時附帶一些Python知識，同時國內陳濤同窗友情翻譯了一箇中文版，這裏能夠看到：推薦《用Python進行天然語言處理》中文翻譯-NLTK配套書；另一本是《Python Text Processing with NLTK 2.0 Cookbook》，這本書要深刻一些，會涉及到NLTK的代碼結構，同時會介紹如何定製本身的語料和模型等，至關不錯。

官方主頁：http://www.nltk.org/
Github代碼頁：https://github.com/nltk/nltk

2. Pattern

Pattern is a web mining module for the Python programming language.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and canvas visualization.

Pattern由比利時安特衛普大學CLiPS實驗室出品，客觀的說，Pattern不只僅是一套文本處理工具，它更是一套web數據挖掘工具，囊括了數據抓取模塊（包括Google, Twitter, 維基百科的API，以及爬蟲和HTML分析器），文本處理模塊（詞性標註，情感分析等），機器學習模塊(VSM, 聚類，SVM）以及可視化模塊等，能夠說，Pattern的這一整套邏輯也是這篇文章的組織邏輯，不過這裏咱們暫且把Pattern放到文本處理部分。我我的主要使用的是它的英文處理模塊Pattern.en, 有不少很不錯的文本處理功能，包括基礎的tokenize, 詞性標註，句子切分，語法檢查，拼寫糾錯，情感分析，句法分析等，至關不錯。

官方主頁：http://www.clips.ua.ac.be/pattern

3. TextBlob: Simplified Text Processing

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

TextBlob是一個頗有意思的Python文本處理工具包，它實際上是基於上面兩個Python工具包NLKT和Pattern作了封裝（TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both），同時提供了不少文本處理功能的接口，包括詞性標註，名詞短語提取，情感分析，文本分類，拼寫檢查等，甚至包括翻譯和語言檢測，不過這個是基於Google的API的，有調用次數限制。TextBlob相對比較年輕，有興趣的同窗能夠關注。

官方主頁：http://textblob.readthedocs.org/en/dev/
Github代碼頁：https://github.com/sloria/textblob

4. MBSP for Python

MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.

MBSP與Pattern同源，同出自比利時安特衛普大學CLiPS實驗室，提供了Word Tokenization, 句子切分，詞性標註，Chunking, Lemmatization，句法分析等基本的文本處理功能，感興趣的同窗能夠關注。

官方主頁：http://www.clips.ua.ac.be/pages/MBSP

5. Gensim: Topic modeling for humans

Gensim是一個至關專業的主題模型Python工具包，不管是代碼仍是文檔，咱們曾經用《如何計算兩個文檔的類似度》介紹過Gensim的安裝和使用過程，這裏就很少說了。

官方主頁：http://radimrehurek.com/gensim/index.html
github代碼頁：https://github.com/piskvorky/gensim

6. langid.py: Stand-alone language identification system

語言檢測是一個頗有意思的話題，不過相對比較成熟，這方面的解決方案不少，也有不少不錯的開源工具包，不過對於Python來講，我使用過langid這個工具包，也很是願意推薦它。langid目前支持97種語言的檢測，提供了不少易用的功能，包括能夠啓動一個建議的server，經過json調用其API，可定製訓練本身的語言檢測模型等，能夠說是「麻雀雖小，五臟俱全」。

Github主頁：https://github.com/saffsd/langid.py

7. Jieba: 結巴中文分詞

「結巴」中文分詞：作最好的Python中文分詞組件「Jieba」 (Chinese for 「to stutter」) Chinese text segmentation: built to be the best Python Chinese word segmentation module.

好了，終於能夠說一個國內的Python文本處理工具包了：結巴分詞，其功能包括支持三種分詞模式（精確模式、全模式、搜索引擎模式），支持繁體分詞，支持自定義詞典等，是目前一個很是不錯的Python中文分詞解決方案。

Github主頁：https://github.com/fxsjy/jieba

8. xTAS

xtas, the eXtensible Text Analysis Suite, a distributed text analysis package based on Celery and Elasticsearch.

感謝微博朋友 @大山坡的春提供的線索：咱們組同事以前發佈了xTAS，也是基於python的text mining工具包，歡迎使用，連接：http://t.cn/RPbEZOW。看起來很不錯的樣子，回頭試用一下。

Github代碼頁：https://github.com/NLeSC/xtas

3、Python科學計算工具包

提及科學計算，你們首先想起的是Matlab，集數值計算，可視化工具及交互於一身，不過惋惜是一個商業產品。開源方面除了GNU Octave在嘗試作一個相似Matlab的工具包外，Python的這幾個工具包集合到一塊兒也能夠替代Matlab的相應功能：NumPy+SciPy+Matplotlib+iPython。同時，這幾個工具包，特別是NumPy和SciPy，也是不少Python文本處理 & 機器學習 & 數據挖掘工具包的基礎，很是重要。最後再推薦一個系列《用Python作科學計算》，將會涉及到NumPy, SciPy, Matplotlib，能夠作參考。

1. NumPy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:
1）a powerful N-dimensional array object
2）sophisticated (broadcasting) functions
3）tools for integrating C/C++ and Fortran code
4） useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

NumPy幾乎是一個沒法迴避的科學計算工具包，最經常使用的也許是它的N維數組對象，其餘還包括一些成熟的函數庫，用於整合C/C++和Fortran代碼的工具包，線性代數、傅里葉變換和隨機數生成函數等。NumPy提供了兩種基本的對象：ndarray（N-dimensional array object）和 ufunc（universal function object）。ndarray是存儲單一數據類型的多維數組，而ufunc則是可以對數組進行處理的函數。

官方主頁：http://www.numpy.org/

2. SciPy：Scientific Computing Tools for Python

SciPy refers to several related but distinct entities:

1）The SciPy Stack, a collection of open source software for scientific computing in Python, and particularly a specified set of core packages.
2）The community of people who use and develop this stack.
3）Several conferences dedicated to scientific computing in Python – SciPy, EuroSciPy and SciPy.in.
4）The SciPy library, one component of the SciPy stack, providing many numerical routines.

「SciPy是一個開源的Python算法庫和數學工具包，SciPy包含的模塊有最優化、線性代數、積分、插值、特殊函數、快速傅里葉變換、信號處理和圖像處理、常微分方程求解和其餘科學與工程中經常使用的計算。其功能與軟件MATLAB、Scilab和GNU Octave相似。 Numpy和Scipy經常結合着使用，Python大多數機器學習庫都依賴於這兩個模塊。」—-引用自「Python機器學習庫」

官方主頁：http://www.scipy.org/

3. Matplotlib

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits.

matplotlib 是python最著名的繪圖庫，它提供了一整套和matlab類似的命令API，十分適合交互式地進行製圖。並且也能夠方便地將它做爲繪圖控件，嵌入GUI應用程序中。Matplotlib能夠配合ipython shell使用，提供不亞於Matlab的繪圖體驗，總之用過了都說好。

官方主頁：http://matplotlib.org/

4. iPython

IPython provides a rich architecture for interactive computing with:

1）Powerful interactive shells (terminal and Qt-based).
2）A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
3）Support for interactive data visualization and use of GUI toolkits.
4）Flexible, embeddable interpreters to load into your own projects.
5）Easy to use, high performance tools for parallel computing.

「iPython 是一個Python 的交互式Shell，比默認的Python Shell 好用得多，功能也更強大。她支持語法高亮、自動完成、代碼調試、對象自省，支持 Bash Shell 命令，內置了許多頗有用的功能和函式等，很是容易使用。」啓動iPython的時候用這個命令「ipython –pylab」，默認開啓了matploblib的繪圖交互，用起來很方便。

官方主頁：http://ipython.org/

4、Python 機器學習 & 數據挖掘工具包

機器學習和數據挖掘這兩個概念不太好區分，這裏就放到一塊兒了。這方面的開源Python工具包有不少，這裏先從熟悉的講起，再補充其餘來源的資料，也歡迎你們補充。

1. scikit-learn: Machine Learning in Python

scikit-learn (formerly scikits.learn) is an open source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, logistic regression, naive Bayes, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

首先推薦大名鼎鼎的scikit-learn，scikit-learn是一個基於NumPy, SciPy, Matplotlib的開源機器學習工具包，主要涵蓋分類，迴歸和聚類算法，例如SVM，邏輯迴歸，樸素貝葉斯，隨機森林，k-means等算法，代碼和文檔都很是不錯，在許多Python項目中都有應用。例如在咱們熟悉的NLTK中，分類器方面就有專門針對scikit-learn的接口，能夠調用scikit-learn的分類算法以及訓練數據來訓練分類器模型。這裏推薦一個視頻，也是我早期遇到scikit-learn的時候推薦過的：推薦一個Python機器學習工具包Scikit-learn以及相關視頻–Tutorial: scikit-learn – Machine Learning in Python

官方主頁：http://scikit-learn.org/

2. Pandas: Python Data Analysis Library

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

第一次接觸Pandas是因爲Udacity上的一門數據分析課程「Introduction to Data Science」的Project須要用Pandas庫，因此學習了一下Pandas。Pandas也是基於NumPy和Matplotlib開發的，主要用於數據分析和數據可視化，它的數據結構DataFrame和R語言裏的data.frame很像，特別是對於時間序列數據有本身的一套分析機制，很是不錯。這裏推薦一本書《Python for Data Analysis》，做者是Pandas的主力開發，依次介紹了iPython, NumPy, Pandas裏的相關功能，數據可視化，數據清洗和加工，時間數據處理等，案例包括金融股票數據挖掘等，至關不錯。

官方主頁：http://pandas.pydata.org/

=====================================================================
分割線，以上工具包基本上都是本身用過的，如下來源於其餘同窗的線索，特別是《Python機器學習庫》，《23個python的機器學習包》，作了一點增刪修改，歡迎你們補充
=====================================================================

3. mlpy – Machine Learning Python

mlpy is a Python module for Machine Learning built on top of NumPy/SciPy and the GNU Scientific Libraries.

mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is Open Source, distributed under the GNU General Public License version 3.

官方主頁：http://mlpy.sourceforge.net/

4. MDP：The Modular toolkit for Data Processing

Modular toolkit for Data Processing (MDP) is a Python data processing framework.
From the user’s perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
From the scientific developer’s perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units are then automatically integrated with the rest of the library.
The base of available algorithms is steadily increasing and includes signal processing methods (Principal Component Analysis, Independent Component Analysis, Slow Feature Analysis), manifold learning methods ([Hessian] Locally Linear Embedding), several classifiers, probabilistic methods (Factor Analysis, RBM), data pre-processing methods, and many others.

「MDP用於數據處理的模塊化工具包，一個Python數據處理框架。從用戶的觀點，MDP是可以被整合到數據處理序列和更復雜的前饋網絡結構的一批監督學習和非監督學習算法和其餘數據處理單元。計算依照速度和內存需求而高效的執行。從科學開發者的觀點，MDP是一個模塊框架，它可以被容易地擴展。新算法的實現是容易且直觀的。新實現的單元而後被自動地與程序庫的其他部件進行整合。MDP在神經科學的理論研究背景下被編寫，可是它已經被設計爲在使用可訓練數據處理算法的任何狀況中都是有用的。其站在用戶一邊的簡單性，各類不一樣的隨時可用的算法，及應用單元的可重用性，使得它也是一個有用的教學工具。」

官方主頁：http://mdp-toolkit.sourceforge.net/

5. PyBrain

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive 「Backronym」.

「PyBrain(Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network)是Python的一個機器學習模塊，它的目標是爲機器學習任務提供靈活、易應、強大的機器學習算法。（這名字很霸氣）

PyBrain正如其名，包括神經網絡、強化學習(及兩者結合)、無監督學習、進化算法。由於目前的許多問題須要處理連續態和行爲空間，必須使用函數逼近(如神經網絡)以應對高維數據。PyBrain以神經網絡爲核心，全部的訓練方法都以神經網絡爲一個實例。」

官方主頁：http://www.pybrain.org/

6. PyML – machine learning in Python

PyML is an interactive object oriented framework for machine learning written in Python. PyML focuses on SVMs and other kernel methods. It is supported on Linux and Mac OS X.

「PyML是一個Python機器學習工具包，爲各分類和迴歸方法提供靈活的架構。它主要提供特徵選擇、模型選擇、組合分類器、分類評估等功能。」

項目主頁：http://pyml.sourceforge.net/

7. Milk：Machine learning toolkit in Python.

Its focus is on supervised classification with several classifiers available:
SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs
feature selection. These classifiers can be combined in many ways to form
different classification systems.

「Milk是Python的一個機器學習工具箱，其重點是提供監督分類法與幾種有效的分類分析：SVMs(基於libsvm)，K-NN，隨機森林經濟和決策樹。它還能夠進行特徵選擇。這些分類能夠在許多方面相結合，造成不一樣的分類系統。對於無監督學習，它提供K-means和affinity propagation聚類算法。」

官方主頁：http://luispedro.org/software/milk

http://luispedro.org/software/milk

8. PyMVPA: MultiVariate Pattern Analysis (MVPA) in Python

PyMVPA is a Python package intended to ease statistical learning analyses of large datasets. It offers an extensible framework with a high-level interface to a broad range of algorithms for classification, regression, feature selection, data import and export. It is designed to integrate well with related software packages, such as scikit-learn, and MDP. While it is not limited to the neuroimaging domain, it is eminently suited for such datasets. PyMVPA is free software and requires nothing but free-software to run.

「PyMVPA(Multivariate Pattern Analysis in Python)是爲大數據集提供統計學習分析的Python工具包，它提供了一個靈活可擴展的框架。它提供的功能有分類、迴歸、特徵選擇、數據導入導出、可視化等」

官方主頁：http://www.pymvpa.org/

9. Pyrallel – Parallel Data Analytics in Python

Experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.

「Pyrallel(Parallel Data Analytics in Python)基於分佈式計算模式的機器學習和半交互式的試驗項目，可在小型集羣上運行」

Github代碼頁：http://github.com/pydata/pyrallel

10. Monte – gradient based learning in Python

Monte (python) is a Python framework for building gradient based learning machines, like neural networks, conditional random fields, logistic regression, etc. Monte contains modules (that hold parameters, a cost-function and a gradient-function) and trainers (that can adapt a module’s parameters by minimizing its cost-function on training data).

Modules are usually composed of other modules, which can in turn contain other modules, etc. Gradients of decomposable systems like these can be computed with back-propagation.

「Monte (machine learning in pure Python)是一個純Python機器學習庫。它能夠迅速構建神經網絡、條件隨機場、邏輯迴歸等模型，使用inline-C優化，極易使用和擴展。」

官方主頁：http://montepython.sourceforge.net

11. Theano

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
1）tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
2）transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
3）efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
4）speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
5）dynamic C code generation – Evaluate expressions faster.
6） extensive unit-testing and self-verification – Detect and diagnose many types of mistake.
Theano has been powering large-scale computationally intensive scientific investigations since 2007. But it is also approachable enough to be used in the classroom (IFT6266 at the University of Montreal).

「Theano 是一個 Python 庫，用來定義、優化和模擬數學表達式計算，用於高效的解決多維數組的計算問題。Theano的特色：緊密集成Numpy；高效的數據密集型GPU計算；高效的符號微分運算；高速和穩定的優化；動態生成c代碼；普遍的單元測試和自我驗證。自2007年以來，Theano已被普遍應用於科學運算。theano使得構建深度學習模型更加容易，能夠快速實現多種模型。PS：Theano，一位希臘美女，Croton最有權勢的Milo的女兒，後來成爲了畢達哥拉斯的老婆。」

12. Pylearn2

Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. This means you can write Pylearn2 plugins (new models, algorithms, etc) using mathematical expressions, and theano will optimize and stabilize those expressions for you, and compile them to a backend of your choice (CPU or GPU).

「Pylearn2創建在theano上，部分依賴scikit-learn上，目前Pylearn2正處於開發中，將能夠處理向量、圖像、視頻等數據，提供MLP、RBM、SDA等深度學習模型。」

官方主頁：http://deeplearning.net/software/pylearn2/

其餘的，歡迎你們補充，這裏也會持續更新這篇文章。

注：原創文章，轉載請註明出處「我愛天然語言處理」：www.52nlp.cn

本文連接地址：http://www.52nlp.cn/python-網頁爬蟲-文本處理-科學計算-機器學習-數據挖掘