[轉]Python機器學習工具箱

時間 2019-11-21

原文原文鏈接

Python在科學計算領域，有兩個重要的擴展模塊：Numpy和Scipy。其中Numpy是一個用python實現的科學計算包。包括：

一個強大的N維數組對象Array；
比較成熟的（廣播）函數庫；
用於整合C/C++和Fortran代碼的工具包；
實用的線性代數、傅里葉變換和隨機數生成函數。

SciPy是一個開源的Python算法庫和數學工具包，SciPy包含的模塊有最優化、線性代數、積分、插值、特殊函數、快速傅里葉變換、信號處理和圖像處理、常微分方程求解和其餘科學與工程中經常使用的計算。其功能與軟件MATLAB、Scilab和GNU Octave相似。html

Numpy和Scipy經常結合着使用，Python大多數機器學習庫都依賴於這兩個模塊，繪圖和可視化依賴於matplotlib模塊，matplotlib的風格與matlab相似。Python機器學習庫很是多，並且大多數開源，主要有：前端

1. scikit-learnnode

scikit-learn 是一個基於SciPy和Numpy的開源機器學習模塊，包括分類、迴歸、聚類系列算法，主要算法有SVM、邏輯迴歸、樸素貝葉斯、Kmeans、DBSCAN等，目前由INRI 資助，偶爾Google也資助一點。python

項目主頁：linux

https://pypi.python.org/pypi/scikit-learn/git

http://scikit-learn.org/github

https://github.com/scikit-learn/scikit-learnweb

2. NLTK算法

NLTK(Natural Language Toolkit)是Python的天然語言處理模塊，包括一系列的字符處理和語言統計模型。NLTK 經常使用於學術研究和教學，應用的領域有語言學、認知科學、人工智能、信息檢索、機器學習等。 NLTK提供超過50個語料庫和詞典資源，文本處理庫包括分類、分詞、詞幹提取、解析、語義推理。可穩定運行在Windows, Mac OS X和Linux平臺上. 編程

項目主頁：

http://sourceforge.net/projects/nltk/

https://pypi.python.org/pypi/nltk/

http://nltk.org/

3. Mlpy

Mlpy是基於NumPy/SciPy的Python機器學習模塊，它是Cython的擴展應用。包含的機器學習算法有：

- 迴歸

least squares, ridge regression, least angle regression, elastic net, kernel ridge regression, support vector machines (SVM), partial least squares (PLS)

- 分類

linear discriminant analysis (LDA), Basic perceptron, Elastic Net, logistic regression, (Kernel) Support Vector Machines (SVM), Diagonal Linear Discriminant Analysis (DLDA), Golub Classifier, Parzen-based, (kernel) Fisher Discriminant Classifier, k-nearest neighbor, Iterative RELIEF, Classification Tree, Maximum Likelihood Classifier

- 聚類

hierarchical clustering, Memory-saving Hierarchical Clustering, k-means

- 維度約減

(Kernel) Fisher discriminant analysis (FDA), Spectral Regression Discriminant Analysis (SRDA), (kernel) Principal component analysis (PCA)

項目主頁：

http://sourceforge.net/projects/mlpy

https://mlpy.fbk.eu/

4. Shogun

Shogun是一個開源的大規模機器學習工具箱。目前Shogun的機器學習功能分爲幾個部分：feature表示，feature預處理，核函數表示,核函數標準化，距離表示，分類器表示，聚類方法，分佈，性能評價方法，迴歸方法，結構化輸出學習器。

SHOGUN 的核心由C++實現，提供 Matlab、 R、 Octave、 Python接口。主要應用在linux平臺上。

項目主頁：

http://www.shogun-toolbox.org/

5. MDP

The Modular toolkit for Data Processing (MDP) ，用於數據處理的模塊化工具包，一個Python數據處理框架。

從用戶的觀點，MDP是可以被整合到數據處理序列和更復雜的前饋網絡結構的一批監督學習和非監督學習算法和其餘數據處理單元。計算依照速度和內存需求而高效的執行。從科學開發者的觀點，MDP是一個模塊框架，它可以被容易地擴展。新算法的實現是容易且直觀的。新實現的單元而後被自動地與程序庫的其他部件進行整合。MDP在神經科學的理論研究背景下被編寫，可是它已經被設計爲在使用可訓練數據處理算法的任何狀況中都是有用的。其站在用戶一邊的簡單性，各類不一樣的隨時可用的算法，及應用單元的可重用性，使得它也是一個有用的教學工具。

項目主頁：

http://mdp-toolkit.sourceforge.net/

https://pypi.python.org/pypi/MDP/

6. PyBrain

PyBrain(Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network)是Python的一個機器學習模塊，它的目標是爲機器學習任務提供靈活、易應、強大的機器學習算法。（這名字很霸氣）

PyBrain正如其名，包括神經網絡、強化學習(及兩者結合)、無監督學習、進化算法。由於目前的許多問題須要處理連續態和行爲空間，必須使用函數逼近(如神經網絡)以應對高維數據。PyBrain以神經網絡爲核心，全部的訓練方法都以神經網絡爲一個實例。

項目主頁：

http://www.pybrain.org/

https://github.com/pybrain/pybrain/

7. BigML

BigML 使得機器學習爲數據驅動決策和預測變得容易，BigML使用容易理解的交互式操做建立優雅的預測模型。BigML使用BigML.io,捆綁Python。

項目主頁：

https://bigml.com/

https://pypi.python.org/pypi/bigml

http://bigml.readthedocs.org/

8. PyML

PyML是一個Python機器學習工具包，爲各分類和迴歸方法提供靈活的架構。它主要提供特徵選擇、模型選擇、組合分類器、分類評估等功能。

項目主頁：

http://cmgm.stanford.edu/~asab/pyml/tutorial/

http://pyml.sourceforge.net/

9. Milk

Milk是Python的一個機器學習工具箱，其重點是提供監督分類法與幾種有效的分類分析：SVMs(基於libsvm)，K-NN，隨機森林經濟和決策樹。它還能夠進行特徵選擇。這些分類能夠在許多方面相結合，造成不一樣的分類系統。

對於無監督學習，它提供K-means和affinity propagation聚類算法。

項目主頁：

https://pypi.python.org/pypi/milk/

http://luispedro.org/software/milk

10. PyMVPA

PyMVPA(Multivariate Pattern Analysis in Python)是爲大數據集提供統計學習分析的Python工具包，它提供了一個靈活可擴展的框架。它提供的功能有分類、迴歸、特徵選擇、數據導入導出、可視化等

項目主頁：

http://www.pymvpa.org/

https://github.com/PyMVPA/PyMVPA

11. Pattern

Pattern是Python的web挖掘模塊，它綁定了 Google、Twitter 、Wikipedia API，提供網絡爬蟲、HTML解析功能，文本分析包括淺層規則解析、WordNet接口、句法與語義分析、TF-IDF、LSA等，還提供聚類、分類和圖網絡可視化的功能。

項目主頁：

http://www.clips.ua.ac.be/pages/pattern

https://pypi.python.org/pypi/Pattern

12. pyrallel

Pyrallel(Parallel Data Analytics in Python)基於分佈式計算模式的機器學習和半交互式的試驗項目，可在小型集羣上運行，適用範圍：

l focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).

l focus on small to medium data (with data locality when possible).

l focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.

l do not focus on HA / Fault Tolerance (yet).

l do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

項目主頁：

https://pypi.python.org/pypi/pyrallel

http://github.com/pydata/pyrallel

13. Monte

Monte ( machine learning in pure Python)是一個純Python機器學習庫。它能夠迅速構建神經網絡、條件隨機場、邏輯迴歸等模型，使用inline-C優化，極易使用和擴展。

項目主頁：

https://pypi.python.org/pypi/Monte

http://montepython.sourceforge.net

14. Orange

Orange 是一個基於組件的數據挖掘和機器學習軟件套裝，它的功能即友好，又很強大，快速而又多功能的可視化編程前端，以便瀏覽數據分析和可視化，基綁定了 Python以進行腳本開發。它包含了完整的一系列的組件以進行數據預處理，並提供了數據賬目，過渡，建模，模式評估和勘探的功能。其由C++ 和 Python開發，它的圖形庫是由跨平臺的Qt框架開發。

項目主頁：

https://pypi.python.org/pypi/Orange/

http://orange.biolab.si/

15. Theano

Theano 是一個 Python 庫，用來定義、優化和模擬數學表達式計算，用於高效的解決多維數組的計算問題。Theano的特色：

l  緊密集成Numpy

l  高效的數據密集型GPU計算

l  高效的符號微分運算

l  高速和穩定的優化

l  動態生成c代碼

l  普遍的單元測試和自我驗證

自2007年以來，Theano已被普遍應用於科學運算。theano使得構建深度學習模型更加容易，能夠快速實現下列模型：

l Logistic Regression

l Multilayer perceptron

l Deep Convolutional Network

l Auto Encoders, Denoising Autoencoders

l Stacked Denoising Auto-Encoders

l Restricted Boltzmann Machines

l Deep Belief Networks

l HMC Sampling

l Contractive auto-encoders

Theano，一位希臘美女，Croton最有權勢的Milo的女兒，後來成爲了畢達哥拉斯的老婆。

項目主頁：

http://deeplearning.net/tutorial/

https://pypi.python.org/pypi/Theano

16. Pylearn2

Pylearn2創建在theano上，部分依賴scikit-learn上，目前Pylearn2正處於開發中，將能夠處理向量、圖像、視頻等數據，提供MLP、RBM、SDA等深度學習模型。Pylearn2的目標是：

Researchers add features as they need them. We avoid getting bogged down by too much top-down planning in advance.
A machine learning toolbox for easy scientific experimentation.
All models/algorithms published by the LISA lab should have reference implementations in Pylearn2.
Pylearn2 may wrap other libraries such as scikits.learn when this is practical
Pylearn2 differs from scikits.learn in that Pylearn2 aims to provide great flexibility and make it possible for a researcher to do almost anything, while scikits.learn aims to work as a 「black box」 that can produce good results even if the user does not understand the implementation
Dataset interface for vector, images, video, ...
Small framework for all what is needed for one normal MLP/RBM/SDA/Convolution experiments.
Easy reuse of sub-component of Pylearn2.
Using one sub-component of the library does not force you to use / learn to use all of the other sub-components if you choose not to.
Support cross-platform serialization of learned models.
Remain approachable enough to be used in the classroom (IFT6266 at the University of Montreal).