MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.html
MLlib是Spark機器學習庫。它的目標是構造實用的、可擴展的、簡單的機器學習。它的通用組成部分分爲學習算法和工具包,包括:分類、迴歸、彙集、協同過濾、降維,也提供了lower-level級別的原型優化和higher-level級別的pipeline API。
算法
It divides into two packages:sql
spark.mllib
contains the original API built on top of RDDs.apache
spark.ml
provides higher-level API built on top of DataFrames for constructing ML pipelines.dom
它分爲兩個包:機器學習
spark.mllib
:包括構建在 RDDs之上的原型API。ide
spark.ml
:提供構建在 DataFrames 上的 higher-level API ,而DataFrames 是爲了構造ML管道的。工具
Using spark.ml
is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting spark.mllib
along with the development of spark.ml
. Users should be comfortable using spark.mllib
features and expect more features coming. Developers should contribute new algorithms to spark.ml
if they fit the ML pipeline concept well, e.g., feature extractors and transformers.
學習
推薦使用 spark.ml ,由於基於DataFrames的API 更加通用和靈活。可是咱們將繼續支持spark.mllib 和spark.ml一塊兒發展。用戶能夠舒暢的使用spark.mllib特性,而且指望更多特點的到來。開發人員安裝了能夠貢獻新的算法給spark.ml,固然這些算法應與ML pipeline概念相適應。flex
e.g:extractors(提取器) 和 transformers(轉換器)
We list major functionality from both below, with links to detailed guides.
咱們在下面列出了主要的功能,經過鏈接進入詳細指南。