Spark機器學習工具鏈-MLflow使用教程

時間 2019-11-06

原文原文鏈接

Spark機器學習工具鏈-MLflow使用教程

本文翻譯自 https://www.mlflow.org/docs/latest/concepts.html
本文地址 http://www.javashuo.com/article/p-sbrldptq-nt.html，by openthings, 2018.06.07.

參考：html

mlflow項目由Databricks建立。
- 官方主頁 https://www.mlflow.org/
- 官方文檔 https://www.mlflow.org/docs/latest/index.html
基於Kubernetes的機器學習系統，http://www.javashuo.com/article/p-bpgpkqza-dt.html
Kubeflow-機器學習工做流框架，https://my.oschina.net/u/2306127/blog/1807785
Spark機器學習工具鏈-MLflow，https://my.oschina.net/u/2306127/blog/1825638

什麼是咱們構建的？

在本教程中，咱們將演示一個案例，展現數據科學家使用MLFlow端到端地構建一個線性迴歸模型。如何使用MLflow打包代碼，其中代碼訓練該模型以一種可重用和重複生產的模型格式保存。最後，使用MLflow建立簡單的 HTTP server，能夠用來進行預測。python

咱們使用一個數據集來預測酒類質量，基於酒的量化指標如「fixed acidity」, 「pH」, 「residual sugar」, 等等。數據集來自於 UCI’s machine learning repository. [Ref]。git

你首先須要？

本教程中，咱們使用MLflow, conda, 和位於example/tutorial的示範代碼，在 MLflow repository。下載相關代碼，以下：github

git clone https://github.com/databricks/mlflow

訓練模型

要作的第一件事是訓練一個線性迴歸模型，有兩個hyperparameters: alpha 和 l1_ratio。json

使用的代碼位於 example/tutorial/train.py，以下：瀏覽器

# Read the wine-quality csv file (make sure you're running this from the root of MLflow!)
wine_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "wine-quality.csv")
data = pd.read_csv(wine_path)

# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)

# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]

alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5

with mlflow.start_run():
    lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    lr.fit(train_x, train_y)

    predicted_qualities = lr.predict(test_x)

    (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

    print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
    print("  RMSE: %s" % rmse)
    print("  MAE: %s" % mae)
    print("  R2: %s" % r2)

    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.sklearn.log_model(lr, "model")

在這裏，咱們使用pandas、numpy和 sklearn APIs 建立簡單的機器學習模型。除此以外，咱們使用 MLflow tracking APIs記錄每一次訓練的信息，如 hyperparameters alpha 和 l1_ratio 用於訓練的度量，如 root mean square error，用於評估該模型。另外，咱們序列化該模型model，以MLflow能夠部署的格式保存。服務器

運行代碼：app

python example/tutorial/train.py

試驗其餘的 alpha 和 l1_ratio，經過將其做爲參數傳入train.py，以下：框架

python example/tutorial/train.py <alpha> <l1_ratio>

運行後，MLflow 記錄了相關信息，在目錄 mlruns中。dom

比較模型

下一步，咱們使用 MLflow UI 來比較剛纔產生的模型。運行mlflow ui在一樣的工做目錄（包含 mlruns），在瀏覽器打開 http://localhost:5000。

此頁面中，能夠看到所產生的度量指標，以下：

今後頁面能夠看到，較低的 alpha 更適合咱們的模型。咱們可使用搜索快速過濾出模型。例如，查詢 metrics.rmse < 0.8 將返回全部 root mean squared error 小於 0.8的。更復雜的操做，能夠下載 CSV的表格，並使用喜歡的軟件來分析。

打包訓練代碼

如今，咱們有了編寫好的訓練代碼，但願將其打包從而讓其餘的數據科學家能夠容易地重用這個模型，或者將其放到遠程服務器運行。爲了打包，咱們使用 MLflow Projects conventions指定代碼的依賴和入口點。在 example/tutorial/MLproject 文件中，咱們指定project的依賴在 Conda environment file ，名爲 conda.yaml， 咱們的這個項目有一個入口點，接受兩個參數：alpha 和 l1_ratio。以下：

# example/tutorial/MLproject

name: tutorial

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      alpha: float
      l1_ratio: {type: float, default: 0.1}
    command: "python train.py {alpha} {l1_ratio}"

# example/tutorial/conda.yaml

name: tutorial
channels:
  - defaults
dependencies:
  - numpy=1.14.3
  - pandas=0.22.0
  - scikit-learn=0.19.1
  - pip:
    - mlflow

爲了運行該項目，簡單地調用 mlflow run example/tutorial -P alpha=0.42。運行命令後， MLflow將在新的conda環境中運行訓練代碼，而且使用在 conda.yaml中指定的依賴軟件和模塊。

Projects can also be run directly from Github if the repository has a MLproject file in the root. We’ve duplicated this tutorial to the https://github.com/databricks/mlflow-example repository which can be run with mlflow run git@github.com:databricks/mlflow-example.git -P alpha=0.42.

服務模型

如今，咱們將 MLproject打包而且識別出最好的model，是時候使用 MLflow Models來部署這個模型了。一個MLflow Model是機器學習模型封裝的標準格式，能夠用於後續一系列的處理工具。例如，經過real-time serving提供 REST API 或在Spark上的批處理智能推理。

在咱們的訓練代碼中，訓練出線性迴歸模型後，咱們啓動 MLflow 中的一個函數，保存模型爲運行部件。

mlflow.sklearn.log_model(lr, "model")

爲了瀏覽這個 artifact，咱們再次使用UI。點擊頁面中的列表，以下。

在下面，咱們看到對 mlflow.sklearn.log_model 的調用產生了兩個文件，在/Users/mlflow/mlflow-prototype/mlruns/0/7c1a0d5c42844dcdb8f5191146925174/artifacts/model。第一個 MLmodel 是元數據文件，告訴MLflow如何載入模型。第二個文件 model.pkl 是咱們訓練的線性迴歸模型的序列化。

在這個例子中，咱們演示使用 MLmodel 格式經過MLflow部署一個本地的REST server，用於進行預測。

部署上服務器，運行：

mlflow sklearn serve /Users/mlflow/mlflow-prototype/mlruns/0/7c1a0d5c42844dcdb8f5191146925174/artifacts/model -p 1234

注意：

該版本Python必須與運行mlflow sklearn的一致。不然，可能會報錯： UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 1: ordinal not in range(128) or raise ValueError, "unsupported pickle protocol: %d".

預測服務調用，運行：

curl -X POST -H "Content-Type:application/json" --data '[{"fixed acidity": 6.2, "volatile acidity": 0.66, "citric acid": 0.48, "residual sugar": 1.2, "chlorides": 0.029, "free sulfur dioxide": 29, "total sulfur dioxide": 75, "density": 0.98, "pH": 3.33, "sulphates": 0.39, "alcohol": 12.8}]' http://127.0.0.1:1234/invocations

# RESPONSE
# {"predictions": [6.379428821398614]}

Spark機器學習工具鏈-MLflow使用教程

Spark機器學習工具鏈-MLflow使用教程

什麼是咱們構建的？

你首先須要？

訓練模型

比較模型

打包訓練代碼

服務模型

更多資源