使用Flask構建機器學習模型API

1. Python環境設置和Flask基礎

Anaconda

  • 使用「Anaconda」建立一個虛擬環境。若是你須要在Python中建立你的工做流程,並將依賴項分離出來,或者共享環境設置,「Anaconda」發行版是一個不錯的選擇。
    • 安裝here
    • wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    • bash Miniconda3-latest-Linux-x86_64.sh
    • source .bashrc
    • conda create --name <environment-name> python=3.6
    • source activate <environment-name>
    • 安裝必要的Python包: flask & gunicorn.
  • 嘗試一個簡單的「Flask」Hello-World應用程序,並使用gunicorn提供服務:html

    • hello-world.py
    • 編寫代碼:
      ```pythonnode

      from flask import Flaskpython

      app = Flask(name)mysql

      @app.route('/users/ ')
      def hello_world(username=None):
      git

      return("Hello {}!".format(username))
      ```
    • 保存
    • gunicorn --bind 0.0.0.0:8000 hello-world:appgithub

    • 若是你獲得了下面的響應,你就走上了正確的道路:web

    Hello World

    • 在瀏覽器上訪問https://localhost:8000/users/any-name

    Browser

您編寫了第一個Flask應用程序。正如您如今經過幾個簡單的步驟所體驗到的,咱們可以建立能夠在本地訪問的web端點。將來的路也很簡單。sql

使用「Flask」,咱們能夠很容易地封裝咱們的機器學習模型,並將它們做爲Web api來使用。此外,若是咱們想建立更復雜的web應用程序(包括JavaScript ' gasps '),咱們只須要進行一些修改。數據庫

2. 構建機器學習模型

  • 以機器學習比賽爲例: Loan Prediction Competition. 主要目標是設置預處理管道和建立ML模型,目標是在部署時簡化ML預測。json

  • 數據:連接:https://pan.baidu.com/s/1VjqNZxvdKm0G5iBtTs-4TA 提取碼:n5ag

import os 
import json
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import make_pipeline

import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('../data/training.csv')
list(data.columns)
['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History',
 'Property_Area',
 'Loan_Status']
data.shape
(614, 13)
  • Finding out the null/Nan values in the columns:
for _ in data.columns:
    print("The number of null values in:{} == {}".format(_, data[_].isnull().sum()))
The number of null values in:Loan_ID == 0
The number of null values in:Gender == 13
The number of null values in:Married == 3
The number of null values in:Dependents == 15
The number of null values in:Education == 0
The number of null values in:Self_Employed == 32
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 22
The number of null values in:Loan_Amount_Term == 14
The number of null values in:Credit_History == 50
The number of null values in:Property_Area == 0
The number of null values in:Loan_Status == 0
  • Next step is creating training and testing datasets:
pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\
            'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']

X_train, X_test, y_train, y_test = train_test_split(data[pred_var], data['Loan_Status'], \
                                                    test_size=0.25, random_state=42)
  • To make sure that the pre-processing steps are followed religiously even after we are done with experimenting and we do not miss them while predictions, we'll create a custom pre-processing Scikit-learn estimator.

To follow the process on how we ended up with this estimator, read up on this notebook

from sklearn.base import BaseEstimator, TransformerMixin

class PreProcessing(BaseEstimator, TransformerMixin):
    """Custom Pre-Processing estimator for our use-case
    """

    def __init__(self):
        pass

    def transform(self, df):
        """Regular transform() that is a help for training, validation & testing datasets
           (NOTE: The operations performed here are the ones that we did prior to this cell)
        """
        pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome',\
                    'CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']
        
        df = df[pred_var]
        
        df['Dependents'] = df['Dependents'].fillna(0)
        df['Self_Employed'] = df['Self_Employed'].fillna('No')
        df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(self.term_mean_)
        df['Credit_History'] = df['Credit_History'].fillna(1)
        df['Married'] = df['Married'].fillna('No')
        df['Gender'] = df['Gender'].fillna('Male')
        df['LoanAmount'] = df['LoanAmount'].fillna(self.amt_mean_)
        
        gender_values = {'Female' : 0, 'Male' : 1} 
        married_values = {'No' : 0, 'Yes' : 1}
        education_values = {'Graduate' : 0, 'Not Graduate' : 1}
        employed_values = {'No' : 0, 'Yes' : 1}
        property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
        dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
        df.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                    'Self_Employed': employed_values, 'Property_Area': property_values, \
                    'Dependents': dependent_values}, inplace=True)
        
        return df.as_matrix()

    def fit(self, df, y=None, **fit_params):
        """Fitting the Training dataset & calculating the required values from train
           e.g: We will need the mean of X_train['Loan_Amount_Term'] that will be used in
                transformation of X_test
        """
        
        self.term_mean_ = df['Loan_Amount_Term'].mean()
        self.amt_mean_ = df['LoanAmount'].mean()
        return self
  • Convert y_train & y_test to np.array:
y_train = y_train.replace({'Y':1, 'N':0}).as_matrix()
y_test = y_test.replace({'Y':1, 'N':0}).as_matrix()

We'll create a pipeline to make sure that all the preprocessing steps that we do are just a single scikit-learn estimator.

pipe = make_pipeline(PreProcessing(),
                    RandomForestClassifier())
pipe
Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

To search for the best hyper-parameters (degree for PolynomialFeatures & alpha for Ridge), we'll do a Grid Search:

  • Defining param_grid:
param_grid = {"randomforestclassifier__n_estimators" : [10, 20, 30],
             "randomforestclassifier__max_depth" : [None, 6, 8, 10],
             "randomforestclassifier__max_leaf_nodes": [None, 5, 10, 20], 
             "randomforestclassifier__min_impurity_split": [0.1, 0.2, 0.3]}
  • Running the Grid Search:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)
  • Fitting the training data on the pipeline estimator:
grid.fit(X_train, y_train)
GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impu...bs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'randomforestclassifier__n_estimators': [10, 20, 30], 'randomforestclassifier__max_depth': [None, 6, 8, 10], 'randomforestclassifier__max_leaf_nodes': [None, 5, 10, 20], 'randomforestclassifier__min_impurity_split': [0.1, 0.2, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
  • Let's see what parameter did the Grid Search select:
print("Best parameters: {}".format(grid.best_params_))
Best parameters: {'randomforestclassifier__max_depth': None, 'randomforestclassifier__max_leaf_nodes': None, 'randomforestclassifier__min_impurity_split': 0.3, 'randomforestclassifier__n_estimators': 30}
  • Let's score:
print("Validation set score: {:.2f}".format(grid.score(X_test, y_test)))
Validation set score: 0.79

3. 保存機器學習模型:序列化和反序列化

# 保存模型
from sklearn.externals import joblib
joblib.dump(grid, 'loan_model.pkl')
['loan_model.pkl']
# 加載模型
grid = joblib.load('loan_model.pkl')
# 讀取測試數據
test_df = pd.read_csv('../data/test.csv', encoding="utf-8-sig")
test_df = test_df.head()
test_df
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
0 LP001015 Male Yes 0 Graduate No 5720 0 110.0 360.0 1.0 Urban
1 LP001022 Male Yes 1 Graduate No 3076 1500 126.0 360.0 1.0 Urban
2 LP001031 Male Yes 2 Graduate No 5000 1800 208.0 360.0 1.0 Urban
3 LP001035 Male Yes 2 Graduate No 2340 2546 100.0 360.0 NaN Urban
4 LP001051 Male No 0 Not Graduate No 3276 0 78.0 360.0 1.0 Urban
# 使用模型進行預測
grid.predict(test_df)
array([1, 1, 1, 1, 1], dtype=int64)

4. 使用Flask建立API

咱們將保持文件夾結構儘量簡單:

Folder Struct

構建包裝函數有三個重要部分, apicall():

  • 獲取請求數據

  • 加載模型

  • 預測並響應

HTTP消息由頭和正文組成。做爲標準,發送的主體內容大部分是「json」格式。咱們將發送(' POST url-endpoint/ ')傳入的數據做爲批處理,以得到預測。

(NOTE: 您能夠直接發送純文本、XML、csv或圖像,但爲了格式的可互換性,建議使用「json」)

import pandas as pd
from sklearn.externals import joblib
from flask import Flask, jsonify, request

app = Flask(__name__)


@app.route('/predict', methods=['POST'])
def apicall():
    try:
        # 獲取test數據,可經過json,也可經過其餘方式
        test_json = request.get_json()
        test = pd.read_json(test_json, orient='records')
        test['Dependents'] = [str(x) for x in list(test['Dependents'])]
        loan_ids = test['Loan_ID']

        # 讀取數據庫形式
        # sql = "select * from data where unif_cust_id=" + unif_cust_id
        # conn = create_engine('mysql+mysqldb://test:test@localhost:3306/score_card?charset=utf8')
        # data = pd.read_sql(sql, conn)

    except Exception as e:
        raise e

    if test.empty:
        return bad_request()
    else:
        # 加載模型
        print("Loading the model...")
        loaded_model = joblib.load('loan_model.pkl')

        # 預測
        print("The model has been loaded...doing predictions now...")
        predictions = loaded_model.predict(test)

        # 將預測結果存入DataFrame中
        prediction_series = list(pd.Series(predictions))
        final_predictions = pd.DataFrame(list(zip(loan_ids, prediction_series)))

        # 返回接口響應
        responses = jsonify(predictions=final_predictions.to_json(orient="records"))
        responses.status_code = 200

        return responses


@app.errorhandler(400)
def bad_request(error=None):
    message = {
        'status': 400,
        'message': 'Bad Request: ' + request.url + '--> Please check your data payload...',
    }
    resp = jsonify(message)
    resp.status_code = 400

    return resp


if __name__ == '__main__':
    app.run()
* Serving Flask app "__main__" (lazy loading)
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)


Loading the model...
The model has been loaded...doing predictions now...


127.0.0.1 - - [11/Nov/2019 10:05:09] "POST /predict HTTP/1.1" 200 -

請求API

若是使用jupyter,請另啓一個頁面進行請求。

import json
import requests
import pandas as pd
"""Setting the headers to send and accept json responses
"""
header = {'Content-Type': 'application/json', \
                  'Accept': 'application/json'}

"""Reading test batch
"""
df = pd.read_csv('../data/test.csv', encoding="utf-8-sig")
df = df.head()

"""Converting Pandas Dataframe to json
"""
data = df.to_json(orient='records')
data
'[{"Loan_ID":"LP001015","Gender":"Male","Married":"Yes","Dependents":"0","Education":"Graduate","Self_Employed":"No","ApplicantIncome":5720,"CoapplicantIncome":0,"LoanAmount":110.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001022","Gender":"Male","Married":"Yes","Dependents":"1","Education":"Graduate","Self_Employed":"No","ApplicantIncome":3076,"CoapplicantIncome":1500,"LoanAmount":126.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001031","Gender":"Male","Married":"Yes","Dependents":"2","Education":"Graduate","Self_Employed":"No","ApplicantIncome":5000,"CoapplicantIncome":1800,"LoanAmount":208.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001035","Gender":"Male","Married":"Yes","Dependents":"2","Education":"Graduate","Self_Employed":"No","ApplicantIncome":2340,"CoapplicantIncome":2546,"LoanAmount":100.0,"Loan_Amount_Term":360.0,"Credit_History":null,"Property_Area":"Urban"},{"Loan_ID":"LP001051","Gender":"Male","Married":"No","Dependents":"0","Education":"Not Graduate","Self_Employed":"No","ApplicantIncome":3276,"CoapplicantIncome":0,"LoanAmount":78.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"}]'
"""POST <url>/predict
"""
resp = requests.post("http://127.0.0.1:5000/predict", \
                    data = json.dumps(data),\
                    headers= header)
resp.status_code
200
resp.json()
{'predictions': '[{"0":"LP001015","1":1},{"0":"LP001022","1":1},{"0":"LP001031","1":1},{"0":"LP001035","1":1},{"0":"LP001051","1":1}]'}
相關文章
相關標籤/搜索