wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source .bashrc
conda create --name <environment-name> python=3.6
source activate <environment-name>
flask
& gunicorn
.嘗試一個簡單的「Flask」Hello-World應用程序,並使用gunicorn
提供服務:html
hello-world.py
編寫代碼:
```pythonnode
from flask import Flaskpython
app = Flask(name)mysql
@app.route('/users/
def hello_world(username=None):
return("Hello {}!".format(username))```
gunicorn --bind 0.0.0.0:8000 hello-world:app
github
若是你獲得了下面的響應,你就走上了正確的道路:web
https://localhost:8000/users/any-name
您編寫了第一個Flask應用程序。正如您如今經過幾個簡單的步驟所體驗到的,咱們可以建立能夠在本地訪問的web端點。將來的路也很簡單。sql
使用「Flask」,咱們能夠很容易地封裝咱們的機器學習模型,並將它們做爲Web api來使用。此外,若是咱們想建立更復雜的web應用程序(包括JavaScript ' gasps '),咱們只須要進行一些修改。數據庫
以機器學習比賽爲例: Loan Prediction Competition. 主要目標是設置預處理管道和建立ML模型,目標是在部署時簡化ML預測。json
數據:連接:https://pan.baidu.com/s/1VjqNZxvdKm0G5iBtTs-4TA 提取碼:n5ag
import os import json import numpy as np import pandas as pd from sklearn.externals import joblib from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.base import BaseEstimator, TransformerMixin from sklearn.ensemble import RandomForestClassifier from sklearn.pipeline import make_pipeline import warnings warnings.filterwarnings("ignore")
data = pd.read_csv('../data/training.csv')
list(data.columns)
['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status']
data.shape
(614, 13)
null/Nan
values in the columns:for _ in data.columns: print("The number of null values in:{} == {}".format(_, data[_].isnull().sum()))
The number of null values in:Loan_ID == 0 The number of null values in:Gender == 13 The number of null values in:Married == 3 The number of null values in:Dependents == 15 The number of null values in:Education == 0 The number of null values in:Self_Employed == 32 The number of null values in:ApplicantIncome == 0 The number of null values in:CoapplicantIncome == 0 The number of null values in:LoanAmount == 22 The number of null values in:Loan_Amount_Term == 14 The number of null values in:Credit_History == 50 The number of null values in:Property_Area == 0 The number of null values in:Loan_Status == 0
training
and testing
datasets:pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\ 'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area'] X_train, X_test, y_train, y_test = train_test_split(data[pred_var], data['Loan_Status'], \ test_size=0.25, random_state=42)
pre-processing steps
are followed religiously even after we are done with experimenting and we do not miss them while predictions, we'll create a custom pre-processing Scikit-learn estimator
.To follow the process on how we ended up with this estimator
, read up on this notebook
from sklearn.base import BaseEstimator, TransformerMixin class PreProcessing(BaseEstimator, TransformerMixin): """Custom Pre-Processing estimator for our use-case """ def __init__(self): pass def transform(self, df): """Regular transform() that is a help for training, validation & testing datasets (NOTE: The operations performed here are the ones that we did prior to this cell) """ pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome',\ 'CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History','Property_Area'] df = df[pred_var] df['Dependents'] = df['Dependents'].fillna(0) df['Self_Employed'] = df['Self_Employed'].fillna('No') df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(self.term_mean_) df['Credit_History'] = df['Credit_History'].fillna(1) df['Married'] = df['Married'].fillna('No') df['Gender'] = df['Gender'].fillna('Male') df['LoanAmount'] = df['LoanAmount'].fillna(self.amt_mean_) gender_values = {'Female' : 0, 'Male' : 1} married_values = {'No' : 0, 'Yes' : 1} education_values = {'Graduate' : 0, 'Not Graduate' : 1} employed_values = {'No' : 0, 'Yes' : 1} property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2} dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1} df.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \ 'Self_Employed': employed_values, 'Property_Area': property_values, \ 'Dependents': dependent_values}, inplace=True) return df.as_matrix() def fit(self, df, y=None, **fit_params): """Fitting the Training dataset & calculating the required values from train e.g: We will need the mean of X_train['Loan_Amount_Term'] that will be used in transformation of X_test """ self.term_mean_ = df['Loan_Amount_Term'].mean() self.amt_mean_ = df['LoanAmount'].mean() return self
y_train
& y_test
to np.array
:y_train = y_train.replace({'Y':1, 'N':0}).as_matrix() y_test = y_test.replace({'Y':1, 'N':0}).as_matrix()
We'll create a pipeline
to make sure that all the preprocessing steps that we do are just a single scikit-learn estimator
.
pipe = make_pipeline(PreProcessing(), RandomForestClassifier())
pipe
Pipeline(memory=None, steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False))])
To search for the best hyper-parameters
(degree
for PolynomialFeatures
& alpha
for Ridge
), we'll do a Grid Search
:
param_grid
:param_grid = {"randomforestclassifier__n_estimators" : [10, 20, 30], "randomforestclassifier__max_depth" : [None, 6, 8, 10], "randomforestclassifier__max_leaf_nodes": [None, 5, 10, 20], "randomforestclassifier__min_impurity_split": [0.1, 0.2, 0.3]}
Grid Search
:grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)
pipeline estimator
:grid.fit(X_train, y_train)
GridSearchCV(cv=3, error_score='raise-deprecating', estimator=Pipeline(memory=None, steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impu...bs=None, oob_score=False, random_state=None, verbose=0, warm_start=False))]), fit_params=None, iid='warn', n_jobs=None, param_grid={'randomforestclassifier__n_estimators': [10, 20, 30], 'randomforestclassifier__max_depth': [None, 6, 8, 10], 'randomforestclassifier__max_leaf_nodes': [None, 5, 10, 20], 'randomforestclassifier__min_impurity_split': [0.1, 0.2, 0.3]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0)
print("Best parameters: {}".format(grid.best_params_))
Best parameters: {'randomforestclassifier__max_depth': None, 'randomforestclassifier__max_leaf_nodes': None, 'randomforestclassifier__min_impurity_split': 0.3, 'randomforestclassifier__n_estimators': 30}
print("Validation set score: {:.2f}".format(grid.score(X_test, y_test)))
Validation set score: 0.79
# 保存模型 from sklearn.externals import joblib joblib.dump(grid, 'loan_model.pkl')
['loan_model.pkl']
# 加載模型 grid = joblib.load('loan_model.pkl')
# 讀取測試數據 test_df = pd.read_csv('../data/test.csv', encoding="utf-8-sig") test_df = test_df.head()
test_df
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001015 | Male | Yes | 0 | Graduate | No | 5720 | 0 | 110.0 | 360.0 | 1.0 | Urban |
1 | LP001022 | Male | Yes | 1 | Graduate | No | 3076 | 1500 | 126.0 | 360.0 | 1.0 | Urban |
2 | LP001031 | Male | Yes | 2 | Graduate | No | 5000 | 1800 | 208.0 | 360.0 | 1.0 | Urban |
3 | LP001035 | Male | Yes | 2 | Graduate | No | 2340 | 2546 | 100.0 | 360.0 | NaN | Urban |
4 | LP001051 | Male | No | 0 | Not Graduate | No | 3276 | 0 | 78.0 | 360.0 | 1.0 | Urban |
# 使用模型進行預測 grid.predict(test_df)
array([1, 1, 1, 1, 1], dtype=int64)
咱們將保持文件夾結構儘量簡單:
構建包裝函數有三個重要部分, apicall()
:
獲取請求數據
加載模型
預測並響應
HTTP消息由頭和正文組成。做爲標準,發送的主體內容大部分是「json」格式。咱們將發送(' POST url-endpoint/ ')傳入的數據做爲批處理,以得到預測。
(NOTE: 您能夠直接發送純文本、XML、csv或圖像,但爲了格式的可互換性,建議使用「json」)
import pandas as pd from sklearn.externals import joblib from flask import Flask, jsonify, request app = Flask(__name__) @app.route('/predict', methods=['POST']) def apicall(): try: # 獲取test數據,可經過json,也可經過其餘方式 test_json = request.get_json() test = pd.read_json(test_json, orient='records') test['Dependents'] = [str(x) for x in list(test['Dependents'])] loan_ids = test['Loan_ID'] # 讀取數據庫形式 # sql = "select * from data where unif_cust_id=" + unif_cust_id # conn = create_engine('mysql+mysqldb://test:test@localhost:3306/score_card?charset=utf8') # data = pd.read_sql(sql, conn) except Exception as e: raise e if test.empty: return bad_request() else: # 加載模型 print("Loading the model...") loaded_model = joblib.load('loan_model.pkl') # 預測 print("The model has been loaded...doing predictions now...") predictions = loaded_model.predict(test) # 將預測結果存入DataFrame中 prediction_series = list(pd.Series(predictions)) final_predictions = pd.DataFrame(list(zip(loan_ids, prediction_series))) # 返回接口響應 responses = jsonify(predictions=final_predictions.to_json(orient="records")) responses.status_code = 200 return responses @app.errorhandler(400) def bad_request(error=None): message = { 'status': 400, 'message': 'Bad Request: ' + request.url + '--> Please check your data payload...', } resp = jsonify(message) resp.status_code = 400 return resp if __name__ == '__main__': app.run()
* Serving Flask app "__main__" (lazy loading) * Environment: production WARNING: Do not use the development server in a production environment. Use a production WSGI server instead. * Debug mode: off * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) Loading the model... The model has been loaded...doing predictions now... 127.0.0.1 - - [11/Nov/2019 10:05:09] "[37mPOST /predict HTTP/1.1[0m" 200 -
若是使用jupyter,請另啓一個頁面進行請求。
import json import requests import pandas as pd
"""Setting the headers to send and accept json responses """ header = {'Content-Type': 'application/json', \ 'Accept': 'application/json'} """Reading test batch """ df = pd.read_csv('../data/test.csv', encoding="utf-8-sig") df = df.head() """Converting Pandas Dataframe to json """ data = df.to_json(orient='records')
data
'[{"Loan_ID":"LP001015","Gender":"Male","Married":"Yes","Dependents":"0","Education":"Graduate","Self_Employed":"No","ApplicantIncome":5720,"CoapplicantIncome":0,"LoanAmount":110.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001022","Gender":"Male","Married":"Yes","Dependents":"1","Education":"Graduate","Self_Employed":"No","ApplicantIncome":3076,"CoapplicantIncome":1500,"LoanAmount":126.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001031","Gender":"Male","Married":"Yes","Dependents":"2","Education":"Graduate","Self_Employed":"No","ApplicantIncome":5000,"CoapplicantIncome":1800,"LoanAmount":208.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001035","Gender":"Male","Married":"Yes","Dependents":"2","Education":"Graduate","Self_Employed":"No","ApplicantIncome":2340,"CoapplicantIncome":2546,"LoanAmount":100.0,"Loan_Amount_Term":360.0,"Credit_History":null,"Property_Area":"Urban"},{"Loan_ID":"LP001051","Gender":"Male","Married":"No","Dependents":"0","Education":"Not Graduate","Self_Employed":"No","ApplicantIncome":3276,"CoapplicantIncome":0,"LoanAmount":78.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"}]'
"""POST <url>/predict """ resp = requests.post("http://127.0.0.1:5000/predict", \ data = json.dumps(data),\ headers= header)
resp.status_code
200
resp.json()
{'predictions': '[{"0":"LP001015","1":1},{"0":"LP001022","1":1},{"0":"LP001031","1":1},{"0":"LP001035","1":1},{"0":"LP001051","1":1}]'}