如何實現一個基於 jupyter 的 microservices

時間 2019-12-13

標籤如何實現一個基於 jupyter microservices 简体版

原文原文鏈接

零、背景：

現有基於 Node.js 的項目，但須要整合 Data Science 同事的基於 python（jupyter）的代碼部分，以實現額外的數據分析功能。因而設想實現一個 microservices。下面介紹一些庫的使用方法、本身寫的 demo和遇到的坑，方便之後查閱。html

1、jupyter_kernel_gateway

第一步，是想辦法把 jupyter 文件當成一個 http server 啓動，以即可以接受來自任何異構項目的調用。這裏能夠用到jupyter_kernel_gateway的 notebook-http 功能。python

官方文檔：https://jupyter-kernel-gateway.readthedocs.io/en/latest/http-mode.htmlgit

一、安裝

pip install jupyter_kernel_gatewaygithub

二、啓動

jupyter kernelgateway --KernelGatewayApp.api='kernel_gateway.notebook_http' --KernelGatewayApp.seed_uri='/Users/xjnotxj/Program/PythonProject/main.ipynb'web

seed_uri除了是本地路徑，也能夠是個url http://localhost:8890/notebooks/main.ipynbdocker

三、使用

import json

# imitate REQUEST args (調試時候用，平時請忽略)
# REQUEST = json.dumps({'body': {'age': ['181']}, 'args': {'sex': ['male'], 'location': ['shanghai']}, 'path': {'name': 'colin'}, 'headers': {'Content-Type': 'multipart/form-data; boundary=--------------------------149817035181009685206727', 'Cache-Control': 'no-cache', 'Postman-Token': '96c484cb-8709-4a42-9e12-3aaf18392c92', 'User-Agent': 'PostmanRuntime/7.6.0', 'Accept': '*/*', 'Host': 'localhost:8888', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '161', 'Connection': 'keep-alive'}})

用註釋定義路由：# POST /post/:name（能夠多個 cell 一塊兒用），請求體自動綁定在 req 對象上：json

# POST /post/:name

req = json.loads(REQUEST)

# defined return vars
return_status = 200
return_code = 0
return_message = ''
return_data = {}

這裏定義了一個檢查 req 參數的 function，由於 jupyter_kernel_gateway 不支持 return 或者 exit 退出當前 request，仍是會繼續日後執行，致使多個輸出干擾最終 response 結果。因此我這邊代碼邏輯寫的不簡潔，若是有知道改進的朋友能夠告訴我。api

# POST /post/:name 
 
def checkReqValid(req):  
    
    global return_code
    global return_message
    
    # age    
    if 100 <= req["age"] or req["age"] < 0:
        return_code = -2
        return_message = "'age' is out of range" 
        return True
    
    return False

實現 controller 部分：數組

# POST /post/:name 


try :   
    
    name = req['path']['name']
    age = int(req['body']['age'][0])
    sex = req['args']['sex'][0]
    location = req['args']['location'][0]
    
    if checkReqValid({"name":name,
                        "age":age,
                        "sex":sex,
                        "location":location}) == True:
        pass
    else : 
        # dosomething……
        return_data = {
            "name":name,
            "age":age,
            "sex":sex,
            "location":location,
            "req":req
        }

    
except KeyError: # check has field is empty
    return_code = -1
    return_message = "some field is empty"

finally: # return data
    print(json.dumps({
        "code":return_code,
        "message":return_message,
        "data":return_data
    }))

用 # ResponseInfo POST /post/:name 定義輸出響應頭，用 print 寫入stdout 的方式來響應請求：網絡

# ResponseInfo POST /post/:name

print(json.dumps({
    "headers" : {
        "Content-Type" : "application/json"
    },
    "status" : return_status
}))

當我訪問localhost:8888/post/colin?sex=male&location=shanghai且body體爲 age:18時，返回值爲：

{
    "code": 0,
    "message": "",
    "data": {
        "name": "colin",
        "age": 18,
        "sex": "male",
        "location": "shanghai",
        "req": {
            "body": {
                "age": [
                    "18"
                ]
            },
            "args": {
                "sex": [
                    "male"
                ],
                "location": [
                    "shanghai"
                ]
            },
            "path": {
                "name": "colin"
            },
            "headers": {
                "Content-Type": "multipart/form-data; boundary=--------------------------981201125716045634129372",
                "Cache-Control": "no-cache",
                "Postman-Token": "ec0f5364-b0ea-4828-b987-c12f15573296",
                "User-Agent": "PostmanRuntime/7.6.0",
                "Accept": "*/*",
                "Host": "localhost:8888",
                "Accept-Encoding": "gzip, deflate",
                "Content-Length": "160",
                "Connection": "keep-alive"
            }
        }
    }
}

關於響應碼：

默認下爲200 OK （且Content-Type: text/plain）

若是發生運行錯誤，則返回500 Internal Server Error

若是沒有找到路由，則返回404 Not Found

若是找到路由可是 get/post 等這類請求方法仍是沒匹配上，則返回405 Not Supported

四、坑

（1）cell 裏涉及到註釋實現的路由功能時，首行不能是空行，否則報錯：

✘ xjnotxj@jiangchengzhideMacBook-Pro  ~/Program/PythonProject  jupyter kernelgateway --KernelGatewayApp.api='kernel_gateway.notebook_http' --KernelGatewayApp.seed_uri='/Users/xjnotxj/Program/PythonProject/tuo.ipynb'
[KernelGatewayApp] Kernel started: bb13bcd6-514f-4682-b627-e6809cbb13ac
Traceback (most recent call last):
  File "/anaconda3/bin/jupyter-kernelgateway", line 11, in <module>
    sys.exit(launch_instance())
  File "/anaconda3/lib/python3.7/site-packages/jupyter_core/application.py", line 266, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/anaconda3/lib/python3.7/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "/anaconda3/lib/python3.7/site-packages/kernel_gateway/gatewayapp.py", line 382, in initialize
    self.init_webapp()
  File "/anaconda3/lib/python3.7/site-packages/kernel_gateway/gatewayapp.py", line 449, in init_webapp
    handlers = self.personality.create_request_handlers()
  File "/anaconda3/lib/python3.7/site-packages/kernel_gateway/notebook_http/__init__.py", line 112, in create_request_handlers
    raise RuntimeError('No endpoints were discovered. Check your notebook to make sure your cells are annotated correctly.')
RuntimeError: No endpoints were discovered. Check your notebook to make sure your cells are annotated correctly.
 ✘ xjnotxj@jiangchengzhideMacBook-Pro  ~/Program/PythonProject  [IPKernelApp] WARNING | Parent appears to have exited, shutting down.

（2）response 裏`args`和`body`體裏的參數值是一個長度爲1的數組

# 注意取法
sex = req['args']['sex'][0]

2、papermill

第二步，就是用相似膠水的東西，把不一樣的 Data Science 處理腳本，粘連起來，依次調用。

爲何要使用papermill，而不是直接調用腳本？

（1）規範了調用jurpyter文件和傳參的模式

（2）執行jurpyter文件後能夠生成 out 文件，方便回溯

（3）上下文變量按照每個jurpyter文件劃分區域去存儲，互不干擾

一、安裝

https://github.com/nteract/papermill

pip install papermill

二、使用

（1）`a.ipynb`

import papermill as pm 

for i, item in enumerate(data):
    data[i] = item * multiple
    
pm.record("data", data) 
print(data)

（2）`main.ipynb`

data=[1,2,3]
data

# 也能夠經過命令行運行，詳細看文檔
pm.execute_notebook(
   'a.ipynb',
   'a_out.ipynb', 
   parameters = dict(data=data,multiple=3)
)

Papermill 支持輸入和輸出路徑有如下幾種類型：

（1）本地文件系統： local

（2）HTTP，HTTPS協議： http://, https://

（3）亞馬遜網絡服務：AWS S3 s3://

（4）Azure：Azure DataLake Store，Azure Blob Store adl://, abs://

（5）Google Cloud：Google雲端存儲 gs://

執行main.ipynb後：

一、會生成a_out.ipynb新文件（見下文的（3））

二、有綁定在a_out.ipynb上的上下文變量：

re = pm.read_notebook('a_out.ipynb').dataframe
re

name	value	type	filename
0	data	[1, 2, 3]	parameter	a_out.ipynb
1	multiple	3	parameter	a_out.ipynb
2	data	[3, 6, 9]	record	a_out.ipynb

獲取參數稍微有一些繁瑣，我這裏封裝了個 function：

# getNotebookData args
# [filename] .ipynb的文件路徑
# [field] 取值變量
# [default_value] 默認返回值(default:None)
# [_type] 'parameter'|'record'(default)

def getPMNotebookData(filename, field ,default_value = None,_type='record'):
    result = default_value
    try:
        re = pm.read_notebook(filename).dataframe
        result = re[re['name']==field][re['type']==_type]["value"].values[0] 
    except:  
        pass
    finally:
        return result
data = getPMNotebookData('a_out.ipynb', 'data', 0)
data
# [3, 6, 9]

（3）`a_out.ipynb`

生成的這個新文件，會多出兩塊內容：

一、在全部 cell 的最開頭，會自動插入新的 cell，裏面有咱們傳入的參數

# Parameters
data = [1, 2, 3]
multiple = 3

二、cell 對應的 out 信息

[3, 6, 9]

三、坑

（1）參數不能傳 pd.Dataframe 類型

會報錯：

TypeError: Object of type DataFrame is not JSON serializable

解決辦法：

一、序列化 Dataframe

Dataframe提供了兩種序列化的方式，df.to_json() 或 df.to_csv()，解析或者詳細的用法請看：https://github.com/nteract/papermill/issues/215

缺點：

在序列化的過程當中，Dataframe 每列的數據類型會發生丟失，從新讀取後需從新指定。

二、不經過 papermill 的傳參機制去傳輸 Dataframe，而是經過 csv 中間文件承接【推薦】

3、docker 封裝

第三步，就是用 docker ，封裝設計好的 microservices，以便部署。

待寫……

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

如何實現一個基於 jupyter 的 microservices

零、背景：

1、jupyter_kernel_gateway

一、安裝

二、啓動

三、使用

四、坑

（1）cell 裏涉及到註釋實現的路由功能時，首行不能是空行，否則報錯：

（2）response 裏args和body體裏的參數值是一個長度爲1的數組

2、papermill

一、安裝

二、使用

（1）a.ipynb

（2）main.ipynb

（3）a_out.ipynb

三、坑

（1）參數不能傳 pd.Dataframe 類型

3、docker 封裝

（2）response 裏`args`和`body`體裏的參數值是一個長度爲1的數組

（1）`a.ipynb`

（2）`main.ipynb`

（3）`a_out.ipynb`