現有基於 Node.js 的項目,但須要整合 Data Science 同事的基於 python(jupyter) 的代碼部分,以實現額外的數據分析功能。因而設想實現一個 microservices。下面介紹一些庫的使用方法、本身寫的 demo和遇到的坑,方便之後查閱。html
第一步,是想辦法把 jupyter 文件當成一個 http server 啓動,以即可以接受來自任何異構項目的調用。這裏能夠用到jupyter_kernel_gateway
的 notebook-http
功能。python
官方文檔:https://jupyter-kernel-gateway.readthedocs.io/en/latest/http-mode.htmlgit
pip install jupyter_kernel_gateway
github
jupyter kernelgateway --KernelGatewayApp.api='kernel_gateway.notebook_http' --KernelGatewayApp.seed_uri='/Users/xjnotxj/Program/PythonProject/main.ipynb'
web
seed_uri
除了是本地路徑,也能夠是個urlhttp://localhost:8890/notebooks/main.ipynb
docker
import json
# imitate REQUEST args (調試時候用,平時請忽略) # REQUEST = json.dumps({'body': {'age': ['181']}, 'args': {'sex': ['male'], 'location': ['shanghai']}, 'path': {'name': 'colin'}, 'headers': {'Content-Type': 'multipart/form-data; boundary=--------------------------149817035181009685206727', 'Cache-Control': 'no-cache', 'Postman-Token': '96c484cb-8709-4a42-9e12-3aaf18392c92', 'User-Agent': 'PostmanRuntime/7.6.0', 'Accept': '*/*', 'Host': 'localhost:8888', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '161', 'Connection': 'keep-alive'}})
用註釋定義路由:# POST /post/:name
(能夠多個 cell 一塊兒用),請求體自動綁定在 req
對象上:json
# POST /post/:name req = json.loads(REQUEST) # defined return vars return_status = 200 return_code = 0 return_message = '' return_data = {}
這裏定義了一個檢查 req 參數的 function,由於 jupyter_kernel_gateway 不支持 return 或者 exit 退出當前 request,仍是會繼續日後執行,致使多個輸出干擾最終 response 結果。因此我這邊代碼邏輯寫的不簡潔,若是有知道改進的朋友能夠告訴我。api
# POST /post/:name def checkReqValid(req): global return_code global return_message # age if 100 <= req["age"] or req["age"] < 0: return_code = -2 return_message = "'age' is out of range" return True return False
實現 controller 部分:數組
# POST /post/:name try : name = req['path']['name'] age = int(req['body']['age'][0]) sex = req['args']['sex'][0] location = req['args']['location'][0] if checkReqValid({"name":name, "age":age, "sex":sex, "location":location}) == True: pass else : # dosomething…… return_data = { "name":name, "age":age, "sex":sex, "location":location, "req":req } except KeyError: # check has field is empty return_code = -1 return_message = "some field is empty" finally: # return data print(json.dumps({ "code":return_code, "message":return_message, "data":return_data }))
用 # ResponseInfo POST /post/:name
定義輸出響應頭,用 print
寫入stdout 的方式來響應請求:網絡
# ResponseInfo POST /post/:name print(json.dumps({ "headers" : { "Content-Type" : "application/json" }, "status" : return_status }))
當我訪問localhost:8888/post/colin?sex=male&location=shanghai
且body體爲 age:18
時,返回值爲:
{ "code": 0, "message": "", "data": { "name": "colin", "age": 18, "sex": "male", "location": "shanghai", "req": { "body": { "age": [ "18" ] }, "args": { "sex": [ "male" ], "location": [ "shanghai" ] }, "path": { "name": "colin" }, "headers": { "Content-Type": "multipart/form-data; boundary=--------------------------981201125716045634129372", "Cache-Control": "no-cache", "Postman-Token": "ec0f5364-b0ea-4828-b987-c12f15573296", "User-Agent": "PostmanRuntime/7.6.0", "Accept": "*/*", "Host": "localhost:8888", "Accept-Encoding": "gzip, deflate", "Content-Length": "160", "Connection": "keep-alive" } } } }
關於響應碼:
默認下爲
200 OK
(且Content-Type: text/plain
)若是發生運行錯誤,則返回
500 Internal Server Error
若是沒有找到路由,則返回
404 Not Found
若是找到路由可是 get/post 等這類請求方法仍是沒匹配上,則返回
405 Not Supported
✘ xjnotxj@jiangchengzhideMacBook-Pro ~/Program/PythonProject jupyter kernelgateway --KernelGatewayApp.api='kernel_gateway.notebook_http' --KernelGatewayApp.seed_uri='/Users/xjnotxj/Program/PythonProject/tuo.ipynb' [KernelGatewayApp] Kernel started: bb13bcd6-514f-4682-b627-e6809cbb13ac Traceback (most recent call last): File "/anaconda3/bin/jupyter-kernelgateway", line 11, in <module> sys.exit(launch_instance()) File "/anaconda3/lib/python3.7/site-packages/jupyter_core/application.py", line 266, in launch_instance return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs) File "/anaconda3/lib/python3.7/site-packages/traitlets/config/application.py", line 657, in launch_instance app.initialize(argv) File "/anaconda3/lib/python3.7/site-packages/kernel_gateway/gatewayapp.py", line 382, in initialize self.init_webapp() File "/anaconda3/lib/python3.7/site-packages/kernel_gateway/gatewayapp.py", line 449, in init_webapp handlers = self.personality.create_request_handlers() File "/anaconda3/lib/python3.7/site-packages/kernel_gateway/notebook_http/__init__.py", line 112, in create_request_handlers raise RuntimeError('No endpoints were discovered. Check your notebook to make sure your cells are annotated correctly.') RuntimeError: No endpoints were discovered. Check your notebook to make sure your cells are annotated correctly. ✘ xjnotxj@jiangchengzhideMacBook-Pro ~/Program/PythonProject [IPKernelApp] WARNING | Parent appears to have exited, shutting down.
args
和body
體裏的參數值是一個長度爲1的數組# 注意取法 sex = req['args']['sex'][0]
第二步,就是用相似膠水的東西,把不一樣的 Data Science 處理腳本,粘連起來,依次調用。
爲何要使用papermill
,而不是直接調用腳本?
(1)規範了調用jurpyter文件和傳參的模式
(2)執行jurpyter文件後能夠生成 out 文件,方便回溯
(3)上下文變量按照每個jurpyter文件劃分區域去存儲,互不干擾
https://github.com/nteract/papermill
pip install papermill
a.ipynb
import papermill as pm for i, item in enumerate(data): data[i] = item * multiple pm.record("data", data) print(data)
main.ipynb
data=[1,2,3] data
# 也能夠經過命令行運行,詳細看文檔 pm.execute_notebook( 'a.ipynb', 'a_out.ipynb', parameters = dict(data=data,multiple=3) )
Papermill 支持輸入和輸出路徑有如下幾種類型:
(1)本地文件系統: local
(2)HTTP,HTTPS協議: http://, https://
(3)亞馬遜網絡服務:AWS S3 s3://
(4)Azure:Azure DataLake Store,Azure Blob Store adl://, abs://
(5)Google Cloud:Google雲端存儲 gs://
執行main.ipynb
後:
一、會生成a_out.ipynb
新文件(見下文的(3))
二、有綁定在a_out.ipynb
上的上下文變量:
re = pm.read_notebook('a_out.ipynb').dataframe re
name | value | type | filename | |
---|---|---|---|---|
0 | data | [1, 2, 3] | parameter | a_out.ipynb |
1 | multiple | 3 | parameter | a_out.ipynb |
2 | data | [3, 6, 9] | record | a_out.ipynb |
獲取參數稍微有一些繁瑣,我這裏封裝了個 function:
# getNotebookData args # [filename] .ipynb的文件路徑 # [field] 取值變量 # [default_value] 默認返回值(default:None) # [_type] 'parameter'|'record'(default) def getPMNotebookData(filename, field ,default_value = None,_type='record'): result = default_value try: re = pm.read_notebook(filename).dataframe result = re[re['name']==field][re['type']==_type]["value"].values[0] except: pass finally: return result data = getPMNotebookData('a_out.ipynb', 'data', 0) data # [3, 6, 9]
a_out.ipynb
生成的這個新文件,會多出兩塊內容:
一、在全部 cell 的最開頭,會自動插入新的 cell,裏面有咱們傳入的參數
# Parameters data = [1, 2, 3] multiple = 3
二、cell 對應的 out 信息
[3, 6, 9]
會報錯:
TypeError: Object of type DataFrame is not JSON serializable
解決辦法:
一、序列化 Dataframe
Dataframe提供了兩種序列化的方式,df.to_json()
或 df.to_csv()
,解析或者詳細的用法請看:https://github.com/nteract/papermill/issues/215
缺點:
在序列化的過程當中,Dataframe 每列的數據類型會發生丟失,從新讀取後需從新指定。
二、不經過 papermill 的傳參機制去傳輸 Dataframe,而是經過 csv 中間文件承接 【推薦】
第三步,就是用 docker
,封裝設計好的 microservices,以便部署。
待寫……