[Python Debug]Kernel Crash While Running Neural Network with Keras|Jupyter Notebook運行Keras服務器宕機緣由及解決

時間 2019-11-20

標籤 python debug kernel crash running neural network keras jupyter notebook 運行服務器宕機緣由解決欄目 Python 简体版

原文原文鏈接

最近作Machine Learning做業，要在Jupyter Notebook上用Keras搭建Neural Network。結果連最簡單的一層神經網絡都運行不了，更奇怪的是我先用iris數據集跑了一遍並無任何問題，可是用老師給的fashion mnist一運行服務器就提示掛掉重啓。更更奇怪的是一樣的code在同窗的電腦上跑也是一點問題都沒有，讓我一度覺得是個人macbook年代久遠配置過低什麼的，差點要買新電腦了>_<python

今天上課經ML老師幾番調試，居然完美解決了，不愧是CMU大神！（這裏給Prof強烈打call，雖然他看不懂中文><）由於剛學python沒多久，還很不熟悉，通過此次又學會好多新技能✌️git

出問題的完整code以下，就是用Keras實現logistic regression，是一個簡單的一層網絡，可是每次運行到最後一行server就掛掉，而後重啓kernel。github

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, FastICA
from sklearn.linear_model import LogisticRegression
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv2D
from keras.utils import to_categorical
from keras.datasets import fashion_mnist

(x3_train, y_train), (x3_test, y_test) = fashion_mnist.load_data()
n_classes = np.max(y_train) + 1

# Vectorize image arrays, since most methods expect this format
x_train = x3_train.reshape(x3_train.shape[0], np.prod(x3_train.shape[1:]))
x_test = x3_test.reshape(x3_test.shape[0], np.prod(x3_test.shape[1:]))

# Binary vector representation of targets (for one-hot or multinomial output networks)
y3_train = to_categorical(y_train)
y3_test = to_categorical(y_test)

from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)     
x_test_scaled = scaler.fit_transform(x_test) 

n_output = y3_train.shape[1]
n_input = x_train_scaled.shape[1]

nn_lr = Sequential() 
nn_lr.add(Dense(units=n_output, input_dim= n_input, activation = 'softmax'))
nn_lr.compile(optimizer = 'sgd', loss = 'categorical_crossentropy', metrics = ['accuracy'])

因爲Jupyter Notebook只是一直重啓kernel，並無任何錯誤提示，因此讓人無從下手。可是經老師提示原來啓動Jupyter Notebook時自動打開的terminal上會記錄運行的信息（小白第一次發現。。），包括了kerter停止及重啓的詳細過程及緣由：編程

[I 22:11:54.603 NotebookApp] Kernel interrupted: 7e7f6646-97b0-4ec7-951c-1dce783f60c4服務器

[I 22:13:49.160 NotebookApp] Saving file at /Documents/[Rutgers]Study/2019Spring/MACHINE LEARNING W APPLCTN LARGE DATASET/hw/Untitled1.ipynb網絡

2019-03-28 22:13:49.829246: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA多線程

2019-03-28 22:13:49.829534: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.架構

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.dom

OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.this

[I 22:13:51.049 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports

kernel c1114f5a-3829-432f-a26a-c2db6c330352 restarted

還有另一個方法，把代碼copy到ipython中，也能夠獲得相似的信息，因此最後定位的錯誤是：

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.

谷歌了一下，github上有一個很詳細的討論帖，可是樓主是運行XGBoost時遇到了這個問題，讓我聯想到寒假安裝XGBoost確實通過了很曲折的過程，可能不當心把某個文件重複下載到了不一樣路徑，因而程序加載package時出現了衝突。帖子裏提供了幾種可能的緣由及解決方法：

1. 卸載clang-omp

brew uninstall libiomp clang-omp

as long as u got gcc v5 from brew it come with openmp

follow steps in:
https://github.com/dmlc/xgboost/tree/master/python-package

嘗試了卸載xgboost再安裝，而後卸載clang-omp，獲得錯誤提示

No such keg: /usr/local/Cellar/libiomp

pip uninstall xbgoost
pip install xgboost
brew uninstall libiomp clang-omp

2. 直接在jupyter notebook裏運行：

# DANGER! DANGER!
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

老師說這行命令可讓系統忽略package衝突的問題，自行選擇一個package使用。試了一下這個方法確實有效，但這是很是危險的作法，極度不推薦！

3. 找到重複的libiomp5.dylib文件，刪除其中一個

在Finder中確實找到了兩個文件，分別在~/⁨anaconda3⁩/lib⁩和~/anaconda3⁩/⁨lib⁩/⁨python3.6⁩/⁨site-packages⁩/⁨_solib_darwin⁩/⁨_U@mkl_Udarwin_S_S_Cmkl_Ulibs_Udarwin___Uexternal_Smkl_Udarwin_Slib⁩ （？？？？）但是不太肯定應該刪除哪個，感受這種作法也蠻危險的，刪錯了整個跑不起來了。

4. OpenMP衝突

Hint: This means that multiple copies of the OpenMP runtime have been linked into the program

根據提示信息裏的Hint，搜了下TensorFlow OpenMP。OpenMP是一個多線程並行編程的平臺，TensorFlow彷佛有本身的並行計算架構，並用不上OpenMP（see https://github.com/tensorflow/tensorflow/issues/12434）

5. 卸載nomkl

I had the same error on my Mac with a python program using numpy, keras, and matplotlib. I solved it with 'conda install nomkl'.

這是最後有效的作法！nomkl全稱是Math Kernel Library (MKL) Optimization，是Interl開發的用來加速數學運算的模塊，經過conda安裝package能夠自動使用mkl，更詳細的信息能夠看這個Anaconda的官方文檔。

To opt out, run conda install nomkl and then use conda install to install packages that would normally include MKL or depend on packages that include MKL, such as scipy, numpy, and pandas.

多是numpy之類的package更新時出現了一些衝突，安裝nomkl以後居然神奇地解決了，後來又嘗試把MKL卸載了，程序依然正常運行。。卸載命令以下：