Epoch 1/2000
Traceback (most recent call last):
File "multi_type_question_number.py", line 450, in <module>
train(model)
File "multi_type_question_number.py", line 281, in train
layers='heads')
File "/root/Mask_RCNN/mrcnn/model.py", line 2381, in train
use_multiprocessing=False,
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
class_weight=class_weight)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 1211, in train_on_batch
class_weight=class_weight)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 751, in _standardize_user_data
exception_prefix='input')
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training_utils.py", line 138, in standardize_input_data
str(data_shape))
ValueError: Error when checking input: expected input_image_meta to have shape (16,) but got array with shape (22,)
複製代碼
緣由:class不對應。 解決:背景1 + 需目標檢測的數量python
100/100 [==============================] - 261s 3s/step - loss: 1.6207 - rpn_class_loss: 0.1146 - rpn_bbox_loss: 0.4441 - mrcnn_class_loss: 0.2516 - mrcnn_bbox_loss: 0.4165 - mrcnn_mask_loss: 0.3940 - val_loss: 0.8447 - val_rpn_class_loss: 0.0514 - val_rpn_bbox_loss: 0.4179 - val_mrcnn_class_loss: 0.0822 - val_mrcnn_bbox_loss: 0.1607 - val_mrcnn_mask_loss: 0.1326
Traceback (most recent call last):
File "multi_type_question_number.py", line 450, in <module>
train(model)
File "multi_type_question_number.py", line 281, in train
layers='heads')
File "/root/Mask_RCNN/mrcnn/model.py", line 2381, in train
use_multiprocessing=False,
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training_generator.py", line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/callbacks.py", line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/callbacks.py", line 446, in on_epoch_end
self.model.save(filepath, overwrite=True)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/network.py", line 1090, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/saving.py", line 382, in save_model
_serialize_model(model, f, include_optimizer)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/saving.py", line 83, in _serialize_model
model_config['config'] = model.get_config()
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/network.py", line 931, in get_config
return copy.deepcopy(config)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 218, in _deepcopy_list
y.append(deepcopy(a, memo))
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in <listcomp>
y = [deepcopy(a, memo) for a in x]
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in <listcomp>
y = [deepcopy(a, memo) for a in x]
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 297, in _reconstruct
state = deepcopy(state, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 297, in _reconstruct
state = deepcopy(state, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 174, in deepcopy
rv = reductor(4)
TypeError: can't pickle SwigPyObject objects 複製代碼
解決:git
報錯模型參數不對。 model.py
設置只保存權重便可,mask rcnn
不支持全量模型報錯,緣由未知github
save_best_only=True,save_weights_only=True
複製代碼
#Callbacks
keras.callbacks.ModelCheckpoint(self.checkpoint_path,monitor='val_loss',
verbose=0,save_best_only=True,save_weights_only=True),
]
複製代碼
Epoch 33/2000
100/100 [==============================] - 231s 2s/step - loss: 0.9838 - rpn_class_loss: 0.0180 - rpn_bbox_loss: 0.2431 - mrcnn_class_loss: 0.1373 - mrcnn_bbox_loss: 0.2723 - mrcnn_mask_loss: 0.3130 - val_loss: 0.9683 - val_rpn_class_loss: 0.0417 - val_rpn_bbox_loss: 0.3798 - val_mrcnn_class_loss: 0.0751 - val_mrcnn_bbox_loss: 0.2362 - val_mrcnn_mask_loss: 0.2355
Epoch 34/2000
100/100 [==============================] - 215s 2s/step - loss: 0.8713 - rpn_class_loss: 0.0227 - rpn_bbox_loss: 0.1844 - mrcnn_class_loss: 0.1292 - mrcnn_bbox_loss: 0.2486 - mrcnn_mask_loss: 0.2863 - val_loss: 0.7546 - val_rpn_class_loss: 0.0275 - val_rpn_bbox_loss: 0.2965 - val_mrcnn_class_loss: 0.0583 - val_mrcnn_bbox_loss: 0.1699 - val_mrcnn_mask_loss: 0.2023
Epoch 35/2000
100/100 [==============================] - 221s 2s/step - loss: nan - rpn_class_loss: 0.6370 - rpn_bbox_loss: 0.4525 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0335 - mrcnn_mask_loss: 0.0907 - val_loss: nan - val_rpn_class_loss: 0.7072 - val_rpn_bbox_loss: 0.4715 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
Epoch 36/2000
100/100 [==============================] - 219s 2s/step - loss: nan - rpn_class_loss: 0.7061 - rpn_bbox_loss: 0.4460 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7072 - val_rpn_bbox_loss: 0.4431 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
Epoch 37/2000
100/100 [==============================] - 209s 2s/step - loss: nan - rpn_class_loss: 0.7066 - rpn_bbox_loss: 0.4681 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7073 - val_rpn_bbox_loss: 0.4084 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
.....
Epoch 167/2000
100/100 [==============================] - 210s 2s/step - loss: nan - rpn_class_loss: 0.7062 - rpn_bbox_loss: 0.4489 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7074 - val_rpn_bbox_loss: 0.5061 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
Epoch 168/2000
24/100 [======>.......................] - ETA: 1:34 - loss: nan - rpn_class_loss: 0.7069 - rpn_bbox_loss: 0.4690 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00
複製代碼
mrcnn_class_loss: nan
致使 總體loss
nan算法
解決: 訓練數據中存在 class
以外的數值致使centos
It takes me two days to running this code on my own data set. I thought there should be more details in the guidance. 1. When using add_image() in the utils.Dataset class, the image_id must be consecutive integer from 1 to some number, because image_id is the index of a list. Class number also should be consecutive integer from 1 to some number, or you will get an nanloss.服務器
rpn class loss :並非全部物體類別的分類損失,而是前景和背景分類的損失網絡
去到本身建立的訓練目錄
訓練過程查看: ecs遠程服務器jupyter notebook 以及tensorboard服務後臺運行:
nohup jupyter notebook --ip 0.0.0.0 --no-browser --allow-root > jupyter.log &
tensorboard loss 檢測: nohup tensorboard --logdir=./logs > tensorboard.log &
Tensorboard could not bind to unsupported address family
ip6的問題,多是Ip4沒有成爲默認,指定host 0.0.0.0便可。 nohup tensorboard --logdir=./logs --host=0.0.0.0 > tensorboard.log &
iptables -A INPUT -p tcp --dport 6006 -j ACCEPT
iptables -A OUTPUT -p tcp --sport 6006 -j ACCEPT
iptables -A INPUT -p tcp --dport 8888 -j ACCEPT
iptables -A OUTPUT -p tcp --sport 8888 -j ACCEPT
service iptables restart
複製代碼
model.py
的train
方法中 use_multiprocessing
關閉,worker
設爲1 就能搞定.
值得注意的是,在深度學習領域中,經常使用帶mini-batch的隨機梯度降低算法(Stochastic Gradient Descent, SGD)訓練深層結構,它有一個好處就是並不須要遍歷所有的樣本,當數據量很是大時十分有效。此時,可根據實際問題來定義epoch,例如定義10000次迭代爲1個epoch,若每次迭代的batch-size設爲256,那麼1個epoch至關於過了2560000次訓練樣本。