VisualPytorch發佈域名+雙服務器以下:
http://nag.visualpytorch.top/static/ (對應114.115.148.27)
http://visualpytorch.top/static/ (對應39.97.209.22)python
共同貢獻PyTorch常見錯誤與坑彙總文檔:《PyTorch常見報錯/坑彙總》服務器
net = LeNet2(classes=2019) # 法1: 保存整個Module,不只保存參數,也保存結構 torch.save(net, path) net_load = torch.load(path_model) # 網絡名稱、結構、模型參數、優化器參數均保留 # 法2: 保存模型參數(推薦,佔用資源少) state_dict = net.state_dict() torch.save(state_dict , path) net_new = LeNet2(classes=2019) net_new.load_state_dict(state_dict_load)
保存:網絡
checkpoint = { "model_state_dict": net.state_dict(), "optimizer_state_dict": optimizer.state_dict(), "epoch": epoch } path_checkpoint = "./checkpoint_{}_epoch.pkl".format(epoch) torch.save(checkpoint, path_checkpoint)
恢復:dom
# ============================ step 2/5 模型 ============================ net = LeNet(classes=2) net.initialize_weights() # ============================ step 3/5 損失函數 ============================ criterion = nn.CrossEntropyLoss() # 選擇損失函數 # ============================ step 4/5 優化器 ============================ optimizer = optim.SGD(net.parameters(), lr=LR, momentum=0.9) # 選擇優化器 scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=6, gamma=0.1) # 設置學習率降低策略 # ============================ step 5+/5 斷點恢復 ============================ path_checkpoint = "./checkpoint_4_epoch.pkl" checkpoint = torch.load(path_checkpoint) net.load_state_dict(checkpoint['model_state_dict']) optimizer.load_state_dict(checkpoint['optimizer_state_dict']) scheduler.last_epoch = checkpoint['epoch']
機器學習分支,研究源域(source domain)的知識如何應用到目標域(target domain)機器學習
模型微調步驟:函數
模型微調訓練方法:學習
到螞蟻-蜜蜂二分類任務,其中有114張訓練,80張測試。能夠看出來訓練數據仍是至關小的,必須有已經訓練好的模型。測試
模型搭建以下:優化
# ============================ step 2/5 模型 ============================ # 1/3 構建模型 resnet18_ft = models.resnet18() # 2/3 加載參數 !!! path_pretrained_model = "resnet18-5c106cde.pth" state_dict_load = torch.load(path_pretrained_model) resnet18_ft.load_state_dict(state_dict_load) # 法1 : 凍結卷積層,模型參數再也不更新 for param in resnet18_ft.parameters(): param.requires_grad = False # 3/3 替換fc層,將本來輸出神經元個數改成 classes = 2 !!! num_ftrs = resnet18_ft.fc.in_features resnet18_ft.fc = nn.Linear(num_ftrs, classes)
若是不加載參數,訓練了25個epoch仍只有70%正確率,最後Loss保持在0.5左右。而若是加載了參數,基本上第二個epoch就達到90%正確率。ui
更推薦的用優化器控制學習率方法:分組靈活控制LR
# ============================ step 4/5 優化器 ============================ # 法2 : conv 小學習率 fc_params_id = list(map(id, resnet18_ft.fc.parameters())) # 返回的是parameters的 內存地址 base_params = filter(lambda p: id(p) not in fc_params_id, resnet18_ft.parameters()) optimizer = optim.SGD([ {'params': base_params, 'lr': LR*0.1}, # 0 {'params': resnet18_ft.fc.parameters(), 'lr': LR}], momentum=0.9) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=lr_decay_step, gamma=0.1) # 設置學習率降低策略
CPU(Central Processing Unit, 中央處理器):主要包括控制器和運算器
GPU(Graphics Processing Unit, 圖形處理器):處理統一的,無依賴的大規模數據運算
在模型訓練/測試時必須使全部數據和模型參數都在同一類設備上,而且注意數據的to操做不是inplace操做。
通過實驗,上面resnet18的遷移學習中,若是不採用GPU訓練一個epoch耗時58.362s,而使用了GPU後僅須要6.626s!
區別:張量不執行inplace,模型執行inplace
x = torch.ones((3, 3)) x = x.to(torch.float64) # 數據類型 x = torch.ones((3, 3)) x = x.to("cuda") # 數據設備 linear = nn.Linear(2, 2) linear.to(torch.double) # 模型數據類型,不改變存儲位置 gpu1 = torch.device("cuda") linear.to(gpu1) # 模型設備,不改變存儲位置
torch.nn.DataParallel(module, # 須要包裝分發的模型 device_ids=None, # 可分發的gpu,默認分發到全部可見可用gpu output_device=None, # 結果輸出設備 dim=0 )
功能:包裝模型,實現分發並行機制
查詢當前gpu內存剩餘:
def get_gpu_memory(): import os os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt') memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()] os.system('rm tmp.txt') return memory_gpu gpu_memory = get_gpu_memory() gpu_list = np.argsort(gpu_memory)[::-1] gpu_list_str = ','.join(map(str, gpu_list)) os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str) print("\ngpu free memory: {}".format(gpu_memory)) print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"])) >>> gpu free memory: [10362, 10058, 9990, 9990] >>> CUDA_VISIBLE_DEVICES :0,1,3,2
gpu模型加載:
在沒有GPU的機器上運行GPU代碼:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False...
解決: torch.load(path_state_dict, map_location="cpu")
在單GPU機器上加載多GPU訓練模型參數(在參數key中會含有module.
)
RuntimeError: Error(s) in loading state_dict for FooNet: Missing key(s) in state_dict: "linears.0.weight", "linears.1.weight", "linears.2.weight". Unexpected key(s) in state_dict: "module.linears.0.weight", "module.linears.1.weight", "module.linears.2.weight".
解決:
from collections import OrderedDict new_state_dict = OrderedDict() for k, v in state_dict_load.items (): namekey = k[7:] if k.startswith('module.') else k new_state_dict[namekey] =v