case7 淋巴瘤子類分類實驗記錄

case7 淋巴瘤子類分類實驗記錄


簡介

分類問題:3分類 (identifying three sub-types of lymphoma: Chronic Lymphocytic Leukemia (CLL, 慢性淋巴細胞白血病), Follicular Lymphoma (FL,濾泡性淋巴瘤), and Mantle Cell Lymphoma (MCL,套細胞淋巴瘤)
網絡模型:AlexNet
數據集: 原圖1388*1040大小,共計374張, 1.4G。 CLL:113, FL:138, MCL:122git


實驗流程

準備工做
caffe環境配置好;數據集代碼下載完畢github

  • 將大圖切成小的patches.代碼:step1_make_patches.m。
    代碼須要修改的就是路徑,這點須要注意。爲了方便,將數據集放在與.m的同級目錄下.
    在這以前,爲了與教程所描述的數據集中圖片的命名一致,要在每一類別下的圖片加類名前綴。這裏給出ubuntu下批量修改文件名的方法:數據庫

    cd 到子類所在的路徑下
    假設要加的類名前綴爲CLL-
    sudo rename 's/^/CLL-/' *tifubuntu

修正後的代碼以及簡要理解以下:數組

clc
    clear all 
    % 子圖的輸出路徑
    outdir='./subs/'; %output directory for all of the sub files
    mkdir(outdir)

    % 設置取patch時的步長
    step_size=32;
    % 設置patch大小,注意做者在這裏提到,輸入caffe時還會被crop成32*32
     patch_size=36; %size of the pathces we would like to extract, bigger since Caffee will randomly crop 32 x 32 patches from them
    % 按類別取patch
    classes={'CLL','FL','MCL'};
    class_struct={};
    for classi=1:length(classes)
        % 獲得目標類文件夾下全部圖片名稱
        files=dir([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'*.tif']); % we only use images for which we have a mask available, so we use their filenames to limit the patients
        % 生成無重複的病人序號。這裏解釋一下,由於做者想作的是創建與病人關聯的數據庫,可是實際上該數據集沒有病人信息,但爲了統一,仍採用這種結構生成數據
        % arrayfun: 對數組中的每個元素進行fun運算; x{1}{1}生成1x1的cell
        patients=unique(arrayfun(@(x) x{1}{1},arrayfun(@(x) strsplit(x.name,'.'),files,'UniformOutput',0),'UniformOutput',0)); %this creates a list of patient id numbers
        patient_struct=[];
        % parfor 並行
    parfor ci=1:length(patients) % for each of the *patients* we extract patches
            % base屬性爲名字
            patient_struct(ci).base=patients{ci}; %we keep track of the base filename so that we can split it into sets later. a "base" is for example 12750 in 12750_500_f00003_original.tif
            % sub_file 屬性存放該病人(大圖)的patch存放路徑
            patient_struct(ci).sub_file=[]; %this will hold all of the patches we extract which will later be written
            % 獲得對應病人的大圖
            files=dir(sprintf([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'%s*.tif'],patients{ci})); %get a list of all of the image files associated with this particular patient
    
            for fi=1:length(files) %for each of the files..... % 由上,該數據集無重複,每一個病人只對應一張大圖
                disp([ci,length(patients),fi,length(files)])
                fname=files(fi).name;
                % 保存的該病人每張大圖的名字
                patient_struct(ci).sub_file(fi).base=fname; %each individual image name gets saved as well
        
                io=imread([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),fname]); %read the image
            
                [nrow,ncol,ndim]=size(io);
                fnames_sub={};
                i=1;
                % 取圖像的patch,其實是矩陣取子塊,數量爲[(1388-36)/32+1]*[(1040-36)/32+1]*2
                for rr=1:step_size:nrow-patch_size
                    for cc=1:step_size:ncol-patch_size
                        for rot=1:2  % 旋轉,旋轉90度,擴充數據集x2,
                            try
                                % 能夠改爲rr=1:step_size:nrow-patch_size+1,... ,
                                % subio=io(rr:rr+patch_size-1,cc+1:cc+patch_size-1,:);
                        
                                subio=io(rr+1:rr+patch_size,cc+1:cc+patch_size,:);                            
                                subio=imrotate(subio,(rot-1)*90);
                                % patch的命名方式:第幾個patch
                                subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i);
                                fnames_sub{end+1}=subfname;
                                imwrite(subio,[outdir,subfname]);
                                i=i+1;
                            catch err
                                disp(err);
                                continue
                            end
                        end
                    end
                end
        
        
                patient_struct(ci).sub_file(fi).fnames_subs=fnames_sub;
            end
    
        end
        class_struct{classi}=patient_struct;

    end

    save('class_struct.mat','class_struct') %save this just incase the computer crashes before the next step finishes

每一個圖片切出2752張patchesbash


  • 生成交叉驗證集,爲了獲得最好的模型。使用5折交叉驗證。代碼step2_make_training_lists.m。
    每個交叉驗證集須要生成4個txt文件,以第一折爲例:
    train_w32_parent_1.txt,test_w32_parent_1.txt:該交叉驗證集包含的病人名稱列表的txt
    train_w32_1.txt,test_w32_1.txt: 該交叉驗證集包含的圖片名稱以及對應類別的列表的txt
    代碼比較直觀,只要是要理解5折交叉驗證的原理。簡單記錄下代碼:網絡

    load class_struct %save this just incase the computer crashes before the next step finishes
      % 5折交叉驗證
      nfolds=5; %determine how many folds we want to use during cross validation
      fidtrain=[];
      fidtest=[];
    
    
      fidtrain_parent=[];
      fidtest_parent=[];
      % 生成全部文件的句柄
      for zz=1:nfolds %open all of the file Ids for the training and testing files
          %each fold has 4 files created (as discussed in the tutorial)
          fidtrain(zz)=fopen(sprintf('train_w32_%d.txt',zz),'w');
          fidtest(zz)=fopen(sprintf('test_w32_%d.txt',zz),'w');
    
          fidtrain_parent(zz)=fopen(sprintf('train_w32_parent_%d.txt',zz),'w');
          fidtest_parent(zz)=fopen(sprintf('test_w32_parent_%d.txt',zz),'w');
      end
    
      % 將病人ID寫入patient.txt .將病人的patch圖片及類別(CLL:0,FL:1,MCL : 2)名寫入另外兩個txt
      % 5折交叉驗證是:4個爲訓練集,剩餘一個爲測試集,這樣能夠組合爲5個數據集
      for classi=1:length(class_struct)
    
          patient_struct=class_struct{classi};
    
          npatients=length(patient_struct); %get the number of patients that we have
          indices=crossvalind('Kfold',npatients,nfolds); %use the matlab function to generate a k-fold set
    
          for fi=1:npatients %for each patient
              disp([fi,npatients]);
              for k=1:nfolds %for each fold
    
                  if(indices(fi)==k) %if this patient is in the test set for this fold, set the file descriptor accordingly
                      fid=fidtest(k);
                      fid_parent=fidtest_parent(k);
                  else %otherwise its in the training set
                      fid=fidtrain(k);
                      fid_parent=fidtrain_parent(k);
                  end
    
                  fprintf(fid_parent,'%s\n',patient_struct(fi).base); %print this patien's ID to the parent file
    
                  subfiles=patient_struct(fi).sub_file; %get the patient's images
    
                  for subfi=1:length(subfiles) %for each of the patient images
                      try
                          subfnames=subfiles(subfi).fnames_subs; %now get all of the negative patches
                          % !!!這裏注意要將%s\t%d改成%s\ %d,使用空格做爲分隔,不然後面格式轉換時會出錯:could not open or find file...
                          cellfun(@(x) fprintf(fid,'%s\ %d\n',x,classi-1),subfnames); %write them to the list as belonging to the 0 class (non nuclei)
    
                      catch err
                          disp(err)
                          disp([patient_struct(fi).base,'  ',patient_struct(fi).sub_file(subfi).base]) %if there are any errors, display them, but continue
                          continue
                      end
                  end
    
              end
          end
    
      end
    
      for zz=1:nfolds %now that we're done, make sure that we close all of the files
          fclose(fidtrain(zz));
          fclose(fidtest(zz));
    
          fclose(fidtrain_parent(zz));
          fclose(fidtest_parent(zz));
    
      end

5個數據集模型, 每一個測試集203648張patches,訓練集825600,訓練集:測試集~1:4app


  • 生成數據集。這裏利用caffe的命令行生成leveldb格式的數據和相應的均值文件。之因此不直接用image layer,是由於:還需計算所需格式的均值,並且image layer也不是設計爲大數據量讀取的,因此直接使用caffe命令行更加方便。
    代碼:step3_make_dbs.sh,** 在sub文件夾內運行**,以確保路徑正確。仍是要修改源代碼的一些路徑問題和一些細節上的錯誤:dom

    #!/bin/bash
    
      filepath=$(cd "$(dirname "$0")"; pwd)
    
      for kfoldi in {1..5}
      do
      echo "doing fold $kfoldi"
      #注意這裏,若是你實驗的目錄是在caffe路徑下時,能夠這樣,不然須要絕對路徑。並且原代碼的for循環內部{{kfoldi}} 應改成kfoldi
      #~/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ train_w32_${kfoldi}.txt DB_train_${kfoldi} &
      #~/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ test_w32_${kfoldi}.txt DB_test_${kfoldi} &
      /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ train_w32_$kfoldi.txt DB_train_$kfoldi &
      /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ test_w32_$kfoldi.txt DB_test_$kfoldi &
      done
    
    
    
    
      FAIL=0
      for job in `jobs -p`
      do
          echo $job
          wait $job || let "FAIL+=1"
      done
    
    
    
    
      echo "number failed: $FAIL"
    
      cd ../
    
      for kfoldi in {1..5}
      do
      echo "doing fold $kfoldi"
      #這裏同上,進行修改
      /home/mz/py-R-FCN/caffe/build/tools/compute_image_mean DB_train_$kfoldi DB_train_w32_$kfoldi.binaryproto -backend leveldb  &
      done

  • 訓練DL分類器

說明:使用的網絡結構是alexnet,其實比AlexNet官方結構精簡,只有3對卷積池化和兩個全鏈接,實際上這是cifar10分類中使用的網絡結構。要考慮到這裏的輸入圖片大小爲32*32(網絡結構中對輸入的定義還作了crop爲32的操做),並且是3分類(alexnet是1000分類),因此從模型的複雜度上也不須要作的和alexnet那樣複雜。因此網絡深度和一些參數須要變化,不能照搬AlexNet。可是值得實驗的是,是否病理學圖像必須裁成小圖,大一些的圖是否能夠,少加一些pool,把深度提上去,不知道性能會怎麼樣?編輯器

須要的文件:與7-lymphoma同級的common文件夾下的BASE-alexnet_solver_ada.prototxt、(BASE-alexnet_traing_32w_db.prototxt、BASE-alexnet_traing_32w_dropout_db.prototxt;帶不帶dropout),(deploy_train32.prototxt、deploy_train32_dropout.prototxt,測試網絡定義)。
複製5份,用於5個模型(5折交叉驗證),命名方式1-alexnet_solver_ada.prototxt,以此類推。
修改的內容:

  1. 覈對全部文件中的$(kfoldi)d,須要相應替換爲數字1-5. 修改prototxt文件最後ip layer的輸出爲3。
  2. 要修改路徑。文件中的路徑(數據,prototxt)是指模型定義都放在了caffe的./model下,而數據集存LMDB和mean文件放在caffe根目錄下。若是不是,須要替換爲絕對路徑。
  3. 修改caffe的測試迭代次數,在solver文件下的test_iter。計算方法爲測試數據量/測試時的batch_size。batch_size = 128,而前者能夠經過運行下面指令快速獲得:

    wc -l test_w32_1.txt

    或者打開文件拉到最後一行,看文本編輯器的下方顯示的行數。

進行訓練:

/home/mz/py-R-FCN/caffe/build/tools/caffe train --solver=1-alexnet_solver_ada.prototxt

對於模型5,迭代600000次,不加dropout的模型:0.841879,loss = 0.513672 ;
加dropout的模型:0.826787,loss = 0.576846
對於模型4,迭代600000次,不加dropout的模型:0.86142,loss = 0.364765 ;
加dropout的模型:0.85352,loss = 0.500288
對於模型3,迭代600000次,不加dropout的模型:0.840632,loss = 0.448586 ;
加dropout的模型:0.814813,loss = 0.546735
對於模型2,迭代600000次,不加dropout的模型:0.817167,loss = 0.466199 ;
加dropout的模型:0.797229,loss = 0.557098
對於模型1,迭代600000次,不加dropout的模型:0.85496,loss = 0.435163 ;
加dropout的模型:0.828828,loss = 0.577961


嘗試大尺寸的patch,而後使用不一樣的網絡結構(AlexNet,VGG-16,GoogLeNet,ResNet)

  1. 數據準備。
    如今嘗試大尺寸的patch,這裏裁剪成227*227。後續的實驗再也不進行交叉驗證。將所有數據合爲一份數據集,按照2:1:1劃分訓練集,校驗集和測試集。
    方法:從新新建一個文件夾,用來存放實驗數據。更改原來的step1和step2的文件中的代碼。參考以下:
    step1. 從原圖上切227×227的patch,同時對這些patch作水平翻轉,擴充數據。一張原圖生成96張patch。
clc
clear all 
% 子圖的輸出路徑
outdir='./subs_227/'; %output directory for all of the sub files
mkdir(outdir)

% 設置取patch時的步長
step_size=227;
% 設置patch大小,注意做者在這裏提到,輸入caffe時還會被crop成32*32
patch_size=227; %size of the pathces we would like to extract, bigger since Caffee will randomly crop 32 x 32 patches from them

% 是否水平翻轉
flip = true;

% 按類別取patch
classes={'CLL','FL','MCL'};
class_struct={};
for classi=1:length(classes)
    % 獲得目標類文件夾下全部圖片名稱
    files=dir([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'*.tif']); % we only use images for which we have a mask available, so we use their filenames to limit the patients
    % 生成無重複的病人序號。這裏解釋一下,由於做者想作的是創建與病人關聯的數據庫,可是實際上該數據集沒有病人信息,但爲了統一,仍採用這種結構生成數據
    % arrayfun: 對數組中的每個元素進行fun運算; x{1}{1}生成1x1的cell
    patients=unique(arrayfun(@(x) x{1}{1},arrayfun(@(x) strsplit(x.name,'.'),files,'UniformOutput',0),'UniformOutput',0)); %this creates a list of patient id numbers
    patient_struct=[];
    % parfor 並行
   parfor ci=1:length(patients) % for each of the *patients* we extract patches
        % base屬性爲名字
        patient_struct(ci).base=patients{ci}; %we keep track of the base filename so that we can split it into sets later. a "base" is for example 12750 in 12750_500_f00003_original.tif
        % sub_file 屬性存放該病人(大圖)的patch存放路徑
        patient_struct(ci).sub_file=[]; %this will hold all of the patches we extract which will later be written
        % 獲得對應病人的大圖
        files=dir(sprintf([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'%s*.tif'],patients{ci})); %get a list of all of the image files associated with this particular patient
        
        for fi=1:length(files) %for each of the files..... % 由上,該數據集無重複,每一個病人只對應一張大圖
            disp([ci,length(patients),fi,length(files)])
            fname=files(fi).name;
            % 保存的該病人每張大圖的名字
            patient_struct(ci).sub_file(fi).base=fname; %each individual image name gets saved as well
            
            io=imread([sprintf('./case7_lymphoma_classification/%s/', classes{classi}), fname]); %read the image
                
            [nrow,ncol,ndim]=size(io);
            fnames_sub={};
            i=1;
            % 取圖像的patch,其實是矩陣取子塊,數量爲[(1388-36)/32+1]*[(1040-36)/32+1]*2
            for rr=1:step_size:nrow-patch_size
                for cc=1:step_size:ncol-patch_size
                    for rot=1:2  % 旋轉,旋轉90度,擴充數據集x2,
                        try
                            % 能夠改爲rr=1:step_size:nrow-patch_size+1,... ,
                            % subio=io(rr:rr+patch_size-1,cc+1:cc+patch_size-1,:);
                            
                            subio=io(rr+1:rr+patch_size,cc+1:cc+patch_size,:);                            
                            subio=imrotate(subio,(rot-1)*90);
                            % patch的命名方式:第幾個patch
                            subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i);
                            fnames_sub{end+1}=subfname;
                            imwrite(subio,[outdir,subfname]);
                            i=i+1;
                            if flip
                                subio_flip = subio(:,end:-1:1,1:3);
                                % patch的命名方式:第幾個patch
                                subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i);
                                fnames_sub{end+1}=subfname;
                                imwrite(subio_flip,[outdir,subfname]);
                                i=i+1;
                            end
                        catch err
                            disp(err);
                            continue
                        end
                    end
                end
            end
            
            
            patient_struct(ci).sub_file(fi).fnames_subs=fnames_sub;
        end
        
    end
    class_struct{classi}=patient_struct;

end

save('class_struct.mat','class_struct') %save this just incase the computer crashes before the next step finishes

step2.生成分別包含訓練和測試集圖片name list的TXT文件.訓練集:17856;測試集:9120.;校驗集:8928

load class_struct %save this just incase the computer crashes before the next step finishes

% 生成文件的句柄

fidtrain=fopen(sprintf('train_w227.txt'),'w');
fidval=fopen(sprintf('val_w227.txt'),'w');
fidtest=fopen(sprintf('test_w227.txt'),'w');
fidtrain_parent  = fopen(sprintf('train_w227_parent.txt'),'w');
fidval_parent  = fopen(sprintf('val_w227_parent.txt'),'w');
fidtest_parent  = fopen(sprintf('test_w227_parent.txt'),'w');
% 將病人的patch圖片及類別(CLL:0,FL:1,MCL : 2)名寫入訓練和測試txt

% 訓練集,校驗集和測試集比例2:1:1


for classi=1:length(class_struct)
    
    patient_struct=class_struct{classi};
    
    npatients=length(patient_struct); %get the number of patients that we have
    % 打亂順序
    RandIndex = randperm(npatients);
    test_index = RandIndex(1:ceil(0.25*npatients));
    val_index = RandIndex(ceil(0.25*npatients)+1:ceil(0.5*npatients));
    train_index = RandIndex(ceil(0.5*npatients)+1:end);
        
    for fi=1:npatients %for each patient
        disp([fi,npatients]);
            
        if(ismember(fi, test_index)) %if this patient is in the test set for this fold, set the file descriptor accordingly
            fid=fidtest;
            fid_parent=fidtest_parent;
        elseif(ismember(fi, train_index)) %otherwise its in the training set
            fid=fidtrain;
            fid_parent=fidtrain_parent;
        else
            fid=fidval;
            fid_parent=fidval_parent;
        end
            
        fprintf(fid_parent,'%s\n',patient_struct(fi).base); %print this patien's ID to the parent file

        subfiles=patient_struct(fi).sub_file; %get the patient's images

        for subfi=1:length(subfiles) %for each of the patient images
            try
                subfnames=subfiles(subfi).fnames_subs; %now get all of the negative patches
                cellfun(@(x) fprintf(fid,'%s %d\n',x,classi-1),subfnames); %write them to the list as belonging to the 0 class (non nuclei)

            catch err
                disp(err)
                disp([patient_struct(fi).base,'  ',patient_struct(fi).sub_file(subfi).base]) %if there are any errors, display them, but continue
                continue
            end
        end

    end
end
    

 %now that we're done, make sure that we close all of the files
fclose(fidtrain);
fclose(fidtest);
fclose(fidval);
fclose(fidtrain_parent);
fclose(fidtest_parent);
fclose(fidval_parent);

step3. 生成leveldb格式的數據以及對應的均值文件。將按以下修改的step3文件放入subs_227

#!/bin/bash

echo "create lmdb data"
/home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb   ./ ../train_w227.txt ../DB_train &
/home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb   ./ ../test_w227.txt ../DB_test &
/home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb   ./ ../test_w227.txt ../DB_val &





FAIL=0
for job in `jobs -p`
do
    echo $job
    wait $job || let "FAIL+=1"
done




echo "number failed: $FAIL"

cd ../


echo "ceate mean binary"
/home/mz/py-R-FCN/caffe/build/tools/compute_image_mean DB_train DB_train_w227.binaryproto -backend lmdb  &

不一樣的模型

AlexNet

  1. 從caffe/models下拷貝bvlc-alexnet文件夾,獲得Alexnet的模型定義prototxt和solver.prototxt.更改相關參數,進行訓練。
    參數:迭代次數:50000;test_iter=179;test_interval=200;fc8-output=3;
  2. 結果
    val-accuracy: 0.927598; train-loss = 8.55613e-05;
    這裏測試的時候仍使用train_val.prototxt,另存一份,起名爲train_test.prototxt。只是要將校驗集路徑改成測試集路徑。而後,執行下面命令:
sudo /home/mz/py-R-FCN/caffe/build/tools/caffe test -model=train_test.prototxt -weights=../models/caffe_alexnet_train_iter_50000.caffemodel -gpu 0 -iterations=183

-iterations迭代次數參數計算方式:測試集數量/batch_size
test-accuracy: 0.856721 loss=1.03727

GooLeNet

VGG

ResNet

相關文章
相關標籤/搜索