Mxnet (35): 使用全卷積網絡（FCN）進行語義分割

1. 轉置卷積

裝置卷積層用來增長輸入的寬和高。html

讓咱們考慮一個基本狀況，輸入和輸出通道均爲1，填充爲0，跨度爲1。下圖說明了轉置卷積如何經過 2 × 2 2×2 2×2內核是根據 2 × 2 2×2 2×2輸入矩陣獲得 3 x 3 3x3 3x3的輸出python

將上面的過程轉化爲代碼以下，其中kernel爲K，輸入爲X：git

def trans_conv(X, K):
    h, w = K.shape
    Y = np.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1))
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            Y[i: i + h, j: j + w] += X[i, j] * K
    return Y

X = np.array([[0, 1], [2, 3]])
K = np.array([[0, 1], [2, 3]])
trans_conv(X, K)

使用gluon的nn.Conv2DTranspose以得到相同的結果。如 nn.Conv2D，輸入和內核均應爲4D張量。github

X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2)
tconv = nn.Conv2DTranspose(1, kernel_size=2)
tconv.initialize(init.Constant(K))
tconv(X)

1.1 填充，步幅和通道設置

咱們將填充元素應用於卷積中的輸入，而將它們應用於轉置卷積中的輸出。一種 1 × 1 1×1 1×1 padding表示咱們首先按正常方式計算輸出，而後刪除第一行/最後一行。網絡

tconv = nn.Conv2DTranspose(1, kernel_size=2, padding=1)
tconv.initialize(init.Constant(K))
tconv(X)

# array([[[[4.]]]])

步幅也適用於輸出app

tconv = nn.Conv2DTranspose(1, kernel_size=2, strides=2)
tconv.initialize(init.Constant(K))
tconv(X)

還能夠用來還原通道，下降通道數，下面的轉置卷積對形狀的更改和上面的卷積徹底相反dom

X = np.random.uniform(size=(1, 10, 16, 16))
conv = nn.Conv2D(20, kernel_size=5, padding=2, strides=3)
tconv = nn.Conv2DTranspose(10, kernel_size=5, padding=2, strides=3)
conv.initialize()
tconv.initialize()
tconv(conv(X)).shape == X.shape

# True

2. 全卷積網絡（FCN）

全卷積網絡使用卷積神經網絡將圖像像素轉換爲像素類別。與先前介紹的卷積神經網絡不一樣，FCN經過轉置的卷積層將中間層特徵圖的高度和寬度轉換回輸入圖像的大小，從而使預測與輸入圖像中的輸入圖像具備一一對應的關係。空間尺寸（高度和寬度）。給定空間維度上的位置，通道維度的輸出將是對應於該位置的像素的類別預測。ide

2.1 建立模型

全卷積網絡首先使用卷積神經網絡來提取圖像特徵，而後經過1×1 卷積層將通道數轉換爲類別數。最後經過使用轉置的卷積層將特徵圖的高度和寬度轉換爲輸入圖像的大小。模型輸出與輸入圖像具備相同的高度和寬度，而且在空間位置上具備一一對應的關係。最終輸出通道包含相應空間位置的像素的類別預測。函數

下面使用在ImageNet上預訓練的ResNet-18模型進行微調。模型成員變量的最後兩層features是全局平均池化層 GlobalAvgPool2D和示例扁平化層Flatten。該 output模塊包含用於輸出的徹底鏈接層。徹底卷積網絡不須要這些層。測試

pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True)
pretrained_net.features[-4:], pretrained_net.output

重新建立全卷積網絡實例net。它重複pretrained_net的除了最後兩層的全部神經層features的實例成員變量的模型參數。

net = nn.HybridSequential()
for layer in pretrained_net.features[:-2]:
    net.add(layer)

給定分別爲320和480的高度和寬度的輸入，正向計算將把輸入的高度和寬度減少爲原來的1/32：10和15。

X = np.random.uniform(size=(1, 3, 320, 480))
net(X).shape

# (1, 512, 10, 15)

接下來須要經過 1 × 1 1×1 1×1卷積層將通道數輸出爲數據的類別數量,這裏Pascal VOC2012的種類爲21。而且經過轉置卷積層將寬高放大爲原來的32倍。只要將步幅設置爲32，並將padding設置爲 32 / 2 = 16 32/2=16 32/2=16,便可達到方法32倍的效果，將kernel設置爲 64 × 64 64×64 64×64

num_classes = 21
net.add(
    nn.Conv2D(num_classes, kernel_size=1),
    nn.Conv2DTranspose(num_classes, kernel_size=64, padding=16, strides=32)
)

2.2 初始化轉置卷積層

咱們已經知道轉置的卷積層能夠放大特徵圖。在圖像處理中，有時咱們須要放大圖像，即上採樣。上採樣的方法不少，一種常見的方法是雙線性插值。簡單來講, 爲了得到輸出圖像的像素座標 ( x , y ) (x, y) (x,y), 首先將座標映射到輸入圖像的座標 ( x ′ , y ′ ) (x', y') (x′,y′)。而後在輸入圖像上找到4個最接近 ( x ′ , y ′ ) (x', y') (x′,y′)的座標，而後經過 ( x ′ , y ′ ) (x', y') (x′,y′)和它附近的四個像素的相對距離計算 ( x , y ) (x, y) (x,y) 。下面構建一個函數，經過雙線插值進行上採樣。

def bilinear_kernel(in_channels, out_channels, kernel_size):
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1
    else:
        center = factor - 0.5
    og = (np.arange(kernel_size).reshape(-1, 1),
          np.arange(kernel_size).reshape(1, -1))
    filt = (1 - np.abs(og[0] - center) / factor) * (1 - np.abs(og[1] - center) / factor)
    weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size))
    weight[range(in_channels), range(out_channels), :, :] = filt
    return np.array(weight)

如今，咱們將對由轉置卷積層實現的雙線性插值上採樣進行實驗。構造一個轉置的卷積層，將輸入的高度和寬度放大2倍，並使用函數初始化其卷積內核。

conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2)
conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4)))

讀取圖像X並將升採樣結果記錄爲Y。爲了打印圖像，咱們須要調整通道尺寸的位置。

img = image.imread('img/catdog.jpg')
X = np.expand_dims(img.astype('float32').transpose(2, 0, 1), axis=0)/255
Y = conv_trans(X)
out_img = Y[0].transpose(1, 2, 0)
print('輸入圖片形狀:', img.shape)
print('處理過得輸出形狀:', out_img.shape)
px.imshow(out_img.asnumpy(), width=img.shape[1]/2, height=img.shape[0]/2)

初始化轉置卷積層和 1 × 1 1×1 1×1 卷積層

W = bilinear_kernel(num_classes, num_classes, 64)
net[-1].initialize(init.Constant(W))
net[-2].initialize(init=init.Xavier())

3. 訓練

此處的損失函數和準確度計算與圖像分類中使用的損失函數和準確度計算沒有實質性區別。因爲咱們使用轉置卷積層的通道來預測像素類別，所以在axis=1中指定了（通道尺寸）選項SoftmaxCrossEntropyLoss。另外，該模型基於每一個像素的預測類別是否正確來計算精度。

def accuracy(y_hat, y): 
    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
        y_hat = y_hat.argmax(axis=1)
    cmp = y_hat.astype(y.dtype) == y
    return float(cmp.sum())

def train_batch(net, features, labels, loss, trainer, devices, split_f=d2l.split_batch):
    X_shards, y_shards = split_f(features, labels, devices)
    with autograd.record():
        pred_shards = [net(X_shard) for X_shard in X_shards]
        ls = [loss(pred_shard, y_shard) for pred_shard, y_shard
              in zip(pred_shards, y_shards)]
    for l in ls:
        l.backward()
    # ignore_stale_grad表明能夠使用就得梯度參數
    trainer.step(labels.shape[0], ignore_stale_grad=True)
    train_loss_sum = sum([float(l.sum()) for l in ls])
    train_acc_sum = sum(accuracy(pred_shard, y_shard)
                        for pred_shard, y_shard in zip(pred_shards, y_shards))
    return train_loss_sum, train_acc_sum

def train(net, train_iter, test_iter, loss, trainer, num_epochs,
               devices=d2l.try_all_gpus(), split_f=d2l.split_batch):
    num_batches, timer = len(train_iter), d2l.Timer()
    epochs_lst, loss_lst, train_acc_lst, test_acc_lst = [],[],[],[]
    for epoch in range(num_epochs):
        metric = d2l.Accumulator(4)
        for i, (features, labels) in enumerate(train_iter):
            timer.start()
            l, acc = train_batch(
                net, features, labels, loss, trainer, devices, split_f)
            metric.add(l, acc, labels.shape[0], labels.size)
            timer.stop()
            if (i + 1) % (num_batches // 5) == 0:
                epochs_lst.append(epoch + i / num_batches)
                loss_lst.append(metric[0] / metric[2])
                train_acc_lst.append(metric[1] / metric[3])
        test_acc_lst.append(d2l.evaluate_accuracy_gpus(net, test_iter, split_f))
        print(f"[epock {epoch+1}] train loss: {metric[0] / metric[2]:.3f} train acc: {metric[1] / metric[3]:.3f}", 
              f" test_loss: {test_acc_lst[-1]:.3f}")
    print(f'loss {metric[0] / metric[2]:.3f}, train acc '
          f'{metric[1] / metric[3]:.3f}, test acc {test_acc_lst[-1]:.3f}')
    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec on '
          f'{str(devices)}')
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=epochs_lst, y=loss_lst, name='train loss'))
    fig.add_trace(go.Scatter(x=epochs_lst, y=train_acc_lst, name='train acc'))
    fig.add_trace(go.Scatter(x=list(range(1,len(test_acc_lst)+1)), y=test_acc_lst, name='test acc'))
    fig.update_layout(width=800, height=480, xaxis_title='epoch', yaxis_range=[0, 1])
    fig.show()

加載數據,比較費內存，選取16一組：

batch_size = 16
train_iter, test_iter = load_data_voc(batch_size, crop_size)

因爲圖片都比較大會加載在內存中，若是內存不夠用，能夠考慮減小數據量。

num_epochs, lr, wd, devices = 5, 0.1, 1e-3, [npx.gpu()]
loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1)
net.collect_params().reset_ctx(devices)
trainer = gluon.Trainer(net.collect_params(), 'sgd', { 'learning_rate': lr, 'wd': wd})
train(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

4.預測

在預測期間，咱們須要標準化每一個通道中的輸入圖像，並將它們轉換爲卷積神經網絡所需的四維輸入格式。

def predict(img):
    X = test_iter._dataset.normalize_image(img)
    X = np.expand_dims(X.transpose(2, 0, 1), axis=0)
    pred = net(X.as_in_ctx(devices[0])).argmax(axis=1)
    return pred.reshape(pred.shape[1], pred.shape[2])

def label2image(pred):
    colormap = VOC_COLORMAP.as_in_ctx(devices[0])
    X = pred.astype('int32')
    return colormap[X, :]

獲取測試數據，並進行預測。爲模型使用步幅爲32的轉置卷積層，因此當輸入圖像的高度或寬度不能被32整除時，轉置卷積層輸出的高度或寬度會偏離輸入圖像的大小。爲了解決此問題，咱們能夠在圖像中裁剪多個具備高和寬爲32的整數倍的矩形區域，而後對這些區域中的像素執行正向計算。組合時，這些區域必須徹底覆蓋輸入圖像。當像素被多個區域覆蓋時，在不一樣區域的正向計算中輸出的轉置卷積層的平均值能夠用做softmax操做的輸入，以預測類別。

test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
    crop_rect = (0, 0, 480, 320)
    X = image.fixed_crop(test_images[i], *crop_rect)
    pred = label2image(predict(X))
    imgs += [X, pred, image.fixed_crop(test_labels[i], *crop_rect)]
Image(show_imgs(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=1.5))

第一排原圖，第二排預測圖，第三排是標籤。

5.參考

https://d2l.ai/chapter_computer-vision/fcn.html

6.代碼

github