自監督圖像論文復現 | BYOL(pytorch)| 2020

繼續上一篇的內容,上一篇講解了Bootstrap Your Onw Latent自監督模型的論文和結構:
https://juejin.cn/post/6922347006144970760python

如今咱們看看如何用pytorch來實現這個結構,而且在學習的過程當中加深對論文的理解。
github:https://github.com/lucidrains/byol-pytorchgit

【前沿】:這個代碼我沒有實際跑過,畢竟我只是一個沒有GPU的小可憐。github

主要模型代碼

class BYOL(nn.Module):
    def __init__(
        self,
        net,
        image_size,
        hidden_layer = -2,
        projection_size = 256,
        projection_hidden_size = 4096,
        augment_fn = None,
        augment_fn2 = None,
        moving_average_decay = 0.99,
        use_momentum = True
    ):
        super().__init__()
        self.net = net

        # default SimCLR augmentation

        DEFAULT_AUG = torch.nn.Sequential(
            RandomApply(
                T.ColorJitter(0.8, 0.8, 0.8, 0.2),
                p = 0.3
            ),
            T.RandomGrayscale(p=0.2),
            T.RandomHorizontalFlip(),
            RandomApply(
                T.GaussianBlur((3, 3), (1.0, 2.0)),
                p = 0.2
            ),
            T.RandomResizedCrop((image_size, image_size)),
            T.Normalize(
                mean=torch.tensor([0.485, 0.456, 0.406]),
                std=torch.tensor([0.229, 0.224, 0.225])),
        )

        self.augment1 = default(augment_fn, DEFAULT_AUG)
        self.augment2 = default(augment_fn2, self.augment1)

        self.online_encoder = NetWrapper(net, projection_size, projection_hidden_size, layer=hidden_layer)

        self.use_momentum = use_momentum
        self.target_encoder = None
        self.target_ema_updater = EMA(moving_average_decay)

        self.online_predictor = MLP(projection_size, projection_size, projection_hidden_size)

        # get device of network and make wrapper same device
        device = get_module_device(net)
        self.to(device)

        # send a mock image tensor to instantiate singleton parameters
        self.forward(torch.randn(2, 3, image_size, image_size, device=device))

    @singleton('target_encoder')
    def _get_target_encoder(self):
        target_encoder = copy.deepcopy(self.online_encoder)
        set_requires_grad(target_encoder, False)
        return target_encoder

    def reset_moving_average(self):
        del self.target_encoder
        self.target_encoder = None

    def update_moving_average(self):
        assert self.use_momentum, 'you do not need to update the moving average, since you have turned off momentum for the target encoder'
        assert self.target_encoder is not None, 'target encoder has not been created yet'
        update_moving_average(self.target_ema_updater, self.target_encoder, self.online_encoder)

    def forward(self, x, return_embedding = False):
        if return_embedding:
            return self.online_encoder(x)

        image_one, image_two = self.augment1(x), self.augment2(x)

        online_proj_one, _ = self.online_encoder(image_one)
        online_proj_two, _ = self.online_encoder(image_two)

        online_pred_one = self.online_predictor(online_proj_one)
        online_pred_two = self.online_predictor(online_proj_two)

        with torch.no_grad():
            target_encoder = self._get_target_encoder() if self.use_momentum else self.online_encoder
            target_proj_one, _ = target_encoder(image_one)
            target_proj_two, _ = target_encoder(image_two)
            target_proj_one.detach_()
            target_proj_two.detach_()

        loss_one = loss_fn(online_pred_one, target_proj_two.detach())
        loss_two = loss_fn(online_pred_two, target_proj_one.detach())

        loss = loss_one + loss_two
        return loss.mean()
  • 先看forward()函數,發現輸入一個圖片給模型,而後返回值是這個圖片計算的loss
  • 若是是推理過程,那麼return_embedding=True,那麼返回的值就是online network中的encoder部分輸出的東西,不用在考慮後面的predictor,這裏須要注意代碼中的encoder實際上是論文中的encoder+projector
  • 圖片通過self.augment1和self.augment2處理成兩個不一樣的圖片,在上一篇中,咱們稱之爲view;
  • 兩個圖片都通過online-encoder,這裏可能會有疑問:不是應該一個圖片通過online network,另一個通過target network嗎?爲何這兩個都通過online-encoder,你說的沒錯,這裏只是方便後面計算symmetric loss,由於要計算對稱損失,因此兩個圖片都要通過online network和target network。
  • 在target network中推理的內容,都不須要記錄梯度,由於target network是根據online network的參數更新的
  • 若是self.use_momentum=False,那麼就不使用論文中的更新target network的方式,而是直接把online network複製給target network,不過我發現!這個github代碼雖然有600多stars,可是這裏的就算你的self.use_momentum=True,其實也是把online network複製給了target network啊哈哈,那麼就不在這裏深究了。
  • 最後計算經過loss_fn計算損失,而後return loss.mean()

因此,目前位置,咱們發現這個BYOL的結構其實很簡單,目前還有疑點的地方有4個:架構

  • online_encoder如何定義?
  • predictor如何定義?
  • 圖像加強方法如何定義?
  • loss_fn損失函數如何定義?

augment

從上面的代碼中能夠看到這一段:app

# default SimCLR augmentation

        DEFAULT_AUG = torch.nn.Sequential(
            RandomApply(
                T.ColorJitter(0.8, 0.8, 0.8, 0.2),
                p = 0.3
            ),
            T.RandomGrayscale(p=0.2),
            T.RandomHorizontalFlip(),
            RandomApply(
                T.GaussianBlur((3, 3), (1.0, 2.0)),
                p = 0.2
            ),
            T.RandomResizedCrop((image_size, image_size)),
            T.Normalize(
                mean=torch.tensor([0.485, 0.456, 0.406]),
                std=torch.tensor([0.229, 0.224, 0.225])),
        )

        self.augment1 = default(augment_fn, DEFAULT_AUG)
        self.augment2 = default(augment_fn2, self.augment1)

能夠看到:框架

  • 這個就是圖像加強的pipeline,而augment1和augment2能夠自定義,默認的話就是augment1和augment2都是上面的DEFAULT_AUG;
  • from torchvision import transforms as T

比較陌生的可能就是torchvision.transforms.ColorJitter()這個方法了。
dom

從官方API上能夠看到,這個方法其實就是隨機的修改圖片的亮度,對比度,飽和度和色調函數

encoder+projector

class NetWrapper(nn.Module):
    def __init__(self, net, projection_size, projection_hidden_size, layer = -2):
        super().__init__()
        self.net = net
        self.layer = layer

        self.projector = None
        self.projection_size = projection_size
        self.projection_hidden_size = projection_hidden_size

        self.hidden = None
        self.hook_registered = False

    def _find_layer(self):
        if type(self.layer) == str:
            modules = dict([*self.net.named_modules()])
            return modules.get(self.layer, None)
        elif type(self.layer) == int:
            children = [*self.net.children()]
            return children[self.layer]
        return None

    def _hook(self, _, __, output):
        self.hidden = flatten(output)

    def _register_hook(self):
        layer = self._find_layer()
        assert layer is not None, f'hidden layer ({self.layer}) not found'
        handle = layer.register_forward_hook(self._hook)
        self.hook_registered = True

    @singleton('projector')
    def _get_projector(self, hidden):
        _, dim = hidden.shape
        projector = MLP(dim, self.projection_size, self.projection_hidden_size)
        return projector.to(hidden)

    def get_representation(self, x):
        if self.layer == -1:
            return self.net(x)

        if not self.hook_registered:
            self._register_hook()

        _ = self.net(x)
        hidden = self.hidden
        self.hidden = None
        assert hidden is not None, f'hidden layer {self.layer} never emitted an output'
        return hidden

    def forward(self, x, return_embedding = False):
        representation = self.get_representation(x)

        if return_embedding:
            return representation

        projector = self._get_projector(representation)
        projection = projector(representation)
        return projection, representation

這個就是基本的encoder+projector,裏面包含encoder和projector。post

encoder

這個在初始化NetWrapper的時候,須要做爲參數傳遞進來,因此看了訓練文件,發現這個模型爲:學習

from torchvision import models, transforms
resnet = models.resnet50(pretrained=True)

因此encoder和論文中說的同樣,是一個resnet50。若是我記得沒錯,這個resnet輸出的是一個(batch_size,1000)這樣子的tensor。

projector

調用到了MLP這個東西:

class MLP(nn.Module):
    def __init__(self, dim, projection_size, hidden_size = 4096):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_size),
            nn.BatchNorm1d(hidden_size),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_size, projection_size)
        )

    def forward(self, x):
        return self.net(x)

是全鏈接層+BN+激活層的結構。和論文中說的差很少,而且在最後的全鏈接層後面沒有加上BN+relu。通過這個MLP,返回的是一個(batch_size,projection_size)這樣形狀的tensor。

predictor

self.online_predictor = MLP(projection_size, projection_size, projection_hidden_size)

這個predictor,其實就是和projector如出一轍的東西,能夠看到predictor的輸入和輸出的特徵數量都是projection_size

這裏由於我對自監督的體系沒有完整的閱讀論文,只是最早看了這個BYOL,因此我沒法說明這個predictor爲何存在。從表現來看,是爲了防止online network和target network的結構徹底相同,若是徹底相同的話可能會讓兩個模型訓練出徹底同樣的效果,也就是loss=0的狀況。假設

loss_fn

def loss_fn(x, y):
    x = F.normalize(x, dim=-1, p=2)
    y = F.normalize(y, dim=-1, p=2)
    return 2 - 2 * (x * y).sum(dim=-1)

這部分和論文中一致。

綜上所屬,這個BYOL框架是一個簡單,又有趣的無監督架構。

相關文章
相關標籤/搜索