This is not an official style guide for PyTorch. This document summarizes best practices from more than a year of experience with deep learning using the PyTorch framework. Note that the learnings we share come mostly from a research and startup perspective.
This is an open project and other collaborators are highly welcomed to edit and improve the document.
You will find three main parts of this doc. First, a quick recap of best practices in Python, followed by some tips and recommendations using PyTorch. Finally, we share some insights and experiences using other frameworks which helped us generally improve our workflow.
Update 30.4.2019
After so much positive feedback I also added a summary of commonly used building blocks from our projects at Mirage: You will find building blocks for (Self-Attention, Perceptual Loss using VGG, Spectral Normalization, Adaptive Instance Normalization, ...)
Code Snippets for Losses, Layers and other building blocksless
From our experience we recommend using Python 3.6+ because of the following features which became very handy for clean and simple code:
We try to follow the Google Styleguide for Python. Please refer to the well-documented style guide on python code provided by Google.
We provide here a summary of the most commonly used rules:ide
From 3.16.4
Type | Convention | Example |
Packages & Modules | lower_with_under | from prefetch_generator import BackgroundGenerator |
Classes | CapWords | class DataLoader |
Instances | lower_with_under | dataset = Dataset |
Methods & Functions | lower_with_under() | def visualize_tensor() |
Variables | lower_with_under | background_color='Blue' |
In general, we recommend the use of an IDE such as visual studio code or PyCharm. Whereas VS Code provides syntax highlighting and autocompletion in a relatively lightweight editor PyCharm has lots of advanced features for working with remote clusters.
If set up properly this allows you to do the following:
In general, we recommend to use jupyter notebooks for initial exploration/ playing around with new models and code. Python scripts should be used as soon as you want to train the model on a bigger dataset where also reproducibility is more important.
Our recommended workflow:
Jupyter Notebook | Python Scripts |
+ Exploration | + Running longer jobs without interruption |
+ Debugging | + Easy to track changes with git |
- Can become a huge file | - Debugging mostly means rerunning the whole script |
- Can be interrupted (don't use for long training) | |
- Prone to errors and become a mess |
Commonly used libraries:
Name | Description | Used for |
torch | Base Framework for working with neural networks | creating tensors, networks and training them using backprop |
torchvision | todo | data preprocessing, augmentation, postprocessing |
Pillow (PIL) | Python Imaging Library | Loading images and storing them |
Numpy | Package for scientific computing with Python | Data preprocessing & postprocessing |
prefetch_generator | Library for background processing | Loading next batch in background during computation |
tqdm | Progress bar | Progress during training of each epoch |
torchsummary | Keras summary for PyTorch | Displays network, it's parameters and sizes at each layer |
tensorboardx | Tensorboard without tensorflow | Logging experiments and showing them in tensorboard |
Don't put all layers and models into the same file. A best practice is to separate the final networks into a separate file (networks.py) and keep the layers, losses, and ops in respective files (layers.py, losses.py, ops.py). The finished model (composed of one or multiple networks) should be reference in a file with its name (e.g. yolov3.py, DCGAN.py)
The main routine, respective the train and test scripts should only import from the file having the model's name.
We recommend breaking up the network into its smaller reusable pieces. A network is a nn.Module consisting of operations or other nn.Modules as building blocks. Loss functions are also nn.Module and can, therefore, be directly integrated into the network.
A class inheriting from nn.Module must have a forward method implementing the forward pass of the respective layer or operation.
A nn.module can be used on input data using self.net(input). This simply uses the call() method of the object to feed the input through the module.
output = self.net(input)
Use the following pattern for simple networks with a single input and single output:
class ConvBlock(nn.Module): def __init__(self): super(ConvBlock, self).__init__() block = [nn.Conv2d(...)] block += [nn.ReLU()] block += [nn.BatchNorm2d(...)] self.block = nn.Sequential(*block) def forward(self, x): return self.block(x) class SimpleNetwork(nn.Module): def __init__(self, num_resnet_blocks=6): super(SimpleNetwork, self).__init__() # here we add the individual layers layers = [ConvBlock(...)] for i in range(num_resnet_blocks): layers += [ResBlock(...)] self.net = nn.Sequential(*layers) def forward(self, x): return self.net(x)
Note the following:
class ResnetBlock(nn.Module): def __init__(self, dim, padding_type, norm_layer, use_dropout, use_bias): super(ResnetBlock, self).__init__() self.conv_block = self.build_conv_block(...) def build_conv_block(self, ...): conv_block = [] conv_block += [nn.Conv2d(...), norm_layer(...), nn.ReLU()] if use_dropout: conv_block += [nn.Dropout(...)] conv_block += [nn.Conv2d(...), norm_layer(...)] return nn.Sequential(*conv_block) def forward(self, x): out = x + self.conv_block(x) return out
Here the skip connection of a ResNet block has been implemented directly in the forward pass. PyTorch allows for dynamic operations during the forward pass.
For a network requiring multiple outputs, such as building a perceptual loss using a pretrained VGG network we use the following pattern:
class Vgg19(nn.Module): def __init__(self, requires_grad=False): super(Vgg19, self).__init__() vgg_pretrained_features = models.vgg19(pretrained=True).features self.slice1 = torch.nn.Sequential() self.slice2 = torch.nn.Sequential() self.slice3 = torch.nn.Sequential() for x in range(7): self.slice1.add_module(str(x), vgg_pretrained_features[x]) for x in range(7, 21): self.slice2.add_module(str(x), vgg_pretrained_features[x]) for x in range(21, 30): self.slice3.add_module(str(x), vgg_pretrained_features[x]) if not requires_grad: for param in self.parameters(): param.requires_grad = False def forward(self, x): h_relu1 = self.slice1(x) h_relu2 = self.slice2(h_relu1) h_relu3 = self.slice3(h_relu2) out = [h_relu1, h_relu2, h_relu3] return out
Note here the following:
Even if PyTorch already has a lot of of standard loss function it might be necessary sometimes to create your own loss function. For this, create a separate file losses.py
and extend the nn.Module
class to create your custom loss function:
class CustomLoss(nn.Module): def __init__(self): super(CustomLoss,self).__init__() def forward(self,x,y): loss = torch.mean((x - y)**2) return loss
Note that we used the following patterns:
# import statements import torch import torch.nn as nn from torch.utils import data ... # set flags / seeds torch.backends.cudnn.benchmark = True np.random.seed(1) torch.manual_seed(1) torch.cuda.manual_seed(1) ... # Start with main code if __name__ == '__main__': # argparse for additional flags for experiment parser = argparse.ArgumentParser(description="Train a network for ...") ... opt = parser.parse_args() # add code for datasets (we always use train and validation/ test set) data_transforms = transforms.Compose([ transforms.Resize((opt.img_size, opt.img_size)), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) train_dataset = datasets.ImageFolder( root=os.path.join(opt.path_to_data, "train"), transform=data_transforms) train_data_loader = data.DataLoader(train_dataset, ...) test_dataset = datasets.ImageFolder( root=os.path.join(opt.path_to_data, "test"), transform=data_transforms) test_data_loader = data.DataLoader(test_dataset ...) ... # instantiate network (which has been imported from *networks.py*) net = MyNetwork(...) ... # create losses (criterion in pytorch) criterion_L1 = torch.nn.L1Loss() ... # if running on GPU and we want to use cuda move model there use_cuda = torch.cuda.is_available() if use_cuda: net = net.cuda() ... # create optimizers optim = torch.optim.Adam(net.parameters(), lr=opt.lr) ... # load checkpoint if needed/ wanted start_n_iter = 0 start_epoch = 0 if opt.resume: ckpt = load_checkpoint(opt.path_to_checkpoint) # custom method for loading last checkpoint net.load_state_dict(ckpt['net']) start_epoch = ckpt['epoch'] start_n_iter = ckpt['n_iter'] optim.load_state_dict(ckpt['optim']) print("last checkpoint restored") ... # if we want to run experiment on multiple GPUs we move the models there net = torch.nn.DataParallel(net) ... # typically we use tensorboardX to keep track of experiments writer = SummaryWriter(...) # now we start the main loop n_iter = start_n_iter for epoch in range(start_epoch, opt.epochs): # set models to train mode net.train() ... # use prefetch_generator and tqdm for iterating through data pbar = tqdm(enumerate(BackgroundGenerator(train_data_loader, ...)), total=len(train_data_loader)) start_time = time.time() # for loop going through dataset for i, data in pbar: # data preparation img, label = data if use_cuda: img = img.cuda() label = label.cuda() ... # It's very good practice to keep track of preparation time and computation time using tqdm to find any issues in your dataloader prepare_time = start_time-time.time() # forward and backward pass optim.zero_grad() ... loss.backward() optim.step() ... # udpate tensorboardX writer.add_scalar(..., n_iter) ... # compute computation time and *compute_efficiency* process_time = start_time-time.time()-prepare_time pbar.set_description("Compute efficiency: {:.2f}, epoch: {}/{}:".format( process_time/(process_time+prepare_time), epoch, opt.epochs)) start_time = time.time() # maybe do a test pass every x epochs if epoch % x == x-1: # bring models to evaluation mode net.eval() ... #do some tests pbar = tqdm(enumerate(BackgroundGenerator(test_data_loader, ...)), total=len(test_data_loader)) for i, data in pbar: ... # save checkpoint if needed ...
There are two distinct patterns in PyTorch to use multiple GPUs for training. From our experience both patterns are valid. The first one results however in nicer and less code. The second one seems to have a slight performance advantage due to less communication between the GPUs. I asked a question in the official PyTorch forum about the two approaches here
The most common one is to simply split up the batches of all networks to the individual GPUs.
A model running on 1 GPU with batch size 64 would, therefore, run on 2 GPUs with each a batch size of 32. This can be done automatically by wrapping the model by nn.DataParallel(model).
This pattern is less commonly used. A repository implementing this approach is shown here in the pix2pixHD implementation by Nvidia
Numpy runs on the CPU and is slower than torch code. Since torch has been developed with being similar to numpy in mind most numpy functions are supported by PyTorch already.
The data loading pipeline should be independent of your main training code. PyTorch uses background workers for loading the data more efficiently and without disturbing the main training process.
Typically we train our models for thousands of steps. Therefore, it is enough to log loss and other results every n'th step to reduce the overhead. Especially, saving intermediary results as images can be costly during training.
It's very handy to use command-line arguments to set parameters during code execution (batch size, learning rate, etc). An easy way to keep track of the arguments for an experiment is by just printing the dictionary received from parse_args:
# saves arguments to config.txt file opt = parser.parse_args() with open("config.txt", "w") as f: f.write(opt.__str__()) ...
PyTorch keeps track of of all operations involving tensors for automatic differentiation. Use .detach() to prevent recording of unnecessary operations.
You can print variables directly, however it's recommended to use variable.detach() or variable.item(). In earlier PyTorch versions < 0.4 you have to use .data to access the tensor of a variable.
The two ways are not identical as pointed out in one of the issues here:
output = self.net.forward(input) # they are not equal! output = self.net(input)
We recommend setting the following seeds at the beginning of your code:
torch.manual_seed(1) torch.cuda.manual_seed(1)
On Nvidia GPUs you can add the following line at the beginning of our code. This will allow the cuda backend to optimize your graph during its first execution. However, be aware that if you change the network input/output tensor size the graph will be optimized each time a change occurs. This can lead to very slow runtime and out of memory errors. Only set this flag if your input and output have always the same shape. Usually, this results in an improvement of about 20%.
torch.backends.cudnn.benchmark = True
It depends on the machine used, the preprocessing pipeline and the network size. Running on a SSD on a 1080Ti GPU we see a compute efficiency of almost 1.0 which is an ideal scenario. If shallow (small) networks or a slow harddisk is used the number may drop to around 0.1-0.2 depending on your setup.
In PyTorch we can implement very easily virtual batch sizes. We just prevent the optimizer from making an update of the parameters and sum up the gradients for batch_size cycles.
# in the main loop out = net(input) loss = criterion(out, label) # we just call backward to sum up gradients but don't perform step here loss.backward() total_loss += loss.item() / batch_size if n_iter % batch_size == batch_size-1: # here we perform out optimization step using a virtual batch size optim.step() optim.zero_grad() print('Total loss: ', total_loss) total_loss = 0.0 ...
We can access the learning rate directly using the instantiated optimizer as shown here:
for param_group in optim.param_groups: old_lr = param_group['lr'] new_lr = old_lr * 0.1 param_group['lr'] = new_lr print('Updated lr from {} to {}'.format(old_lr, new_lr)) ...
If you want to use a pretrained model such as VGG to compute a loss but not train it (e.g. Perceptual loss in style-transfer/ GANs/ Auto-encoder) you can use the following pattern:
# instantiate the model pretrained_VGG = VGG19(...) # disable gradients (prevent training) for p in pretrained_VGG.parameters(): # reset requires_grad p.requires_grad = False ... # you don't have to use the no_grad() namespace but can just run the model # no gradients will be computed for the VGG model out_real = pretrained_VGG(input_a) out_fake = pretrained_VGG(input_b) loss = any_criterion(out_real, out_fake) ...
Those methods are used to set layers such as BatchNorm2d or Dropout2d from training to inference mode. Every module which inherits from nn.Module has an attribute called isTraining. .eval() and .train() just simply sets this attribute to True/ False. For more information of how this method is implemented please have a look at the module code in PyTorch
Make sure that no gradients get computed and stored during your code execution. You can simply use the following pattern to assure that:
with torch.no_grad():
# run model here out_tensor = net(in_tensor)
In PyTorch you can freeze layers. This will prevent them from being updated during an optimization step.
# you can freeze whole modules using for p in pretrained_VGG.parameters(): # reset requires_grad p.requires_grad = False
Since PyTorch 0.4 *Variable and Tensor have been merged. We don't have to explicitly create a Variable object anymore.
C++ version is about 10% faster
From our experience you can gain about 20% speed-up. But the first time you run your model it takes quite some time to build the optimized graph. In some cases (loops in forward pass, no fixed input shape, if/else in forward, etc.) this flag might result in out of memory or other errors.
If frees a tensor from a computation graph. A nice illustration is shown here