Introduction

PixelCNN was introduced by DeepMind and it was among the three autoregressive models that DeepMind introduced. There have been several iterations after the first introduction of PixelCNN to improve speed and efficiency. In this article, we'll be going through the basic PixelCNN.

This article is an excerpt from the book PyTorch Deep Learning Hands-On by Sherin Thomas and Sudhanshu Passi. This PyTorch publication has numerous examples and dynamic AI applications and demonstrates the simplicity and efficiency of the PyTorch approach to machine intelligence and deep learning.

PixelCNN generates one pixel at a time and uses that to generate the next pixel, and then uses the previous two pixels to generate the next.

Figure 1.1: Images generated from PixelCNN

In PixelCNN, there is a probabilistic density model that can learn the density distribution of all images and generate the images from the distribution. But here, we are trying to condition each pixel generated on all the previously generated pixels by taking the joint probability of all the previous predictions.

PixelCNN uses convolutional layers as receptive fields, which improves the reading time of the input. Consider an image partially occluded by something; let's say we only have half of the image. So, we have half of an image and our algorithm needs to generate the second half. PixelCNN gets the image in a single shot through the convolutional layer. However, the generation in PixelCNN has to be sequential anyway. You might be wondering how only half of the image goes to convolution; the answer is masked convolution, which we'll be explaining later.

Figure 1.2 shows how the convolution operation is applied on the set of pixels to predict the center pixel. The main advantage of the autoregressive model over other models is that the joint probability learning technique is tractable and can be learned using gradient descent. There is no approximation and there is no workaround; we just try to predict each pixel value given all the previous pixel values and training is completely backed by backpropagation. However, we struggle with scalability using autoregressive models since the generation is always sequential. PixelCNN is a well-architected model to take the product of individual probabilities as joint probabilities of all the previous pixels, while generating new pixels. In an RNN model, this is the default behavior, but the CNN model achieves this by using a cleverly designed mask.

PixelCNN captures the distribution of dependencies between pixels in the parameters, which is unlike the other approaches. VAEs learn this distribution by generating the hidden latent vector, which introduces independent assumptions. In PixelCNN, the dependencies learned are not just between the previous pixels but also between the different channels; in a normal color image, it is red, green, and blue (RGB).

Figure 1.2: Predicting pixel value from surrounding pixels

There is a fundamental problem: what if the CNN tries to use the current pixel or the future pixels to learn the current pixel? This is also managed by the mask, which takes the granularity of self to the channel level also. For instance, the current pixel's red channel won't learn from the current pixel but will learn from previous pixels. But the green channel can now use the current red channel and all the previous pixels. Similarly, the blue channel can learn from both the green and red channel of the current pixel, as well as all the previous pixels.

There are two types of masks used in the whole network, but the later layers don't need to have this security, although they still need to emulate the sequential learning while doing the parallel convolution operation. So, the PixelCNN paper (https://arxiv.org/pdf/1606.05328.pdf) introduces two types of masks: type A and type B.

One major architectural difference that makes PixelCNN stand out from other traditional CNN models is the absence of the pooling layers. Since the aim of the PixelCNN is not capturing the essence of the image in a dimensionally reduced form and we cannot afford losing context through pooling, the authors deliberately removed the pooling layer.

fm = 64

net = nn.Sequential(
    MaskedConv2d('A', 1, fm, 7, 1, 3, bias=False),
         nn.BatchNorm2d(fm), nn.ReLU(True),
    MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
    nn.BatchNorm2d(fm), nn.ReLU(True),
    MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
    nn.BatchNorm2d(fm), nn.ReLU(True),
    MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
    nn.BatchNorm2d(fm), nn.ReLU(True),
    MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
    nn.BatchNorm2d(fm), nn.ReLU(True),
    MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
    nn.BatchNorm2d(fm), nn.ReLU(True),
    MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
    nn.BatchNorm2d(fm), nn.ReLU(True),
    MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
    nn.BatchNorm2d(fm), nn.ReLU(True),
    nn.Conv2d(fm, 256, 1))

The preceding code snippet is the complete PixelCNN model, which is wrapped inside a sequential unit. It consists of a bunch of MaskedConv2d instances, which is inherited from torch.nn.Conv2d and uses all the *args and **kwargs of Conv2d from torch.nn. Each convolution unit is followed by a batch norm layer and ReLU layer, which is known to be a successful combination with convolution layers. Instead of using a linear layer at the final layer, the authors decided to use a normal two-dimensional convolution, which is proven to work better than a linear layer.

Masked Convolution

Masked convolution is used in PixelCNN to prevent information flow from the future and current pixel to the generation task while training the network. This is essential because while generating the pixels, we don't have access to the future pixels or current pixel. However, there is one exception, which was described previously. The generation of the current green channel value can use the prediction of the red channel and the generation of the current blue channel can use the prediction of both the green and red channels.

Masking is done by zeroing out all the pixels that are not required. A mask tensor of equivalent size to the size of the kernel with values 1 and 0 will be created, which has 0 for all the unnecessary pixels. This mask tensor then gets multiplied with the weight tensor before doing the convolution operation:

Figure 1.3: On the left is the mask and on the right is the context in PixelCNN

Since PixelCNN doesn't use pooling layers and deconvolution layers, the channel size should remain constant as the flow progresses. While mask A is solely responsible for preventing the network from learning the value from the current pixel, mask B keeps the channel size to three (RGB) and allows more flexibility in the network by allowing the current pixel value depending on its own value as well:

Figure 1.4: Mask A and mask B

class MaskedConv2d(nn.Conv2d):
    def __init__(self, mask_type, *args, **kwargs):
        super().__init__(*args, **kwargs)
        assert mask_type in ('A', 'B')
        self.register_buffer('mask', self.weight.data.clone())
        _, _, kH, kW = self.weight.size()
        self.mask.fill_(1)
        self.mask[:, :, kH // 2, kW // 2 + (mask_type == 'B'):] =
    0
        self.mask[:, :, kH // 2 + 1:] = 0

    def forward(self, x):
        self.weight.data *= self.mask
        return super(MaskedConv2d, self).forward(x)

The preceding class MaskedConv2d is inherited from torch.nn.Conv2d instead of being inherited from torch.nn.Module. Even though we inherit from torch.nn.Module to create a custom model class normally, since we are trying to make Conv2d enhance the operation with the mask, we inherit from torch.nn.Conv2D, which is in turn being inherited from torch.nn.Module. The class method register_buffer is one of the convenient APIs that PyTorch provides to add any tensors to the state_dict dictionary object, which in turn gets saved to the disk along with the model if you try to save the model to disk.

The obvious way of adding a stateful variable, which can then be reused in the forward function, would be to add that as the object attribute:

self.mask = self.weight.data.clone()

But this would never be part of the state_dict and would never be saved to disk. With register_buffer, we can make sure that the new tensor we have created will be part of state_dict. The mask tensor is then filled with 1s using the in-place fill_ operation and then has 0 added to it to get a tensor like in Figure 1.3, although the figure shows only a two-dimension tensor, where the actual weight tensor is three-dimensional. The forward function is just for masking the weight tensor by multiplying with the mask tensor. The multiplication keeps all the values corresponding to the index where the mask had 1, while deleting all the values corresponding to the index where the mask had 0. Then a normal call to the parent Conv2d layer uses the weight tensor and does the two-dimensional convolution.

The final layer of the network is a softmax layer, which predicts the value among 256 possible values of a pixel and hence discretizes the output generation of the network, while the previous state-of-the-art autoregressive model used continues value generation at the final layer:

optimizer = optim.Adam(net.parameters())
for epoch in range(25):
    net.train(True)
    for input, _ in tr:
        target = (input[:,0] * 255).long()
        out = net(input)
        loss = F.cross_entropy(out, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Training uses the Adam optimizer with default momentum rate. Also, the loss function is created from the Functional module of PyTorch. Everything else remains the same as for a normal training operation other than the creation of the target variable.

Until now, we had worked with supervised learning, where the labels are given explicitly, but in this case, the target is the same as the input, since we are trying to recreate the same output. The torchvision package applies transformation and normalization to the pixels and converts the pixel value ranges from 0 to 255 to -1 to 1. We need to convert back to the 0 to 255 range since we are using softmax at the final layer, and that generates probability distribution over 0 to 255.

In this article, we implemented the foundation block of WaveNet called PixelCNN, which is built on autoregressive convolutional neural networks (CNNs). PixelCNN uses convolutional layers as receptive fields, which improves the reading time of the input.

About the Authors

Sherin Thomas started his career as an information security expert and shifted his focus to deep learning-based security systems. He has helped several companies across the globe to set up their AI pipelines and worked recently for CoWrks, a fast-growing start-up based out of Bengaluru. Sherin is working on several open source projects including PyTorch, RedisAI, and many more, and is leading the development of TuringNetwork.ai.

Sudhanshu Passi is a technologist employed at CoWrks. Among other things, he has been the driving force behind everything related to machine learning at CoWrks. His expertise in simplifying complex concepts makes his work an ideal read for beginners and experts alike. This can be verified by his many blogs and this debut book publication. In his spare time, he can be found at his local swimming pool computing gradient descent underwater.