15,115,557 members
Articles / Programming Languages / Python
Article
Posted 29 May 2019

4K views
2 bookmarked

# PixelCNN in Autoregressive Models

Rate me:
This article is an excerpt from the book PyTorch Deep Learning Hands-On by Sherin Thomas and Sudhanshu Passi. It has numerous examples and dynamic AI applications and demonstrates the simplicity and efficiency of the PyTorch approach to machine intelligence and deep learning.

## Introduction

PixelCNN was introduced by DeepMind and it was among the three autoregressive models that DeepMind introduced. There have been several iterations after the first introduction of PixelCNN to improve speed and efficiency. In this article, we'll be going through the basic PixelCNN.

This article is an excerpt from the book PyTorch Deep Learning Hands-On by Sherin Thomas and Sudhanshu Passi. This PyTorch publication has numerous examples and dynamic AI applications and demonstrates the simplicity and efficiency of the PyTorch approach to machine intelligence and deep learning.

PixelCNN generates one pixel at a time and uses that to generate the next pixel, and then uses the previous two pixels to generate the next.

Figure 1.1: Images generated from PixelCNN

In PixelCNN, there is a probabilistic density model that can learn the density distribution of all images and generate the images from the distribution. But here, we are trying to condition each pixel generated on all the previously generated pixels by taking the joint probability of all the previous predictions.

PixelCNN uses convolutional layers as receptive fields, which improves the reading time of the input. Consider an image partially occluded by something; let's say we only have half of the image. So, we have half of an image and our algorithm needs to generate the second half. PixelCNN gets the image in a single shot through the convolutional layer. However, the generation in PixelCNN has to be sequential anyway. You might be wondering how only half of the image goes to convolution; the answer is masked convolution, which we'll be explaining later.

Figure 1.2 shows how the convolution operation is applied on the set of pixels to predict the center pixel. The main advantage of the autoregressive model over other models is that the joint probability learning technique is tractable and can be learned using gradient descent. There is no approximation and there is no workaround; we just try to predict each pixel value given all the previous pixel values and training is completely backed by backpropagation. However, we struggle with scalability using autoregressive models since the generation is always sequential. PixelCNN is a well-architected model to take the product of individual probabilities as joint probabilities of all the previous pixels, while generating new pixels. In an RNN model, this is the default behavior, but the CNN model achieves this by using a cleverly designed mask.

PixelCNN captures the distribution of dependencies between pixels in the parameters, which is unlike the other approaches. VAEs learn this distribution by generating the hidden latent vector, which introduces independent assumptions. In PixelCNN, the dependencies learned are not just between the previous pixels but also between the different channels; in a normal color image, it is red, green, and blue (RGB).

Figure 1.2: Predicting pixel value from surrounding pixels

There is a fundamental problem: what if the CNN tries to use the current pixel or the future pixels to learn the current pixel? This is also managed by the mask, which takes the granularity of self to the channel level also. For instance, the current pixel's red channel won't learn from the current pixel but will learn from previous pixels. But the green channel can now use the current red channel and all the previous pixels. Similarly, the blue channel can learn from both the green and red channel of the current pixel, as well as all the previous pixels.

There are two types of masks used in the whole network, but the later layers don't need to have this security, although they still need to emulate the sequential learning while doing the parallel convolution operation. So, the PixelCNN paper (https://arxiv.org/pdf/1606.05328.pdf) introduces two types of masks: type A and type B.

One major architectural difference that makes PixelCNN stand out from other traditional CNN models is the absence of the pooling layers. Since the aim of the PixelCNN is not capturing the essence of the image in a dimensionally reduced form and we cannot afford losing context through pooling, the authors deliberately removed the pooling layer.

Python
```fm = 64

net = nn.Sequential(
MaskedConv2d('A', 1, fm, 7, 1, 3, bias=False),
nn.BatchNorm2d(fm), nn.ReLU(True),
MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
nn.BatchNorm2d(fm), nn.ReLU(True),
MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
nn.BatchNorm2d(fm), nn.ReLU(True),
MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
nn.BatchNorm2d(fm), nn.ReLU(True),
MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
nn.BatchNorm2d(fm), nn.ReLU(True),
MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
nn.BatchNorm2d(fm), nn.ReLU(True),
MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
nn.BatchNorm2d(fm), nn.ReLU(True),
MaskedConv2d('B', fm, fm, 7, 1, 3, bias=False),
nn.BatchNorm2d(fm), nn.ReLU(True),
nn.Conv2d(fm, 256, 1))```

The preceding code snippet is the complete PixelCNN model, which is wrapped inside a sequential unit. It consists of a bunch of `MaskedConv2d` instances, which is inherited from `torch.nn.Conv2d` and uses all the `*args` and `**kwargs` of `Conv2d` from `torch.nn`. Each convolution unit is followed by a batch norm layer and ReLU layer, which is known to be a successful combination with convolution layers. Instead of using a linear layer at the final layer, the authors decided to use a normal two-dimensional convolution, which is proven to work better than a linear layer.

Masked convolution is used in PixelCNN to prevent information flow from the future and current pixel to the generation task while training the network. This is essential because while generating the pixels, we don't have access to the future pixels or current pixel. However, there is one exception, which was described previously. The generation of the current green channel value can use the prediction of the red channel and the generation of the current blue channel can use the prediction of both the green and red channels.

Masking is done by zeroing out all the pixels that are not required. A mask tensor of equivalent size to the size of the kernel with values 1 and 0 will be created, which has 0 for all the unnecessary pixels. This mask tensor then gets multiplied with the weight tensor before doing the convolution operation:

Figure 1.3: On the left is the mask and on the right is the context in PixelCNN

Since PixelCNN doesn't use pooling layers and deconvolution layers, the channel size should remain constant as the flow progresses. While mask A is solely responsible for preventing the network from learning the value from the current pixel, mask B keeps the channel size to three (RGB) and allows more flexibility in the network by allowing the current pixel value depending on its own value as well:

Python
```class MaskedConv2d(nn.Conv2d):
super().__init__(*args, **kwargs)
_, _, kH, kW = self.weight.size()
self.mask[:, :, kH // 2, kW // 2 + (mask_type == 'B'):] =
0
self.mask[:, :, kH // 2 + 1:] = 0

def forward(self, x):

The preceding class `MaskedConv2d` is inherited from `torch.nn.Conv2d` instead of being inherited from `torch.nn.Module`. Even though we inherit from `torch.nn.Module` to create a custom model class normally, since we are trying to make `Conv2d` enhance the operation with the mask, we inherit from `torch.nn.Conv2D`, which is in turn being inherited from `torch.nn.Module`. The class method `register_buffer` is one of the convenient APIs that PyTorch provides to add any tensors to the `state_dict` dictionary object, which in turn gets saved to the disk along with the model if you try to save the model to disk.

The obvious way of adding a stateful variable, which can then be reused in the forward function, would be to add that as the object attribute:

Python
`self.mask = self.weight.data.clone()`

But this would never be part of the `state_dict` and would never be saved to disk. With `register_buffer`, we can make sure that the new tensor we have created will be part of `state_dict`. The mask tensor is then filled with 1s using the in-place `fill_` operation and then has 0 added to it to get a tensor like in Figure 1.3, although the figure shows only a two-dimension tensor, where the actual weight tensor is three-dimensional. The forward function is just for masking the weight tensor by multiplying with the mask tensor. The multiplication keeps all the values corresponding to the index where the mask had 1, while deleting all the values corresponding to the index where the mask had 0. Then a normal call to the parent `Conv2d` layer uses the weight tensor and does the two-dimensional convolution.

The final layer of the network is a softmax layer, which predicts the value among 256 possible values of a pixel and hence discretizes the output generation of the network, while the previous state-of-the-art autoregressive model used continues value generation at the final layer:

Python
```optimizer = optim.Adam(net.parameters())
for epoch in range(25):
net.train(True)
for input, _ in tr:
target = (input[:,0] * 255).long()
out = net(input)
loss = F.cross_entropy(out, target)
loss.backward()
optimizer.step()```

Training uses the Adam optimizer with default momentum rate. Also, the loss function is created from the Functional module of PyTorch. Everything else remains the same as for a normal training operation other than the creation of the target variable.

Until now, we had worked with supervised learning, where the labels are given explicitly, but in this case, the target is the same as the input, since we are trying to recreate the same output. The torchvision package applies transformation and normalization to the pixels and converts the pixel value ranges from 0 to 255 to -1 to 1. We need to convert back to the 0 to 255 range since we are using softmax at the final layer, and that generates probability distribution over 0 to 255.

In this article, we implemented the foundation block of WaveNet called PixelCNN, which is built on autoregressive convolutional neural networks (CNNs). PixelCNN uses convolutional layers as receptive fields, which improves the reading time of the input.

Sherin Thomas started his career as an information security expert and shifted his focus to deep learning-based security systems. He has helped several companies across the globe to set up their AI pipelines and worked recently for CoWrks, a fast-growing start-up based out of Bengaluru. Sherin is working on several open source projects including PyTorch, RedisAI, and many more, and is leading the development of TuringNetwork.ai.

Sudhanshu Passi is a technologist employed at CoWrks. Among other things, he has been the driving force behind everything related to machine learning at CoWrks. His expertise in simplifying complex concepts makes his work an ideal read for beginners and experts alike. This can be verified by his many blogs and this debut book publication. In his spare time, he can be found at his local swimming pool computing gradient descent underwater.

## Share

 United Kingdom
Founded in 2004 in Birmingham, UK, Packt's mission is to help the world put software to work in new ways, through the delivery of effective learning and information services to IT professionals.

Working towards that vision, we have published over 5000 books and videos so far, providing IT professionals with the actionable knowledge they need to get the job done - whether that's specific learning on an emerging technology or optimizing key skills in more established tools.