Autoencoders in Computer Vision

An autoencoder is a type of artificial neural network that is used to learn data encodings unsupervised. The autoencoder must examine the input and create a function capable of transforming a specific instance of that data into a meaningful representation.

What is an Autoencoder?

Autoencoders are made up of two primary parts: an encoder and a decoder. The encoder takes an input sample (the picture of the left cat) and turns it into a vector (the green block), which is effectively a set of integers. This vector is known as a latent vector.

The decoder then takes this latent vector of integers and extends it in order to recreate the input sample.

When we employ autoencoders, we are not concerned with the output itself, but rather with the vector created in the middle (the latent vector).

This vector is significant since it represents a representation of the input picture, and using it, we may possibly do numerous tasks such as rebuilding the original image.

The latent vector may be thought of as an image code, which is where the phrases encode/decode originate from.

Structure of a Computer Vision Autoencoder

Autoencoder architecture. (Source: https://medium.com/@birla.deepak26/autoencoders-76bb49ae6a8f)

Autoencoders consist of three elements:

Encoder: a module that compresses the input data into an encoded representation that is several orders of magnitude less than the input data. The encoder is made up of a series of convolutional blocks followed by pooling modules or basic linear layers that compress the model's input into a small area known as the "bottleneck" or "code."
Bottleneck/Code: as you may have guessed, the code is the most crucial and, strangely, the smallest component of the neural network. The code exists to limit the flow of information from the encoder to the decoder, enabling only the most important data to pass through. The code assists us in developing a knowledge representation of the input. As a compact representation of the input, the code inhibits the neural network from memorising and overfitting the data. The lesser the danger of overfitting, the smaller the code. This layer is often constructed as a linear layer, or as a tensor if convolutions are used.
Decoder: the network's decoder component serves as an interpreter for the code. The decoder assists the network in "decompressing" the knowledge representations and reassembling the data from its compressed form. The output is then compared to the true value. When working with pictures or linear layers, this is commonly done via transposed convolutions.

Types of Autoencoders for Computer Vision

The structure we mentioned before is a high-level overview. There are several varieties of autoencoders. Here are the primary autoencoder kinds to be aware of:

Variational AutoEncoders (VAE)

The generation of data is something that traditional autoencoders cannot achieve. What is the reason behind this?

During training, we give the model the input picture and ask it to learn the encoder and decoder parameters needed to reconstruct the image.

We only need the decoder section during testing because it is the part that creates the picture. And we'll need a vector to make a picture. However, we have no notion what this vector is made of.

If we merely supply some random values as input, we will almost certainly get a picture that does not appear cohesive.

The generation of data is something that traditional autoencoders cannot achieve. What is the reason behind this?

During training, we give the model the input picture and ask it to learn the encoder and decoder parameters needed to reconstruct the image.

We only need the decoder section during testing because it is the part that creates the picture. And we'll need a vector to make a picture. However, we have no notion what this vector is made of.

If we merely supply some random values as input, we will almost certainly get a picture that does not appear cohesive.

Variational Autoencoder architecture (VAE). (Source: https://towardsdatascience.com/difference-between-autoencoder-ae-and-variational-autoencoder-vae-ed7be1c038f2)

Undercomplete Autoencoders

An Undercomplete Autoencoder takes a picture as input and attempts to anticipate the same image as output, reconstructed from the compressed code area. Undercomplete Autoencoders are unsupervised since they do not accept labels in their input because the target is the same as the input. The principal use of this sort of autoencoder is dimensionality reduction.

Why should autoecoders be used for dimensionality reduction rather than alternative approaches such as Principal Component Analysis (PCA)? PCA can only create linear associations. Undercomplete Autoencoders can learn non-linear correlations and so outperform complete Autoencoders in dimensionality reduction.

If we eliminate all non-linear activations from an Undercomplete Autoencoder and just utilise linear layers, we effectively reduce the autoencoder to something that operates on par with PCA.

A graphical illustration of the difference between Autoencoder and PCA.

Denoising Autoencoders

Denoising Autoencoders are used to eliminate noise from images.

How does this encoder accomplish this? Instead of giving the original image to the network, we make a duplicate of it, add artificial noise to it, and feed this new noisy version of the image to the network. The result is then compared to the original image.

In this manner, the autoencoder eliminates noise by learning a representation of the input from which the noise may be readily filtered out. The construction of a Denoising Autoencoder is seen below.

Denoising Autoencoder architecture. (Source: https://www.jeremyjordan.me/autoencoders/)

Another use for Denoising Autoencoders is the removal of watermarks from images. An example is shown below.

Sparse Autoencoders

To prevent overfitting, this autoencoder expressly penalises the usage of hidden node connections. How does it accomplish this? To penalise activations inside a layer, we build a loss function.

For each given observation, we urge our network to learn an encoding and decoding that only requires a minimal number of neurons to be activated. This is comparable to dropping out.

This enables the network to become sensitive to certain characteristics of the incoming data. Unlike an Undercomplete Autoencoder, which uses the full network for every observation, a Sparse Autoencoder is compelled to selectively activate network parts based on the input data.

As a consequence, we reduced the network's ability to memorise the input data while without reducing its ability to extract characteristics from the data.

Contrastive Autoencoders

One would anticipate the learnt encoding to be quite comparable for relatively similar inputs. This, however, is not always the case.

So, how can we know whether encoding a neighbour differs significantly from encoding an input? We may examine the derivatives of the hidden layer's activation functions and demand that they be tiny in relation to the input.

By doing so, we hope to ensure that even if the input varies little, we will preserve a relatively similar encoded state. This is accomplished by including a constraint-adding term within the loss function.

Vector Quantised-Variational AutoEncoder (VQ-VAE)

Normal VAEs learn a continuous latent representation, from which each value may be decoded and returned to an image. Given the large set of alternatives, the model has a difficult time learning good representation.

Instead, we want to limit the number of potential latent vectors so that the model can focus on them; in other words, we want a discrete latent representation.

Vector quantization will be used to accomplish this. The concept is to employ a codebook, which is a learnt collection of vectors used to generate latent vectors. In practise, we first use the encoder to encode an image, resulting in many latent vectors.

Then we replace all of the latent with the nearest vector in the codebook. To have a combination of vectors chosen from the codebook, we have to produce more than one latent per image, allowing us to search in a larger but yet discrete space.

VQ-Variational Autoencoder architecture. (Source: https://arxiv.org/pdf/1711.00937v2.pdf)

Applications for Autoencoders in Computer Vision

Autoencoders in Segmentation Tasks

With autoencoders, we generally start with an input image and rebuild it using the autoencoder architecture. Instead of providing the same input as the intended result, how about providing a segmentation mask as the expected output? We would be able to conduct tasks such as semantic segmentation in this manner. This is exactly what U-Net, an autoencoder architecture commonly employed for semantic segmentation of medical diagnostic images, does.

Anomaly Detection

Autoencoders are a very effective method for identifying abnormalities. You will learn how effectively you can generally reassemble your data by encoding and decoding. If an autoencoder is then given atypical data that reveals something the model has never seen before, the error when reconstructing the input after the code will be significantly larger.

For example, a model that performs a good job of reducing and reconstructing cat images might perform horribly when confronted with a giraffe image. Even with highly complicated and high-dimensional datasets, reconstruction error is a very reliable indication of abnormalities.

Super-resolution

The term "super-resolution" refers to the process of increasing the resolution of a low-quality image. In theory, we could do this by simply upsampling the image and using bilinear interpolation to fill in the extra pixel values, but the resultant image would be hazy since we couldn't enhance the amount of information in the image. To address this issue, we may train an autoencoder to predict pixel values for a high-resolution image.

Generating New Data (Data Synthesis)

Variational Autoencoders may provide both image and time series data. The parameterized distribution at the autoencoder's code may be randomly sampled to create discrete values for latent vectors, which can then be transmitted to the decoder, resulting in image data production. Time series data, such as music, may also be modelled using Variational Autoencoders.