Just lately, we confirmed find out how to generate photos utilizing generative adversarial networks (GANs). GANs might yield superb outcomes, however the contract there mainly is: what you see is what you get.
Typically this can be all we would like. In different circumstances, we could also be extra excited about truly modelling a website. We donât simply wish to generate realistic-looking samples – we would like our samples to be situated at particular coordinates in area area.
For instance, think about our area to be the area of facial expressions. Then our latent area could be conceived as two-dimensional: In accordance with underlying emotional states, expressions fluctuate on a positive-negative scale. On the similar time, they fluctuate in depth. Now if we educated a VAE on a set of facial expressions adequately overlaying the ranges, and it did the truth is âuncoverâ our hypothesized dimensions, we might then use it to generate previously-nonexisting incarnations of factors (faces, that’s) in latent area.
Variational autoencoders are much like probabilistic graphical fashions in that they assume a latent area that’s accountable for the observations, however unobservable. They’re much like plain autoencoders in that they compress, after which decompress once more, the enter area. In distinction to plain autoencoders although, the essential level right here is to plan a loss perform that enables to acquire informative representations in latent area.
In a nutshell
In normal VAEs (Kingma and Welling 2013), the target is to maximise the proof decrease sure (ELBO):
[ELBO = E[log p(x|z)] – KL(q(z)||p(z))]
In plain phrases and expressed by way of how we use it in apply, the primary element is the reconstruction loss we additionally see in plain (non-variational) autoencoders. The second is the Kullback-Leibler divergence between a previous imposed on the latent area (usually, a typical regular distribution) and the illustration of latent area as discovered from the info.
A serious criticism concerning the normal VAE loss is that it ends in uninformative latent area. Options embrace (beta)-VAE(Burgess et al. 2018), Data-VAE (Zhao, Track, and Ermon 2017), and extra. The MMD-VAE(Zhao, Track, and Ermon 2017) carried out beneath is a subtype of Data-VAE that as a substitute of creating every illustration in latent area as comparable as doable to the prior, coerces the respective distributions to be as shut as doable. Right here MMD stands for most imply discrepancy, a similarity measure for distributions primarily based on matching their respective moments. We clarify this in additional element beneath.
Our goal immediately
On this publish, we’re first going to implement a typical VAE that strives to maximise the ELBO. Then, we examine its efficiency to that of an Data-VAE utilizing the MMD loss.
Our focus will likely be on inspecting the latent areas and see if, and the way, they differ as a consequence of the optimization standards used.
The area weâre going to mannequin will likely be glamorous (style!), however for the sake of manageability, confined to measurement 28 x 28: Weâll compress and reconstruct photos from the Style MNIST dataset that has been developed as a drop-in to MNIST.
A typical variational autoencoder
Seeing we havenât used TensorFlow keen execution for some weeks, weâll do the mannequin in an keen approach.
For those whoâre new to keen execution, donât fear: As each new approach, it wants some getting accustomed to, however youâll shortly discover that many duties are made simpler for those who use it. A easy but full, template-like instance is accessible as a part of the Keras documentation.
Setup and knowledge preparation
As standard, we begin by ensuring weâre utilizing the TensorFlow implementation of Keras and enabling keen execution. In addition to tensorflow and keras, we additionally load tfdatasets to be used in knowledge streaming.
By the best way: No must copy-paste any of the beneath code snippets. The 2 approaches can be found amongst our Keras examples, specifically, as eager_cvae.R and mmd_cvae.R.
The information comes conveniently with keras, all we have to do is the same old normalization and reshaping.
What do we’d like the take a look at set for, given we’re going to prepare an unsupervised (a greater time period being: semi-supervised) mannequin? Weâll use it to see how (beforehand unknown) knowledge factors cluster collectively in latent area.
Now put together for streaming the info to keras:
Subsequent up is defining the mannequin.
Encoder-decoder mannequin
The mannequin actually is 2 fashions: the encoder and the decoder. As weâll see shortly, in the usual model of the VAE there’s a third element in between, performing the so-called reparameterization trick.
The encoder is a customized mannequin, comprised of two convolutional layers and a dense layer. It returns the output of the dense layer cut up into two elements, one storing the imply of the latent variables, the opposite their variance.
latent_dim <- 2
encoder_model <- perform(title = NULL)
We select the latent area to be of dimension 2 – simply because that makes visualization straightforward.
With extra complicated knowledge, you’ll in all probability profit from selecting a better dimensionality right here.
So the encoder compresses actual knowledge into estimates of imply and variance of the latent area.
We then ânot directlyâ pattern from this distribution (the so-called reparameterization trick):
reparameterize <- perform(imply, logvar)
The sampled values will function enter to the decoder, who will try to map them again to the unique area.
The decoder is mainly a sequence of transposed convolutions, upsampling till we attain a decision of 28×28.
decoder_model <- perform(title = NULL) {
keras_model_custom(title = title, perform(self) {
self$dense <- layer_dense(items = 7 * 7 * 32, activation = "relu")
self$reshape <- layer_reshape(target_shape = c(7, 7, 32))
self$deconv1 <-
layer_conv_2d_transpose(
filters = 64,
kernel_size = 3,
strides = 2,
padding = "similar",
activation = "relu"
)
self$deconv2 <-
layer_conv_2d_transpose(
filters = 32,
kernel_size = 3,
strides = 2,
padding = "similar",
activation = "relu"
)
self$deconv3 <-
layer_conv_2d_transpose(
filters = 1,
kernel_size = 3,
strides = 1,
padding = "similar"
)
perform (x, masks = NULL) {
x %>%
self$dense() %>%
self$reshape() %>%
self$deconv1() %>%
self$deconv2() %>%
self$deconv3()
}
})
}
Notice how the ultimate deconvolution doesn’t have the sigmoid activation you might need anticipated. It is because we will likely be utilizing tf$nn$sigmoid_cross_entropy_with_logits when calculating the loss.
Talking of losses, letâs examine them now.
Loss calculations
One technique to implement the VAE loss is combining reconstruction loss (cross entropy, within the current case) and Kullback-Leibler divergence. In Keras, the latter is accessible instantly as loss_kullback_leibler_divergence.
Right here, we comply with a latest Google Colaboratory pocket book in batch-estimating the entire ELBO as a substitute (as a substitute of simply estimating reconstruction loss and computing the KL-divergence analytically):
[ELBO batch estimate = log p(x_{batch}|z_{sampled})+log p(z)âlog q(z_{sampled}|x_{batch})]
Calculation of the conventional loglikelihood is packaged right into a perform so we will reuse it throughout the coaching loop.
normal_loglik <- perform(pattern, imply, logvar, reduce_axis = 2) {
loglik <- k_constant(0.5, dtype = tf$float64) *
(k_log(2 * k_constant(pi, dtype = tf$float64)) +
logvar +
k_exp(-logvar) * (pattern - imply) ^ 2)
- k_sum(loglik, axis = reduce_axis)
}
Peeking forward some, throughout coaching we are going to compute the above as follows.
First,
crossentropy_loss <- tf$nn$sigmoid_cross_entropy_with_logits(
logits = preds,
labels = x
)
logpx_z <- - k_sum(crossentropy_loss)
yields (log p(x|z)), the loglikelihood of the reconstructed samples given values sampled from latent area (a.okay.a. reconstruction loss).
Then,
logpz <- normal_loglik(
z,
k_constant(0, dtype = tf$float64),
k_constant(0, dtype = tf$float64)
)
provides (log p(z)), the prior loglikelihood of (z). The prior is assumed to be normal regular, as is most frequently the case with VAEs.
Lastly,
logqz_x <- normal_loglik(z, imply, logvar)
vields (log q(z|x)), the loglikelihood of the samples (z) given imply and variance computed from the noticed samples (x).
From these three parts, we are going to compute the ultimate loss as
loss <- -k_mean(logpx_z + logpz - logqz_x)
After this peaking forward, letâs shortly end the setup so we prepare for coaching.
Last setup
In addition to the loss, we’d like an optimizer that can try to decrease it.
optimizer <- tf$prepare$AdamOptimizer(1e-4)
We instantiate our fashions âŠ
encoder <- encoder_model()
decoder <- decoder_model()
and arrange checkpointing, so we will later restore educated weights.
checkpoint_dir <- "./checkpoints_cvae"
checkpoint_prefix <- file.path(checkpoint_dir, "ckpt")
checkpoint <- tf$prepare$Checkpoint(
optimizer = optimizer,
encoder = encoder,
decoder = decoder
)
From the coaching loop, we are going to, in sure intervals, additionally name three capabilities not reproduced right here (however accessible within the code instance): generate_random_clothes, used to generate garments from random samples from the latent area; show_latent_space, that shows the entire take a look at set in latent (2-dimensional, thus simply visualizable) area; and show_grid, that generates garments based on enter values systematically spaced out in a grid.
Letâs begin coaching! Truly, earlier than we try this, letâs take a look at what these capabilities show earlier than any coaching: As a substitute of garments, we see random pixels. Latent area has no construction. And several types of garments don’t cluster collectively in latent area.

Coaching loop
Weâre coaching for 50 epochs right here. For every epoch, we loop over the coaching set in batches. For every batch, we comply with the same old keen execution circulate: Contained in the context of a GradientTape, apply the mannequin and calculate the present loss; then outdoors this context calculate the gradients and let the optimizer carry out backprop.
Whatâs particular right here is that we’ve two fashions that each want their gradients calculated and weights adjusted. This may be taken care of by a single gradient tape, offered we create it persistent.
After every epoch, we save present weights and each ten epochs, we additionally save plots for later inspection.
num_epochs <- 50
for (epoch in seq_len(num_epochs)) {
iter <- make_iterator_one_shot(train_dataset)
total_loss <- 0
logpx_z_total <- 0
logpz_total <- 0
logqz_x_total <- 0
until_out_of_range({
x <- iterator_get_next(iter)
with(tf$GradientTape(persistent = TRUE) %as% tape, {
c(imply, logvar) %<-% encoder(x)
z <- reparameterize(imply, logvar)
preds <- decoder(z)
crossentropy_loss <-
tf$nn$sigmoid_cross_entropy_with_logits(logits = preds, labels = x)
logpx_z <-
- k_sum(crossentropy_loss)
logpz <-
normal_loglik(z,
k_constant(0, dtype = tf$float64),
k_constant(0, dtype = tf$float64)
)
logqz_x <- normal_loglik(z, imply, logvar)
loss <- -k_mean(logpx_z + logpz - logqz_x)
})
total_loss <- total_loss + loss
logpx_z_total <- tf$reduce_mean(logpx_z) + logpx_z_total
logpz_total <- tf$reduce_mean(logpz) + logpz_total
logqz_x_total <- tf$reduce_mean(logqz_x) + logqz_x_total
encoder_gradients <- tape$gradient(loss, encoder$variables)
decoder_gradients <- tape$gradient(loss, decoder$variables)
optimizer$apply_gradients(
purrr::transpose(record(encoder_gradients, encoder$variables)),
global_step = tf$prepare$get_or_create_global_step()
)
optimizer$apply_gradients(
purrr::transpose(record(decoder_gradients, decoder$variables)),
global_step = tf$prepare$get_or_create_global_step()
)
})
checkpoint$save(file_prefix = checkpoint_prefix)
cat(
glue(
"Losses (epoch): {epoch}:",
" {(as.numeric(logpx_z_total)/batches_per_epoch) %>% spherical(2)} logpx_z_total,",
" {(as.numeric(logpz_total)/batches_per_epoch) %>% spherical(2)} logpz_total,",
" {(as.numeric(logqz_x_total)/batches_per_epoch) %>% spherical(2)} logqz_x_total,",
" {(as.numeric(total_loss)/batches_per_epoch) %>% spherical(2)} complete"
),
"n"
)
if (epoch %% 10 == 0) {
generate_random_clothes(epoch)
show_latent_space(epoch)
show_grid(epoch)
}
}
Outcomes
How properly did that work? Letâs see the varieties of garments generated after 50 epochs.

Additionally, how disentangled (or not) are the completely different lessons in latent area?

And now watch completely different garments morph into each other.

How good are these representations? That is arduous to say when there’s nothing to check with.
So letâs dive into MMD-VAE and see the way it does on the identical dataset.
MMD-VAE
MMD-VAE guarantees to generate extra informative latent options, so we’d hope to see completely different conduct particularly within the clustering and morphing plots.
Information setup is similar, and there are solely very slight variations within the mannequin. Please take a look at the entire code for this instance, mmd_vae.R, as right here weâll simply spotlight the variations.
Variations within the mannequin(s)
There are three variations as regards mannequin structure.
One, the encoder doesn’t need to return the variance, so there is no such thing as a want for tf$cut up. The encoderâs name methodology now simply is
Between the encoder and the decoder, we donât want the sampling step anymore, so there is no such thing as a reparameterization.
And since we gainedât use tf$nn$sigmoid_cross_entropy_with_logits to compute the loss, we let the decoder apply the sigmoid within the final deconvolution layer:
self$deconv3 <- layer_conv_2d_transpose(
filters = 1,
kernel_size = 3,
strides = 1,
padding = "similar",
activation = "sigmoid"
)
Loss calculations
Now, as anticipated, the large novelty is within the loss perform.
The loss, most imply discrepancy (MMD), relies on the concept two distributions are similar if and provided that all moments are similar.
Concretely, MMD is estimated utilizing a kernel, such because the Gaussian kernel
[k(z,z’)=frac{e^}{2sigma^2}]
to evaluate similarity between distributions.
The thought then is that if two distributions are similar, the typical similarity between samples from every distribution must be similar to the typical similarity between combined samples from each distributions:
[MMD(p(z)||q(z))=E_{p(z),p(z’)}[k(z,z’)]+E_{q(z),q(z’)}[k(z,z’)]â2E_{p(z),q(z’)}[k(z,z’)]]
The next code is a direct port of the creatorâs authentic TensorFlow code:
compute_kernel <- perform(x, y) {
x_size <- k_shape(x)[1]
y_size <- k_shape(y)[1]
dim <- k_shape(x)[2]
tiled_x <- k_tile(
k_reshape(x, k_stack(record(x_size, 1, dim))),
k_stack(record(1, y_size, 1))
)
tiled_y <- k_tile(
k_reshape(y, k_stack(record(1, y_size, dim))),
k_stack(record(x_size, 1, 1))
)
k_exp(-k_mean(k_square(tiled_x - tiled_y), axis = 3) /
k_cast(dim, tf$float64))
}
compute_mmd <- perform(x, y, sigma_sqr = 1) {
x_kernel <- compute_kernel(x, x)
y_kernel <- compute_kernel(y, y)
xy_kernel <- compute_kernel(x, y)
k_mean(x_kernel) + k_mean(y_kernel) - 2 * k_mean(xy_kernel)
}
Coaching loop
The coaching loop differs from the usual VAE instance solely within the loss calculations.
Listed here are the respective strains:
with(tf$GradientTape(persistent = TRUE) %as% tape, {
imply <- encoder(x)
preds <- decoder(imply)
true_samples <- k_random_normal(
form = c(batch_size, latent_dim),
dtype = tf$float64
)
loss_mmd <- compute_mmd(true_samples, imply)
loss_nll <- k_mean(k_square(x - preds))
loss <- loss_nll + loss_mmd
})
So we merely compute MMD loss in addition to reconstruction loss, and add them up. No sampling is concerned on this model.
After all, we’re curious to see how properly that labored!
Outcomes
Once more, letâs have a look at some generated garments first. It looks like edges are a lot sharper right here.

The clusters too look extra properly unfold out within the two dimensions. And, they’re centered at (0,0), as we’d have hoped for.

Lastly, letâs see garments morph into each other. Right here, the sleek, steady evolutions are spectacular!
Additionally, almost all area is stuffed with significant objects, which hasnât been the case above.

MNIST
For curiosityâs sake, we generated the identical sorts of plots after coaching on authentic MNIST.
Right here, there are hardly any variations seen in generated random digits after 50 epochs of coaching.

Additionally the variations in clustering will not be that massive.

However right here too, the morphing appears to be like way more natural with MMD-VAE.

Conclusion
To us, this demonstrates impressively what massive a distinction the price perform could make when working with VAEs.
One other element open to experimentation could be the prior used for the latent area – see this speak for an outline of other priors and the âVariational Combination of Posteriorsâ paper (Tomczak and Welling 2017) for a well-liked latest method.
For each value capabilities and priors, we count on efficient variations to develop into approach larger nonetheless after we go away the managed atmosphere of (Style) MNIST and work with real-world datasets.
Kingma, Diederik P., and Max Welling. 2013. âAuto-Encoding Variational Bayes.â CoRR abs/1312.6114.
Tomczak, Jakub M., and Max Welling. 2017. âVAE with a VampPrior.â CoRR abs/1705.07120.
