Tutorial 8 — Molecular Generation with VAEs · AI4Chemical Sciences Bootcamp

What you will learn

Implement an LSTM-based encoder and decoder for SMILES strings in PyTorch
Train a VAE with the ELBO objective (reconstruction loss + KL divergence annealing)
Sample from the latent space and decode into SMILES, measuring validity and novelty
Perform latent-space interpolation between two molecules and visualise intermediate structures

Instructions

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

Set up the environment. Install torch and rdkit. Load the ZINC-250k SMILES dataset and build the vocabulary (character-to-index mapping).
Build the VAE. Complete the Encoder (bidirectional LSTM → linear layers for μ and log σ²) and Decoder (linear reparameterisation → LSTM → character logits) classes. Implement the reparameterisation trick.
Train with KL annealing. Implement the cyclical KL annealing schedule. Train for 30 epochs, logging reconstruction accuracy and KL divergence separately.
Sample and evaluate. Sample 1000 latent vectors from N(0,I). Decode each to a SMILES string using greedy and temperature-based sampling. Compute validity, uniqueness, and novelty.
Interpolate. Pick two valid molecules. Encode both to get μ₁ and μ₂. Linearly interpolate 10 points between them, decode each, and draw the resulting molecules.

Questions

Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

Warm-up

After 30 epochs, what is the reconstruction accuracy (fraction of training SMILES decoded exactly)? What fraction of sampled SMILES are chemically valid?

Easy

Core

How does sampling temperature affect validity vs. diversity? At temperature 1.0 vs. 0.5, which gives higher validity? Which gives more diverse structures?

Medium

Core

Examine the interpolation path between two molecules with different scaffolds. Is the transition gradual or does it jump abruptly? How many intermediate structures are chemically valid?

Medium

Challenge

Add a property predictor head (logP) trained jointly with the VAE. Use it to guide latent-space optimisation (gradient ascent on logP) from a seed molecule. Report the change in logP and whether the optimised SMILES is valid.

Challenge

Resources

Notebook (Colab) GitHub repo Paired lecture notes

Molecular Generation with VAEs

Open in Google Colab

Getting started