Tutorial 8

Molecular Generation with VAEs

Encode and decode molecules in a continuous latent space using a character-level VAE, then explore the space to find new valid candidates.

August 13, 2026 · 14:45 – 17:00
105 min
Python · Google Colab
Back to schedule

Open in Google Colab

The notebook has most of the code pre-filled. Complete the exercises marked ### YOUR CODE HERE ###.

Open Notebook

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

  1. Set up the environment. Install torch and rdkit. Load the ZINC-250k SMILES dataset and build the vocabulary (character-to-index mapping).

  2. Build the VAE. Complete the Encoder (bidirectional LSTM → linear layers for μ and log σ²) and Decoder (linear reparameterisation → LSTM → character logits) classes. Implement the reparameterisation trick.

  3. Train with KL annealing. Implement the cyclical KL annealing schedule. Train for 30 epochs, logging reconstruction accuracy and KL divergence separately.

  4. Sample and evaluate. Sample 1000 latent vectors from N(0,I). Decode each to a SMILES string using greedy and temperature-based sampling. Compute validity, uniqueness, and novelty.

  5. Interpolate. Pick two valid molecules. Encode both to get μ₁ and μ₂. Linearly interpolate 10 points between them, decode each, and draw the resulting molecules.


Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

    Warm-up

    After 30 epochs, what is the reconstruction accuracy (fraction of training SMILES decoded exactly)? What fraction of sampled SMILES are chemically valid?

    Easy
    Core

    How does sampling temperature affect validity vs. diversity? At temperature 1.0 vs. 0.5, which gives higher validity? Which gives more diverse structures?

    Medium
    Core

    Examine the interpolation path between two molecules with different scaffolds. Is the transition gradual or does it jump abruptly? How many intermediate structures are chemically valid?

    Medium
    Challenge

    Add a property predictor head (logP) trained jointly with the VAE. Use it to guide latent-space optimisation (gradient ascent on logP) from a seed molecule. Report the change in logP and whether the optimised SMILES is valid.

    Challenge

Notebook (Colab) GitHub repo Paired lecture notes