Encode and decode molecules in a continuous latent space using a character-level VAE, then explore the space to find new valid candidates.
What you will learn
Instructions
Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.
Set up the environment. Install torch and rdkit. Load the ZINC-250k SMILES dataset and build the vocabulary (character-to-index mapping).
Build the VAE. Complete the Encoder (bidirectional LSTM → linear layers for μ and log σ²) and Decoder (linear reparameterisation → LSTM → character logits) classes. Implement the reparameterisation trick.
Train with KL annealing. Implement the cyclical KL annealing schedule. Train for 30 epochs, logging reconstruction accuracy and KL divergence separately.
Sample and evaluate. Sample 1000 latent vectors from N(0,I). Decode each to a SMILES string using greedy and temperature-based sampling. Compute validity, uniqueness, and novelty.
Interpolate. Pick two valid molecules. Encode both to get μ₁ and μ₂. Linearly interpolate 10 points between them, decode each, and draw the resulting molecules.
Questions
Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.
After 30 epochs, what is the reconstruction accuracy (fraction of training SMILES decoded exactly)? What fraction of sampled SMILES are chemically valid?
EasyHow does sampling temperature affect validity vs. diversity? At temperature 1.0 vs. 0.5, which gives higher validity? Which gives more diverse structures?
MediumExamine the interpolation path between two molecules with different scaffolds. Is the transition gradual or does it jump abruptly? How many intermediate structures are chemically valid?
MediumAdd a property predictor head (logP) trained jointly with the VAE. Use it to guide latent-space optimisation (gradient ascent on logP) from a seed molecule. Report the change in logP and whether the optimised SMILES is valid.
ChallengeResources