Tutorial 10 — Active Learning for DFT Triage · AI4Chemical Sciences Bootcamp

What you will learn

Set up a pool-based active learning loop with a Gaussian Process surrogate and Morgan fingerprint features
Implement uncertainty sampling, random sampling, and greedy (exploit-only) query strategies
Plot learning curves (MAE vs. number of DFT calculations) comparing all three strategies
Analyse which molecular structures are selected earliest by each strategy

Instructions

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

Set up the environment. Install scikit-learn, gpytorch, and rdkit. Load the ANI-1 energy dataset subset (5000 molecules) as the unlabelled pool; 50 random molecules form the initial labelled set.
Implement the GP surrogate. Build a GP with an RBF kernel on 2048-bit Morgan fingerprints. Fit on the initial labelled set and compute posterior mean and variance on the full pool.
Implement query strategies. Complete three query functions: random (select uniformly at random), uncertainty (select highest posterior variance), and greedy (select lowest predicted energy). Each selects a batch of 10 molecules per round.
Run the AL loop. Run 20 rounds (200 total queries) for each strategy. After each round, retrain the GP and evaluate MAE on a held-out test set. Log MAE per round.
Analyse selection patterns. For uncertainty sampling, plot a histogram of the molecular weight of selected molecules across rounds. Does the strategy initially focus on a particular region of chemical space?

Questions

Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

Warm-up

After 50 initial labelled molecules, what is the GP test MAE? How does this compare to a model trained on all 5000 molecules?

Easy

Core

After 20 AL rounds, which strategy achieves the lowest MAE: random, uncertainty, or greedy? Is the ordering consistent across all rounds, or does it change?

Medium

Core

Plot the posterior variance on the pool before and after 10 rounds of uncertainty sampling. How does the spatial distribution of uncertainty change? Are there still high-uncertainty regions?

Medium

Challenge

Implement a diversity-aware query strategy (determinantal point process or k-means++ seeding) that selects a batch of 10 molecules that are both uncertain and mutually dissimilar. Compare its learning curve to pure uncertainty sampling.

Challenge

Resources

Notebook (Colab) GitHub repo Paired lecture notes

Active Learning for DFT Triage

Open in Google Colab

Getting started