Run a pool-based active learning loop over QM9/ANI-1 molecules, using Gaussian Process uncertainty to prioritise which calculations to run.
What you will learn
Instructions
Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.
Set up the environment. Install scikit-learn, gpytorch, and rdkit. Load the ANI-1 energy dataset subset (5000 molecules) as the unlabelled pool; 50 random molecules form the initial labelled set.
Implement the GP surrogate. Build a GP with an RBF kernel on 2048-bit Morgan fingerprints. Fit on the initial labelled set and compute posterior mean and variance on the full pool.
Implement query strategies. Complete three query functions: random (select uniformly at random), uncertainty (select highest posterior variance), and greedy (select lowest predicted energy). Each selects a batch of 10 molecules per round.
Run the AL loop. Run 20 rounds (200 total queries) for each strategy. After each round, retrain the GP and evaluate MAE on a held-out test set. Log MAE per round.
Analyse selection patterns. For uncertainty sampling, plot a histogram of the molecular weight of selected molecules across rounds. Does the strategy initially focus on a particular region of chemical space?
Questions
Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.
After 50 initial labelled molecules, what is the GP test MAE? How does this compare to a model trained on all 5000 molecules?
EasyAfter 20 AL rounds, which strategy achieves the lowest MAE: random, uncertainty, or greedy? Is the ordering consistent across all rounds, or does it change?
MediumPlot the posterior variance on the pool before and after 10 rounds of uncertainty sampling. How does the spatial distribution of uncertainty change? Are there still high-uncertainty regions?
MediumImplement a diversity-aware query strategy (determinantal point process or k-means++ seeding) that selects a batch of 10 molecules that are both uncertain and mutually dissimilar. Compare its learning curve to pure uncertainty sampling.
ChallengeResources