Lecture 10 — Active Learning Theory · AI4Chemical Sciences Bootcamp

Recording

Recording will be available after the bootcamp.

August 2026

Learning Objectives

Formalise the pool-based active learning setting and contrast it with stream-based and membership-query paradigms
Implement uncertainty sampling, margin sampling, and entropy-based query strategies
Explain the BALD acquisition function and its connection to mutual information
Design a realistic active learning loop for a DFT screening campaign

Key Takeaways

Takeaway 1. Random sampling is a surprisingly strong baseline for active learning in chemistry — always benchmark against it before claiming that a fancy query strategy helps.
Takeaway 2. Uncertainty sampling selects the points the model is least confident about, but high uncertainty near the training distribution boundary is not the same as informativeness — diversity-promoting methods (core-set, BADGE) often do better.
Takeaway 3. BALD (Bayesian Active Learning by Disagreement) maximises the mutual information between model parameters and predictions; it is theoretically principled but computationally expensive for large pools.
Takeaway 4. In chemistry, the cost of labelling (DFT calculation, synthesis, assay) dwarfs the cost of the model. Even a modest reduction in the number of required experiments translates directly to saved resources.