Tutorial 12 — Uncertainty Quantification in Practice

What you will learn

Build reliability diagrams and compute Expected Calibration Error (ECE) for a trained model
Apply split conformal prediction to produce valid prediction intervals for molecular properties
Train a deep ensemble of 5 models and compare its uncertainty to MC-Dropout and a Gaussian Process
Identify out-of-distribution molecules where model uncertainty should be highest

Instructions

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

Set up the environment. Install scikit-learn, torch, and mapie. Load the ESOL dataset and a pre-trained random forest with predict_std support (using the forest's tree variance).
Build calibration plots. Bin predicted probabilities into 10 bins. Compute the fraction of true values falling within each predicted confidence interval. Plot a reliability diagram and compute ECE.
Apply temperature scaling. Fit a scalar temperature parameter T on a held-out calibration set. Re-plot the reliability diagram after temperature scaling. Report the change in ECE.
Conformal prediction. Use MAPIE's MapieRegressor with a ridge regression base model. Set the target coverage to 90%. Report the empirical coverage and mean interval width on the test set.
Deep ensemble. Train 5 independently initialised MLPs on the training set. Compute ensemble mean and variance on the test set. Compare interval width and coverage to conformal prediction.

Questions

Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

Warm-up

Before calibration, is your model overconfident or underconfident? How can you tell from the reliability diagram?

Easy

Core

After temperature scaling, how much does ECE decrease? Is the model now well-calibrated across all probability bins, or only in certain regions?

Medium

Core

For conformal prediction at 90% target coverage, what is the empirical coverage on the test set? Is it at least 90%? What happens to interval width if you raise the target to 95%?

Medium

Challenge

Identify the 10 test molecules with highest ensemble variance. Are they structurally different from the training set (measure Tanimoto distance to nearest training neighbour)? What does this tell you about the applicability domain?

Challenge

Resources

Notebook (Colab) GitHub repo Paired lecture notes

Uncertainty Quantification in Practice

Open in Google Colab

Getting started