Tutorial 12

Uncertainty Quantification in Practice

Build calibration curves, apply conformal prediction, and compare ensemble variance on molecular property prediction datasets.

August 18, 2026 · 14:45 – 17:00
105 min
Python · Google Colab
Back to schedule

Open in Google Colab

The notebook has most of the code pre-filled. Complete the exercises marked ### YOUR CODE HERE ###.

Open Notebook

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

  1. Set up the environment. Install scikit-learn, torch, and mapie. Load the ESOL dataset and a pre-trained random forest with predict_std support (using the forest's tree variance).

  2. Build calibration plots. Bin predicted probabilities into 10 bins. Compute the fraction of true values falling within each predicted confidence interval. Plot a reliability diagram and compute ECE.

  3. Apply temperature scaling. Fit a scalar temperature parameter T on a held-out calibration set. Re-plot the reliability diagram after temperature scaling. Report the change in ECE.

  4. Conformal prediction. Use MAPIE's MapieRegressor with a ridge regression base model. Set the target coverage to 90%. Report the empirical coverage and mean interval width on the test set.

  5. Deep ensemble. Train 5 independently initialised MLPs on the training set. Compute ensemble mean and variance on the test set. Compare interval width and coverage to conformal prediction.


Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

    Warm-up

    Before calibration, is your model overconfident or underconfident? How can you tell from the reliability diagram?

    Easy
    Core

    After temperature scaling, how much does ECE decrease? Is the model now well-calibrated across all probability bins, or only in certain regions?

    Medium
    Core

    For conformal prediction at 90% target coverage, what is the empirical coverage on the test set? Is it at least 90%? What happens to interval width if you raise the target to 95%?

    Medium
    Challenge

    Identify the 10 test molecules with highest ensemble variance. Are they structurally different from the training set (measure Tanimoto distance to nearest training neighbour)? What does this tell you about the applicability domain?

    Challenge

Notebook (Colab) GitHub repo Paired lecture notes