Lecture 3 — Supervised Learning: Regression & Classification

Recording

Recording will be available after the bootcamp.

August 2026

Learning Objectives

Explain the bias-variance trade-off and use it to diagnose underfitting and overfitting
Implement cross-validation correctly for chemical datasets, avoiding scaffold leakage
Apply random forests and gradient boosting to regression and classification tasks
Tune hyperparameters with grid search or random search and report generalisation performance honestly

Key Takeaways

Takeaway 1. Random train/test splits overestimate real-world performance on chemical data — always use scaffold or time-based splits to measure how well a model extrapolates.
Takeaway 2. More features are not always better. Feature selection and regularisation are essential when the number of descriptors (thousands of fingerprint bits) exceeds the number of training molecules.
Takeaway 3. Gradient boosting (XGBoost, LightGBM) consistently outperforms random forests on tabular chemical data when the dataset is large enough, but random forests are more robust on very small datasets.
Takeaway 4. Hyperparameter optimisation on the test set is data leakage. Always tune on a held-out validation set or inner cross-validation fold, and report the final metric on an untouched test set.