Core supervised ML methods — bias-variance trade-off, model selection, and hyperparameter optimisation — applied to chemical property prediction.
Recording
Recording will be available after the bootcamp.
August 2026Learning Objectives
Key Takeaways
Takeaway 1. Random train/test splits overestimate real-world performance on chemical data — always use scaffold or time-based splits to measure how well a model extrapolates.
Takeaway 2. More features are not always better. Feature selection and regularisation are essential when the number of descriptors (thousands of fingerprint bits) exceeds the number of training molecules.
Takeaway 3. Gradient boosting (XGBoost, LightGBM) consistently outperforms random forests on tabular chemical data when the dataset is large enough, but random forests are more robust on very small datasets.
Takeaway 4. Hyperparameter optimisation on the test set is data leakage. Always tune on a held-out validation set or inner cross-validation fold, and report the final metric on an untouched test set.
Further Reading & Resources