Tutorial 3 — Model Training & Evaluation · AI4Chemical Sciences Bootcamp

What you will learn

Split chemical datasets using random, scaffold, and time-based strategies with DeepChem or RDKit
Implement nested cross-validation to tune hyperparameters without touching the test set
Plot learning curves and validation curves to diagnose underfitting and overfitting
Detect and fix a data leakage bug in a preprocessing pipeline

Instructions

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

Set up the environment. Install deepchem, scikit-learn, and rdkit. Confirm the dataset loads and SMILES parse cleanly.
Explore the dataset. Load the BBBP (blood–brain barrier permeability) dataset. Plot the class balance and identify any duplicated SMILES strings.
Implement splitting strategies. Complete the three splitting cells: random split, scaffold split (Bemis–Murcko), and temporal split (by year of publication). Compare the resulting class distributions in each split.
Tune with nested cross-validation. Fill in the inner-loop hyperparameter grid for a RandomForestClassifier (n_estimators, max_depth, min_samples_leaf). Report AUC-ROC on the outer test fold only.
Diagnose your model. Plot a learning curve (training size vs. validation AUC) and a validation curve (max_depth vs. AUC). Identify the bias-variance regime your model is in.

Questions

Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

Warm-up

What fraction of molecules are BBB-positive? Would you expect this class imbalance to affect AUC-ROC? What about accuracy?

Easy

Core

Compare test AUC under random vs. scaffold splitting. Which is lower? By how much? What does this gap imply for model deployment?

Medium

Core

Identify the data leakage bug introduced in the notebook (hint: look at where StandardScaler is fit). Fix it and report how test AUC changes.

Medium

Challenge

Implement a stratified scaffold split that preserves class balance within each fold. How does the AUC distribution across folds compare to an unstratified scaffold split?

Challenge

Resources

Notebook (Colab) GitHub repo Paired lecture notes

Model Training & Evaluation

Open in Google Colab

Getting started