Tutorial 3

Model Training & Evaluation

Diagnose overfitting, prevent data leakage, and select hyperparameters rigorously for chemical property prediction.

August 11, 2026 · 10:15 – 12:00
105 min
Python · Google Colab
Back to schedule

Open in Google Colab

The notebook has most of the code pre-filled. Complete the exercises marked ### YOUR CODE HERE ###.

Open Notebook

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

  1. Set up the environment. Install deepchem, scikit-learn, and rdkit. Confirm the dataset loads and SMILES parse cleanly.

  2. Explore the dataset. Load the BBBP (blood–brain barrier permeability) dataset. Plot the class balance and identify any duplicated SMILES strings.

  3. Implement splitting strategies. Complete the three splitting cells: random split, scaffold split (Bemis–Murcko), and temporal split (by year of publication). Compare the resulting class distributions in each split.

  4. Tune with nested cross-validation. Fill in the inner-loop hyperparameter grid for a RandomForestClassifier (n_estimators, max_depth, min_samples_leaf). Report AUC-ROC on the outer test fold only.

  5. Diagnose your model. Plot a learning curve (training size vs. validation AUC) and a validation curve (max_depth vs. AUC). Identify the bias-variance regime your model is in.


Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

    Warm-up

    What fraction of molecules are BBB-positive? Would you expect this class imbalance to affect AUC-ROC? What about accuracy?

    Easy
    Core

    Compare test AUC under random vs. scaffold splitting. Which is lower? By how much? What does this gap imply for model deployment?

    Medium
    Core

    Identify the data leakage bug introduced in the notebook (hint: look at where StandardScaler is fit). Fix it and report how test AUC changes.

    Medium
    Challenge

    Implement a stratified scaffold split that preserves class balance within each fold. How does the AUC distribution across folds compare to an unstratified scaffold split?

    Challenge

Notebook (Colab) GitHub repo Paired lecture notes