Diagnose overfitting, prevent data leakage, and select hyperparameters rigorously for chemical property prediction.
What you will learn
Instructions
Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.
Set up the environment. Install deepchem, scikit-learn, and rdkit. Confirm the dataset loads and SMILES parse cleanly.
Explore the dataset. Load the BBBP (blood–brain barrier permeability) dataset. Plot the class balance and identify any duplicated SMILES strings.
Implement splitting strategies. Complete the three splitting cells: random split, scaffold split (Bemis–Murcko), and temporal split (by year of publication). Compare the resulting class distributions in each split.
Tune with nested cross-validation. Fill in the inner-loop hyperparameter grid for a RandomForestClassifier (n_estimators, max_depth, min_samples_leaf). Report AUC-ROC on the outer test fold only.
Diagnose your model. Plot a learning curve (training size vs. validation AUC) and a validation curve (max_depth vs. AUC). Identify the bias-variance regime your model is in.
Questions
Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.
What fraction of molecules are BBB-positive? Would you expect this class imbalance to affect AUC-ROC? What about accuracy?
EasyCompare test AUC under random vs. scaffold splitting. Which is lower? By how much? What does this gap imply for model deployment?
MediumIdentify the data leakage bug introduced in the notebook (hint: look at where StandardScaler is fit). Fix it and report how test AUC changes.
Implement a stratified scaffold split that preserves class balance within each fold. How does the AUC distribution across folds compare to an unstratified scaffold split?
ChallengeResources