Tutorial 2 — Molecular Property Prediction · AI4Chemical Sciences Bootcamp

What you will learn

Compute Morgan fingerprints from SMILES strings using RDKit's AllChem module
Build a scikit-learn pipeline with feature scaling and a random forest regressor
Evaluate model performance using RMSE, R², and a parity plot
Compare random vs. scaffold-based train/test splits and explain the difference in reported performance

Instructions

Getting started

Open the Colab notebook using the button above. The notebook is structured into clearly labelled sections. Run each cell in order; cells marked Exercise require you to fill in code.

Set up the environment. Run the first cell to install rdkit, scikit-learn, and matplotlib. Verify that from rdkit import Chem works without errors.
Load the dataset. The notebook downloads the ESOL aqueous solubility dataset automatically. Inspect the first few rows, check the distribution of logS values, and identify any outliers.
Featurize molecules. Complete the featurize() function to compute Morgan fingerprints (radius=2, 2048 bits) for each SMILES string using AllChem.GetMorganFingerprintAsBitVect.
Train the model. Fit a RandomForestRegressor on the training fingerprints. Experiment with n_estimators and max_depth using the config dict at the top of the training cell.
Evaluate and visualise. Run the evaluation cell to compute RMSE and R², generate a predicted-vs-true parity plot, and compare performance under random vs. scaffold split.

Questions

Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

Warm-up

How many molecules are in the ESOL dataset? What is the mean and standard deviation of the target logS? Are there any invalid SMILES strings?

Easy

Core

How does model RMSE change when you increase the fingerprint radius from 2 to 3? What structural information does a larger radius capture?

Medium

Core

Compare RMSE on a random split vs. a scaffold split. Which is higher? What does the gap tell you about the model's ability to generalise to new scaffolds?

Medium

Challenge

Replace Morgan fingerprints with RDKit 2-D descriptors (Descriptors.CalcMolDescriptors). How does performance compare? Which descriptor subset is most predictive according to feature importance?

Challenge

Resources

Notebook (Colab) GitHub repo Paired lecture notes

Molecular Property Prediction

Open in Google Colab

Getting started