Tutorial 2

Molecular Property Prediction

Build an end-to-end prediction pipeline — from SMILES to a trained scikit-learn model — using RDKit fingerprints.

August 10, 2026 · 14:45 – 17:00
105 min
Python · Google Colab
Back to schedule

Open in Google Colab

The notebook has most of the code pre-filled. Complete the exercises marked ### YOUR CODE HERE ###.

Open Notebook

Getting started

Open the Colab notebook using the button above. The notebook is structured into clearly labelled sections. Run each cell in order; cells marked Exercise require you to fill in code.

  1. Set up the environment. Run the first cell to install rdkit, scikit-learn, and matplotlib. Verify that from rdkit import Chem works without errors.

  2. Load the dataset. The notebook downloads the ESOL aqueous solubility dataset automatically. Inspect the first few rows, check the distribution of logS values, and identify any outliers.

  3. Featurize molecules. Complete the featurize() function to compute Morgan fingerprints (radius=2, 2048 bits) for each SMILES string using AllChem.GetMorganFingerprintAsBitVect.

  4. Train the model. Fit a RandomForestRegressor on the training fingerprints. Experiment with n_estimators and max_depth using the config dict at the top of the training cell.

  5. Evaluate and visualise. Run the evaluation cell to compute RMSE and R², generate a predicted-vs-true parity plot, and compare performance under random vs. scaffold split.


Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

    Warm-up

    How many molecules are in the ESOL dataset? What is the mean and standard deviation of the target logS? Are there any invalid SMILES strings?

    Easy
    Core

    How does model RMSE change when you increase the fingerprint radius from 2 to 3? What structural information does a larger radius capture?

    Medium
    Core

    Compare RMSE on a random split vs. a scaffold split. Which is higher? What does the gap tell you about the model's ability to generalise to new scaffolds?

    Medium
    Challenge

    Replace Morgan fingerprints with RDKit 2-D descriptors (Descriptors.CalcMolDescriptors). How does performance compare? Which descriptor subset is most predictive according to feature importance?

    Challenge

Notebook (Colab) GitHub repo Paired lecture notes