Build an end-to-end prediction pipeline — from SMILES to a trained scikit-learn model — using RDKit fingerprints.
What you will learn
Instructions
Open the Colab notebook using the button above. The notebook is structured into clearly labelled sections. Run each cell in order; cells marked Exercise require you to fill in code.
Set up the environment. Run the first cell to install rdkit, scikit-learn, and matplotlib. Verify that from rdkit import Chem works without errors.
Load the dataset. The notebook downloads the ESOL aqueous solubility dataset automatically. Inspect the first few rows, check the distribution of logS values, and identify any outliers.
Featurize molecules. Complete the featurize() function to compute Morgan fingerprints (radius=2, 2048 bits) for each SMILES string using AllChem.GetMorganFingerprintAsBitVect.
Train the model. Fit a RandomForestRegressor on the training fingerprints. Experiment with n_estimators and max_depth using the config dict at the top of the training cell.
Evaluate and visualise. Run the evaluation cell to compute RMSE and R², generate a predicted-vs-true parity plot, and compare performance under random vs. scaffold split.
Questions
Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.
How many molecules are in the ESOL dataset? What is the mean and standard deviation of the target logS? Are there any invalid SMILES strings?
EasyHow does model RMSE change when you increase the fingerprint radius from 2 to 3? What structural information does a larger radius capture?
MediumCompare RMSE on a random split vs. a scaffold split. Which is higher? What does the gap tell you about the model's ability to generalise to new scaffolds?
MediumReplace Morgan fingerprints with RDKit 2-D descriptors (Descriptors.CalcMolDescriptors). How does performance compare? Which descriptor subset is most predictive according to feature importance?
Resources