Tutorial 5 — Molecular Property Prediction with GNNs

What you will learn

Construct molecular graphs from SMILES using PyTorch Geometric's Data objects
Implement one full message-passing layer (aggregate → update) using PyG's MessagePassing base class
Train an MPNN end-to-end on QM9 and track validation MAE
Visualise atom-level embeddings with UMAP coloured by atom type and partial charge

Instructions

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

Set up the environment. Install torch-geometric and its dependencies (torch-scatter, torch-sparse). Confirm that torch_geometric.datasets.QM9 loads without errors.
Explore QM9. Load the dataset and inspect a sample graph: number of nodes, edge_index shape, and node features. Filter to molecules with ≤20 heavy atoms for faster training.
Build the MPNN. Complete the MessagePassing subclass: implement message() (concatenate source and edge features, apply a linear layer), aggregate() (sum), and update() (GRU). Add a global sum readout and a final MLP head.
Train the model. Run the training loop for 50 epochs. Plot train and validation MAE for the HOMO-LUMO gap target. Compare to a baseline that predicts the mean gap.
Visualise embeddings. Extract atom embeddings from the last message-passing layer. Run UMAP on 2000 randomly sampled atoms. Plot a 2-D scatter coloured by atom type.

Questions

Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

Warm-up

What is the mean HOMO-LUMO gap in QM9? What is the standard deviation? How does MAE/std compare for your trained model vs. the mean baseline?

Easy

Core

How does increasing the number of message-passing steps from 3 to 6 affect validation MAE? At what depth do you start to see over-smoothing?

Medium

Core

In your UMAP embedding, are atoms of the same type clustered together? Are carbon atoms in aromatic rings separated from aliphatic carbons? What does this tell you about the learned representation?

Medium

Challenge

Add edge features (bond type, aromaticity) to the message function. How does validation MAE change relative to the node-only model?

Challenge

Resources

Notebook (Colab) GitHub repo Paired lecture notes

Molecular Property Prediction with GNNs

Open in Google Colab

Getting started