Apply an MPNN to the QM9 dataset in PyTorch Geometric to predict HOMO-LUMO gaps, then visualise learned atom embeddings with UMAP.
What you will learn
Data objectsMessagePassing base classInstructions
Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.
Set up the environment. Install torch-geometric and its dependencies (torch-scatter, torch-sparse). Confirm that torch_geometric.datasets.QM9 loads without errors.
Explore QM9. Load the dataset and inspect a sample graph: number of nodes, edge_index shape, and node features. Filter to molecules with ≤20 heavy atoms for faster training.
Build the MPNN. Complete the MessagePassing subclass: implement message() (concatenate source and edge features, apply a linear layer), aggregate() (sum), and update() (GRU). Add a global sum readout and a final MLP head.
Train the model. Run the training loop for 50 epochs. Plot train and validation MAE for the HOMO-LUMO gap target. Compare to a baseline that predicts the mean gap.
Visualise embeddings. Extract atom embeddings from the last message-passing layer. Run UMAP on 2000 randomly sampled atoms. Plot a 2-D scatter coloured by atom type.
Questions
Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.
What is the mean HOMO-LUMO gap in QM9? What is the standard deviation? How does MAE/std compare for your trained model vs. the mean baseline?
EasyHow does increasing the number of message-passing steps from 3 to 6 affect validation MAE? At what depth do you start to see over-smoothing?
MediumIn your UMAP embedding, are atoms of the same type clustered together? Are carbon atoms in aromatic rings separated from aliphatic carbons? What does this tell you about the learned representation?
MediumAdd edge features (bond type, aromaticity) to the message function. How does validation MAE change relative to the node-only model?
ChallengeResources