Fine-tune a pre-trained chemical language model on BACE IC50 data using HuggingFace Transformers, comparing frozen vs. full fine-tuning.
What you will learn
Instructions
Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.
Set up the environment. Install transformers, datasets, and scikit-learn. Load seyonec/ChemBERTa-zinc-base-v1 from HuggingFace Hub and verify the tokeniser handles a few SMILES strings correctly.
Prepare the BACE dataset. Load BACE IC50 values, log-transform them (pIC50 = −log10(IC50)), and apply a scaffold split. Tokenise all SMILES with max_length=128 and padding.
Fine-tune (head only). Freeze all BERT layers. Add a regression head (dropout + linear). Train for 10 epochs with AdamW (lr=1e-3). Record validation RMSE.
Full fine-tune. Unfreeze all layers. Train with a small learning rate (lr=2e-5) and linear warmup. Train for 5 epochs. Compare validation RMSE to head-only fine-tuning.
Visualise embeddings. Extract CLS embeddings for all test-set molecules using the full fine-tuned model. Run UMAP and colour by pIC50 value. Identify clusters.
Questions
Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.
What is the baseline RMSE if you predict the mean pIC50 for all test molecules? How much does head-only fine-tuning improve on this?
EasyAt what epoch does validation RMSE plateau for full fine-tuning? Does it ever increase (overfitting)? How does learning rate warmup affect early-epoch stability?
MediumIn the UMAP embedding, are molecules with high pIC50 (potent) clustered together? Do structurally similar molecules (same scaffold) cluster regardless of potency?
MediumReplace the CLS token with mean-pooling over all token embeddings. Does this change validation RMSE? Which pooling strategy gives better-calibrated embeddings according to your UMAP?
ChallengeResources