Tutorial 9 — Fine-tuning ChemBERTa · AI4Chemical Sciences Bootcamp

What you will learn

Load and tokenise SMILES strings with the ChemBERTa tokeniser from HuggingFace Hub
Fine-tune a BERT-style model for regression using the Trainer API
Compare frozen-encoder (head-only) vs. full fine-tuning on the BACE IC50 dataset
Visualise CLS-token embeddings with UMAP to understand the learned chemical space

Instructions

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

Set up the environment. Install transformers, datasets, and scikit-learn. Load seyonec/ChemBERTa-zinc-base-v1 from HuggingFace Hub and verify the tokeniser handles a few SMILES strings correctly.
Prepare the BACE dataset. Load BACE IC50 values, log-transform them (pIC50 = −log10(IC50)), and apply a scaffold split. Tokenise all SMILES with max_length=128 and padding.
Fine-tune (head only). Freeze all BERT layers. Add a regression head (dropout + linear). Train for 10 epochs with AdamW (lr=1e-3). Record validation RMSE.
Full fine-tune. Unfreeze all layers. Train with a small learning rate (lr=2e-5) and linear warmup. Train for 5 epochs. Compare validation RMSE to head-only fine-tuning.
Visualise embeddings. Extract CLS embeddings for all test-set molecules using the full fine-tuned model. Run UMAP and colour by pIC50 value. Identify clusters.

Questions

Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

Warm-up

What is the baseline RMSE if you predict the mean pIC50 for all test molecules? How much does head-only fine-tuning improve on this?

Easy

Core

At what epoch does validation RMSE plateau for full fine-tuning? Does it ever increase (overfitting)? How does learning rate warmup affect early-epoch stability?

Medium

Core

In the UMAP embedding, are molecules with high pIC50 (potent) clustered together? Do structurally similar molecules (same scaffold) cluster regardless of potency?

Medium

Challenge

Replace the CLS token with mean-pooling over all token embeddings. Does this change validation RMSE? Which pooling strategy gives better-calibrated embeddings according to your UMAP?

Challenge

Resources

Notebook (Colab) GitHub repo Paired lecture notes

Fine-tuning ChemBERTa

Open in Google Colab

Getting started