Lecture 9 — Transformers for Chemistry · AI4Chemical Sciences Bootcamp

Recording

Recording will be available after the bootcamp.

August 2026

Learning Objectives

Explain the scaled dot-product attention mechanism and multi-head attention
Describe how SMILES strings are tokenised and what vocabulary design choices matter for chemistry
Fine-tune a chemical BERT model for property prediction and interpret the CLS-token embedding
Compare text-based transformer models with graph-based GNNs for molecular tasks

Key Takeaways

Takeaway 1. Self-attention allows every token to attend to every other token in constant depth — this global receptive field is the key advantage over recurrent models, and it is why transformers scale so well.
Takeaway 2. SMILES tokenisation is non-trivial: character-level tokenisation conflates multi-character element symbols; atom-level tokenisation with a chemistry-aware tokeniser is strongly preferred.
Takeaway 3. For property prediction, fine-tuning only the classification head on a frozen BERT encoder is a fast baseline; full fine-tuning usually wins with ≥500 labelled molecules.
Takeaway 4. Transformers on SMILES and GNNs on molecular graphs are not interchangeable — they encode complementary inductive biases; ensemble or multi-view models often outperform either alone.