Lecture 7 — Learned Representations · AI4Chemical Sciences Bootcamp

Recording

Recording will be available after the bootcamp.

August 2026

Learning Objectives

Explain the pre-training / fine-tuning paradigm and contrast masked-language-model and contrastive objectives
Identify when transfer learning helps (small labelled dataset, large unlabelled corpus) and when it does not
Fine-tune a pre-trained molecular model (e.g., ChemBERTa, GNN pre-training) on a downstream task
Evaluate whether learned representations generalise better than fixed fingerprints on a given benchmark

Key Takeaways

Takeaway 1. Pre-training forces a model to learn general chemical knowledge (atom environments, bond patterns) before it ever sees task labels — this regularisation often outperforms task-specific models on low-data regimes.
Takeaway 2. The choice of pre-training objective matters: masked-atom prediction captures local chemistry; contrastive objectives (e.g., MolCLR) encourage global molecular similarity.
Takeaway 3. Fine-tuning all layers (full fine-tuning) beats frozen-encoder approaches when you have more than a few hundred labelled examples, but risks catastrophic forgetting on very small datasets.
Takeaway 4. Learned representations are not always better than ECFP — always run fingerprint baselines before investing in pre-training pipelines.