Lecture 2 — Evolution of Molecular Representations

Recording

Recording will be available after the bootcamp.

August 2026

Learning Objectives

Read and write SMILES, InChI, and SELFIES strings for simple organic molecules
Compute Morgan (ECFP) fingerprints and understand the radius and bit-length hyperparameters
Compare expert-crafted descriptors, fingerprints, and learned graph-based representations
Choose an appropriate representation for a given property-prediction task

Key Takeaways

Takeaway 1. SMILES are compact and human-readable but encode the same molecule in multiple ways — always canonicalise before featurising to avoid data leakage.
Takeaway 2. Morgan fingerprints (ECFP) remain a strong baseline: they are interpretable, fast to compute, and perform well on small datasets where neural representations overfit.
Takeaway 3. Representations are not neutral — they embed assumptions about which structural features matter. A fingerprint that ignores 3-D geometry will fail on conformer-sensitive properties.
Takeaway 4. SELFIES guarantee syntactic validity by construction, making them the preferred string encoding for generative models where invalid SMILES outputs would break the pipeline.