From SMILES strings to molecular graphs — how we encode chemical structure for machine learning.
Recording
Recording will be available after the bootcamp.
August 2026Learning Objectives
Key Takeaways
Takeaway 1. SMILES are compact and human-readable but encode the same molecule in multiple ways — always canonicalise before featurising to avoid data leakage.
Takeaway 2. Morgan fingerprints (ECFP) remain a strong baseline: they are interpretable, fast to compute, and perform well on small datasets where neural representations overfit.
Takeaway 3. Representations are not neutral — they embed assumptions about which structural features matter. A fingerprint that ignores 3-D geometry will fail on conformer-sensitive properties.
Takeaway 4. SELFIES guarantee syntactic validity by construction, making them the preferred string encoding for generative models where invalid SMILES outputs would break the pipeline.