Index a corpus of chemistry abstracts and build a retrieval-augmented Q&A system that grounds LLM answers in the literature.
What you will learn
Instructions
Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.
Set up the environment. Install sentence-transformers, faiss-cpu, and openai (or use a local model via ollama). Load the provided dataset of 500 chemistry paper abstracts.
Chunk and embed. Split each abstract into 200-token chunks with 20-token overlap. Embed all chunks using all-MiniLM-L6-v2. Store embeddings in a FAISS flat-L2 index.
Implement retrieval. Complete the retrieve(query, k=5) function: embed the query, search the FAISS index, and return the top-k chunks with their source metadata.
Build the RAG chain. Construct a prompt template that inserts retrieved chunks as context. Call the LLM (gpt-4o-mini or a local model) with the augmented prompt. Parse and display the answer with citations.
Evaluate. Run 10 pre-defined chemistry questions from the notebook. Score each answer for faithfulness (does the answer contradict the retrieved context?) and relevance (does it address the question?).
Questions
Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.
For the query "What solvents are used in Suzuki coupling?", what are the top-3 retrieved chunks? Do they contain relevant information?
EasyCompare the LLM answer with retrieval (RAG) vs. without retrieval (bare LLM) for a question about a recent reaction. Which answer is more accurate? Does the bare LLM hallucinate?
MediumHow does increasing k (retrieved chunks) from 3 to 10 affect answer quality and latency? Is there a point of diminishing returns?
MediumImplement re-ranking: after FAISS retrieval, re-rank the top-10 chunks using a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) before selecting the top-3 for the prompt. Does re-ranking improve faithfulness scores?
Resources