Tutorial 13 — Chemistry RAG Assistant · AI4Chemical Sciences Bootcamp

What you will learn

Chunk and embed a corpus of chemistry paper abstracts using a sentence-transformer model
Build a FAISS vector index and retrieve the top-k most relevant chunks for a query
Construct a RAG prompt that provides retrieved context to an LLM before answering
Evaluate answer faithfulness and relevance using automated metrics (BERTScore, answer recall)

Instructions

Getting started

Open the Colab notebook using the button above. Run each cell in order; cells marked Exercise require you to fill in code.

Set up the environment. Install sentence-transformers, faiss-cpu, and openai (or use a local model via ollama). Load the provided dataset of 500 chemistry paper abstracts.
Chunk and embed. Split each abstract into 200-token chunks with 20-token overlap. Embed all chunks using all-MiniLM-L6-v2. Store embeddings in a FAISS flat-L2 index.
Implement retrieval. Complete the retrieve(query, k=5) function: embed the query, search the FAISS index, and return the top-k chunks with their source metadata.
Build the RAG chain. Construct a prompt template that inserts retrieved chunks as context. Call the LLM (gpt-4o-mini or a local model) with the augmented prompt. Parse and display the answer with citations.
Evaluate. Run 10 pre-defined chemistry questions from the notebook. Score each answer for faithfulness (does the answer contradict the retrieved context?) and relevance (does it address the question?).

Questions

Answer these questions as you work through the notebook. Discuss with your neighbour — some have no single right answer.

Warm-up

For the query "What solvents are used in Suzuki coupling?", what are the top-3 retrieved chunks? Do they contain relevant information?

Easy

Core

Compare the LLM answer with retrieval (RAG) vs. without retrieval (bare LLM) for a question about a recent reaction. Which answer is more accurate? Does the bare LLM hallucinate?

Medium

Core

How does increasing k (retrieved chunks) from 3 to 10 affect answer quality and latency? Is there a point of diminishing returns?

Medium

Challenge

Implement re-ranking: after FAISS retrieval, re-rank the top-10 chunks using a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) before selecting the top-3 for the prompt. Does re-ranking improve faithfulness scores?

Challenge

Resources

Notebook (Colab) GitHub repo Paired lecture notes

Chemistry RAG Assistant

Open in Google Colab

Getting started