Documentation: Backend (backend.py)
This script processes PDF documents into vector embeddings and builds a FAISS index for semantic search.
It is the offline preprocessing pipeline for the Indaba RAG chatbot.
Key Responsibilities
- Load PDFs from a folder.
- Extract raw text using PyPDF2.
- Chunk large documents into smaller overlapping text segments.
- Convert chunks into embeddings using SentenceTransformers.
- Build and persist a FAISS index for similarity search.
- Save the raw chunks for later retrieval.
Step-by-Step Breakdown
Imports and Setup
import os import pickle import numpy as np from PyPDF2 import PdfReader from sentence_transformers import SentenceTransformer import faiss
os → file system operations.
pickle→ save preprocessed chunks.
numpy → numerical array handling.
PyPDF2→ extract text from PDF files.
SentenceTransformer → embedding model (all-MiniLM-L6-v2).
faiss → efficient similarity search.
Constants:
embedder = SentenceTransformer("all-MiniLM-L6-v2") INDEX_FILE = "faiss_index.bin" CHUNKS_FILE = "chunks.pkl"
- embedder is the model instance; loading this, downloads model weights (first run may take time).
- INDEX_FILE and CHUNKS_FILE defines where to save FAISS index and chunks.
Function to Load PDF
def load_pdf(file_path): pdf = PdfReader(file_path) text = "" for page in pdf.pages: text += page.extract_text() + "\n" return text
- Reads a PDF file with PyPDF2.
- Extracts text page by page.
- Returns the full document text as a string.
Function for Text Chunking
def chunk_text(text, chunk_size=500, overlap=100): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunks.append(text[start:end]) start += chunk_size - overlap return chunks
Splits the text into chunks of chunk_size characters, shifting by chunk_size - overlap each time (so consecutive chunks overlap by overlap characters).
Representation:
Chunk 1 = 0–500
Chunk 2 = 400–900 (100 overlap)
Full Pipeline Info
Walkthrough of the function:
1.Collect chunks for all PDFs:
pdf_folder = "vault" #This is the folder/path pdfs are stored in. all_chunks = [] for filename in os.listdir(pdf_folder): if filename.endswith(".pdf"): text = load_pdf(os.path.join(pdf_folder, filename)) chunks = chunk_text(text) all_chunks.extend(chunks)
- Extracts text and chunks for each PDF and keeps all chunks in all_chunks list as strings.
Note: Order matters (index ids align with order).
2.Embed chunks:
vectors = embedder.encode(all_chunks) vectors = np.array(vectors)
embedder.encode(list_of_texts) returns a list/array of vectors. By default, it returns float32 or float64 depending on version — FAISS expects float32. In practice it's safer to force dtype float32.
Important: embedding all chunks at once can OOM if you have many chunks. Use batching:
vectors = embedder.encode(all_chunks, batch_size=32, show_progress_bar=True) vectors = np.array(vectors).astype('float32')
3.Create FAISS index:
dim = vectors.shape[1] index = faiss.IndexFlatL2(dim) index.add(vectors)
(Basic)
- Creates a FAISS index and adds all chunk vectors into the index.
(Technical)
IndexFlatL2 = exact (brute-force) nearest neighbor search using L2 distance. Works for small-to-medium collections.
- Pros: simple and exact.
- Cons: slow on large collections.
The index.add(vectors) adds vectors in the same order as all_chunks. FAISS internal ids = 0..N-1 in that order — that’s how you map back to chunks.
4.Save index and chunks:
faiss.write_index(index, INDEX_FILE) with open(CHUNKS_FILE, "wb") as f: pickle.dump(all_chunks, f)
Saves FAISS index to faiss_index.bin.
Saves chunks (raw text) to chunks.pkl.
These files are later loaded by the Streamlit frontend on runtime.
How to run this script
make sure your PDFs are in docs/
python -m backend
Top comments (0)