DEV Community

Oswin Heman-Ackah
Oswin Heman-Ackah

Posted on • Edited on

Building a chatbot with Python (Backend)

Documentation: Backend (backend.py)
This script processes PDF documents into vector embeddings and builds a FAISS index for semantic search.
It is the offline preprocessing pipeline for the Indaba RAG chatbot.

Key Responsibilities

  1. Load PDFs from a folder.
  2. Extract raw text using PyPDF2.
  3. Chunk large documents into smaller overlapping text segments.
  4. Convert chunks into embeddings using SentenceTransformers.
  5. Build and persist a FAISS index for similarity search.
  6. Save the raw chunks for later retrieval.

Step-by-Step Breakdown

Imports and Setup

import os import pickle import numpy as np from PyPDF2 import PdfReader from sentence_transformers import SentenceTransformer import faiss 
Enter fullscreen mode Exit fullscreen mode

os → file system operations.
pickle→ save preprocessed chunks.
numpy → numerical array handling.
PyPDF2→ extract text from PDF files.
SentenceTransformer → embedding model (all-MiniLM-L6-v2).
faiss → efficient similarity search.

Constants:

embedder = SentenceTransformer("all-MiniLM-L6-v2") INDEX_FILE = "faiss_index.bin" CHUNKS_FILE = "chunks.pkl" 
Enter fullscreen mode Exit fullscreen mode
  • embedder is the model instance; loading this, downloads model weights (first run may take time).
  • INDEX_FILE and CHUNKS_FILE defines where to save FAISS index and chunks.

Function to Load PDF

def load_pdf(file_path): pdf = PdfReader(file_path) text = "" for page in pdf.pages: text += page.extract_text() + "\n" return text 
Enter fullscreen mode Exit fullscreen mode
  • Reads a PDF file with PyPDF2.
  • Extracts text page by page.
  • Returns the full document text as a string.

Function for Text Chunking

def chunk_text(text, chunk_size=500, overlap=100): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunks.append(text[start:end]) start += chunk_size - overlap return chunks 
Enter fullscreen mode Exit fullscreen mode
  • Splits the text into chunks of chunk_size characters, shifting by chunk_size - overlap each time (so consecutive chunks overlap by overlap characters).

  • Representation:
    Chunk 1 = 0–500
    Chunk 2 = 400–900 (100 overlap)

Full Pipeline Info
Walkthrough of the function:

1.Collect chunks for all PDFs:

pdf_folder = "vault" #This is the folder/path pdfs are stored in. all_chunks = [] for filename in os.listdir(pdf_folder): if filename.endswith(".pdf"): text = load_pdf(os.path.join(pdf_folder, filename)) chunks = chunk_text(text) all_chunks.extend(chunks) 
Enter fullscreen mode Exit fullscreen mode
  • Extracts text and chunks for each PDF and keeps all chunks in all_chunks list as strings.

Note: Order matters (index ids align with order).

2.Embed chunks:

vectors = embedder.encode(all_chunks) vectors = np.array(vectors) 
Enter fullscreen mode Exit fullscreen mode
  • embedder.encode(list_of_texts) returns a list/array of vectors. By default, it returns float32 or float64 depending on version — FAISS expects float32. In practice it's safer to force dtype float32.

  • Important: embedding all chunks at once can OOM if you have many chunks. Use batching:

 vectors = embedder.encode(all_chunks, batch_size=32, show_progress_bar=True) vectors = np.array(vectors).astype('float32') 
Enter fullscreen mode Exit fullscreen mode

3.Create FAISS index:

dim = vectors.shape[1] index = faiss.IndexFlatL2(dim) index.add(vectors) 
Enter fullscreen mode Exit fullscreen mode

(Basic)

  • Creates a FAISS index and adds all chunk vectors into the index.

(Technical)
IndexFlatL2 = exact (brute-force) nearest neighbor search using L2 distance. Works for small-to-medium collections.

  • Pros: simple and exact.
  • Cons: slow on large collections.

The index.add(vectors) adds vectors in the same order as all_chunks. FAISS internal ids = 0..N-1 in that order — that’s how you map back to chunks.

4.Save index and chunks:

 faiss.write_index(index, INDEX_FILE) with open(CHUNKS_FILE, "wb") as f: pickle.dump(all_chunks, f) 
Enter fullscreen mode Exit fullscreen mode

Saves FAISS index to faiss_index.bin.
Saves chunks (raw text) to chunks.pkl.

These files are later loaded by the Streamlit frontend on runtime.

How to run this script

make sure your PDFs are in docs/

python -m backend

Top comments (0)