Live Demo: PDFSUMMARIZATION Site
Github CODE
Optimized PDF Q&A Assistant with Streamlit, LangChain, Hugging Face, and Supabase
When working on AI projects, you might notice that code runs fast on Google Colab but slows down on a local machine. The solution is to make the pipeline optimized and efficient.
In this blog, I’ll walk you through building a PDF Q&A Assistant that:
Upload a PDF → hash & check if already stored → extract, embed, and save chunks in Supabase → take user’s question → retrieve relevant chunks → refine with LLM → display answer.
Tech Stack Used
- Streamlit → Front-end UI and deployment
- LangChain → Works with LLMs, connecting the AI “brain”
- Hugging Face → Provides powerful pre-trained models
- Supabase → Vector database for storing and retrieving PDF data
Configuration
from sentence_transformers import SentenceTransformer from supabase import create_client from huggingface_hub import InferenceClient SUPABASE_URL = st.secrets["SUPABASE_URL"] SUPABASE_KEY = st.secrets["SUPABASE_KEY"] HF_TOKEN = st.secrets["HF_TOKEN"] # Hugging Face token supabase = create_client(SUPABASE_URL, SUPABASE_KEY) model = SentenceTransformer('all-MiniLM-L6-v2') hf_client = InferenceClient(api_key=HF_TOKEN)
Here, Supabase is used for storage, a SentenceTransformer model handles embeddings, and Hugging Face provides an LLM client for inference.
Hash and Extract PDF Data
import fitz # PyMuPDF (faster alternative to pdfplumber) import hashlib def hash_pdf(pdf_path): with open(pdf_path, "rb") as f: return hashlib.md5(f.read()).hexdigest() def extract_and_chunk(pdf_path, chunk_size=500): doc = fitz.open(pdf_path) text = " ".join([page.get_text() for page in doc]) words = text.split() chunks = [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)] return chunks
hashlib
→ creates a unique fingerprint (hash) for the PDF, preventing duplicate processing.
fitz
→ efficiently extracts text from the PDF and splits it into manageable chunks.
Embed, Store, and Retrieve
def embed_chunks(chunks): return model.encode(chunks, batch_size=16, show_progress_bar=True).tolist()
def store_to_supabase(chunks, embeddings, pdf_id): data = [{ "id": f"chunk{i+1}", # id will be chunk1, chunk2, ... "pdf_id": pdf_id, "text": chunk, "embedding": embedding } for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))] supabase.table("documents1").upsert(data).execute()
def retrieve_chunks(query, pdf_id, top_k=10): query_embedding = model.encode(query).tolist() response = supabase.rpc("match_documents", { "query_embedding": query_embedding, "match_count": top_k, "pdf_id_filter": pdf_id }).execute() relevant_chunk=[row["text"] for row in response.data] if response.data else [] return relevant_chunk
Embed Chunks
→ Convert text chunks into embeddings (vectors).
Store in Supabase
→ Save text + embeddings for future queries.
Retrieve Chunks
→ Find the most relevant text chunks with semantic similarity search.
Refine with Hugging Face LLM
def refine_with_llm(relevant_chunk, question): refinement_input = "\n\n---\n\n".join(relevant_chunk) prompt = f""" Refine the following extracted text chunks for clarity, conciseness, and improved readability. Keep the technical meaning accurate and explain any complex terms simply if needed. Text to refine: {refinement_input} Question: {question}""" response = hf_client.chat.completions.create( model="mistralai/Mixtral-8x7B-Instruct-v0.1", messages=[ {"role": "system", "content": "You are an expert technical editor and writer."}, {"role": "user", "content": prompt} ], temperature=0.7, max_tokens=500 ) refined_text = response.choices[0].message.content return refined_text
This step ensures that even if retrieved chunks are messy or incomplete, the AI agent refines them into clear, concise, and context-aware answers.
Streamlit Front-End
import uuid import os import streamlit as st st.set_page_config(page_title="PDF Q&A Assistant") st.title("📄 Ask Questions About Your PDF") uploaded_file = st.file_uploader("Upload a PDF", type="pdf") if uploaded_file: with st.spinner("Processing PDF..."): pdf_path = f"temp_{uuid.uuid4().hex}.pdf" with open(pdf_path, "wb") as f: f.write(uploaded_file.read()) pdf_id = hash_pdf(pdf_path) existing = supabase.table("documents1").select("id").eq("pdf_id", pdf_id).execute() if existing.data: st.warning("⚠️ This PDF has already been processed. You can still ask questions.") else: chunks = extract_and_chunk(pdf_path) embeddings = embed_chunks(chunks) store_to_supabase(chunks, embeddings, pdf_id) os.remove(pdf_path) st.success("PDF ready for Q&A.") question = st.text_input("Ask a question about the uploaded PDF:") if question: with st.spinner("Generating answer..."): results = retrieve_chunks(question, pdf_id) if not results: st.error("No relevant chunks found.") else: answer = refine_with_llm(results, question) st.markdown("### Answer:") st.write(answer)
Explanation:
- UI Setup → Streamlit sets page config, title, and PDF uploader.
- Temporary Save → Uploaded PDF is saved locally with a unique name.
- Hashing → Generate an MD5 hash to uniquely identify the PDF.
- Check Supabase → Skip processing if the PDF was already stored.
- Extract & Chunk → Pull text from the PDF and split it into word chunks.
- Embed Chunks → Convert chunks into vector embeddings for semantic search.
- Store in Supabase → Save chunks, embeddings, and PDF ID in the database.
- Clean Up → Remove the temporary PDF file after processing.
- Ask Question → User inputs a question about the uploaded PDF.
- Retrieve Chunks → Fetch most relevant chunks from Supabase via similarity search.
- Refine Answer → LLM polishes the retrieved text into a clear, concise response.
- Display Result → Show the AI-generated answer in the Streamlit app.
From PDF to Summary: Building an AI Agent with Python & Vector Databases - Basic
Top comments (0)