DEV Community

Cover image for From PDF to Summary: Building an AI Agent with Python & Vector Databases - Basic
datatoinfinity
datatoinfinity

Posted on • Edited on

From PDF to Summary: Building an AI Agent with Python & Vector Databases - Basic

Live Demo: PDFSUMMARIZATION Site

Sample PDF: Download Here

Github CODE

The PDF Summarization AI Agent is an AI-powered tool that summarizes lengthy PDFs and answers questions based only on their content.It’s useful when you need a quick overview without reading the entire document.

  • Summarizes large PDF files into concise overviews.
  • Answers user questions only from the uploaded PDF.
  • Formats responses clearly and preserves technical accuracy.

Used By

Researchers → Extract key findings from academic papers.
Lawyers → Summarize contracts & compliance documents.
Business Analysts → Turn meeting transcripts into quick insights.
Finance Teams → Condense invoices & financial statements.
Students → Create study notes from textbooks.

Tech Used

Streamlit → Front-end & deployment.
LangChain → LLM integration & chaining workflows.
Hugging Face → Pre-trained AI models (e.g., Mixtral-8x7B).
Supabase → Vector database for storing PDF embeddings.

How It Works

  1. Extract text from PDF. 2 Chunk the text into smaller segments (for large PDFs).
  2. Embed each chunk into vector form using a transformer model.
  3. Store embeddings in Supabase Vector DB.
  4. Perform similarity search to find the most relevant chunks for a query.
  5. Use a Hugging Face model to refine and format the answer.

Key Concepts

Chaining

A method of breaking a complex task into sequential steps, where the output of one step feeds into the next.

Embedding

A representation of text, images, or audio as points in a semantic vector space.
Similar items (e.g., mobile, smartphone, cell phone) are stored close together in this space.

Installation

pip install pdfplumber sentence-transformers supabase 
Enter fullscreen mode Exit fullscreen mode
  • pdfplumber → Extract text from PDF.
  • sentence-transformers → Convert text into embeddings.
  • supabase → Store and search embeddings.

Supabase Setup

  1. Create a Supabase account.
  2. Start a new project and copy:
    • Project URL
    • API Key
  3. Enable vector extension:
CREATE EXTENSION IF NOT EXISTS vector SCHEMA extensions; 
Enter fullscreen mode Exit fullscreen mode
  1. Create documents1 table:
CREATE TABLE documents1 ( id TEXT PRIMARY KEY, text TEXT, pdf_id TEXT, embedding VECTOR(384) ); 
Enter fullscreen mode Exit fullscreen mode
  1. Create similarity search function:
CREATE FUNCTION match_documents( query_embedding VECTOR(384), match_count INT ) RETURNS TABLE ( id TEXT, text TEXT ) LANGUAGE plpgsql STABLE AS $$ BEGIN RETURN QUERY SELECT documents1.id, documents1.text FROM documents1 ORDER BY documents1.embedding <-> query_embedding LIMIT match_count; END; $$; 
Enter fullscreen mode Exit fullscreen mode

PDF Processing

1. Upload PDF (Google Colab)

from google.colab import files uploaded = files.upload() 
Enter fullscreen mode Exit fullscreen mode

2. Extract & Chunk

import pdfplumber def extract_and_chunk(pdf_path, chunk_size=500): with pdfplumber.open(pdf_path) as pdf: text = "".join(page.extract_text() or "" for page in pdf.pages) chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)] return chunks 
Enter fullscreen mode Exit fullscreen mode

Store in Supabase

from supabase import create_client from sentence_transformers import SentenceTransformer supabase_url = "YOUR_SUPABASE_URL" supabase_key = "YOUR_API_KEY" supabase = create_client(supabase_url, supabase_key) model = SentenceTransformer('all-MiniLM-L6-v2') pdf_path = "Sample.pdf" chunks = extract_and_chunk(pdf_path) embeddings = model.encode(chunks).tolist() data = [ {"id": f"chunk_{i}", "text": chunk, "embedding": embedding, "pdf_id": "doc1"} for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)) ] supabase.table("documents1").insert(data).execute() 
Enter fullscreen mode Exit fullscreen mode

Query Search

query = "What is the topic?" query_embedding = model.encode(query).tolist() response = supabase.rpc( "match_documents", {"query_embedding": query_embedding, "match_count": 3} ).execute() relevant_chunks = [row["text"] for row in response.data] print("\n---\n".join(relevant_chunks)) 
Enter fullscreen mode Exit fullscreen mode

Hugging Face Integration

  1. Create a Hugging Face account.
  2. Generate a READ API token.
from huggingface_hub import InferenceClient import os client = InferenceClient( api_key=os.getenv("HUGGINGFACEHUB_API_TOKEN", "YOUR_HF_API_KEY") ) 
Enter fullscreen mode Exit fullscreen mode

Refinement with Mixtral-8x7B

prompt = f""" Refine the following extracted text chunks for clarity, conciseness, and improved readability. Keep the technical meaning accurate. Text to refine: { "\n\n---\n\n".join(relevant_chunks) } """ response = client.chat.completions.create( model="mistralai/Mixtral-8x7B-Instruct-v0.1", messages=[ {"role": "system", "content": "You are an expert technical editor."}, {"role": "user", "content": prompt} ], temperature=0.7, max_tokens=500 ) print("\n Refined Output:\n") print(response.choices[0].message.content) 
Enter fullscreen mode Exit fullscreen mode

Notes

  • Delete old data before inserting chunks from a new PDF to avoid * duplicate ID errors.
  • Hugging Face request cost & speed depend on the chosen model.
  • Supabase vector size (384) must match your embedding model output.

PDF upload → chunking → storing → querying → refining

Top comments (0)