DEV Community

Cover image for From Pixels to Insight: Building a Unified Multi‑Modal GenAI Knowledge Base
Suraj Khaitan
Suraj Khaitan

Posted on

From Pixels to Insight: Building a Unified Multi‑Modal GenAI Knowledge Base

Summary

Modern enterprise knowledge isn’t just text. It lives in PDFs with embedded charts, scanned diagrams, and implicit relationships buried across documents. This article walks through designing a production‑grade, multi‑modal ingestion pipeline that:

  • Parses heterogeneous documents (PDF, Word, etc.)
  • Extracts embedded images → converts them to descriptive text (image2text)
  • Normalizes and chunks content
  • Generates embeddings and structured triples
  • Builds both vector and graph indices
  • Enables hybrid retrieval (semantic + relational)

All implemented as a serverless, event‑orchestrated flow (e.g., Step Functions + Lambdas) using modular services for reading, vision, embedding, indexing, and retrieval—without leaking any sensitive identifiers.


Why Multi‑Modal + Structured Matters

RAG systems relying solely on dense vector similarity can miss:

  • Diagram semantics (architecture views, flow charts)
  • Entity relationships (who owns what, dependencies)
  • Procedural context (step ordering)

By fusing:

  1. Text extraction (PDF parsing, OCR fallback)
  2. Image captioning / vision-to-text (image2text)
  3. Graph construction (subject–predicate–object triples)
  4. Dense embeddings (semantic meaning)

…you unlock richer grounding for LLM responses: precise factual lookup + contextual reasoning over relationships.


High-Level Architecture

flowchart LR A[Upload Document(s)] --> B[Initialize Job] B --> C[Read: PDF/Text Parsing] C --> D{Images?} D -->|Yes| E[Image2Text (Vision Models)] D -->|No| F[Skip] E --> G[Merge Text + Image Captions] F --> G[Unified Content Stream] G --> H[Chunking Strategy] H --> I[Embedding Generation] I --> J[Vector Index (Similarity)] I --> K[Triple Extraction + Graph Index] J --> L[Hybrid Retriever] K --> L L --> M[LLM Answer Synthesis] 
Enter fullscreen mode Exit fullscreen mode

Status Lifecycle & Resilience

A robust ingestion pipeline maintains explicit statuses for observability and idempotency:

pending → processed → image2text_completed → chunked → embedded → indexed | | | | reading_failed image2text_failed chunking_failed ... 
Enter fullscreen mode Exit fullscreen mode

Each stage validates preconditions (e.g., must be processed or image2text_completed before chunking) and writes atomic status transitions to a metadata store. This enables safe retries, partial completion, and granular metrics.


Core Pipeline Modules (Conceptual)

Module Responsibility Key Considerations
Reading Detect file type; extract textual blocks; extract inline images Hybrid extraction: native PDF libraries + fallback OCR
Image2Text Caption images with a vision model Early exit if no images; retain page provenance
Chunking Build semantically coherent, token‑bounded chunks Preserve hierarchy; avoid splitting code/examples mid block
Embedding Generate vectors per chunk Batch calls; dimension awareness
Graph Indexing Extract triples (subject, predicate, object) Confidence scoring + schema versioning
Retriever Hybrid vector + graph + metadata fusion Weighted scoring & reranking

Walkthrough: PDF With Embedded Diagrams

  1. Upload technical PDF.
  2. Reading stage extracts page blocks + images.
  3. Image2Text captions each figure ("Sequence diagram showing service A calls service B").
  4. Chunking interleaves text paragraphs & captions while enforcing token limit.
  5. Embedding stage produces dense vectors (e.g., 1024‑dimensional).
  6. Graph extraction derives entity relationships (ServiceA → calls → ServiceB).
  7. Retrieval blends vector similarity + graph expansion.
  8. LLM answers grounded with both chunk text and relationship provenance.

Pseudo-Code (Abstracted)

Reading + Image Extraction

def read_document(file_path: str) -> dict: pages = extract_pdf_text(file_path) images = extract_pdf_images(file_path) return {"pages": pages, "images": images, "status": "processed"} 
Enter fullscreen mode Exit fullscreen mode

Vision Enrichment (Image2Text)

def enrich_with_image_captions(processed: dict, vision_model) -> dict: if not processed["images"]: processed["status"] = "image2text_completed" processed["image_captions"] = [] return processed captions = [] for img in processed["images"]: caption = vision_model.describe(img.binary) captions.append({"text": caption, "page": img.page_num, "type": "image_caption"}) processed["image_captions"] = captions processed["status"] = "image2text_completed" return processed 
Enter fullscreen mode Exit fullscreen mode

Chunking Mixed Modal Content

def build_chunks(pages, image_captions, max_tokens=800): stream = interleave(pages, image_captions, strategy="by_page") chunks, current, token_count = [], [], 0 for block in stream: bc = count_tokens(block["text"]) if token_count + bc > max_tokens: chunks.append(join(current)) current, token_count = [], 0 current.append(block["text"]) token_count += bc if current: chunks.append(join(current)) return [{"chunk_text": c, "modality_mix": detect_modalities(c)} for c in chunks] 
Enter fullscreen mode Exit fullscreen mode

Embedding Generation

def embed_chunks(chunks, embed_model): vectors = [] for c in chunks: v = embed_model.embed(c["chunk_text"]) vectors.append({**c, "vector": v}) return vectors 
Enter fullscreen mode Exit fullscreen mode

Triple Extraction

def extract_triples(chunks, llm): triples = [] for c in chunks: prompt = f"Extract (subject, predicate, object) triples:\n{c['chunk_text']}" t = llm.extract_triples(prompt) triples.extend(t) return normalize_triples(triples) 
Enter fullscreen mode Exit fullscreen mode

Hybrid Retrieval Fusion

def hybrid_retrieve(query, vector_index, graph_index, embed_model): q_vec = embed_model.embed(query) vec_hits = vector_index.search(q_vec, top_k=10) related_nodes = graph_index.expand_entities(query, hop_limit=2) scored = fuse_scores(vec_hits, related_nodes) return rerank(scored, query) 
Enter fullscreen mode Exit fullscreen mode

Graph + Vector Synergy

Embedding nodes & edges enables:

  • Semantic expansion (similar conceptual nodes)
  • Multi-hop reasoning with relevance pruning
  • Rich provenance joining text chunks + relationships

Operational Considerations

Aspect Notes
Throughput Batch embeddings; parallel captioning where feasible
Memory Guardrails for large PDFs & image sets; stream pages
Idempotency Status validation gates per stage
Error isolation Modality-specific fail states (continue with text)
Schema evolution Version triple schema + migration plan
Ranking fusion Tune α/β/γ weights offline (NDCG/MRR)

Fusion formula example:

final_score = α * vector_similarity + β * graph_relevance + γ * metadata_boost 
Enter fullscreen mode Exit fullscreen mode

Performance & Quality Tips

Challenge Mitigation
Redundant captions Embedding-based dedupe (cosine threshold)
Oversized chunks Adaptive token sizing per density
Ambiguous entities Confidence filtering + fallback noun phrase rules
Hallucinated triples Require in-text evidence & schema validation

Security & Governance (Generic)

  • File-type allowlist (.pdf, .docx, .pptx, .xlsx, .txt)
  • Malware scanning pre-ingestion
  • PII redaction pass pre-embedding
  • Structured logging (no raw document dumps)
  • Principle of least privilege for each stage

Testing Strategy

Layer Test Goal
Reader Unit Mixed PDF → expected blocks & images
Chunker Unit Cohesive segmentation, no mid-code splits
Embedding Contract Dimensions + determinism for identical input
Graph Golden Known doc yields expected triples
Retriever Integration Query returns hybrid evidence set
End-to-End Scenario Upload → query answer grounded in provenance

Observability Essentials

  • Stage latency (p50/p95)
  • Failure counts by stage & file type
  • Chunks per document distribution
  • Triple density (per 1k tokens)
  • Retrieval latency breakdown (vector vs graph)
  • Fusion contribution (percentage of answers using graph expansion)

Future Extensions

  1. Multi-modal joint embeddings (text + image per chunk)
  2. Temporal predicates (time-aware graph queries)
  3. Incremental re-chunking on document updates (diff-based)
  4. Active learning for low-confidence triples
  5. Structured citation anchors in generated answers

Prompt Pattern for Safe Triple Extraction

You are an information extraction agent. Extract factual (subject, predicate, object) triples ONLY if explicitly implied by the text. Return JSON array. Ignore speculative relationships. Text: "Service A asynchronously publishes events to Queue Q. Service B subscribes to Queue Q." Output: [ {"subject": "Service A", "predicate": "publishes to", "object": "Queue Q"}, {"subject": "Service B", "predicate": "subscribes to", "object": "Queue Q"} ] 
Enter fullscreen mode Exit fullscreen mode

Implementation Checklist (Condensed)

  • Parser abstraction (text + images)
  • Vision caption enrichment with provenance
  • Adaptive chunking engine
  • Embedding batcher & dimension registry
  • Triple extraction microservice
  • Vector + graph indices
  • Hybrid retrieval fusion layer
  • Metrics, statuses, retry semantics
  • Security guardrails & PII handling
  • Evaluation harness & golden corpora

Closing

Multi-modal + graph-aware RAG reduces hallucination, improves specificity, and unlocks reasoning over implicit structure. Build modularly, invest early in observability, and iterate using evaluation-driven refinements.

Feel free to request a focused deep dive (e.g., graph schema design or fusion scoring) if helpful.

About the Author

Written by Suraj Khaitan
— Gen AI Architect | Working on serverless AI & cloud platforms.

Top comments (0)