Posted on Nov 1

From Pixels to Insight: Building a Unified Multi‑Modal GenAI Knowledge Base

Summary

Modern enterprise knowledge isn’t just text. It lives in PDFs with embedded charts, scanned diagrams, and implicit relationships buried across documents. This article walks through designing a production‑grade, multi‑modal ingestion pipeline that:

Parses heterogeneous documents (PDF, Word, etc.)
Extracts embedded images → converts them to descriptive text (image2text)
Normalizes and chunks content
Generates embeddings and structured triples
Builds both vector and graph indices
Enables hybrid retrieval (semantic + relational)

All implemented as a serverless, event‑orchestrated flow (e.g., Step Functions + Lambdas) using modular services for reading, vision, embedding, indexing, and retrieval—without leaking any sensitive identifiers.

Why Multi‑Modal + Structured Matters

RAG systems relying solely on dense vector similarity can miss:

Diagram semantics (architecture views, flow charts)
Entity relationships (who owns what, dependencies)
Procedural context (step ordering)

By fusing:

Text extraction (PDF parsing, OCR fallback)
Image captioning / vision-to-text (image2text)
Graph construction (subject–predicate–object triples)
Dense embeddings (semantic meaning)

…you unlock richer grounding for LLM responses: precise factual lookup + contextual reasoning over relationships.

High-Level Architecture

flowchart LR A[Upload Document(s)] --> B[Initialize Job] B --> C[Read: PDF/Text Parsing] C --> D{Images?} D -->|Yes| E[Image2Text (Vision Models)] D -->|No| F[Skip] E --> G[Merge Text + Image Captions] F --> G[Unified Content Stream] G --> H[Chunking Strategy] H --> I[Embedding Generation] I --> J[Vector Index (Similarity)] I --> K[Triple Extraction + Graph Index] J --> L[Hybrid Retriever] K --> L L --> M[LLM Answer Synthesis]

Status Lifecycle & Resilience

A robust ingestion pipeline maintains explicit statuses for observability and idempotency:

pending → processed → image2text_completed → chunked → embedded → indexed | | | | reading_failed image2text_failed chunking_failed ...

Each stage validates preconditions (e.g., must be processed or image2text_completed before chunking) and writes atomic status transitions to a metadata store. This enables safe retries, partial completion, and granular metrics.

Core Pipeline Modules (Conceptual)

Module	Responsibility	Key Considerations
Reading	Detect file type; extract textual blocks; extract inline images	Hybrid extraction: native PDF libraries + fallback OCR
Image2Text	Caption images with a vision model	Early exit if no images; retain page provenance
Chunking	Build semantically coherent, token‑bounded chunks	Preserve hierarchy; avoid splitting code/examples mid block
Embedding	Generate vectors per chunk	Batch calls; dimension awareness
Graph Indexing	Extract triples (subject, predicate, object)	Confidence scoring + schema versioning
Retriever	Hybrid vector + graph + metadata fusion	Weighted scoring & reranking

Walkthrough: PDF With Embedded Diagrams

Upload technical PDF.
Reading stage extracts page blocks + images.
Image2Text captions each figure ("Sequence diagram showing service A calls service B").
Chunking interleaves text paragraphs & captions while enforcing token limit.
Embedding stage produces dense vectors (e.g., 1024‑dimensional).
Graph extraction derives entity relationships (ServiceA → calls → ServiceB).
Retrieval blends vector similarity + graph expansion.
LLM answers grounded with both chunk text and relationship provenance.

Pseudo-Code (Abstracted)

Reading + Image Extraction

def read_document(file_path: str) -> dict: pages = extract_pdf_text(file_path) images = extract_pdf_images(file_path) return {"pages": pages, "images": images, "status": "processed"}

Vision Enrichment (Image2Text)

def enrich_with_image_captions(processed: dict, vision_model) -> dict: if not processed["images"]: processed["status"] = "image2text_completed" processed["image_captions"] = [] return processed captions = [] for img in processed["images"]: caption = vision_model.describe(img.binary) captions.append({"text": caption, "page": img.page_num, "type": "image_caption"}) processed["image_captions"] = captions processed["status"] = "image2text_completed" return processed

Chunking Mixed Modal Content

def build_chunks(pages, image_captions, max_tokens=800): stream = interleave(pages, image_captions, strategy="by_page") chunks, current, token_count = [], [], 0 for block in stream: bc = count_tokens(block["text"]) if token_count + bc > max_tokens: chunks.append(join(current)) current, token_count = [], 0 current.append(block["text"]) token_count += bc if current: chunks.append(join(current)) return [{"chunk_text": c, "modality_mix": detect_modalities(c)} for c in chunks]

Embedding Generation

def embed_chunks(chunks, embed_model): vectors = [] for c in chunks: v = embed_model.embed(c["chunk_text"]) vectors.append({**c, "vector": v}) return vectors

Triple Extraction

def extract_triples(chunks, llm): triples = [] for c in chunks: prompt = f"Extract (subject, predicate, object) triples:\n{c['chunk_text']}" t = llm.extract_triples(prompt) triples.extend(t) return normalize_triples(triples)

Hybrid Retrieval Fusion

def hybrid_retrieve(query, vector_index, graph_index, embed_model): q_vec = embed_model.embed(query) vec_hits = vector_index.search(q_vec, top_k=10) related_nodes = graph_index.expand_entities(query, hop_limit=2) scored = fuse_scores(vec_hits, related_nodes) return rerank(scored, query)

Graph + Vector Synergy

Embedding nodes & edges enables:

Semantic expansion (similar conceptual nodes)
Multi-hop reasoning with relevance pruning
Rich provenance joining text chunks + relationships

Operational Considerations

Aspect	Notes
Throughput	Batch embeddings; parallel captioning where feasible
Memory	Guardrails for large PDFs & image sets; stream pages
Idempotency	Status validation gates per stage
Error isolation	Modality-specific fail states (continue with text)
Schema evolution	Version triple schema + migration plan
Ranking fusion	Tune α/β/γ weights offline (NDCG/MRR)

Fusion formula example:

final_score = α * vector_similarity + β * graph_relevance + γ * metadata_boost

Performance & Quality Tips

Challenge	Mitigation
Redundant captions	Embedding-based dedupe (cosine threshold)
Oversized chunks	Adaptive token sizing per density
Ambiguous entities	Confidence filtering + fallback noun phrase rules
Hallucinated triples	Require in-text evidence & schema validation

Security & Governance (Generic)

File-type allowlist (.pdf, .docx, .pptx, .xlsx, .txt)
Malware scanning pre-ingestion
PII redaction pass pre-embedding
Structured logging (no raw document dumps)
Principle of least privilege for each stage

Testing Strategy

Layer	Test	Goal
Reader	Unit	Mixed PDF → expected blocks & images
Chunker	Unit	Cohesive segmentation, no mid-code splits
Embedding	Contract	Dimensions + determinism for identical input
Graph	Golden	Known doc yields expected triples
Retriever	Integration	Query returns hybrid evidence set
End-to-End	Scenario	Upload → query answer grounded in provenance

Observability Essentials

Stage latency (p50/p95)
Failure counts by stage & file type
Chunks per document distribution
Triple density (per 1k tokens)
Retrieval latency breakdown (vector vs graph)
Fusion contribution (percentage of answers using graph expansion)

Future Extensions

Multi-modal joint embeddings (text + image per chunk)
Temporal predicates (time-aware graph queries)
Incremental re-chunking on document updates (diff-based)
Active learning for low-confidence triples
Structured citation anchors in generated answers

Prompt Pattern for Safe Triple Extraction

You are an information extraction agent. Extract factual (subject, predicate, object) triples ONLY if explicitly implied by the text. Return JSON array. Ignore speculative relationships. Text: "Service A asynchronously publishes events to Queue Q. Service B subscribes to Queue Q." Output: [ {"subject": "Service A", "predicate": "publishes to", "object": "Queue Q"}, {"subject": "Service B", "predicate": "subscribes to", "object": "Queue Q"} ]

Implementation Checklist (Condensed)

Parser abstraction (text + images)
Vision caption enrichment with provenance
Adaptive chunking engine
Embedding batcher & dimension registry
Triple extraction microservice
Vector + graph indices
Hybrid retrieval fusion layer
Metrics, statuses, retry semantics
Security guardrails & PII handling
Evaluation harness & golden corpora

Closing

Multi-modal + graph-aware RAG reduces hallucination, improves specificity, and unlocks reasoning over implicit structure. Build modularly, invest early in observability, and iterate using evaluation-driven refinements.

Feel free to request a focused deep dive (e.g., graph schema design or fusion scoring) if helpful.

About the Author

Written by Suraj Khaitan
— Gen AI Architect | Working on serverless AI & cloud platforms.

DEV Community