Summary
Modern enterprise knowledge isn’t just text. It lives in PDFs with embedded charts, scanned diagrams, and implicit relationships buried across documents. This article walks through designing a production‑grade, multi‑modal ingestion pipeline that:
- Parses heterogeneous documents (PDF, Word, etc.)
- Extracts embedded images → converts them to descriptive text (image2text)
- Normalizes and chunks content
- Generates embeddings and structured triples
- Builds both vector and graph indices
- Enables hybrid retrieval (semantic + relational)
All implemented as a serverless, event‑orchestrated flow (e.g., Step Functions + Lambdas) using modular services for reading, vision, embedding, indexing, and retrieval—without leaking any sensitive identifiers.
Why Multi‑Modal + Structured Matters
RAG systems relying solely on dense vector similarity can miss:
- Diagram semantics (architecture views, flow charts)
- Entity relationships (who owns what, dependencies)
- Procedural context (step ordering)
By fusing:
- Text extraction (PDF parsing, OCR fallback)
- Image captioning / vision-to-text (image2text)
- Graph construction (subject–predicate–object triples)
- Dense embeddings (semantic meaning)
…you unlock richer grounding for LLM responses: precise factual lookup + contextual reasoning over relationships.
High-Level Architecture
flowchart LR A[Upload Document(s)] --> B[Initialize Job] B --> C[Read: PDF/Text Parsing] C --> D{Images?} D -->|Yes| E[Image2Text (Vision Models)] D -->|No| F[Skip] E --> G[Merge Text + Image Captions] F --> G[Unified Content Stream] G --> H[Chunking Strategy] H --> I[Embedding Generation] I --> J[Vector Index (Similarity)] I --> K[Triple Extraction + Graph Index] J --> L[Hybrid Retriever] K --> L L --> M[LLM Answer Synthesis] Status Lifecycle & Resilience
A robust ingestion pipeline maintains explicit statuses for observability and idempotency:
pending → processed → image2text_completed → chunked → embedded → indexed | | | | reading_failed image2text_failed chunking_failed ... Each stage validates preconditions (e.g., must be processed or image2text_completed before chunking) and writes atomic status transitions to a metadata store. This enables safe retries, partial completion, and granular metrics.
Core Pipeline Modules (Conceptual)
| Module | Responsibility | Key Considerations |
|---|---|---|
| Reading | Detect file type; extract textual blocks; extract inline images | Hybrid extraction: native PDF libraries + fallback OCR |
| Image2Text | Caption images with a vision model | Early exit if no images; retain page provenance |
| Chunking | Build semantically coherent, token‑bounded chunks | Preserve hierarchy; avoid splitting code/examples mid block |
| Embedding | Generate vectors per chunk | Batch calls; dimension awareness |
| Graph Indexing | Extract triples (subject, predicate, object) | Confidence scoring + schema versioning |
| Retriever | Hybrid vector + graph + metadata fusion | Weighted scoring & reranking |
Walkthrough: PDF With Embedded Diagrams
- Upload technical PDF.
- Reading stage extracts page blocks + images.
- Image2Text captions each figure ("Sequence diagram showing service A calls service B").
- Chunking interleaves text paragraphs & captions while enforcing token limit.
- Embedding stage produces dense vectors (e.g., 1024‑dimensional).
- Graph extraction derives entity relationships (ServiceA → calls → ServiceB).
- Retrieval blends vector similarity + graph expansion.
- LLM answers grounded with both chunk text and relationship provenance.
Pseudo-Code (Abstracted)
Reading + Image Extraction
def read_document(file_path: str) -> dict: pages = extract_pdf_text(file_path) images = extract_pdf_images(file_path) return {"pages": pages, "images": images, "status": "processed"} Vision Enrichment (Image2Text)
def enrich_with_image_captions(processed: dict, vision_model) -> dict: if not processed["images"]: processed["status"] = "image2text_completed" processed["image_captions"] = [] return processed captions = [] for img in processed["images"]: caption = vision_model.describe(img.binary) captions.append({"text": caption, "page": img.page_num, "type": "image_caption"}) processed["image_captions"] = captions processed["status"] = "image2text_completed" return processed Chunking Mixed Modal Content
def build_chunks(pages, image_captions, max_tokens=800): stream = interleave(pages, image_captions, strategy="by_page") chunks, current, token_count = [], [], 0 for block in stream: bc = count_tokens(block["text"]) if token_count + bc > max_tokens: chunks.append(join(current)) current, token_count = [], 0 current.append(block["text"]) token_count += bc if current: chunks.append(join(current)) return [{"chunk_text": c, "modality_mix": detect_modalities(c)} for c in chunks] Embedding Generation
def embed_chunks(chunks, embed_model): vectors = [] for c in chunks: v = embed_model.embed(c["chunk_text"]) vectors.append({**c, "vector": v}) return vectors Triple Extraction
def extract_triples(chunks, llm): triples = [] for c in chunks: prompt = f"Extract (subject, predicate, object) triples:\n{c['chunk_text']}" t = llm.extract_triples(prompt) triples.extend(t) return normalize_triples(triples) Hybrid Retrieval Fusion
def hybrid_retrieve(query, vector_index, graph_index, embed_model): q_vec = embed_model.embed(query) vec_hits = vector_index.search(q_vec, top_k=10) related_nodes = graph_index.expand_entities(query, hop_limit=2) scored = fuse_scores(vec_hits, related_nodes) return rerank(scored, query) Graph + Vector Synergy
Embedding nodes & edges enables:
- Semantic expansion (similar conceptual nodes)
- Multi-hop reasoning with relevance pruning
- Rich provenance joining text chunks + relationships
Operational Considerations
| Aspect | Notes |
|---|---|
| Throughput | Batch embeddings; parallel captioning where feasible |
| Memory | Guardrails for large PDFs & image sets; stream pages |
| Idempotency | Status validation gates per stage |
| Error isolation | Modality-specific fail states (continue with text) |
| Schema evolution | Version triple schema + migration plan |
| Ranking fusion | Tune α/β/γ weights offline (NDCG/MRR) |
Fusion formula example:
final_score = α * vector_similarity + β * graph_relevance + γ * metadata_boost Performance & Quality Tips
| Challenge | Mitigation |
|---|---|
| Redundant captions | Embedding-based dedupe (cosine threshold) |
| Oversized chunks | Adaptive token sizing per density |
| Ambiguous entities | Confidence filtering + fallback noun phrase rules |
| Hallucinated triples | Require in-text evidence & schema validation |
Security & Governance (Generic)
- File-type allowlist (.pdf, .docx, .pptx, .xlsx, .txt)
- Malware scanning pre-ingestion
- PII redaction pass pre-embedding
- Structured logging (no raw document dumps)
- Principle of least privilege for each stage
Testing Strategy
| Layer | Test | Goal |
|---|---|---|
| Reader | Unit | Mixed PDF → expected blocks & images |
| Chunker | Unit | Cohesive segmentation, no mid-code splits |
| Embedding | Contract | Dimensions + determinism for identical input |
| Graph | Golden | Known doc yields expected triples |
| Retriever | Integration | Query returns hybrid evidence set |
| End-to-End | Scenario | Upload → query answer grounded in provenance |
Observability Essentials
- Stage latency (p50/p95)
- Failure counts by stage & file type
- Chunks per document distribution
- Triple density (per 1k tokens)
- Retrieval latency breakdown (vector vs graph)
- Fusion contribution (percentage of answers using graph expansion)
Future Extensions
- Multi-modal joint embeddings (text + image per chunk)
- Temporal predicates (time-aware graph queries)
- Incremental re-chunking on document updates (diff-based)
- Active learning for low-confidence triples
- Structured citation anchors in generated answers
Prompt Pattern for Safe Triple Extraction
You are an information extraction agent. Extract factual (subject, predicate, object) triples ONLY if explicitly implied by the text. Return JSON array. Ignore speculative relationships. Text: "Service A asynchronously publishes events to Queue Q. Service B subscribes to Queue Q." Output: [ {"subject": "Service A", "predicate": "publishes to", "object": "Queue Q"}, {"subject": "Service B", "predicate": "subscribes to", "object": "Queue Q"} ] Implementation Checklist (Condensed)
- Parser abstraction (text + images)
- Vision caption enrichment with provenance
- Adaptive chunking engine
- Embedding batcher & dimension registry
- Triple extraction microservice
- Vector + graph indices
- Hybrid retrieval fusion layer
- Metrics, statuses, retry semantics
- Security guardrails & PII handling
- Evaluation harness & golden corpora
Closing
Multi-modal + graph-aware RAG reduces hallucination, improves specificity, and unlocks reasoning over implicit structure. Build modularly, invest early in observability, and iterate using evaluation-driven refinements.
Feel free to request a focused deep dive (e.g., graph schema design or fusion scoring) if helpful.
About the Author
Written by Suraj Khaitan
— Gen AI Architect | Working on serverless AI & cloud platforms.
Top comments (0)