AI RAG Injector

AI License Required

Overview Examples Configuration reference Changelog API reference

What is Retrieval Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a technique that improves the accuracy and relevance of language model responses by enriching prompts with external data at runtime. Instead of relying solely on what the model was trained on, RAG retrieves contextually relevant information—such as documents, support articles, or internal knowledge—from connected data sources like vector databases.

This retrieved context is then automatically injected into the prompt before the model generates a response. RAG is a critical safeguard in specialized or high-stakes applications, where factual accuracy matters. LLMs are prone to hallucinations, plausible-sounding but factually incorrect or fabricated responses. RAG helps mitigate this by grounding the model’s output in real, verifiable data.

The following table describes the different use cases for RAG based on industry:

Industry	Use case
Healthcare	RAG can help surface up-to-date clinical guidelines or patient records in a timely manner, critical when treatment decisions depend on the most current information.
Legal	Lawyers can use RAG-powered assistants to instantly retrieve relevant case law, legal precedents, or compliance documentation during client consultations.
Finance	In fast-moving markets, RAG enables models to deliver financial insights based on current data, avoiding outdated or misleading responses driven by stale training snapshots.

Why use the AI RAG Injector plugin

The AI RAG Injector plugin automates the retrieval and injection of contextual data for RAG pipelines without doing manual prompt engineering or retrieval logic. Integrated at the gateway level, it handles embedding generation, vector search, and context injection transparently for each request.

Simplifies RAG workflows: Automatically embeds prompts, queries the vector DB, and injects relevant context without custom retrieval logic.
Platform-level control: Shifts RAG logic from app code to infrastructure, allowing platform teams to enforce global policies, update configurations centrally, and reduce developer overhead.
Improved security: Vector DB access is limited to the AI Gateway, eliminating the need to expose it to individual dev teams or AI agents.
Enables RAG in restricted environments: Supports RAG even where direct access to the vector database is not possible, such as external-facing or isolated services.
Developer productivity: Developers can focus on building AI features without needing to manage embeddings, similarity search, or context handling.
v3.11+ Save LLM costs: When using the AI RAG Injector plugin with the AI Prompt Compressor, you can wrap specific prompt parts in <LLMLINGUA> tags within your template to target only those sections for compression, preserving the rest of the prompt unchanged.

This plugin extends the functionality of the AI Proxy plugin, and requires either AI Proxy or AI Proxy Advanced to be configured first. To set up AI Proxy quickly, see Get started with AI Gateway.

How the AI RAG Injector plugin works

When a user sends a prompt, the RAG Injector plugin queries a configured vector database for relevant context and injects that information into the request before passing it to the language model.

You configure the AI RAG Injector plugin via the Admin API or decK, setting up the RAG content to send to the vector database.
When a request reaches the AI Gateway, the plugin generates embeddings for request prompts, then queries the vector database for the top-k most similar embeddings.
The plugin injects the retrieved content from the vector search result into the request body, and forwards the request to the upstream service.

The following diagram is a simplified overview of how the plugin works. See the following section for a more detailed description.

 sequenceDiagram participant User participant AIGateway as AI Gateway (RAG Injector Plugin) participant VectorDB as Vector DB (Data Source) participant Upstream as Upstream Service User->>AIGateway: Send request with prompt AIGateway->>VectorDB: Query for similar embeddings VectorDB-->>AIGateway: Return relevant context AIGateway->>Upstream: Inject context and forward enriched request Upstream-->>User: Return response

RAG Generation process

The RAG workflow consists of two critical phases:

Data preparation: Processes and embeds unstructured data into a vector index for efficient semantic search
Retrieval and generation: The system uses similarity search to dynamically assemble contextual prompts that guide the language model’s output.

Phase 1: Data Preparation

This phase sets up the foundation for semantic retrieval by converting raw data into a format that can be indexed and searched efficiently.

Step breakdown:

A document loader pulls content from various sources, such as PDFs, websites, emails, or internal systems.
The system breaks the unstructured data into smaller, semantically meaningful chunks to support precise retrieval.
Each chunk is transformed into a vector embedding (a numeric representation that captures its semantic content).
These embeddings are saved to a vector database, enabling a fast, similarity-based search during query time.

Phase 2: Retrieval and Generation

This phase runs in real time, taking user input and producing a context-aware response using the indexed data.

Step breakdown:

The user’s query is converted into an embedding using the same model used during data preparation.
A semantic similarity search locates the most relevant content chunks in the vector database.
The system builds a custom prompt by combining the retrieved chunks with the original query.
The LLM generates a contextually accurate response using both the retrieved context and its own internal knowledge.

The diagram below shows how data flows through both phases of the RAG pipeline, from ingestion and embedding to real-time query handling and response generation:

 sequenceDiagram autonumber actor User participant RawData as Raw Data participant EmbeddingModel as Embedding Model participant VectorDB as Vector Database participant LLM par Data Preparation Phase activate RawData RawData->>EmbeddingModel: Load and chunk documents, generate embeddings deactivate RawData activate EmbeddingModel EmbeddingModel->>VectorDB: Store embeddings deactivate EmbeddingModel activate VectorDB deactivate VectorDB end par Retrieval & Generation Phase activate User User->>EmbeddingModel: (1) Submit query and generate query embedding activate EmbeddingModel EmbeddingModel->>VectorDB: (2) Search vector DB deactivate EmbeddingModel activate VectorDB VectorDB-->>EmbeddingModel: Return relevant chunks deactivate VectorDB activate EmbeddingModel EmbeddingModel->>LLM: (3) Assemble prompt and send deactivate EmbeddingModel activate LLM LLM-->>User: (4) Generate and return response deactivate LLM deactivate User end

Rather than guessing from memory, the LLM paired with the RAG pipeline now has the ability to look up the information it needs in real time, which will dramatically reduce hallucinations and increase the accuracy of the AI output.

Vector databases

A vector database can be used to store vector embeddings, or numerical representations, of data items. For example, a response would be converted to a numerical representation and stored in the vector database so that it can compare new requests against the stored vectors to find relevant cached items.

The AI RAG Injector plugin supports the following vector databases:

Using config.vectordb.strategy: redis and parameters in config.vectordb.redis:
- Redis with Vector Similarity Search (VSS)
- AWS MemoryDB for Redis v3.12+
Using config.vectordb.strategy: pgvector and parameters in config.vectordb.pgvector:
- PostgreSQL with pgvector v3.10+

To learn more about vector databases in AI Gateway, see Embedding-based similarity matching in Kong AI gateway plugins.

FAQs

What embedding dimension should I use in my vectordb config?

The embedding dimension you use depends on your model and use case. More dimensions improve accuracy but increase cost. 1536 is a balanced default if you use the OpenAI text-embedding-3-large model.

Can I reduce embedding dimensions to save resources?

Yes. Use PCA, t-SNE, or UMAP to keep key features while lowering memory and latency.

What chunk size should I use for RAG?

Common sizes are 200–1000 tokens. Smaller chunks give precision; larger ones preserve context.

Should I add chunk overlap?

Yes. Overlap helps maintain context between chunks and improves retrieval quality.

How should I split text into chunks?

Use token-, sentence-, or semantic-based chunking based on your data and query type.

Which distance metric works best with embeddings?

Cosine similarity is the best distance metric for text. Use Euclidean only for coordinate-based data.

Where should I inject RAG context in the prompt?

It depends on your priorities:

system offers strong guidance, but carries higher prompt injection risk
user is safer for untrusted content
assistant offers moderate influence You can set this via the inject_as_role setting.

How do I resolve the MemoryDB error Number of indexes exceeds the limit?

If you see the following error in the logs:

failed to create memorydb instance failed to create index: LIMIT Number of indexes (11) exceeds the limit (10)

Copied!

This means that the hardcoded MemoryDB instance limit has been reached. To resolve this, create more MemoryDB instances to handle multiple AI RAG Injector plugin instances.