Key Takeaways
- Semantic caching is a Retrieval Augmented Generation (RAG) technique that stores queries and responses as vector embeddings, allowing the system to reuse previous answers.
- Semantic caching helps with efficiently retrieving the accurate responses without repeatedly invoking large language models.
- Learn about the systematic journey from semantic caching failure to production success, testing seven bi-encoder models across four experimental configurations with 1,000 real banking queries.
- In the evaluation process, the model selection strategy included three model types: compact, large-scale, and specialized models.
- Achieving a sub-5% false positive rate requires a multi-layered architectural approach. This roadmap includes query pre-processing, fine-tuned domain models, a multi-vector architecture, cross-encoder reranking, and a final rule-based system for logical validation.
As natural language becomes the standard interface for interacting with software, whether through intelligent search, chatbots, analytics assistants, or enterprise knowledge explorers, systems must process vast numbers of user queries that differ in phrasing but share the same intent. Efficiently retrieving accurate responses without repeatedly invoking large language models is critical for speed, consistency, and cost control.
Semantic caching enables this efficiency. It is a Retrieval Augmented Generation (RAG) technique that stores queries and responses as vector embeddings, allowing the system to reuse previous answers when new queries carry similar meaning. Unlike traditional caching based on exact string matches, semantic caching operates on meaning and intent, ensuring continuity across diverse phrasing patterns.
In production environments, semantic caching accelerates response times, stabilizes output quality, and reduces redundant LLM calls across diverse applications, from customer support and document retrieval to conversational business intelligence. Yet if the cache is poorly designed, semantically close but incorrect results can surface, producing false positives that compromise reliability in critical contexts.
When we set out to implement semantic caching for our financial services FAQ system, we expected a straightforward path: choose a proven bi-encoder model, set reasonable similarity thresholds, and let the system learn from user interactions. The reality proved far more complex.
Our production system was giving users confident but completely wrong answers. A business owner saying 'I don't want this business account anymore' was directed to automatic payment cancellation procedures with 84.9% confidence. A customer saying 'I don't want this card anymore' was directed to investment account closure procedures with 88.7% confidence instead of credit card cancellation guidance.
Someone requesting ATM locations received loan balance checking instructions with 80.9% confidence. Despite using state-of-the-art sentence transformers with a standard similarity threshold of 0.7, our false positive rate reached 99% for some models.
This article documents our systematic journey from semantic caching failure to production success, testing seven bi-encoder models across four experimental configurations with 1,000 real banking queries. The core insight: the design of your cache is a more powerful lever for reducing false positives than model optimization.
Experimental Methodology: Production-Grade Benchmark for Semantic Caching
The insights in this article come from a rigorous, production-grade evaluation with 1,000 queries tested across multiple bi-encoder models. Our test configuration was designed to replicate a real-world environment and provide a comprehensive analysis of semantic caching performance.
System Architecture
Our test system utilized a query-to-query semantic caching architecture. A user query is vectorized and matched against a Facebook AI Similarity Search (FAISS) based cache of previously answered queries. If a similar query is found above a set similarity threshold, the corresponding gold answer is returned from the cache. If the query does not meet the threshold, the system falls back to the Large Language Model (LLM) to generate a response, which is then added to the cache for future use. The evaluation environment was designed to closely mirror production scale and behavior, maintaining identical caching logic, model configurations, and performance parameters to ensure realistic and reproducible results. Figure 1 illustrates the high level semantic caching architecture.

Figure 1: High-Level Semantic Caching Architecture
Test Modes
The evaluation was conducted in two distinct caching modes to simulate different production scenarios:
- Incremental Mode: This mode began with an empty cache. Queries were added only when a cache miss occurred, simulating a "cold-start" deployment where the system learns and grows over time.
- Pre-cached Mode: This mode started with a pre-loaded cache populated with 100 gold answers and 300 strategically crafted distractors. This setup simulated a robust, real-time environment where the system is primed with foundational knowledge.
Infrastructure and Scale
Our evaluation used production grade infrastructure to ensure realistic performance measurements:
- Infrastructure: AWS g4dn.xlarge (NVIDIA T4 GPU, 16GB GPU memory)
- Scale: 1,000 query variations tested across seven bi-encoder models - all-MiniLM-L6-v2, e5-large-v2, mxbai-embed-large-v1, bge-m3, Qwen3-Embedding-0.6B, jina-embeddings-v2-base-en and instructor-large
- Dataset: 100 banking FAQs from major bank institutions
- Validation: 99.7% accuracy with ground truth comparison
- Metrics Measured: Cache Hit%, LLM Hit%, FP% (False Positive Rate), Recall@1, Recall@3 and Latency
Dataset Design: Real-World Query Patterns
The dataset, sourced from real bank websites, was meticulously designed to mimic authentic customer interactions. It consisted of 1,000 queries derived from 100 canonical banking FAQs, chosen from 10 different domains including payment, loan, dispute, accounts, investment, ATM and etc. Each FAQ was systematically enhanced with 10 query variations (e.g., formal, casual, slang, tiny with typos, etc.) and 3 types of query distractors to test the system's ability to handle semantic precision: topical_neighbor (0.8-0.9 similarity), semantic_near_miss (0.85-0.95 similarity), and cross_domain (0.6-0.8 similarity).
The screenshot below illustrates a sample from the dataset construction process, where each canonical FAQ was expanded into multiple real world query variations and paired with structured distractors to test semantic precision. This example reflects the benchmark’s focus on evaluating both accuracy and the cache’s ability to distinguish fine grained intent differences under realistic query conditions.
"faq_id": "Q003", "domain": "payment", "faq": "how do I cancel a Zelle payment", "gold_answer": "You can only cancel a Zelle payment if the recipient hasn't enrolled in Zelle yet. If they're already enrolled, the payment is sent immediately and cannot be canceled. Contact the recipient directly to request a return payment. For pending payments to unenrolled recipients, you may be able to cancel through your banking app or by calling customer service.", "variations": [ ["V001", "formal", "What is the procedure for canceling a Zelle transaction?", "hard"], ["V002", "casual", "can i cancel a zelle payment i just sent", "medium"], ["V003", "polite", "Is it possible to cancel a Zelle transfer I made by mistake?", "medium"], ["V004", "slang", "can i take back money i zelled to someone", "hard"], ["V005", "vague", "I need to stop something I did", "hard"], ["V006", "frustration", "I sent money to the wrong person on Zelle! How do I get it back?", "hard"], ["V007", "typo", "How do I cancle a Zele paymet?", "medium"], ["V008", "tiny", "cancel zelle?", "hard"], ["V009", "contradictory", "I want to complete this Zelle payment but how do I cancel it?", "hard"], ["V010", "grammar_error", "How I can canceling Zelle payment that I send?", "hard"] ], "query_distractors": [ ["Q1021", "topical_neighbor", "how do I view my recent transaction history", "medium", 0.83, "account_history", "high"], ["Q1022", "semantic_near_miss", "how do I reverse a completed wire transfer", "hard", 0.91, "payment_reversal", "high"], ["Q1023", "cross_domain", "how do I dispute unauthorized credit card charges", "medium", 0.74, "charge_dispute", "medium"] ] Model Selection Strategy
We evaluated seven representative bi-encoder models selected to span the embedding landscape in terms of size, architecture, and provider diversity:
- Compact Models: all-MiniLM-L6-v2 (384 dimensions) and jina-embeddings-v2-base-en (768 dimensions)
- Large-Scale Models: e5-large-v2, mxbai-embed-large-v1, and bge-m3 (all 1024 dimensions)
- Specialized Models: Qwen3-Embedding-0.6B and instructor-large (instruction-tuned for task-specific embeddings)
The experiments compared models varying in scale and representational depth. Compact models provide lower latency and computational cost, while larger architectures capture deeper semantic relationships at higher resource demand. Instruction tuned and task specific models add contextual alignment, enhancing precision in complex, domain specific scenarios.
Experiment 1: The Zero-Shot Baseline False Positive Crisis
Our first experiment established baseline performance using default 0.7 similarity thresholds across all models with incremental cache building from zero.

Figure 2: Zero Shot Semantic Caching Baseline Flow
The table below summarizes the zero shot baseline results across all models. Each model was tested with an empty cache and a fixed similarity threshold of 0.7. Metrics include cache hit rate, LLM fallback rate, false positives, recall, and latency, establishing a clear reference point before any cache optimization and illustrating how later architectural refinements reduced false positives in production.
The False Positive Crisis
| Model | Threshold | Cache Hit% | LLM Hit% | FP% | R@1% | R@3% | Latency (ms) |
| all-MiniLM-L6-v2 | 0.7 | 60.80 | 39.20 | 19.30 | 41.50 | 45.70 | 7.19 |
| e5-large-v2 | 0.7 | 99.90 | 0.10 | 99.00 | 0.90 | 0.90 | 18.31 |
| mxbai-embed-large-v1 | 0.7 | 84.90 | 15.10 | 40.00 | 44.90 | 49.30 | 18.35 |
| bge-m3 | 0.7 | 84.40 | 15.60 | 42.40 | 42.00 | 47.00 | 18.64 |
| Qwen3-Embedding-0.6B | 0.7 | 80.80 | 19.20 | 34.20 | 46.60 | 52.60 | 40.85 |
| jina-embeddings-v2-base-en | 0.7 | 97.70 | 2.30 | 85.60 | 12.10 | 13.20 | 12.11 |
| instructor-large | 0.7 | 99.90 | 0.10 | 99.00 | 0.90 | 0.90 | 21.79 |
Table 1: Zero Shot Semantic Caching Baseline Results
Our first experiment established a zero-shot baseline, revealing a significant "False Positive Crisis". With an empty cache and default thresholds, the models performed unacceptably, with two models achieving dangerously high false positive rates of 99%, meaning virtually every cache hit was incorrect. Even the best performer in this stage delivered incorrect answers 19.3% of the time. This showed that zero-shot models, which optimize for general semantic similarity without domain context, are not suitable for a domain like banking, as they match queries based on surface-level linguistic similarity rather than the necessary functional intent.
Experiment 2: Similarity Threshold Optimization Limits
Recognizing that default thresholds were inadequate, we fine-tuned similarity thresholds for each model based on validation data.

Figure 3: Optimizing Thresholds to Reduce False Positives
Similarity Threshold Optimized Results
| Model | Threshold | Cache Hit% | LLM Hit% | FP% | R@1% | R@3% | Latency (ms) |
| all-MiniLM-L6-v2 | 0.7 | 60.80 | 39.20 | 19.30 | 41.50 | 45.70 | 6.91 |
| e5-large-v2 | 0.9 | 53.00 | 47.00 | 27.20 | 25.80 | 34.70 | 19.65 |
| mxbai-embed-large-v1 | 0.8 | 63.30 | 36.70 | 16.50 | 46.80 | 51.60 | 19.51 |
| bge-m3 | 0.8 | 64.90 | 35.10 | 19.00 | 45.90 | 51.60 | 19.79 |
| Qwen3-Embedding-0.6B | 0.8 | 56.40 | 43.60 | 13.40 | 43.00 | 47.70 | 43.17 |
| jina-embeddings-v2-base-en | 0.86 | 69.80 | 30.20 | 20.80 | 49.00 | 54.90 | 12.48 |
| instructor-large | 0.93 | 63.70 | 36.30 | 14.10 | 49.60 | 53.40 | 22.63 |
Table 2: Model Performance Variation Under Different Thresholds
Following Experiment 2, it became clear that simply adjusting the similarity threshold was not a viable path to production-ready accuracy. While threshold optimization provided significant improvements over the initial baseline, false positive rates remained unacceptably high for production deployment. More aggressive threshold tuning, while reducing false positives, came at the cost of increasing cache misses and driving up expensive LLM calls. This revealed a fundamental architectural flaw: the core issue was not in the model's ability to find a good match, but in the cache's lack of adequate and precise candidates for it to choose from.
Experiment 3: The Best Candidate Principle and Critical Role of Cache Content
Our breakthrough came from reconceptualizing the problem entirely. Instead of optimizing search over sparse, incrementally-built caches, we pre-loaded caches with comprehensive domain coverage. This experiment revealed a fundamental design principle with broad applicability:
The Best Candidate Principle: "Ensuring optimal candidates are available for selection is more effective than optimizing selection algorithms on inadequate candidate sets".
Pre-loaded Cache Architecture
The new cache design contained:
- 100 Gold Standard FAQs: Canonical questions covering all 10 banking domains in our use case.
- 300 Strategic Distractors: These were carefully crafted queries designed to be semantically similar but incorrect. The total number of distractors was derived from a 3:1 ratio of distractors to gold-standard FAQs (300 distractors for 100 FAQs). This approach was used to simulate real-world data where distractors are common, allowing us to rigorously test the models' ability to handle nuanced semantic boundaries.
- Measured Similarity Ranges: Distractors with known similarity scores (0.6-0.95), including topical neighbours, semantic near-misses, and cross-domain queries, to test the system's ability to handle semantic boundary precision.

Figure 4: Best Candidate Selection Principle for Cache Optimization
Preloaded Cache with Distractors Results
| Model | Threshold | Cache Hit% | LLM Hit% | FP% | R@1% | R@3% | Latency (ms) | % FP Improvement from Experiment-2 |
| all-MiniLM-L6-v2 | 0.7 | 75.20 | 24.80 | 9.40 | 65.80 | 71.00 | 6.99 | 51.3 |
| e5-large-v2 | 0.9 | 68.40 | 31.60 | 18.30 | 50.10 | 61.60 | 20.52 | 32.7 |
| mxbai-embed-large-v1 | 0.8 | 81.60 | 18.40 | 8.80 | 72.80 | 78.20 | 20.45 | 46.7 |
| bge-m3 | 0.8 | 82.00 | 18.00 | 8.30 | 73.70 | 79.90 | 20.91 | 56.3 |
| Qwen3-Embedding-0.6B | 0.8 | 74.60 | 25.40 | 8.50 | 66.10 | 72.10 | 44.73 | 36.6 |
| jina-embeddings-v2-base-en | 0.86 | 84.90 | 15.10 | 10.30 | 74.60 | 80.40 | 13.11 | 50.5 |
| instructor-large | 0.93 | 79.50 | 20.50 | 5.80 | 73.70 | 79.00 | 24.02 | 58.9 |
Table 3: Impact of Cache Content Selection on False Positives and Recall
This architectural change, based on the Best Candidate Principle, resulted in a significant breakthrough. Despite adding 300 strategic distractors to the cache, we observed a dual benefit: false positive rates dropped by as much as 59% across models, and simultaneously, cache hit rates saw a major increase, rising from a range of 53% to 69.8% in Experiment 2 to a new range of 68.4% to 84.9% in this experiment. This substantial improvement validated that cache design is a more powerful lever for production-grade accuracy than threshold tuning alone.
Experiment 4: Cache Quality Control
Our final optimization involved introducing cache quality controls that filtered out problematic query patterns before they entered the cache. This included filtering tiny queries (extremely short or underspecified inputs such as "cancel?" or "loan?"), typos, grammar errors, and vague questions that could create semantic confusion.

Figure 5: Cache Quality Control Mechanism and Impact
Preloaded Cache with Cache Quality Control Results
| Model | Threshold | Cache Hit% | LLM Hit% | FP% | R@1% | R@3% | Latency (ms) | % FP Improvement from Experiment-2 |
| all-MiniLM-L6-v2 | 0.7 | 70.20 | 29.80 | 4.70 | 65.50 | 68.90 | 7.46 | 75.6 |
| e5-large-v2 | 0.9 | 60.10 | 39.90 | 9.20 | 50.90 | 59.00 | 20.34 | 66.2 |
| mxbai-embed-large-v1 | 0.8 | 77.90 | 22.10 | 5.10 | 72.80 | 76.70 | 20.29 | 69.1 |
| bge-m3 | 0.8 | 78.60 | 21.40 | 4.50 | 74.10 | 77.90 | 20.72 | 76.3 |
| Qwen3-Embedding-0.6B | 0.8 | 71.10 | 28.90 | 5.30 | 65.80 | 70.40 | 44.98 | 60.4 |
| jina-embeddings-v2-base-en | 0.86 | 80.50 | 19.50 | 5.80 | 74.70 | 79.30 | 13.22 | 72.1 |
| instructor-large | 0.93 | 77.60 | 22.40 | 3.80 | 73.80 | 77.50 | 23.79 | 73.0 |
Table 4: Impact of Quality Control on Cache Precision
This stage proved the importance of a cache quality control layer as a mandatory guardrail. It successfully addressed the persistent issues of typos, slang, and vague queries that can create semantic confusion. The results were dramatic, with all models except one achieving sub-6% false positive rates. The top performer, instructor-large, reached a 3.8% FP rate, representing a 96.2% reduction from its initial baseline. This final architectural step solidified the system's viability for a production environment.
Conclusion: From Crisis to Production-Ready
The path from a 99% to a 3.8% false positive rate in production semantic caching required a fundamental shift in system design philosophy. While model selection and parameter tuning are important, they were insufficient for a domain where accuracy is critical.
Our experiments revealed that the Best Candidate Principle, ensuring optimal candidates are available for selection, is more effective than optimizing search algorithms on inadequate candidate sets.

Figure 6: Comparative Performance After Cache Optimization
The optimal model for a production system depends on your specific use case and business requirements, which involves finding the best trade-off between latency, LLM cost, and false positive rates.
Based on these benchmarking use cases,
- Primary Choice: instructor-large for best accuracy (3.8% FP)
- Cost Optimized: bge-m3 for balanced performance
- Latency Critical: all-MiniLM-L6-v2 for real-time applications
- Models to Avoid: e5-large-v2 - persistent high FP rate despite optimization
Future Research and Architectural Roadmap: Addressing the Final 3.8% FP
Our experiments successfully reduced the false positive rate to 3.8%. This performance is viable for production deployment in many domains, while a sub-2% rate would provide enhanced safety margins for the most critical financial guidance scenarios.
Observations from Production Deployment
During our production deployment, we observed consistent failure modes that a pure semantic similarity based system struggled to resolve. These observations are based on a representative sample of the false positives identified across all models.
- Semantic Granularity Failures: The model fails to distinguish between closely related, yet distinct, banking concepts (e.g., "credit card" vs. "debit card").
- Intent Classification Failures: The model fails to understand the user's core intent. For example, a high similarity score was assigned to the query "Can I skip my loan payment this month" and the correct FAQ "What happens if I miss a loan payment". In this case, the user's intent was to ask for permission, but the system retrieved a candidate describing the consequences of a past action.
- Context Preservation Failures: The model incorrectly retrieves an answer that is technically correct but completely out of context, often in cases of typos or slang. For example, a high similarity score was assigned to the query "How do I buy stocks after hours?" and the general FAQ "How do I buy stocks". This provides strong evidence that a bi-encoder's dense vector averages out contextual qualifiers like "after hours", treating them as minor details rather than important context that changes handling requirements.
A Multi-Layered Architectural Roadmap
The path to a near-perfect false positive rate requires a multi-layered architectural approach that goes beyond pure similarity. The following roadmap is a systematic plan designed to address the remaining error categories identified in our analysis.
- Advanced Query Pre-processing: Use a fine-tuned LLM or a simple rule-based system to clean user queries, correct typos, and standardize slang before the query enters the semantic system.
- Fine-Tuned Domain Models: Further improve embedding performance by fine-tuning a base model on a limited, high-quality, in-domain dataset.
- Multi-Vector Architecture: Move beyond a single vector per query by creating separate vector spaces for different aspects of the query, such as "content", "intent", and "context".
- Cross-Encoder Re-ranking: Add a re-ranking layer to the system to deeply analyse the relationship between the query and a small list of candidates to improve accuracy.
- Domain Knowledge Integration: Integrate a final layer of rule-based validation that acts as a guardrail, incorporating domain-specific knowledge.
Lessons Beyond Banking: Principles for Any RAG System
While this case study focuses on the financial services industry, the principles apply to any domain relying on RAG semantic caching.
- Cache Design Over Model Tuning: Our findings demonstrate that cache architecture, rather than model selection or threshold tuning, is the strongest lever for reducing false positives.
- Garbage In, Garbage Out: Low quality queries such as typos, vague wording and grammar errors pollute the cache and create false positives. A preprocessing or quality control layer is a mandatory guardrail in production.
- The Limits of Threshold Tuning: Relying on similarity thresholds alone reduces false positives but at the cost of higher cache misses and LLM calls, leading to unsustainable costs and degraded user experience.
Conclusion
We began with a broken semantic caching system. By moving from a reactive, incremental cache to a proactive, architecturally sound design based on the Best Candidate Principle, we reduced false positives from 99% to 3.8%.
The path to a safe, reliable RAG system isn’t only about finding the perfect or large model. If your pipeline is failing, fix the cache before tuning the model. Architecture, not embeddings, is what separates prototypes from production systems.