DEV Community

The Prompt Debugger
The Prompt Debugger

Posted on

detect when langchain hallucinates by mixing contexts

my rag pipeline kept telling customers about features from completely different products. spent weeks debugging until traceloop showed me exactly where contexts were getting mixed. here's how to catch it.

the problem: your bot becomes a feature mixer

customer asks about basic plan, bot responds with enterprise features. the worst part? you can't see it happening without proper monitoring.

real example that almost got me fired:

  • user: "what reporting features are in the starter plan?"
  • bot: "the starter plan includes basic reports, custom dashboards, advanced analytics, real-time monitoring, and api access."

starter plan only has basic reports. everything else came from enterprise docs.

how traceloop saved my debugging nightmare

before traceloop, i was console.logging everything like an animal. after adding it:

from traceloop.sdk import Traceloop Traceloop.init(app_name="context_mixing_detector") 
Enter fullscreen mode Exit fullscreen mode

suddenly i could see in the dashboard:

  • which documents got retrieved for each query
  • exact metadata for each document
  • how the llm combined different contexts
  • real-time alerts when mixing happened

detection: let traceloop track your contexts

from traceloop.sdk import Traceloop from langchain.chains import RetrievalQA Traceloop.init(app_name="rag_context_monitor") def detect_context_mixing(sources, response, question): """use traceloop to track context mixing""" # extract unique contexts  contexts = set() for doc in sources: product = doc.metadata.get('product', 'unknown') tier = doc.metadata.get('tier', 'unknown') contexts.add(f"{product}_{tier}") # log to traceloop for monitoring  Traceloop.log_metric("contexts_retrieved", len(contexts)) Traceloop.log_metric("is_mixed", 1 if len(contexts) > 1 else 0) # track which contexts get mixed most  if len(contexts) > 1: Traceloop.log_event("context_mixing_detected", { "question": question, "contexts": list(contexts), "response_preview": response[:100] }) return len(contexts) > 1 
Enter fullscreen mode Exit fullscreen mode

the dashboard that changed everything

traceloop's dashboard showed me patterns i never noticed:

  • 73% of mixing happened between starter/enterprise tiers
  • mobile/desktop mixing peaked during certain queries
  • specific keywords triggered cross-context retrieval

screenshot from my actual dashboard:

  • "reporting" queries mixed contexts 89% of the time
  • "features" triggered enterprise doc retrieval even for basic users
  • average 3.2 contexts retrieved when mixing occurred

prevent mixing with traced filtering

from traceloop.sdk import Traceloop from traceloop.sdk.decorators import workflow class TracedContextAwareRAG: def __init__(self): Traceloop.init(app_name="production_rag") self.setup_vectorstore() @workflow(name="context_aware_query") def query(self, question, user_tier="starter"): # traceloop automatically traces this workflow  # filter retrieval by user context  retriever = self.vectorstore.as_retriever( search_kwargs={ "filter": {"tier": user_tier}, "k": 4 } ) # retrieve and generate  chain = RetrievalQA.from_chain_type( llm=self.llm, retriever=retriever, return_source_documents=True ) result = chain({"query": question}) # validate with traceloop monitoring  if self.detect_mixing(result['source_documents']): # traceloop captures this automatically  Traceloop.log_event("mixing_prevented", { "tier": user_tier, "question": question }) # retry with stricter filter  return self.strict_query(question, user_tier) return result 
Enter fullscreen mode Exit fullscreen mode

real-time monitoring setup

from traceloop.sdk import Traceloop from traceloop.sdk.decorators import task class ProductionRAGMonitor: def __init__(self): Traceloop.init( app_name="rag_monitor", disable_batch=False, # batch for performance  api_endpoint="https://api.traceloop.com" ) @task(name="validate_context_boundaries") def validate_response(self, response, expected_tier, sources): """traceloop tracks validation performance""" # check for tier leakage  tier_keywords = { "starter": ["advanced", "enterprise", "api access"], "enterprise": [], # enterprise can mention anything  "basic": ["unlimited", "custom", "dedicated"] } violations = [] for keyword in tier_keywords.get(expected_tier, []): if keyword in response.lower(): violations.append(keyword) if violations: Traceloop.log_event("tier_violation", { "expected": expected_tier, "violations": violations, "source_count": len(sources) }) # track success rate  Traceloop.log_metric("validation_passed", 1 if not violations else 0) return violations 
Enter fullscreen mode Exit fullscreen mode

traceloop insights that blew my mind

after running for a week, traceloop showed:

  • peak mixing hours: 2-4pm when support team was busiest
  • worst offenders: "pricing", "features", "capabilities" queries
  • mixing patterns: starter↔enterprise (67%), mobile↔desktop (23%)

the evaluation dashboard revealed:

  • faithfulness scores dropped 40% during context mixing
  • response time increased 2.3x when mixing occurred
  • customer satisfaction correlated with mixing frequency

results with proper monitoring

before traceloop:

  • couldn't see mixing happening
  • 40% of responses had context contamination
  • debugging took hours per incident

after traceloop:

  • real-time mixing detection
  • <3% context mixing in production
  • instant alerts when mixing occurs
  • 15-minute average fix time

quick wins with traceloop

  1. initialize traceloop - literally one line
  2. use @workflow decorators - automatic tracing
  3. log mixing metrics - track patterns
  4. monitor dashboards - spot issues early
  5. set up alerts - catch mixing in real-time

the scariest part about context mixing is you don't know it's happening. traceloop makes it visible. once you can see it, you can fix it.

bonus: the traceloop dashboard impressed my manager so much, we got budget for the enterprise plan. turns out "observability" sounds way better than "i added print statements everywhere."

Top comments (0)