Posted on Mar 6

AI Agents and LLM Models: Routers and Gateways for Efficient Management

In my current role as an AI architect, I've faced a recurring challenge, how do we effectively manage multiple Large Language Models (LLMs) when several teams are building AI agents? With so many models available, each with different strengths, it becomes crucial to use the right model for the right task. Our solution has been implementing Model Routers and Gateways – and the results have been transformative.

The Challenge of Managing Multiple LLMs

When you're working with several different LLMs across an organization, things can quickly become confusing. Teams might default to using a single familiar model for everything, even when it's not the optimal choice. This leads to inefficiency, higher costs, and sometimes poor performance. We needed a system to intelligently direct queries to the most appropriate model based on the specific requirements of each task.

Enter Model Routers: Using the Right Tool for the Right Job

A model router essentially acts as a traffic director for your AI queries. Instead of sending everything to a single model, it analyzes the query and routes it to the most suitable LLM based on several factors:

Technical Queries

When a user needs code generation or debugging help, we route these to technically proficient models like OpenAI's GPT. These specialized models excel at tasks requiring precise understanding of programming languages and technical concepts.[ Anthropic is good Choice for this type of query , however we are using GPT models]

General Queries

For everyday questions that don't involve sensitive information or complex reasoning, we use cost-effective options like DEEPSEEK. Being locally hosted, it offers good performance for standard queries without the higher costs of premium API-based models. This helps to keep our company data within our premises since query may retrieve employee personal information and other polices .

Creative Requests

Tasks like story writing or brainstorming need models with larger context windows and strong creative capabilities. For these, we route to Google Gemini, which can maintain coherence across longer, more nuanced creative outputs.

Unknown Queries

Not every question has an answer, and that's okay. When a query falls outside our models' capabilities, our system is designed to politely acknowledge its limitations rather than providing incorrect information.

Model Gateways: The Integration Layer

While routers direct traffic, model gateways serve as the unified interface between users and various LLMs. Think of the gateway as the central hub that connects everything together, providing several critical benefits:

Centralized Access Control

The gateway creates a single point of access for all models, allowing for better management of API keys and access tokens. This reduces security risks while simplifying administration.

Intelligent Cost Management

Different models have different pricing structures. The gateway helps manage costs by directing queries to more expensive models only when their capabilities justify the expense, while routing general queries to more affordable options.

Built-in Redundancy

API services occasionally experience downtime or rate limiting. A well-designed gateway includes fallback mechanisms, automatically rerouting requests to alternative models when necessary to maintain continuous service.

Performance Optimization

The centralized nature of the gateway makes it ideal for implementing load balancing, caching frequent responses, and monitoring overall system performance.

Security and Usage Insights

With all traffic flowing through a single point, the gateway provides comprehensive logging for security audits and usage analytics, helping inform future resource allocation decisions.

The Impact on Our Organization

Implementing this router-gateway architecture has dramatically improved how we manage our AI systems. Teams no longer need to worry about which model to use for which task – the system handles that automatically. We've seen reduced costs, improved response quality, and better overall user experiences.

The system has also provided unexpected benefits in terms of scalability. As new models become available, we can easily integrate them into our architecture without disrupting existing workflows. This future-proofs our AI infrastructure as the LLM landscape continues to evolve. If you're managing AI systems across teams or applications, consider implementing a similar router-gateway architecture to make your systems more efficient, cost-effective, and scalable.

The right model for the right job, at the right time, with the right controls – that's the foundation of a truly effective multi-model AI system.

Router and Gateway Implementation

""" AI Model Gateway Service This FastAPI application serves as a gateway for multiple AI models, providing a unified interface to interact with different AI services (OpenAI, Gemini, and DeepSeek). It includes intelligent routing capabilities to direct queries to the most appropriate model based on intent and requirements. Key Features: - Unified API interface for multiple AI models - Intelligent query routing based on intent classification - Support for OpenAI (Azure), Google Gemini, and DeepSeek models - Health monitoring and model availability checks - Test endpoints for each supported model - Centralized access control and API key management - Cost management and usage tracking - Failover and redundancy mechanisms - Security and audit logging Environment Variables Required: - AZURE_OPENAI_ENDPOINT: Azure OpenAI service endpoint - AZURE_OPENAI_API_KEY: Azure OpenAI API key - AZURE_OPENAI_API_VERSION: Azure OpenAI API version - AZURE_OPENAI_DEPLOYMENT_NAME: Azure OpenAI deployment name - GEMINI_API_KEY: Google Gemini API key - DEEPSEEK_URL: DeepSeek API endpoint URL - JWT_SECRET: Secret key for JWT authentication Author: Sreeni Ramadurai Date: 2025-03-06 Version: 1.0.0 """ # 1. Imports from fastapi import FastAPI, HTTPException, Request, Depends, Security, Query from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials from fastapi.middleware.cors import CORSMiddleware from pydantic import BaseModel, Field from typing import Dict, Any, Optional, Literal, List from enum import Enum from datetime import datetime, timedelta from functools import lru_cache import os import json import jwt import time import logging import asyncio import uuid import requests import google.generativeai as genai from dotenv import load_dotenv from langchain_openai import AzureChatOpenAI # 2. Configure logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' ) logger = logging.getLogger(__name__) # 3. Load environment variables load_dotenv() # 4. Initialize FastAPI app app = FastAPI( title="AI Model Gateway", description="A gateway service for different AI models with advanced features", version="1.0.0" ) # 5. Add CORS middleware app.add_middleware( CORSMiddleware, allow_origins=["*"], # In production, replace with specific origins  allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) # 6. Security settings security = HTTPBearer() JWT_SECRET = os.getenv("JWT_SECRET", "your-secret-key") ACCESS_TOKEN_EXPIRE_MINUTES = 30 API_USERS = { "admin": "admin123", # In production, use hashed passwords and a database  "user": "user123" } # 7. Model costs MODEL_COSTS = { "gpt-4o-mini": 0.0001, "gemini-1.5-pro": 0.00005, "deepseek-r1": 0.00002 } # 8. Enums class QueryIntent(str, Enum): """ Enumeration of possible query intents for classification. Attributes: TECHNICAL: Technical support and system-related queries GENERAL: General information and non-specific queries HR: Human Resources related queries OUT_OF_SCOPE: Queries that are not supported by the system """ TECHNICAL = "technical" GENERAL = "general" HR = "hr" OUT_OF_SCOPE = "out_of_scope" class ModelCapability(str, Enum): """ Enumeration of model capabilities for routing decisions. Attributes: TECHNICAL: Ability to handle technical queries GENERAL: Ability to handle general queries HR: Ability to handle HR-related queries LONG_CONTEXT: Ability to handle long context inputs COST_EFFICIENT: Cost-effective model option """ TECHNICAL = "technical" GENERAL = "general" HR = "hr" LONG_CONTEXT = "long_context" COST_EFFICIENT = "cost_efficient" # 9. Pydantic Models class RouterConfig(BaseModel): """ Configuration for model routing based on intent. Attributes: intent: The classified intent of the query model_type: Type of model to use (openai, gemini, deepseek) model_name: Specific model name/version max_tokens: Maximum tokens for the model temperature: Model temperature setting capabilities: List of model capabilities """ intent: QueryIntent model_type: str model_name: str max_tokens: int temperature: float = 0.7 capabilities: List[ModelCapability] class RouterRequest(BaseModel): """ Request model for the router endpoint. Attributes: input_data: The text input to process context_length: Length of context in tokens cost_sensitive: Whether to prioritize cost efficiency """ input_data: str = Field(..., description="Input text to process") context_length: Optional[int] = Field(default=1000, description="Length of the context in tokens") cost_sensitive: Optional[bool] = Field(default=False, description="Whether to prioritize cost efficiency") class RouterResponse(BaseModel): """ Response model for the router endpoint. Attributes: intent: The classified intent model_type: Selected model type model_name: Selected model name confidence: Classification confidence score explanation: Explanation of the classification """ intent: QueryIntent model_type: str model_name: str confidence: float explanation: str class ModelResponse(BaseModel): """ Response model for model endpoints. Attributes: status: Success or error status model: Model used for processing response: Model's response text error: Error message if any """ status: str model: str response: Optional[str] = None error: Optional[str] = None class TestRequest(BaseModel): """ Request model for test endpoints. Attributes: input_data: Test input text max_tokens: Maximum tokens for generation """ input_data: str = Field(..., description="Input text to process") max_tokens: Optional[int] = Field(default=1000, description="Maximum number of tokens to generate") class ModelRequest(BaseModel): """ Request model for direct model access. Attributes: input_data: Input text to process model_type: Type of model to use model_name: Specific model name max_tokens: Maximum tokens for generation """ input_data: str = Field(..., description="Input text to process") model_type: str = Field(..., description="Type of model to use (openai, gemini, deepseek)") model_name: str = Field(..., description="Name of the specific model to use") max_tokens: Optional[int] = Field(default=1000, description="Maximum number of tokens to generate") class HealthResponse(BaseModel): """ Response model for health check endpoint. Attributes: status: Overall system status models: Dictionary of model availability status """ status: str models: Dict[str, bool] class UserAuth(BaseModel): """ Authentication request model. Attributes: username: User's username password: User's password """ username: str password: str class Token(BaseModel): """ Authentication token response model. Attributes: access_token: JWT access token token_type: Type of token (bearer) """ access_token: str token_type: str # 10. Router Configurations ROUTER_CONFIGS = { QueryIntent.TECHNICAL: RouterConfig( intent=QueryIntent.TECHNICAL, model_type="openai", model_name="gpt-4o-mini", max_tokens=1001, capabilities=[ModelCapability.TECHNICAL] ), QueryIntent.GENERAL: RouterConfig( intent=QueryIntent.GENERAL, model_type="gemini", model_name="gemini-1.5-pro", max_tokens=4001, capabilities=[ModelCapability.GENERAL, ModelCapability.LONG_CONTEXT] ), QueryIntent.HR: RouterConfig( intent=QueryIntent.HR, model_type="deepseek", model_name="deepseek-r1", max_tokens=4001, capabilities=[ModelCapability.HR] ), } # 11. Helper Functions async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)): """ Verify JWT token and return payload. Args: credentials: HTTP authorization credentials containing the JWT token Returns: dict: Decoded JWT payload Raises: HTTPException: If token is invalid """ try: payload = jwt.decode(credentials.credentials, JWT_SECRET, algorithms=["HS256"]) return payload except jwt.InvalidTokenError: raise HTTPException( status_code=401, detail="Invalid authentication token" ) def create_access_token(data: dict, expires_delta: Optional[timedelta] = None) -> str: """ Create a new JWT access token. Args: data: Data to encode in the token expires_delta: Token expiration time Returns: str: Encoded JWT token """ to_encode = data.copy() if expires_delta: expire = datetime.utcnow() + expires_delta else: expire = datetime.utcnow() + timedelta(minutes=15) to_encode.update({"exp": expire}) encoded_jwt = jwt.encode(to_encode, JWT_SECRET, algorithm="HS256") return encoded_jwt async def test_model_connectivity(model: str) -> bool: """ Test connectivity to a specific model. Args: model: Name of the model to test Returns: bool: True if model is accessible, False otherwise """ return True # 12. Model Handlers async def openai_model(input_data: str, model_name: str, max_tokens: int) -> ModelResponse: """ Handle requests to OpenAI model with retry logic. Args: input_data: Input text to process model_name: Name of the OpenAI model to use max_tokens: Maximum tokens for generation Returns: ModelResponse: Response from the model """ max_retries = 3 retry_delay = 1 for attempt in range(max_retries): try: start_time = time.time() client = AzureChatOpenAI( azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), api_key=os.getenv("AZURE_OPENAI_API_KEY"), api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview"), azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini") ) response = await client.ainvoke(input_data) return ModelResponse(status="success", model=model_name, response=response.content) except Exception as e: if attempt == max_retries - 1: logger.error(f"OpenAI model error after {max_retries} attempts: {str(e)}") return ModelResponse(status="error", model=model_name, error=str(e)) await asyncio.sleep(retry_delay * (attempt + 1)) async def gemini_model(input_data: str, model_name: str, max_tokens: int) -> ModelResponse: """ Handle requests to Google Gemini model. Args: input_data: Input text to process model_name: Name of the Gemini model to use max_tokens: Maximum tokens for generation Returns: ModelResponse: Response from the model """ try: genai.configure(api_key=os.getenv("GEMINI_API_KEY")) model = genai.GenerativeModel(model_name) response = await model.generate_content_async(input_data) return ModelResponse(status="success", model=model_name, response=response.text) except Exception as e: return ModelResponse(status="error", model=model_name, error=str(e)) async def deepseek_model(input_data: str, model_name: str, max_tokens: int) -> ModelResponse: """ Handle requests to DeepSeek model. Args: input_data: Input text to process model_name: Name of the DeepSeek model to use max_tokens: Maximum tokens for generation Returns: ModelResponse: Response from the model """ try: deepseek_url = os.getenv("DEEPSEEK_URL", "http://localhost:8000/v1/chat/completions") payload = { "model": model_name, "messages": [{"role": "user", "content": input_data}], "max_tokens": max_tokens, "temperature": 0.7, "stream": False } response = requests.post(deepseek_url, json=payload) response.raise_for_status() result = response.json() return ModelResponse(status="success", model=model_name, response=result["message"]["content"]) except Exception as e: return ModelResponse(status="error", model=model_name, error=str(e)) # 13. Intent Classification async def classify_intent(input_data: str, context_length: int, cost_sensitive: bool) -> RouterResponse: """ Classify the intent of a query and determine the appropriate model. Args: input_data: Input text to classify context_length: Length of context in tokens cost_sensitive: Whether to prioritize cost efficiency Returns: RouterResponse: Classification result with model selection """ try: # Try OpenAI first if available  if os.getenv("AZURE_OPENAI_API_KEY"): try: client = AzureChatOpenAI( azure_deployment="gpt-4o-mini", api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2023-05-15"), azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), api_key=os.getenv("AZURE_OPENAI_API_KEY"), temperature=0, ) prompt = f"""Analyze the following query and determine its intent. Consider: 1. Is it a technical issue? (e.g., software problems, system errors, technical support) 2. Is it an HR-related query? Look for topics like: - Benefits (401k, health insurance, dental, vision) - Payroll and compensation - Leave policies (vacation, sick leave, FMLA) - Employee relations - Hiring and recruitment - Performance reviews - Training and development - Employee policies - Workplace accommodations - Employee handbook - HR forms and documents 3. Is it a general query? (e.g., general information, office locations, company policies) 4. Is it out of scope? Query: {input_data} Context length: {context_length} tokens Cost sensitive: {cost_sensitive} Respond in JSON format with: - intent: one of ["technical", "general", "hr"] - confidence: float between 0 and 1 - explanation: brief explanation of the classification """ response = await client.ainvoke(prompt) result = response.content.strip() if result.startswith('``` json'): result = result[7:] if result.endswith(' ```'): result = result[:-3] result = result.strip() classification = json.loads(result) config = ROUTER_CONFIGS[QueryIntent(classification["intent"])] if context_length > 2001: config = ROUTER_CONFIGS[QueryIntent.GENERAL] # Use Gemini for long context  elif cost_sensitive and config.model_type != "gemini": config = ROUTER_CONFIGS[QueryIntent.GENERAL] # Use Gemini for cost efficiency  return RouterResponse( intent=QueryIntent(classification["intent"]), model_type=config.model_type, model_name=config.model_name, confidence=classification["confidence"], explanation=classification["explanation"] ) except Exception as e: print(f"OpenAI classification error: {str(e)}") # Fallback to DeepSeek if OpenAI fails  if os.getenv("DEEPSEEK_URL"): try: deepseek_url = os.getenv("DEEPSEEK_URL") payload = { "model": "deepseek-chat", "messages": [ { "role": "user", "content": f"""Analyze the following query and determine its intent. Consider: 1. Is it a technical issue? (e.g., software problems, system errors, technical support) 2. Is it an HR-related query? Look for topics like: - Benefits (401k, health insurance, dental, vision) - Payroll and compensation - Leave policies (vacation, sick leave, FMLA) - Employee relations - Hiring and recruitment - Performance reviews - Training and development - Employee policies - Workplace accommodations - Employee handbook - HR forms and documents 3. Is it a general query? (e.g., general information, office locations, company policies) 4. Is it out of scope? 5. Is it ambiguous? Query: {input_data} Context length: {context_length} tokens Cost sensitive: {cost_sensitive} Respond in JSON format with: - intent: one of ["technical", "hr", "general", "out_of_scope", "ambiguous"] - confidence: float between 0 and 1 - explanation: brief explanation of the classification""" } ], "max_tokens": 500, "temperature": 0.3, "stream": False } response = requests.post(deepseek_url, json=payload) response.raise_for_status() result = response.json() classification = { "intent": result["message"]["content"].strip(), "confidence": 0.8, "explanation": "Classified using DeepSeek model" } config = ROUTER_CONFIGS[QueryIntent(classification["intent"])] if context_length > 2001: config = ROUTER_CONFIGS[QueryIntent.GENERAL] # Use Gemini for long context  elif cost_sensitive and config.model_type != "gemini": config = ROUTER_CONFIGS[QueryIntent.GENERAL] # Use Gemini for cost efficiency  return RouterResponse( intent=QueryIntent(classification["intent"]), model_type=config.model_type, model_name=config.model_name, confidence=classification["confidence"], explanation=classification["explanation"] ) except Exception as e: print(f"DeepSeek classification error: {str(e)}") return RouterResponse( intent=QueryIntent.GENERAL, model_type="gemini", model_name="gemini-1.5-pro", confidence=0.5, explanation="Fallback to general model due to classification errors" ) except Exception as e: print(f"Error in intent classification: {str(e)}") return RouterResponse( intent=QueryIntent.GENERAL, model_type="gemini", model_name="gemini-1.5-pro", confidence=0.5, explanation="Fallback to general model due to classification error" ) # 14. Main Endpoints @app.get("/") async def root(): """ Root endpoint returning API information and available endpoints. Returns: dict: API information and endpoint list """ return { "name": "AI Model Gateway", "version": "1.0.0", "documentation": "/docs", "endpoints": { "model": "/model", "router": "/router", "smart_model": "/smart-model", "health": "/health", "test_openai": "/test/openai", "test_gemini": "/test/gemini", "test_deepseek": "/test/deepseek" } } @app.post("/auth/login", response_model=Token) async def login(user_auth: UserAuth): """ Authenticate user and return JWT token. Args: user_auth: User authentication credentials Returns: Token: JWT access token Raises: HTTPException: If credentials are invalid """ if user_auth.username not in API_USERS or API_USERS[user_auth.username] != user_auth.password: raise HTTPException( status_code=401, detail="Incorrect username or password", headers={"WWW-Authenticate": "Bearer"}, ) access_token_expires = timedelta(minutes=ACCESS_TOKEN_EXPIRE_MINUTES) access_token = create_access_token( data={"sub": user_auth.username}, expires_delta=access_token_expires ) return {"access_token": access_token, "token_type": "bearer"} @app.post("/router", response_model=RouterResponse) async def route_query(request: RouterRequest): """ Route query to appropriate model based on intent classification. Args: request: Router request containing query details Returns: RouterResponse: Routing decision with model selection Raises: HTTPException: If routing fails """ try: return await classify_intent( request.input_data, request.context_length, request.cost_sensitive ) except Exception as e: raise HTTPException( status_code=500, detail=f"Routing error: {str(e)}" ) @app.post("/smart-model", response_model=ModelResponse) async def smart_model( input_data: str = Query(..., description="Input text to process"), context_length: Optional[int] = Query(default=1000, description="Length of context in tokens"), cost_sensitive: Optional[bool] = Query(default=False, description="Whether to prioritize cost efficiency"), user: dict = Depends(verify_token) ): """ Smart model endpoint that automatically routes queries to appropriate model. Args: input_data: Input text to process context_length: Length of context in tokens cost_sensitive: Whether to prioritize cost efficiency user: Authenticated user information Returns: ModelResponse: Response from the selected model Raises: HTTPException: If processing fails """ try: if context_length <= 0: raise HTTPException( status_code=400, detail="context_length must be greater than 0" ) routing = await classify_intent( input_data, context_length, cost_sensitive ) max_tokens = ROUTER_CONFIGS[routing.intent].max_tokens if context_length > 2001: routing.model_type = "gemini" routing.model_name = "gemini-1.5-pro" routing.intent = QueryIntent.GENERAL max_tokens = 4001 if routing.model_type == "openai" and max_tokens > 1001: max_tokens = 1001 elif routing.model_type == "deepseek" and max_tokens > 4001: max_tokens = 4001 elif routing.model_type == "gemini" and max_tokens > 4001: max_tokens = 4001 if routing.model_type == "openai": result = await openai_model( input_data, routing.model_name, max_tokens ) elif routing.model_type == "gemini": result = await gemini_model( input_data, routing.model_name, max_tokens ) elif routing.model_type == "deepseek": result = await deepseek_model( input_data, routing.model_name, max_tokens ) else: raise HTTPException( status_code=400, detail=f"Unsupported model type: {routing.model_type}" ) return result except Exception as e: logger.error(f"Error in smart_model: {str(e)}") return ModelResponse( status="error", model="router", error=str(e) ) @app.post("/model", response_model=ModelResponse) async def model_gateway(request: ModelRequest): """ Direct model access endpoint. Args: request: Model request containing input and model details Returns: ModelResponse: Response from the specified model Raises: HTTPException: If processing fails """ try: if request.model_type == "openai": result = await openai_model( request.input_data, request.model_name, request.max_tokens ) elif request.model_type == "gemini": result = await gemini_model( request.input_data, request.model_name, request.max_tokens ) elif request.model_type == "deepseek": result = await deepseek_model( request.input_data, request.model_name, request.max_tokens ) else: raise HTTPException( status_code=400, detail=f"Unsupported model type: {request.model_type}" ) if result.status == "error": raise HTTPException( status_code=500, detail=result.error ) return result except HTTPException: raise except Exception as e: raise HTTPException( status_code=500, detail=f"Gateway error: {str(e)}" ) @app.get("/health", response_model=HealthResponse) async def health_check(): """ Health check endpoint to verify model availability. Returns: HealthResponse: Status of all models """ model_status = {} for model, env_var in { "openai": "AZURE_OPENAI_API_KEY", "gemini": "GEMINI_API_KEY", "deepseek": "DEEPSEEK_URL" }.items(): is_available = bool(os.getenv(env_var)) if is_available: try: await test_model_connectivity(model) except Exception as e: logger.error(f"Model {model} connectivity test failed: {str(e)}") is_available = False model_status[model] = is_available return HealthResponse( status="healthy" if all(model_status.values()) else "degraded", models=model_status ) # 15. Test Endpoints @app.post("/test/openai", response_model=ModelResponse) async def test_openai(request: TestRequest): """ Test endpoint for OpenAI model. Args: request: Test request with input data Returns: ModelResponse: Response from OpenAI model Raises: HTTPException: If test fails """ try: result = await openai_model( request.input_data, "gpt-4", request.max_tokens ) if result.status == "error": raise HTTPException( status_code=500, detail=result.error ) return result except Exception as e: raise HTTPException( status_code=500, detail=f"OpenAI test error: {str(e)}" ) @app.post("/test/gemini", response_model=ModelResponse) async def test_gemini(request: TestRequest): """ Test endpoint for Gemini model. Args: request: Test request with input data Returns: ModelResponse: Response from Gemini model Raises: HTTPException: If test fails """ try: result = await gemini_model( request.input_data, "gemini-2.0-flash", request.max_tokens ) if result.status == "error": raise HTTPException( status_code=500, detail=result.error ) return result except Exception as e: raise HTTPException( status_code=500, detail=f"Gemini test error: {str(e)}" ) @app.post("/test/deepseek", response_model=ModelResponse) async def test_deepseek(request: TestRequest): """ Test endpoint for DeepSeek model. Args: request: Test request with input data Returns: ModelResponse: Response from DeepSeek model Raises: HTTPException: If test fails """ try: result = await deepseek_model( request.input_data, "deepseek-chat", request.max_tokens ) if result.status == "error": raise HTTPException( status_code=500, detail=result.error ) return result except Exception as e: raise HTTPException( status_code=500, detail=f"DeepSeek test error: {str(e)}" ) # 16. Main entry point if __name__ == '__main__': import uvicorn uvicorn.run(app, host="0.0.0.0", port=8050)

Gateway Endpoints

Invoking model with context

output

General Query or Prompt

Router routes to Local self hosted LLM (DEEPSEEK)

Note

Implementing an Effective Intent Classification System
The key to a successful router is accurate intent classification. Here's how we've optimized our classification function with few-shot prompting

# Intent Classification Prompt Template def classify_intent(user_query): """ Classify user query intent using few-shot examples to achieve high confidence scores. Input: User query text Output: Intent classification with confidence score Example few-shot prompts: Query: "Can you help me debug this Python function?" Intent: TECHNICAL Confidence: 0.95 Reasoning: Contains programming language reference and technical task request. Query: "How do I set up a CI/CD pipeline for my Node.js application?" Intent: TECHNICAL Confidence: 0.97 Reasoning: Involves DevOps implementation and programming framework specifics. Query: "Explain the error in this SQL query: SELECT * FROM users WHERE username = 'john' AND AND email = 'john@example.com'" Intent: TECHNICAL Confidence: 0.98 Reasoning: Contains specific database query syntax and error identification request. Query: "What's the capital of France?" Intent: GENERAL Confidence: 0.98 Reasoning: Simple factual question requiring basic knowledge. Query: "Can you summarize the key points of climate change?" Intent: GENERAL Confidence: 0.94 Reasoning: Requests information synthesis on a general knowledge topic without specialized expertise. Query: "What are the main differences between capitalism and socialism?" Intent: GENERAL Confidence: 0.96 Reasoning: Comparative analysis question on broad economic/political systems. Query: "What's the process for requesting time off in our company?" Intent: HR Confidence: 0.93 Reasoning: Involves company policy related to employee leave and HR procedures. Query: "How should I prepare for my annual performance review?" Intent: HR Confidence: 0.91 Reasoning: Related to employee evaluation process and professional development. Query: "What are best practices for addressing conflicts between team members?" Intent: HR Confidence: 0.89 Reasoning: Involves workplace relationship management and conflict resolution. """

By providing diverse examples with explicit reasoning, our router achieves higher accuracy in intent classification. We've included examples in multiple languages and with varying complexity to handle edge cases.

Thanks
Sreeni Ramadorai

DEV Community