A high-performance, stateless OpenRouter proxy service built with Node.js, TypeScript, and Express. Provides REST API and WebSocket streaming capabilities for LLM inference without authentication or user tracking.
- Dual Interface: REST API for standard requests and WebSocket for streaming
- Comprehensive Parameter Support: System prompts, model/provider selection, temperature, tools, etc.
- Multi-modal Support: Text, audio, and image generation capabilities
- Robust Error Handling: Graceful failure recovery and informative error responses
- High Performance: Optimized for speed and low latency
- IP-based Rate Limiting: Protection against abuse while maintaining simplicity
- Node.js 20+
- npm or yarn
- OpenRouter API key
- Clone the repository:
git clone <repository-url> cd llm-proxy- Install dependencies:
npm install- Set up environment variables:
cp .env.example .env # Edit .env with your OpenRouter API key- Build the project:
npm run build- Start the server:
npm startFor development:
npm run devThe service uses environment variables for configuration. See .env.example for all available options:
OPENROUTER_API_KEY: Your OpenRouter API key
PORT: Server port (default: 3000)HOST: Server host (default: 0.0.0.0)NODE_ENV: Environment (development/production/test)LOG_LEVEL: Logging level (debug/info/warn/error)RATE_LIMIT_WINDOW_MS: Rate limit window in milliseconds (default: 900000)RATE_LIMIT_MAX_REQUESTS: Max requests per window (default: 100)WS_MAX_CONNECTIONS: Max WebSocket connections (default: 1000)WS_HEARTBEAT_INTERVAL: WebSocket heartbeat interval (default: 30000)MAX_CONCURRENT_REQUESTS: Max concurrent requests (default: 100)REQUEST_TIMEOUT: Request timeout in milliseconds (default: 30000)
Check service health status.
Response:
{ "status": "healthy", "timestamp": "2024-01-01T00:00:00.000Z", "uptime": 123.45, "version": "1.0.0", "environment": "production" }Create a completion using the specified model.
Request Body:
{ "model": "openai/gpt-4o", "messages": [ { "role": "user", "content": "Hello, world!" } ], "temperature": 0.7, "max_tokens": 100, "stream": false }Response:
{ "id": "chatcmpl-123", "choices": [ { "finish_reason": "stop", "message": { "content": "Hello! How can I help you today?", "role": "assistant" } } ], "usage": { "prompt_tokens": 10, "completion_tokens": 15, "total_tokens": 25 }, "model": "openai/gpt-4o", "created": 1704067200, "object": "chat.completion" }Create a streaming completion using the specified model.
Request Body:
{ "model": "openai/gpt-4o", "messages": [ { "role": "user", "content": "Tell me a story" } ], "temperature": 0.7, "max_tokens": 500, "stream": true }Response: Server-Sent Events stream
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1704067200,"model":"openai/gpt-4o","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1704067200,"model":"openai/gpt-4o","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]} data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1704067200,"model":"openai/gpt-4o","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]} data: [DONE] List all available models with optional filtering and pagination.
Query Parameters:
provider(optional): Filter by provider (e.g., "openai", "anthropic")search(optional): Search in model name or descriptionlimit(optional): Number of models to return (default: 50, max: 100)offset(optional): Number of models to skip (default: 0)
Example:
GET /api/v1/models?provider=openai&search=gpt&limit=10&offset=0 Response:
{ "data": [ { "id": "openai/gpt-4o", "name": "GPT-4o", "description": "Most advanced GPT-4 model", "context_length": 128000, "pricing": { "prompt": "0.005", "completion": "0.015" }, "supported_parameters": ["temperature", "max_tokens", "top_p"], "is_moderated": true, "max_completion_tokens": 4096 } ], "pagination": { "total": 150, "limit": 10, "offset": 0, "hasMore": true } }Get detailed information about a specific model.
Example:
GET /api/v1/models/openai/gpt-4o Response:
{ "data": { "id": "openai/gpt-4o", "name": "GPT-4o", "description": "Most advanced GPT-4 model", "context_length": 128000, "pricing": { "prompt": "0.005", "completion": "0.015" }, "supported_parameters": ["temperature", "max_tokens", "top_p", "frequency_penalty", "presence_penalty"], "is_moderated": true, "max_completion_tokens": 4096 } }Get supported parameters for a specific model.
Example:
GET /api/v1/models/openai/gpt-4o/parameters Response:
{ "data": { "model": "openai/gpt-4o", "supported_parameters": [ "temperature", "max_tokens", "top_p", "frequency_penalty", "presence_penalty", "stop", "stream" ] } }Get pricing information for a specific model.
Example:
GET /api/v1/models/openai/gpt-4o/pricing Response:
{ "data": { "model": "openai/gpt-4o", "pricing": { "prompt": "0.005", "completion": "0.015" } } }Get top models by context length.
Query Parameters:
limit(optional): Number of models to return (default: 10)
Example:
GET /api/v1/models/top?limit=5 Response:
{ "data": [ { "id": "anthropic/claude-3-5-sonnet-20241022", "name": "Claude 3.5 Sonnet", "context_length": 200000, "pricing": { "prompt": "0.003", "completion": "0.015" } } ] }Search models by query.
Query Parameters:
q(required): Search querylimit(optional): Number of results to return (default: 20)
Example:
GET /api/v1/models/search?q=code&limit=5 Response:
{ "data": [ { "id": "openai/gpt-4o", "name": "GPT-4o", "description": "Most advanced GPT-4 model with code capabilities" } ], "query": "code", "total": 25 }Get all available providers.
Response:
{ "data": [ "openai", "anthropic", "google", "meta", "mistral" ] }Get models by provider.
Example:
GET /api/v1/models/providers/openai Response:
{ "data": [ { "id": "openai/gpt-4o", "name": "GPT-4o", "context_length": 128000 } ], "provider": "openai" }Connect to the WebSocket endpoint:
ws://localhost:3000/ws { "type": "inference_request", "id": "req-123", "data": { "model": "openai/gpt-4o", "messages": [ { "role": "user", "content": "Hello, world!" } ], "temperature": 0.7, "max_tokens": 100 } }{ "type": "inference_response", "id": "req-123", "data": { "content": "Hello! How can I help you today?", "finish_reason": "stop", "usage": { "prompt_tokens": 10, "completion_tokens": 15, "total_tokens": 25 }, "model": "openai/gpt-4o", "created": 1704067200 } }{ "type": "heartbeat", "timestamp": 1704067200000 }{ "type": "error", "id": "req-123", "error": { "code": 400, "message": "Invalid model", "type": "validation" } }{ "type": "close", "reason": "Client requested close", "code": 1000 }// Standard completion const response = await fetch('http://localhost:3000/api/v1/inference', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'openai/gpt-4o', messages: [ { role: 'user', content: 'Hello, world!' } ], temperature: 0.7, max_tokens: 100 }) }); const data = await response.json(); console.log(data.choices[0].message.content);// Streaming completion const response = await fetch('http://localhost:3000/api/v1/inference/stream', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'openai/gpt-4o', messages: [ { role: 'user', content: 'Tell me a story' } ], temperature: 0.7, max_tokens: 500, stream: true }) }); const reader = response.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); const lines = chunk.split('\n'); for (const line of lines) { if (line.startsWith('data: ')) { const data = line.slice(6); if (data === '[DONE]') break; try { const parsed = JSON.parse(data); if (parsed.choices?.[0]?.delta?.content) { console.log(parsed.choices[0].delta.content); } } catch (e) { // Ignore invalid JSON } } } }const ws = new WebSocket('ws://localhost:3000/ws'); ws.onopen = () => { // Send inference request ws.send(JSON.stringify({ type: 'inference_request', id: 'req-123', data: { model: 'openai/gpt-4o', messages: [ { role: 'user', content: 'Hello, world!' } ], temperature: 0.7 } })); }; ws.onmessage = (event) => { const message = JSON.parse(event.data); switch (message.type) { case 'inference_response': if (message.data.content) { console.log(message.data.content); } if (message.data.finish_reason) { console.log('Finished:', message.data.finish_reason); } break; case 'error': console.error('Error:', message.error.message); break; case 'heartbeat': console.log('Heartbeat received'); break; } }; ws.onclose = () => { console.log('WebSocket connection closed'); };import requests import json # Standard completion response = requests.post('http://localhost:3000/api/v1/inference', json={ 'model': 'openai/gpt-4o', 'messages': [ {'role': 'user', 'content': 'Hello, world!'} ], 'temperature': 0.7, 'max_tokens': 100 } ) data = response.json() print(data['choices'][0]['message']['content'])# Standard completion curl -X POST http://localhost:3000/api/v1/inference \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-4o", "messages": [ {"role": "user", "content": "Hello, world!"} ], "temperature": 0.7, "max_tokens": 100 }' # List models curl http://localhost:3000/api/v1/models # Get model details curl http://localhost:3000/api/v1/models/openai/gpt-4o # Search models curl "http://localhost:3000/api/v1/models/search?q=gpt&limit=5"All errors follow this format:
{ "error": { "code": 400, "message": "Validation error", "type": "validation", "details": { "field": "model", "message": "Model is required" } } }validation: Request validation failedrate_limit: Rate limit exceededopenrouter: OpenRouter API errorinternal: Internal server error
400: Bad Request - Invalid request data404: Not Found - Model or endpoint not found429: Too Many Requests - Rate limit exceeded500: Internal Server Error - Server error502: Bad Gateway - OpenRouter API error503: Service Unavailable - Service temporarily unavailable
The service implements IP-based rate limiting:
- Default: 100 requests per 15 minutes per IP
- Inference endpoints: 50 requests per 15 minutes per IP
- WebSocket: 5 connections per minute per IP
Rate limit headers are included in responses:
X-RateLimit-Limit: Maximum requests allowedX-RateLimit-Remaining: Requests remaining in current windowX-RateLimit-Reset: Time when the rate limit resets
npm run dev- Start development server with hot reloadnpm run build- Build the projectnpm start- Start production servernpm test- Run testsnpm run test:watch- Run tests in watch modenpm run test:coverage- Run tests with coveragenpm run lint- Run ESLintnpm run lint:fix- Fix ESLint errors
src/ ├── controllers/ # Request handlers ├── services/ # Business logic ├── middleware/ # Express middleware ├── routes/ # API routes ├── types/ # TypeScript definitions ├── utils/ # Utility functions ├── app.ts # Express app setup └── server.ts # Server entry point The project includes comprehensive tests:
- Unit tests: Test individual functions and classes
- Integration tests: Test complete request/response cycles
- Load tests: Test performance under load
Run tests:
npm test# Build the image docker build -f docker/Dockerfile -t llm-proxy . # Run the container docker run -p 3000:3000 -e OPENROUTER_API_KEY=your-key llm-proxy# Start all services docker-compose -f docker/docker-compose.yml up -d # Stop all services docker-compose -f docker/docker-compose.yml downThe service provides monitoring endpoints:
GET /health- Health check with uptime and version info
- IP-based rate limiting
- Input validation and sanitization
- CORS protection
- Security headers (Helmet)
- No authentication required (stateless design)
- Connection pooling for OpenRouter API
- Efficient WebSocket handling
- Memory-optimized streaming
- Request/response compression
- Caching for model information
- Stateless design for horizontal scaling
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
MIT License - see LICENSE file for details.
For issues and questions, please open an issue on GitHub.