CustomKB transforms your documents into AI-powered, searchable knowledgebases with state-of-the-art embedding models, vector search, and language models to deliver contextually relevant answers from your data.
- Key Features
- How It Works
- Prerequisites
- Installation
- Quick Start
- Core Commands
- Configuration
- Advanced Features
- Security
- Performance Optimization
- Troubleshooting
- FAQ
- Contributing
- Support & Community
- License
- Semantic Search: Find information by meaning, not just keywords
- Multi-Provider AI: OpenAI, Anthropic, Google, xAI, and local models via Ollama
- Universal Document Support: Process Markdown, HTML, code, PDFs, and plain text
- 27+ Language Support: Multi-language processing with automatic detection
- Hybrid Search: Combines vector similarity with BM25 keyword matching
- Cross-Encoder Reranking: Boosts accuracy by 20-40% with advanced models
- Enterprise Security: Hardened against injection attacks, safe serialization (no pickle), input validation, path protection
- Memory-Optimized Tiers: Automatically adapts from 4GB to 128GB+ systems
- GPU Acceleration: CUDA support for faster reranking
- Concurrent Processing: Batch operations with configurable thread pools
- Smart Caching: Two-tier cache system with LRU eviction
- Production Ready: Checkpoint saving, automatic retries, graceful error handling
CustomKB follows a three-stage pipeline to transform your documents into an intelligent knowledgebase:
1. Document Processing ├─ Text extraction (Markdown, HTML, PDF, code, plain text) ├─ Language detection (27+ languages) ├─ Intelligent chunking (200-400 tokens, context-aware) └─ Metadata extraction (filenames, categories, timestamps) 2. Embedding Generation ├─ Vector embeddings via OpenAI, Google, or local models ├─ Batch processing with checkpoints ├─ FAISS index creation for fast similarity search └─ Optional BM25 index for hybrid search 3. Semantic Search & Query ├─ Query embedding generation ├─ Vector similarity search (k-NN via FAISS) ├─ Optional: Hybrid search (vector + BM25 keyword matching) ├─ Optional: Cross-encoder reranking for precision ├─ Context assembly from top results └─ LLM response generation with retrieved context Why This Approach Works:
- Semantic Understanding: Vector embeddings capture meaning, not just keywords
- Hybrid Accuracy: Combining vector and keyword search catches both conceptual and exact matches
- Reranking Precision: Cross-encoders evaluate query-document pairs for superior relevance
- Efficient Retrieval: FAISS enables sub-millisecond search across millions of vectors
- Python: 3.12 or higher
- SQLite: 3.45+ (usually included with Python)
- RAM: 4GB+ (8GB+ recommended for optimal performance)
- GPU (optional): NVIDIA GPU with CUDA 11 or 12 for acceleration
- API Keys: For your chosen AI providers (OpenAI, Anthropic, Google, xAI)
git clone https://github.com/Open-Technology-Foundation/customkb.git cd customkbpython -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements.txt# Automatic installation (detects GPU and CUDA version) ./setup/install_faiss.sh # Or manual installation: # CPU-only: pip install -r requirements-faiss-cpu.txt # GPU (CUDA 12): pip install -r requirements-faiss-gpu-cu12.txt # GPU (CUDA 11): pip install -r requirements-faiss-gpu-cu11.txt # Force specific variant: # FAISS_VARIANT=cpu ./setup/install_faiss.shsudo ./setup/nltk_setup.py download cleanupChoose between system-wide or user-local installation:
Option A: System-wide (requires sudo)
sudo mkdir -p /var/lib/vectordbs sudo chown $USER:$USER /var/lib/vectordbs export VECTORDBS="/var/lib/vectordbs"Option B: User-local (no sudo required, recommended)
mkdir -p "$HOME/knowledgebases" export VECTORDBS="$HOME/knowledgebases"Add to your shell profile (~/.bashrc, ~/.zshrc, etc.):
export VECTORDBS="$HOME/knowledgebases" # or /var/lib/vectordbsexport OPENAI_API_KEY="your-openai-key" export ANTHROPIC_API_KEY="your-anthropic-key" export GOOGLE_API_KEY="your-google-key" # Optional export XAI_API_KEY="your-xai-key" # OptionalAdd these to your shell profile for persistence.
# 1. Create knowledgebase directory mkdir -p "$VECTORDBS/myproject" # 2. Create configuration cat > "$VECTORDBS/myproject/myproject.cfg" << 'EOF' [DEFAULT] vector_model = text-embedding-3-small query_model = gpt-4o-mini db_min_tokens = 200 db_max_tokens = 400 EOF # 3. Process documents (from your project directory) customkb database myproject docs/*.md *.txt # 4. Generate embeddings customkb embed myproject # 5. Query your knowledgebase customkb query myproject "What are the main features?"That's it! Your knowledgebase is ready to answer questions about your documents.
customkb database <kb_name> [files...] [options]Process and store text files in the knowledgebase.
Options:
-l, --language: Stopwords language (en, fr, de, etc.)--detect-language: Auto-detect language per file-f, --force: Reprocess existing files-v, --verbose: Detailed output
Examples:
# Process all markdown files customkb database myproject ~/docs/**/*.md # Auto-detect language for multilingual docs customkb database myproject ~/docs/ --detect-language # Force reprocess existing files customkb database myproject ~/docs/*.md --forcecustomkb embed <kb_name> [options]Create vector embeddings for all text chunks.
Options:
-r, --reset-database: Reset embedding status-v, --verbose: Show progress
Examples:
# Generate embeddings with progress customkb embed myproject --verbose # Reset and regenerate all embeddings customkb embed myproject --reset-databasecustomkb query <kb_name> "<question>" [options]Perform semantic search and generate AI responses.
Options:
-c, --context-only: Return only context, no AI response-m, --model: AI model to use-k, --top-k: Number of results (default: 50)-t, --temperature: Response creativity (0-2)-f, --format: Output format (xml, json, markdown, plain)-p, --prompt-template: Response style template
Examples:
# Simple query customkb query myproject "How does authentication work?" # Advanced query with specific model customkb query myproject "Explain the architecture" \ --model claude-sonnet-4-5 \ --format json \ --prompt-template technical # Get context only (no LLM response) customkb query myproject "Find authentication docs" --context-onlyCustomKB uses INI-style configuration with environment variable overrides.
- Environment variables (highest)
- Configuration file (
.cfg) - Default values (lowest)
[DEFAULT] # Models vector_model = text-embedding-3-small query_model = gpt-4o-mini # Text Processing db_min_tokens = 200 # Minimum chunk size db_max_tokens = 400 # Maximum chunk size # Query Settings query_max_tokens = 4096 # Max tokens in LLM response query_top_k = 30 # Number of chunks to retrieve query_temperature = 0.1 # LLM creativity (0=precise, 2=creative) query_role = You are a helpful expert assistant. # Output Format reference_format = json # xml, json, markdown, plain query_prompt_template = technical # Response style [ALGORITHMS] # Search Configuration similarity_threshold = 0.6 # Minimum similarity score (0-1) enable_hybrid_search = true # Combine vector + keyword search bm25_weight = 0.5 # Weight for BM25 in hybrid mode bm25_max_results = 1000 # Max results from BM25 # Reranking enable_reranking = true # Use cross-encoder for precision reranking_model = cross-encoder/ms-marco-MiniLM-L-6-v2 reranking_top_k = 30 # Rerank top N results [PERFORMANCE] # Optimization embedding_batch_size = 100 # Chunks per batch cache_thread_pool_size = 4 # Concurrent cache operations memory_cache_size = 10000 # LRU cache entries checkpoint_interval = 10 # Save every N batches [API] # Rate Limiting api_call_delay_seconds = 0.05 # Delay between API calls api_max_concurrency = 8 # Parallel API requests api_max_retries = 20 # Retry attempts for failed callsdb_min_tokens/db_max_tokens: Controls chunk size. Smaller = more precise, larger = more contextsimilarity_threshold: Lower (0.5) for broader results, higher (0.7) for strict relevanceenable_hybrid_search: Enable for technical docs, disable for narrative contentquery_temperature: 0.0-0.3 for factual, 0.7-1.0 for creative responses
OpenAI
- GPT-4o, GPT-4o-mini (128k context)
- GPT-4.1, GPT-4.1-mini, GPT-4.1-nano (1M context)
- o3, o3-mini, o3-pro (advanced reasoning)
- o4-mini (multimodal reasoning)
Anthropic
- Claude Sonnet 4.5, Haiku 4.5 (200k context, extended thinking)
- Claude Opus 4.1, Sonnet 4.0 (200k context)
- Gemini 2.5 Pro/Flash/Lite (thinking models, 1M+ context)
- Gemini 1.5 Pro/Flash-8B
xAI
- Grok 4.0, Grok 4.0-fast (256k-2M context, reasoning)
Local (Ollama)
- Llama 3.3 (8B-70B)
- Gemma 3 (4B-27B)
- DeepSeek R1, Qwen 2.5, Mistral, Phi-4
OpenAI
text-embedding-3-large(3072 dims, best quality)text-embedding-3-small(1536 dims, cost-effective)text-embedding-ada-002(1536 dims, legacy)
gemini-embedding-001(768/1536/3072 dims)- 68% MTEB score vs 64.6% for OpenAI
- 30k token context vs 8k
- Matryoshka Representation Learning
Customize response styles:
customkb query myproject "question" --prompt-template <template>Available templates:
default: Balanced, helpful responsesinstructive: Step-by-step explanationsscholarly: Academic, citation-richconcise: Brief, to-the-pointanalytical: Deep analysis with reasoningconversational: Friendly, approachabletechnical: Precise, developer-focused
Control how results are formatted:
# JSON for APIs customkb query myproject "search" --format json # XML with structured references customkb query myproject "search" --format xml # Markdown for documentation customkb query myproject "search" --format markdown # Plain text customkb query myproject "search" --format plainOrganize and filter results by categories:
# Categorize documents customkb categorize myproject --import # Query with category filters customkb query myproject "query" --categories "Technical,Legal"# Process with specific language customkb database myproject docs/*.txt --language french # Auto-detect languages (recommended for multilingual docs) customkb database myproject docs/ --detect-languageSupported languages: English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Finnish, Russian, Turkish, Arabic, Hebrew, Japanese, Chinese, Korean, and more.
CustomKB implements enterprise-grade security measures to protect your data and systems.
Safe Serialization
- ✓ Zero pickle deserialization vulnerabilities
- ✓ JSON format for reranking cache (human-readable, secure)
- ✓ JSON format for categorization checkpoints
- ✓ NPZ + JSON hybrid for BM25 indexes (efficient + secure)
- ✓ Automatic migration from legacy pickle formats
Injection Prevention
- ✓ SQL injection protection via table name validation
- ✓ Path traversal protection in file operations
- ✓ Input validation for all user-provided parameters
- ✓ Parameterized queries for database operations
API Security
- ✓ API key validation and secure storage
- ✓ Environment variable-based configuration
- ✓ No API keys in logs or error messages
- ✓ Secure credential handling
Data Protection
- ✓ Database integrity checks
- ✓ Atomic operations with rollback support
- ✓ Backup support for critical operations
- ✓ File permission validation
When deploying CustomKB in production:
- API Keys: Store in environment variables, never in code or config files
- File Permissions: Restrict knowledgebase directories to application user only
- Network Access: Run on localhost or behind authentication proxy
- Updates: Regularly check CHANGELOG.md for security patches
- Backups: Enable automatic backups before migrations
If you discover a security vulnerability:
- Do not create a public GitHub issue
- Email security concerns to: [Create issue for security contact]
- Include:
- Steps to reproduce
- Potential impact assessment
- Suggested remediation (if any)
- Allow reasonable time for patching before public disclosure
See CHANGELOG.md for detailed security update history.
# Analyze system and show recommendations customkb optimize --analyze # Apply optimizations automatically customkb optimize myproject # Preview changes without applying customkb optimize myproject --dry-runCustomKB automatically configures based on available memory:
| Memory | Tier | Features | Batch Size | Cache Size |
|---|---|---|---|---|
| <16GB | Low | Conservative, no hybrid search | 50 | 5,000 |
| 16-64GB | Medium | Balanced, moderate caching | 100 | 10,000 |
| 64-128GB | High | Large batches, hybrid search | 200 | 20,000 |
| >128GB | Very High | Maximum performance | 300 | 50,000 |
# Verify performance indexes customkb verify-indexes myproject # Build BM25 hybrid search index customkb bm25 myprojectCustomKB automatically detects and uses NVIDIA GPUs for:
- Cross-encoder reranking (20-40% faster)
- FAISS index search (GPU-enabled builds)
# Benchmark GPU vs CPU performance ./scripts/benchmark_gpu.py # Monitor GPU usage during operations ./scripts/gpu_monitor.sh"Knowledgebase not found"
# Verify KB exists ls -la $VECTORDBS/ # Check for .cfg file ls -la $VECTORDBS/myproject/myproject.cfg # Error message shows available KBs customkb query nonexistent "test""API rate limit exceeded"
# Increase delay between calls in config api_call_delay_seconds = 0.1 api_max_concurrency = 4"Out of memory during embedding"
# Run optimizer for your system customkb optimize myproject # Or manually reduce batch size in config embedding_batch_size = 50"Low similarity scores" or poor results
# Try lower threshold similarity_threshold = 0.5 # Enable hybrid search enable_hybrid_search = true # Or use stronger embedding model vector_model = text-embedding-3-large"Import failed: unsupported file type"
# CustomKB supports: .md, .txt, .html, .pdf # Convert other formats to supported types first # For code files, use .txt extension or markdown fenced blocks# Enable verbose logging customkb query myproject "test" -v # Check detailed logs tail -f $VECTORDBS/myproject/logs/myproject.log # Run diagnostics ./scripts/diagnose_crashes.py myprojectAll knowledgebases live in $VECTORDBS:
$VECTORDBS/ ├── myproject/ │ ├── myproject.cfg # Configuration (required) │ ├── myproject.db # SQLite database with chunks │ ├── myproject.faiss # FAISS vector index │ ├── myproject.bm25 # BM25 index (optional, for hybrid search) │ ├── .rerank_cache/ # Reranking cache (optional) │ └── logs/ # Runtime logs The system intelligently resolves KB names:
# All resolve to the same KB: customkb query myproject "test" customkb query myproject.cfg "test" customkb query $VECTORDBS/myproject "test" customkb query $VECTORDBS/myproject/myproject.cfg "test" # → All use $VECTORDBS/myproject/myproject.cfgLocated in scripts/ directory:
show_optimization_tiers.py- Display memory tier settingsemergency_optimize.py- Conservative recovery settingsclean_corrupted_cache.py- Clean corrupted cache files
benchmark_gpu.py- Compare GPU vs CPU performancegpu_monitor.sh- Real-time GPU utilization monitoringgpu_env.sh- GPU environment setup
rebuild_bm25_filtered.py- Rebuild BM25 indexes with filtersupgrade_bm25_tokens.py- Upgrade database for BM25 tokensdiagnose_crashes.py- Analyze crash logs and system stateupdate_dependencies.py- Check and update Python dependenciessecurity-check.sh- Run security validation checks
# Install test dependencies pip install -r requirements-test.txt # Run all tests python run_tests.py # Run specific test suites python run_tests.py --unit # Unit tests only python run_tests.py --integration # Integration tests only # Run with safety limits (recommended for CI) python run_tests.py --safe --memory-limit 2048 # Generate coverage report python run_tests.py --coverageQ: Can I use CustomKB without any API keys?
A: Yes! Use local Ollama models for both embeddings and queries. No external API calls required. Performance depends on your local hardware.
Q: How much does it cost to process documents?
A: Costs vary by provider and model:
- OpenAI
text-embedding-3-small: $0.02 per 1M tokens (~750k words) - Google
gemini-embedding-001: $0.15 per 1M tokens - Local Ollama models: Free (just electricity)
Example: A 500-page technical manual (~250k tokens) costs about $0.005 to embed with OpenAI.
Q: Is my data private and secure?
A: Your documents stay local. Only text chunks are sent to API providers during embedding and query operations. The full document contents never leave your system. For maximum privacy, use local Ollama models.
Q: What's the difference between CustomKB and vector databases like Pinecone?
A: CustomKB is a complete RAG (Retrieval-Augmented Generation) system including:
- Document processing pipeline
- Embedding generation
- Vector + hybrid search
- LLM integration
- Response generation
Vector databases only handle storage and retrieval. You'd need to build the rest yourself.
Q: Can I use multiple embedding models in one knowledgebase?
A: No, each knowledgebase uses one embedding model. To switch models, create a new KB or regenerate embeddings with --reset-database.
Q: How do I update my knowledgebase when documents change?
A: Re-run the database command with updated files:
customkb database myproject docs/*.md --force customkb embed myprojectOnly changed/new files are reprocessed.
Q: What's the maximum knowledgebase size?
A: Tested up to 10M+ chunks (~4GB database). FAISS scales to billions of vectors. Practical limits depend on your RAM and disk space.
Q: Can I run CustomKB in a Docker container?
A: Yes, though no official Docker image yet. Use a Python 3.12+ base image and install dependencies. Mount your $VECTORDBS directory as a volume.
Q: Does CustomKB support real-time document monitoring?
A: Not yet. You manually trigger document processing. Consider using filesystem watchers (inotify) to trigger updates automatically.
We welcome contributions from the community! Whether you're fixing bugs, adding features, improving documentation, or sharing ideas, your help makes CustomKB better for everyone.
- Report Bugs: Open an issue
- Suggest Features: Open an issue
- Improve Documentation: Fix typos, clarify instructions, add examples
- Submit Code: Bug fixes, new features, performance improvements
- Share Knowledge: Answer questions, write tutorials, create examples
-
Fork the repository on GitHub
-
Clone your fork
git clone https://github.com/YOUR-USERNAME/customkb.git cd customkb -
Create a feature branch
git checkout -b feature/amazing-feature
-
Set up development environment
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt pip install -r requirements-test.txt -
Make your changes
- Write clean, documented code
- Follow existing code style
- Add tests for new features
- Update documentation as needed
-
Run tests
python run_tests.py python run_tests.py --coverage
-
Commit your changes
git add . git commit -m "Add amazing feature"
-
Push to your fork
git push origin feature/amazing-feature
-
Open a Pull Request
- Go to the original repository
- Click "New Pull Request"
- Select your branch
- Describe your changes clearly
- Code Style: Follow PEP 8 for Python code
- Type Hints: Use type annotations for function signatures
- Testing: Maintain or improve test coverage
- Documentation: Update README and docstrings
- Commits: Write clear, descriptive commit messages
- Be respectful and inclusive
- Welcome newcomers and different perspectives
- Focus on what's best for the community
- Show empathy towards others
- Join discussions in GitHub Discussions
- Ask questions in issues (label with
question) - Review existing PRs to see the process
- Documentation: You're reading it! Check the sections above
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Releases: Watch the repository for release notifications
- Changelog: See CHANGELOG.md for version history
- Security: Check Security section for vulnerability reporting
- GitHub: Open-Technology-Foundation/customkb
- Maintainer: Indonesian Open Technology Foundation
- License: GPL-3.0 (see LICENSE)
Here's a complete workflow for creating a production-ready knowledgebase:
# 1. Setup environment export VECTORDBS="$HOME/knowledgebases" export OPENAI_API_KEY="your-key-here" # 2. Create KB directory mkdir -p "$VECTORDBS/techbase" cd "$VECTORDBS/techbase" # 3. Create optimized configuration cat > techbase.cfg << 'EOF' [DEFAULT] vector_model = text-embedding-3-small query_model = gpt-4o-mini db_min_tokens = 250 db_max_tokens = 500 [ALGORITHMS] enable_hybrid_search = true enable_reranking = true similarity_threshold = 0.65 bm25_weight = 0.5 [PERFORMANCE] embedding_batch_size = 100 memory_cache_size = 20000 checkpoint_interval = 10 EOF # 4. Process documents with language detection customkb database techbase ~/docs/**/*.md --detect-language --verbose # 5. Generate embeddings with progress customkb embed techbase --verbose # 6. Build hybrid search index customkb bm25 techbase # 7. Optimize for your system customkb optimize techbase # 8. Verify everything is set up correctly customkb verify-indexes techbase # 9. Test with sample queries customkb query techbase "What are the best practices?" \ --prompt-template technical \ --format markdown # 10. Test context-only retrieval customkb query techbase "authentication implementation" \ --context-only \ --top-k 10OPENAI_API_KEY # OpenAI API key ANTHROPIC_API_KEY # Anthropic API key GOOGLE_API_KEY # Google/Gemini API key XAI_API_KEY # xAI API key VECTORDBS # Knowledgebase base directory NLTK_DATA # NLTK data location (optional)# Embedding models text-embedding-3-small → OpenAI small (1536 dims) text-embedding-3-large → OpenAI large (3072 dims) gemini-embedding-001 → Google Gemini (configurable dims) # LLM models gpt-4o → OpenAI GPT-4 Omni gpt-4o-mini → OpenAI GPT-4 Omni Mini (cost-effective) claude-sonnet-4-5 → Anthropic Claude Sonnet 4.5 gemini-2.5-flash → Google Gemini 2.5 Flash grok-4 → xAI Grok 4- Large datasets: Increase
embedding_batch_sizeup to system limits - Technical content: Enable
enable_hybrid_search = true - GPU available: Install FAISS GPU variant for 2-4x speedup
- Low memory: Run
customkb optimizeto adjust for your system - Better accuracy: Enable reranking, lower similarity threshold
- Faster queries: Increase cache size, disable reranking for speed
GPL-3.0 License - see LICENSE file for details.
Copyright © 2024 Indonesian Open Technology Foundation
Actively maintained by the Indonesian Open Technology Foundation