AI Documentation Vector Database Hybrid Scraper

AI-focused documentation ingestion and retrieval stack that combines Firecrawl and Crawl4AI powered scraping with a Qdrant vector database. The project exposes both FastAPI and MCP interfaces, offers mode-aware configuration (solo developer vs enterprise feature sets), and ships with tooling for embeddings, hybrid search, retrieval-augmented generation (RAG) workflows, and operational monitoring.

Overview

The system ingests documentation sources, generates dense and sparse embeddings, stores them in Qdrant, and serves hybrid search and RAG building blocks. It is built for AI engineers who need reliable documentation ingestion pipelines, reproducible retrieval quality, and integration points for agents or applications.

Highlights

Multi-tier crawling orchestration (src/services/browser/unified_manager.py) covering lightweight HTTP, Crawl4AI, browser-use, Playwright, and Firecrawl, plus a resumable bulk embedder CLI (src/crawl4ai_bulk_embedder.py).
Hybrid retrieval stack leveraging OpenAI and LangChain FastEmbed dense embeddings plus FastEmbedSparse BM25 signals, reranking, and HyDE augmentation through the modular Qdrant service (src/services/vector_db/ and src/services/hyde/).
Dual interfaces: REST endpoints in FastAPI (src/api/routers/v1/) and a FastMCP server (src/unified_mcp_server.py) that registers search, document management, analytics, and content intelligence tools for Claude Desktop / Code.
Built-in API hardening with SlowAPI-powered global rate limiting configured through the SecurityConfig model and middleware stack.
Observability built in: OpenTelemetry tracing, structured logging, health checks, optional Dragonfly cache, and /metrics exposure via prometheus-fastapi-instrumentator (src/services/observability/).
Developer ergonomics with uv-managed environments, dependency-injector driven service wiring, Ruff + pytest quality gates, and a unified developer CLI (scripts/dev.py).

Architecture

flowchart LR subgraph clients["Clients"] mcp["Claude Desktop / MCP"] rest["REST / CLI clients"] end subgraph mcp_server["FastMCP server"] registry["Tool registry (register_all_tools)"] end subgraph api["FastAPI application"] router["Mode-aware routers"] factory["Service factory"] end subgraph processing["Processing layer"] crawl["Unified crawling manager"] embed["Embedding manager"] search["Hybrid retrieval"] end subgraph data["Storage & caching"] qdrant[("Qdrant vector DB")] redis[("Redis / Dragonfly cache")] storage["Local docs & artifacts"] end subgraph observability["Observability"] metrics["Prometheus exporter"] health["Health & diagnostics"] end mcp --> registry registry --> processing rest --> api api --> processing processing --> crawl processing --> embed processing --> search crawl --> firecrawl["Firecrawl API"] crawl --> crawl4ai["Crawl4AI"] crawl --> browseruse["browser-use / Playwright"] embed --> openai["OpenAI"] embed --> fastembed["FastEmbed / FlagEmbedding"] search --> qdrant processing --> redis api --> metrics metrics --> observability processing --> health health --> observability

The FastMCP server now registers tool modules directly through register_all_tools. Dependencies are resolved once from the shared ApplicationContainer and provided to the tool registrars without an intermediate services layer, keeping the runtime surface minimal and maintainable.

The observability surface relies entirely on the OpenTelemetry and Prometheus stack: FastAPI and MCP emit spans and metrics that flow into the collector, with Grafana dashboards consuming those native feeds. No bespoke analytics dashboards or visualization engines ship with the application.

Core Components

Infrastructure Orchestration

ApplicationContainer (src/infrastructure/container.py) is the single source of truth for wiring clients (OpenAI, Qdrant, Redis, Firecrawl), caches, vector storage, crawling, embeddings, monitoring, and RAG helpers. Runtime surfaces pull dependencies from the container using dependency-injector providers rather than instantiating bespoke managers.
src/infrastructure/bootstrap.py exposes ensure_container and container_session helpers so FastAPI lifespans, the unified MCP server, CLI utilities, and evaluation scripts share identical lifecycle management without reimplementing startup/shutdown logic.
Service initialization is coordinated through container lifecycle hooks with deterministic startup/shutdown ordering, ensuring shared resources (HTTP sessions, vector stores, MCP sessions, monitoring tasks) are initialised once and cleaned up safely.

Crawling & Ingestion

AutomationRouter centralizes tier analysis while UnifiedBrowserManager delegates multi-tier scraping and tracks quality metrics.
Firecrawl and Crawl4AI adapters plus browser-use / Playwright integrations cover static and dynamic sites.
src/crawl4ai_bulk_embedder.py streams bulk ingestion, chunking, and embedding into Qdrant with resumable state and progress reporting.
docs/users/web-scraping.md and docs/users/examples-and-recipes.md include tier selection guidance and code samples.

LangChain splitter matrix & hybrid retrieval glue

The ingestion stack no longer ships bespoke "basic" or "enhanced" chunkers. Every surface—FastAPI, MCP, CLI—calls the shared src/services/document_chunking.chunk_to_documents helper, which orchestrates a matrix of LangChain text splitters selected from document metadata:

Markdown → MarkdownHeaderTextSplitter + recursive refinement for heading-aware segments.
HTML → HTMLSemanticPreservingSplitter (or header/section splitters when semantic parsing is disabled) with optional whitespace normalisation.
Code → RecursiveCharacterTextSplitter.from_language seeded from inferred file extensions or crawler metadata.
JSON → RecursiveJsonSplitter for structured payloads.
Token aware → TokenTextSplitter.from_tiktoken_encoder for strict token budgets.
Plain text → RecursiveCharacterTextSplitter using newline/space fallbacks.

All variants share canonical chunk metadata (chunk_id, chunk_index, inferred kind, provenance fields) so downstream services never rely on bespoke schemas. See the LangChain text splitter catalogue for implementation details.¹

Dense and sparse embeddings are sourced from the LangChain FastEmbed wrappers— FastEmbedEmbeddings for dense vectors and FastEmbedSparse for BM25-compatible representations—allowing hybrid searches without vendor SDK drift.² VectorStoreService wires both outputs into LangChain's QdrantVectorStore, so ingestion code and documentation examples share a single integration point.³

from langchain_community.embeddings import FastEmbedEmbeddings from langchain_qdrant import FastEmbedSparse, QdrantVectorStore from src.services.document_chunking import chunk_to_documents, infer_document_kind from src.config.models import ChunkingConfig config = ChunkingConfig(chunk_size=1600, chunk_overlap=200) documents = chunk_to_documents( raw_text, metadata, infer_document_kind(metadata, "text"), config, ) store = QdrantVectorStore.from_documents( documents, embedding=FastEmbedEmbeddings(model_name="BAAI/bge-small-en-v1.5"), sparse_embeddings=FastEmbedSparse(model_name="qdrant/bm25"), url="http://localhost:6333", collection_name="documentation", )

Toggle dense-only, sparse-only, or hybrid retrieval by setting EmbeddingConfig.retrieval_mode (and the equivalent CLI/MCP options). Hybrid mode persists both vector modalities and enables Qdrant's sparse+dense scoring during search.⁴

Vector Search & Retrieval

src/services/vector_db/ wraps collection management, hybrid search orchestration, adaptive fusion, and payload indexing.
Dense embeddings via OpenAI or FastEmbed, optional sparse vectors via SPLADE, and reranking hooks are configurable through Pydantic models (src/config/models.py).
HyDE augmentation and caching live under src/services/hyde/, enabling query expansion for RAG pipelines.
Search responses return timing, scoring metadata, and diagnostics suitable for observability dashboards.

Caching

The unified CacheManager (src/services/cache/manager.py) fronts the Dragonfly cache, hashing keys and enforcing TTL policies across embeddings, search, HyDE, and browser flows.
Specialized helpers such as the HyDE cache depend on CacheManager, ensuring a single entry point for embeddings, search results, and warm-up flows.
Configure Dragonfly URLs and TTLs via the cache models in src/config/models.py; the manager automatically wires Prometheus metrics when enabled.

Interfaces & Tooling

FastAPI routes (/api/v1/search, /api/v1/documents, /api/v1/collections) expose the core ingestion and retrieval capabilities.
The FastMCP server (src/unified_mcp_server.py) registers search, document, embedding, scraping, analytics, cache, and content intelligence tool modules (src/mcp_tools/). MCP tooling now expects dependencies to be supplied explicitly—services resolve once from the dependency-injector container during startup and are passed into the tool register_tools() functions rather than through intermediate service wrappers.
Developer CLI (scripts/dev.py) manages services, testing profiles, benchmarks, linting, and type checking.
Example notebooks and scripts under examples/ demonstrate agentic RAG flows and advanced search orchestration.

Observability & Operations

/metrics endpoints are exposed through prometheus-fastapi-instrumentator, while OpenTelemetry spans capture embedding, cache, database, and RAG pipeline telemetry; the database manager emits db.query.duration histograms for each session. All analytics are powered by these native observability feeds instead of custom dashboards. See docs/observability/embeddings_telemetry.md and docs/operators/monitoring.md for configuration details.
Health probes for system resources, Qdrant, Redis, RAG configuration, and application metadata are centrally coordinated by the HealthCheckManager (src/services/observability/health_manager.py), ensuring MCP tools and FastAPI dependencies share the same health state.
A single GET /health endpoint on the FastAPI and FastMCP servers reports the aggregated system status; per-service health endpoints have been removed.
Optional Dragonfly cache, PostgreSQL, ARQ workers, and Grafana dashboards are provisioned via docker-compose.yml profiles.
Structured logging and SlowAPI-based rate limiting are configured through the middleware manager (src/services/fastapi/middleware/manager.py) and security helpers (src/services/fastapi/middleware/security.py).

AI Telemetry Quickstart

Enable OTLP export (set in .env or deployment secrets):

export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317" export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <token>"

Run an OpenTelemetry Collector that forwards spans/metrics to your backend:

receivers: otlp: protocols: grpc: http: processors: batch: {} exporters: prometheus: endpoint: "0.0.0.0:9464" otlp: endpoint: "https://observability.example.com:4317" headers: authorization: "Bearer ${OBSERVABILITY_API_TOKEN}" service: pipelines: metrics: receivers: [otlp] processors: [batch] exporters: [prometheus, otlp] traces: receivers: [otlp] processors: [batch] exporters: [otlp]

Scrape /metrics with Prometheus (the collector exposes the OTLP pipeline to Prometheus in the example above). Helpful PromQL snippets:

sum by (model) (rate(ai_operation_tokens{operation="embedding"}[5m])) sum by (operation) (increase(ai_operation_cost[1h])) histogram_quantile( 0.95, sum by (le) (rate(ai_operation_duration_bucket{operation="embedding"}[5m])) )

Trace dashboards can group spans with gen_ai.operation.name, gen_ai.request.model, and gen_ai.usage.* attributes to visualize synchronous embeds versus asynchronous batch jobs.

Security & Validation

src/security/ml_security.py provides the consolidated MLSecurityValidator for URL, query, and filename validation alongside dependency and container scanning hooks.
FastAPI and MCP flows use the shared validator via dependency helpers, ensuring a single source of truth for sanitization and auditing logic.
Rate limiting defaults (default_rate_limit, rate_limit_window, optional Redis storage) are controlled through SecurityConfig and applied via the global SlowAPI limiter.

Quick Start

Prerequisites

Python 3.11 (or 3.12) and uv for dependency management.
A running Qdrant instance (local Docker welcome: docker compose --profile simple up -d qdrant).
API keys for the providers you plan to use (e.g., OPENAI_API_KEY, AI_DOCS__FIRECRAWL__API_KEY).

Environment variables

Variable	Purpose	Example
`AI_DOCS__QDRANT__URL`	Points services at your Qdrant instance.	`http://localhost:6333`
`OPENAI_API_KEY`	Enables OpenAI embeddings and HyDE prompts.	`sk-...`
`AI_DOCS__FIRECRAWL__API_KEY`	Authenticates Firecrawl API usage.	`fc-...`
`AI_DOCS__CACHE__REDIS_URL`	Enables Dragonfly/Redis caching layers.	`redis://localhost:6379`
`AI_DOCS__ENABLE_ADVANCED_MONITORING`	Toggles advanced monitoring dashboards.	`true`
`AI_DOCS__ENABLE_DEPLOYMENT_FEATURES`	Enables deployment automation endpoints.	`true`
`AI_DOCS__ENABLE_AB_TESTING`	Enables experimentation helpers.	`false`
`FASTMCP_TRANSPORT`	Chooses MCP transport (`streamable-http` or `stdio`).	`streamable-http`
`FASTMCP_HOST` / `FASTMCP_PORT`	Hostname and port for MCP HTTP transport.	`0.0.0.0` / `8001`
`FASTMCP_BUFFER_SIZE`	Tunes MCP stream buffer size (bytes).	`8192`

Store secrets in a .env file or your secrets manager and export them before running the services.

Clone & Install

git clone https://github.com/BjornMelin/ai-docs-vector-db-hybrid-scraper cd ai-docs-vector-db-hybrid-scraper uv sync --dev

Run the FastAPI application

# Ensure Qdrant is reachable at http://localhost:6333 export OPENAI_API_KEY="sk-..." # optional if using OpenAI export AI_DOCS__FIRECRAWL__API_KEY="fc-..." # optional but recommended uv run python -m src.api.main

Visit http://localhost:8000/docs for interactive OpenAPI docs. Feature flags such as AI_DOCS__ENABLE_ADVANCED_MONITORING=true adjust optional services without switching application modes.

Search API payloads

All search endpoints accept the canonical SearchRequest body. Minimal example:

{ "query": "vector databases", "limit": 5 }

Responses are emitted as SearchResponse payloads containing canonical SearchRecord entries. Example response payload:

{ "query": "vector databases", "total_results": 2, "processing_time_ms": 12.4, "records": [ { "id": "doc-1", "content": "Install Qdrant with Docker...", "score": 0.91, "collection": "documentation" }, { "id": "doc-2", "content": "Manage hybrid sparse+dense search pipelines...", "score": 0.88, "collection": "documentation" } ] }

Run the MCP server

uv run python src/unified_mcp_server.py

The server validates configuration on startup and registers the available MCP tools. Configure Claude Desktop / Code with the generated transport details (see config/claude-mcp-config.example.json).

Copy config/claude-mcp-config.example.json to your Claude settings directory and update the command field if you use a virtual environment wrapper.
If you prefer HTTP transport, export FASTMCP_TRANSPORT=streamable-http and set FASTMCP_HOST/FASTMCP_PORT to match the values referenced in the Claude config.
Restart Claude Desktop / Code so it reloads the MCP manifest and tool list.

Bulk ingestion CLI

uv run python src/crawl4ai_bulk_embedder.py --help

Use CSV/JSON/TXT URL lists to scrape, chunk, embed, and upsert into Qdrant with resumable checkpoints.

Docker Compose

Simple profile (API + Qdrant): docker compose --profile simple up -d
Enterprise profile (adds Dragonfly, PostgreSQL, worker, Prometheus, Grafana): docker compose --profile enterprise up -d

Stop with docker compose down when finished.

Configuration

Configuration is defined with Pydantic models in src/config/models.py and can be overridden via environment variables (AI_DOCS__*) or YAML files in config/templates/.
Mode-aware settings enable or disable services such as advanced caching, A/B testing, and observability.
Detailed configuration guidance lives in docs/developers/setup-and-configuration.md and operator runbooks under docs/operators/.

Testing & Quality

# Quick unit + fast integration tests python scripts/dev.py test --profile quick # Full suite with coverage (mirrors CI) python scripts/dev.py test --profile ci # Lint, format, type-check, and tests in one pass python scripts/dev.py quality

Performance and benchmark suites are available via python scripts/dev.py benchmark, and chaos-focused stress suites live under tests/ with dedicated markers.

Documentation & Resources

User guides: docs/users/ (quick start, search, scraping recipes, troubleshooting).
Developer deep dives: docs/developers/ (API reference, integration, architecture).
Operator handbook: docs/operators/ (deployment, monitoring, security).
Research notes and experiments: docs/research/.

Publishable MkDocs output is generated under site/ when running the documentation pipeline.

Contributing

Contributions are welcome. Read the CONTRIBUTING.md guide for development workflow, coding standards, and review expectations. Please include tests and documentation updates with feature changes. If this stack accelerates your RAG pipelines, consider starring the repository so other developers can discover it.

License

Released under the MIT License.

LangChain text splitter reference – https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/ ↩
FastEmbed dense and sparse configuration guide – https://qdrant.tech/documentation/fastembed/ ↩
LangChain Qdrant vector store integration – https://python.langchain.com/docs/integrations/vectorstores/qdrant/ ↩
Qdrant hybrid sparse+dense search overview – https://qdrant.tech/articles/hybrid-search/ ↩

Name		Name	Last commit message	Last commit date
Latest commit History 1,384 Commits
.github		.github
config		config
docs		docs
examples		examples
k8s		k8s
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.markdownlint.json		.markdownlint.json
.markdownlintignore		.markdownlintignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.yamllint.yml		.yamllint.yml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
mutmut_config.ini		mutmut_config.ini
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini
setup.sh		setup.sh
uv.lock		uv.lock

Uh oh!

License

Uh oh!

BjornMelin/ai-docs-vector-db-hybrid-scraper

Folders and files

Latest commit

History

Repository files navigation

AI Documentation Vector Database Hybrid Scraper

Overview

Highlights

Table of Contents

Architecture

Core Components

Infrastructure Orchestration

Crawling & Ingestion

LangChain splitter matrix & hybrid retrieval glue

Vector Search & Retrieval

Caching

Interfaces & Tooling

Observability & Operations

AI Telemetry Quickstart

Security & Validation

Quick Start

Prerequisites

Environment variables

Clone & Install

Run the FastAPI application

Search API payloads

Run the MCP server

Bulk ingestion CLI

Docker Compose

Configuration

Testing & Quality

Documentation & Resources

Contributing

License

Footnotes

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages