RAGnificent combines Python and Rust components to scrape websites and convert HTML content to markdown, JSON, or XML formats. It supports sitemap parsing, semantic chunking for RAG (Retrieval-Augmented Generation), and includes performance optimizations through Rust integration.
Key features include HTML-to-markdown/JSON/XML conversion with support for various elements, intelligent content chunking that preserves document structure, and systematic content discovery through sitemap parsing. The hybrid architecture uses Python for high-level operations and Rust for performance-critical tasks.
Check out the deepwiki for a granular breakdown of the repository contents, purpose and structure.
- Features - Feature overview and capabilities
- Configuration - Configuration management and environment setup
- Optimization - Performance tuning and optimization guide
git clone https://github.com/krljakob/RAGnificent.git cd RAGnificent # Quick setup ./build_all.sh # Unix/macOS # or: .\build_all.ps1 # Windows # Manual setup uv venv && export PATH=".venv/bin:$PATH" uv pip install -r requirements.txt && uv pip install -e . pytest# Basic conversion python -m RAGnificent https://example.com -o output.md # With RAG chunking python -m RAGnificent https://example.com --save-chunks --chunk-dir chunks # Multiple formats and parallel processing python -m RAGnificent --links-file urls.txt --parallel -f json# Python API from RAGnificent.core.scraper import MarkdownScraper scraper = MarkdownScraper() html = scraper.scrape_website("https://example.com") markdown = scraper.convert_to_markdown(html, "https://example.com") chunks = scraper.create_chunks(markdown, "https://example.com")# Run all tests (including benchmarks) pytest # Fast development testing (recommended) ./run_tests.sh fast # ~15 seconds # or pytest -m "not benchmark and not slow" # Run specific test categories ./run_tests.sh unit # Unit tests only ./run_tests.sh integration # Integration tests ./run_tests.sh benchmark # Performance benchmarks ./run_tests.sh profile # Show slowest tests # Run specific test files pytest tests/unit/test_chunk_utils.py -v pytest tests/rust/test_python_bindings.py -vTest Performance: Tests are organized by speed - fast unit tests run in ~15 seconds, while full suite including benchmarks takes ~22 seconds. Benchmarks are skipped by default for rapid development cycles.
Current Status: 48 tests with comprehensive coverage across core functionality.
RAGnificent/: Main Python packagecore/: Core functionality (scraper, cache, config, etc.)rag/: RAG-specific components (embedding, vector store, search)utils/: Utility modules (chunking, sitemap parsing)
src/: Rust source code for performance-critical operationstests/: Comprehensive test suiteexamples/: Demo scripts and usage examplesdocs/: Detailed documentation
cargo bench python scripts/visualize_benchmarks.py- Fork the repository
- Create your feature branch (
git checkout -b feature/fix-rate-limiting) - Commit your changes (
git commit -m 'Fix rate limiting edge case') - Push to the branch (
git push origin feature/fix-rate-limiting) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
krljakob