Super-parallel corpus crawler for multilingual NLP and Computational Linguistics research
A type-safe ReScript implementation for building massive parallel corpora across 1500+ languages from Bible translation sources.
-
1500+ Languages - Crawl parallel texts from multiple Bible corpus sources
-
Type-Safe - Built with ReScript for compile-time correctness
-
Proof-Verified - Echidna integration for mathematical verification
-
RSR Gold Compliant - Follows Rhodium Standard Repository specifications
-
Semantic Grounding - OpenCyc integration for common-sense reasoning
-
Container-Ready - Podman-native (no Docker required)
# Clone the repository git clone https://github.com/Hyperpolymath/1000Langs.git cd 1000Langs # Using Nix (recommended) nix develop # Or using npm directly npm install1000Langs/ ├── src/ │ ├── Lang1000.res # Main entry point │ ├── crawlers/ # Web crawler implementations │ │ ├── Crawler.res # Base crawler module │ │ ├── BibleCloud.res # bible.cloud crawler │ │ ├── BibleCom.res # bible.com crawler │ │ └── PngScriptures.res # pngscriptures.org crawler │ ├── api/ # API client wrappers │ │ └── DigitalBiblePlatform.res │ ├── corpus/ # Corpus management │ │ └── Alignment.res # Parallel text alignment │ ├── utils/ # Utility modules │ │ ├── Iso639.res # Language code handling │ │ └── Statistics.res # Statistical functions │ ├── proofs/ # Mathematical proofs │ └── cyc/ # OpenCyc integration ├── test/ # Test suites ├── proofs/ # Echidna proof files ├── config/ # Nickel configuration ├── meta/ # Reference data ├── .well-known/ # Discovery files ├── justfile # Task automation ├── flake.nix # Nix development environment ├── Containerfile # Podman container definition └── rescript.json # ReScript configuration| Source | URL | Type | Languages |
|---|---|---|---|
Bible Cloud | API | 1500+ | |
Bible.com | Scraper | 2000+ | |
PNG Scriptures | Download | 800+ | |
eBible | Download | 1000+ | |
Find.Bible | API | 1200+ |
Configuration is managed through Nickel for type-safe, validated settings:
# Validate configuration just nickel-check # Export to JSON just nickel-export # Show resolved config just nickel-showSee config/main.ncl for all configuration options.
# Run all tests just test # Run with coverage just test-coverage # Run proof verification just proveThis project integrates with Echidna for mathematical proof verification:
-
Data integrity proofs
-
Alignment correctness verification
-
Statistical property validation
-
Type safety guarantees
# Run all proofs just prove # Check specific proof just prove-check alignment_correctnessUses Podman (not Docker) for container operations:
# Build container just container-build # Run container just container-run # Deploy with volume mounts just container-devThis project targets Gold (100%) compliance with the Rhodium Standard Repository specification:
# Run compliance audit just rsr-audit # Generate HTML report just rsr-audit-htmlSee CONTRIBUTING.adoc for guidelines.
This project uses the Tri-Perimeter Contribution Framework (TPCF):
-
Perimeter 1 (Core): Maintainers only
-
Perimeter 2 (Expert): Trusted contributors
-
Perimeter 3 (Community): Open contributions
Dual licensed under:
-
MIT License
-
Palimpsest License v0.8
See LICENSE.txt for details.
Commercial use with attribution is permitted. Proprietary AI training without attribution is prohibited.