Skip to content

hyperpolymath/lol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LOL

RSR Compliance License ReScript Podman

Super-parallel corpus crawler for multilingual NLP and Computational Linguistics research

A type-safe ReScript implementation for building massive parallel corpora across 1500+ languages from Bible translation sources.

Features

  • 1500+ Languages - Crawl parallel texts from multiple Bible corpus sources

  • Type-Safe - Built with ReScript for compile-time correctness

  • Proof-Verified - Echidna integration for mathematical verification

  • RSR Gold Compliant - Follows Rhodium Standard Repository specifications

  • Semantic Grounding - OpenCyc integration for common-sense reasoning

  • Container-Ready - Podman-native (no Docker required)

Quick Start

Prerequisites

  • Node.js 20+

  • Just command runner

  • Nix (recommended) or npm

  • Podman (for containers)

Installation

# Clone the repository git clone https://github.com/Hyperpolymath/1000Langs.git cd 1000Langs # Using Nix (recommended) nix develop # Or using npm directly npm install

Build & Run

# Build the project just build # Run tests just test # Run the crawler just crawl-all

Project Structure

1000Langs/ ├── src/ │ ├── Lang1000.res # Main entry point │ ├── crawlers/ # Web crawler implementations │ │ ├── Crawler.res # Base crawler module │ │ ├── BibleCloud.res # bible.cloud crawler │ │ ├── BibleCom.res # bible.com crawler │ │ └── PngScriptures.res # pngscriptures.org crawler │ ├── api/ # API client wrappers │ │ └── DigitalBiblePlatform.res │ ├── corpus/ # Corpus management │ │ └── Alignment.res # Parallel text alignment │ ├── utils/ # Utility modules │ │ ├── Iso639.res # Language code handling │ │ └── Statistics.res # Statistical functions │ ├── proofs/ # Mathematical proofs │ └── cyc/ # OpenCyc integration ├── test/ # Test suites ├── proofs/ # Echidna proof files ├── config/ # Nickel configuration ├── meta/ # Reference data ├── .well-known/ # Discovery files ├── justfile # Task automation ├── flake.nix # Nix development environment ├── Containerfile # Podman container definition └── rescript.json # ReScript configuration

Supported Sources

Source URL Type Languages

Bible Cloud

https://bible.cloud

API

1500+

Bible.com

https://bible.com

Scraper

2000+

PNG Scriptures

https://pngscriptures.org

Download

800+

eBible

https://ebible.org

Download

1000+

Find.Bible

https://find.bible

API

1200+

Configuration

Configuration is managed through Nickel for type-safe, validated settings:

# Validate configuration just nickel-check # Export to JSON just nickel-export # Show resolved config just nickel-show

See config/main.ncl for all configuration options.

Testing

# Run all tests just test # Run with coverage just test-coverage # Run proof verification just prove

Proof Verification

This project integrates with Echidna for mathematical proof verification:

  • Data integrity proofs

  • Alignment correctness verification

  • Statistical property validation

  • Type safety guarantees

# Run all proofs just prove # Check specific proof just prove-check alignment_correctness

Container Deployment

Uses Podman (not Docker) for container operations:

# Build container just container-build # Run container just container-run # Deploy with volume mounts just container-dev

RSR Compliance

This project targets Gold (100%) compliance with the Rhodium Standard Repository specification:

# Run compliance audit just rsr-audit # Generate HTML report just rsr-audit-html

Contributing

See CONTRIBUTING.adoc for guidelines.

This project uses the Tri-Perimeter Contribution Framework (TPCF):

  • Perimeter 1 (Core): Maintainers only

  • Perimeter 2 (Expert): Trusted contributors

  • Perimeter 3 (Community): Open contributions

License

Dual licensed under:

  • MIT License

  • Palimpsest License v0.8

See LICENSE.txt for details.

Commercial use with attribution is permitted. Proprietary AI training without attribution is prohibited.

Acknowledgments

  • Original Python implementation by Ehsaneddin Asgari (LMU Munich)

  • Bible corpus data from various open Bible translation projects

  • Echidna for proof verification

  • RSR for compliance framework

About

Super-parallel corpus crawler for multilingual NLP and Computational Linguistics research

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •