A Scientifically-Rigorous Infrastructure for LLM Reliability and Performance Research
The Crucible Framework is a comprehensive infrastructure for conducting reproducible, statistically-rigorous experiments on large language model (LLM) reliability, performance optimization, and cost-accuracy trade-offs. Built on Elixir/OTP, it leverages the BEAM virtual machine's strengths in concurrency, fault tolerance, and distributed computing to enable research at scale.
Target Audience:
- PhD students and researchers studying LLM reliability
- ML engineers requiring rigorous experimental evaluation
- Research labs investigating AI system optimization
- Anyone needing publication-quality experimental infrastructure
Key Features:
- Multi-model ensemble voting with 4 voting strategies
- Request hedging for tail latency reduction (50-75% P99 improvement)
- Statistical testing with 15+ tests (parametric & non-parametric)
- Research-grade instrumentation with complete event capture
- Causal transparency for LLM decision provenance
- Automated experiment orchestration with checkpointing
- Multi-format reporting (Markdown, LaTeX, HTML, Jupyter)
# Clone repository git clone https://github.com/North-Shore-AI/crucible_framework.git cd crucible_framework # Install dependencies mix deps.get # Compile mix compile # Run tests mix testdefmodule MyFirstExperiment do use ResearchHarness.Experiment # Define experiment name "Ensemble vs Single Model" dataset :mmlu_stem, sample_size: 100 conditions [ %{name: "baseline", fn: &baseline/1}, %{name: "ensemble", fn: &ensemble/1} ] metrics [:accuracy, :cost_per_query, :latency_p99] repeat 3 # Baseline: Single model def baseline(query) do # Your single-model implementation %{prediction: "answer", latency: 800, cost: 0.01} end # Treatment: 3-model ensemble def ensemble(query) do {:ok, result} = Ensemble.predict(query, models: [:gpt4_mini, :claude_haiku, :gemini_flash], strategy: :majority ) %{ prediction: result.answer, latency: result.metadata.latency_ms, cost: result.metadata.cost_usd } end end # Run experiment {:ok, report} = ResearchHarness.run(MyFirstExperiment, output_dir: "results") # Results saved to: # - results/exp_abc123_report.md # - results/exp_abc123_results.csv # - results/exp_abc123_analysis.jsonOutput:
# Experiment Results ## Summary - Ensemble accuracy: 96.3% (±1.2%) - Baseline accuracy: 89.1% (±2.1%) - Improvement: +7.2 percentage points - Statistical significance: p < 0.001, d = 3.42 - Cost increase: 3.0× ($0.03 vs $0.01) ## Conclusion Ensemble significantly outperforms baseline with very large effect size. Cost-accuracy ratio: $0.29 per percentage point improvement.The framework consists of 6 layers organized as 8 independent OTP applications:
graph TB subgraph "Layer 6: Orchestration" RH[ResearchHarness<br/>Experiment DSL] end subgraph "Layer 5: Analysis & Reporting" BENCH[Bench<br/>Statistical Tests] TEL[TelemetryResearch<br/>Metrics] REP[Reporter<br/>Multi-Format] end subgraph "Layer 4: Reliability Strategies" ENS[Ensemble<br/>Multi-Model Voting] HEDGE[Hedging<br/>Latency Reduction] end subgraph "Layer 3: Transparency" CT[CausalTrace<br/>Decision Provenance] end subgraph "Layer 2: Data Management" DS[DatasetManager<br/>Benchmarks] end subgraph "Layer 1: Foundation" OTP[Elixir/OTP Runtime] end RH --> BENCH RH --> TEL RH --> ENS RH --> HEDGE RH --> DS ENS --> CT BENCH --> OTP TEL --> OTP ENS --> OTP HEDGE --> OTP DS --> OTP See ARCHITECTURE.md for complete system design.
Increase reliability by querying multiple models and aggregating responses.
{:ok, result} = Ensemble.predict("What is 2+2?", models: [:gpt4, :claude, :gemini], strategy: :majority, execution: :parallel ) result.answer # => "4" result.metadata.consensus # => 1.0 (100% agreement) result.metadata.cost_usd # => 0.045Voting Strategies:
:majority- Most common answer wins (default):weighted- Weight by confidence scores:best_confidence- Highest confidence answer:unanimous- All must agree
Execution Strategies:
:parallel- All models simultaneously (fastest):sequential- One at a time until consensus (cheapest):hedged- Primary with backups (balanced):cascade- Fast/cheap → slow/expensive
Expected Results:
- Reliability: 96-99% accuracy (vs 89-92% single model)
- Cost: 3-5× single model cost
- Latency: = slowest model in parallel mode
See ENSEMBLE_GUIDE.md for deep dive.
Reduce P99 latency by sending backup requests after a delay.
Hedging.request(fn -> call_api() end, strategy: :percentile, percentile: 95, enable_cancellation: true )Strategies:
:fixed- Fixed delay (e.g., 100ms):percentile- Delay = Pth percentile latency:adaptive- Learn optimal delay over time:workload_aware- Different delays per workload type
Expected Results:
- P99 latency: 50-75% reduction
- Cost: 5-15% increase (hedge fires ~10% of time)
- Based on: Google's "Tail at Scale" research
See HEDGING_GUIDE.md for theory and practice.
Rigorous statistical analysis for publication-quality results.
control = [0.89, 0.87, 0.90, 0.88, 0.91] treatment = [0.96, 0.97, 0.94, 0.95, 0.98] result = Bench.compare(control, treatment)Output:
%Bench.Result{ test: :welch_t_test, p_value: 0.00012, effect_size: %{cohens_d: 4.52, interpretation: "very large"}, confidence_interval: %{interval: {0.051, 0.089}, level: 0.95}, interpretation: """ Treatment group shows significantly higher accuracy (M=0.96, SD=0.015) compared to control (M=0.89, SD=0.015), t(7.98)=8.45, p<0.001, d=4.52. """ }Features:
- 15+ statistical tests: t-tests, ANOVA, Mann-Whitney, Wilcoxon, Kruskal-Wallis
- Automatic test selection: Checks assumptions, selects appropriate test
- Effect sizes: Cohen's d, η², ω², odds ratios
- Power analysis: A priori and post-hoc
- Multiple comparison correction: Bonferroni, Holm, Benjamini-Hochberg
See STATISTICAL_TESTING.md for complete guide.
Complete event capture for reproducible research.
# Start experiment {:ok, exp} = TelemetryResearch.start_experiment( name: "ensemble_vs_single", condition: "treatment", tags: ["accuracy", "h1"] ) # All events automatically captured # Stop and export {:ok, exp} = TelemetryResearch.stop_experiment(exp.id) {:ok, path} = TelemetryResearch.export(exp.id, :csv)Features:
- Experiment isolation: Multiple concurrent experiments, no cross-contamination
- Event enrichment: Automatic metadata (timestamp, experiment context, process info)
- Storage backends: ETS (in-memory) or PostgreSQL (persistent)
- Export formats: CSV, JSON Lines, Parquet
- Metrics calculation: Latency percentiles, cost, reliability
See INSTRUMENTATION.md for complete guide.
Unified interface to standard benchmarks with caching.
# Load dataset {:ok, dataset} = DatasetManager.load(:mmlu_stem, sample_size: 200) # Evaluate predictions {:ok, results} = DatasetManager.evaluate(predictions, dataset: dataset, metrics: [:exact_match, :f1] ) results.accuracy # => 0.96Supported Datasets:
- MMLU: 15,908 questions across 57 subjects
- HumanEval: 164 Python programming problems
- GSM8K: 8,500 grade school math problems
- Custom: Load from JSONL files
Features:
- Automatic caching: First load downloads, subsequent loads from cache
- Version tracking: Dataset versions locked for reproducibility
- Multiple metrics: Exact match, F1, BLEU, CodeBLEU
- Sampling: Random, stratified, k-fold cross-validation
See DATASETS.md for details.
Capture and visualize LLM decision-making for transparency.
# Parse LLM output with event tags {:ok, chain} = CausalTrace.parse_llm_output(llm_response, "Task Name") # Save and visualize CausalTrace.save(chain) CausalTrace.open_visualization(chain)Event Types:
- Task decomposition, hypothesis formation, pattern application
- Alternative consideration, constraint identification
- Decision making, uncertainty flagging
Features:
- Interactive HTML visualization: Timeline, alternatives, confidence levels
- Storage: JSON format on disk
- Search: Query chains by criteria
- Export: Markdown, CSV for analysis
Use Cases:
- Debugging LLM code generation
- User trust studies
- Model comparison
- Prompt engineering
See CAUSAL_TRANSPARENCY.md for details.
Protect your LLM systems from attacks, bias, and data quality issues with a comprehensive 4-library security stack.
# Complete security pipeline {:ok, result} = SecurePipeline.process(user_input) # Automatically protects against: # - 21 adversarial attack types # - Prompt injection (24+ patterns) # - Bias and fairness violations # - Data quality issues and driftFour Security Libraries:
- CrucibleAdversary - 21 attack types (character, word, semantic, injection, jailbreak), defense mechanisms, robustness metrics (ASR, accuracy drop, consistency)
- LlmGuard - AI firewall with 24+ prompt injection patterns, pipeline architecture, <10ms latency
- ExFairness - 4 fairness metrics (demographic parity, equalized odds, equal opportunity, predictive parity), EEOC 80% rule compliance, bias mitigation
- ExDataCheck - 22 data quality expectations, drift detection (KS test, PSI), outlier detection, continuous monitoring
Key Features:
- 21 Attack Types: Character perturbations, prompt injection, jailbreak techniques, semantic attacks
- Defense Mechanisms: Detection, filtering, sanitization with risk scoring
- Fairness Auditing: 4 metrics with legal compliance (EEOC 80% rule)
- Data Quality: 22 built-in expectations, distribution drift detection
- Production-Ready: <30ms total security overhead, >90% test coverage
Example: Comprehensive Security Evaluation
# Evaluate model robustness against all attack types {:ok, evaluation} = CrucibleAdversary.evaluate( MyModel, test_set, attacks: [:prompt_injection, :jailbreak_roleplay, :character_swap], metrics: [:accuracy_drop, :asr, :consistency], defense_mode: :strict ) # Results: # - Attack Success Rate: 3.2% (target: <5%) # - Accuracy Drop: 6.8% (target: <10%) # - Fairness Compliance: ✅ Passes EEOC 80% rule # - Data Quality Score: 92/100See ADVERSARIAL_ROBUSTNESS.md for complete technical deep dive including:
- All 21 attack types with examples
- Defense mechanisms and integration patterns
- Fairness metrics and bias mitigation strategies
- Data quality validation and drift detection
- Complete security pipeline architecture
- Links to all 4 component repositories
High-level DSL for defining and running complete experiments.
defmodule MyExperiment do use ResearchHarness.Experiment name "Hypothesis 1: Ensemble Reliability" dataset :mmlu_stem, sample_size: 200 conditions [ %{name: "baseline", fn: &baseline/1}, %{name: "ensemble", fn: &ensemble/1} ] metrics [:accuracy, :latency_p99, :cost_per_query] repeat 3 seed 42 # For reproducibility end {:ok, report} = ResearchHarness.run(MyExperiment)Features:
- Declarative DSL: Define experiments clearly
- Automatic execution: Parallel processing with GenStage
- Fault tolerance: Checkpointing every N queries
- Cost estimation: Preview before running
- Statistical analysis: Automatic with Bench
- Multi-format reports: Markdown, LaTeX, HTML, Jupyter
The framework supports rigorous experimental research with 6 core hypotheses:
A 5-model ensemble achieves ≥99% accuracy on MMLU-STEM, significantly higher than best single model (≤92%).
Design: Randomized controlled trial
- Control: GPT-4 single model
- Treatment: 5-model majority vote ensemble
- n=200 queries, 3 repetitions
Analysis: Independent t-test, Cohen's d effect size
Expected Result: d > 1.0 (very large effect)
Request hedging reduces P99 latency by ≥50% with <15% cost increase.
Design: Paired comparison
- Baseline vs P95 hedging
- n=1000 API calls, 3 repetitions
Analysis: Paired t-test on log-transformed latencies
Expected Result: 60% P99 reduction, 10% cost increase
H3-H6: See RESEARCH_METHODOLOGY.md
Complete methodology includes:
- Experimental designs (factorial, repeated measures, time series)
- Statistical methods (parametric & non-parametric tests)
- Power analysis for sample size determination
- Multiple comparison correction
- Reproducibility protocols
- Publication guidelines
- Elixir: 1.14 or higher
- Erlang/OTP: 25 or higher
- PostgreSQL: 14+ (optional, for persistent telemetry storage)
macOS:
brew install elixirUbuntu/Debian:
wget https://packages.erlang-solutions.com/erlang-solutions_2.0_all.deb sudo dpkg -i erlang-solutions_2.0_all.deb sudo apt-get update sudo apt-get install elixirFrom source:
git clone https://github.com/elixir-lang/elixir.git cd elixir make clean test# Clone repository git clone https://github.com/North-Shore-AI/crucible_framework.git cd crucible_framework # Install dependencies mix deps.get # Compile all apps mix compile # Run tests mix test # Generate documentation mix docs # View docs open doc/index.htmlCreate config/config.exs:
import Config # API Keys config :ensemble, openai_api_key: System.get_env("OPENAI_API_KEY"), anthropic_api_key: System.get_env("ANTHROPIC_API_KEY"), google_api_key: System.get_env("GOOGLE_API_KEY") # Dataset caching config :dataset_manager, cache_dir: "~/.cache/crucible_framework/datasets" # Telemetry storage config :telemetry_research, storage_backend: :ets # or :postgres # Research harness config :research_harness, checkpoint_dir: "./checkpoints", results_dir: "./results"export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." export GOOGLE_API_KEY="..."Or create .env file and use dotenv:
# .env OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... GOOGLE_API_KEY=...# Test if 3-model ensemble beats single model alias Ensemble alias DatasetManager # Load dataset {:ok, dataset} = DatasetManager.load(:mmlu_stem, sample_size: 100) # Run baseline (single model) baseline_results = Enum.map(dataset.items, fn item -> # Call GPT-4 {:ok, answer} = call_gpt4(item.input) %{predicted: answer, expected: item.expected} end) # Run ensemble (3 models) ensemble_results = Enum.map(dataset.items, fn item -> {:ok, result} = Ensemble.predict(item.input, models: [:gpt4_mini, :claude_haiku, :gemini_flash], strategy: :majority ) %{predicted: result.answer, expected: item.expected} end) # Evaluate {:ok, baseline_eval} = DatasetManager.evaluate(baseline_results, dataset: dataset) {:ok, ensemble_eval} = DatasetManager.evaluate(ensemble_results, dataset: dataset) # Compare with statistics baseline_accuracy = baseline_eval.accuracy ensemble_accuracy = ensemble_eval.accuracy Bench.compare([baseline_accuracy], [ensemble_accuracy])alias Hedging # Measure baseline latencies baseline_latencies = 1..1000 |> Enum.map(fn _ -> start = System.monotonic_time(:millisecond) call_api() finish = System.monotonic_time(:millisecond) finish - start end) # Measure hedged latencies hedged_latencies = 1..1000 |> Enum.map(fn _ -> start = System.monotonic_time(:millisecond) Hedging.request(fn -> call_api() end, strategy: :percentile, percentile: 95 ) finish = System.monotonic_time(:millisecond) finish - start end) # Calculate P99 baseline_p99 = Statistics.percentile(baseline_latencies, 0.99) hedged_p99 = Statistics.percentile(hedged_latencies, 0.99) reduction = (baseline_p99 - hedged_p99) / baseline_p99 IO.puts("P99 reduction: #{Float.round(reduction * 100, 1)}%")defmodule EnsembleExperiment do use ResearchHarness.Experiment name "Ensemble Size Comparison" description "Test 1, 3, 5, 7 model ensembles" dataset :mmlu_stem, sample_size: 200 conditions [ %{name: "single", fn: &single_model/1}, %{name: "ensemble_3", fn: &ensemble_3/1}, %{name: "ensemble_5", fn: &ensemble_5/1}, %{name: "ensemble_7", fn: &ensemble_7/1} ] metrics [ :accuracy, :consensus, :cost_per_query, :latency_p50, :latency_p99 ] repeat 3 seed 42 def single_model(query) do {:ok, result} = call_gpt4(query) %{prediction: result.answer, cost: 0.01, latency: 800} end def ensemble_3(query) do {:ok, result} = Ensemble.predict(query, models: [:gpt4_mini, :claude_haiku, :gemini_flash], strategy: :majority ) %{ prediction: result.answer, consensus: result.metadata.consensus, cost: result.metadata.cost_usd, latency: result.metadata.latency_ms } end def ensemble_5(query) do {:ok, result} = Ensemble.predict(query, models: [:gpt4_mini, :claude_haiku, :gemini_flash, :gpt35, :claude_sonnet], strategy: :majority ) %{ prediction: result.answer, consensus: result.metadata.consensus, cost: result.metadata.cost_usd, latency: result.metadata.latency_ms } end def ensemble_7(query) do {:ok, result} = Ensemble.predict(query, models: [:gpt4_mini, :claude_haiku, :gemini_flash, :gpt35, :claude_sonnet, :gpt4, :gemini_pro], strategy: :majority ) %{ prediction: result.answer, consensus: result.metadata.consensus, cost: result.metadata.cost_usd, latency: result.metadata.latency_ms } end end # Run experiment {:ok, report} = ResearchHarness.run(EnsembleExperiment, output_dir: "results/ensemble_size", formats: [:markdown, :latex, :html, :jupyter] ) # Results in: # - results/ensemble_size/exp_abc123_report.md # - results/ensemble_size/exp_abc123_report.tex # - results/ensemble_size/exp_abc123_report.html # - results/ensemble_size/exp_abc123_analysis.ipynb| Configuration | P50 | P95 | P99 |
|---|---|---|---|
| Single model | 800ms | 2000ms | 5000ms |
| 3-model ensemble | 1200ms | 2500ms | 6000ms |
| Ensemble + hedging | 1200ms | 1800ms | 2200ms |
| Parallelism | Queries/sec |
|---|---|
| Sequential | 0.83 |
| 50 concurrent | 41.7 |
| 100 concurrent | 83.3 |
| Configuration | Per Query | Per 1000 Queries |
|---|---|---|
| GPT-4 Mini | $0.0002 | $0.19 |
| 3-model ensemble (cheap) | $0.0007 | $0.66 |
| 5-model ensemble (mixed) | $0.0057 | $5.66 |
| Component | Memory Usage |
|---|---|
| Empty process | ~2 KB |
| Query process | ~5 KB |
| HTTP client | ~15 KB |
| ETS storage (100k events) | ~200 MB |
All experiments are fully reproducible:
1. Deterministic Seeding
# Set seed for all randomness seed 42 # Query order, sampling, tie-breaking all deterministic2. Version Tracking
# Saved in experiment metadata framework_version: 0.1.0 elixir_version: 1.14.0 dataset_version: mmlu-1.0.0 model_versions: gpt4: gpt-4-0613 claude: claude-3-opus-202402293. Complete Artifact Preservation
results/exp_abc123/ ├── config.json # Full configuration ├── environment.json # System info ├── dataset.jsonl # Exact dataset used ├── results.csv # Raw results ├── analysis.json # Statistical analysis └── checkpoints/ # Intermediate state 4. Verification Protocol
# Reproduce experiment git checkout <commit> mix deps.get export EXPERIMENT_SEED=42 mix run experiments/h1_ensemble.exs # Results should be identical diff results/original results/reproductionIf you use this framework in your research, please cite:
@software{crucible_framework2025, title = {Crucible Framework: Infrastructure for LLM Reliability Research}, author = {Research Infrastructure Team}, year = {2025}, url = {https://github.com/North-Shore-AI/crucible_framework}, version = {0.1.0} }For specific experiments, also cite:
@misc{your_experiment2025, title = {Your Experiment Title}, author = {Your Name}, year = {2025}, howpublished = {Open Science Framework}, url = {https://osf.io/xxxxx/} }See PUBLICATIONS.md for paper templates and guidelines.
- ARCHITECTURE.md - Complete system architecture (6-layer stack, library interactions)
- RESEARCH_METHODOLOGY.md - 6 hypotheses, experimental designs, statistical methods
- GETTING_STARTED.md - Installation, first experiment, troubleshooting
- ENSEMBLE_GUIDE.md - Deep dive into ensemble library, voting strategies
- HEDGING_GUIDE.md - Request hedging explained, Google's research
- STATISTICAL_TESTING.md - Using Bench for rigorous analysis
- ADVERSARIAL_ROBUSTNESS.md - Complete security stack: 21 attacks, defenses, fairness, data quality
- INSTRUMENTATION.md - TelemetryResearch complete guide
- DATASETS.md - Supported datasets, adding custom datasets
- CAUSAL_TRANSPARENCY.md - Using CausalTrace, user study protocols
- CONTRIBUTING.md - How to contribute, code standards
- PUBLICATIONS.md - How to cite, paper templates
- FAQ.md - Common questions, troubleshooting
We welcome contributions! Please see CONTRIBUTING.md for:
- Code of conduct
- Development workflow
- Code standards
- Testing requirements
- Documentation standards
- Pull request process
Quick contribution guide:
- Fork repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Write tests (
mix test) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open pull request
This project is licensed under the MIT License - see the LICENSE file for details.
Key points:
- Free for academic and commercial use
- Attribution required
- No warranty
- Documentation: https://hexdocs.pm/crucible_framework
- Issues: https://github.com/North-Shore-AI/crucible_framework/issues
- Discussions: https://github.com/North-Shore-AI/crucible_framework/discussions
- Email: research@example.com
Inspired by:
- Google's "The Tail at Scale" research (Dean & Barroso, 2013)
- Ensemble methods in ML (Breiman, 1996; Dietterich, 2000)
- Statistical best practices (Cohen, 1988; Cumming, 2014)
Built with:
- Elixir - Functional programming language
- OTP - Fault-tolerant runtime
- Req - HTTP client
- Jason - JSON parsing
- Telemetry - Instrumentation
Datasets:
- Distributed execution across multiple nodes
- Real-time experiment monitoring dashboard
- Additional datasets (BigBench, HELM)
- Model-specific optimizations
- AutoML for ensemble configuration
- Cost optimization algorithms
- Advanced visualizations (Plotly, D3.js)
- Cloud deployment templates (AWS, GCP, Azure)
- Production-ready stability
- Complete test coverage (>95%)
- Performance optimizations
- Extended documentation
- Tutorial videos
Elixir/OTP provides:
- Lightweight processes (10k+ concurrent requests)
- Fault tolerance (supervision trees handle failures)
- Immutability (no race conditions)
- Hot code reloading (update without stopping)
- Distributed (scale across machines)
Perfect for research requiring massive parallelism and reproducibility.
- Beginner: 1-2 days to run existing experiments
- Intermediate: 1 week to write custom experiments
- Advanced: 2-4 weeks to extend framework
Prior Elixir experience helps but not required. Framework provides high-level DSL.
Depends on experiment size:
- Small (100 queries): $0.50-$5
- Medium (1000 queries): $5-$50
- Large (10k queries): $50-$500
Use cheap models (GPT-4 Mini, Claude Haiku, Gemini Flash) to minimize costs.
For research: Yes. Framework is stable, well-tested, and actively used.
For production systems: Partially. Core libraries (Ensemble, Hedging) are production-ready. ResearchHarness is research-focused.
Basic experiments: Yes, using provided templates.
Custom experiments: Some Elixir knowledge required (functions, pattern matching, pipelines).
Resources:
See Citation section above and PUBLICATIONS.md for LaTeX/BibTeX templates.
Status: Active development Version: 0.1.0 Last Updated: 2025-10-08 Maintainers: Research Infrastructure Team
Built with ❤️ by researchers, for researchers.