Skip to content

Conversation

@JoeXic
Copy link

@JoeXic JoeXic commented Nov 2, 2025

Describe this PR

Add real-time web monitoring dashboard for GAIA validation benchmark with progress tracking and visualization capabilities.

What changed?

  • Added run_gaia_with_monitor.py to run GAIA benchmark with integrated web monitoring
  • Added utils/progress_check/gaia_web_monitor.py - web dashboard for real-time progress tracking
  • Added utils/progress_check/generate_gaia_report.py - report generation utility
  • Updated main.py to support the new monitoring command
  • Web dashboard accessible at http://localhost:8080 during benchmark execution

Why?

Running long benchmarks like GAIA validation requires hours, and users need a way to:

  • Monitor real-time progress without constantly checking logs
  • Visualize task completion status
  • Track performance metrics during execution
  • Generate comprehensive reports after completion
- Add run-gaia-with-monitor command for running benchmark with real-time monitoring - Add web dashboard for monitoring benchmark progress (gaia_web_monitor.py) - Add generate_gaia_report.py to utils/progress_check/ for generating task reports
@JoeXic JoeXic closed this Nov 2, 2025
@JoeXic JoeXic reopened this Nov 2, 2025
@JoeXic JoeXic changed the title feat(monitoring): add real-time web dashboard for GAIA benchmark progress feat(monitoring): add real-time web dashboard for monitoring benchmark progress Nov 10, 2025
@JoeXic
Copy link
Author

JoeXic commented Nov 10, 2025

Describe this PR

Refactor monitoring system from GAIA-specific to generic benchmark monitoring, supporting GAIA, FutureX, xbench, and FinSearchComp benchmarks with real-time web dashboards.

What changed?

Core Changes

  • Replaced run_gaia_with_monitor.pyrun_benchmark_with_monitor.py (generic benchmark runner)
  • Replaced utils/progress_check/gaia_web_monitor.pyutils/progress_check/benchmark_monitor.py (generic monitor)
  • Replaced utils/progress_check/generate_gaia_report.pyutils/progress_check/generate_benchmark_report.py (generic report generator)
  • Updated main.py to use the new generic monitoring system
  • Updated utils/progress_check/check_finsearchcomp_progress.py (fixed type annotation)

New Features

  • Auto-detect benchmark type from log folder path
  • Support benchmark-specific metrics:
    • GAIA/FinSearchComp: Correctness evaluation (accuracy)
    • FutureX/xbench: Prediction tracking (prediction rate)
    • FinSearchComp: Task type breakdown (T1/T2/T3) and regional analysis
  • Extract attempt number from log filename for accurate report generation
  • Suppress verbose HTTP logs in web dashboard
  • Automatic port conflict resolution

Documentation

  • Added monitor_guide.md - Web monitoring dashboard guide

Why?

Running long benchmarks (GAIA, FutureX, xbench, FinSearchComp) requires hours, and users need a way to:

  • Monitor real-time progress without constantly checking logs
  • Visualize task completion status with benchmark-specific metrics
  • Track performance metrics during execution (accuracy for GAIA, prediction rate for FutureX/xbench)
  • Generate comprehensive reports after completion
  • Use a unified monitoring system across all benchmarks instead of benchmark-specific solutions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant