An intelligent Python tool to extract and catalog software repositories from JOSS published papers
π― Features β’ π Quick Start β’ π Output β’ π οΈ Usage β’ π Statistics
|
|
|
|
# 1. Open in Codespaces (click badge above) # 2. Install uv # 3. Create the Virtual Environment # 4. Activate the environment (#Linux) pip install uv uv venv source .venv/bin/activate # Run only if the Requirement file is present uv pip install -r requirements.txt# Run only if the Requirement file is absent uv pip install requests beautifulsoup4python joss_extractor.pypython helmholtzRSD_extractor.pyThe script generates a timestamped CSV file with software repositories:
software_repository "https://github.com/example/awesome-tool" "https://gitlab.com/research/data-analyzer" "https://codeberg.org/dev/ml-framework" joss_repositories_YYYYMMDD_HHMMSS.csv Helmholtz_software_repositories_YYYYMMDD_HHMMSS.csv Example: joss_repositories_20250805_143022.csv
python joss_extractor.py python helmholtzRSD_extractor.pyπ JOSS Papers Data Extractor ================================================== π Started at: 2025-08-05 14:30:15 Fetching page 1/156... β Retrieved 20 papers (Total: 20) Fetching page 2/156... β Retrieved 20 papers (Total: 40) ... ============================================================ π EXTRACTION SUMMARY ============================================================ π₯ Total papers processed: 3,111 π Records written to CSV: 3,089 β Papers without repositories: 22 π Repository coverage: 99.3% π Output file: joss_repositories_20250805_143022.csv π Extraction completed at: 2025-08-05 14:32:18 π VERIFICATION: β
Processed 3,111 papers from API β
Wrote 3,089 repository URLs to CSV β
Data integrity: 3,089 + 22 = 3,111 β β±οΈ Total execution time: 123.4 seconds | Metric | Typical Value |
|---|---|
| Total Papers | ~3,100+ |
| Repository Coverage | ~99% |
| Execution Time | 2-5 minutes |
| Output Size | ~200KB |
| API Pages | ~156 pages |
- Python 3.6+
requestslibrary- Internet connection
- Base URL:
https://joss.theoj.org/papers/published.json - Pagination: 20 records per page
- Total Pages: ~156 pages
- Rate Limiting: 100ms delay between requests
- Fetch all pages from JOSS API
- Filter papers with valid repository URLs
- Format URLs with explicit quotes
- Export to timestamped CSV file
- Verify data integrity
This project was generated with the assistance of Claude AI. Contributions are welcome!
-
Fork the repository
-
Create your feature branch (
git checkout -b feature/AmazingFeature) -
Commit your changes (
git commit -m 'Add some AmazingFeature') -
Push to the branch (
git push origin feature/AmazingFeature) -
Open a Pull Request
-
[TODO : Fix Licence extraction logic for non GITHUB repo's]
This project is open source and available under the MIT License.
- JOSS - For providing the excellent API
- Claude AI - For assisting in code generation
- GitHub Codespaces - For seamless development environment