π A comprehensive, Python-based JavaScript scraping and archiving tool built on Playwright. Designed for security researchers, bug bounty hunters, developers, and analysts to extract, filter, and save JavaScript files (external & inline) from any target website.
jsScraper allows you to scan web pages for JavaScript files β both external and inline β and archive them with options to:
- Filter out common libraries and tracking scripts
- Deduplicate using SHA-256 hashes
- Crawl internal pages
- Collect cross-origin resources (optional)
- Generate verbose output logs
- Process entire URL lists
Its powerful combination of asynchronous scraping, Playwright automation, and smart filtering makes it suitable for recon, compliance, forensics, and competitive intelligence.
| Feature | Description |
|---|---|
| π External JS Collection | Captures all loaded .js files on target page |
| π§ Inline Script Parsing | Extracts <script> blocks from HTML content |
| βοΈ Filtering Engine | Removes tracking scripts, analytics, and known libraries using regex |
| π Deduplication | Saves only unique scripts based on SHA-256 hash |
| π Crawling | Optional crawling of internal links up to specified depth |
| π Cross-Origin Capture | Capture JS from third-party domains if required |
| πͺ΅ Logging | Verbose log file (verbose.log) and clean CLI logging |
| π§ͺ Batch Mode | Accepts a list of target URLs from file |
-
Python 3.8+
-
Dependencies:
playwrightvalidatorsbeautifulsoup4
git clone https://github.com/exe249/jsScraper.git cd jsScraper pip install -r requirements.txt playwright installpython jsScraper.py https://example.compython jsScraper.py --url-file urls.txtpython jsScraper.py https://example.com \ --output output_dir \ --filter strict \ --min-size 200 \ --crawl \ --max-depth 2 \ --cross-origin \ --clear \ --verbose| Argument | Description |
|---|---|
url | Target website to scrape (e.g., https://site.com) |
--url-file | Path to file with list of URLs (overrides url) |
-o, --output | Output directory (default: getJsOutput) |
--filter | Filtering mode: strict (default) or relaxed |
--min-size | Minimum file size in bytes (default: 150) |
--crawl | Enable crawling of internal links |
--max-depth | Max depth for crawling (default: 2) |
--cross-origin | Include third-party JS |
--clear | Clear output folder before writing new data |
-t, --timeout | Page timeout in seconds (default: 60) |
-r, --delay | Delay between downloads in seconds (default: 0.5) |
-v, --verbose | Enable verbose logging (saved to verbose.log) |
Files are saved as:
<output_dir>/<domain>/<filter_mode>/ βββ example_com_main_f3ab23d4.js βββ example_com_inline_1a2b3c4d.js βββ ... βββ verbose.log Each JS file is uniquely named using:
- Domain
- Path
- Content hash (SHA-256, first 8 chars)
- Extract inline secrets, endpoints, or tokens
- Identify outdated/vulnerable JS libraries
- Use in bug bounty / recon workflows
- Archive all JS on a domain for future analysis
- Identify scripts used in past attacks or shady behavior
- strict: Blocks most common analytics, CDNs, libraries
- relaxed: Allows more JS through (themes, plugins, etc)
Custom patterns can be added to UNINTERESTING_JS_STRICT and UNINTERESTING_JS_RELAXED in the script.
playwright validators beautifulsoup4MIT License β Free to use, modify, and redistribute. See LICENSE for details.
Contributions are welcome β whether it's a feature idea, bug fix, optimization, or doc update!
- Open a pull request with your improvements
- Create an issue for bug reports or suggestions
- Discuss new ideas via issues or discussions
- π Plugin engine (e.g., secrets detection, URL extraction, JS analysis)
- π¨ JS beautification / deobfuscation support
- π JSON-based summary reports
- π³ Docker wrapper for easy deployment
Developed by 249BUG β built for recon professionals, security analysts, and digital investigators.