A fast, polite, single-file Bash spider built around wget --spider.
It takes one or more start targets (URLs, hostnames, IPv4/IPv6 — with optional ports), crawls only the same domains by default, and writes a clean, de-duplicated list of discovered URLs to ./urls.
You can aim it at videos, audio, images, pages, or everything, and optionally emit a sitemap.txt and/or sitemap.xml.
- Respectful by default — honors
robots.txtunless you opt out. - Same-site only — strict allowlist built from your inputs, so it won’t wander off the domains you give it.
- Smart normalization
- Adds
https://to scheme-less seeds (or use--httpto default to HTTP). - Adds trailing
/to directory-like URLs (avoids/dir → /dir/redirect hiccups). - Fully supports IPv6 (
[2001:db8::1]:8443).
- Adds
- Flexible output modes
--video(default),--audio,--images,--pages,--files, or--all.--ext 'pat|tern'to override any preset (e.g.,pdf|docx|xlsx).
- Status filter —
--status-200keeps only URLs that returned HTTP 200 OK. - Polite pacing —
--delay SECONDS+--random-wait(default 0.5s). - Sitemaps —
--sitemap-txtand/or--sitemap-xmlfrom the final filtered set. - Robust log parsing — handles both
URL: http://…andURL:http://…. - Single-dash synonyms —
-video,-images,-all,-ext,-delay, etc.
- Bash (arrays &
set -euo pipefailsupport; Bash 4+ recommended) wget,awk,sed,grep,sort,mktemp,paste(standard GNU userland)
# Clone or copy the script into your PATH git clone https://github.com/Pryodon/Web-Spider-Linux-shell-script.git cd Web-Spider-Linux-shell-script chmod +x webspider # optional: symlink as 'spider' ln -s "$PWD/webspider" ~/bin/spider# Crawl one site (video mode by default) and write results to ./urls webspider https://www.example.com/ # Crawl one site searching only for .mkv and .mp4 files. webspider --ext 'mkv|mp4' https://nyx.mynetblog.com/xc/ # Multiple seeds (scheme-less is OK; defaults to https) webspider nyx.mynetblog.com www.mynetblog.com example.com # From a file (one seed per line — URLs, hostnames, IPv4/IPv6 ok) webspider seeds.txt - Results:
urls— your filtered, unique URL listlog— verbosewgetcrawl log
By default the spider respects robots, stays on your domains, and returns video files only.
webspider [--http|--https] [--video|--audio|--images|--pages|--files|--all] [--ext 'pat|tern'] [--delay SECONDS] [--status-200] [--no-robots] [--sitemap-txt] [--sitemap-xml] <links.txt | URL...> --video: video files only- mp4|mkv|avi|mov|wmv|flv|webm|m4v|ts|m2ts
--audio: audio files only- mp3|mpa|mp2|aac|wav|flac|m4a|ogg|opus|wma|alac|aif|aiff
--images: image files only- jpg|jpeg|png|gif|webp|bmp|tiff|svg|avif|heic|heif
--pages: directories (…/) + common page extensions- html|htm|shtml|xhtml|php|phtml|asp|aspx|jsp|jspx|cfm|cgi|pl|do|action|md|markdown
--files: all files (excludes directories and .html? pages)--all: everything (directories + pages + files)
--ext'pat|tern' : override extension set used by --video/--audio/--images/--pages.- Example: --files --ext 'pdf|docx|xlsx'
--delay S: polite crawl delay in seconds (default: 0.5), works with --random-wait--status-200: only keep URLs that returned HTTP 200 OK--no-robots: ignore robots.txt (default is to respect robots)--http|--https: default scheme for scheme-less seeds (default: --https)-h|--help: show usage
Single-dash forms work too: -video, -images, -files, -all, -ext, -delay, -status-200, -no-robots, etc.
webspider --status-200 --delay 1.0 https://www.example.com/
webspider --images --sitemap-txt https://www.example.com/ # Produces: urls (images only) and sitemap.txt (same set) webspider --pages --sitemap-xml https://www.example.com/ # Produces sitemap.xml containing directories and page-like URLs webspider --http --files --ext 'pdf|epub|zip' 192.168.1.50:8080
webspider --audio nyx.mynetblog.com/xc seeds.txt https://www.mynetblog.com/
webspider --images https://[2001:db8::1]:8443/gallery/
webspider --files https://www.example.com/some/path/ wget --no-host-directories --force-directories --no-clobber --cut-dirs=0 -i urls - Full URLs:
https://host/path,http://1.2.3.4:8080/dir/ - Hostnames:
example.com,sub.example.com - IPv4:
10.0.0.5,10.0.0.5:8080/foo - IPv6:
[2001:db8::1],[2001:db8::1]:8443/foo
- If no scheme: prefix with the default (
https://) or use--http - If looks like a directory (no dot in last path segment, and no
?/#): append/ - Domain allowlist is built from the seeds (auto-adds
www.variant for bare domains, however there is a bug with this as it only lists the root page on thewww.domain.)
- The spider runs
wget --spider --recursive --no-parent --level=infon your seed set. - It stays on the same domains (via
--domains=<comma-list>), unless a seed is an IP/IPv6 literal (then--domainsis skipped, andwgetstill naturally sticks to that host). - Extracts every
URL:line from thelogfile, normalizes away query/fragments, dedupes, and then applies your mode filter
(--video/--audio/--images/--pages/--files/--all). - If
--status-200is set, only URLs with an observed HTTP 200 OK are kept.
Heads-up: wget --spider generally uses HEAD requests where possible. Some servers don’t return 200 to HEAD even though GET would succeed. If filtering looks too strict, try without --status-200.
--sitemap-txt→sitemap.txt(newline-delimited URLs)--sitemap-xml→sitemap.xml(Sitemaps.org format)
Both are generated from the final filtered set (urls).
For an SEO-style site map, use --pages (or --all if you really want everything).
- Default delay is
0.5seconds with--random-waitto jitter requests. - Tune with
--delay 1.0(or higher) for shared hosts or when rate-limited. - You can combine with
--status-200to avoid collecting dead links.
Other knobs to consider (edit script if you want to hard-wire them):
--level=to cap depth (the script currently usesinf)--quota=or--reject=patterns if you need to skip classes of files
- Respect
robots.txt(default). Only use--no-robotswhen you own the host(s) or have permission. - Be mindful of server load and your network AUP. Increase
--delayif unsure.
urls— final, filtered, unique URLs (overwritten each run)log— fullwgetlog (overwritten each run)- Optional:
sitemap.txt,sitemap.xml(when requested)
To keep results separate across runs, copy/rename urls or run the script in a different directory.
It is very easy to append the current list of urls to another file:
cat urls >>biglist
Passing a few dozen seeds on the command line is fine:
webspider nyx.mynetblog.com www.mynetblog.com example.com
For very large lists, avoid shell ARG_MAX limits:
# write to a file generate_seeds > seeds.txt webspider --images --status-200 seeds.txt # or batch with xargs (runs webspider repeatedly with 100 seeds per call) generate_seeds | xargs -r -n100 webspider --video --delay 0.8 -
“Found no broken links.” but
urlsis empty
You likely hitrobots.txtrules, or your mode filtered everything out.
Try--no-robots(if permitted) and/or a different mode (e.g.,--all). -
Seeds without trailing slash don’t crawl
The script appends/to directory-like paths; if you still see issues, make sure redirects aren’t blocked upstream. -
--status-200drops too many
Some servers don’t return 200 for HEAD. Re-run without--status-200. -
IPv6 seeds
Always bracket:https://[2001:db8::1]/. The script helps, but explicit is best. -
Off-site crawl
The allowlist comes from your seeds. If you seedexample.com, it also allowswww.example.com. (auto-addswww.variant for bare domains, however there is a bug with this as it only lists the root page on thewww.domain.)
If you see off-site URLs, confirm they truly share the same registrable domain, or seed more specifically (e.g.,sub.example.com/).
Can I mix HTTP and HTTPS?
Yes. Provide the scheme per-seed where needed, or use --http to default scheme-less seeds to HTTP.
Will it download files?
No. It runs wget in spider mode (HEAD/GET checks only), and outputs URLs to the urls file. To actually download the files in the urls file, do something like this:
wget -i urls
Or..
wget --no-host-directories --force-directories --no-clobber --cut-dirs=0 -i urls
Can I make a “pages + files” hybrid?
Use --all (includes everything), or --files --ext 'html|htm|php|…' if you want file-only including page extensions.
How do I only keep 200 OK pages in a search-engine sitemap?
Use --pages --status-200 --sitemap-xml.
- Video:
mp4|mkv|avi|mov|wmv|flv|webm|m4v|ogv|ts|m2ts - Audio:
mp3|mpa|mp2|aac|wav|flac|m4a|ogg|opus|wma|alac|aif|aiff - Images:
jpg|jpeg|png|gif|webp|bmp|tiff|svg|avif|heic|heif - Pages:
html|htm|shtml|xhtml|php|phtml|asp|aspx|jsp|jspx|cfm|cgi|pl|do|action|md|markdown
Override any of these with your own file extensions: --ext 'pat|tern'.