|
1 | | -# Insanely Awesome Web Spider (Linux shell script) |
| 1 | +# Web Spider (Linux shell script) |
2 | 2 |
|
3 | | -**!!! This script requires the program [wget](https://www.gnu.org/software/wget/) !!!** |
| 3 | +A fast, polite, single-file Bash spider built around `wget --spider`. |
| 4 | +It takes one or more **start targets** (URLs, hostnames, IPv4/IPv6 — with optional ports), crawls **only the same domains** by default, and writes a clean, de-duplicated list of discovered URLs to `./urls`. |
4 | 5 |
|
5 | | -Spider start URLs/domains/IPs and produce a unique URL list that you can feed to wget for easy downloading. |
| 6 | +You can aim it at videos, audio, images, pages, **or everything**, and optionally emit a `sitemap.txt` and/or `sitemap.xml`. |
6 | 7 |
|
7 | | -This script uses the spidering functions of wget to mainly search web directories for media files but can also spider entire websites. It generates a list of URLs found and writes them to a file. The list of URLs can then be fed to wget for easy downloading.<br/> |
8 | | -`wget -i urls` |
| 8 | +--- |
9 | 9 |
|
10 | | -- This script honors website's robots.txt files but that can be disabled. |
11 | | -- There is a built-in crawl delay of 0.5 seconds. This can be adjusted to your needs.<br/> |
12 | | -`webspider --delay 1 www.example.com`<br/> |
13 | | -This will maintain a delay with jitter around 1 second between requests to the website. |
14 | | -- The script defaults to searching for video files but can also search for audio files, images, etc. or even search for your own file extensions that you provide. |
15 | | -- The script creates a list of URLs in a file named `urls` and a log file named `log`. These files will be placed in your current directory. These files are overwritten on each run of the script. You can easily append the current list of URLs to another file.<br/> |
16 | | -`cat urls >>biglist` |
| 10 | +## Features |
| 11 | + |
| 12 | +- **Respectful by default** — honors `robots.txt` unless you opt out. |
| 13 | +- **Same-site only** — strict allowlist built from your inputs, so it won’t wander off the domains you give it. |
| 14 | +- **Smart normalization** |
| 15 | + - Adds `https://` to scheme-less seeds (or use `--http` to default to HTTP). |
| 16 | + - Adds trailing `/` to directory-like URLs (avoids `/dir → /dir/` redirect hiccups). |
| 17 | + - Fully supports IPv6 (`[2001:db8::1]:8443`). |
| 18 | +- **Flexible output modes** |
| 19 | + - `--video` (default), `--audio`, `--images`, `--pages`, `--files`, or `--all`. |
| 20 | + - `--ext 'pat|tern'` to override any preset (e.g., `pdf|docx|xlsx`). |
| 21 | +- **Status filter** — `--status-200` keeps only URLs that returned **HTTP 200 OK**. |
| 22 | +- **Polite pacing** — `--delay SECONDS` + `--random-wait` (default **0.5s**). |
| 23 | +- **Sitemaps** — `--sitemap-txt` and/or `--sitemap-xml` from the **final filtered set**. |
| 24 | +- **Robust log parsing** — handles both `URL: http://…` and `URL:http://…`. |
| 25 | +- **Single-dash synonyms** — `-video`, `-images`, `-all`, `-ext`, `-delay`, etc. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## Requirements |
| 30 | + |
| 31 | +- Bash (arrays & `set -euo pipefail` support; Bash 4+ recommended) |
| 32 | +- `wget`, `awk`, `sed`, `grep`, `sort`, `mktemp`, `paste` (standard GNU userland) |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## Installation |
| 37 | + |
| 38 | +```bash |
| 39 | +# Clone or copy the script into your PATH |
| 40 | +git clone https://github.com/Pryodon/Web-Spider-Linux-shell-script.git |
| 41 | +cd Web-Spider-Linux-shell-script |
| 42 | +chmod +x webspider |
| 43 | +# optional: symlink as 'spider' |
| 44 | +ln -s "$PWD/webspider" ~/bin/spider |
| 45 | +``` |
| 46 | + |
| 47 | +## Quick Start |
| 48 | +``` |
| 49 | +# Crawl one site (video mode by default) and write results to ./urls |
| 50 | +webspider https://www.example.com/ |
| 51 | +
|
| 52 | +# Crawl one site searching only for .mkv and .mp4 files. |
| 53 | +webspider --ext 'mkv|mp4' https://nyx.mynetblog.com/xc/ |
| 54 | +
|
| 55 | +# Multiple seeds (scheme-less is OK; defaults to https) |
| 56 | +webspider nyx.mynetblog.com www.mynetblog.com example.com |
| 57 | +
|
| 58 | +# From a file (one seed per line — URLs, hostnames, IPv4/IPv6 ok) |
| 59 | +webspider seeds.txt |
| 60 | +``` |
| 61 | + |
| 62 | +- Results: |
| 63 | + - `urls` — your filtered, unique URL list |
| 64 | + - `log` — verbose `wget` crawl log |
| 65 | + |
| 66 | +By default the spider **respects robots**, stays on your domains, and returns **video files** only. |
17 | 67 |
|
18 | | -Some websites are found at IP addresses. The script defaults to using https but if your URL starts with `http://` it will work fine. If some targets must be http, include the scheme on those explicitly:<br/> |
19 | | -`webspider --files http://192.168.0.34:8080 https://www.example.com/`<br/> |
20 | | -This will spider those websites searching for all files but not web pages. |
| 68 | +## Usage |
| 69 | +``` |
| 70 | +webspider [--http|--https] |
| 71 | + [--video|--audio|--images|--pages|--files|--all] |
| 72 | + [--ext 'pat|tern'] [--delay SECONDS] [--status-200] |
| 73 | + [--no-robots] |
| 74 | + [--sitemap-txt] [--sitemap-xml] |
| 75 | + <links.txt | URL...> |
| 76 | +``` |
| 77 | + |
| 78 | +### Modes (choose one; default is --video) |
| 79 | +- `--video` : video files only |
| 80 | + - mp4|mkv|avi|mov|wmv|flv|webm|m4v|ts|m2ts |
| 81 | +- `--audio` : audio files only |
| 82 | + - mp3|mpa|mp2|aac|wav|flac|m4a|ogg|opus|wma|alac|aif|aiff |
| 83 | +- `--images` : image files only |
| 84 | + - jpg|jpeg|png|gif|webp|bmp|tiff|svg|avif|heic|heif |
| 85 | +- `--pages` : directories (…/) + common page extensions |
| 86 | + - html|htm|shtml|xhtml|php|phtml|asp|aspx|jsp|jspx|cfm|cgi|pl|do|action|md|markdown |
| 87 | +- `--files` : all files (excludes directories and .html? pages) |
| 88 | +- `--all` : everything (directories + pages + files) |
| 89 | + |
| 90 | +### Options |
| 91 | +- `--ext` 'pat|tern' : override extension set used by --video/--audio/--images/--pages. |
| 92 | + - Example: --files --ext 'pdf|docx|xlsx' |
| 93 | +- `--delay S` : polite crawl delay in seconds (default: 0.5), works with --random-wait |
| 94 | +- `--status-200` : only keep URLs that returned HTTP 200 OK |
| 95 | +- `--no-robots` : ignore robots.txt (default is to respect robots) |
| 96 | +- `--http` | `--https` : default scheme for scheme-less seeds (default: --https) |
| 97 | +- `-h` | `--help` : show usage |
| 98 | + |
| 99 | +Single-dash forms work too: `-video`, `-images`, `-files`, `-all`, `-ext`, `-delay`, `-status-200`, `-no-robots`, etc. |
21 | 100 |
|
22 | | -To partially mirror a website, you can use these commands... |
| 101 | +## Examples |
| 102 | +### 1) Video crawl (default), with strict 200 OK and slower pacing |
| 103 | +`webspider --status-200 --delay 1.0 https://www.example.com/` |
| 104 | + |
| 105 | +### 2) Images only, write a simple text sitemap |
| 106 | +``` |
| 107 | +webspider --images --sitemap-txt https://www.example.com/ |
| 108 | +# Produces: urls (images only) and sitemap.txt (same set) |
23 | 109 | ``` |
24 | | -webspider --all https://www.example.com/some/path/ |
| 110 | + |
| 111 | +### 3) Pages-only crawl for a classic site sitemap |
| 112 | +``` |
| 113 | +webspider --pages --sitemap-xml https://www.example.com/ |
| 114 | +# Produces sitemap.xml containing directories and page-like URLs |
| 115 | +``` |
| 116 | + |
| 117 | +### 4) Plain HTTP on a high port (IPv4), custom extensions |
| 118 | +`webspider --http --files --ext 'pdf|epub|zip' 192.168.1.50:8080` |
| 119 | + |
| 120 | +### 5) Mixed seeds and a seed file |
| 121 | +`webspider --audio nyx.mynetblog.com/xc seeds.txt https://www.mynetblog.com/` |
| 122 | + |
| 123 | +### 6) IPv6 with port |
| 124 | +`webspider --images https://[2001:db8::1]:8443/gallery/` |
| 125 | + |
| 126 | +### 7) To partially mirror a website, you can use these commands for example... |
| 127 | +``` |
| 128 | +webspider --files https://www.example.com/some/path/ |
25 | 129 |
|
26 | 130 | wget --no-host-directories --force-directories --no-clobber --cut-dirs=0 -i urls |
27 | 131 | ``` |
28 | | -The script will only search under the `/some/path/` directory. |
29 | | - |
30 | | -You can feed the script a file containing a list of URLs to spider:<br/> |
31 | | -`webspider --audio links.txt`<br/> |
32 | | -That would spider the websites in the file searching for audio files. |
33 | | - |
34 | | -The script can even generate a sitemap text or XML file for your website!<br/> |
35 | | -`webspider --all --sitemap-xml https://www.example.com/` |
36 | | - |
37 | | - |
38 | | -<pre> |
39 | | -Usage: |
40 | | - webspider [--http|--https] |
41 | | - [--video|--audio|--images|--pages|--files|--all] |
42 | | - [--ext 'pat|tern'] [--delay SECONDS] [--status-200] |
43 | | - [--no-robots] |
44 | | - [--sitemap-txt] [--sitemap-xml] |
45 | | - <links.txt | URL...> |
46 | | - |
47 | | - Modes (choose one; default is --video): |
48 | | - --video : video files only (mp4|mkv|avi|mov|wmv|flv|webm|m4v|ts|m2ts) |
49 | | - --audio : audio files only (mp3|mpa|mp2|aac|wav|flac|m4a|ogg|opus|wma|alac|aif|aiff) |
50 | | - --images : image files only (jpg|jpeg|png|gif|webp|bmp|tiff|svg|avif|heic|heif) |
51 | | - --pages : directories (.../) + page-like extensions (html|htm|shtml|xhtml|php|phtml|asp|aspx|jsp|jspx|cfm|cgi|pl|do|action|md|markdown) |
52 | | - --files : all files (exclude directories and .html/.htm pages) |
53 | | - --all : everything (directories + pages + files) |
54 | | - |
55 | | - Options: |
56 | | - --ext PAT : override extension pattern used by --video/--audio/--images/--pages |
57 | | - Example: If you only want files with the extensions .abc and .xyz then use this command: |
58 | | - webspider --ext 'abc|xyz' https://www.example.com/ |
59 | | - --delay S : polite crawl delay in seconds (default: 0.5). Accepts decimals. Uses wget --wait + --random-wait. |
60 | | - --status-200 : only keep URLs that returned HTTP 200 OK (adds -S to wget and parses statuses) |
61 | | - --no-robots : ignore robots.txt (default: respect robots) |
62 | | - --sitemap-txt : write sitemap.txt (plain newline-separated list of final URLs) |
63 | | - --sitemap-xml : write sitemap.xml (Sitemaps.org XML from final URLs) |
64 | | - --http|--https : default scheme for scheme-less inputs (default: https) |
65 | | - -h, --help : show this help |
66 | | - |
67 | | -- Notes: |
68 | | - - Single-dash forms work too: -video, -audio, -images, -pages, -files, -all, -ext, -delay, -status-200, -no-robots, -sitemap-txt, -sitemap-xml |
69 | | - - Use -- to end options if a path starts with a dash. |
70 | | -</pre> |
| 132 | + |
| 133 | +## What counts as a “seed”? |
| 134 | + |
| 135 | +### You can pass: |
| 136 | +- **Full URLs: `https://host/path`, `http://1.2.3.4:8080/dir/`** |
| 137 | +- **Hostnames: `example.com`, `sub.example.com`** |
| 138 | +- **IPv4: `10.0.0.5`, `10.0.0.5:8080/foo`** |
| 139 | +- **IPv6: `[2001:db8::1]`, `[2001:db8::1]:8443/foo`** |
| 140 | + |
| 141 | +### Normalization rules: |
| 142 | +- If no scheme: prefix with the default (`https://`) or use `--http` |
| 143 | +- If looks like a **directory** (no dot in last path segment, and no `?`/`#`): append `/` |
| 144 | +- Domain allowlist is built from the seeds (auto-adds `www.` variant for bare domains, however there is a bug with this as it only lists the root page on the `www.` domain.) |
| 145 | + |
| 146 | +<hr> |
| 147 | + |
| 148 | +## What gets crawled? |
| 149 | + |
| 150 | +- The spider runs `wget --spider --recursive --no-parent --level=inf` on your seed set. |
| 151 | +- It **stays on the same domains** (via `--domains=<comma-list>`), unless a seed is an IP/IPv6 literal (then `--domains` is skipped, and `wget` still naturally sticks to that host). |
| 152 | +- Extracts every `URL:` line from the `log` file, normalizes away query/fragments, dedupes, and then applies your mode filter<br/> |
| 153 | + (`--video/--audio/--images/--pages/--files/--all`). |
| 154 | +- If `--status-200` is set, only URLs with an observed **HTTP 200 OK** are kept. |
| 155 | + |
| 156 | +**Heads-up:** `wget --spider` generally uses **HEAD** requests where possible. Some servers don’t return 200 to HEAD even though GET would succeed. If filtering looks too strict, try without `--status-200`. |
| 157 | + |
| 158 | +## Sitemaps |
| 159 | +- `--sitemap-txt` → `sitemap.txt` (newline-delimited URLs) |
| 160 | +- `--sitemap-xml` → `sitemap.xml` (Sitemaps.org format) |
| 161 | + |
| 162 | +Both are generated from the **final filtered set** (`urls`).<br/> |
| 163 | +For an SEO-style site map, use `--pages` (or `--all` if you really want everything). |
| 164 | + |
| 165 | +## Performance & Politeness |
| 166 | +- Default delay is `0.5` seconds with `--random-wait` to jitter requests. |
| 167 | +- Tune with `--delay 1.0` (or higher) for shared hosts or when rate-limited. |
| 168 | +- You can combine with `--status-200` to avoid collecting dead links. |
| 169 | + |
| 170 | +Other knobs to consider (edit script if you want to hard-wire them): |
| 171 | +- `--level=` to cap depth (the script currently uses `inf`) |
| 172 | +- `--quota=` or `--reject=` patterns if you need to skip classes of files |
| 173 | + |
| 174 | +## Security & Ethics |
| 175 | +- Respect `robots.txt` (default). Only use `--no-robots` when you **own** the host(s) or have permission. |
| 176 | +- Be mindful of server load and your network AUP. Increase `--delay` if unsure. |
| 177 | + |
| 178 | +## Output Files |
| 179 | +- `urls` — final, filtered, unique URLs (overwritten each run) |
| 180 | +- `log` — full `wget` log (overwritten each run) |
| 181 | +- Optional: `sitemap.txt`, `sitemap.xml` (when requested) |
| 182 | + |
| 183 | +To keep results separate across runs, copy/rename `urls` or run the script in a different directory.<br/> |
| 184 | +It is very easy to append the current list of urls to another file:<br/> |
| 185 | +`cat urls >>biglist` |
| 186 | + |
| 187 | +## Piping & Large seed sets |
| 188 | +Passing a few dozen seeds on the command line is fine:<br/> |
| 189 | +`webspider nyx.mynetblog.com www.mynetblog.com example.com` |
| 190 | + |
| 191 | +For **very large** lists, avoid shell ARG_MAX limits: |
| 192 | +``` |
| 193 | +# write to a file |
| 194 | +generate_seeds > seeds.txt |
| 195 | +webspider --images --status-200 seeds.txt |
| 196 | +
|
| 197 | +# or batch with xargs (runs webspider repeatedly with 100 seeds per call) |
| 198 | +generate_seeds | xargs -r -n100 webspider --video --delay 0.8 |
| 199 | +``` |
| 200 | + |
| 201 | +## Troubleshooting |
| 202 | + |
| 203 | +- **“Found no broken links.” but `urls` is empty**<br/> |
| 204 | + You likely hit `robots.txt` rules, or your mode filtered everything out.<br/> |
| 205 | + Try `--no-robots` (if permitted) and/or a different mode (e.g., `--all`). |
| 206 | + |
| 207 | +- **Seeds without trailing slash don’t crawl**<br/> |
| 208 | + The script appends `/` to directory-like paths; if you still see issues, make sure redirects aren’t blocked upstream. |
| 209 | + |
| 210 | +- `--status-200` drops too many<br/> |
| 211 | + Some servers don’t return 200 for HEAD. Re-run without `--status-200`. |
| 212 | + |
| 213 | +- **IPv6 seeds**<br/> |
| 214 | + Always bracket: `https://[2001:db8::1]/`. The script helps, but explicit is best. |
| 215 | + |
| 216 | +- **Off-site crawl**<br/> |
| 217 | + The allowlist comes from your seeds. If you seed `example.com`, it also allows `www.example.com`. (auto-adds www. variant for bare domains, however there is a bug with this as it only lists the root page on the www. domain.)<br/> |
| 218 | + If you see off-site URLs, confirm they truly share the same registrable domain, or seed more specifically (e.g., `sub.example.com/`). |
| 219 | + |
| 220 | +<hr> |
| 221 | + |
| 222 | +## FAQ |
| 223 | + |
| 224 | +**Can I mix HTTP and HTTPS?**<br/> |
| 225 | +Yes. Provide the scheme per-seed where needed, or use `--http` to default scheme-less seeds to HTTP. |
| 226 | + |
| 227 | +**Will it download files?**<br/> |
| 228 | +No. It runs `wget` in **spider mode** (HEAD/GET checks only), and outputs URLs to the `urls` file. |
| 229 | +To actually download the files in the `urls` file, do something like this:<br/> |
| 230 | +`wget -i urls`<br/> |
| 231 | +Or..<br/> |
| 232 | +`wget --no-host-directories --force-directories --no-clobber --cut-dirs=0 -i urls` |
| 233 | + |
| 234 | +**Can I make a “pages + files” hybrid?**<br/> |
| 235 | +Use `--all` (includes everything), or `--files --ext 'html|htm|php|…'` if you want file-only including page extensions. |
| 236 | + |
| 237 | +**How do I only keep 200 OK pages in a search-engine sitemap?**<br/> |
| 238 | +Use `--pages --status-200 --sitemap-xml`. |
| 239 | + |
| 240 | + |
| 241 | +## Appendix: Preset extension lists |
| 242 | +- Video: `mp4|mkv|avi|mov|wmv|flv|webm|m4v|ogv|ts|m2ts` |
| 243 | +- Audio: `mp3|mpa|mp2|aac|wav|flac|m4a|ogg|opus|wma|alac|aif|aiff` |
| 244 | +- Images: `jpg|jpeg|png|gif|webp|bmp|tiff|svg|avif|heic|heif` |
| 245 | +- Pages: `html|htm|shtml|xhtml|php|phtml|asp|aspx|jsp|jspx|cfm|cgi|pl|do|action|md|markdown` |
| 246 | + |
| 247 | +Override any of these with your own file extensions: `--ext 'pat|tern'`. |
| 248 | + |
0 commit comments