Skip to content

Commit 74ccb96

Browse files
authored
Update README.md
rewrote the readme
1 parent 681900c commit 74ccb96

File tree

1 file changed

+238
-60
lines changed

1 file changed

+238
-60
lines changed

README.md

Lines changed: 238 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,248 @@
1-
# Insanely Awesome Web Spider (Linux shell script)
1+
# Web Spider (Linux shell script)
22

3-
**!!! This script requires the program [wget](https://www.gnu.org/software/wget/) !!!**
3+
A fast, polite, single-file Bash spider built around `wget --spider`.
4+
It takes one or more **start targets** (URLs, hostnames, IPv4/IPv6 — with optional ports), crawls **only the same domains** by default, and writes a clean, de-duplicated list of discovered URLs to `./urls`.
45

5-
Spider start URLs/domains/IPs and produce a unique URL list that you can feed to wget for easy downloading.
6+
You can aim it at videos, audio, images, pages, **or everything**, and optionally emit a `sitemap.txt` and/or `sitemap.xml`.
67

7-
This script uses the spidering functions of wget to mainly search web directories for media files but can also spider entire websites. It generates a list of URLs found and writes them to a file. The list of URLs can then be fed to wget for easy downloading.<br/>
8-
`wget -i urls`
8+
---
99

10-
- This script honors website's robots.txt files but that can be disabled.
11-
- There is a built-in crawl delay of 0.5 seconds. This can be adjusted to your needs.<br/>
12-
`webspider --delay 1 www.example.com`<br/>
13-
This will maintain a delay with jitter around 1 second between requests to the website.
14-
- The script defaults to searching for video files but can also search for audio files, images, etc. or even search for your own file extensions that you provide.
15-
- The script creates a list of URLs in a file named `urls` and a log file named `log`. These files will be placed in your current directory. These files are overwritten on each run of the script. You can easily append the current list of URLs to another file.<br/>
16-
`cat urls >>biglist`
10+
## Features
11+
12+
- **Respectful by default** — honors `robots.txt` unless you opt out.
13+
- **Same-site only** — strict allowlist built from your inputs, so it won’t wander off the domains you give it.
14+
- **Smart normalization**
15+
- Adds `https://` to scheme-less seeds (or use `--http` to default to HTTP).
16+
- Adds trailing `/` to directory-like URLs (avoids `/dir → /dir/` redirect hiccups).
17+
- Fully supports IPv6 (`[2001:db8::1]:8443`).
18+
- **Flexible output modes**
19+
- `--video` (default), `--audio`, `--images`, `--pages`, `--files`, or `--all`.
20+
- `--ext 'pat|tern'` to override any preset (e.g., `pdf|docx|xlsx`).
21+
- **Status filter**`--status-200` keeps only URLs that returned **HTTP 200 OK**.
22+
- **Polite pacing**`--delay SECONDS` + `--random-wait` (default **0.5s**).
23+
- **Sitemaps**`--sitemap-txt` and/or `--sitemap-xml` from the **final filtered set**.
24+
- **Robust log parsing** — handles both `URL: http://…` and `URL:http://…`.
25+
- **Single-dash synonyms**`-video`, `-images`, `-all`, `-ext`, `-delay`, etc.
26+
27+
---
28+
29+
## Requirements
30+
31+
- Bash (arrays & `set -euo pipefail` support; Bash 4+ recommended)
32+
- `wget`, `awk`, `sed`, `grep`, `sort`, `mktemp`, `paste` (standard GNU userland)
33+
34+
---
35+
36+
## Installation
37+
38+
```bash
39+
# Clone or copy the script into your PATH
40+
git clone https://github.com/Pryodon/Web-Spider-Linux-shell-script.git
41+
cd Web-Spider-Linux-shell-script
42+
chmod +x webspider
43+
# optional: symlink as 'spider'
44+
ln -s "$PWD/webspider" ~/bin/spider
45+
```
46+
47+
## Quick Start
48+
```
49+
# Crawl one site (video mode by default) and write results to ./urls
50+
webspider https://www.example.com/
51+
52+
# Crawl one site searching only for .mkv and .mp4 files.
53+
webspider --ext 'mkv|mp4' https://nyx.mynetblog.com/xc/
54+
55+
# Multiple seeds (scheme-less is OK; defaults to https)
56+
webspider nyx.mynetblog.com www.mynetblog.com example.com
57+
58+
# From a file (one seed per line — URLs, hostnames, IPv4/IPv6 ok)
59+
webspider seeds.txt
60+
```
61+
62+
- Results:
63+
- `urls` — your filtered, unique URL list
64+
- `log` — verbose `wget` crawl log
65+
66+
By default the spider **respects robots**, stays on your domains, and returns **video files** only.
1767

18-
Some websites are found at IP addresses. The script defaults to using https but if your URL starts with `http://` it will work fine. If some targets must be http, include the scheme on those explicitly:<br/>
19-
`webspider --files http://192.168.0.34:8080 https://www.example.com/`<br/>
20-
This will spider those websites searching for all files but not web pages.
68+
## Usage
69+
```
70+
webspider [--http|--https]
71+
[--video|--audio|--images|--pages|--files|--all]
72+
[--ext 'pat|tern'] [--delay SECONDS] [--status-200]
73+
[--no-robots]
74+
[--sitemap-txt] [--sitemap-xml]
75+
<links.txt | URL...>
76+
```
77+
78+
### Modes (choose one; default is --video)
79+
- `--video` : video files only
80+
- mp4|mkv|avi|mov|wmv|flv|webm|m4v|ts|m2ts
81+
- `--audio` : audio files only
82+
- mp3|mpa|mp2|aac|wav|flac|m4a|ogg|opus|wma|alac|aif|aiff
83+
- `--images` : image files only
84+
- jpg|jpeg|png|gif|webp|bmp|tiff|svg|avif|heic|heif
85+
- `--pages` : directories (…/) + common page extensions
86+
- html|htm|shtml|xhtml|php|phtml|asp|aspx|jsp|jspx|cfm|cgi|pl|do|action|md|markdown
87+
- `--files` : all files (excludes directories and .html? pages)
88+
- `--all` : everything (directories + pages + files)
89+
90+
### Options
91+
- `--ext` 'pat|tern' : override extension set used by --video/--audio/--images/--pages.
92+
- Example: --files --ext 'pdf|docx|xlsx'
93+
- `--delay S` : polite crawl delay in seconds (default: 0.5), works with --random-wait
94+
- `--status-200` : only keep URLs that returned HTTP 200 OK
95+
- `--no-robots` : ignore robots.txt (default is to respect robots)
96+
- `--http` | `--https` : default scheme for scheme-less seeds (default: --https)
97+
- `-h` | `--help` : show usage
98+
99+
Single-dash forms work too: `-video`, `-images`, `-files`, `-all`, `-ext`, `-delay`, `-status-200`, `-no-robots`, etc.
21100

22-
To partially mirror a website, you can use these commands...
101+
## Examples
102+
### 1) Video crawl (default), with strict 200 OK and slower pacing
103+
`webspider --status-200 --delay 1.0 https://www.example.com/`
104+
105+
### 2) Images only, write a simple text sitemap
106+
```
107+
webspider --images --sitemap-txt https://www.example.com/
108+
# Produces: urls (images only) and sitemap.txt (same set)
23109
```
24-
webspider --all https://www.example.com/some/path/
110+
111+
### 3) Pages-only crawl for a classic site sitemap
112+
```
113+
webspider --pages --sitemap-xml https://www.example.com/
114+
# Produces sitemap.xml containing directories and page-like URLs
115+
```
116+
117+
### 4) Plain HTTP on a high port (IPv4), custom extensions
118+
`webspider --http --files --ext 'pdf|epub|zip' 192.168.1.50:8080`
119+
120+
### 5) Mixed seeds and a seed file
121+
`webspider --audio nyx.mynetblog.com/xc seeds.txt https://www.mynetblog.com/`
122+
123+
### 6) IPv6 with port
124+
`webspider --images https://[2001:db8::1]:8443/gallery/`
125+
126+
### 7) To partially mirror a website, you can use these commands for example...
127+
```
128+
webspider --files https://www.example.com/some/path/
25129
26130
wget --no-host-directories --force-directories --no-clobber --cut-dirs=0 -i urls
27131
```
28-
The script will only search under the `/some/path/` directory.
29-
30-
You can feed the script a file containing a list of URLs to spider:<br/>
31-
`webspider --audio links.txt`<br/>
32-
That would spider the websites in the file searching for audio files.
33-
34-
The script can even generate a sitemap text or XML file for your website!<br/>
35-
`webspider --all --sitemap-xml https://www.example.com/`
36-
37-
38-
<pre>
39-
Usage:
40-
webspider [--http|--https]
41-
[--video|--audio|--images|--pages|--files|--all]
42-
[--ext 'pat|tern'] [--delay SECONDS] [--status-200]
43-
[--no-robots]
44-
[--sitemap-txt] [--sitemap-xml]
45-
<links.txt | URL...>
46-
47-
Modes (choose one; default is --video):
48-
--video : video files only (mp4|mkv|avi|mov|wmv|flv|webm|m4v|ts|m2ts)
49-
--audio : audio files only (mp3|mpa|mp2|aac|wav|flac|m4a|ogg|opus|wma|alac|aif|aiff)
50-
--images : image files only (jpg|jpeg|png|gif|webp|bmp|tiff|svg|avif|heic|heif)
51-
--pages : directories (.../) + page-like extensions (html|htm|shtml|xhtml|php|phtml|asp|aspx|jsp|jspx|cfm|cgi|pl|do|action|md|markdown)
52-
--files : all files (exclude directories and .html/.htm pages)
53-
--all : everything (directories + pages + files)
54-
55-
Options:
56-
--ext PAT : override extension pattern used by --video/--audio/--images/--pages
57-
Example: If you only want files with the extensions .abc and .xyz then use this command:
58-
webspider --ext 'abc|xyz' https://www.example.com/
59-
--delay S : polite crawl delay in seconds (default: 0.5). Accepts decimals. Uses wget --wait + --random-wait.
60-
--status-200 : only keep URLs that returned HTTP 200 OK (adds -S to wget and parses statuses)
61-
--no-robots : ignore robots.txt (default: respect robots)
62-
--sitemap-txt : write sitemap.txt (plain newline-separated list of final URLs)
63-
--sitemap-xml : write sitemap.xml (Sitemaps.org XML from final URLs)
64-
--http|--https : default scheme for scheme-less inputs (default: https)
65-
-h, --help : show this help
66-
67-
- Notes:
68-
- Single-dash forms work too: -video, -audio, -images, -pages, -files, -all, -ext, -delay, -status-200, -no-robots, -sitemap-txt, -sitemap-xml
69-
- Use -- to end options if a path starts with a dash.
70-
</pre>
132+
133+
## What counts as a “seed”?
134+
135+
### You can pass:
136+
- **Full URLs: `https://host/path`, `http://1.2.3.4:8080/dir/`**
137+
- **Hostnames: `example.com`, `sub.example.com`**
138+
- **IPv4: `10.0.0.5`, `10.0.0.5:8080/foo`**
139+
- **IPv6: `[2001:db8::1]`, `[2001:db8::1]:8443/foo`**
140+
141+
### Normalization rules:
142+
- If no scheme: prefix with the default (`https://`) or use `--http`
143+
- If looks like a **directory** (no dot in last path segment, and no `?`/`#`): append `/`
144+
- Domain allowlist is built from the seeds (auto-adds `www.` variant for bare domains, however there is a bug with this as it only lists the root page on the `www.` domain.)
145+
146+
<hr>
147+
148+
## What gets crawled?
149+
150+
- The spider runs `wget --spider --recursive --no-parent --level=inf` on your seed set.
151+
- It **stays on the same domains** (via `--domains=<comma-list>`), unless a seed is an IP/IPv6 literal (then `--domains` is skipped, and `wget` still naturally sticks to that host).
152+
- Extracts every `URL:` line from the `log` file, normalizes away query/fragments, dedupes, and then applies your mode filter<br/>
153+
(`--video/--audio/--images/--pages/--files/--all`).
154+
- If `--status-200` is set, only URLs with an observed **HTTP 200 OK** are kept.
155+
156+
**Heads-up:** `wget --spider` generally uses **HEAD** requests where possible. Some servers don’t return 200 to HEAD even though GET would succeed. If filtering looks too strict, try without `--status-200`.
157+
158+
## Sitemaps
159+
- `--sitemap-txt``sitemap.txt` (newline-delimited URLs)
160+
- `--sitemap-xml``sitemap.xml` (Sitemaps.org format)
161+
162+
Both are generated from the **final filtered set** (`urls`).<br/>
163+
For an SEO-style site map, use `--pages` (or `--all` if you really want everything).
164+
165+
## Performance & Politeness
166+
- Default delay is `0.5` seconds with `--random-wait` to jitter requests.
167+
- Tune with `--delay 1.0` (or higher) for shared hosts or when rate-limited.
168+
- You can combine with `--status-200` to avoid collecting dead links.
169+
170+
Other knobs to consider (edit script if you want to hard-wire them):
171+
- `--level=` to cap depth (the script currently uses `inf`)
172+
- `--quota=` or `--reject=` patterns if you need to skip classes of files
173+
174+
## Security & Ethics
175+
- Respect `robots.txt` (default). Only use `--no-robots` when you **own** the host(s) or have permission.
176+
- Be mindful of server load and your network AUP. Increase `--delay` if unsure.
177+
178+
## Output Files
179+
- `urls` — final, filtered, unique URLs (overwritten each run)
180+
- `log` — full `wget` log (overwritten each run)
181+
- Optional: `sitemap.txt`, `sitemap.xml` (when requested)
182+
183+
To keep results separate across runs, copy/rename `urls` or run the script in a different directory.<br/>
184+
It is very easy to append the current list of urls to another file:<br/>
185+
`cat urls >>biglist`
186+
187+
## Piping & Large seed sets
188+
Passing a few dozen seeds on the command line is fine:<br/>
189+
`webspider nyx.mynetblog.com www.mynetblog.com example.com`
190+
191+
For **very large** lists, avoid shell ARG_MAX limits:
192+
```
193+
# write to a file
194+
generate_seeds > seeds.txt
195+
webspider --images --status-200 seeds.txt
196+
197+
# or batch with xargs (runs webspider repeatedly with 100 seeds per call)
198+
generate_seeds | xargs -r -n100 webspider --video --delay 0.8
199+
```
200+
201+
## Troubleshooting
202+
203+
- **“Found no broken links.” but `urls` is empty**<br/>
204+
You likely hit `robots.txt` rules, or your mode filtered everything out.<br/>
205+
Try `--no-robots` (if permitted) and/or a different mode (e.g., `--all`).
206+
207+
- **Seeds without trailing slash don’t crawl**<br/>
208+
The script appends `/` to directory-like paths; if you still see issues, make sure redirects aren’t blocked upstream.
209+
210+
- `--status-200` drops too many<br/>
211+
Some servers don’t return 200 for HEAD. Re-run without `--status-200`.
212+
213+
- **IPv6 seeds**<br/>
214+
Always bracket: `https://[2001:db8::1]/`. The script helps, but explicit is best.
215+
216+
- **Off-site crawl**<br/>
217+
The allowlist comes from your seeds. If you seed `example.com`, it also allows `www.example.com`. (auto-adds www. variant for bare domains, however there is a bug with this as it only lists the root page on the www. domain.)<br/>
218+
If you see off-site URLs, confirm they truly share the same registrable domain, or seed more specifically (e.g., `sub.example.com/`).
219+
220+
<hr>
221+
222+
## FAQ
223+
224+
**Can I mix HTTP and HTTPS?**<br/>
225+
Yes. Provide the scheme per-seed where needed, or use `--http` to default scheme-less seeds to HTTP.
226+
227+
**Will it download files?**<br/>
228+
No. It runs `wget` in **spider mode** (HEAD/GET checks only), and outputs URLs to the `urls` file.
229+
To actually download the files in the `urls` file, do something like this:<br/>
230+
`wget -i urls`<br/>
231+
Or..<br/>
232+
`wget --no-host-directories --force-directories --no-clobber --cut-dirs=0 -i urls`
233+
234+
**Can I make a “pages + files” hybrid?**<br/>
235+
Use `--all` (includes everything), or `--files --ext 'html|htm|php|…'` if you want file-only including page extensions.
236+
237+
**How do I only keep 200 OK pages in a search-engine sitemap?**<br/>
238+
Use `--pages --status-200 --sitemap-xml`.
239+
240+
241+
## Appendix: Preset extension lists
242+
- Video: `mp4|mkv|avi|mov|wmv|flv|webm|m4v|ogv|ts|m2ts`
243+
- Audio: `mp3|mpa|mp2|aac|wav|flac|m4a|ogg|opus|wma|alac|aif|aiff`
244+
- Images: `jpg|jpeg|png|gif|webp|bmp|tiff|svg|avif|heic|heif`
245+
- Pages: `html|htm|shtml|xhtml|php|phtml|asp|aspx|jsp|jspx|cfm|cgi|pl|do|action|md|markdown`
246+
247+
Override any of these with your own file extensions: `--ext 'pat|tern'`.
248+

0 commit comments

Comments
 (0)