A simple and efficient web crawler for Python.
- Crawl web pages and extract links starting from a root URL recursively
- Concurrent workers and custom delay
- Handle relative and absolute URLs
- Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks
Install using pip:
pip install tiny-web-crawlerfrom tiny_web_crawler import Spider from tiny_web_crawler import SpiderSettings settings = SpiderSettings( root_url = 'http://github.com', max_links = 2 ) spider = Spider(settings) spider.start() # Set workers and delay (default: delay is 0.5 sec and verbose is True) # If you do not want delay, set delay=0 settings = SpiderSettings( root_url = 'https://github.com', max_links = 5, max_workers = 5, delay = 1, verbose = False ) spider = Spider(settings) spider.start()Crawled output sample for https://github.com
{ "http://github.com": { "urls": [ "http://github.com/", "https://githubuniverse.com/", "..." ], "https://github.com/solutions/ci-cd": { "urls": [ "https://github.com/solutions/ci-cd/", "https://githubuniverse.com/", "..." ] } } }Thank you for considering to contribute.
- If you are a first time contributor you can pick a
good-first-issueand get started. - Please feel free to ask questions.
- Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
- We are working on doing our first major release. Please check this
issueand see if anything interests you.
- Install poetry in your system
pipx install poetry - Clone the repo you forked
- Create a venv or use
poetry shell - Run
poetry install --with dev pre-commit install(see)pre-commit install --hook-type pre-push
- An issue exists or is created which address the PR
- Tests are written for the changes
- All lint/test passes
