Skip to content

debugtalk/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

A simple web crawler, mainly targets for link validation test.

Features

  • running in BFS or DFS mode
  • specify concurrent running workers in BFS mode
  • crawl seeds can be set to more than one urls
  • support crawl with cookies
  • configure hyper links regex, including match type and ignore type
  • group visited urls by HTTP status code
  • flexible configuration in YAML
  • send test result by mail, through SMTP protocol or mailgun service
  • cancel jobs

Installation/Upgrade

$ pip install -U git+https://github.com/debugtalk/WebCrawler.git#egg=requests-crawler --process-dependency-links

To ensure the installation or upgrade is successful, you can execute command webcrawler -V to see if you can get the correct version number.

$ webcrawler -V jenkins-mail-py version: 0.2.4 WebCrawler version: 0.3.0

Usage

$ webcrawler -h usage: webcrawler [-h] [-V] [--log-level LOG_LEVEL] [--config-file CONFIG_FILE] [--seeds SEEDS] [--include-hosts INCLUDE_HOSTS] [--cookies COOKIES] [--crawl-mode CRAWL_MODE] [--max-depth MAX_DEPTH] [--concurrency CONCURRENCY] [--save-results SAVE_RESULTS] [--grey-user-agent GREY_USER_AGENT] [--grey-traceid GREY_TRACEID] [--grey-view-grey GREY_VIEW_GREY] [--mailgun-api-id MAILGUN_API_ID] [--mailgun-api-key MAILGUN_API_KEY] [--mail-sender MAIL_SENDER] [--mail-recepients [MAIL_RECEPIENTS [MAIL_RECEPIENTS ...]]] [--mail-subject MAIL_SUBJECT] [--mail-content MAIL_CONTENT] [--jenkins-job-name JENKINS_JOB_NAME] [--jenkins-job-url JENKINS_JOB_URL] [--jenkins-build-number JENKINS_BUILD_NUMBER] A web crawler for testing website links validation. optional arguments: -h, --help show this help message and exit -V, --version show version --log-level LOG_LEVEL Specify logging level, default is INFO. --config-file CONFIG_FILE Specify config file path. --seeds SEEDS Specify crawl seed url(s), several urls can be specified with pipe; if auth needed, seeds can be specified like user1:pwd1@url1|user2:pwd2@url2 --include-hosts INCLUDE_HOSTS Specify extra hosts to be crawled. --cookies COOKIES Specify cookies, several cookies can be joined by '|'. e.g. 'lang:en,country:us|lang:zh,country:cn' --crawl-mode CRAWL_MODE Specify crawl mode, BFS or DFS. --max-depth MAX_DEPTH Specify max crawl depth. --concurrency CONCURRENCY Specify concurrent workers number. --save-results SAVE_RESULTS Specify if save results, default is NO. --grey-user-agent GREY_USER_AGENT Specify grey environment header User-Agent. --grey-traceid GREY_TRACEID Specify grey environment cookie traceid. --grey-view-grey GREY_VIEW_GREY Specify grey environment cookie view_gray. --mailgun-api-id MAILGUN_API_ID Specify mailgun api id. --mailgun-api-key MAILGUN_API_KEY Specify mailgun api key. --mail-sender MAIL_SENDER Specify email sender. --mail-recepients [MAIL_RECEPIENTS [MAIL_RECEPIENTS ...]] Specify email recepients. --mail-subject MAIL_SUBJECT Specify email subject. --mail-content MAIL_CONTENT Specify email content. --jenkins-job-name JENKINS_JOB_NAME Specify jenkins job name. --jenkins-job-url JENKINS_JOB_URL Specify jenkins job url. --jenkins-build-number JENKINS_BUILD_NUMBER Specify jenkins build number. 

Examples

Specify config file.

$ webcrawler --seeds http://debugtalk.com --crawl-mode bfs --max-depth 5 --config-file path/to/config.yml

Crawl in BFS mode with 20 concurrent workers, and set maximum depth to 5.

$ webcrawler --seeds http://debugtalk.com --crawl-mode bfs --max-depth 5 --concurrency 20

Crawl in DFS mode, and set maximum depth to 10.

$ webcrawler --seeds http://debugtalk.com --crawl-mode dfs --max-depth 10

Crawl several websites in BFS mode with 20 concurrent workers, and set maximum depth to 10.

$ webcrawler --seeds http://debugtalk.com,http://blog.debugtalk.com --crawl-mode bfs --max-depth 10 --concurrency 20

Crawl with different cookies.

$ webcrawler --seeds http://debugtalk.com --crawl-mode BFS --max-depth 10 --concurrency 50 --cookies 'lang:en,country:us|lang:zh,country:cn' 

Supported Python Versions

WebCrawler supports Python 2.7, 3.3, 3.4, 3.5, and 3.6.

License

Open source licensed under the MIT license (see LICENSE file for details).

About

A web crawler based on requests-html, mainly targets for url validation test.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages