Skip to content

DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

License

Notifications You must be signed in to change notification settings

codeislifes/DotnetSpider

 
 

Repository files navigation

DotnetSpider

Travis branch NuGet Member project of .NET Core Community GitHub license

DotnetSpider, a .NET Standard web crawling library similar to WebMagic and Scrapy. It is a lightweight, efficient and fast high-level web crawling & scraping framework for .NET

DESIGN

DESIGN

DEVELOP ENVIROMENT

  • Visual Studio 2017 (15.3 or later)
  • .NET Core 2.0 or later
  • Storage data to mysql. Download MySql grant all on . to 'root'@'localhost' IDENTIFIED BY '' with grant option; flush privileges;

OPTIONAL ENVIROMENT

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Projet DotnetSpider.Sample in the solution. 

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

public class EntityModelSpider {	public static void Run()	{	Spider spider = new Spider();	spider.Run();	}	private class Spider : EntitySpider	{	protected override void OnInit(params string[] arguments)	{	var word = "可乐|雪碧";	AddRequest(string.Format("http://news.baidu.com/ns?word={0}&tn=news&from=news&cl=2&pn=0&rn=20&ct=1", word), new Dictionary<string, dynamic> { { "Keyword", word } });	AddEntityType<BaiduSearchEntry>();	AddPipeline(new ConsoleEntityPipeline());	}	[Schema("baidu", "baidu_search_entity_model")]	[Entity(Expression = ".//div[@class='result']", Type = SelectorType.XPath)]	class BaiduSearchEntry : BaseEntity	{	[Column]	[Field(Expression = "Keyword", Type = SelectorType.Enviroment)]	public string Keyword { get; set; }	[Column]	[Field(Expression = ".//h3[@class='c-title']/a")]	[ReplaceFormatter(NewValue = "", OldValue = "<em>")]	[ReplaceFormatter(NewValue = "", OldValue = "</em>")]	public string Title { get; set; }	[Column]	[Field(Expression = ".//h3[@class='c-title']/a/@href")]	public string Url { get; set; }	[Column]	[Field(Expression = ".//div/p[@class='c-author']/text()")]	[ReplaceFormatter(NewValue = "-", OldValue = "&nbsp;")]	public string Website { get; set; }	[Column]	[Field(Expression = ".//div/span/a[@class='c-cache']/@href")]	public string Snapshot { get; set; }	[Column]	[Field(Expression = ".//div[@class='c-summary c-row ']", Option = FieldOptions.InnerText)]	[ReplaceFormatter(NewValue = "", OldValue = "<em>")]	[ReplaceFormatter(NewValue = "", OldValue = "</em>")]	[ReplaceFormatter(NewValue = " ", OldValue = "&nbsp;")]	public string Details { get; set; }	[Column(Length = 0)]	[Field(Expression = ".", Option = FieldOptions.InnerText)]	[ReplaceFormatter(NewValue = "", OldValue = "<em>")]	[ReplaceFormatter(NewValue = "", OldValue = "</em>")]	[ReplaceFormatter(NewValue = " ", OldValue = "&nbsp;")]	public string PlainText { get; set; }	}	} } public static void Main() {	EntityModelSpider.Run(); } 

Run via Startup

Command: -s:[spider type name | TaskName attribute] -i:[identity] -a:[arg1,arg2...] --tid:[taskId] -n:[name] -c:[configuration file path or name] 
  1. -s: Type name of spider or TaskNameAttribute for example: DotnetSpider.Sample.BaiduSearchSpiderl
  2. -i: Set identity.
  3. -a: Pass arguments to spider's Run method.
  4. --tid: Set task id.
  5. -n: Set name.
  6. -c: Set config file path, for example you want to run with a customize config: -e:app.my.config

WebDriver Support

When you want to collect a page JS loaded, there is only one thing to do, set the downloader to WebDriverDownloader.

Downloader=new WebDriverDownloader(Browser.Chrome); 

See a complete sample

NOTE:

  1. Make sure there is a ChromeDriver.exe in bin forlder when you try to use Chrome. You can install it to your project via NUGET manager: Chromium.ChromeDriver
  2. Make sure you already add a *.webdriver Firefox profile when you try to use Firefox: https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles
  3. Make sure there is a PhantomJS.exe in bin folder when you try to use PhantomJS. You can install it to your project via NUGET manager: PhantomJS

Storage log and status to database

DotnetSpider.Hub

https://github.com/zlzforever/DotnetSpider.Hub

  1. Dependences a ci platform for example I use teamcity right now.
  2. Dependences Scheduler.NET https://github.com/zlzforever/Scheduler.NET
  3. More documents continue...

1 2 3 4 5

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0 tcp-keepalive 60 

Buy me a coffee

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: zlzforever@163.com

About

DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C# 50.5%
  • HTML 43.4%
  • CSS 3.7%
  • JavaScript 1.3%
  • TSQL 0.9%
  • Shell 0.2%