Skip to content

ref-tools/ref-spider-nodejs

 
 

Repository files navigation

spider-rs

The spider project ported to Node.js

Getting Started

  1. npm i @spider-rs/spider-rs --save
import { Website, pageTitle } from '@spider-rs/spider-rs' const website = new Website('https://rsseau.fr') .withHeaders({ authorization: 'somerandomjwt', }) .withBudget({ '*': 20, // limit max request 20 pages for the website '/docs': 10, // limit only 10 pages on the `/docs` paths }) .withBlacklistUrl(['/resume']) // regex or pattern matching to ignore paths .build() // optional: page event handler const onPageEvent = (_err, page) => { const title = pageTitle(page) // comment out to increase performance if title not needed console.info(`Title of ${page.url} is '${title}'`) website.pushData({ status: page.statusCode, html: page.content, url: page.url, title, }) } await website.crawl(onPageEvent) await website.exportJsonlData('./storage/rsseau.jsonl') console.log(website.getLinks())

Collect the resources for a website.

import { Website } from '@spider-rs/spider-rs' const website = new Website('https://rsseau.fr') .withBudget({ '*': 20, '/docs': 10, }) // you can use regex or string matches to ignore paths .withBlacklistUrl(['/resume']) .build() await website.scrape() console.log(website.getPages())

Run the crawls in the background on another thread.

import { Website } from '@spider-rs/spider-rs' const website = new Website('https://rsseau.fr') const onPageEvent = (_err, page) => { console.log(page) } await website.crawl(onPageEvent, true) // runs immediately

Use headless Chrome rendering for crawls.

import { Website } from '@spider-rs/spider-rs' const website = new Website('https://rsseau.fr').withChromeIntercept(true, true) const onPageEvent = (_err, page) => { console.log(page) } // the third param determines headless chrome usage. await website.crawl(onPageEvent, false, true) console.log(website.getLinks())

Cron jobs can be done with the following.

import { Website } from '@spider-rs/spider-rs' const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *') // sleep function to test cron const stopCron = (time: number, handle) => { return new Promise((resolve) => { setTimeout(() => { resolve(handle.stop()) }, time) }) } const links = [] const onPageEvent = (err, value) => { links.push(value) } const handle = await website.runCron(onPageEvent) // stop the cron in 4 seconds await stopCron(4000, handle)

Use the crawl shortcut to get the page content and url.

import { crawl } from '@spider-rs/spider-rs' const { links, pages } = await crawl('https://rsseau.fr') console.log(pages)

Benchmarks

View the benchmarks to see a breakdown between libs and platforms.

Test url: https://espn.com

libraries pages speed
spider(rust): crawl 150,387 1m
spider(nodejs): crawl 150,387 153s
spider(python): crawl 150,387 186s
scrapy(python): crawl 49,598 1h
crawlee(nodejs): crawl 18,779 30m

The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.

Development

Install the napi cli npm i @napi-rs/cli --global.

  1. yarn build:test

About

Spider ported to Node.js

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 67.6%
  • JavaScript 16.2%
  • TypeScript 16.2%