The spider project ported to Node.js
- npm i @spider-rs/spider-rs --save
import { Website, pageTitle } from '@spider-rs/spider-rs' const website = new Website('https://rsseau.fr') .withHeaders({ authorization: 'somerandomjwt', }) .withBudget({ '*': 20, // limit max request 20 pages for the website '/docs': 10, // limit only 10 pages on the `/docs` paths }) .withBlacklistUrl(['/resume']) // regex or pattern matching to ignore paths .build() // optional: page event handler const onPageEvent = (_err, page) => { const title = pageTitle(page) // comment out to increase performance if title not needed console.info(`Title of ${page.url} is '${title}'`) website.pushData({ status: page.statusCode, html: page.content, url: page.url, title, }) } await website.crawl(onPageEvent) await website.exportJsonlData('./storage/rsseau.jsonl') console.log(website.getLinks())Collect the resources for a website.
import { Website } from '@spider-rs/spider-rs' const website = new Website('https://rsseau.fr') .withBudget({ '*': 20, '/docs': 10, }) // you can use regex or string matches to ignore paths .withBlacklistUrl(['/resume']) .build() await website.scrape() console.log(website.getPages())Run the crawls in the background on another thread.
import { Website } from '@spider-rs/spider-rs' const website = new Website('https://rsseau.fr') const onPageEvent = (_err, page) => { console.log(page) } await website.crawl(onPageEvent, true) // runs immediatelyUse headless Chrome rendering for crawls.
import { Website } from '@spider-rs/spider-rs' const website = new Website('https://rsseau.fr').withChromeIntercept(true, true) const onPageEvent = (_err, page) => { console.log(page) } // the third param determines headless chrome usage. await website.crawl(onPageEvent, false, true) console.log(website.getLinks())Cron jobs can be done with the following.
import { Website } from '@spider-rs/spider-rs' const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *') // sleep function to test cron const stopCron = (time: number, handle) => { return new Promise((resolve) => { setTimeout(() => { resolve(handle.stop()) }, time) }) } const links = [] const onPageEvent = (err, value) => { links.push(value) } const handle = await website.runCron(onPageEvent) // stop the cron in 4 seconds await stopCron(4000, handle)Use the crawl shortcut to get the page content and url.
import { crawl } from '@spider-rs/spider-rs' const { links, pages } = await crawl('https://rsseau.fr') console.log(pages)View the benchmarks to see a breakdown between libs and platforms.
Test url: https://espn.com
| libraries | pages | speed | 
|---|---|---|
| spider(rust): crawl | 150,387 | 1m | 
| spider(nodejs): crawl | 150,387 | 153s | 
| spider(python): crawl | 150,387 | 186s | 
| scrapy(python): crawl | 49,598 | 1h | 
| crawlee(nodejs): crawl | 18,779 | 30m | 
The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.
Install the napi cli npm i @napi-rs/cli --global.
- yarn build:test