DEV Community

Mario
Mario

Posted on

If you would need to scrape many different websites nowdays, which tool/language combo would you pick?

Basically I want to crawl simple blogs and extract their blog posts. The biggest challenge here would probably be the parsing of the data and understanding different content parts within a blogpost

Top comments (6)

Collapse
 
crimsonmed profile image
Médéric Burlet

Would depend on the type of scraping.

If we need to interact as a human then puppetteer with JS / TS would be good: github.com/puppeteer/puppeteer

If you just need to parse data I really like to use cheerio with JS / TS : github.com/cheeriojs/cheerio
It let's you access webpage information with jquery syntax. which can be quite practical.

Collapse
 
rioma profile image
Mario

Thanks for the response!

I do not need to interact as a human, but just collect news articles from different websites, at scale. Looking at cheerio, seems like a very decent option. Thanks!

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt

Node.js +/- Puppetteer would probably be the first natural choice; although I am not accustomed to Puppetteer that much.

I used to use Selenium API with Python, if I need to scrape dynamic websites. But async in Python does not seems to be as natural as Node.js

I don't know much about Golang. How often is it used for web scraping?

Collapse
 
talha131 profile image
Talha Mansoor

I do not need to interact as a human, but just collect news articles from different websites, at scale.

If it is scale you are looking for then best option would be scrapy.org/ with Scrapy Cloud. You can also run multiple Scrapy spiders in a process.

Collapse
 
jcsvveiga profile image
João Veiga

Elixir + Floki

Collapse
 
jengfad profile image
Jennifer Fadriquela

I'm also a beginner to webscraping. Scrapy framework is a good tool but will have a steeper learning curve than just using libraries (selenium, beautifulsoup, requests).