How to use headless browsers with scrapy?

by scrapecrow Apr 23, 2023

Python offers several libraries for headless browser control like Playwright or Selenium but integrating them with scrapy can be difficult.

To use Playwright with scrapy the scrapy-playwright community extension can be used. Scrapy-playwright works by creating a new download handler that is powered by Playwright exclusively. To activate it set the DOWNLOADER_HANDLER setting:

DOWNLOAD_HANDLERS = { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", } # and switch to asyncio reactor as playwright is asynchronous TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" 

Then to enable playwright attach meta={"playwright": True} parameter to each outgoing scrapy.Request object:

import scrapy class PlaywrightSpider(scrapy.Spider): name = "playwright-spider" def start_requests(self): yield scrapy.Request("https://httpbin.dev/get", meta={"playwright": True}) # or POST request yield scrapy.FormRequest( url="https://httpbin.dev/post", formdata={"foo": "bar"}, meta={"playwright": True} ) def parse(self, response): # 'response' contains the page as seen by the browser return {"url": response.url} 

While scrapy-playwright doesn't give full control of the web browser it integrates effortlessly with scrapy Spiders and can be an easy solution for scraping dynamic web content using scrapy.

Alternatively, check out Scrapfly's scrapy SDK with the headless browser feature which configures scrapy request to go through Scrapfly's managed cloud browsers.

Related Articles

Web Scraping Dynamic Websites With Scrapy Playwright

Learn about Selenium Playwright. A Scrapy integration that allows web scraping dynamic web pages with Scrapy. We'll explain web scraping with Scrapy Playwright through an example project and how to use it for common scraping use cases, such as clicking elements, scrolling and waiting for elements.

PYTHON
PLAYWRIGHT
SCRAPY
HEADLESS-BROWSER
Web Scraping Dynamic Websites With Scrapy Playwright

Web Scraping Dynamic Web Pages With Scrapy Selenium

Learn how to scrape dynamic web pages with Scrapy Selenium. You will also learn how to use Scrapy Selenium for common scraping use cases, such as waiting for elements, clicking buttons and scrolling.

PYTHON
SCRAPY
HEADLESS-BROWSER
SELENIUM
Web Scraping Dynamic Web Pages With Scrapy Selenium

Scrapy Splash Guide: Scrape Dynamic Websites With Scrapy

Learn about web scraping with Scrapy Splash, which lets Scrapy scrape dynamic web pages. We'll define Splash, cover installation and navigation, and provide a step-by-step guide for using Scrapy Splash.

PYTHON
HEADLESS-BROWSER
FRAMEWORK
SCRAPY
Scrapy Splash Guide: Scrape Dynamic Websites With Scrapy

Web Scraping with Playwright and JavaScript

Learn about Playwright - a browser automation toolkit for server side Javascript like NodeJS, Deno or Bun.

PLAYWRIGHT
HEADLESS-BROWSER
NODEJS
Web Scraping with Playwright and JavaScript

Guide to SeleniumBase — A Better & Easier Selenium

SeleniumBase streamlines browser automation with simple syntax, cross-browser support, and robust features, perfect for testing and web scraping.

SELENIUM
HEADLESS-BROWSER
Guide to SeleniumBase — A Better & Easier Selenium

Playwright vs Selenium

Explore the key differences between Playwright vs Selenium in terms of performance, web scraping, and automation testing for modern web applications.

HEADLESS-BROWSER
PLAYWRIGHT
SELENIUM
Playwright vs Selenium