A powerful web scraping library built with Playwright that provides a declarative, step-by-step approach to web automation and data extraction.
- 🚀 Declarative Scraping: Define scraping workflows using Python dictionaries or dataclasses
- 🔄 Pagination Support: Built-in support for next button and scroll-based pagination
- 📊 Data Collection: Extract text, HTML, values, and files from web pages
- 🔗 Multi-tab Support: Handle multiple tabs and complex navigation flows
- 📄 PDF Generation: Save pages as PDFs or trigger print-to-PDF actions
- 📥 File Downloads: Download files with automatic directory creation
- 🔁 Looping & Iteration: ForEach loops for processing multiple elements
- 📡 Streaming Results: Real-time result processing with callbacks
- 🎯 Error Handling: Graceful error handling with configurable termination
- 🔧 Flexible Selectors: Support for ID, class, tag, and XPath selectors
- 🔁 Retry Logic: Automatic retry on failure with configurable delays
- 🎛️ Conditional Execution: Skip or execute steps based on JavaScript conditions
- ⏳ Smart Waiting: Wait for selectors before actions with configurable timeouts
- 🔀 Fallback Selectors: Multiple selector fallbacks for increased robustness
- 🖱️ Enhanced Clicks: Double-click, right-click, modifier keys, and force clicks
- ⌨️ Input Enhancements: Clear before input, human-like typing delays
- 🔍 Data Transformations: Regex extraction, JavaScript transformations, default values
- 🌐 Page Actions: Reload, get URL/title, meta tags, cookies, localStorage, viewport
- 🤖 Human-like Behavior: Random delays to mimic human interaction
- ✅ Element State Checks: Require visible/enabled before actions
# Using pip pip install stepwright # Using pip with development dependencies pip install stepwright[dev] # From source git clone https://github.com/lablnet/stepwright.git cd stepwright pip install -e .import asyncio from stepwright import run_scraper, TabTemplate, BaseStep async def main(): templates = [ TabTemplate( tab="example", steps=[ BaseStep( id="navigate", action="navigate", value="https://example.com" ), BaseStep( id="get_title", action="data", object_type="tag", object="h1", key="title", data_type="text" ) ] ) ] results = await run_scraper(templates) print(results) if __name__ == "__main__": asyncio.run(main())Main function to execute scraping templates.
Parameters:
templates: List ofTabTemplateobjectsoptions: OptionalRunOptionsobject
Returns: List[Dict[str, Any]]
results = await run_scraper(templates, RunOptions( browser={"headless": True} ))Execute scraping with streaming results via callback.
Parameters:
templates: List ofTabTemplateobjectson_result: Callback function for each result (can be sync or async)options: OptionalRunOptionsobject
async def process_result(result, index): print(f"Result {index}: {result}") await run_scraper_with_callback(templates, process_result)@dataclass class TabTemplate: tab: str initSteps: Optional[List[BaseStep]] = None # Steps executed once before pagination perPageSteps: Optional[List[BaseStep]] = None # Steps executed for each page steps: Optional[List[BaseStep]] = None # Single steps array pagination: Optional[PaginationConfig] = None@dataclass class BaseStep: id: str description: Optional[str] = None object_type: Optional[SelectorType] = None # 'id' | 'class' | 'tag' | 'xpath' object: Optional[str] = None action: Literal[ "navigate", "input", "click", "data", "scroll", "eventBaseDownload", "foreach", "open", "savePDF", "printToPDF", "downloadPDF", "downloadFile", "reload", "getUrl", "getTitle", "getMeta", "getCookies", "setCookies", "getLocalStorage", "setLocalStorage", "getSessionStorage", "setSessionStorage", "getViewportSize", "setViewportSize", "screenshot", "waitForSelector", "evaluate" ] = "navigate" value: Optional[str] = None key: Optional[str] = None data_type: Optional[DataType] = None # 'text' | 'html' | 'value' | 'default' | 'attribute' wait: Optional[int] = None terminateonerror: Optional[bool] = None subSteps: Optional[List["BaseStep"]] = None autoScroll: Optional[bool] = None # Retry configuration retry: Optional[int] = None # Number of retries on failure (default: 0) retryDelay: Optional[int] = None # Delay between retries in ms (default: 1000) # Conditional execution skipIf: Optional[str] = None # JavaScript expression - skip step if true onlyIf: Optional[str] = None # JavaScript expression - execute only if true # Element waiting and state waitForSelector: Optional[str] = None # Wait for selector before action waitForSelectorTimeout: Optional[int] = None # Timeout for waitForSelector in ms (default: 30000) waitForSelectorState: Optional[Literal["visible", "hidden", "attached", "detached"]] = None # Multiple selector fallbacks fallbackSelectors: Optional[List[Dict[str, str]]] = None # List of {object_type, object} # Click enhancements clickModifiers: Optional[List[ClickModifier]] = None # ['Control', 'Meta', 'Shift', 'Alt'] doubleClick: Optional[bool] = None # Perform double click forceClick: Optional[bool] = None # Force click even if not visible/actionable rightClick: Optional[bool] = None # Perform right click # Input enhancements clearBeforeInput: Optional[bool] = None # Clear input before typing (default: True) inputDelay: Optional[int] = None # Delay between keystrokes in ms # Data extraction enhancements required: Optional[bool] = None # Raise error if extraction returns None/empty defaultValue: Optional[str] = None # Default value if extraction fails regex: Optional[str] = None # Regex pattern to extract from data regexGroup: Optional[int] = None # Regex group to extract (default: 0) transform: Optional[str] = None # JavaScript expression to transform data # Timeout configuration timeout: Optional[int] = None # Step-specific timeout in ms # Navigation enhancements waitUntil: Optional[Literal["load", "domcontentloaded", "networkidle", "commit"]] = None # Human-like behavior randomDelay: Optional[Dict[str, int]] = None # {min: ms, max: ms} for random delay # Element state checks requireVisible: Optional[bool] = None # Require element visible (default: True for click) requireEnabled: Optional[bool] = None # Require element enabled # Skip/continue logic skipOnError: Optional[bool] = None # Skip step if error occurs (default: False) continueOnEmpty: Optional[bool] = None # Continue if element not found (default: True)@dataclass class RunOptions: browser: Optional[dict] = None # Playwright launch options onResult: Optional[Callable] = NoneNavigate to a URL.
BaseStep( id="go_to_page", action="navigate", value="https://example.com" )Fill form fields.
BaseStep( id="search", action="input", object_type="id", object="search-box", value="search term" )Click on elements.
BaseStep( id="submit", action="click", object_type="class", object="submit-button" )Extract data from elements.
BaseStep( id="get_title", action="data", object_type="tag", object="h1", key="title", data_type="text" )Process multiple elements.
BaseStep( id="process_items", action="foreach", object_type="class", object="item", subSteps=[ BaseStep( id="get_item_title", action="data", object_type="tag", object="h2", key="title", data_type="text" ) ] )BaseStep( id="download_file", action="eventBaseDownload", object_type="class", object="download-link", value="./downloads/file.pdf", key="downloaded_file" )BaseStep( id="download_pdf", action="downloadPDF", object_type="class", object="pdf-link", value="./output/document.pdf", key="pdf_file" )BaseStep( id="save_pdf", action="savePDF", value="./output/page.pdf", key="pdf_file" )PaginationConfig( strategy="next", nextButton=NextButtonConfig( object_type="class", object="next-page", wait=2000 ), maxPages=10 )PaginationConfig( strategy="scroll", scroll=ScrollConfig( offset=800, delay=1500 ), maxPages=5 )Paginate first, then collect data from each page:
TabTemplate( tab="news", initSteps=[...], perPageSteps=[...], # Collect data from each page pagination=PaginationConfig( strategy="next", nextButton=NextButtonConfig(...), paginationFirst=True # Go to next page before collecting ) )Paginate through all pages first, then collect all data at once:
TabTemplate( tab="articles", initSteps=[...], perPageSteps=[...], # Collect all data after all pagination pagination=PaginationConfig( strategy="next", nextButton=NextButtonConfig(...), paginateAllFirst=True # Load all pages first ) )from stepwright import run_scraper, RunOptions results = await run_scraper(templates, RunOptions( browser={ "proxy": { "server": "http://proxy-server:8080", "username": "user", "password": "pass" } } ))results = await run_scraper(templates, RunOptions( browser={ "headless": False, "slow_mo": 1000, "args": ["--no-sandbox", "--disable-setuid-sandbox"] } ))async def process_result(result, index): print(f"Result {index}: {result}") # Process result immediately (e.g., save to database) await save_to_database(result) await run_scraper_with_callback( templates, process_result, RunOptions(browser={"headless": True}) )Use collected data in subsequent steps:
BaseStep( id="get_title", action="data", object_type="id", object="page-title", key="page_title", data_type="text" ), BaseStep( id="save_with_title", action="savePDF", value="./output/{{page_title}}.pdf", # Uses collected page_title key="pdf_file" )Use loop index in foreach steps:
BaseStep( id="process_items", action="foreach", object_type="class", object="item", subSteps=[ BaseStep( id="save_item", action="savePDF", value="./output/item_{{i}}.pdf", # i = 0, 1, 2, ... # or value="./output/item_{{i_plus1}}.pdf" # i_plus1 = 1, 2, 3, ... ) ] )Steps can be configured to terminate on error:
BaseStep( id="critical_step", action="click", object_type="id", object="important-button", terminateonerror=True # Stop execution if this fails )Without terminateonerror=True, errors are logged but execution continues.
Automatically retry failed steps with configurable delays:
BaseStep( id="click_button", action="click", object_type="id", object="flaky-button", retry=3, # Retry up to 3 times retryDelay=1000 # Wait 1 second between retries )Execute or skip steps based on JavaScript conditions:
# Skip step if condition is true BaseStep( id="optional_click", action="click", object_type="id", object="optional-button", skipIf="document.querySelector('.modal').classList.contains('hidden')" ) # Execute only if condition is true BaseStep( id="conditional_data", action="data", object_type="id", object="dynamic-content", key="content", onlyIf="document.querySelector('#dynamic-content') !== null" )Wait for elements to appear before performing actions:
BaseStep( id="click_after_load", action="click", object_type="id", object="target-button", waitForSelector="#loading-indicator", # Wait for this selector waitForSelectorTimeout=5000, # Timeout: 5 seconds waitForSelectorState="hidden" # Wait until hidden )Provide multiple selector options for increased robustness:
BaseStep( id="click_with_fallback", action="click", object_type="id", object="primary-button", # Try this first fallbackSelectors=[ {"object_type": "class", "object": "btn-primary"}, {"object_type": "class", "object": "submit-btn"}, {"object_type": "xpath", "object": "//button[contains(text(), 'Submit')]"} ] )Advanced click options for different interaction types:
# Double click BaseStep( id="double_click", action="click", object_type="id", object="item", doubleClick=True ) # Right click (context menu) BaseStep( id="right_click", action="click", object_type="id", object="context-menu-trigger", rightClick=True ) # Click with modifier keys (Ctrl/Cmd+Click) BaseStep( id="multi_select", action="click", object_type="class", object="item", clickModifiers=["Control"] # or ["Meta"] for Mac ) # Force click (click hidden elements) BaseStep( id="force_click", action="click", object_type="id", object="hidden-button", forceClick=True )More control over input behavior:
# Clear input before typing (default: True) BaseStep( id="clear_and_input", action="input", object_type="id", object="search-box", value="new search term", clearBeforeInput=True # Clear existing value first ) # Human-like typing with delays BaseStep( id="human_like_input", action="input", object_type="id", object="form-field", value="slowly typed text", inputDelay=100 # 100ms delay between each character )Advanced data extraction and transformation options:
# Extract with regex BaseStep( id="extract_price", action="data", object_type="id", object="price", key="price", regex=r"\$(\d+\.\d+)", # Extract dollar amount regexGroup=1 # Get first capture group ) # Transform extracted data with JavaScript BaseStep( id="transform_data", action="data", object_type="id", object="raw-data", key="processed", transform="value.toUpperCase().trim()" # JavaScript transformation ) # Required field with default value BaseStep( id="get_required_data", action="data", object_type="id", object="important-field", key="important", required=True, # Raise error if not found defaultValue="N/A" # Use if extraction fails ) # Continue even if element not found BaseStep( id="optional_data", action="data", object_type="id", object="optional-content", key="optional", continueOnEmpty=True # Don't raise error if not found )Validate element state before actions:
BaseStep( id="click_visible", action="click", object_type="id", object="button", requireVisible=True, # Ensure element is visible requireEnabled=True # Ensure element is enabled )Add human-like random delays to actions:
BaseStep( id="human_like_action", action="click", object_type="id", object="button", randomDelay={"min": 500, "max": 2000} # Random delay between 500-2000ms )Skip steps that fail instead of stopping execution:
BaseStep( id="optional_step", action="click", object_type="id", object="optional-button", skipOnError=True # Continue even if this step fails )Reload the current page with optional wait conditions:
BaseStep( id="reload", action="reload", waitUntil="networkidle" # Wait for network to be idle )BaseStep( id="get_url", action="getUrl", key="current_url" # Store in collector )BaseStep( id="get_title", action="getTitle", key="page_title" )# Get specific meta tag BaseStep( id="get_description", action="getMeta", object="description", # Meta name or property key="meta_description" ) # Get all meta tags BaseStep( id="get_all_meta", action="getMeta", key="all_meta_tags" # Returns dictionary of all meta tags )# Get all cookies BaseStep( id="get_cookies", action="getCookies", key="cookies" ) # Get specific cookie BaseStep( id="get_session_cookie", action="getCookies", object="session_id", key="session" ) # Set cookie BaseStep( id="set_cookie", action="setCookies", object="preference", value="dark_mode" )# Get localStorage value BaseStep( id="get_storage", action="getLocalStorage", object="user_preference", key="preference" ) # Set localStorage value BaseStep( id="set_storage", action="setLocalStorage", object="theme", value="dark" ) # Get all localStorage items BaseStep( id="get_all_storage", action="getLocalStorage", key="all_storage" ) # SessionStorage (same pattern) BaseStep( id="get_session", action="getSessionStorage", object="temp_data", key="data" )# Get viewport size BaseStep( id="get_viewport", action="getViewportSize", key="viewport" ) # Set viewport size BaseStep( id="set_viewport", action="setViewportSize", value="1920x1080" # or "1920,1080" or "1920 1080" )# Full page screenshot BaseStep( id="screenshot", action="screenshot", value="./screenshots/page.png", data_type="full" # Full page, omit for viewport only ) # Element screenshot BaseStep( id="element_screenshot", action="screenshot", object_type="id", object="content-area", value="./screenshots/element.png", key="screenshot_path" )Explicit wait for element state:
BaseStep( id="wait_for_element", action="waitForSelector", object_type="id", object="dynamic-content", value="visible", # visible, hidden, attached, detached wait=5000, # Timeout in ms key="wait_result" # Stores True/False )Execute custom JavaScript:
BaseStep( id="custom_js", action="evaluate", value="() => document.querySelector('.counter').textContent", key="counter_value" )import asyncio from pathlib import Path from stepwright import ( run_scraper, TabTemplate, BaseStep, PaginationConfig, NextButtonConfig, RunOptions ) async def main(): templates = [ TabTemplate( tab="news_scraper", initSteps=[ BaseStep( id="navigate", action="navigate", value="https://news-site.com" ), BaseStep( id="search", action="input", object_type="id", object="search-box", value="technology" ) ], perPageSteps=[ BaseStep( id="collect_articles", action="foreach", object_type="class", object="article", subSteps=[ BaseStep( id="get_title", action="data", object_type="tag", object="h2", key="title", data_type="text" ), BaseStep( id="get_content", action="data", object_type="tag", object="p", key="content", data_type="text" ), BaseStep( id="get_link", action="data", object_type="tag", object="a", key="link", data_type="value" ) ] ) ], pagination=PaginationConfig( strategy="next", nextButton=NextButtonConfig( object_type="id", object="next-page", wait=2000 ), maxPages=5 ) ) ] # Run scraper results = await run_scraper(templates, RunOptions( browser={"headless": True} )) # Process results for i, article in enumerate(results): print(f"\nArticle {i + 1}:") print(f"Title: {article.get('title')}") print(f"Content: {article.get('content')[:100]}...") print(f"Link: {article.get('link')}") if __name__ == "__main__": asyncio.run(main())# Clone repository git clone https://github.com/lablnet/stepwright.git cd stepwright # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install in development mode pip install -e ".[dev]" # Install Playwright browsers playwright install chromium# Run all tests pytest # Run with verbose output pytest -v # Run specific test file pytest tests/test_scraper.py # Run specific test class pytest tests/test_scraper.py::TestGetBrowser # Run specific test pytest tests/test_scraper.py::TestGetBrowser::test_create_browser_instance # Run with coverage pytest --cov=src --cov-report=html # Run integration tests only pytest tests/test_integration.pystepwright/ ├── src/ │ ├── __init__.py │ ├── step_types.py # Type definitions and dataclasses │ ├── helpers.py # Utility functions │ ├── executor.py # Core step execution logic │ ├── parser.py # Public API (run_scraper) │ ├── scraper.py # Low-level browser automation │ ├── handlers/ # Action-specific handlers │ │ ├── __init__.py │ │ ├── data_handlers.py # Data extraction handlers │ │ ├── file_handlers.py # File download/PDF handlers │ │ ├── loop_handlers.py # Foreach/open handlers │ │ └── page_actions.py # Page-related actions (reload, getUrl, etc.) │ └── scraper_parser.py # Backward compatibility ├── tests/ │ ├── __init__.py │ ├── conftest.py # Pytest configuration │ ├── test_page.html # Test HTML page │ ├── test_page_enhanced.html # Enhanced test page for new features │ ├── test_scraper.py # Core scraper tests │ ├── test_parser.py # Parser function tests │ ├── test_new_features.py # Tests for new features │ └── test_integration.py # Integration tests ├── pyproject.toml # Package configuration ├── setup.py # Setup script ├── pytest.ini # Pytest configuration ├── README.md # This file └── README_TESTS.md # Detailed test documentation # Format code with black black src/ tests/ # Lint with flake8 flake8 src/ tests/ # Type checking with mypy mypy src/The codebase follows separation of concerns:
- step_types.py: All type definitions (BaseStep, TabTemplate, etc.)
- helpers.py: Utility functions (placeholder replacement, locator creation, condition evaluation)
- executor.py: Core execution logic (execute steps, handle pagination, retry logic)
- parser.py: Public API (run_scraper, run_scraper_with_callback)
- scraper.py: Low-level Playwright wrapper (navigate, click, get_data)
- handlers/: Action-specific handlers organized by functionality
- data_handlers.py: Data extraction logic with transformations
- file_handlers.py: File download and PDF operations
- loop_handlers.py: Foreach loops and new tab/window handling
- page_actions.py: Page-related actions (reload, getUrl, cookies, storage, etc.)
- scraper_parser.py: Backward compatibility wrapper
You can import from the main module or specific submodules:
# From main module (recommended) from stepwright import run_scraper, TabTemplate, BaseStep # From specific modules from stepwright.step_types import TabTemplate, BaseStep from stepwright.parser import run_scraper from stepwright.helpers import replace_data_placeholdersSee README_TESTS.md for detailed testing documentation.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure all tests pass (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License - see LICENSE file for details.
- 🐛 Issues: GitHub Issues
- 📖 Documentation: README.md and README_TESTS.md
- 💬 Discussions: GitHub Discussions
- Built with Playwright
- Inspired by declarative web scraping patterns
- Original TypeScript version: framework-Island/stepwright
Muhammad Umer Farooq (@lablnet)