Skip to main content
ScrapeGraph API Banner

PyPI Package

PyPI version

Python Support

Python Support

Installation

Install the package using pip:
pip install scrapegraph-py 

Features

  • AI-Powered Extraction: Advanced web scraping using artificial intelligence
  • Flexible Clients: Both synchronous and asynchronous support
  • Type Safety: Structured output with Pydantic schemas
  • Production Ready: Detailed logging and automatic retries
  • Developer Friendly: Comprehensive error handling

Quick Start

Initialize the client with your API key:
from scrapegraph_py import Client  client = Client(api_key="your-api-key-here") 
You can also set the SGAI_API_KEY environment variable and initialize the client without parameters: client = Client()

Services

SmartScraper

Extract specific information from any webpage using AI:
response = client.smartscraper(  website_url="https://example.com",  user_prompt="Extract the main heading and description" ) 

Parameters

ParameterTypeRequiredDescription
website_urlstringYesThe URL of the webpage that needs to be scraped.
user_promptstringYesA textual description of what you want to achieve.
output_schemaobjectNoThe Pydantic object that describes the structure and format of the response.
render_heavy_jsbooleanNoEnable enhanced JavaScript rendering for heavy JS websites (React, Vue, Angular, etc.). Default: False
Define a simple schema for basic data extraction:
from pydantic import BaseModel, Field  class ArticleData(BaseModel):  title: str = Field(description="The article title")  author: str = Field(description="The author's name")  publish_date: str = Field(description="Article publication date")  content: str = Field(description="Main article content")  category: str = Field(description="Article category")  response = client.smartscraper(  website_url="https://example.com/blog/article",  user_prompt="Extract the article information",  output_schema=ArticleData )  print(f"Title: {response.title}") print(f"Author: {response.author}") print(f"Published: {response.publish_date}") 
Define a complex schema for nested data structures:
from typing import List from pydantic import BaseModel, Field  class Employee(BaseModel):  name: str = Field(description="Employee's full name")  position: str = Field(description="Job title")  department: str = Field(description="Department name")  email: str = Field(description="Email address")  class Office(BaseModel):  location: str = Field(description="Office location/city")  address: str = Field(description="Full address")  phone: str = Field(description="Contact number")  class CompanyData(BaseModel):  name: str = Field(description="Company name")  description: str = Field(description="Company description")  industry: str = Field(description="Industry sector")  founded_year: int = Field(description="Year company was founded")  employees: List[Employee] = Field(description="List of key employees")  offices: List[Office] = Field(description="Company office locations")  website: str = Field(description="Company website URL")  # Extract comprehensive company information response = client.smartscraper(  website_url="https://example.com/about",  user_prompt="Extract detailed company information including employees and offices",  output_schema=CompanyData )  # Access nested data print(f"Company: {response.name}") print("\nKey Employees:") for employee in response.employees:  print(f"- {employee.name} ({employee.position})")  print("\nOffice Locations:") for office in response.offices:  print(f"- {office.location}: {office.address}") 
For modern web applications built with React, Vue, Angular, or other JavaScript frameworks:
from scrapegraph_py import Client from pydantic import BaseModel, Field  class ProductInfo(BaseModel):  name: str = Field(description="Product name")  price: str = Field(description="Product price")  description: str = Field(description="Product description")  availability: str = Field(description="Product availability status")  client = Client(api_key="your-api-key")  # Enable enhanced JavaScript rendering for a React-based e-commerce site response = client.smartscraper(  website_url="https://example-react-store.com/products/123",  user_prompt="Extract product details including name, price, description, and availability",  output_schema=ProductInfo,  render_heavy_js=True # Enable for React/Vue/Angular sites )  print(f"Product: {response['result']['name']}") print(f"Price: {response['result']['price']}") print(f"Available: {response['result']['availability']}") 
When to use render_heavy_js:
  • React, Vue, or Angular applications
  • Single Page Applications (SPAs)
  • Sites with heavy client-side rendering
  • Dynamic content loaded via JavaScript
  • Interactive elements that depend on JavaScript execution

SearchScraper

Search and extract information from multiple web sources using AI:
response = client.searchscraper(  user_prompt="What are the key features and pricing of ChatGPT Plus?" ) 

Parameters

ParameterTypeRequiredDescription
user_promptstringYesA textual description of what you want to achieve.
num_resultsnumberNoNumber of websites to search (3-20). Default: 3.
extraction_modebooleanNoTrue = AI extraction mode (10 credits/page), False = markdown mode (2 credits/page). Default: True
output_schemaobjectNoThe Pydantic object that describes the structure and format of the response (AI extraction mode only)
Define a simple schema for structured search results:
from pydantic import BaseModel, Field from typing import List  class ProductInfo(BaseModel):  name: str = Field(description="Product name")  description: str = Field(description="Product description")  price: str = Field(description="Product price")  features: List[str] = Field(description="List of key features")  availability: str = Field(description="Availability information")  response = client.searchscraper(  user_prompt="Find information about iPhone 15 Pro",  output_schema=ProductInfo )  print(f"Product: {response.name}") print(f"Price: {response.price}") print("\nFeatures:") for feature in response.features:  print(f"- {feature}") 
Define a complex schema for comprehensive market research:
from typing import List from pydantic import BaseModel, Field  class MarketPlayer(BaseModel):  name: str = Field(description="Company name")  market_share: str = Field(description="Market share percentage")  key_products: List[str] = Field(description="Main products in market")  strengths: List[str] = Field(description="Company's market strengths")  class MarketTrend(BaseModel):  name: str = Field(description="Trend name")  description: str = Field(description="Trend description")  impact: str = Field(description="Expected market impact")  timeframe: str = Field(description="Trend timeframe")  class MarketAnalysis(BaseModel):  market_size: str = Field(description="Total market size")  growth_rate: str = Field(description="Annual growth rate")  key_players: List[MarketPlayer] = Field(description="Major market players")  trends: List[MarketTrend] = Field(description="Market trends")  challenges: List[str] = Field(description="Industry challenges")  opportunities: List[str] = Field(description="Market opportunities")  # Perform comprehensive market research response = client.searchscraper(  user_prompt="Analyze the current AI chip market landscape",  output_schema=MarketAnalysis )  # Access structured market data print(f"Market Size: {response.market_size}") print(f"Growth Rate: {response.growth_rate}")  print("\nKey Players:") for player in response.key_players:  print(f"\n{player.name}")  print(f"Market Share: {player.market_share}")  print("Key Products:")  for product in player.key_products:  print(f"- {product}")  print("\nMarket Trends:") for trend in response.trends:  print(f"\n{trend.name}")  print(f"Impact: {trend.impact}")  print(f"Timeframe: {trend.timeframe}") 
Use markdown mode for cost-effective content gathering:
from scrapegraph_py import Client  client = Client(api_key="your-api-key")  # Enable markdown mode for cost-effective content gathering response = client.searchscraper(  user_prompt="Latest developments in artificial intelligence",  num_results=3,  extraction_mode=False # Enable markdown mode (2 credits per page vs 10 credits) )  # Access the raw markdown content markdown_content = response['markdown_content'] reference_urls = response['reference_urls']  print(f"Markdown content length: {len(markdown_content)} characters") print(f"Reference URLs: {len(reference_urls)}")  # Process the markdown content print("Content preview:", markdown_content[:500] + "...")  # Save to file for analysis with open('ai_research_content.md', 'w', encoding='utf-8') as f:  f.write(markdown_content)  print("Content saved to ai_research_content.md") 
Markdown Mode Benefits:
  • Cost-effective: Only 2 credits per page (vs 10 credits for AI extraction)
  • Full content: Get complete page content in markdown format
  • Faster: No AI processing overhead
  • Perfect for: Content analysis, bulk data collection, building datasets

Markdownify

Convert any webpage into clean, formatted markdown:
response = client.markdownify(  website_url="https://example.com" ) 

Async Support

All endpoints support asynchronous operations:
import asyncio from scrapegraph_py import AsyncClient  async def main():  async with AsyncClient() as client:  response = await client.smartscraper(  website_url="https://example.com",  user_prompt="Extract the main content"  )  print(response)  asyncio.run(main()) 

Feedback

Help us improve by submitting feedback programmatically:
client.submit_feedback(  request_id="your-request-id",  rating=5,  feedback_text="Great results!" ) 

Support

This project is licensed under the MIT License. See the LICENSE file for details.
⌘I