ScrapeGraph AI Integration Server

@ScrapeGraphAI/scrapegraph-mcp

•last deployed 26 days ago

Remote

Quick Setup

Open Source

About

Enable language models to perform advanced AI-powered web scraping with enterprise-grade reliability. Transform webpages into structured markdown, extract data using AI, and execute AI-powered web searches seamlessly. Enhance your applications with powerful web data extraction capabilities through this integration.

It could also make web search, browser automation and markdownification

Capabilities

1 / 2

markdownify

Convert a webpage into clean, formatted markdown.

This tool fetches any webpage and converts its content into clean, readable markdown format. Useful for extracting content from documentation, articles, and web pages for further processing. Costs 2 credits per page. Read-only operation with no side effects.

Args: website_url (str): The complete URL of the webpage to convert to markdown format. - Must include protocol (http:// or https://) - Supports most web content types (HTML, articles, documentation) - Works with both static and dynamic content - Examples: * https://example.com/page * https://docs.python.org/3/tutorial/ * https://github.com/user/repo/README.md - Invalid examples: * example.com (missing protocol) * ftp://example.com (unsupported protocol) * localhost:3000 (missing protocol)

Returns: Dictionary containing: - markdown: The converted markdown content as a string - metadata: Additional information about the conversion (title, description, etc.) - status: Success/error status of the operation - credits_used: Number of credits consumed (always 2 for this operation)

Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the webpage cannot be accessed or returns an error TimeoutError: If the webpage takes too long to load (>120 seconds)

Parameters

website_url*required

string

smartscraper

Extract structured data from a webpage, HTML, or markdown using AI-powered extraction. This tool uses advanced AI to understand your natural language prompt and extract specific structured data from web content. Supports three input modes: URL scraping, local HTML processing, or local markdown processing. Ideal for extracting product information, contact details, article metadata, or any structured content. Costs 10 credits per page. Read-only operation. Args: user_prompt (str): Natural language instructions describing what data to extract. - Be specific about the fields you want for better results - Use clear, descriptive language about the target data - Examples: * "Extract product name, price, description, and availability status" * "Find all contact methods: email addresses, phone numbers, and social media links" * "Get article title, author, publication date, and summary" * "Extract all job listings with title, company, location, and salary" - Tips for better results: * Specify exact field names you want * Mention data types (numbers, dates, URLs, etc.) * Include context about where data might be located website_url (Optional[str]): The complete URL of the webpage to scrape. - Mutually exclusive with website_html and website_markdown - Must include protocol (http:// or https://) - Supports dynamic and static content - Examples: * https://example.com/products/item * https://news.site.com/article/123 * https://company.com/contact - Default: None (must provide one of the three input sources) website_html (Optional[str]): Raw HTML content to process locally. - Mutually exclusive with website_url and website_markdown - Maximum size: 2MB - Useful for processing pre-fetched or generated HTML - Use when you already have HTML content from another source - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>" - Default: None website_markdown (Optional[str]): Markdown content to process locally. - Mutually exclusive with website_url and website_html - Maximum size: 2MB - Useful for extracting from markdown documents or converted content - Works well with documentation, README files, or converted web content - Example: "# Title

Section

Content here..." - Default: None

 output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure. - Can be provided as a dictionary or JSON string - Helps ensure consistent, structured output format - Optional but recommended for complex extractions - Examples: * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}} * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}}' * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}}} - Default: None (AI will infer structure from prompt) number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping. - Range: 0-50 scrolls - Default: 0 (no scrolling) - Useful for dynamically loaded content (lazy loading, infinite scroll) - Each scroll waits for content to load before continuing - Examples: * 0: Static content, no scrolling needed * 3: Social media feeds, product listings * 10: Long articles, extensive product catalogs - Note: Increases processing time proportionally total_pages (Optional[int]): Number of pages to process for pagination. - Range: 1-100 pages - Default: 1 (single page only) - Automatically follows pagination links when available - Useful for multi-page listings, search results, catalogs - Examples: * 1: Single page extraction * 5: First 5 pages of search results * 20: Comprehensive catalog scraping - Note: Each page counts toward credit usage (10 credits × pages) render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites. - Default: false - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites - Increases processing time but captures client-side rendered content - Use when content is loaded dynamically via JavaScript - Examples of when to use: * React/Angular/Vue applications * Sites with dynamic content loading * AJAX-heavy interfaces * Content that appears after page load - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds) stealth (Optional[bool]): Enable stealth mode to avoid bot detection. - Default: false - Helps bypass basic anti-scraping measures - Uses techniques to appear more like a human browser - Useful for sites with bot detection systems - Examples of when to use: * Sites that block automated requests * E-commerce sites with protection * Sites that require "human-like" behavior - Note: May increase processing time and is not 100% guaranteed Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - metadata: Information about the extraction process - credits_used: Number of credits consumed (10 per page processed) - processing_time: Time taken for the extraction - pages_processed: Number of pages that were analyzed - status: Success/error status of the operation Raises: ValueError: If no input source provided or multiple sources provided HTTPError: If website_url cannot be accessed TimeoutError: If processing exceeds timeout limits ValidationError: If output_schema is malformed JSON

Parameters

stealth

unknown

total_pages

unknown

user_prompt*required

string

website_url

unknown

website_html

unknown

output_schema

unknown

render_heavy_js

unknown

website_markdown

unknown

number_of_scrolls

unknown

searchscraper

Perform AI-powered web searches with structured data extraction.

This tool searches the web based on your query and uses AI to extract structured information from the search results. Ideal for research, competitive analysis, and gathering information from multiple sources. Each website searched costs 10 credits (default 3 websites = 30 credits). Read-only operation but results may vary over time (non-idempotent).

Args: user_prompt (str): Search query or natural language instructions for information to find. - Can be a simple search query or detailed extraction instructions - The AI will search the web and extract relevant data from found pages - Be specific about what information you want extracted - Examples: * "Find latest AI research papers published in 2024 with author names and abstracts" * "Search for Python web scraping tutorials with ratings and difficulty levels" * "Get current cryptocurrency prices and market caps for top 10 coins" * "Find contact information for tech startups in San Francisco" * "Search for job openings for data scientists with salary information" - Tips for better results: * Include specific fields you want extracted * Mention timeframes or filters (e.g., "latest", "2024", "top 10") * Specify data types needed (prices, dates, ratings, etc.)

num_results (Optional[int]): Number of websites to search and extract data from. - Default: 3 websites (costs 30 credits total) - Range: 1-20 websites (recommended to stay under 10 for cost efficiency) - Each website costs 10 credits, so total cost = num_results × 10 - Examples: * 1: Quick single-source lookup (10 credits) * 3: Standard research (30 credits) - good balance of coverage and cost * 5: Comprehensive research (50 credits) * 10: Extensive analysis (100 credits) - Note: More results provide broader coverage but increase costs and processing time number_of_scrolls (Optional[int]): Number of infinite scrolls per searched webpage. - Default: 0 (no scrolling on search result pages) - Range: 0-10 scrolls per page - Useful when search results point to pages with dynamic content loading - Each scroll waits for content to load before continuing - Examples: * 0: Static content pages, news articles, documentation * 2: Social media pages, product listings with lazy loading * 5: Extensive feeds, long-form content with infinite scroll - Note: Increases processing time significantly (adds 5-10 seconds per scroll per page)

Returns: Dictionary containing: - search_results: Array of extracted data from each website found - sources: List of URLs that were searched and processed - total_websites_processed: Number of websites successfully analyzed - credits_used: Total credits consumed (num_results × 10) - processing_time: Total time taken for search and extraction - search_query_used: The actual search query sent to search engines - metadata: Additional information about the search process

Raises: ValueError: If user_prompt is empty or num_results is out of range HTTPError: If search engines are unavailable or return errors TimeoutError: If search or extraction process exceeds timeout limits RateLimitError: If too many requests are made in a short time period

Note: - Results may vary between calls due to changing web content (non-idempotent) - Search engines may return different results over time - Some websites may be inaccessible or block automated access - Processing time increases with num_results and number_of_scrolls - Consider using smartscraper on specific URLs if you know the target sites

Parameters

num_results

unknown

user_prompt*required

string

number_of_scrolls

unknown

smartcrawler_initiate

Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.

This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only).

SmartCrawler supports two modes:

AI Extraction Mode: Extracts structured data based on your prompt from every crawled page
Markdown Conversion Mode: Converts each page to clean markdown format

Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL

prompt (Optional[str]): AI prompt for data extraction. - REQUIRED when extraction_mode is 'ai' - Ignored when extraction_mode is 'markdown' - Describes what data to extract from each crawled page - Applied consistently across all discovered pages - Examples: * "Extract API endpoint name, method, parameters, and description" * "Get article title, author, publication date, and summary" * "Find product name, price, description, and availability" * "Extract job title, company, location, salary, and requirements" - Tips for better results: * Be specific about fields you want from each page * Consider that different pages may have different content structures * Use general terms that apply across multiple page types extraction_mode (str): Extraction mode for processing crawled pages. - Default: "ai" - Options: * "ai": AI-powered structured data extraction (10 credits per page) - Uses the prompt to extract specific data from each page - Returns structured JSON data - More expensive but provides targeted information - Best for: Data collection, research, structured analysis * "markdown": Simple markdown conversion (2 credits per page) - Converts each page to clean markdown format - No AI processing, just content conversion - More cost-effective for content archival - Best for: Documentation backup, content migration, reading - Cost comparison: * AI mode: 50 pages = 500 credits * Markdown mode: 50 pages = 100 credits depth (Optional[int]): Maximum depth of link traversal from the starting URL. - Default: unlimited (will follow links until max_pages or no more links) - Depth levels: * 0: Only the starting URL (no link following) * 1: Starting URL + pages directly linked from it * 2: Starting URL + direct links + links from those pages * 3+: Continues following links to specified depth - Examples: * 1: Crawl blog homepage + all blog posts * 2: Crawl docs homepage + category pages + individual doc pages * 3: Deep crawling for comprehensive site coverage - Considerations: * Higher depth can lead to exponential page growth * Use with max_pages to control scope and cost * Consider site structure when setting depth max_pages (Optional[int]): Maximum number of pages to crawl in total. - Default: unlimited (will crawl until no more links or depth limit) - Recommended ranges: * 10-20: Testing and small sites * 50-100: Medium sites and focused crawling * 200-500: Large sites and comprehensive analysis * 1000+: Enterprise-level crawling (high cost) - Cost implications: * AI mode: max_pages × 10 credits * Markdown mode: max_pages × 2 credits - Examples: * 10: Quick site sampling (20-100 credits) * 50: Standard documentation crawl (100-500 credits) * 200: Comprehensive site analysis (400-2000 credits) - Note: Crawler stops when this limit is reached, regardless of remaining links same_domain_only (Optional[bool]): Whether to crawl only within the same domain. - Default: true (recommended for most use cases) - Options: * true: Only crawl pages within the same domain as starting URL - Prevents following external links - Keeps crawling focused on the target site - Reduces risk of crawling unrelated content - Example: Starting at docs.example.com only crawls docs.example.com pages * false: Allow crawling external domains - Follows links to other domains - Can lead to very broad crawling scope - May crawl unrelated or unwanted content - Use with caution and appropriate max_pages limit - Recommendations: * Use true for focused site crawling * Use false only when you specifically need cross-domain data * Always set max_pages when using false to prevent runaway crawling

Returns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results

Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly

Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity

Parameters

url*required

string

depth

unknown

prompt

unknown

max_pages

unknown

extraction_mode

string

same_domain_only

unknown

smartcrawler_fetch_results

Retrieve the results of an asynchronous SmartCrawler operation.

This tool fetches the results from a previously initiated crawling operation using the request_id. The crawl request processes asynchronously in the background. Keep polling this endpoint until the status field indicates 'completed'. While processing, you'll receive status updates. Read-only operation that safely retrieves results without side effects.

Args: request_id: The unique request ID returned by smartcrawler_initiate. Use this to retrieve the crawling results. Keep polling until status is 'completed'. Example: 'req_abc123xyz'

Returns: Dictionary containing: - status: Current status of the crawl operation ('processing', 'completed', 'failed') - results: Crawled data (structured extraction or markdown) when completed - metadata: Information about processed pages, URLs visited, and processing statistics Keep polling until status is 'completed' to get final results

Parameters

request_id*required

string

ScrapeGraph AI Integration Server

About

Capabilities

markdownify

Parameters

smartscraper

Section

Parameters

searchscraper

Parameters

smartcrawler_initiate

Parameters

smartcrawler_fetch_results

Parameters

Get connection URL

Or add to your client

Quality Score

Monthly Tool Calls

Deployed from

Uptime

Latency

License

Local

Published

Source Code

Homepage