ScrapeGraph

This guide provides a quick overview for getting started with ScrapeGraph tools. For detailed documentation of all ScrapeGraph features and configurations head to the API reference. For more information about ScrapeGraph AI:

Overview

Integration details

Class	Package	Serializable	JS support
SmartScraperTool	langchain-scrapegraph	✅	❌
SmartCrawlerTool	langchain-scrapegraph	✅	❌
MarkdownifyTool	langchain-scrapegraph	✅	❌
AgenticScraperTool	langchain-scrapegraph	✅	❌
GetCreditsTool	langchain-scrapegraph	✅	❌

Tool features

Tool	Purpose	Input	Output
SmartScraperTool	Extract structured data from websites	URL + prompt	JSON
SmartCrawlerTool	Extract data from multiple pages with crawling	URL + prompt + crawl options	JSON
MarkdownifyTool	Convert webpages to markdown	URL	Markdown text
GetCreditsTool	Check API credits	None	Credit info

Setup

The integration requires the following packages:

pip install --quiet -U langchain-scrapegraph

Credentials

You’ll need a ScrapeGraph AI API key to use these tools. Get one at scrapegraphai.com.

import getpass import os  if not os.environ.get("SGAI_API_KEY"):  os.environ["SGAI_API_KEY"] = getpass.getpass("ScrapeGraph AI API key:\n") 

It’s also helpful (but not needed) to set up LangSmith for best-in-class observability:

os.environ["LANGSMITH_TRACING"] = "true" os.environ["LANGSMITH_API_KEY"] = getpass.getpass() 

Instantiation

Here we show how to instantiate instances of the ScrapeGraph tools:

from scrapegraph_py.logger import sgai_logger import json  from langchain_scrapegraph.tools import (  GetCreditsTool,  MarkdownifyTool,  SmartCrawlerTool,  SmartScraperTool, )  sgai_logger.set_logging(level="INFO")  smartscraper = SmartScraperTool() smartcrawler = SmartCrawlerTool() markdownify = MarkdownifyTool() credits = GetCreditsTool() 

Invocation

Invoke directly with args

Let’s try each tool individually:

SmartCrawler Tool

The SmartCrawlerTool allows you to crawl multiple pages from a website and extract structured data with advanced crawling options like depth control, page limits, and domain restrictions.

# SmartScraper result = smartscraper.invoke(  {  "user_prompt": "Extract the company name and description",  "website_url": "https://scrapegraphai.com",  } ) print("SmartScraper Result:", result)  # Markdownify markdown = markdownify.invoke({"website_url": "https://scrapegraphai.com"}) print("\nMarkdownify Result (first 200 chars):", markdown[:200])  # SmartCrawler url = "https://scrapegraphai.com/" prompt = (  "What does the company do? and I need text content from their privacy and terms" )  # Use the tool with crawling parameters result_crawler = smartcrawler.invoke(  {  "url": url,  "prompt": prompt,  "cache_website": True,  "depth": 2,  "max_pages": 2,  "same_domain_only": True,  } )  print("\nSmartCrawler Result:") print(json.dumps(result_crawler, indent=2))  # Check credits credits_info = credits.invoke({}) print("\nCredits Info:", credits_info) 

SmartScraper Result: {'company_name': 'ScrapeGraphAI', 'description': "ScrapeGraphAI is a powerful AI web scraping tool that turns entire websites into clean, structured data through a simple API. It's designed to help developers and AI companies extract valuable data from websites efficiently and transform it into formats that are ready for use in LLM applications and data analysis."}  Markdownify Result (first 200 chars): [![ScrapeGraphAI Logo](https://scrapegraphai.com/images/scrapegraphai_logo.svg)ScrapeGraphAI](https://scrapegraphai.com/)  PartnersPricingFAQ[Blog](https://scrapegraphai.com/blog)DocsLog inSign up  Op LocalScraper Result: {'company_name': 'Company Name', 'description': 'We are a technology company focused on AI solutions.', 'contact': {'email': '[email protected]', 'phone': '(555) 123-4567'}}  Credits Info: {'remaining_credits': 49679, 'total_credits_used': 914}

# SmartCrawler example from scrapegraph_py.logger import sgai_logger import json  from langchain_scrapegraph.tools import SmartCrawlerTool  sgai_logger.set_logging(level="INFO")  # Will automatically get SGAI_API_KEY from environment tool = SmartCrawlerTool()  # Example based on the provided code snippet url = "https://scrapegraphai.com/" prompt = (  "What does the company do? and I need text content from their privacy and terms" )  # Use the tool with crawling parameters result = tool.invoke(  {  "url": url,  "prompt": prompt,  "cache_website": True,  "depth": 2,  "max_pages": 2,  "same_domain_only": True,  } )  print(json.dumps(result, indent=2)) 

Invoke with ToolCall

We can also invoke the tool with a model-generated ToolCall:

model_generated_tool_call = {  "args": {  "user_prompt": "Extract the main heading and description",  "website_url": "https://scrapegraphai.com",  },  "id": "1",  "name": smartscraper.name,  "type": "tool_call", } smartscraper.invoke(model_generated_tool_call) 

ToolMessage(content='{"main_heading": "Get the data you need from any website", "description": "Easily extract and gather information with just a few lines of code with a simple api. Turn websites into clean and usable structured data."}', name='SmartScraper', tool_call_id='1') 

Chaining

Let’s use our tools with an LLM to analyze a website:

# | output: false # | echo: false  # pip install -qU langchain langchain-openai from langchain.chat_models import init_chat_model  model = init_chat_model(model="gpt-4o", model_provider="openai") 

from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnableConfig, chain  prompt = ChatPromptTemplate(  [  (  "system",  "You are a helpful assistant that can use tools to extract structured information from websites.",  ),  ("human", "{user_input}"),  ("placeholder", "{messages}"),  ] )  model_with_tools = model.bind_tools([smartscraper], tool_choice=smartscraper.name) model_chain = prompt | model_with_tools   @chain def tool_chain(user_input: str, config: RunnableConfig):  input_ = {"user_input": user_input}  ai_msg = model_chain.invoke(input_, config=config)  tool_msgs = smartscraper.batch(ai_msg.tool_calls, config=config)  return model_chain.invoke({**input_, "messages": [ai_msg, *tool_msgs]}, config=config)   tool_chain.invoke(  "What does ScrapeGraph AI do? Extract this information from their website https://scrapegraphai.com" ) 

AIMessage(content='ScrapeGraph AI is an AI-powered web scraping tool that efficiently extracts and converts website data into structured formats via a simple API. It caters to developers, data scientists, and AI researchers, offering features like easy integration, support for dynamic content, and scalability for large projects. It supports various website types, including business, e-commerce, and educational sites. Contact: [email protected].', additional_kwargs={'tool_calls': [{'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'function': {'arguments': '{"user_prompt":"Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.","website_url":"https://scrapegraphai.com"}', 'name': 'SmartScraper'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 47, 'prompt_tokens': 480, 'total_tokens': 527, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_c7ca0ebaca', 'finish_reason': 'stop', 'logprobs': None}, id='run-45a12c86-d499-4273-8c59-0db926799bc7-0', tool_calls=[{'name': 'SmartScraper', 'args': {'user_prompt': 'Extract details about the products, services, and key features offered by ScrapeGraph AI, as well as any unique selling points or innovations mentioned on the website.', 'website_url': 'https://scrapegraphai.com'}, 'id': 'call_shkRPyjyAtfjH9ffG5rSy9xj', 'type': 'tool_call'}], usage_metadata={'input_tokens': 480, 'output_tokens': 47, 'total_tokens': 527, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

API reference

For detailed documentation of all ScrapeGraph features and configurations head to the LangChain API reference. Or to the official SDK repo.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

Overview

Integration details

Tool features

Setup

Credentials

Instantiation

Invocation

Invoke directly with args

SmartCrawler Tool

Invoke with ToolCall

Chaining

API reference

Popular Providers

Integrations by component

​Overview

​Integration details

​Tool features

​Setup

​Credentials

​Instantiation

​Invocation

​Invoke directly with args

​SmartCrawler Tool

​Invoke with ToolCall

​Chaining

​API reference

Overview

Integration details

Tool features

Setup

Credentials

Instantiation

Invocation

Invoke directly with args

SmartCrawler Tool

Invoke with ToolCall

Chaining

API reference