Skip to main content

What is extract()?

await stagehand.extract("extract the name of the repository"); 
extract grabs structured data from a webpage. You can define your schema with zod (TypeScript) or JSON. If you do not want to define a schema, you can also call extract with just a natural language prompt, or call extract with no parameters.

Why use extract()?

Using extract()

You can use extract() to extract structured data from a webpage. You can define your schema with zod (TypeScript) or JSON. If you do not want to define a schema, you can also call extract with just a natural language prompt, or call extract with no parameters.
const result = await stagehand.extract("extract the product details"); 

Return value of extract()?

When you use extract(), Stagehand will return a Promise<ExtractResult> with the following structure:
  • Basic Schema
  • Array
  • Primitive
  • Instruction Only
  • No Parameters
When extracting with a schema, the return type is inferred from your Zod schema:
const result = await stagehand.extract(  "extract product details",  z.object({  name: z.string(),  price: z.number(),  inStock: z.boolean()  }) ); 
Example result:
{  name: "Wireless Mouse",  price: 29.99,  inStock: true } 

Advanced Configuration

You can pass additional options to configure the model, timeout, and selector scope:
const result = await stagehand.extract("extract the repository name", {  model: "anthropic/claude-sonnet-4-5",  timeout: 30000,  selector: "//header" // Focus on specific area }); 

Targeted Extract

Pass a selector to extract to target a specific element on the page.
This helps reduce the context passed to the LLM, optimizing token usage/speed and improving accuracy.
const tableData = await stagehand.extract(  "Extract the values of the third row",  z.object({  values: z.array(z.string())  }),  {  // xPath or CSS selector  selector: "xpath=/html/body/div/table/"   } ); 

Best practices

Extract with Context

You can provide additional context to your schema to help the model extract the data more accurately.
const apartments = await stagehand.extract(  "Extract ALL the apartment listings and their details, including address, price, and square feet.",  z.array(  z.object({  address: z.string().describe("the address of the apartment"),  price: z.string().describe("the price of the apartment"),  square_feet: z.string().describe("the square footage of the apartment"),  })  ) ); 
To extract links or URLs, define the relevant field as z.string().url().
Here is how an extract call might look for extracting a link or URL. This also works for image links.
const contactLink = await stagehand.extract(  "extract the link to the 'contact us' page",  z.string().url() // note the usage of z.string().url() for URL validation );  console.log("the link to the contact us page is: ", contactLink); 
Inside Stagehand, extracting links works by asking the LLM to select an ID. Stagehand looks up that ID in a mapping of IDs -> URLs. When logging the LLM trace, you should expect to see IDs. The actual URLs will be included in the final ExtractResult.

Troubleshooting

Problem: extract() returns empty or incomplete dataSolutions:
  • Check your instruction clarity: Make sure your instruction is specific and describes exactly what data you want to extract
  • Verify the data exists: Use stagehand.observe() first to confirm the data is present on the page
  • Wait for dynamic content: If the page loads content dynamically, use stagehand.act("wait for the content to load") before extracting
Solution: Wait for content before extracting
// Wait for content before extracting await stagehand.act("wait for the product listings to load"); const products = await stagehand.extract(  "extract all product names and prices",  z.array(z.object({  name: z.string(),  price: z.string()  })) ); 
Problem: Getting schema validation errors or type mismatchesSolutions:
  • Use optional fields: Make fields optional with z.optional() if the data might not always be present
  • Use flexible types: Consider using z.string() instead of z.number() for prices that might include currency symbols
  • Add descriptions: Use .describe() to help the model understand field requirements
Solution: More flexible schema
const schema = z.object({  price: z.string().describe("price including currency symbol, e.g., '$19.99'"),  availability: z.string().optional().describe("stock status if available"),  rating: z.number().optional() }); 
Problem: Extraction results vary between runsSolutions:
  • Be more specific in instructions: Instead of “extract prices”, use “extract the numerical price value for each item”
  • Use context in schema descriptions: Add field descriptions to guide the model
  • Combine with observe: Use stagehand.observe() to understand the page structure first
Solution: Validate with observe first
// First observe to understand the page structure const elements = await stagehand.observe("find all product listings"); console.log("Found elements:", elements.map(e => e.description));  // Then extract with specific targeting const products = await stagehand.extract(  "extract name and price from each product listing shown on the page",  z.array(z.object({  name: z.string().describe("the product title or name"),  price: z.string().describe("the price as displayed, including currency")  })) ); 
Problem: Extraction is slow or timing outSolutions:
  • Reduce scope: Extract smaller chunks of data in multiple calls rather than everything at once
  • Use targeted instructions: Be specific about which part of the page to focus on
  • Consider pagination: For large datasets, extract one page at a time
  • Increase timeout: Use timeoutMs parameter for complex extractions
Solution: Break down large extractions
// Instead of extracting everything at once const allData = []; const pageNumbers = [1, 2, 3, 4, 5];  for (const pageNum of pageNumbers) {  await stagehand.act(`navigate to page ${pageNum}`);   const pageData = await stagehand.extract(  "extract product data from the current page only",  z.array(z.object({  name: z.string(),  price: z.number()  })),  { timeout: 60000 } // 60 second timeout  );   allData.push(...pageData); } 

Next steps