Skip to main content

Welcome to ScrapeGraphAI

ScrapeGraphAI is a revolutionary web scraping Python library that uses Large Language Models (LLMs) and direct graph logic to create scraping pipelines for websites and local documents. Instead of writing complex XPath selectors or CSS queries, you simply describe what information you want to extract in natural language.

What is ScrapeGraphAI?

ScrapeGraphAI transforms the traditional web scraping paradigm by leveraging AI to understand and extract data from web pages. Just tell the library which information you want to extract, and it will do it for you automatically.
You Only Scrape Once - ScrapeGraphAI’s motto reflects its intelligent approach to web scraping, where AI agents understand the structure and extract exactly what you need.

Key Features

Natural Language Prompts

Extract data using simple text descriptions instead of complex selectors. No need to inspect HTML or write XPath queries.

Multiple LLM Support

Works with OpenAI, Anthropic, Groq, Azure, Gemini, and local models via Ollama. Choose the model that fits your needs and budget.

Multiple Data Sources

Scrape from websites, HTML, XML, JSON, CSV, Markdown, and other document formats with the same simple interface.

Built-in Graph Pipelines

Pre-built scraping pipelines for single pages, multi-page scraping, search results, and more complex scenarios.

Why Choose ScrapeGraphAI?

AI-Powered Intelligence

Traditional web scrapers break when websites change their structure. ScrapeGraphAI uses LLMs to understand content semantically, making your scrapers more resilient to layout changes.

Developer-Friendly

No need to:
  • Inspect element structures
  • Write complex CSS selectors or XPath queries
  • Handle pagination logic manually
  • Parse and structure data manually

Flexible Architecture

Built on LangChain, ScrapeGraphAI provides:
  • Modular graph nodes for customizable pipelines
  • Schema validation using Pydantic models
  • Multi-source scraping from a single prompt
  • Parallel execution for improved performance

Available Pipelines

ScrapeGraphAI comes with multiple pre-built graph pipelines:
PipelineDescription
SmartScraperGraphSingle-page scraper with a user prompt and input source
SearchGraphMulti-page scraper that extracts information from top search results
SpeechGraphExtracts information and generates an audio file
ScriptCreatorGraphGenerates a Python script for scraping
SmartScraperMultiGraphMulti-page scraper with a single prompt and multiple sources
ScriptCreatorMultiGraphGenerates Python scripts for multiple pages
Each graph has a multi version that makes parallel LLM calls for improved performance.

Supported LLM Providers

ScrapeGraphAI integrates with major LLM providers:
  • OpenAI (GPT-4, GPT-3.5, GPT-4o)
  • Anthropic Claude
  • Google Gemini
  • Groq
  • Azure OpenAI
  • Ollama (local models like Llama, Mistral, Phi)

Use Cases

Data Collection

Gather product information, pricing data, or market research from multiple websites automatically.

Content Aggregation

Extract articles, blog posts, or news from various sources into a structured format.

Lead Generation

Collect contact information, company details, and social media links from business websites.

AI Agent Integration

Provide clean, structured data to AI agents through integrations with LangChain, LlamaIndex, and Crew.ai.

Integration Ecosystem

ScrapeGraphAI seamlessly integrates with popular frameworks:
  • LLM Frameworks: LangChain, LlamaIndex, Crew.ai, Agno, CamelAI
  • Low-code Platforms: Pipedream, Bubble, Zapier, n8n, Dify
  • MCP Server: Available on Smithery
  • SDKs: Python and Node.js SDKs for the hosted API

Performance

According to the Firecrawl benchmark, ScrapeGraphAI is the best fetcher on the market for accurate data extraction.

Next Steps

Installation

Get started by installing ScrapeGraphAI and its dependencies

Quick Start

Build your first scraper in under 5 minutes

Community and Support

Join the ScrapeGraphAI community:
ScrapeGraphAI is meant to be used for data exploration and research purposes only. Always respect website terms of service and robots.txt files.