Skip to main content

Quick Start

Get up and running with ScrapeGraphAI in under 5 minutes. This guide will walk you through creating your first scraper using the SmartScraperGraph pipeline.

Your First Scraper

The most common scraping pipeline is SmartScraperGraph, which extracts information from a single page given a natural language prompt and a source URL.

Using Local Models (Ollama)

Here’s a complete example using Ollama with a local Llama model:
1

Ensure Ollama is running

Make sure you have Ollama installed and a model downloaded:
ollama pull llama3.2
ollama serve
2

Create your scraper script

Create a new Python file (e.g., scraper.py) with the following code:
from scrapegraphai.graphs import SmartScraperGraph
import json

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "model": "ollama/llama3.2",
        "temperature": 0,
        "model_tokens": 8192,
        "format": "json",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="Find information about the founders and what the company does",
    source="https://scrapegraphai.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))
3

Run the scraper

Execute your script:
python scraper.py
You’ll see the scraper fetch the page, process it with the LLM, and return structured data:
{
    "description": "ScrapeGraphAI transforms websites into clean, organized data for AI agents and data analytics.",
    "founders": [
        {
            "name": "Marco Vinciguerra",
            "role": "Founder & Software Engineer",
            "linkedin": "https://www.linkedin.com/in/marco-vinciguerra-7ba365242/"
        },
        {
            "name": "Lorenzo Padoan",
            "role": "Founder & Product Engineer",
            "linkedin": "https://www.linkedin.com/in/lorenzo-padoan-4521a2154/"
        }
    ]
}

Using OpenAI

For production use cases, you might prefer OpenAI’s models:
import os
import json
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph

load_dotenv()

# Define the configuration with OpenAI
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

# Create and run the scraper
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract the first article with its title and summary",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))
Model Selection: gpt-4o-mini offers a great balance between performance and cost. For more complex scraping tasks, consider using gpt-4o or gpt-4-turbo.

Understanding the Configuration

Let’s break down the configuration object:
graph_config = {
    "llm": {
        "model": "ollama/llama3.2",      # LLM provider and model name
        "temperature": 0,                # Lower = more deterministic
        "model_tokens": 8192,            # Context window size
        "format": "json",                # Output format
    },
    "verbose": True,                     # Print execution details
    "headless": False,                   # Show browser window (useful for debugging)
}

Key Configuration Options

  • llm.model: Specifies the LLM provider and model (e.g., ollama/llama3.2, openai/gpt-4o-mini)
  • temperature: Controls randomness (0 = deterministic, 1 = creative)
  • verbose: Enables detailed logging of the scraping process
  • headless: Set to True for production (no browser UI), False for debugging

Using Schema for Structured Output

For predictable, type-safe output, define a Pydantic schema:
import os
from typing import List
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from scrapegraphai.graphs import SmartScraperGraph

load_dotenv()

# Define the output schema
class Project(BaseModel):
    title: str = Field(description="The title of the project")
    description: str = Field(description="The description of the project")

class Projects(BaseModel):
    projects: List[Project]

# Configure the scraper with schema
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

# Create scraper with schema
smart_scraper_graph = SmartScraperGraph(
    prompt="List all the projects with their descriptions",
    source="https://perinim.github.io/projects/",
    schema=Projects,  # Add schema here
    config=graph_config,
)

result = smart_scraper_graph.run()
print(result)
The output will be validated against your schema:
{
    'projects': [
        {'title': 'Project Alpha', 'description': 'A web application for...'},
        {'title': 'Project Beta', 'description': 'An AI-powered tool that...'}
    ]
}
Using schemas ensures your data is structured correctly and makes it easier to integrate with databases or APIs.

Supported LLM Providers

ScrapeGraphAI supports multiple LLM providers. Here’s how to configure each:
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
}

Advanced Features

Get Execution Information

Debug and optimize your scraper by inspecting execution details:
from scrapegraphai.utils import prettify_exec_info

result = smart_scraper_graph.run()

# Get detailed execution information
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
This shows:
  • Total execution time
  • Time spent in each node
  • LLM tokens used
  • Success/failure status

Custom Base URL for Ollama

If Ollama is running on a different host or port:
graph_config = {
    "llm": {
        "model": "ollama/llama3.2",
        "base_url": "http://192.168.1.100:11434",  # Custom Ollama URL
    },
}

Scraping Local Files

ScrapeGraphAI can also process local HTML files:
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract all product names and prices",
    source="/path/to/local/file.html",  # Local file path
    config=graph_config,
)

result = smart_scraper_graph.run()

Common Use Cases

Extract Contact Information

prompt = "Find email addresses, phone numbers, and social media links"

Get Product Details

prompt = "Extract product names, prices, descriptions, and availability"

Scrape Articles

prompt = "Get the article title, author, publication date, and full text"

Collect Company Data

prompt = "Find company name, headquarters, founder information, and description"

Troubleshooting

Browser Not Opening

If the browser doesn’t launch:
# Reinstall Playwright browsers
playwright install --with-deps

Empty Results

  • Ensure your prompt is clear and specific
  • Try setting headless: False to see what the browser loads
  • Check if the website requires JavaScript (Playwright handles this automatically)
  • Increase model_tokens if the page content is large

Rate Limiting

  • Add delays between requests when scraping multiple pages
  • Consider using local models (Ollama) for development
  • Use appropriate API rate limits for production

Next Steps

Now that you’ve built your first scraper, explore more advanced features:

Multi-Page Scraping

Learn how to scrape multiple URLs with a single prompt

Search Integration

Scrape top search results from search engines

Custom Graphs

Build custom scraping pipelines for complex scenarios

API Reference

Explore all available graphs and configuration options
Always respect website terms of service and robots.txt files. Use reasonable rate limits and add delays between requests to avoid overloading servers.