Skip to main content

Configuration

Configuration in ScrapeGraphAI is done through a dictionary that controls the behavior of graphs, nodes, and LLM models.

Configuration Structure

The main configuration dictionary has the following structure:
graph_config = {
    "llm": {                    # LLM configuration (required)
        "model": "provider/model-name",
        "api_key": "your-api-key",
        # ... model-specific settings
    },
    "verbose": False,           # Enable detailed logging
    "headless": True,          # Browser headless mode
    "timeout": 480,            # Request timeout (seconds)
    "cache_path": False,       # Enable caching
    "loader_kwargs": {},       # Browser loader options
    # ... additional options
}
The llm configuration is required for all graphs. All other settings are optional.

LLM Configuration

The llm section configures the language model used for generation.

Basic Configuration

"llm": {
    "model": "openai/gpt-4o-mini",
    "api_key": "sk-...",
    "temperature": 0
}

Model Specification

Models can be specified with or without provider prefix:
# ScrapeGraphAI will auto-detect the provider
"llm": {
    "model": "gpt-4o-mini"  # Detected as OpenAI
}

"llm": {
    "model": "llama3.1"  # Detected as Ollama
}
If multiple providers support the same model, the first match is used.

Supported Providers

OpenAI

GPT-4, GPT-3.5, etc.

Anthropic

Claude models

Google

Gemini, Vertex AI

Groq

Fast inference models

Ollama

Local model hosting

Azure OpenAI

Azure-hosted models

Bedrock

AWS Bedrock models

Mistral AI

Mistral models

Hugging Face

HF model endpoints

Provider-Specific Settings

"llm": {
    "model": "openai/gpt-4o-mini",
    "api_key": os.getenv("OPENAI_API_KEY"),
    "temperature": 0,
    "streaming": False,
    "model_tokens": 128000  # Optional: override token limit
}
"llm": {
    "model": "anthropic/claude-3-sonnet-20240229",
    "api_key": os.getenv("ANTHROPIC_API_KEY"),
    "temperature": 0,
    "max_tokens": 4096
}
"llm": {
    "model": "ollama/llama3.1",
    "temperature": 0,
    "base_url": "http://localhost:11434",  # Ollama server URL
    "format": "json"  # Force JSON output
}
"llm": {
    "model": "google_genai/gemini-pro",
    "api_key": os.getenv("GOOGLE_API_KEY"),
    "temperature": 0
}
"llm": {
    "model": "groq/llama3-70b-8192",
    "api_key": os.getenv("GROQ_API_KEY"),
    "temperature": 0
}
"llm": {
    "model": "azure_openai/gpt-4",
    "api_key": os.getenv("AZURE_OPENAI_API_KEY"),
    "azure_endpoint": "https://your-resource.openai.azure.com/",
    "api_version": "2024-02-15-preview",
    "azure_deployment": "your-deployment-name",
    "temperature": 0
}

Using Model Instances

You can pass pre-configured model instances:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    api_key="sk-..."
)

graph_config = {
    "llm": {
        "model_instance": llm,
        "model_tokens": 128000  # Required with model_instance
    },
    "verbose": True
}

Rate Limiting

Control request rates to avoid API limits:
"llm": {
    "model": "openai/gpt-4o-mini",
    "api_key": "sk-...",
    "rate_limit": {
        "requests_per_second": 3,  # Max requests per second
        "max_retries": 5           # Retry attempts on failure
    }
}

Behavior Settings

Verbose Mode

Enable detailed logging for debugging:
"verbose": True  # Shows detailed execution logs
Output:
--- Executing Fetch Node ---
Fetching HTML from: https://example.com
--- Executing ParseNode Node ---
Parsing document into 5 chunks
--- Executing GenerateAnswer Node ---
Processing chunks: 100%|██████████| 5/5 [00:03<00:00]

Headless Mode

Control browser visibility:
"headless": True   # Browser runs in background (default)
"headless": False  # Browser window visible (debugging)
Set headless: False when debugging to see what the browser is doing.

Timeout

Set request timeout in seconds:
"timeout": 30      # 30 seconds (default)
"timeout": 120     # 2 minutes for slow sites
"timeout": None    # No timeout (not recommended)
Applies to:
  • HTTP requests in FetchNode
  • Browser loading in ChromiumLoader
  • LLM generation requests

Cache Path

Enable result caching:
"cache_path": "./scraping_cache"  # Cache to specific directory
"cache_path": False               # Disable caching (default)

Browser Configuration

Loader Arguments

Pass custom arguments to the browser loader:
"loader_kwargs": {
    "proxy": "http://proxy.example.com:8080",
    "wait_until": "networkidle",  # Wait for network idle
    "timeout": 60000,              # Browser timeout (ms)
    "user_agent": "Custom User Agent"
}

Storage State

Use saved browser sessions (cookies, local storage):
"storage_state": "./auth_state.json"  # Load saved session
Useful for:
  • Logged-in scraping
  • Session persistence
  • Cookie-based access

Browser Services

"browser_base": {
    "api_key": "your-browserbase-key",
    "project_id": "your-project-id"
}
Uses BrowserBase for cloud browser automation.
"scrape_do": {
    "api_key": "your-scrapedo-key",
    "use_proxy": True,
    "geoCode": "us",
    "super_proxy": False
}
Uses ScrapeDo proxy service for scraping.

Advanced Settings

HTML Mode

Skip parsing and send raw HTML to LLM:
"html_mode": True  # Skip ParseNode, use raw HTML
Use when:
  • Content is already clean
  • You want maximum context
  • Parsing breaks important structure

Force Markdown

Force markdown conversion regardless of model:
"force": True  # Always convert to markdown
Default behavior:
  • OpenAI models: Automatic markdown conversion
  • Other models: No conversion unless forced

Reasoning Mode

Enable chain-of-thought reasoning:
"reasoning": True  # Add ReasoningNode to pipeline
Adds a reasoning step before answer generation for better quality.

Reattempt Failed Extractions

Retry when extraction fails:
"reattempt": True  # Retry if answer is empty or "NA"
Adds a ConditionalNode that checks answer quality and regenerates if needed.

Additional Context

Provide extra context to the LLM:
"additional_info": "Focus on products under $50. Ignore out of stock items."
This text is prepended to prompts in GenerateAnswerNode.

Reduction Factor

Control HTML reduction (Code Generator only):
"reduction": 2  # Reduce HTML size by factor of 2

Max Iterations

Control code generation iterations:
"max_iterations": {
    "overall": 10,
    "syntax": 3,
    "execution": 3,
    "validation": 3,
    "semantic": 3
}

Burr Integration

Enable workflow tracking with Burr:
"burr_kwargs": {
    "app_instance_id": "scraping-session-123",
    "project_name": "my-scraper",
    "storage_dir": "./burr_state"
}
Burr provides state visualization, debugging tools, and execution replay capabilities.

Complete Configuration Example

Here’s a comprehensive example:
import os
from scrapegraphai.graphs import SmartScraperGraph
from pydantic import BaseModel, Field
from typing import List

# Define output schema
class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Price in USD")
    available: bool = Field(description="In stock status")

class Products(BaseModel):
    products: List[Product]

# Complete configuration
graph_config = {
    # LLM Configuration (Required)
    "llm": {
        "model": "openai/gpt-4o-mini",
        "api_key": os.getenv("OPENAI_API_KEY"),
        "temperature": 0,
        "streaming": False,
        "rate_limit": {
            "requests_per_second": 3,
            "max_retries": 5
        }
    },
    
    # Behavior Settings
    "verbose": True,              # Show detailed logs
    "headless": True,             # Run browser in background
    "timeout": 60,                # 60 second timeout
    "cache_path": "./cache",      # Enable caching
    
    # Browser Configuration
    "loader_kwargs": {
        "wait_until": "networkidle",
        "timeout": 60000
    },
    
    # Processing Options
    "html_mode": False,           # Use parsed text
    "force": True,                # Force markdown conversion
    "cut": True,                  # Enable HTML cleanup
    "reasoning": False,           # Disable reasoning step
    "reattempt": True,            # Retry on failure
    
    # Additional Context
    "additional_info": "Extract only products under $100"
}

# Create and run graph
scraper = SmartScraperGraph(
    prompt="Extract all available products",
    source="https://example.com/shop",
    config=graph_config,
    schema=Products
)

result = scraper.run()
print(result)

Environment Variables

Store sensitive data in environment variables:
# .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk_...
GOOGLE_API_KEY=AIza...
from dotenv import load_dotenv
import os

load_dotenv()

graph_config = {
    "llm": {
        "model": "openai/gpt-4o-mini",
        "api_key": os.getenv("OPENAI_API_KEY")
    }
}
Never commit API keys to version control. Always use environment variables or secure vaults.

Configuration Validation

ScrapeGraphAI validates configuration at runtime:
try:
    graph = SmartScraperGraph(
        prompt="Extract data",
        source="https://example.com",
        config=invalid_config
    )
except ValueError as e:
    print(f"Configuration error: {e}")
    # Handle configuration errors
Common errors:
  • Missing llm configuration
  • Invalid model provider
  • Missing required API keys
  • Invalid parameter types

Best Practices

Begin with minimal configuration:
graph_config = {
    "llm": {
        "model": "openai/gpt-4o-mini",
        "api_key": os.getenv("OPENAI_API_KEY")
    }
}
Add settings as needed.
Store credentials securely:
# Good
"api_key": os.getenv("OPENAI_API_KEY")

# Bad
"api_key": "sk-hardcoded-key-123"
Use verbose mode during development:
"verbose": True if os.getenv("DEBUG") else False
Adjust timeouts based on site complexity:
# Fast static sites
"timeout": 30

# Slow dynamic sites
"timeout": 120

Next Steps

Schemas

Define structured output with Pydantic

Graphs

Learn about graph types and workflows

Examples

See configuration examples

API Reference

Complete API documentation