Skip to main content

Overview

OpenAI provides industry-leading language models that work excellently with ScrapeGraphAI. GPT-4o and GPT-4o-mini offer the best balance of accuracy and performance for web scraping tasks.

Prerequisites

1

Get API Key

Sign up at platform.openai.com and create an API key from the API keys page.
2

Install ScrapeGraphAI

pip install scrapegraphai
playwright install
3

Set Environment Variable

export OPENAI_API_KEY="sk-..."
Or use a .env file:
OPENAI_API_KEY=sk-...

Basic Configuration

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph

load_dotenv()

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract me the first article",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(result)
This example is from the official examples: examples/smart_scraper_graph/openai/smart_scraper_openai.py

Available Models

Configuration Options

Temperature

Control randomness in responses (0 = deterministic, 1 = creative):
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
        "temperature": 0,  # Recommended for scraping
    },
}
Use temperature: 0 for consistent scraping results. Higher values (0.7-1.0) may cause inconsistent data extraction.

Model Tokens

Specify the maximum context window:
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
        "model_tokens": 128000,  # GPT-4o mini context size
    },
}

Base URL (Optional)

Use a custom OpenAI-compatible endpoint:
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
        "base_url": "https://custom-endpoint.com/v1",
    },
}

Complete Example

import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract me the first article",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

# Get execution info
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

Cost Optimization

GPT-4o-mini is 60x cheaper than GPT-4 and handles 95% of scraping tasks perfectly:
"model": "openai/gpt-4o-mini"  # Recommended
Limit token usage for simple extractions:
"llm": {
    "api_key": os.getenv("OPENAI_API_KEY"),
    "model": "openai/gpt-4o-mini",
    "max_tokens": 1000,  # Limit response length
}
Faster scraping = fewer API calls:
graph_config = {
    "llm": {...},
    "headless": True,  # Faster scraping
}

Troubleshooting

Error: AuthenticationError: No API key providedSolution: Ensure your API key is set:
# Option 1: Environment variable
export OPENAI_API_KEY="sk-..."

# Option 2: .env file
OPENAI_API_KEY=sk-...

# Option 3: Direct in code (not recommended)
"api_key": "sk-..."
Error: RateLimitError: Rate limit reachedSolution: Implement delays between requests:
import time

for url in urls:
    result = scraper.run()
    time.sleep(1)  # Wait 1 second between requests
Error: InvalidRequestError: The model 'gpt-4o' does not existSolution: Use correct model format:
"model": "openai/gpt-4o-mini"  # Include 'openai/' prefix

Best Practices

Use Environment Variables

Never hardcode API keys in your code:
api_key = os.getenv("OPENAI_API_KEY")

Set Temperature to 0

For consistent scraping results:
"temperature": 0

Enable Verbose Mode

During development to see what’s happening:
"verbose": True

Use Schemas

For structured data extraction:
schema=MySchema

Next Steps

Advanced Configuration

Learn about proxy rotation, custom headers, and timeouts

Ollama (Local)

Run models locally for free