OpenAI Configuration

Overview

OpenAI provides industry-leading language models that work excellently with ScrapeGraphAI. GPT-4o and GPT-4o-mini offer the best balance of accuracy and performance for web scraping tasks.

Prerequisites

Get API Key

Install ScrapeGraphAI

pip install scrapegraphai
playwright install

Set Environment Variable

export OPENAI_API_KEY="sk-..."

Or use a .env file:

OPENAI_API_KEY=sk-...

Basic Configuration

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph

load_dotenv()

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract me the first article",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(result)

This example is from the official examples: examples/smart_scraper_graph/openai/smart_scraper_openai.py

Available Models

Recommended
All Models
Legacy

GPT-4o Mini (Recommended)

Best balance of cost, speed, and accuracy:

"llm": {
    "api_key": os.getenv("OPENAI_API_KEY"),
    "model": "openai/gpt-4o-mini",
}

Context: 128K tokens
Best for: Most scraping tasks
Cost: Very affordable
Speed: Fast

GPT-4o

Highest accuracy for complex tasks:

"llm": {
    "api_key": os.getenv("OPENAI_API_KEY"),
    "model": "openai/gpt-4o",
}

Context: 128K tokens
Best for: Complex scraping, high accuracy needed
Cost: Premium

ScrapeGraphAI supports all OpenAI models:

Model	Context Window	Best For
`gpt-4o`	128K	Best accuracy
`gpt-4o-mini`	128K	Best value
`gpt-4-turbo`	128K	Vision tasks
`gpt-3.5-turbo`	16K	Legacy support
`o1-preview`	200K	Advanced reasoning
`o1-mini`	128K	Fast reasoning

Older models (not recommended for new projects):

# GPT-3.5 Turbo
"model": "openai/gpt-3.5-turbo"

# GPT-4 (original)
"model": "openai/gpt-4"

Configuration Options

Temperature

Control randomness in responses (0 = deterministic, 1 = creative):

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
        "temperature": 0,  # Recommended for scraping
    },
}

Use temperature: 0 for consistent scraping results. Higher values (0.7-1.0) may cause inconsistent data extraction.

Model Tokens

Specify the maximum context window:

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
        "model_tokens": 128000,  # GPT-4o mini context size
    },
}

Base URL (Optional)

Use a custom OpenAI-compatible endpoint:

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
        "base_url": "https://custom-endpoint.com/v1",
    },
}

Complete Example

import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract me the first article",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

# Get execution info
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

Cost Optimization

Use GPT-4o-mini for Most Tasks

GPT-4o-mini is 60x cheaper than GPT-4 and handles 95% of scraping tasks perfectly:

"model": "openai/gpt-4o-mini"  # Recommended

Set Lower max_tokens

Limit token usage for simple extractions:

"llm": {
    "api_key": os.getenv("OPENAI_API_KEY"),
    "model": "openai/gpt-4o-mini",
    "max_tokens": 1000,  # Limit response length
}

Use Headless Mode

Faster scraping = fewer API calls:

graph_config = {
    "llm": {...},
    "headless": True,  # Faster scraping
}

Troubleshooting

API Key Not Found

Error: AuthenticationError: No API key providedSolution: Ensure your API key is set:

# Option 1: Environment variable
export OPENAI_API_KEY="sk-..."

# Option 2: .env file
OPENAI_API_KEY=sk-...

# Option 3: Direct in code (not recommended)
"api_key": "sk-..."

Rate Limit Errors

Error: RateLimitError: Rate limit reachedSolution: Implement delays between requests:

import time

for url in urls:
    result = scraper.run()
    time.sleep(1)  # Wait 1 second between requests

Invalid Model Name

Error: InvalidRequestError: The model 'gpt-4o' does not existSolution: Use correct model format:

"model": "openai/gpt-4o-mini"  # Include 'openai/' prefix

Best Practices

Use Environment Variables

Never hardcode API keys in your code:

api_key = os.getenv("OPENAI_API_KEY")

Set Temperature to 0

For consistent scraping results:

"temperature": 0

Enable Verbose Mode

During development to see what’s happening:

"verbose": True

Use Schemas

For structured data extraction:

schema=MySchema

Overview

Prerequisites

Basic Configuration

Available Models

GPT-4o Mini (Recommended)

GPT-4o

Configuration Options

Temperature

Model Tokens

Base URL (Optional)

Complete Example

Cost Optimization

Troubleshooting

Best Practices

Use Environment Variables

Set Temperature to 0

Enable Verbose Mode

Use Schemas

Next Steps

Advanced Configuration

Ollama (Local)

Documentation Index

​Overview

​Prerequisites

​Basic Configuration

​Available Models

​GPT-4o Mini (Recommended)

​GPT-4o

​Configuration Options

​Temperature

​Model Tokens

​Base URL (Optional)

​Complete Example

​Cost Optimization

​Troubleshooting

​Best Practices

Use Environment Variables

Set Temperature to 0

Enable Verbose Mode

Use Schemas

​Next Steps

Advanced Configuration

Ollama (Local)

Overview

Prerequisites

Basic Configuration

Available Models

GPT-4o Mini (Recommended)

GPT-4o

Configuration Options

Temperature

Model Tokens

Base URL (Optional)

Complete Example

Cost Optimization

Troubleshooting

Best Practices

Next Steps