Overview
OpenAI provides industry-leading language models that work excellently with ScrapeGraphAI. GPT-4o and GPT-4o-mini offer the best balance of accuracy and performance for web scraping tasks.
Prerequisites
Install ScrapeGraphAI
pip install scrapegraphai
playwright install
Set Environment Variable
export OPENAI_API_KEY = "sk-..."
Or use a .env file:
Basic Configuration
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
load_dotenv()
graph_config = {
"llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
},
"verbose" : True ,
"headless" : False ,
}
smart_scraper_graph = SmartScraperGraph(
prompt = "Extract me the first article" ,
source = "https://www.wired.com" ,
config = graph_config,
)
result = smart_scraper_graph.run()
print (result)
This example is from the official examples: examples/smart_scraper_graph/openai/smart_scraper_openai.py
Available Models
Recommended
All Models
Legacy
GPT-4o Mini (Recommended) Best balance of cost, speed, and accuracy: "llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
}
Context : 128K tokens
Best for : Most scraping tasks
Cost : Very affordable
Speed : Fast
GPT-4o Highest accuracy for complex tasks: "llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o" ,
}
Context : 128K tokens
Best for : Complex scraping, high accuracy needed
Cost : Premium
ScrapeGraphAI supports all OpenAI models: Model Context Window Best For gpt-4o128K Best accuracy gpt-4o-mini128K Best value gpt-4-turbo128K Vision tasks gpt-3.5-turbo16K Legacy support o1-preview200K Advanced reasoning o1-mini128K Fast reasoning
Older models (not recommended for new projects): # GPT-3.5 Turbo
"model" : "openai/gpt-3.5-turbo"
# GPT-4 (original)
"model" : "openai/gpt-4"
Configuration Options
Temperature
Control randomness in responses (0 = deterministic, 1 = creative):
graph_config = {
"llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
"temperature" : 0 , # Recommended for scraping
},
}
Use temperature: 0 for consistent scraping results. Higher values (0.7-1.0) may cause inconsistent data extraction.
Model Tokens
Specify the maximum context window:
graph_config = {
"llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
"model_tokens" : 128000 , # GPT-4o mini context size
},
}
Base URL (Optional)
Use a custom OpenAI-compatible endpoint:
graph_config = {
"llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
"base_url" : "https://custom-endpoint.com/v1" ,
},
}
Complete Example
Basic Scraping
Multi-Page Scraping
With Schema
import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info
load_dotenv()
graph_config = {
"llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
},
"verbose" : True ,
"headless" : False ,
}
smart_scraper_graph = SmartScraperGraph(
prompt = "Extract me the first article" ,
source = "https://www.wired.com" ,
config = graph_config,
)
result = smart_scraper_graph.run()
print (json.dumps(result, indent = 4 ))
# Get execution info
graph_exec_info = smart_scraper_graph.get_execution_info()
print (prettify_exec_info(graph_exec_info))
Cost Optimization
Use GPT-4o-mini for Most Tasks
GPT-4o-mini is 60x cheaper than GPT-4 and handles 95% of scraping tasks perfectly: "model" : "openai/gpt-4o-mini" # Recommended
Limit token usage for simple extractions: "llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
"max_tokens" : 1000 , # Limit response length
}
Faster scraping = fewer API calls: graph_config = {
"llm" : { ... },
"headless" : True , # Faster scraping
}
Troubleshooting
Error : AuthenticationError: No API key providedSolution : Ensure your API key is set:# Option 1: Environment variable
export OPENAI_API_KEY = "sk-..."
# Option 2: .env file
OPENAI_API_KEY = sk - ...
# Option 3: Direct in code (not recommended)
"api_key" : "sk-..."
Error : RateLimitError: Rate limit reachedSolution : Implement delays between requests:import time
for url in urls:
result = scraper.run()
time.sleep( 1 ) # Wait 1 second between requests
Error : InvalidRequestError: The model 'gpt-4o' does not existSolution : Use correct model format:"model" : "openai/gpt-4o-mini" # Include 'openai/' prefix
Best Practices
Use Environment Variables Never hardcode API keys in your code: api_key = os.getenv( "OPENAI_API_KEY" )
Set Temperature to 0 For consistent scraping results:
Enable Verbose Mode During development to see what’s happening:
Use Schemas For structured data extraction:
Next Steps
Advanced Configuration Learn about proxy rotation, custom headers, and timeouts
Ollama (Local) Run models locally for free