The SmartScraperGraph is the simplest and most powerful way to extract data from a single webpage using natural language prompts.
Overview
This example demonstrates how to:
Configure a basic scraping graph
Use natural language to describe what you want to extract
Process and display the results
Monitor execution details
Complete Example
Here’s a working example that extracts an article from Wired.com:
import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info
load_dotenv()
# Define the configuration for the graph
graph_config = {
"llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
},
"verbose" : True ,
"headless" : False ,
}
# Create the SmartScraperGraph instance and run it
smart_scraper_graph = SmartScraperGraph(
prompt = "Extract me the first article" ,
source = "https://www.wired.com" ,
config = graph_config,
)
result = smart_scraper_graph.run()
print (json.dumps(result, indent = 4 ))
# Get graph execution info
graph_exec_info = smart_scraper_graph.get_execution_info()
print (prettify_exec_info(graph_exec_info))
Step-by-Step Breakdown
Import dependencies
import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info
load_dotenv()
Import the required modules and load environment variables from your .env file.
Configure the graph
graph_config = {
"llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
},
"verbose" : True ,
"headless" : False ,
}
Define your configuration with:
llm : The language model to use (OpenAI GPT-4o-mini in this case)
verbose : Enable detailed logging
headless : Set to False to see the browser in action
Create and run the graph
smart_scraper_graph = SmartScraperGraph(
prompt = "Extract me the first article" ,
source = "https://www.wired.com" ,
config = graph_config,
)
result = smart_scraper_graph.run()
Create a SmartScraperGraph instance with:
prompt : Natural language description of what to extract
source : URL of the webpage to scrape
config : Your configuration dictionary
Process the results
print (json.dumps(result, indent = 4 ))
# Get execution details
graph_exec_info = smart_scraper_graph.get_execution_info()
print (prettify_exec_info(graph_exec_info))
Display the extracted data and execution information for debugging.
Configuration Options
OpenAI
Ollama (Local)
Azure OpenAI
graph_config = {
"llm" : {
"api_key" : os.getenv( "OPENAI_API_KEY" ),
"model" : "openai/gpt-4o-mini" ,
},
"verbose" : True ,
"headless" : True ,
}
graph_config = {
"llm" : {
"model" : "ollama/llama3" ,
"base_url" : "http://localhost:11434" ,
},
"verbose" : True ,
"headless" : True ,
}
graph_config = {
"llm" : {
"api_key" : os.getenv( "AZURE_OPENAI_KEY" ),
"model" : "azure/gpt-4o" ,
"api_base" : "https://your-resource.openai.azure.com" ,
"api_version" : "2024-02-01" ,
},
"verbose" : True ,
"headless" : True ,
}
Expected Output
The script will return structured JSON data:
{
"title" : "The Latest in AI: What You Need to Know" ,
"author" : "John Doe" ,
"date" : "2024-03-15" ,
"content" : "Artificial intelligence continues to evolve..." ,
"url" : "https://www.wired.com/story/ai-latest-news"
}
Common Use Cases
News Articles Extract headlines, authors, dates, and content from news websites
Product Information Scrape product names, prices, descriptions, and reviews
Contact Details Extract emails, phone numbers, and addresses from business websites
Event Data Gather event names, dates, locations, and descriptions
Tips for Better Results
Be specific in your prompts : Instead of “get data”, use “Extract the article title, author name, publication date, and first paragraph”.
Use headless mode for production : Set "headless": True to run the browser in the background for better performance.
Handle errors gracefully : Wrap your scraping code in try-except blocks to handle network issues and parsing errors.
Monitoring Execution
The get_execution_info() method provides valuable insights:
graph_exec_info = smart_scraper_graph.get_execution_info()
print (prettify_exec_info(graph_exec_info))
This shows:
Execution time for each node
Token usage and costs
Errors or warnings
Graph traversal path
Next Steps
Multi-Page Scraping Learn to scrape multiple URLs at once
Custom Schemas Define structured output with Pydantic
Troubleshooting
Issue : Browser doesn’t open
Make sure Playwright is installed: playwright install
Check if headless is set to False
Issue : API rate limits
Reduce the number of requests
Add delays between requests
Use a different model or provider
Issue : Extraction is incomplete
Make your prompt more specific
Check if the page requires JavaScript rendering
Verify the page structure hasn’t changed