Basic Scraping - ScrapeGraphAI

The SmartScraperGraph is the simplest and most powerful way to extract data from a single webpage using natural language prompts.

Overview

This example demonstrates how to:

Configure a basic scraping graph
Use natural language to describe what you want to extract
Process and display the results
Monitor execution details

Complete Example

Here’s a working example that extracts an article from Wired.com:

import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperGraph instance and run it
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract me the first article",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

# Get graph execution info
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

Step-by-Step Breakdown

Import dependencies

import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

Import the required modules and load environment variables from your .env file.

Configure the graph

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

Define your configuration with:

llm: The language model to use (OpenAI GPT-4o-mini in this case)
verbose: Enable detailed logging
headless: Set to False to see the browser in action

Create and run the graph

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract me the first article",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper_graph.run()

Create a SmartScraperGraph instance with:

prompt: Natural language description of what to extract
source: URL of the webpage to scrape
config: Your configuration dictionary

Process the results

print(json.dumps(result, indent=4))

# Get execution details
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

Display the extracted data and execution information for debugging.

Configuration Options

OpenAI
Ollama (Local)
Azure OpenAI

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_API_KEY"),
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": True,
}

graph_config = {
    "llm": {
        "model": "ollama/llama3",
        "base_url": "http://localhost:11434",
    },
    "verbose": True,
    "headless": True,
}

graph_config = {
    "llm": {
        "api_key": os.getenv("AZURE_OPENAI_KEY"),
        "model": "azure/gpt-4o",
        "api_base": "https://your-resource.openai.azure.com",
        "api_version": "2024-02-01",
    },
    "verbose": True,
    "headless": True,
}

Expected Output

The script will return structured JSON data:

{
    "title": "The Latest in AI: What You Need to Know",
    "author": "John Doe",
    "date": "2024-03-15",
    "content": "Artificial intelligence continues to evolve...",
    "url": "https://www.wired.com/story/ai-latest-news"
}

Common Use Cases

News Articles

Extract headlines, authors, dates, and content from news websites

Product Information

Scrape product names, prices, descriptions, and reviews

Contact Details

Extract emails, phone numbers, and addresses from business websites

Event Data

Gather event names, dates, locations, and descriptions

Tips for Better Results

Be specific in your prompts: Instead of “get data”, use “Extract the article title, author name, publication date, and first paragraph”.

Use headless mode for production: Set "headless": True to run the browser in the background for better performance.

Handle errors gracefully: Wrap your scraping code in try-except blocks to handle network issues and parsing errors.

Monitoring Execution

The get_execution_info() method provides valuable insights:

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

This shows:

Execution time for each node
Token usage and costs
Errors or warnings
Graph traversal path

Next Steps

Multi-Page Scraping

Learn to scrape multiple URLs at once

Custom Schemas

Define structured output with Pydantic

Troubleshooting

Issue: Browser doesn’t open

Make sure Playwright is installed: playwright install
Check if headless is set to False

Issue: API rate limits

Reduce the number of requests
Add delays between requests
Use a different model or provider

Issue: Extraction is incomplete

Make your prompt more specific
Check if the page requires JavaScript rendering
Verify the page structure hasn’t changed

Documentation Index

​Overview

​Complete Example

​Step-by-Step Breakdown

​Configuration Options

​Expected Output

​Common Use Cases

News Articles

Product Information

Contact Details

Event Data

​Tips for Better Results

​Monitoring Execution

​Next Steps

Multi-Page Scraping

Custom Schemas

​Troubleshooting

Overview

Complete Example

Step-by-Step Breakdown

Configuration Options

Expected Output

Common Use Cases

Tips for Better Results

Monitoring Execution

Next Steps

Troubleshooting