Multi-Page Scraping - ScrapeGraphAI

The SmartScraperMultiGraph allows you to scrape multiple webpages in a single operation, perfect for gathering data from related pages or different sections of a website.

Overview

This example demonstrates how to:

Scrape multiple URLs with one graph instance
Process results from different sources
Aggregate data from multiple pages
Handle different page structures

Complete Example

Here’s a working example that scrapes information from multiple portfolio pages:

import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperMultiGraph

load_dotenv()

# Define the configuration for the graph
openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "openai/gpt-4o",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperMultiGraph instance and run it
multiple_search_graph = SmartScraperMultiGraph(
    prompt="Who is Marco Perini?",
    source=[
        "https://perinim.github.io/",
        "https://perinim.github.io/cv/"
    ],
    schema=None,
    config=graph_config,
)

result = multiple_search_graph.run()
print(json.dumps(result, indent=4))

Step-by-Step Breakdown

Import dependencies

import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperMultiGraph

load_dotenv()

Import SmartScraperMultiGraph for multi-page scraping capabilities.

Configure the graph

graph_config = {
    "llm": {
        "api_key": os.getenv("OPENAI_APIKEY"),
        "model": "openai/gpt-4o",
    },
    "verbose": True,
    "headless": False,
}

Use GPT-4o for better understanding across multiple page contexts.

Define multiple sources

multiple_search_graph = SmartScraperMultiGraph(
    prompt="Who is Marco Perini?",
    source=[
        "https://perinim.github.io/",
        "https://perinim.github.io/cv/"
    ],
    schema=None,
    config=graph_config,
)

Pass a list of URLs as the source parameter. The graph will scrape all pages and aggregate the results.

Run and process results

result = multiple_search_graph.run()
print(json.dumps(result, indent=4))

Results from all pages are combined into a single response.

Multi-Graph Variants

Standard Multi
Multi Concat
Multi Lite

from scrapegraphai.graphs import SmartScraperMultiGraph

# Scrapes all pages and aggregates results
graph = SmartScraperMultiGraph(
    prompt="Extract contact information",
    source=[
        "https://example.com/about",
        "https://example.com/contact",
        "https://example.com/team"
    ],
    config=graph_config,
)

Standard multi-page scraping with result aggregation.

from scrapegraphai.graphs import SmartScraperMultiConcatGraph

# Concatenates all page content before processing
graph = SmartScraperMultiConcatGraph(
    prompt="Find all product mentions",
    source=[
        "https://example.com/products/page1",
        "https://example.com/products/page2",
        "https://example.com/products/page3"
    ],
    config=graph_config,
)

Combines all HTML content into one document before extraction.

from scrapegraphai.graphs import SmartScraperMultiLiteGraph

# Lightweight version for faster processing
graph = SmartScraperMultiLiteGraph(
    prompt="Get page titles",
    source=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ],
    config=graph_config,
)

Optimized for speed with reduced token usage.

Expected Output

The results are organized by source URL:

[
    {
        "source": "https://perinim.github.io/",
        "data": {
            "name": "Marco Perini",
            "role": "Software Engineer",
            "bio": "Passionate developer with experience in..."
        }
    },
    {
        "source": "https://perinim.github.io/cv/",
        "data": {
            "experience": [
                {
                    "company": "Tech Corp",
                    "position": "Senior Developer",
                    "duration": "2020-Present"
                }
            ],
            "skills": ["Python", "JavaScript", "AI"]
        }
    }
]

Common Use Cases

Product Catalogs

Scrape multiple product pages to build a complete catalog

News Aggregation

Collect articles from different sections or categories

Competitor Analysis

Gather data from multiple competitor websites

Portfolio Scraping

Extract information from various profile or portfolio pages

Performance Considerations

Parallel Processing: Pages are scraped concurrently for better performance.

Token Usage: Multi-page scraping consumes more tokens. Consider using the Lite variant for simple tasks.

Rate Limiting: Be mindful of rate limits when scraping many pages. Add delays if needed.

Advanced Example: Multi-Page with Schema

from typing import List
from pydantic import BaseModel, Field

class ContactInfo(BaseModel):
    name: str = Field(description="Person's name")
    email: str = Field(description="Email address")
    role: str = Field(description="Job title or role")

class TeamMembers(BaseModel):
    members: List[ContactInfo]

multi_graph = SmartScraperMultiGraph(
    prompt="Extract all team member information",
    source=[
        "https://company.com/team/engineering",
        "https://company.com/team/design",
        "https://company.com/team/marketing"
    ],
    schema=TeamMembers,
    config=graph_config,
)

result = multi_graph.run()

This combines multi-page scraping with schema validation for structured output.

Handling Different Page Structures

The AI automatically adapts to different page layouts:

multi_graph = SmartScraperMultiGraph(
    prompt="Extract pricing information",
    source=[
        "https://competitor1.com/pricing",  # Table layout
        "https://competitor2.com/plans",    # Card layout
        "https://competitor3.com/pricing"   # List layout
    ],
    config=graph_config,
)

The graph understands your prompt and extracts relevant data regardless of HTML structure.

Error Handling

try:
    result = multiple_search_graph.run()
    
    # Check for partial failures
    for page_result in result:
        if "error" in page_result:
            print(f"Failed to scrape {page_result['source']}: {page_result['error']}")
        else:
            print(f"Successfully scraped {page_result['source']}")
            
except Exception as e:
    print(f"Scraping failed: {e}")

Next Steps

Custom Schemas

Add structure to your multi-page results

Local Documents

Process multiple local files

Tips for Multi-Page Scraping

Group related pages: Scrape pages with similar content together for better context
Use specific prompts: Be clear about what information should be extracted from all pages
Monitor performance: Use get_execution_info() to track time and token usage
Handle failures gracefully: Some pages might fail; ensure your code handles partial results
Consider pagination: For paginated content, generate URLs programmatically

Documentation Index

​Overview

​Complete Example

​Step-by-Step Breakdown

​Multi-Graph Variants

​Expected Output

​Common Use Cases

Product Catalogs

News Aggregation

Competitor Analysis

Portfolio Scraping

​Performance Considerations

​Advanced Example: Multi-Page with Schema

​Handling Different Page Structures

​Error Handling

​Next Steps

Custom Schemas

Local Documents

​Tips for Multi-Page Scraping

Overview

Complete Example

Step-by-Step Breakdown

Multi-Graph Variants

Expected Output

Common Use Cases

Performance Considerations

Advanced Example: Multi-Page with Schema

Handling Different Page Structures

Error Handling

Next Steps

Tips for Multi-Page Scraping