The SmartScraperMultiGraph allows you to scrape multiple webpages in a single operation, perfect for gathering data from related pages or different sections of a website.
Overview
This example demonstrates how to:
Scrape multiple URLs with one graph instance
Process results from different sources
Aggregate data from multiple pages
Handle different page structures
Complete Example
Here’s a working example that scrapes information from multiple portfolio pages:
import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperMultiGraph
load_dotenv()
# Define the configuration for the graph
openai_key = os.getenv( "OPENAI_APIKEY" )
graph_config = {
"llm" : {
"api_key" : openai_key,
"model" : "openai/gpt-4o" ,
},
"verbose" : True ,
"headless" : False ,
}
# Create the SmartScraperMultiGraph instance and run it
multiple_search_graph = SmartScraperMultiGraph(
prompt = "Who is Marco Perini?" ,
source = [
"https://perinim.github.io/" ,
"https://perinim.github.io/cv/"
],
schema = None ,
config = graph_config,
)
result = multiple_search_graph.run()
print (json.dumps(result, indent = 4 ))
Step-by-Step Breakdown
Import dependencies
import json
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperMultiGraph
load_dotenv()
Import SmartScraperMultiGraph for multi-page scraping capabilities.
Configure the graph
graph_config = {
"llm" : {
"api_key" : os.getenv( "OPENAI_APIKEY" ),
"model" : "openai/gpt-4o" ,
},
"verbose" : True ,
"headless" : False ,
}
Use GPT-4o for better understanding across multiple page contexts.
Define multiple sources
multiple_search_graph = SmartScraperMultiGraph(
prompt = "Who is Marco Perini?" ,
source = [
"https://perinim.github.io/" ,
"https://perinim.github.io/cv/"
],
schema = None ,
config = graph_config,
)
Pass a list of URLs as the source parameter. The graph will scrape all pages and aggregate the results.
Run and process results
result = multiple_search_graph.run()
print (json.dumps(result, indent = 4 ))
Results from all pages are combined into a single response.
Multi-Graph Variants
Standard Multi
Multi Concat
Multi Lite
from scrapegraphai.graphs import SmartScraperMultiGraph
# Scrapes all pages and aggregates results
graph = SmartScraperMultiGraph(
prompt = "Extract contact information" ,
source = [
"https://example.com/about" ,
"https://example.com/contact" ,
"https://example.com/team"
],
config = graph_config,
)
Standard multi-page scraping with result aggregation. from scrapegraphai.graphs import SmartScraperMultiConcatGraph
# Concatenates all page content before processing
graph = SmartScraperMultiConcatGraph(
prompt = "Find all product mentions" ,
source = [
"https://example.com/products/page1" ,
"https://example.com/products/page2" ,
"https://example.com/products/page3"
],
config = graph_config,
)
Combines all HTML content into one document before extraction. from scrapegraphai.graphs import SmartScraperMultiLiteGraph
# Lightweight version for faster processing
graph = SmartScraperMultiLiteGraph(
prompt = "Get page titles" ,
source = [
"https://example.com/page1" ,
"https://example.com/page2" ,
"https://example.com/page3"
],
config = graph_config,
)
Optimized for speed with reduced token usage.
Expected Output
The results are organized by source URL:
[
{
"source" : "https://perinim.github.io/" ,
"data" : {
"name" : "Marco Perini" ,
"role" : "Software Engineer" ,
"bio" : "Passionate developer with experience in..."
}
},
{
"source" : "https://perinim.github.io/cv/" ,
"data" : {
"experience" : [
{
"company" : "Tech Corp" ,
"position" : "Senior Developer" ,
"duration" : "2020-Present"
}
],
"skills" : [ "Python" , "JavaScript" , "AI" ]
}
}
]
Common Use Cases
Product Catalogs Scrape multiple product pages to build a complete catalog
News Aggregation Collect articles from different sections or categories
Competitor Analysis Gather data from multiple competitor websites
Portfolio Scraping Extract information from various profile or portfolio pages
Parallel Processing : Pages are scraped concurrently for better performance.
Token Usage : Multi-page scraping consumes more tokens. Consider using the Lite variant for simple tasks.
Rate Limiting : Be mindful of rate limits when scraping many pages. Add delays if needed.
Advanced Example: Multi-Page with Schema
from typing import List
from pydantic import BaseModel, Field
class ContactInfo ( BaseModel ):
name: str = Field( description = "Person's name" )
email: str = Field( description = "Email address" )
role: str = Field( description = "Job title or role" )
class TeamMembers ( BaseModel ):
members: List[ContactInfo]
multi_graph = SmartScraperMultiGraph(
prompt = "Extract all team member information" ,
source = [
"https://company.com/team/engineering" ,
"https://company.com/team/design" ,
"https://company.com/team/marketing"
],
schema = TeamMembers,
config = graph_config,
)
result = multi_graph.run()
This combines multi-page scraping with schema validation for structured output.
Handling Different Page Structures
The AI automatically adapts to different page layouts:
multi_graph = SmartScraperMultiGraph(
prompt = "Extract pricing information" ,
source = [
"https://competitor1.com/pricing" , # Table layout
"https://competitor2.com/plans" , # Card layout
"https://competitor3.com/pricing" # List layout
],
config = graph_config,
)
The graph understands your prompt and extracts relevant data regardless of HTML structure.
Error Handling
try :
result = multiple_search_graph.run()
# Check for partial failures
for page_result in result:
if "error" in page_result:
print ( f "Failed to scrape { page_result[ 'source' ] } : { page_result[ 'error' ] } " )
else :
print ( f "Successfully scraped { page_result[ 'source' ] } " )
except Exception as e:
print ( f "Scraping failed: { e } " )
Next Steps
Custom Schemas Add structure to your multi-page results
Local Documents Process multiple local files
Tips for Multi-Page Scraping
Group related pages : Scrape pages with similar content together for better context
Use specific prompts : Be clear about what information should be extracted from all pages
Monitor performance : Use get_execution_info() to track time and token usage
Handle failures gracefully : Some pages might fail; ensure your code handles partial results
Consider pagination : For paginated content, generate URLs programmatically