Graphs
Graphs are the core building blocks of ScrapeGraphAI. A graph represents a complete scraping workflow composed of interconnected nodes that process data sequentially or conditionally.
Overview
In ScrapeGraphAI, a graph defines:
Nodes : Individual processing units (fetch, parse, generate, etc.)
Edges : Connections between nodes defining the execution flow
Entry Point : The starting node of the workflow
State : A dictionary that flows through the graph, updated by each node
BaseGraph Low-level graph execution engine
AbstractGraph High-level graph scaffolding for scraping workflows
BaseGraph
BaseGraph manages the execution flow of interconnected nodes. It’s the low-level engine that orchestrates node execution.
Key Attributes
nodes (list): Dictionary mapping node names to node instances
edges (list): Dictionary of directed edges defining node relationships
entry_point (str): Name of the entry point node
graph_name (str): Name identifier for the graph
Basic Example
from scrapegraphai.graphs import BaseGraph
from scrapegraphai.nodes import FetchNode, ParseNode, GenerateAnswerNode
# Create nodes
fetch_node = FetchNode(
input = "url | local_dir" ,
output = [ "doc" ],
node_config = { "llm_model" : llm_model}
)
parse_node = ParseNode(
input = "doc" ,
output = [ "parsed_doc" ],
node_config = { "llm_model" : llm_model, "chunk_size" : 8192 }
)
generate_answer_node = GenerateAnswerNode(
input = "user_prompt & parsed_doc" ,
output = [ "answer" ],
node_config = { "llm_model" : llm_model}
)
# Create graph
graph = BaseGraph(
nodes = [fetch_node, parse_node, generate_answer_node],
edges = [
(fetch_node, parse_node),
(parse_node, generate_answer_node)
],
entry_point = fetch_node,
graph_name = "MyCustomGraph"
)
# Execute graph
initial_state = {
"user_prompt" : "Extract product names" ,
"url" : "https://example.com"
}
final_state, execution_info = graph.execute(initial_state)
print (final_state[ "answer" ])
Execution Flow
The BaseGraph.execute() method processes the graph:
Initialize : Start with the entry point node and initial state
Traverse : Execute each node sequentially following edges
Update State : Each node updates the state dictionary
Track Metrics : Collect execution time, token usage, and costs
Return Results : Final state and execution information
The graph execution is synchronous by default. For async execution, use run_safe_async() from AbstractGraph.
Conditional Nodes
Conditional nodes enable branching logic based on runtime conditions:
from scrapegraphai.nodes import ConditionalNode
cond_node = ConditionalNode(
input = "answer" ,
output = [ "answer" ],
node_name = "CheckAnswer" ,
node_config = {
"key_name" : "answer" ,
"condition" : 'not answer or answer=="NA"'
}
)
# Conditional nodes require exactly 2 outgoing edges
graph = BaseGraph(
nodes = [fetch_node, parse_node, generate_node, cond_node, regen_node],
edges = [
(fetch_node, parse_node),
(parse_node, generate_node),
(generate_node, cond_node),
(cond_node, regen_node), # True path
(cond_node, None ) # False path (end)
],
entry_point = fetch_node
)
Conditional nodes must have exactly two outgoing edges: one for the true condition and one for the false condition.
Adding Nodes Dynamically
You can add nodes to an existing graph:
from scrapegraphai.nodes import DescriptionNode
# Add a new node to the end of the graph
description_node = DescriptionNode(
input = "answer" ,
output = [ "description" ],
node_config = { "llm_model" : llm_model}
)
graph.append_node(description_node)
AbstractGraph
AbstractGraph is an abstract base class that provides scaffolding for creating high-level scraping workflows. All pre-built graphs like SmartScraperGraph, SearchGraph, and JSONScraperGraph inherit from AbstractGraph.
Key Features
Automatic LLM model initialization
Configuration management (verbose, headless, etc.)
Schema validation support
Common parameter propagation to all nodes
Execution tracking and state management
Core Attributes
class AbstractGraph :
prompt: str # User prompt/question
source: str # Data source (URL, file, etc.)
config: dict # Graph configuration
schema: BaseModel # Optional Pydantic output schema
llm_model: object # Initialized LLM instance
verbose: bool # Enable detailed logging
headless: bool # Browser headless mode
final_state: dict # Final execution state
execution_info: list # Execution metrics
Creating a Custom Graph
Extend AbstractGraph to create custom scraping workflows:
from scrapegraphai.graphs import AbstractGraph, BaseGraph
from scrapegraphai.nodes import FetchNode, ParseNode, GenerateAnswerNode
from pydantic import BaseModel
from typing import Optional, Type
class MyCustomGraph ( AbstractGraph ):
"""Custom graph for specialized scraping tasks"""
def __init__ (
self ,
prompt : str ,
source : str ,
config : dict ,
schema : Optional[Type[BaseModel]] = None
):
super (). __init__ (prompt, config, source, schema)
self .input_key = "url" if source.startswith( "http" ) else "local_dir"
def _create_graph ( self ) -> BaseGraph:
"""Define the node structure and edges"""
# Create nodes
fetch_node = FetchNode(
input = "url | local_dir" ,
output = [ "doc" ],
node_config = { "llm_model" : self .llm_model}
)
parse_node = ParseNode(
input = "doc" ,
output = [ "parsed_doc" ],
node_config = {
"llm_model" : self .llm_model,
"chunk_size" : self .model_token
}
)
generate_node = GenerateAnswerNode(
input = "user_prompt & parsed_doc" ,
output = [ "answer" ],
node_config = {
"llm_model" : self .llm_model,
"schema" : self .schema
}
)
# Build and return graph
return BaseGraph(
nodes = [fetch_node, parse_node, generate_node],
edges = [
(fetch_node, parse_node),
(parse_node, generate_node)
],
entry_point = fetch_node,
graph_name = self . __class__ . __name__
)
def run ( self ) -> str :
"""Execute the graph and return results"""
inputs = { "user_prompt" : self .prompt, self .input_key: self .source}
self .final_state, self .execution_info = self .graph.execute(inputs)
return self .final_state.get( "answer" , "No answer found." )
# Use the custom graph
graph_config = {
"llm" : { "model" : "openai/gpt-4o-mini" },
"verbose" : True ,
"headless" : True
}
custom_graph = MyCustomGraph(
prompt = "Extract all product information" ,
source = "https://example.com/products" ,
config = graph_config
)
result = custom_graph.run()
print (result)
Pre-built Graphs
ScrapeGraphAI includes several ready-to-use graphs:
Extracts structured data from web pages using AI. from scrapegraphai.graphs import SmartScraperGraph
graph = SmartScraperGraph(
prompt = "List all attractions" ,
source = "https://en.wikipedia.org/wiki/Rome" ,
config = { "llm" : { "model" : "openai/gpt-4o-mini" }}
)
result = graph.run()
See SmartScraperGraph reference for details.
Searches the internet and extracts information from results. from scrapegraphai.graphs import SearchGraph
graph = SearchGraph(
prompt = "Latest AI news" ,
config = { "llm" : { "model" : "openai/gpt-4o-mini" }}
)
result = graph.run()
Extracts data from JSON sources. from scrapegraphai.graphs import JSONScraperGraph
graph = JSONScraperGraph(
prompt = "Extract user names" ,
source = "data.json" ,
config = { "llm" : { "model" : "openai/gpt-4o-mini" }}
)
result = graph.run()
Graph Configuration
Graphs are configured via the config dictionary. Common parameters:
config = {
"llm" : {
"model" : "openai/gpt-4o-mini" ,
"api_key" : "your-api-key" ,
"temperature" : 0
},
"verbose" : True , # Enable detailed logging
"headless" : True , # Browser headless mode
"cache_path" : False , # Enable/disable caching
"timeout" : 480 , # Request timeout in seconds
"loader_kwargs" : {} # Additional loader arguments
}
See Configuration for complete details.
State Management
The state is a dictionary that flows through the graph, being updated by each node:
# Initial state
initial_state = {
"user_prompt" : "Extract product names" ,
"url" : "https://example.com"
}
# After FetchNode
state = {
"user_prompt" : "Extract product names" ,
"url" : "https://example.com" ,
"doc" : [Document( ... )] # Added by FetchNode
}
# After ParseNode
state = {
"user_prompt" : "Extract product names" ,
"url" : "https://example.com" ,
"doc" : [Document( ... )],
"parsed_doc" : [ "chunk1" , "chunk2" , ... ] # Added by ParseNode
}
# After GenerateAnswerNode
state = {
"user_prompt" : "Extract product names" ,
"url" : "https://example.com" ,
"doc" : [Document( ... )],
"parsed_doc" : [ "chunk1" , "chunk2" , ... ],
"answer" : { "products" : [ ... ]} # Added by GenerateAnswerNode
}
Accessing Final State
final_state, execution_info = graph.execute(initial_state)
# Access specific keys
answer = final_state.get( "answer" )
parsed_doc = final_state.get( "parsed_doc" )
# Or use get_state() with AbstractGraph
graph.run()
answer = graph.get_state( "answer" )
all_state = graph.get_state() # Get entire state
Execution Info
Every graph execution returns detailed metrics:
final_state, execution_info = graph.execute(initial_state)
for node_info in execution_info:
print ( f "Node: { node_info[ 'node_name' ] } " )
print ( f "Tokens: { node_info[ 'total_tokens' ] } " )
print ( f "Cost: $ { node_info[ 'total_cost_USD' ] } " )
print ( f "Time: { node_info[ 'exec_time' ] } s" )
Or use the helper utility:
from scrapegraphai.utils import prettify_exec_info
info = graph.get_execution_info()
print (prettify_exec_info(info))
Burr Integration
ScrapeGraphAI supports Burr for advanced workflow tracking:
graph_config = {
"llm" : { "model" : "openai/gpt-4o-mini" },
"burr_kwargs" : {
"app_instance_id" : "my-scraping-session" ,
"project_name" : "web-scraper"
}
}
graph = SmartScraperGraph(
prompt = "Extract data" ,
source = "https://example.com" ,
config = graph_config
)
result = graph.run()
Burr integration provides advanced debugging, state visualization, and workflow tracking capabilities.
Next Steps
Nodes Learn about individual node types and their functions
Configuration Explore all configuration options
Schemas Define structured output with Pydantic schemas
API Reference View complete graph API documentation