Overview
SmartScraperGraph is a scraping pipeline that automates the process of extracting information from web pages using a natural language model to interpret and answer prompts. It’s the most commonly used graph for web scraping tasks.
Class Signature
Constructor Parameters
The natural language prompt describing what information to extract from the source.
The source to scrape. Can be:
- A URL starting with
http://orhttps:// - A local directory path for offline HTML files
Configuration parameters for the graph. Must include:
llm: LLM configuration (e.g.,{"model": "openai/gpt-4o"})
verbose(bool): Enable detailed loggingheadless(bool): Run browser in headless mode (default: True)html_mode(bool): Process raw HTML without parsing (default: False)reasoning(bool): Enable chain-of-thought reasoning (default: False)reattempt(bool): Retry if initial extraction fails (default: False)additional_info(str): Extra context for the LLMcut(bool): Trim long documents (default: True)force(bool): Force re-fetch even if cachedloader_kwargs(dict): Additional parameters for page loadingbrowser_base(dict): BrowserBase configurationscrape_do(dict): ScrapeDo configurationstorage_state(str): Path to browser state file
Optional Pydantic model defining the expected output structure. Ensures type-safe extraction.
Attributes
The user’s extraction prompt.
The source URL or local directory path.
Configuration dictionary for the graph.
Optional output schema for structured data extraction.
The configured language model instance.
Flag indicating whether verbose logging is enabled.
Flag indicating whether to run browser in headless mode.
Either “url” or “local_dir” based on the source type.
Methods
run()
Executes the scraping process and returns the answer to the prompt.The extracted information as a string, or “No answer found.” if extraction fails.
Basic Usage
Structured Output with Schema
Advanced Configuration
HTML Mode
Process raw HTML without parsing for maximum speed:Reasoning Mode
Enable chain-of-thought reasoning for complex extractions:Reattempt Mode
Automatically retry if initial extraction fails:Local HTML Files
Browser Configuration
Using BrowserBase
Using ScrapeDo
Storage State (Cookies/Auth)
Accessing Graph State
Graph Workflow Variations
The graph automatically adapts its workflow based on configuration: Standard workflow (html_mode=False, reasoning=False, reattempt=False):Error Handling
Performance Tips
- Use HTML mode for simple extractions to skip parsing overhead
- Enable cut to trim long documents and reduce token usage
- Set appropriate chunk_size based on your LLM’s context window
- Use caching with
cache_pathto avoid re-fetching pages - Enable verbose mode during development for debugging
Related Graphs
- SmartScraperMultiGraph - Scrape multiple URLs
- SearchGraph - Search the internet first, then scrape
- OmniScraperGraph - Include image analysis
