Overview
DepthSearchGraph is a scraping pipeline that performs deep crawling of websites by following internal links up to a specified depth. It uses RAG (Retrieval-Augmented Generation) to intelligently search through all crawled pages and extract relevant information.
Class Signature
Constructor Parameters
The natural language prompt describing what information to extract from the crawled pages.
The starting URL or local directory. The graph will crawl from this point:
- URL starting with
http://orhttps:// - Local directory path
Configuration parameters for the graph. Must include:
llm: LLM configuration (e.g.,{"model": "openai/gpt-4o"})embedder_model: Embeddings model for RAG (optional but recommended)
depth(int): Maximum crawl depth (default: 1)only_inside_links(bool): Only follow links within the same domain (default: False)verbose(bool): Enable detailed loggingforce(bool): Force re-fetch even if cachedcut(bool): Trim long documentsbrowser_base(dict): BrowserBase configurationstorage_state(str): Browser state file pathcache_path(str): Path for caching results
Optional Pydantic model defining the expected output structure.
Attributes
The user’s extraction prompt.
The starting URL or local directory path.
Configuration dictionary for the graph.
Optional output schema for structured data extraction.
The configured language model instance.
Either “url” or “local_dir” based on the source type.
Methods
run()
Executes the deep crawling and extraction process.The extracted information from all crawled pages, or “No answer” if extraction fails.
Basic Usage
Crawl Depth Configuration
Depth 1 (Default)
Crawls only the starting page and its direct links:Depth 2
Crawls two levels deep:Depth 3+
Deeper crawling (use with caution):Domain Restriction
Only Inside Links (Recommended)
Allow External Links
Graph Workflow
The DepthSearchGraph uses the following node pipeline:- FetchNodeLevelK: Fetches pages recursively up to specified depth
- ParseNodeDepthK: Parses all crawled pages
- DescriptionNode: Generates descriptions for each page
- RAGNode: Creates vector database from all pages
- GenerateAnswerNodeKLevel: Retrieves relevant chunks and generates answer
Embedder Configuration
OpenAI Embeddings
Ollama Embeddings (Local)
HuggingFace Embeddings
Use Cases
- Documentation Scraping: Extract information across entire documentation sites
- Site Analysis: Analyze entire website structure and content
- Knowledge Base Extraction: Build knowledge bases from website content
- Competitive Intelligence: Gather comprehensive competitor information
- Content Audit: Audit and extract content from all pages
Advanced Usage
Documentation Site Scraping
Product Catalog Scraping
Caching for Performance
Accessing Results
Performance Considerations
Crawl Estimates
| Depth | Avg Links Per Page | Est. Pages Crawled |
|---|---|---|
| 1 | 10 | ~10 pages |
| 1 | 50 | ~50 pages |
| 2 | 10 | ~100 pages |
| 2 | 50 | ~2,500 pages |
| 3 | 10 | ~1,000 pages |
| 3 | 50 | ~125,000 pages |
Optimization Tips
- Use only_inside_links: Prevent crawling external sites
- Set appropriate depth: Start with depth=1, increase if needed
- Enable caching: Reuse crawled data for multiple queries
- Use force=False: Don’t re-fetch if cached
- Monitor verbose output: Track crawling progress
Error Handling
Rate Limiting
Comparison with Other Graphs
| Feature | DepthSearchGraph | SearchGraph | SmartScraperGraph |
|---|---|---|---|
| Crawl Depth | Multi-level | Single (search results) | Single page |
| Link Following | Yes | No | No |
| RAG/Vector Search | Yes | No | No |
| Best For | Site-wide extraction | Internet search | Single page |
| Performance | Slowest | Medium | Fastest |
Limitations
- Deep crawls can be very slow and expensive
- May encounter rate limiting on target sites
- Requires embedder model for RAG functionality
- Can generate large amounts of data
- May hit LLM token limits with many pages
Best Practices
- Start shallow: Begin with depth=1, increase only if needed
- Use domain restriction: Enable
only_inside_linksfor focused crawls - Enable caching: Always use
cache_pathfor large crawls - Monitor progress: Enable
verbosemode to track crawling - Test first: Test on small sites before large-scale crawls
- Respect robots.txt: Be mindful of site policies
- Rate limit: Don’t overwhelm target servers
Related Graphs
- SearchGraph - Search internet and scrape top results
- SmartScraperGraph - Scrape single page
- SmartScraperMultiGraph - Scrape specific list of URLs
