Skip to main content

Overview

Google Gemini provides cutting-edge AI models with industry-leading context windows, perfect for scraping large documents and complex websites. Key Features:
  • Massive Context: Up to 2M tokens (Gemini 2.0 Pro)
  • Multimodal: Process text, images, and video
  • Fast: Gemini Flash optimized for speed
  • Cost-Effective: Competitive pricing
  • Latest Tech: Google’s newest AI capabilities

Prerequisites

1

Get API Key

  1. Go to Google AI Studio
  2. Click “Create API Key”
  3. Copy your API key
2

Install ScrapeGraphAI

pip install scrapegraphai
playwright install
3

Set Environment Variable

export GOOGLE_API_KEY="your-api-key"
Or create a .env file:
GOOGLE_API_KEY=your-api-key

Basic Configuration

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph

load_dotenv()

graph_config = {
    "llm": {
        "api_key": os.getenv("GOOGLE_API_KEY"),
        "model": "google_genai/gemini-2.0-flash-latest",
    },
    "verbose": True,
    "headless": False,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract all product information",
    source="https://example.com",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(result)

Available Models

Configuration Options

Temperature

Control response randomness:
graph_config = {
    "llm": {
        "api_key": os.getenv("GOOGLE_API_KEY"),
        "model": "google_genai/gemini-2.0-flash-latest",
        "temperature": 0,  # Deterministic (recommended for scraping)
    },
}
  • 0: Deterministic, consistent results
  • 0.5: Balanced
  • 1.0: Creative, varied responses
Use temperature: 0 for web scraping to ensure consistent data extraction.

Max Tokens

Limit response length:
graph_config = {
    "llm": {
        "api_key": os.getenv("GOOGLE_API_KEY"),
        "model": "google_genai/gemini-2.0-flash-latest",
        "max_tokens": 4000,  # Limit output
    },
}

Safety Settings

Control content filtering:
graph_config = {
    "llm": {
        "api_key": os.getenv("GOOGLE_API_KEY"),
        "model": "google_genai/gemini-2.0-flash-latest",
        "safety_settings": [
            {
                "category": "HARM_CATEGORY_HARASSMENT",
                "threshold": "BLOCK_NONE"
            },
            {
                "category": "HARM_CATEGORY_HATE_SPEECH",
                "threshold": "BLOCK_NONE"
            },
        ],
    },
}
Adjust safety settings if legitimate content is being blocked during scraping.

Complete Examples

import os
import json
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

graph_config = {
    "llm": {
        "api_key": os.getenv("GOOGLE_API_KEY"),
        "model": "google_genai/gemini-2.0-flash-latest",
        "temperature": 0,
    },
    "verbose": True,
    "headless": False,
}

smart_scraper = SmartScraperGraph(
    prompt="Extract the main article with title, author, and date",
    source="https://www.wired.com",
    config=graph_config,
)

result = smart_scraper.run()
print(json.dumps(result, indent=4))

# Get execution info
graph_exec_info = smart_scraper.get_execution_info()
print(prettify_exec_info(graph_exec_info))

Using Vertex AI

For enterprise deployments, use Google Cloud Vertex AI:
1

Enable Vertex AI

  1. Go to Google Cloud Console
  2. Enable Vertex AI API
  3. Set up authentication
2

Install Google Cloud SDK

pip install google-cloud-aiplatform
3

Authenticate

gcloud auth application-default login
4

Configure ScrapeGraphAI

graph_config = {
    "llm": {
        "model": "google_vertexai/gemini-2.0-flash",
        "project_id": "your-gcp-project-id",
        "location": "us-central1",
        "temperature": 0,
    },
}

Cost Optimization

Gemini Flash models are faster and cheaper:
"model": "google_genai/gemini-2.0-flash-latest"  # Faster, cheaper
Reduce token usage for simple tasks:
"max_tokens": 2000  # Limit response length
Faster scraping = fewer API calls:
"headless": True  # Run browser in background
Implement caching for frequently scraped pages:
import functools

@functools.lru_cache(maxsize=100)
def scrape_cached(url):
    return scraper.run()

Troubleshooting

Error: Invalid API keySolution:
  1. Verify API key at Google AI Studio
  2. Ensure key is active and not expired
  3. Check environment variable is set:
echo $GOOGLE_API_KEY
Error: 429 Rate limit exceededSolution: Implement retry logic:
from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5))
def scrape_with_retry():
    return scraper.run()
Error: Content blocked due to safety settingsSolution: Adjust safety settings:
"safety_settings": [
    {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
    {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
    {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
    {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
]
Error: Model not foundSolution: Use correct model name with provider prefix:
# Correct
"model": "google_genai/gemini-2.0-flash-latest"

# Wrong
"model": "gemini-2.0-flash-latest"  # Missing prefix

Advantages of Gemini

Massive Context

Process up to 2M tokens in a single request - perfect for scraping entire websites or long documents.

Multimodal

Can process text, images, and video together for richer data extraction.

Fast

Gemini Flash models offer industry-leading speed for real-time scraping.

Cost-Effective

Competitive pricing compared to other providers, especially for large contexts.

Best Practices

Use Latest Models

Always use latest model versions:
"model": "google_genai/gemini-2.0-flash-latest"

Set Temperature to 0

For consistent scraping:
"temperature": 0

Enable Verbose Mode

During development:
"verbose": True

Handle Rate Limits

Implement exponential backoff for production use.

Next Steps

Advanced Configuration

Learn about proxy rotation and browser settings

Groq

Explore ultra-fast Groq inference