Web Scrapper Generator

Core AI-powered web scraping rules generation package. This package contains the brain of Zenscrap - all the AI prompts, MCP integrations, and logic for creating and testing ScrapingBee extraction rules.

📋 Overview

This package handles:

AI prompt engineering for web scraping tasks
Playwright MCP integration for browser automation
ScrapingBee MCP for testing extraction rules
Multi-SDK support (Gemini, Claude, Codex)
Dynamic proxy configuration
Cost optimization strategies

🏗️ Package Structure

web_scrapper_generator/
├── lib/
│   ├── src/
│   │   ├── implementations/      # SDK-specific implementations
│   │   │   ├── claude_implementation.dart
│   │   │   ├── codex_implementation.dart
│   │   │   └── gemini_implementation.dart
│   │   ├── prompts.dart         # Core AI prompts and system instructions
│   │   ├── playwright_setup.dart # Playwright MCP configuration
│   │   ├── scraping_bee_mcp.dart # ScrapingBee MCP server
│   │   ├── scraping_bee_api_mixin.dart # API interaction logic
│   │   ├── mcp_adapters.dart    # Unified MCP setup across SDKs
│   │   ├── web_scrapper_response.dart # Response models
│   │   └── web_scrapper_generator_interface.dart # Abstract interface
│   └── web_scrapper_generator.dart # Package exports
├── bin/
│   └── scraping_bee_mcp_server.dart # MCP server executable
└── pubspec.yaml

🚀 Installation

Add to your pubspec.yaml:

dependencies:
  web_scrapper_generator:
    path: ./web_scrapper_generator

💡 Core Components

1. AI Prompts (`prompts.dart`)

The heart of the system - contains sophisticated prompts that guide the AI through:

Page exploration and analysis
Extraction rule creation
JavaScript scenario generation
Cost optimization
Testing and validation

Key features:

Dynamic country proxy selection based on target site
Comprehensive testing workflow
Credit cost optimization strategy
Support for 195+ country proxies

2. MCP Integrations

Playwright MCP (`playwright_setup.dart`)

Provides real browser automation
Allows AI to interact with pages (click, type, scroll)
Captures screenshots and rendered HTML
Dynamic proxy configuration support

ScrapingBee MCP (`scraping_bee_mcp.dart`)

Custom MCP server for testing extraction rules
Validates rules against real ScrapingBee API
Ensures rules work before returning to user
Comprehensive error handling

3. SDK Implementations

Support for multiple AI providers through a unified interface:

Gemini Implementation

final generator = GeminiWebScrapperGenerator(
  geminiSDK: geminiSDK,
  scrapingBeeApiKey: 'your-api-key',
);

Claude Implementation

final generator = ClaudeWebScrapperGenerator(
  claudeSDK: claudeSDK,
  scrapingBeeApiKey: 'your-api-key',
);

Codex Implementation

final generator = CodexWebScrapperGenerator(
  codexSDK: codexSDK,
  scrapingBeeApiKey: 'your-api-key',
);

🎯 Usage Examples

Basic Setup and Initialization

import 'package:web_scrapper_generator/web_scrapper_generator.dart';
import 'package:gemini_cli_sdk/gemini_cli_sdk.dart';

void main() async {
  // Initialize SDK
  final geminiSDK = GeminiSDK();

  // Create generator instance
  final generator = GeminiWebScrapperGenerator(
    geminiSDK: geminiSDK,
    scrapingBeeApiKey: 'your-scrapingbee-api-key',
  );

  // Setup MCP tools if needed
  await generator.setupIfNeeded();

  // Now ready to generate scraping rules!
}

Creating New Scraping Rules

// Define the target URL and request structure
final request = WebScrapperRequest(
  url: 'https://example.com/products/{category}',
  queryParam: {'sort': 'price', 'limit': '20'},
  pathParams: ['category'],
);

// Initialize chat with AI
await generator.initChat(
  InitialPayloadDataCreatingFromZero(
    targetExampleUrl: 'https://example.com/products/electronics',
    webScrapperRequest: request,
  ),
);

// Send user requirements
final response = await generator.sendMessage(
  'Extract product names, prices, and ratings from the product listing page'
);

// Handle response
switch (response) {
  case WebScrapperChatAIResponseWithDataResponse():
    print('Success! Generated settings:');
    print('URL: ${response.fetchSettings.url}');
    print('Rules: ${response.fetchSettings.extract_rules}');
    break;
  case WebScrapperChatAIResponseJustMessage():
    print('AI Message: ${response.message}');
    break;
  case WebScrapperChatAIResponseErrorMessage():
    print('Error: ${response.errorDescription}');
    break;
}

Editing Existing Rules

// Edit existing scraping configuration
await generator.initChat(
  InitialPayloadDataEditingExistingWebScrapper(
    currentRequest: existingRequest,
    currentFetchSettings: existingSettings,
  ),
);

final response = await generator.sendMessage(
  'Add extraction for product images and availability status'
);

🔧 MCP Setup

Automatic Setup

The package can automatically set up required MCP servers:

// This will:
// 1. Install Playwright if needed
// 2. Configure Playwright MCP
// 3. Compile and setup ScrapingBee MCP
await generator.setupIfNeeded();

Manual Setup

If you prefer manual control:

// Setup Playwright
await PlaywrightSetup.instance.setupIfNeeded(geminiSDK);

// Setup ScrapingBee MCP
await ScrapingBeeMcpServerSetup.instance.setupIfNeeded(geminiSDK);

📊 Response Models

WebScrapperRequest

Defines the URL pattern and parameters:

class WebScrapperRequest {
  final String url;                    // URL with {param} placeholders
  final Map<String, String?> queryParam; // Query parameters
  final List<String> pathParams;       // Path parameter names
}

ScrappingBeeFetchSettings

Complete ScrapingBee configuration:

class ScrappingBeeFetchSettings {
  final String url;              // Target URL
  final String extract_rules;    // JSON extraction rules
  final String? js_scenario;     // JavaScript actions
  final bool render_js;          // Enable JS rendering
  final bool premium_proxy;      // Use premium proxy
  final bool stealth_proxy;      // Use stealth proxy
  final String? country_code;    // Proxy country
  // ... more settings
}

🌍 Proxy Configuration

The system intelligently selects proxy settings based on:

Target domain (e.g., .de domains use German proxy)
User requirements
Site difficulty level

Cost optimization priority:

No proxy (1-5 credits)
Premium proxy (25 credits)
Stealth proxy (75 credits - only for LinkedIn, Meta, etc.)

🧪 Testing Workflow

The AI follows a strict testing protocol:

Exploration: Use Playwright to understand the page
Rule Creation: Design extraction rules
Testing: Validate with ScrapingBee MCP
Optimization: Find cheapest working configuration
Validation: Ensure data matches requirements

🔍 Debugging

Enable Verbose Logging

// Set environment variable
export DEBUG_MCP=true

Check MCP Status

final mcpInfo = await geminiSDK.isMcpInstalled();
print('MCP Support: ${mcpInfo.hasMcpSupport}');
print('Servers: ${mcpInfo.servers}');

Test ScrapingBee Connection

final result = await generator.testScrapingBeeConnection();
print('ScrapingBee API Status: $result');

⚠️ Important Notes

Always Test Rules: The AI must test extraction rules before returning them
Cost Awareness: The system optimizes for lowest credit usage
Dynamic Proxies: Proxy country is selected based on target site
MCP Required: Both Playwright and ScrapingBee MCPs must be configured
API Key Security: Never expose ScrapingBee API keys to end users

🤝 Contributing

When contributing to this package:

Maintain the testing workflow in prompts
Ensure MCP compatibility across all SDKs
Add tests for new extraction scenarios
Document any new proxy requirements

📄 License

This package is part of the Zenscrap project. See main project license.

🐛 Troubleshooting

"MCP not found" Error

# Compile the ScrapingBee MCP server
dart compile exe bin/scraping_bee_mcp_server.dart -o build/scraping_bee_mcp_server

"Playwright not installed" Error

# Install Playwright
npm install playwright
npx playwright install

"Invalid extraction rules" Error

Ensure rules are valid JSON
Test rules with ScrapingBee MCP before using
Check CSS/XPath selector syntax

Web Scrapper Generator

📋 Overview

🏗️ Package Structure

🚀 Installation

💡 Core Components

1. AI Prompts (`prompts.dart`)

2. MCP Integrations

Playwright MCP (`playwright_setup.dart`)

ScrapingBee MCP (`scraping_bee_mcp.dart`)

3. SDK Implementations

Gemini Implementation

Claude Implementation

Codex Implementation

🎯 Usage Examples

Basic Setup and Initialization

Creating New Scraping Rules

Editing Existing Rules

🔧 MCP Setup

Automatic Setup

Manual Setup

📊 Response Models

WebScrapperRequest

ScrappingBeeFetchSettings

🌍 Proxy Configuration

🧪 Testing Workflow

🔍 Debugging

Enable Verbose Logging

Check MCP Status

Test ScrapingBee Connection

⚠️ Important Notes

🤝 Contributing

📄 License

🐛 Troubleshooting

"MCP not found" Error

"Playwright not installed" Error

"Invalid extraction rules" Error

📚 Resources

Libraries

web_scrapper_generator package

Web Scrapper Generator

📋 Overview

🏗️ Package Structure

🚀 Installation

💡 Core Components

1. AI Prompts (prompts.dart)

2. MCP Integrations

Playwright MCP (playwright_setup.dart)

ScrapingBee MCP (scraping_bee_mcp.dart)

3. SDK Implementations

Gemini Implementation

Claude Implementation

Codex Implementation

🎯 Usage Examples

Basic Setup and Initialization

Creating New Scraping Rules

Editing Existing Rules

🔧 MCP Setup

Automatic Setup

Manual Setup

📊 Response Models

WebScrapperRequest

ScrappingBeeFetchSettings

🌍 Proxy Configuration

🧪 Testing Workflow

🔍 Debugging

Enable Verbose Logging

Check MCP Status

Test ScrapingBee Connection

⚠️ Important Notes

🤝 Contributing

📄 License

🐛 Troubleshooting

"MCP not found" Error

"Playwright not installed" Error

"Invalid extraction rules" Error

📚 Resources

Libraries

web_scrapper_generator package

1. AI Prompts (`prompts.dart`)

Playwright MCP (`playwright_setup.dart`)

ScrapingBee MCP (`scraping_bee_mcp.dart`)