web_scrapper_generator 2.0.3 copy "web_scrapper_generator: ^2.0.3" to clipboard
web_scrapper_generator: ^2.0.3 copied to clipboard

A Dart package for generating web scrapers using various AI models - Claude, Gemini, Codex.

Web Scrapper Generator #

Core AI-powered web scraping rules generation package. This package contains the brain of Zenscrap - all the AI prompts, MCP integrations, and logic for creating and testing ScrapingBee extraction rules.

πŸ“‹ Overview #

This package handles:

  • AI prompt engineering for web scraping tasks
  • Playwright MCP integration for browser automation
  • ScrapingBee MCP for testing extraction rules
  • Multi-SDK support (Gemini, Claude, Codex)
  • Dynamic proxy configuration
  • Cost optimization strategies

πŸ—οΈ Package Structure #

web_scrapper_generator/
β”œβ”€β”€ lib/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ implementations/      # SDK-specific implementations
β”‚   β”‚   β”‚   β”œβ”€β”€ claude_implementation.dart
β”‚   β”‚   β”‚   β”œβ”€β”€ codex_implementation.dart
β”‚   β”‚   β”‚   └── gemini_implementation.dart
β”‚   β”‚   β”œβ”€β”€ prompts.dart         # Core AI prompts and system instructions
β”‚   β”‚   β”œβ”€β”€ playwright_setup.dart # Playwright MCP configuration
β”‚   β”‚   β”œβ”€β”€ scraping_bee_mcp.dart # ScrapingBee MCP server
β”‚   β”‚   β”œβ”€β”€ scraping_bee_api_mixin.dart # API interaction logic
β”‚   β”‚   β”œβ”€β”€ mcp_adapters.dart    # Unified MCP setup across SDKs
β”‚   β”‚   β”œβ”€β”€ web_scrapper_response.dart # Response models
β”‚   β”‚   └── web_scrapper_generator_interface.dart # Abstract interface
β”‚   └── web_scrapper_generator.dart # Package exports
β”œβ”€β”€ bin/
β”‚   └── scraping_bee_mcp_server.dart # MCP server executable
└── pubspec.yaml

πŸš€ Installation #

Add to your pubspec.yaml:

dependencies:
  web_scrapper_generator:
    path: ./web_scrapper_generator

πŸ’‘ Core Components #

1. AI Prompts (prompts.dart) #

The heart of the system - contains sophisticated prompts that guide the AI through:

  • Page exploration and analysis
  • Extraction rule creation
  • JavaScript scenario generation
  • Cost optimization
  • Testing and validation

Key features:

  • Dynamic country proxy selection based on target site
  • Comprehensive testing workflow
  • Credit cost optimization strategy
  • Support for 195+ country proxies

2. MCP Integrations #

Playwright MCP (playwright_setup.dart)

  • Provides real browser automation
  • Allows AI to interact with pages (click, type, scroll)
  • Captures screenshots and rendered HTML
  • Dynamic proxy configuration support

ScrapingBee MCP (scraping_bee_mcp.dart)

  • Custom MCP server for testing extraction rules
  • Validates rules against real ScrapingBee API
  • Ensures rules work before returning to user
  • Comprehensive error handling

3. SDK Implementations #

Support for multiple AI providers through a unified interface:

Gemini Implementation

final generator = GeminiWebScrapperGenerator(
  geminiSDK: geminiSDK,
  scrapingBeeApiKey: 'your-api-key',
);

Claude Implementation

final generator = ClaudeWebScrapperGenerator(
  claudeSDK: claudeSDK,
  scrapingBeeApiKey: 'your-api-key',
);

Codex Implementation

final generator = CodexWebScrapperGenerator(
  codexSDK: codexSDK,
  scrapingBeeApiKey: 'your-api-key',
);

🎯 Usage Examples #

Basic Setup and Initialization #

import 'package:web_scrapper_generator/web_scrapper_generator.dart';
import 'package:gemini_cli_sdk/gemini_cli_sdk.dart';

void main() async {
  // Initialize SDK
  final geminiSDK = GeminiSDK();

  // Create generator instance
  final generator = GeminiWebScrapperGenerator(
    geminiSDK: geminiSDK,
    scrapingBeeApiKey: 'your-scrapingbee-api-key',
  );

  // Setup MCP tools if needed
  await generator.setupIfNeeded();

  // Now ready to generate scraping rules!
}

Creating New Scraping Rules #

// Define the target URL and request structure
final request = WebScrapperRequest(
  url: 'https://example.com/products/{category}',
  queryParam: {'sort': 'price', 'limit': '20'},
  pathParams: ['category'],
);

// Initialize chat with AI
await generator.initChat(
  InitialPayloadDataCreatingFromZero(
    targetExampleUrl: 'https://example.com/products/electronics',
    webScrapperRequest: request,
  ),
);

// Send user requirements
final response = await generator.sendMessage(
  'Extract product names, prices, and ratings from the product listing page'
);

// Handle response
switch (response) {
  case WebScrapperChatAIResponseWithDataResponse():
    print('Success! Generated settings:');
    print('URL: ${response.fetchSettings.url}');
    print('Rules: ${response.fetchSettings.extract_rules}');
    break;
  case WebScrapperChatAIResponseJustMessage():
    print('AI Message: ${response.message}');
    break;
  case WebScrapperChatAIResponseErrorMessage():
    print('Error: ${response.errorDescription}');
    break;
}

Editing Existing Rules #

// Edit existing scraping configuration
await generator.initChat(
  InitialPayloadDataEditingExistingWebScrapper(
    currentRequest: existingRequest,
    currentFetchSettings: existingSettings,
  ),
);

final response = await generator.sendMessage(
  'Add extraction for product images and availability status'
);

πŸ”§ MCP Setup #

Automatic Setup #

The package can automatically set up required MCP servers:

// This will:
// 1. Install Playwright if needed
// 2. Configure Playwright MCP
// 3. Compile and setup ScrapingBee MCP
await generator.setupIfNeeded();

Manual Setup #

If you prefer manual control:

// Setup Playwright
await PlaywrightSetup.instance.setupIfNeeded(geminiSDK);

// Setup ScrapingBee MCP
await ScrapingBeeMcpServerSetup.instance.setupIfNeeded(geminiSDK);

πŸ“Š Response Models #

WebScrapperRequest #

Defines the URL pattern and parameters:

class WebScrapperRequest {
  final String url;                    // URL with {param} placeholders
  final Map<String, String?> queryParam; // Query parameters
  final List<String> pathParams;       // Path parameter names
}

ScrappingBeeFetchSettings #

Complete ScrapingBee configuration:

class ScrappingBeeFetchSettings {
  final String url;              // Target URL
  final String extract_rules;    // JSON extraction rules
  final String? js_scenario;     // JavaScript actions
  final bool render_js;          // Enable JS rendering
  final bool premium_proxy;      // Use premium proxy
  final bool stealth_proxy;      // Use stealth proxy
  final String? country_code;    // Proxy country
  // ... more settings
}

🌍 Proxy Configuration #

The system intelligently selects proxy settings based on:

  • Target domain (e.g., .de domains use German proxy)
  • User requirements
  • Site difficulty level

Cost optimization priority:

  1. No proxy (1-5 credits)
  2. Premium proxy (25 credits)
  3. Stealth proxy (75 credits - only for LinkedIn, Meta, etc.)

πŸ§ͺ Testing Workflow #

The AI follows a strict testing protocol:

  1. Exploration: Use Playwright to understand the page
  2. Rule Creation: Design extraction rules
  3. Testing: Validate with ScrapingBee MCP
  4. Optimization: Find cheapest working configuration
  5. Validation: Ensure data matches requirements

πŸ” Debugging #

Enable Verbose Logging #

// Set environment variable
export DEBUG_MCP=true

Check MCP Status #

final mcpInfo = await geminiSDK.isMcpInstalled();
print('MCP Support: ${mcpInfo.hasMcpSupport}');
print('Servers: ${mcpInfo.servers}');

Test ScrapingBee Connection #

final result = await generator.testScrapingBeeConnection();
print('ScrapingBee API Status: $result');

⚠️ Important Notes #

  1. Always Test Rules: The AI must test extraction rules before returning them
  2. Cost Awareness: The system optimizes for lowest credit usage
  3. Dynamic Proxies: Proxy country is selected based on target site
  4. MCP Required: Both Playwright and ScrapingBee MCPs must be configured
  5. API Key Security: Never expose ScrapingBee API keys to end users

🀝 Contributing #

When contributing to this package:

  1. Maintain the testing workflow in prompts
  2. Ensure MCP compatibility across all SDKs
  3. Add tests for new extraction scenarios
  4. Document any new proxy requirements

πŸ“„ License #

This package is part of the Zenscrap project. See main project license.

πŸ› Troubleshooting #

"MCP not found" Error #

# Compile the ScrapingBee MCP server
dart compile exe bin/scraping_bee_mcp_server.dart -o build/scraping_bee_mcp_server

"Playwright not installed" Error #

# Install Playwright
npm install playwright
npx playwright install

"Invalid extraction rules" Error #

  • Ensure rules are valid JSON
  • Test rules with ScrapingBee MCP before using
  • Check CSS/XPath selector syntax

πŸ“š Resources #

0
likes
140
points
122
downloads

Publisher

unverified uploader

Weekly Downloads

A Dart package for generating web scrapers using various AI models - Claude, Gemini, Codex.

Repository (GitHub)

Documentation

API reference

License

MIT (license)

Dependencies

claude_code_sdk, codex_cli_sdk, dart_mcp, dio, gemini_cli_sdk, http, path, programming_cli_core_sdk

More

Packages that depend on web_scrapper_generator