web_scrapper_generator 2.0.3
web_scrapper_generator: ^2.0.3 copied to clipboard
A Dart package for generating web scrapers using various AI models - Claude, Gemini, Codex.
Web Scrapper Generator #
Core AI-powered web scraping rules generation package. This package contains the brain of Zenscrap - all the AI prompts, MCP integrations, and logic for creating and testing ScrapingBee extraction rules.
π Overview #
This package handles:
- AI prompt engineering for web scraping tasks
- Playwright MCP integration for browser automation
- ScrapingBee MCP for testing extraction rules
- Multi-SDK support (Gemini, Claude, Codex)
- Dynamic proxy configuration
- Cost optimization strategies
ποΈ Package Structure #
web_scrapper_generator/
βββ lib/
β βββ src/
β β βββ implementations/ # SDK-specific implementations
β β β βββ claude_implementation.dart
β β β βββ codex_implementation.dart
β β β βββ gemini_implementation.dart
β β βββ prompts.dart # Core AI prompts and system instructions
β β βββ playwright_setup.dart # Playwright MCP configuration
β β βββ scraping_bee_mcp.dart # ScrapingBee MCP server
β β βββ scraping_bee_api_mixin.dart # API interaction logic
β β βββ mcp_adapters.dart # Unified MCP setup across SDKs
β β βββ web_scrapper_response.dart # Response models
β β βββ web_scrapper_generator_interface.dart # Abstract interface
β βββ web_scrapper_generator.dart # Package exports
βββ bin/
β βββ scraping_bee_mcp_server.dart # MCP server executable
βββ pubspec.yaml
π Installation #
Add to your pubspec.yaml:
dependencies:
web_scrapper_generator:
path: ./web_scrapper_generator
π‘ Core Components #
1. AI Prompts (prompts.dart) #
The heart of the system - contains sophisticated prompts that guide the AI through:
- Page exploration and analysis
- Extraction rule creation
- JavaScript scenario generation
- Cost optimization
- Testing and validation
Key features:
- Dynamic country proxy selection based on target site
- Comprehensive testing workflow
- Credit cost optimization strategy
- Support for 195+ country proxies
2. MCP Integrations #
Playwright MCP (playwright_setup.dart)
- Provides real browser automation
- Allows AI to interact with pages (click, type, scroll)
- Captures screenshots and rendered HTML
- Dynamic proxy configuration support
ScrapingBee MCP (scraping_bee_mcp.dart)
- Custom MCP server for testing extraction rules
- Validates rules against real ScrapingBee API
- Ensures rules work before returning to user
- Comprehensive error handling
3. SDK Implementations #
Support for multiple AI providers through a unified interface:
Gemini Implementation
final generator = GeminiWebScrapperGenerator(
geminiSDK: geminiSDK,
scrapingBeeApiKey: 'your-api-key',
);
Claude Implementation
final generator = ClaudeWebScrapperGenerator(
claudeSDK: claudeSDK,
scrapingBeeApiKey: 'your-api-key',
);
Codex Implementation
final generator = CodexWebScrapperGenerator(
codexSDK: codexSDK,
scrapingBeeApiKey: 'your-api-key',
);
π― Usage Examples #
Basic Setup and Initialization #
import 'package:web_scrapper_generator/web_scrapper_generator.dart';
import 'package:gemini_cli_sdk/gemini_cli_sdk.dart';
void main() async {
// Initialize SDK
final geminiSDK = GeminiSDK();
// Create generator instance
final generator = GeminiWebScrapperGenerator(
geminiSDK: geminiSDK,
scrapingBeeApiKey: 'your-scrapingbee-api-key',
);
// Setup MCP tools if needed
await generator.setupIfNeeded();
// Now ready to generate scraping rules!
}
Creating New Scraping Rules #
// Define the target URL and request structure
final request = WebScrapperRequest(
url: 'https://example.com/products/{category}',
queryParam: {'sort': 'price', 'limit': '20'},
pathParams: ['category'],
);
// Initialize chat with AI
await generator.initChat(
InitialPayloadDataCreatingFromZero(
targetExampleUrl: 'https://example.com/products/electronics',
webScrapperRequest: request,
),
);
// Send user requirements
final response = await generator.sendMessage(
'Extract product names, prices, and ratings from the product listing page'
);
// Handle response
switch (response) {
case WebScrapperChatAIResponseWithDataResponse():
print('Success! Generated settings:');
print('URL: ${response.fetchSettings.url}');
print('Rules: ${response.fetchSettings.extract_rules}');
break;
case WebScrapperChatAIResponseJustMessage():
print('AI Message: ${response.message}');
break;
case WebScrapperChatAIResponseErrorMessage():
print('Error: ${response.errorDescription}');
break;
}
Editing Existing Rules #
// Edit existing scraping configuration
await generator.initChat(
InitialPayloadDataEditingExistingWebScrapper(
currentRequest: existingRequest,
currentFetchSettings: existingSettings,
),
);
final response = await generator.sendMessage(
'Add extraction for product images and availability status'
);
π§ MCP Setup #
Automatic Setup #
The package can automatically set up required MCP servers:
// This will:
// 1. Install Playwright if needed
// 2. Configure Playwright MCP
// 3. Compile and setup ScrapingBee MCP
await generator.setupIfNeeded();
Manual Setup #
If you prefer manual control:
// Setup Playwright
await PlaywrightSetup.instance.setupIfNeeded(geminiSDK);
// Setup ScrapingBee MCP
await ScrapingBeeMcpServerSetup.instance.setupIfNeeded(geminiSDK);
π Response Models #
WebScrapperRequest #
Defines the URL pattern and parameters:
class WebScrapperRequest {
final String url; // URL with {param} placeholders
final Map<String, String?> queryParam; // Query parameters
final List<String> pathParams; // Path parameter names
}
ScrappingBeeFetchSettings #
Complete ScrapingBee configuration:
class ScrappingBeeFetchSettings {
final String url; // Target URL
final String extract_rules; // JSON extraction rules
final String? js_scenario; // JavaScript actions
final bool render_js; // Enable JS rendering
final bool premium_proxy; // Use premium proxy
final bool stealth_proxy; // Use stealth proxy
final String? country_code; // Proxy country
// ... more settings
}
π Proxy Configuration #
The system intelligently selects proxy settings based on:
- Target domain (e.g., .de domains use German proxy)
- User requirements
- Site difficulty level
Cost optimization priority:
- No proxy (1-5 credits)
- Premium proxy (25 credits)
- Stealth proxy (75 credits - only for LinkedIn, Meta, etc.)
π§ͺ Testing Workflow #
The AI follows a strict testing protocol:
- Exploration: Use Playwright to understand the page
- Rule Creation: Design extraction rules
- Testing: Validate with ScrapingBee MCP
- Optimization: Find cheapest working configuration
- Validation: Ensure data matches requirements
π Debugging #
Enable Verbose Logging #
// Set environment variable
export DEBUG_MCP=true
Check MCP Status #
final mcpInfo = await geminiSDK.isMcpInstalled();
print('MCP Support: ${mcpInfo.hasMcpSupport}');
print('Servers: ${mcpInfo.servers}');
Test ScrapingBee Connection #
final result = await generator.testScrapingBeeConnection();
print('ScrapingBee API Status: $result');
β οΈ Important Notes #
- Always Test Rules: The AI must test extraction rules before returning them
- Cost Awareness: The system optimizes for lowest credit usage
- Dynamic Proxies: Proxy country is selected based on target site
- MCP Required: Both Playwright and ScrapingBee MCPs must be configured
- API Key Security: Never expose ScrapingBee API keys to end users
π€ Contributing #
When contributing to this package:
- Maintain the testing workflow in prompts
- Ensure MCP compatibility across all SDKs
- Add tests for new extraction scenarios
- Document any new proxy requirements
π License #
This package is part of the Zenscrap project. See main project license.
π Troubleshooting #
"MCP not found" Error #
# Compile the ScrapingBee MCP server
dart compile exe bin/scraping_bee_mcp_server.dart -o build/scraping_bee_mcp_server
"Playwright not installed" Error #
# Install Playwright
npm install playwright
npx playwright install
"Invalid extraction rules" Error #
- Ensure rules are valid JSON
- Test rules with ScrapingBee MCP before using
- Check CSS/XPath selector syntax